Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combine DMA loads #144

Open
long-long-float opened this issue May 1, 2020 · 6 comments
Open

Combine DMA loads #144

long-long-float opened this issue May 1, 2020 · 6 comments
Labels
enhancement optimization related to an optimization step

Comments

@long-long-float
Copy link
Contributor

When we compile following OpenCL code which calls vload16 three times with vc4c --asm -O3 -o dma_loads.asm dma_loads.cl, VC4C outputs the following assembly(dma_loads.txt). This contains three DMA loads, but these can be combined into one DMA load.

__kernel void dma_loads(int width, int height, __global uchar *in, __global uchar *out)
{
    for (int y = 1; y < height - 1; y++) {
        size_t idx = y * width;
        uchar16 up   = vload16(idx - width, in);
        uchar16 center = vload16(idx, in);
        uchar16 down = vload16(idx + width, in);

        uchar16 r = (
            up                                               / (uchar16)(3) +
            center                                           / (uchar16)(3) +
            down                                             / (uchar16)(3));

        vstore16(r, idx, out);
    }
}

dma_loads.txt

I want to implement the combiner and think the method.

At each block in CFG and LLVM IR

; Function Attrs: convergent nounwind
define spir_kernel void @dma_loads(i32 %width, i32 %height, i8 addrspace(1)* %in, i8 addrspace(1)* %out) local_unnamed_addr #0 !kernel_arg_addr_space !3 !kernel_arg_access_qual !4 !kernel_arg_type !5 !kernel_arg_base_type !5 !kernel_arg_type_qual !6 !kernel_arg_name !7 {
  %sub = add nsw i32 %height, -1
  %cmp23 = icmp sgt i32 %height, 2
  br i1 %cmp23, label %.lr.ph.preheader, label %._crit_edge

.lr.ph.preheader:                                 ; preds = %0
  br label %.lr.ph

._crit_edge:                                      ; preds = %.lr.ph, %0
  ret void

.lr.ph:                                           ; preds = %.lr.ph.preheader, %.lr.ph
  %y.024 = phi i32 [ %inc, %.lr.ph ], [ 1, %.lr.ph.preheader ]
  %mul = mul nsw i32 %y.024, %width
  %sub1 = sub i32 %mul, %width
  %call = tail call spir_func <16 x i8> @_Z7vload16jPU3AS1Kh(i32 %sub1, i8 addrspace(1)* %in) #2
  %call2 = tail call spir_func <16 x i8> @_Z7vload16jPU3AS1Kh(i32 %mul, i8 addrspace(1)* %in) #2
  %add = add i32 %mul, %width
  %call3 = tail call spir_func <16 x i8> @_Z7vload16jPU3AS1Kh(i32 %add, i8 addrspace(1)* %in) #2
  %div = udiv <16 x i8> %call, <i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3>
  %div4 = udiv <16 x i8> %call2, <i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3>
  %add5 = add nuw <16 x i8> %div4, %div
  %div6 = udiv <16 x i8> %call3, <i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3>
  %add7 = add <16 x i8> %add5, %div6
  tail call spir_func void @_Z8vstore16Dv16_hjPU3AS1h(<16 x i8> %add7, i32 %mul, i8 addrspace(1)* %out) #2
  %inc = add nuw nsw i32 %y.024, 1
  %cmp = icmp slt i32 %inc, %sub
  br i1 %cmp, label %.lr.ph, label %._crit_edge
}
  1. Collect vload16(actually _Z7vload16jPU3AS1Kh).
  2. Collect DMA load addresses from 1st argument of vload16.
  3. Check whether load addresses are regular intervals.
  4. If true, combine theses loads.

I think the checking regular intervals is challenging. The symbolic execution can be used.

Example

Collect vload16 (and address variables)

%mul = mul nsw i32 %y.024, %width
%sub1 = sub i32 %mul, %width
%add = add i32 %mul, %width

%call = tail call spir_func <16 x i8> @_Z7vload16jPU3AS1Kh(i32 %sub1, i8 addrspace(1)* %in) #2
%call2 = tail call spir_func <16 x i8> @_Z7vload16jPU3AS1Kh(i32 %mul, i8 addrspace(1)* %in) #2
%call3 = tail call spir_func <16 x i8> @_Z7vload16jPU3AS1Kh(i32 %add, i8 addrspace(1)* %in) #2

Addresses

  1. %mul - %width
  2. %mul
  3. %mul + %width

These are regular intervals (%width), then these are combined (I should create new function dma_load and vpm_load).

dma_load(i32 %x.093, i8 addrspace(1)* %in, 3 /*= rows*/, 16/*= columns*/)
%call = vpm_load
%call2 = vpm_load
%call3 = vpm_load
@doe300
Copy link
Owner

doe300 commented May 1, 2020

There is already some related code there. This was added some while ago to do a similar job, but I am not sure whether it is still applied. Anyway, that might be a good point to start.

@doe300 doe300 added enhancement optimization related to an optimization step labels May 9, 2020
@long-long-float
Copy link
Contributor Author

@doe300 I have a question. Is there a way to find the instruction corresponded the local (for example, I want to get the instruction %sub1 = sub i32 %mul, %width from the value i32 %sub1). Or should I create this method?

%mul = mul nsw i32 %y.024, %width
%sub1 = sub i32 %mul, %width
%add = add i32 %mul, %width

%call = tail call spir_func <16 x i8> @_Z7vload16jPU3AS1Kh(i32 %sub1, i8 addrspace(1)* %in) #2
%call2 = tail call spir_func <16 x i8> @_Z7vload16jPU3AS1Kh(i32 %mul, i8 addrspace(1)* %in) #2
%call3 = tail call spir_func <16 x i8> @_Z7vload16jPU3AS1Kh(i32 %add, i8 addrspace(1)* %in) #2

@doe300
Copy link
Owner

doe300 commented May 16, 2020

In general, you can query Local#getUsers(LocalUse::Type::WRITER) to get all writers.

If there is just one writer, Local#getSingleWriter() will do the trick. Also if you have a Value instead of the local, you can call Value#getSingleWriter() which does the same, but checks whether the value is a local. Of course the result needs to be checked for nullptr in both cases!

@long-long-float
Copy link
Contributor Author

@doe300 I want to insert the instruction (extends IntermediateInstruction) which do VPM load here, but I cannot find it.
Is there such the instruction, or should I create the instruction?

@doe300
Copy link
Owner

doe300 commented Jun 13, 2020

The general memory access (before we know whether the memory area is lowered to a register, the VPM or accessed via TMU or DMA) is represented as MemoryInstruction.
After the lowering, there are no specific instruction types for the various lowered types (e.g. register, VPM), instead the MemoryInstruction is directly composed to the (hardware) instructions executed to do the memory accesses.
So if you want to insert a VPM access, have a look at the VPM header:

  • insertReadDMA, insertWriteDMA for "direct" DMA access (QPU <-> RAM), abstracting away the VPM
  • VPM::insertReadVPM, VPM::insertWriteVPM for VPM access (QPU <-> VPM), e.g. also for caching/exchanging data between QPUs
  • VPM::insertReadRAM, VPM::insertWriteRAM for DMA only access (VPM <-> RAM), e.g. to read/write back cached data

The VPM object required can be retrieved via the Method::vpm member.

Does this information suffice or do you need a special instruction type to represent VPM accesses (e.g. for further processing)?

@long-long-float
Copy link
Contributor Author

I understand, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement optimization related to an optimization step
Projects
None yet
Development

No branches or pull requests

2 participants