Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove redundant instructions for checking for-loop-exit-condition #22

Open
nomaddo opened this issue Dec 21, 2017 · 6 comments
Open

Remove redundant instructions for checking for-loop-exit-condition #22

nomaddo opened this issue Dec 21, 2017 · 6 comments
Labels
enhancement optimization related to an optimization step

Comments

@nomaddo
Copy link
Collaborator

nomaddo commented Dec 21, 2017

The for-loop creates redundant instructions in the instructions of checking exit condition.

ex)

kernel void loop1 (global float a[], global float b[]) {
  for (int i = 0; i < 1000; i++) {
    a[i] = -i;
    b[i] = -i;
  }
}
// Module with 1 kernels, global data with 0 words (64-bit each), starting at offset 1 words and 0 words of stack-frame
// Kernel 'loop1' with 67 instructions, offset 2, with following parameters: __global out float* a (4 B, 1 items), __global out float* b (4 B, 1 items)
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or ra0, unif, unif
or r2, unif, unif
or r1, 0 (0), 0 (0)
nop.never 
or r0, r1, r1
sub r0, 0 (0), r0
itof r3, r0; v8min r0, r1, r1
shl r0, r0, 2 (2)
add r0, ra0, r0
or -, mutex_acq, mutex_acq
ldi vpw_setup, 6656
or vpm, r3, r3
ldi vpw_setup, 2155954176
ldi vpw_setup, 3221291008
or vpw_addr, r0, r0
or r0, r1, r1
or r1, r1, r1
shl r0, r0, 2 (2)
add r1, r1, 1 (1)
add r0, r2, r0
ldi ra1, 1000
or -, vpw_wait, vpw_wait
or mutex_rel, 1 (1), 1 (1)
or -, mutex_acq, mutex_acq
ldi vpw_setup, 6656
or vpm, r3, r3
ldi vpw_setup, 2155954176
ldi vpw_setup, 3221291008
or vpw_addr, r0, r0
or -, vpw_wait, vpw_wait
or mutex_rel, 1 (1), 1 (1)
xor.setf -, r1, ra1                                             /// redundant ?
xor.ifzc r0, 1 (1), 1 (1); v8min.ifz r0, 1 (1), 1 (1)           /// 
or.setf -, r0, r0                                               ///
or.ifz r1, r1, r1                                               ///
or.setf -, elem_num, r0                                         ///
brr.ifallzc (pc+4) + 4
nop.never 
nop.never 
nop.never 
brr.ifanyz (pc+4) + -40
nop.never 
nop.never 
nop.never 
or r0, unif, unif
or.setf -, elem_num, r0
brr.ifallzc (pc+4) + -63
nop.never 
nop.never 
nop.never 
not irq, qpu_num
nop.thrend.never 
nop.never 
nop.never 

To me, above instructions (specifying ///) can be expressed as isub.setf -, r1, ra1.
Then, if the condition is negative, jump to the head of loop. Otherwise go through.

@nomaddo nomaddo changed the title Remov redundant instructions for checking for-loop-exit-condition Remove redundant instructions for checking for-loop-exit-condition Dec 21, 2017
@doe300
Copy link
Owner

doe300 commented Dec 21, 2017

xor.setf -, r1, ra1 /// redundant ?
xor.ifzc r0, 1 (1), 1 (1); v8min.ifz r0, 1 (1), 1 (1) ///

These two lines calculate the condition (bool r0 = i == 1000). In this particular case, we don't really need the value of the condition stored in any register, the correct flags set would be enough.

or.setf -, r0, r0 ///
or.ifz r1, r1, r1 ///

These two lines are generated by the phi-node for i (something like setting the i of the next iteration to the i of the previous one). or.ifz r1, r1, r1 is of course redundant, with or without using flags, which also makes the previous instruction or.setf -, r0, r0 unnecessary.

or.setf -, elem_num, r0 ///

This instruction is required by the subsequent conditional branches to set the flags correctly. This could be removed, if the last instruction before setting flags already sets the flags for the same source value (and for all SIMD elements!!), which is the case here.

To sum it up:
You are right, with additional optimizations, this could be reduced to something like:

xor.setf -, r1, ra1
xor.ifzc.setf -, 1 (1), 1 (1); v8min.ifz.setf -, 1 (1), 1 (1)
brr.ifallzc (pc+4) + 4
...

@doe300 doe300 self-assigned this Dec 21, 2017
@nomaddo
Copy link
Collaborator Author

nomaddo commented Dec 21, 2017

The output of clang is as follows:

; Function Attrs: nounwind
define spir_kernel void @loop1(float addrspace(1)* nocapture %a, float addrspace(1)* nocapture %b) #1 {
entry:
  br label %for.body

for.body:                                         ; preds = %for.body, %entry
  %i.09 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
  %sub = sub nsw i32 0, %i.09
  %conv = sitofp i32 %sub to float
  %arrayidx = getelementptr inbounds float addrspace(1)* %a, i32 %i.09
  store float %conv, float addrspace(1)* %arrayidx, align 4, !tbaa !10
  %arrayidx3 = getelementptr inbounds float addrspace(1)* %b, i32 %i.09
  store float %conv, float addrspace(1)* %arrayidx3, align 4, !tbaa !10
  %inc = add nuw nsw i32 %i.09, 1
  %exitcond = icmp eq i32 %inc, 1000
  br i1 %exitcond, label %for.end, label %for.body

for.end:                                          ; preds = %for.body
  ret void
}

One to one translation is a little bit hard to do such optimization.
But if we can handle icmp eq ... and br ... in the same time, better optimization can be implemented in translation phase to vc4.

We have two choice:

  • Optimize it as peephole
  • Optimize it through translation

Peephole optimization might be general (it can be used for other bit-operation optimization), but latter is easier.

@nomaddo
Copy link
Collaborator Author

nomaddo commented Dec 21, 2017

One more thing: to jump neightbor basic_block, this cod output jump instruction.
brr.ifallzc (pc+4) + 4 can be deleted in this case.

@doe300
Copy link
Owner

doe300 commented Dec 21, 2017

You are right, with additional optimizations, this could be reduced to something like:

xor.setf -, r1, ra1
xor.ifzc.setf -, 1 (1), 1 (1); v8min.ifz.setf -, 1 (1), 1 (1)
brr.ifallzc (pc+4) + 4
...

Scratch that. According to here, we cannot set flags on both ALUs depending on the condition of their execution, thus xor.ifzc.setf -, 1 (1), 1 (1); v8min.ifz.setf -, 1 (1), 1 (1) won't work. Also the or.ifz r1, r1, r1 is original a i32 %18 = i32 %17 and therefore the same issue as discussed in #14.

@nomaddo
Copy link
Collaborator Author

nomaddo commented Jan 18, 2018

@doe300 you already done it? If so, let's close the issue.

@doe300
Copy link
Owner

doe300 commented Jan 18, 2018

No, I haven't found a fitting optimization yet.

@doe300 doe300 added the optimization related to an optimization step label Apr 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement optimization related to an optimization step
Projects
None yet
Development

No branches or pull requests

2 participants