-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase warmup
and rep
for FA benchmark
#2256
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py
Outdated
Show resolved
Hide resolved
benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py
Outdated
Show resolved
Hide resolved
benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py
Outdated
Show resolved
Hide resolved
benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py
Outdated
Show resolved
Hide resolved
benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py
Outdated
Show resolved
Hide resolved
benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Anatoly Myachev <[email protected]>
benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py
Outdated
Show resolved
Hide resolved
benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py
Outdated
Show resolved
Hide resolved
benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py
Outdated
Show resolved
Hide resolved
benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py
Outdated
Show resolved
Hide resolved
@ESI-SYD @chengjunlu geomean diff will most likely be less, I will write the exact figures here after https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/11106696646/job/30855590713 is finished:
Are you aware of this effect where as the number of runs increases, the average time gets noticeably worse? I don't know what to do with this slowdown, but I still think the idea of running multiple times (>3, only in this case "*-CV" column will not be |
I think that when the warmup runs "too many times" the GPU may start heating up and then throttle the frequency down, so when the timed run start the performance is reduced. That means we are better off not increasing the rep/warmup to the point we see performance degradations in the benchmarks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think we should increase the number of repetition too much. Going from 10 too 600 repetitions is a huge increase.
The kernel timing distribution should be a normal (gaussian) curve. We only need to run the benchmark enough times to approximate a gaussian "bell" curve. From https://www.scribbr.com/statistics/central-limit-theorem/#:~:text=By%20convention%2C%20we%20consider%20a,if%20the%20population%20is%20normal. looks like 30 is the number of reps we should use.
@@ -234,10 +234,11 @@ def benchmark(Z, H, N_CTX, D_HEAD, CAUSAL, provider): | |||
v = torch.randn((Z, H, N_CTX, D_HEAD), device='xpu', dtype=dtype) | |||
sm_scale = 0.125 | |||
quantiles = [0.5, 0.0, 1.0] | |||
warmup, rep = 10, 600 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From 10 to 600 times? Way too many repetitions. It is going to slow down the time it takes to run the benchmarks too much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From 10 to 600 times? Way too many repetitions. It is going to slow down the time it takes to run the benchmarks too much.
This value is measured in milliseconds and is needed for some test combinations where one run takes more than 100 ms.
If we revert #2142, then rep is the number of iterations, then the problem of NaNs in CV is gone? |
@whitneywhtsang Most likely yes. However, I made a change to make |
I also see the benefit of being more similar to upstream triton, but rep meaning the number of iterations is more intuitive IMO. |
Closes #2255
CI status:
UPD: For some reason this greatly affects the mean time. However, if I reduce
warmup
, the mean does not deteriorate as much.