Increase `warmup` and `rep` for FA benchmark #2256

anmyachev · 2024-09-16T11:22:23Z

CI status:

https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10882781069 (100ms)
https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10883238613 (150ms)
https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10884030513 (150ms, without IPEX)
https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10901498420 (200ms)
https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10910606716 (300ms)
https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10920458857 (300ms, without IPEX)
- https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10942449220 (again with up to date main branch)
https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10994968680 (warmup=10)
https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/11092788850 (again with latest changes)

UPD: For some reason this greatly affects the mean time. However, if I reduce warmup, the mean does not deteriorate as much.

Signed-off-by: Anatoly Myachev <[email protected]>

benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py

…ark.py

benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py

…ark.py

benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py

…ark.py

benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py

…ark.py

benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py

…ark.py

benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py

…ark.py

Signed-off-by: Anatoly Myachev <[email protected]>

benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py

…ark.py

benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py

…ark.py

benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py

…ark.py

benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py

…ark.py

anmyachev · 2024-09-30T13:34:14Z

@ESI-SYD @chengjunlu geomean diff will most likely be less, I will write the exact figures here after https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/11106696646/job/30855590713 is finished:

Triton ADV: -5%
Triton DFT: -2%
Xetla: -8%

Are you aware of this effect where as the number of runs increases, the average time gets noticeably worse? I don't know what to do with this slowdown, but I still think the idea of running multiple times (>3, only in this case "*-CV" column will not be NaN) is good (from the point of view of calculating the average).

cc @whitneywhtsang @etiotto

etiotto · 2024-10-01T14:48:30Z

@ESI-SYD @chengjunlu geomean diff will most likely be less, I will write the exact figures here after https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/11106696646/job/30855590713 is finished:

Triton ADV: -5%

Triton DFT: -2%

Xetla: -8%

Are you aware of this effect where as the number of runs increases, the average time gets noticeably worse? I don't know what to do with this slowdown, but I still think the idea of running multiple times (>3, only in this case "*-CV" column will not be NaN) is good (from the point of view of calculating the average).

cc @whitneywhtsang @etiotto

I think that when the warmup runs "too many times" the GPU may start heating up and then throttle the frequency down, so when the timed run start the performance is reduced. That means we are better off not increasing the rep/warmup to the point we see performance degradations in the benchmarks.

etiotto

I do not think we should increase the number of repetition too much. Going from 10 too 600 repetitions is a huge increase.

The kernel timing distribution should be a normal (gaussian) curve. We only need to run the benchmark enough times to approximate a gaussian "bell" curve. From https://www.scribbr.com/statistics/central-limit-theorem/#:~:text=By%20convention%2C%20we%20consider%20a,if%20the%20population%20is%20normal. looks like 30 is the number of reps we should use.

etiotto · 2024-10-01T14:50:34Z

benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py

@@ -234,10 +234,11 @@ def benchmark(Z, H, N_CTX, D_HEAD, CAUSAL, provider):
    v = torch.randn((Z, H, N_CTX, D_HEAD), device='xpu', dtype=dtype)
    sm_scale = 0.125
    quantiles = [0.5, 0.0, 1.0]
+    warmup, rep = 10, 600


From 10 to 600 times? Way too many repetitions. It is going to slow down the time it takes to run the benchmarks too much.

From 10 to 600 times? Way too many repetitions. It is going to slow down the time it takes to run the benchmarks too much.

This value is measured in milliseconds and is needed for some test combinations where one run takes more than 100 ms.

whitneywhtsang · 2024-10-01T15:07:28Z

If we revert #2142, then rep is the number of iterations, then the problem of NaNs in CV is gone?

anmyachev · 2024-10-01T18:24:48Z

If we revert #2142, then rep is the number of iterations, then the problem of NaNs in CV is gone?

@whitneywhtsang Most likely yes. However, I made a change to make do_bench function more similar to the one used in upstream triton. If this is not necessary, I can revert some of the changes.

whitneywhtsang · 2024-10-01T18:32:21Z

If we revert #2142, then rep is the number of iterations, then the problem of NaNs in CV is gone?

@whitneywhtsang Most likely yes. However, I made a change to make do_bench function more similar to the one used in upstream triton. If this is not necessary, I can revert some of the changes.

I also see the benefit of being more similar to upstream triton, but rep meaning the number of iterations is more intuitive IMO.

anmyachev and others added 3 commits September 16, 2024 11:16

Increase 'warmup' and 'rep' for FA benchmark

b1d2a0b

Signed-off-by: Anatoly Myachev <[email protected]>

Merge branch 'main' into amyachev/bench-time

339b709

Use 150ms

5ebbd01

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev commented Sep 17, 2024

View reviewed changes

benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py Outdated Show resolved Hide resolved

Update benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchm…

b1cc599

…ark.py

anmyachev commented Sep 17, 2024

View reviewed changes

benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py Outdated Show resolved Hide resolved

anmyachev added 2 commits September 17, 2024 22:28

Update benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchm…

0ad146f

…ark.py

Merge branch 'main' into amyachev/bench-time

bbf0557

anmyachev commented Sep 23, 2024

View reviewed changes

benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py Outdated Show resolved Hide resolved

anmyachev added 2 commits September 23, 2024 15:08

Update benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchm…

81fec9a

…ark.py

Merge branch 'main' into amyachev/bench-time

42e653a

anmyachev commented Sep 23, 2024

View reviewed changes

benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py Outdated Show resolved Hide resolved

anmyachev added 2 commits September 23, 2024 17:08

Update benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchm…

8f81c13

…ark.py

Merge branch 'main' into amyachev/bench-time

5d08d3a

anmyachev commented Sep 29, 2024

View reviewed changes

benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py Outdated Show resolved Hide resolved

Update benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchm…

b2d3398

…ark.py

anmyachev commented Sep 30, 2024

View reviewed changes

benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py Outdated Show resolved Hide resolved

anmyachev and others added 2 commits September 30, 2024 10:52

Update benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchm…

fe806b1

…ark.py

fix after merge

bf49b0d

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev commented Sep 30, 2024

View reviewed changes

benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py Outdated Show resolved Hide resolved

Update benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchm…

7493632

…ark.py

anmyachev commented Sep 30, 2024

View reviewed changes

benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py Outdated Show resolved Hide resolved

Update benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchm…

524f81d

…ark.py

anmyachev commented Sep 30, 2024

View reviewed changes

benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py Outdated Show resolved Hide resolved

Update benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchm…

4d40864

…ark.py

anmyachev commented Sep 30, 2024

View reviewed changes

benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py Outdated Show resolved Hide resolved

Update benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchm…

b0d91ce

…ark.py

anmyachev marked this pull request as ready for review September 30, 2024 13:29

anmyachev requested review from ESI-SYD and chengjunlu September 30, 2024 13:29

anmyachev requested review from whitneywhtsang and etiotto September 30, 2024 15:47

etiotto requested changes Oct 1, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase `warmup` and `rep` for FA benchmark #2256

Increase `warmup` and `rep` for FA benchmark #2256

anmyachev commented Sep 16, 2024 •

edited

Loading

anmyachev commented Sep 30, 2024 •

edited

Loading

etiotto commented Oct 1, 2024

etiotto left a comment

etiotto Oct 1, 2024

anmyachev Oct 1, 2024

whitneywhtsang commented Oct 1, 2024

anmyachev commented Oct 1, 2024

whitneywhtsang commented Oct 1, 2024

Increase warmup and rep for FA benchmark #2256

Are you sure you want to change the base?

Increase warmup and rep for FA benchmark #2256

Conversation

anmyachev commented Sep 16, 2024 • edited Loading

anmyachev commented Sep 30, 2024 • edited Loading

etiotto commented Oct 1, 2024

etiotto left a comment

Choose a reason for hiding this comment

etiotto Oct 1, 2024

Choose a reason for hiding this comment

anmyachev Oct 1, 2024

Choose a reason for hiding this comment

whitneywhtsang commented Oct 1, 2024

anmyachev commented Oct 1, 2024

whitneywhtsang commented Oct 1, 2024

Increase `warmup` and `rep` for FA benchmark #2256

Increase `warmup` and `rep` for FA benchmark #2256

anmyachev commented Sep 16, 2024 •

edited

Loading

anmyachev commented Sep 30, 2024 •

edited

Loading