Add field-last benchmark script #1950

charleskawczynski · 2024-08-22T00:59:13Z

This PR adds a benchmark to compare a dropped field dimension against moving the field dimension to the last index. This benchmark turned out to be a pretty simple modification of the offset benchmark.

cc @dennisYatunin (who was interested in this benchmark).

Let's look at the Float32 results for two kernels:

Kernel `add3(x1, x2, x3) = x1+x2+x3` and  `n_reads_writes=4`:
[ Info: ArrayType = CuArray
Problem size: (63, 4, 4, 5400, 1), float_type = Float32, device_bandwidth_GBs=2039
┌─────────────────────────────────────────────────────────────────────┬──────────────────────────────────┬─────────┬─────────────┬────────────────┬────────┐
│ funcs                                                               │ time per call                    │ bw %    │ achieved bw │ n-reads/writes │ n-reps │
├─────────────────────────────────────────────────────────────────────┼──────────────────────────────────┼─────────┼─────────────┼────────────────┼────────┤
│     FLD.aos_cart_offset!(X_aos_ref, Y_aos_ref, us; bm, nreps = 100) │ 72 microseconds, 899 nanoseconds │ 54.568  │ 1112.64     │ 4              │ 100    │
│     FLD.aos_lin_offset!(X_aos, Y_aos, us; bm, nreps = 100)          │ 56 microseconds, 259 nanoseconds │ 70.708  │ 1441.74     │ 4              │ 100    │
│     FLD.soa_linear_index!(X_soa, Y_soa, us; bm, nreps = 100)        │ 56 microseconds, 515 nanoseconds │ 70.3877 │ 1435.21     │ 4              │ 100    │
│     FLD.soa_cart_index!(X_soa, Y_soa, us; bm, nreps = 100)          │ 67 microseconds, 462 nanoseconds │ 58.9663 │ 1202.32     │ 4              │ 100    │
└─────────────────────────────────────────────────────────────────────┴──────────────────────────────────┴─────────┴─────────────┴────────────────┴────────┘

Kernel `add3(x1, x2, x3) = x1+x2+x3` and  `n_reads_writes=4`:
[ Info: ArrayType = CuArray
Problem size: (63, 4, 4, 5400, 1), float_type = Float64, device_bandwidth_GBs=2039
┌─────────────────────────────────────────────────────────────────────┬───────────────────────────────────┬─────────┬─────────────┬────────────────┬────────┐
│ funcs                                                               │ time per call                     │ bw %    │ achieved bw │ n-reads/writes │ n-reps │
├─────────────────────────────────────────────────────────────────────┼───────────────────────────────────┼─────────┼─────────────┼────────────────┼────────┤
│     FLD.aos_cart_offset!(X_aos_ref, Y_aos_ref, us; bm, nreps = 100) │ 106 microseconds, 783 nanoseconds │ 74.5051 │ 1519.16     │ 4              │ 100    │
│     FLD.aos_lin_offset!(X_aos, Y_aos, us; bm, nreps = 100)          │ 102 microseconds, 472 nanoseconds │ 77.6396 │ 1583.07     │ 4              │ 100    │
│     FLD.soa_linear_index!(X_soa, Y_soa, us; bm, nreps = 100)        │ 102 microseconds, 523 nanoseconds │ 77.6008 │ 1582.28     │ 4              │ 100    │
│     FLD.soa_cart_index!(X_soa, Y_soa, us; bm, nreps = 100)          │ 106 microseconds, 834 nanoseconds │ 74.4694 │ 1518.43     │ 4              │ 100    │
└─────────────────────────────────────────────────────────────────────┴───────────────────────────────────┴─────────┴─────────────┴────────────────┴────────┘

Kernel `add3(x1, x2, x3) = x1` and  `n_reads_writes=2`:
[ Info: ArrayType = CuArray
Problem size: (63, 4, 4, 5400, 1), float_type = Float32, device_bandwidth_GBs=2039
┌─────────────────────────────────────────────────────────────────────┬──────────────────────────────────┬─────────┬─────────────┬────────────────┬────────┐
│ funcs                                                               │ time per call                    │ bw %    │ achieved bw │ n-reads/writes │ n-reps │
├─────────────────────────────────────────────────────────────────────┼──────────────────────────────────┼─────────┼─────────────┼────────────────┼────────┤
│     FLD.aos_cart_offset!(X_aos_ref, Y_aos_ref, us; bm, nreps = 100) │ 61 microseconds, 185 nanoseconds │ 32.5079 │ 662.837     │ 2              │ 100    │
│     FLD.aos_lin_offset!(X_aos, Y_aos, us; bm, nreps = 100)          │ 31 microseconds, 376 nanoseconds │ 63.3926 │ 1292.57     │ 2              │ 100    │
│     FLD.soa_linear_index!(X_soa, Y_soa, us; bm, nreps = 100)        │ 31 microseconds, 120 nanoseconds │ 63.9141 │ 1303.21     │ 2              │ 100    │
│     FLD.soa_cart_index!(X_soa, Y_soa, us; bm, nreps = 100)          │ 44 microseconds, 53 nanoseconds  │ 45.1499 │ 920.607     │ 2              │ 100    │
└─────────────────────────────────────────────────────────────────────┴──────────────────────────────────┴─────────┴─────────────┴────────────────┴────────┘

Note that soa_linear_index! is what #1929 implements. aos_lin_offset! would be the best we can get by moving the field index to the last index. So, if we move the field dimension to the end, and avoid converting to cartesian indices altogether, we can reach maximum performance, even in "low-utilization" expressions (where not all field variables are used). I'm happy to merge this as a way to document our performance analysis.

This supports that moving the field dimension to the end of the datalayout (in addition to leveraging linear indexing) will fix #1910.

cc @cmbengue, @tapios

wip wip

sriharshakandala · 2024-08-27T18:13:12Z

benchmarks/scripts/benchmark_field_last.jl

+    return LI[CI[I] + CartesianIndex((0, 0, 0, 0, field_index))]
+end
+
+# add3(x1, x2, x3) = x1 + x2 + x3


Should this be uncommented?

No-- It should be commented, I did not loop over all benchmarks.

This corresponds to the n_reads_writes = 2 benchmark.

Sorry, I updated my response, @sriharshakandala.

sriharshakandala · 2024-08-29T15:33:49Z

benchmarks/scripts/benchmark_field_last.jl

+        threads = min(get_N(us), config.threads)
+        blocks = cld(get_N(us), threads)
+        for t in 1:n_trials
+            et = CUDA.@elapsed begin


Should this timer be moved outside the encompassing add function?
Can we use benchmark tools for gathering timing information?

No, this was done to avoid measuring launch latency. From experimentation, I found that it turned out to not be important, but I did this to be careful.

charleskawczynski added the Performance monitoring 🔍🚀 label Aug 22, 2024

charleskawczynski requested review from dennisYatunin, Sbozzolo and sriharshakandala August 22, 2024 00:59

charleskawczynski force-pushed the ck/field_last_bm branch from 442e584 to 7a72fa6 Compare August 27, 2024 12:23

Add field-last benchmark script

4638fe2

wip wip

charleskawczynski force-pushed the ck/field_last_bm branch from 7a72fa6 to 4638fe2 Compare August 27, 2024 12:26

sriharshakandala reviewed Aug 27, 2024

View reviewed changes

sriharshakandala reviewed Aug 29, 2024

View reviewed changes

sriharshakandala mentioned this pull request Aug 30, 2024

Add a benchmark script for IJFVH datalayout. #1963

Draft

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add field-last benchmark script #1950

Add field-last benchmark script #1950

charleskawczynski commented Aug 22, 2024 •

edited

Loading

sriharshakandala Aug 27, 2024

charleskawczynski Aug 27, 2024 •

edited

Loading

charleskawczynski Aug 29, 2024

sriharshakandala Aug 29, 2024

charleskawczynski Aug 29, 2024

Add field-last benchmark script #1950

Are you sure you want to change the base?

Add field-last benchmark script #1950

Conversation

charleskawczynski commented Aug 22, 2024 • edited Loading

sriharshakandala Aug 27, 2024

Choose a reason for hiding this comment

charleskawczynski Aug 27, 2024 • edited Loading

Choose a reason for hiding this comment

charleskawczynski Aug 29, 2024

Choose a reason for hiding this comment

sriharshakandala Aug 29, 2024

Choose a reason for hiding this comment

charleskawczynski Aug 29, 2024

Choose a reason for hiding this comment

charleskawczynski commented Aug 22, 2024 •

edited

Loading

charleskawczynski Aug 27, 2024 •

edited

Loading