Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add field-last benchmark script #1950

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

charleskawczynski
Copy link
Member

@charleskawczynski charleskawczynski commented Aug 22, 2024

This PR adds a benchmark to compare a dropped field dimension against moving the field dimension to the last index. This benchmark turned out to be a pretty simple modification of the offset benchmark.

cc @dennisYatunin (who was interested in this benchmark).

Let's look at the Float32 results for two kernels:

Kernel `add3(x1, x2, x3) = x1+x2+x3` and  `n_reads_writes=4`:
[ Info: ArrayType = CuArray
Problem size: (63, 4, 4, 5400, 1), float_type = Float32, device_bandwidth_GBs=2039
┌─────────────────────────────────────────────────────────────────────┬──────────────────────────────────┬─────────┬─────────────┬────────────────┬────────┐
│ funcs                                                               │ time per call                    │ bw %    │ achieved bw │ n-reads/writes │ n-reps │
├─────────────────────────────────────────────────────────────────────┼──────────────────────────────────┼─────────┼─────────────┼────────────────┼────────┤
│     FLD.aos_cart_offset!(X_aos_ref, Y_aos_ref, us; bm, nreps = 100) │ 72 microseconds, 899 nanoseconds │ 54.568  │ 1112.64     │ 4              │ 100    │
│     FLD.aos_lin_offset!(X_aos, Y_aos, us; bm, nreps = 100)          │ 56 microseconds, 259 nanoseconds │ 70.708  │ 1441.74     │ 4              │ 100    │
│     FLD.soa_linear_index!(X_soa, Y_soa, us; bm, nreps = 100)        │ 56 microseconds, 515 nanoseconds │ 70.3877 │ 1435.21     │ 4              │ 100    │
│     FLD.soa_cart_index!(X_soa, Y_soa, us; bm, nreps = 100)          │ 67 microseconds, 462 nanoseconds │ 58.9663 │ 1202.32     │ 4              │ 100    │
└─────────────────────────────────────────────────────────────────────┴──────────────────────────────────┴─────────┴─────────────┴────────────────┴────────┘

Kernel `add3(x1, x2, x3) = x1+x2+x3` and  `n_reads_writes=4`:
[ Info: ArrayType = CuArray
Problem size: (63, 4, 4, 5400, 1), float_type = Float64, device_bandwidth_GBs=2039
┌─────────────────────────────────────────────────────────────────────┬───────────────────────────────────┬─────────┬─────────────┬────────────────┬────────┐
│ funcs                                                               │ time per call                     │ bw %    │ achieved bw │ n-reads/writes │ n-reps │
├─────────────────────────────────────────────────────────────────────┼───────────────────────────────────┼─────────┼─────────────┼────────────────┼────────┤
│     FLD.aos_cart_offset!(X_aos_ref, Y_aos_ref, us; bm, nreps = 100) │ 106 microseconds, 783 nanoseconds │ 74.5051 │ 1519.16     │ 4              │ 100    │
│     FLD.aos_lin_offset!(X_aos, Y_aos, us; bm, nreps = 100)          │ 102 microseconds, 472 nanoseconds │ 77.6396 │ 1583.07     │ 4              │ 100    │
│     FLD.soa_linear_index!(X_soa, Y_soa, us; bm, nreps = 100)        │ 102 microseconds, 523 nanoseconds │ 77.6008 │ 1582.28     │ 4              │ 100    │
│     FLD.soa_cart_index!(X_soa, Y_soa, us; bm, nreps = 100)          │ 106 microseconds, 834 nanoseconds │ 74.4694 │ 1518.43     │ 4              │ 100    │
└─────────────────────────────────────────────────────────────────────┴───────────────────────────────────┴─────────┴─────────────┴────────────────┴────────┘

Kernel `add3(x1, x2, x3) = x1` and  `n_reads_writes=2`:
[ Info: ArrayType = CuArray
Problem size: (63, 4, 4, 5400, 1), float_type = Float32, device_bandwidth_GBs=2039
┌─────────────────────────────────────────────────────────────────────┬──────────────────────────────────┬─────────┬─────────────┬────────────────┬────────┐
│ funcs                                                               │ time per call                    │ bw %    │ achieved bw │ n-reads/writes │ n-reps │
├─────────────────────────────────────────────────────────────────────┼──────────────────────────────────┼─────────┼─────────────┼────────────────┼────────┤
│     FLD.aos_cart_offset!(X_aos_ref, Y_aos_ref, us; bm, nreps = 100) │ 61 microseconds, 185 nanoseconds │ 32.5079 │ 662.837     │ 2              │ 100    │
│     FLD.aos_lin_offset!(X_aos, Y_aos, us; bm, nreps = 100)          │ 31 microseconds, 376 nanoseconds │ 63.3926 │ 1292.57     │ 2              │ 100    │
│     FLD.soa_linear_index!(X_soa, Y_soa, us; bm, nreps = 100)        │ 31 microseconds, 120 nanoseconds │ 63.9141 │ 1303.21     │ 2              │ 100    │
│     FLD.soa_cart_index!(X_soa, Y_soa, us; bm, nreps = 100)          │ 44 microseconds, 53 nanoseconds  │ 45.1499 │ 920.607     │ 2              │ 100    │
└─────────────────────────────────────────────────────────────────────┴──────────────────────────────────┴─────────┴─────────────┴────────────────┴────────┘

Note that soa_linear_index! is what #1929 implements. aos_lin_offset! would be the best we can get by moving the field index to the last index. So, if we move the field dimension to the end, and avoid converting to cartesian indices altogether, we can reach maximum performance, even in "low-utilization" expressions (where not all field variables are used). I'm happy to merge this as a way to document our performance analysis.

This supports that moving the field dimension to the end of the datalayout (in addition to leveraging linear indexing) will fix #1910.

cc @cmbengue, @tapios

return LI[CI[I] + CartesianIndex((0, 0, 0, 0, field_index))]
end

# add3(x1, x2, x3) = x1 + x2 + x3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be uncommented?

Copy link
Member Author

@charleskawczynski charleskawczynski Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No-- It should be commented, I did not loop over all benchmarks.

This corresponds to the n_reads_writes = 2 benchmark.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I updated my response, @sriharshakandala.

threads = min(get_N(us), config.threads)
blocks = cld(get_N(us), threads)
for t in 1:n_trials
et = CUDA.@elapsed begin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this timer be moved outside the encompassing add function?
Can we use benchmark tools for gathering timing information?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this was done to avoid measuring launch latency. From experimentation, I found that it turned out to not be important, but I did this to be careful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix uncoalesced memory reads
2 participants