-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add field-last benchmark script #1950
base: main
Are you sure you want to change the base?
Conversation
442e584
to
7a72fa6
Compare
7a72fa6
to
4638fe2
Compare
return LI[CI[I] + CartesianIndex((0, 0, 0, 0, field_index))] | ||
end | ||
|
||
# add3(x1, x2, x3) = x1 + x2 + x3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be uncommented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No-- It should be commented, I did not loop over all benchmarks.
This corresponds to the n_reads_writes = 2
benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I updated my response, @sriharshakandala.
threads = min(get_N(us), config.threads) | ||
blocks = cld(get_N(us), threads) | ||
for t in 1:n_trials | ||
et = CUDA.@elapsed begin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this timer be moved outside the encompassing add
function?
Can we use benchmark tools for gathering timing information?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this was done to avoid measuring launch latency. From experimentation, I found that it turned out to not be important, but I did this to be careful.
This PR adds a benchmark to compare a dropped field dimension against moving the field dimension to the last index. This benchmark turned out to be a pretty simple modification of the offset benchmark.
cc @dennisYatunin (who was interested in this benchmark).
Let's look at the
Float32
results for two kernels:Note that
soa_linear_index!
is what #1929 implements.aos_lin_offset!
would be the best we can get by moving the field index to the last index. So, if we move the field dimension to the end, and avoid converting to cartesian indices altogether, we can reach maximum performance, even in "low-utilization" expressions (where not all field variables are used). I'm happy to merge this as a way to document our performance analysis.This supports that moving the field dimension to the end of the datalayout (in addition to leveraging linear indexing) will fix #1910.
cc @cmbengue, @tapios