Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test rewrite of the geometry assembler #320

Draft
wants to merge 1 commit into
base: dev
Choose a base branch
from
Draft

Conversation

JamesWrigley
Copy link
Member

@JamesWrigley JamesWrigley commented Oct 15, 2021

Recently I've been playing around with re-implementing the geometry assembler, and I think it's at the point where we can figure out whether to go with it or not. It's not a clear decision because the results are... mixed. This PR is not meant to be merged, it's just for discussion.

TL;DR:

  • The existing implementation computes the location for each pixel based on the tile/module number. The new implementation instead uses a LUT to compute the location for each pixel.
  • On Maxwell, the new implementation is on average ~3 - 5% slower. The performance heavily depends on the machine architecture so results may vary. I haven't yet tested it on the online cluster.
  • The new implementation is way simpler, it's orders of magnitude fewer SLOC.

So this all started when I had an idea a while ago about using LUT's to assemble geometry, because from pulse to pulse you don't need to compute the output location for each pixel in the input, that only depends on the geometry so it can be computed just once and re-used. I went through about 6 iterations of this idea before finding one that I think works well enough to be considered, which is assembleDetectorData6(). This is a micro-optimized version of assembleDetectorData5(), which is close enough that I would suggest reading the code for assembleDetectorData5() to understand assembleDetectorData6(). In turn, assembleDetectorData5() is a slightly optimized version of assembleDetectorData4(). The other attempts I think are not worth considering because they're either broken or too slow.

Here's an overview of how it works:

  1. A LUT is generated with generateAssemblyLUT2() (this requires a specific input array from extra-geom to compute the LUT data). Let's define the number of pixels in a pulse as N, then the LUT is a 1D byte array of size N * 3. Each 3 bytes in the array are the first 3 bytes of a 4-byte int, and the idea is that since we can comfortably represent all possible LUT values (i.e. positions in the flattened output array) in 2^24 bits, we can cut the size of LUT by 25% by using 24-bit ints. This is to reduce the impact of the LUT on the CPU caches, the larger the LUT is the less space for the input and output arrays, and thus the more cache misses and worse performance.
  2. The LUT is passed to the assembler function. The function iterates in parallel over the flattened input array, calculates the right location in the output array with the LUT, and copies the pixels. Calculating the right index in the LUT is slightly complicated so I'd suggest looking at this commented helper function. Unpacking the 24-bit int into a uint32 is also slightly complicated so there's some comments about that here.

Here are some graphs that I made from running this on Maxwell for pulse lengths from 1 - 800 pulses (the 'reference implementation' is the existing implementation):
image

Interestingly, this outperforms the reference until ~150 pulses, after which it degrades. I would put this down to a quirk of the architecture of the machine rather than anything related to the algorithm because I did not see this while testing on my own machine. It would be interesting to see if this is the case on the online cluster nodes.

image

This shows that after the initial improvement up to ~150 pulses, performance degrades and then stabilizes at around ~97%-ish the performance of the existing implementation.

image

And in practice, this translates to a slowdown of about 2ms for 800 pulses. At 400 pulses it's around 1ms, so assuming a linear slowdown then I guess this would translate to 4ms at 1600 pulses (though at that point assembly itself would take ~160ms). It's interesting to me that the performance of assembleDetectorData4() is so jittery, perhaps it's more sensitive to other workloads on the machine.

General thoughts:

  • I believe the reason this doesn't perform as well compared to the reference is because of the space in the cache the LUT is taking, this should be made as small as possible.
  • Even so, the current implementation will not scale for hundreds/thousands of pulses (at 800 pulses we're already at 80ms) so we will eventually need a faster way to assemble geometry.
  • Assembly is a glorified copy, and GPUs are great at copying 😁 There are I/O issues but by optimizing the memory layout in a GPU we might be able to get good speedups on a CPU (already looking into this). Having said that, I think that there will always be a place for a high-performance CPU-only implementation, if only to use for testing offline.
  • It's not clear to me that we should proceed with this implementation. It is far simpler, but the existing implementation already works and it is consistently faster, even if only by a few percentage points.

Thoughts? (anyone other than the reviewers, feel free to comment too :) )

@JamesWrigley JamesWrigley self-assigned this Oct 15, 2021
@JamesWrigley JamesWrigley marked this pull request as draft October 15, 2021 12:15
@JamesWrigley
Copy link
Member Author

Some interesting results with a similar algorithm implemented in Julia (attempt 8):
image
image
image

These benchmarks were done on Xeons. The difference from the previous set of benchmarks done on EPYCs is quite dramatic:

  • Performance is way worse across the board, roughly 3x.
  • Attempts 4 and 6 just beat out the reference implementation.

Attempt 8 is almost 2x faster than the others. Possible explanations:

  1. EXtra-foam was compiled on an EPYC with -march=native, so maybe the optimizations for the EPYC don't work as well on a Xeon? That doesn't bode well for performance on the online cluster (also very curious about what the difference might be).
  2. The algorithm in attempt 8 is just more cache-friendly.
  3. Julia generates better ASM. I think this is unlikely since it uses LLVM under the hood, which is surely comparable to GCC's code generator.

@JamesWrigley
Copy link
Member Author

JamesWrigley commented Jan 20, 2022

Did another set of benchmarks on the offline cluster and got much more believable results:
image
image
image

Attempts 4/6 are the fastest on this machine, and the Julia implementation is the slowest.

And on the online cluster (where it really matters):
image
image
image

Very similar results, all the C++ attempts are faster than the reference, and the Julia one is the slowest. These graphs are way more noisy than the ones from the offline cluster, I guess because there's some Karabo devices running on the same machine using some CPU.

I think the benchmarks on the online cluster in addition to the massively simplified algorithm are a good enough reason to replace the current assembler, so I'm thinking I'll replace it with attempt 4. That's the simplest one, and from looking at the benchmarks I'd say there's no clear advantage to the more complicated ones.

@zhujun98
Copy link
Collaborator

The point of the current implementation is to enable flexible and fast manipulation of the pixel (masking is a simple example of it) during assembling. As you already figure it out, assembling is just a copy and nothing is fancy. Also, the current implementation is not fully optimized as there was no requirement.

There are also many other cases in EXtra-foam, which envision space for scaling up and including more complicated applications in the future.

I would more than happy to see you if you could lead EXtra-foam to go in this direction :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants