Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rdma: add separate bounce buffer freelist for data (eager) messages #614

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

rauteric
Copy link
Contributor

Separate out bounce buffer freelists into a smaller-sized freelist for
control messages and a larger size for data (eager) messages

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@rauteric rauteric requested a review from a team as a code owner September 19, 2024 22:32
@rauteric rauteric force-pushed the fl-separate-pr branch 2 times, most recently from 19af9ec to 512366e Compare September 19, 2024 23:09
@rauteric rauteric changed the title rdma: add separate bounce buffer for data (eager messages) rdma: add separate bounce buffer freelist for data (eager) messages Sep 19, 2024
include/nccl_ofi_rdma.h Show resolved Hide resolved
include/nccl_ofi_rdma.h Outdated Show resolved Hide resolved
include/nccl_ofi_rdma.h Outdated Show resolved Hide resolved
src/nccl_ofi_rdma.c Outdated Show resolved Hide resolved
@rauteric
Copy link
Contributor Author

rauteric commented Oct 3, 2024

The CI failure(s) are real:

scatter_perf: nccl_ofi_rdma.c:2100: alloc_bounce_req: Assertion `NCCL_OFI_IS_PTR_ALIGNED(&bounce_fl_item->bounce_msg, BOUNCE_BUFFER_ALIGNMENT)' failed.

Need to fix up some asserts.

@rauteric
Copy link
Contributor Author

rauteric commented Oct 3, 2024

Force-push to 3494556 fixes the CI failure by having alignment on both (eager) bounce buffers and ctrl recv buffers.

This requirement will be relaxed when we refactor the data structure in the related #600.

Separate out bounce buffer freelists into a smaller-sized freelist for
control messages and a larger size for data (eager) messages

Also refactor bounce buffer freelists to be per-rail instead of shared
across all rails. Parameters `min_posted_bounce_buffers` and
`max_posted_bounce_buffers` are now per-rail values.

Signed-off-by: Eric Raut <[email protected]>
@rauteric
Copy link
Contributor Author

rauteric commented Oct 3, 2024

Rebased on master

@AmedeoSapio
Copy link
Contributor

bot:aws:retest

1 similar comment
@AvivBenchorin
Copy link
Contributor

bot:aws:retest

Make posted buffer counts for control and eager bounce recv buffers
separately configurable, and set a higher count for ctrl buffers.

Signed-off-by: Eric Raut <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants