Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend range of communicator IDs for the RDMA protocol #367

Merged
merged 2 commits into from
Mar 31, 2024

Conversation

AmedeoSapio
Copy link
Contributor

@AmedeoSapio AmedeoSapio commented Mar 29, 2024

Issue

The RDMA protocol is currently using 12 bits for the communicator ID, limiting the number of communicators to 4K, which can be hit in large deployments.

Description of changes:

This PR is hanging the number of bits used for the communicator ID from 12 to 18 to reduce
the chance of running out of comm IDs. To fit the comm ID in the immediate data,
the message sequence number has been reduced from 16 bits to 10, which is still
more than enough since usually the number of inflight messages is not that high.
The msg_seq_num space can be further reduced if needed.

With the 16 bits msg_seq_num the msgbuff relied on the wraparound of 16 bits integers,
so to reduce the number of bits of msg_seq_num we need to make the wrapping around
explicit and implement a subtraction to deal with the wraparound case. This makes
the msgbuff more generic, as it can now support any msg_seq_num space and any
number of inflight messages (not only powers of 2).

The msgbuff unit test has also been extended to test the case when message sequence numbers wrap around.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@AmedeoSapio AmedeoSapio requested a review from a team as a code owner March 29, 2024 07:30
@AmedeoSapio AmedeoSapio added the BuildTriggerRequest CI build will be triggered when this label is set label Mar 29, 2024
Copy link
Contributor

@bwbarrett bwbarrett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I only reviewed the last two commits, since the other ones are covered in a different PR. My assumption is that this PR will rebase once that one is committed.

include/nccl_ofi_msgbuff.h Show resolved Hide resolved
include/nccl_ofi_msgbuff.h Outdated Show resolved Hide resolved
include/nccl_ofi_msgbuff.h Outdated Show resolved Hide resolved
src/nccl_ofi_msgbuff.c Outdated Show resolved Hide resolved
src/nccl_ofi_msgbuff.c Outdated Show resolved Hide resolved
src/nccl_ofi_rdma.c Show resolved Hide resolved
tests/unit/msgbuff.c Outdated Show resolved Hide resolved
src/nccl_ofi_msgbuff.c Outdated Show resolved Hide resolved
src/nccl_ofi_msgbuff.c Outdated Show resolved Hide resolved
src/nccl_ofi_rdma.c Outdated Show resolved Hide resolved
@AmedeoSapio AmedeoSapio removed the BuildTriggerRequest CI build will be triggered when this label is set label Mar 30, 2024
src/nccl_ofi_rdma.c Show resolved Hide resolved
@AmedeoSapio AmedeoSapio added the BuildTriggerRequest CI build will be triggered when this label is set label Mar 30, 2024
This is changing the msgbuff so that it can support any number of bits for
the sequence number, not just 16.
This is in preparation to increase the range of comm IDs, which requires
changing the number of bits used for the msg_seq_num.
To fit a larger comm ID in the immediate data, the message sequence number
bit-width must be reduced.

With the 16 bits msg_seq_num the msgbuff relied on the wraparound of 16 bits
integers, so to reduce the number of bits of msg_seq_num we need to make the
wrapping around explicit and implement a distance function to deal with the
wraparound case. This makes the msgbuff more generic, as it can now support
any msg_seq_num space and any number of inflight messages, as long as they
are powers of 2.

This also extended the msgbuff unit test to test the case when message sequence
numbers wrap around.

Signed-off-by: Amedeo Sapio <[email protected]>
Changing the number of bits used for the communicator ID from 12 to 18 to reduce
the chance of running out of comm IDs.

To fit a larger comm ID in the immediate data, the message sequence number
has been reduced from 16 bits to 10, which is still more than enough since
usually the number of inflight messages is not that high.
The msg_seq_num space can be further reduced if needed.

Signed-off-by: Amedeo Sapio <[email protected]>
@bwbarrett bwbarrett merged commit 2fbfecf into aws:master Mar 31, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BuildTriggerRequest CI build will be triggered when this label is set
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants