Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rdma: support NCCL multi-recv interface #348

Closed
wants to merge 6 commits into from

Commits on Feb 23, 2024

  1. rdma: defer connect completion after sending connect message

    In the current implementation of connect/accept, it is possible for
    `accept` to complete (i.e., return a non-NULL communicator) after the
    corresponding `connect` returned a NULL communicator (while waiting for
    a completion for the connection message). This is a strange semantic,
    and evidently causes NCCL to be unhappy, particularly in the multi-recv
    case (which is being added in a future commit).
    
    So, after sending the connect message, defer waiting for completion;
    block when closing the send comm if necessary.
    
    Signed-off-by: Eric Raut <[email protected]>
    rauteric committed Feb 23, 2024
    Configuration menu
    Copy the full SHA
    0eb387f View commit details
    Browse the repository at this point in the history

Commits on Feb 24, 2024

  1. tests: set nrecv=1 in functional tests

    These tests do not support multi-recv.
    
    Signed-off-by: Eric Raut <[email protected]>
    rauteric committed Feb 24, 2024
    Configuration menu
    Copy the full SHA
    d20ba9e View commit details
    Browse the repository at this point in the history

Commits on Feb 25, 2024

  1. tests/ring: Post receives before sends

    The previous implementation of the ring unit test posts sends before
    receives, which is incompatible with the current multi-recv behavior of
    rejecting sends until ctrl information is avaiable.
    
    Signed-off-by: Eric Raut <[email protected]>
    rauteric committed Feb 25, 2024
    Configuration menu
    Copy the full SHA
    9207e74 View commit details
    Browse the repository at this point in the history
  2. msgbuff: support tags in preparation for multi-recv

    msgbuff functions accept tag and multi-recv information. RDMA protocol
    code is updated to pass dummy values for these fields. Multi-recv
    support for RDMA protocol will be an upcoming commit.
    
    Signed-off-by: Eric Raut <[email protected]>
    rauteric committed Feb 25, 2024
    Configuration menu
    Copy the full SHA
    8c770ce View commit details
    Browse the repository at this point in the history
  3. rdma: support NCCL multi-recv interface

    The multi-recv interface allows aggregating up to 8 receive requests in
    a single request.
    
    This commit does not yet advertise support for multi-recv to NCCL.
    
    * Temporarily disables eager; it will be re-enabled in a future commit.
    
    Signed-off-by: Eric Raut <[email protected]>
    rauteric committed Feb 25, 2024
    Configuration menu
    Copy the full SHA
    1278cf6 View commit details
    Browse the repository at this point in the history
  4. Advertise multi-recv support to NCCL for RDMA protocol

    RDMA protocol will now support up to 8 multi-recv buffers at a time.
    
    Signed-off-by: Eric Raut <[email protected]>
    rauteric committed Feb 25, 2024
    Configuration menu
    Copy the full SHA
    5fddb2e View commit details
    Browse the repository at this point in the history