Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Packet copy for IB #109

Merged
merged 35 commits into from
Jun 28, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
fe28794
Renew epochs
chhwang Jun 19, 2023
59fd2e1
Rename DirectChannel into SmChannel
chhwang Jun 19, 2023
aef1f5a
Add SmDeviceChannel. PingPongIb is not working
chhwang Jun 19, 2023
be1db47
IB test working
chhwang Jun 19, 2023
f33c8b1
Some bug fixes
chhwang Jun 20, 2023
12671ed
bug fixes
Jun 20, 2023
83f78b1
Minor fix
chhwang Jun 20, 2023
e4ec432
Performance measurement
chhwang Jun 20, 2023
c4adfd5
Minor bug fix
chhwang Jun 20, 2023
320f8c0
Add SmDeviceChannel example for AllReduce WIP
chhwang Jun 20, 2023
1afe2fc
Disable printf on Debug mode
chhwang Jun 21, 2023
4ab3cd8
Bug fixes
chhwang Jun 21, 2023
c857a69
1 node 8 gpu/node & 2 nodes 1 gpu/node work
chhwang Jun 21, 2023
d42fbde
Use different getPktBufs for local and remote (WIP)
chhwang Jun 21, 2023
db6bfc9
functional
Jun 22, 2023
1d527a9
Merge branch 'main' into chhwang/ib-pkt-cpy
chhwang Jun 25, 2023
c457888
Lint
chhwang Jun 25, 2023
cfd1d75
Move some constructors to channel.cc
chhwang Jun 25, 2023
4529874
changed Epoch names
Jun 26, 2023
d5db1db
proxy_channel added
Jun 26, 2023
3a993cf
name change
Jun 26, 2023
b2535e3
Removing dependencies around packet copy
chhwang Jun 26, 2023
15b5bf3
Remove SmDeviceChannel
chhwang Jun 27, 2023
825875a
Merge branch 'saemal/channel_cleanup' into chhwang/ib-pkt-cpy
chhwang Jun 27, 2023
edfc44c
Renaming
chhwang Jun 27, 2023
201cdc8
all cleaned up
Jun 27, 2023
4e829b1
renamed epoch to semaphore
Jun 27, 2023
893c11d
more name clean ups
Jun 28, 2023
5f7134e
Saeed's optimization
chhwang Jun 28, 2023
f17e8b9
Add NUMA binding to fifo_tests
chhwang Jun 28, 2023
4a9371c
Update README.md
chhwang Jun 28, 2023
41d10b4
Fix wrong names
chhwang Jun 28, 2023
145a898
Remove packet namespace
chhwang Jun 28, 2023
a040c63
Merge branch 'main' into chhwang/ib-pkt-cpy
chhwang Jun 28, 2023
5425765
Rename variables
chhwang Jun 28, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 13 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,29 +24,29 @@ MSCCL++ v0.2 supports the following features.

### In-Kernel Communication Interfaces

MSCCL++ provides inter-GPU communication interfaces to be called by a GPU thread. For example, the `put()` method in the following example copies 1KB data from the local GPU to a remote GPU. `devChan` is a peer-to-peer communication channel initialized before the kernel execution from the host side, which consists of information on send/receive buffers.
MSCCL++ provides inter-GPU communication interfaces to be called by a GPU thread. For example, the `put()` method in the following example copies 1KB data from the local GPU to a remote GPU. `channel` is a peer-to-peer communication channel between two GPUs, which consists of information on send/receive buffers. `channel` is initialized from the host side before the kernel execution.

```cpp
__device__ mscclpp::channel::SimpleDeviceChannel devChan;
__device__ mscclpp::SimpleProxyChannel channel;
__global__ void gpuKernel() {
...
// Only one thread is needed for this method.
devChan.put(/*dstOffset=*/ 0, /*srcOffset=*/ 0, /*size=*/ 1024);
channel.put(/*dstOffset=*/ 0, /*srcOffset=*/ 0, /*size=*/ 1024);
...
}
```

MSCCL++ also provides efficient synchronization methods, `signal()`, `flush()`, and `wait()`. For example, we can implement a simple barrier between two ranks (peer-to-peer connected through `devChan`) as follows. Explanation of each method is inlined.
MSCCL++ also provides efficient synchronization methods, `signal()`, `flush()`, and `wait()`. For example, we can implement a simple barrier between two ranks (peer-to-peer connected through `channel`) as follows. Explanation of each method is inlined.

```cpp
// Only one thread is needed for this function.
__device__ void barrier() {
// Inform the peer GPU that I have arrived at this point.
devChan.signal();
// Immediately flush all previous requests sent via this channel.
devChan.flush();
channel.signal();
// Flush the previous signal() call, which will wait for completion of signaling.
channel.flush();
// Wait for the peer GPU to call signal().
devChan.wait();
channel.wait();
// Now this thread is synchronized with the remote GPU’s thread.
// Users may call a local synchronize functions (e.g., __syncthreads())
// to synchronize other local threads as well with the remote side.
Expand All @@ -57,7 +57,7 @@ MSCCL++ provides consistent in-kernel interfaces, i.e., the above interfaces are

### Host-Side Communication Proxy

Some in-kernel communication interfaces of MSCCL++ send requests (called triggers) to a GPU-external helper that conducts key functionalities such as DMA or RDMA. This helper is called a channel service or a proxy. MSCCL++ provides a default implementation of a proxy, which is a background host thread that busy polls triggers from GPUs and conducts functionalities accordingly. For example, the following is a typical host-side code for MSCCL++.
Some in-kernel communication interfaces of MSCCL++ send requests (called triggers) to a GPU-external helper that conducts key functionalities such as DMA or RDMA. This helper is called a proxy service or a proxy in short. MSCCL++ provides a default implementation of a proxy, which is a background host thread that busy polls triggers from GPUs and conducts functionalities accordingly. For example, the following is a typical host-side code for MSCCL++.

```cpp
// Bootstrap: initialize control-plane connections between all ranks
Expand All @@ -66,14 +66,14 @@ auto bootstrap = std::make_shared<mscclpp::Bootstrap>(rank, world_size);
mscclpp::Communicator comm(bootstrap);
// Setup connections here using `comm`
...
// Construct the default channel service
mscclpp::channel::DeviceChannelService channelService(comm);
// Construct the default proxy
mscclpp::ProxyService proxyService(comm);
// Start the proxy
channelService.startProxy();
proxyService.startProxy();
// Run the user application, i.e., launch GPU kernels here
...
// Stop the proxy after the application is finished
channelService.stopProxy();
proxyService.stopProxy();
```

While the default implementation already enables any kinds of communication, MSCCL++ also supports users to easily implement their own customized proxies for further optimization. For example, the following example re-defines how to interpret triggers from GPUs.
Expand Down
1 change: 1 addition & 0 deletions docs/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
doxygen/
20 changes: 10 additions & 10 deletions docs/Doxyfile
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ DOXYFILE_ENCODING = UTF-8
# title of most generated pages and in a few other places.
# The default value is: My Project.

PROJECT_NAME = "MSCCLPP"
PROJECT_NAME = "MSCCL++"

# The PROJECT_NUMBER tag can be used to enter a project or revision number. This
# could be handy for archiving the generated documentation or if some version
Expand All @@ -44,7 +44,7 @@ PROJECT_NUMBER =
# for a project that appears at the top of each page and should give viewer a
# quick idea about the purpose of the project. Keep the description short.

PROJECT_BRIEF =
PROJECT_BRIEF = "GPU-driven computation & communication stack"

# With the PROJECT_LOGO tag one can specify a logo or an icon that is included
# in the documentation. The maximum height of the logo should not exceed 55
Expand All @@ -58,7 +58,7 @@ PROJECT_LOGO =
# entered, it will be relative to the location where doxygen was started. If
# left blank the current directory will be used.

OUTPUT_DIRECTORY = output
OUTPUT_DIRECTORY = doxygen

# If the CREATE_SUBDIRS tag is set to YES then doxygen will create 4096 sub-
# directories (in 2 levels) under the output directory of each output format and
Expand Down Expand Up @@ -244,7 +244,7 @@ SEPARATE_MEMBER_PAGES = NO
# uses this value to replace tabs by spaces in code fragments.
# Minimum value: 1, maximum value: 16, default value: 4.

TAB_SIZE = 4
TAB_SIZE = 2

# This tag can be used to specify a number of aliases that act as commands in
# the documentation. An alias has the form:
Expand Down Expand Up @@ -323,7 +323,7 @@ OPTIMIZE_OUTPUT_SLICE = NO
# Note that for custom extensions you also need to set FILE_PATTERNS otherwise
# the files are not read by doxygen.

EXTENSION_MAPPING =
EXTENSION_MAPPING = cu=C++

# If the MARKDOWN_SUPPORT tag is enabled then doxygen pre-processes all comments
# according to the Markdown format, which allows for more readable
Expand All @@ -333,7 +333,7 @@ EXTENSION_MAPPING =
# case of backward compatibilities issues.
# The default value is: YES.

MARKDOWN_SUPPORT = NO
MARKDOWN_SUPPORT = YES

# When the TOC_INCLUDE_HEADINGS tag is set to a non-zero value, all headings up
# to that level are automatically included in the table of contents, even if
Expand Down Expand Up @@ -467,7 +467,7 @@ LOOKUP_CACHE_SIZE = 0
# normally produced when WARNINGS is set to YES.
# The default value is: NO.

EXTRACT_ALL = NO
EXTRACT_ALL = YES

# If the EXTRACT_PRIVATE tag is set to YES, all private members of a class will
# be included in the documentation.
Expand Down Expand Up @@ -909,7 +909,7 @@ FILE_PATTERNS = *.c \
# be searched for input files as well.
# The default value is: NO.

RECURSIVE = NO
RECURSIVE = YES

# The EXCLUDE tag can be used to specify files and/or directories that should be
# excluded from the INPUT source files. This way you can easily exclude a
Expand Down Expand Up @@ -1738,7 +1738,7 @@ EXTRA_SEARCH_MAPPINGS =
# If the GENERATE_LATEX tag is set to YES, doxygen will generate LaTeX output.
# The default value is: YES.

GENERATE_LATEX = YES
GENERATE_LATEX = NO

# The LATEX_OUTPUT tag is used to specify where the LaTeX docs will be put. If a
# relative path is entered the value of OUTPUT_DIRECTORY will be put in front of
Expand Down Expand Up @@ -2200,7 +2200,7 @@ INCLUDE_FILE_PATTERNS =
# recursively expanded use the := operator instead of the = operator.
# This tag requires that the tag ENABLE_PREPROCESSING is set to YES.

PREDEFINED =
PREDEFINED = __CUDACC__

# If the MACRO_EXPANSION and EXPAND_ONLY_PREDEF tags are set to YES then this
# tag can be used to specify a list of macro names that should be expanded. The
Expand Down
Loading