-
Notifications
You must be signed in to change notification settings - Fork 2
/
install.html
383 lines (364 loc) · 25 KB
/
install.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
<!DOCTYPE html>
<html class="writer-html5" lang="en" >
<head>
<meta charset="utf-8" /><meta name="generator" content="Docutils 0.17.1: http://docutils.sourceforge.net/" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Install Alpa — Alpa 0.2.3.dev17 documentation</title>
<link rel="stylesheet" href="_static/pygments.css" type="text/css" />
<link rel="stylesheet" href="_static/css/theme.css" type="text/css" />
<link rel="stylesheet" href="_static/sg_gallery.css" type="text/css" />
<link rel="stylesheet" href="_static/sg_gallery-binder.css" type="text/css" />
<link rel="stylesheet" href="_static/sg_gallery-dataframe.css" type="text/css" />
<link rel="stylesheet" href="_static/sg_gallery-rendered-html.css" type="text/css" />
<link rel="shortcut icon" href="_static/alpa-logo.ico"/>
<!--[if lt IE 9]>
<script src="_static/js/html5shiv.min.js"></script>
<![endif]-->
<script data-url_root="./" id="documentation_options" src="_static/documentation_options.js"></script>
<script src="_static/jquery.js"></script>
<script src="_static/underscore.js"></script>
<script src="_static/doctools.js"></script>
<script src="_static/js/theme.js"></script>
<link rel="index" title="Index" href="genindex.html" />
<link rel="search" title="Search" href="search.html" />
<link rel="next" title="Alpa Quickstart" href="tutorials/quickstart.html" />
<link rel="prev" title="Alpa Documentation" href="index.html" />
</head>
<body class="wy-body-for-nav">
<div class="wy-grid-for-nav">
<nav data-toggle="wy-nav-shift" class="wy-nav-side">
<div class="wy-side-scroll">
<div class="wy-side-nav-search" >
<a href="index.html" class="icon icon-home"> Alpa
</a>
<div class="version">
0.2.3.dev17
</div>
<div role="search">
<form id="rtd-search-form" class="wy-form" action="search.html" method="get">
<input type="text" name="q" placeholder="Search docs" />
<input type="hidden" name="check_keywords" value="yes" />
<input type="hidden" name="area" value="default" />
</form>
</div>
</div><div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="Navigation menu">
<p class="caption" role="heading"><span class="caption-text">Getting Started</span></p>
<ul class="current">
<li class="toctree-l1 current"><a class="current reference internal" href="#">Install Alpa</a><ul>
<li class="toctree-l2"><a class="reference internal" href="#prerequisites">Prerequisites</a></li>
<li class="toctree-l2"><a class="reference internal" href="#methods">Methods</a><ul>
<li class="toctree-l3"><a class="reference internal" href="#method-1-install-from-python-wheels">Method 1: Install from Python Wheels</a></li>
<li class="toctree-l3"><a class="reference internal" href="#method-2-install-from-source">Method 2: Install from Source</a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="#check-installation">Check Installation</a></li>
<li class="toctree-l2"><a class="reference internal" href="#optional-pytorch-frontend">[Optional] PyTorch Frontend</a></li>
<li class="toctree-l2"><a class="reference internal" href="#troubleshooting">Troubleshooting</a><ul>
<li class="toctree-l3"><a class="reference internal" href="#unhandled-cuda-error">Unhandled Cuda Error</a></li>
<li class="toctree-l3"><a class="reference internal" href="#using-alpa-on-slurm">Using Alpa on Slurm</a></li>
<li class="toctree-l3"><a class="reference internal" href="#jaxlib-jax-flax-version-problems">Jaxlib, Jax, Flax Version Problems</a></li>
<li class="toctree-l3"><a class="reference internal" href="#numpy-version-problems">Numpy Version Problems</a></li>
<li class="toctree-l3"><a class="reference internal" href="#tests-hang-with-no-errors-on-multi-gpu-nodes">Tests Hang with no Errors on Multi-GPU Nodes</a></li>
</ul>
</li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="tutorials/quickstart.html">Alpa Quickstart</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">Tutorials</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="tutorials/pipeshard_parallelism.html">Distributed Training with Both Shard and Pipeline Parallelism</a></li>
<li class="toctree-l1"><a class="reference internal" href="tutorials/alpa_vs_pmap.html">Differences between alpa.parallelize, jax.pmap and jax.pjit</a></li>
<li class="toctree-l1"><a class="reference internal" href="tutorials/opt_serving.html">Serving OPT-175B, BLOOM-176B and CodeGen-16B using Alpa</a></li>
<li class="toctree-l1"><a class="reference internal" href="tutorials/perf_tuning_guide.html">Performance Tuning Guide</a></li>
<li class="toctree-l1"><a class="reference internal" href="tutorials/icml_big_model_tutorial.html">ICML’22 Big Model Tutorial</a></li>
<li class="toctree-l1"><a class="reference internal" href="tutorials/alpa_on_slurm.html">Using Alpa on Slurm</a></li>
<li class="toctree-l1"><a class="reference internal" href="tutorials/faq.html">Frequently Asked Questions (FAQ)</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">Architecture</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="architecture/overview.html">Design and Architecture</a></li>
<li class="toctree-l1"><a class="reference internal" href="architecture/alpa_compiler_walk_through.html">Alpa Compiler Walk-Through</a></li>
<li class="toctree-l1"><a class="reference internal" href="architecture/intra_op_solver.html">Code Structure of the Intra-op Solver</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">Benchmark</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="benchmark/benchmark.html">Performance Benchmark</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">Publications</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="publications/publications.html">Publications</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">Developer Guide</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="developer/developer_guide.html">Developer Guide</a></li>
</ul>
</div>
</div>
</nav>
<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"><nav class="wy-nav-top" aria-label="Mobile navigation menu" >
<i data-toggle="wy-nav-top" class="fa fa-bars"></i>
<a href="index.html">Alpa</a>
</nav>
<div class="wy-nav-content">
<div class="rst-content">
<div role="navigation" aria-label="Page navigation">
<ul class="wy-breadcrumbs">
<li><a href="index.html" class="icon icon-home"></a> »</li>
<li>Install Alpa</li>
<li class="wy-breadcrumbs-aside">
<a href="https://github.com/alpa-projects/alpa/blob/main/docs/install.rst" class="fa fa-github"> Edit on GitHub</a>
</li>
</ul>
<hr/>
</div>
<div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
<div itemprop="articleBody">
<section id="install-alpa">
<h1>Install Alpa<a class="headerlink" href="#install-alpa" title="Permalink to this headline"></a></h1>
<p>This page provides instructions to install alpa from Python wheels or from source. The minimum supported python version is 3.7.</p>
<section id="prerequisites">
<h2>Prerequisites<a class="headerlink" href="#prerequisites" title="Permalink to this headline"></a></h2>
<p>Regardless of installing from wheels or from source, there are a few prerequisite packages:</p>
<ol class="arabic simple">
<li><p>CUDA toolkit:</p></li>
</ol>
<blockquote>
<div><p>Follow the official guides to install <a class="reference external" href="https://developer.nvidia.com/cuda-toolkit">CUDA</a> and <a class="reference external" href="https://developer.nvidia.com/cudnn">cuDNN</a>.
Alpa requires CUDA >= 11.1 and cuDNN >= 8.0.5.</p>
</div></blockquote>
<ol class="arabic simple" start="2">
<li><p>Update pip version and install cupy:</p></li>
</ol>
<blockquote>
<div><div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="c1"># Update pip</span>
pip3 install --upgrade pip
<span class="c1"># Install cupy</span>
pip3 install cupy-cuda11x
</pre></div>
</div>
<p>Then, check whether your system already has NCCL installed.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>python3 -c <span class="s2">"from cupy.cuda import nccl"</span>
</pre></div>
</div>
<p>If it prints nothing, then NCCL has already been installed.
Otherwise, follow the printed instructions to install NCCL.</p>
</div></blockquote>
</section>
<section id="methods">
<h2>Methods<a class="headerlink" href="#methods" title="Permalink to this headline"></a></h2>
<p>Choose one of the methods below.</p>
<section id="method-1-install-from-python-wheels">
<span id="install-from-wheels"></span><h3>Method 1: Install from Python Wheels<a class="headerlink" href="#method-1-install-from-python-wheels" title="Permalink to this headline"></a></h3>
<p>Alpa provides wheels for the following CUDA (cuDNN) and Python versions:</p>
<ul class="simple">
<li><p>CUDA (cuDNN): 11.1 (8.0.5), 11.2 (8.1.0), 11.3 (8.2.0)</p></li>
<li><p>Python: 3.7, 3.8, 3.9</p></li>
</ul>
<p>If you need to use other CUDA, cuDNN, or Python versions, please follow the next section to <a class="reference internal" href="#install-from-source"><span class="std std-ref">install from source</span></a>.</p>
<ol class="arabic simple">
<li><p>Install Alpa python package.</p></li>
</ol>
<blockquote>
<div><div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>pip3 install alpa
</pre></div>
</div>
</div></blockquote>
<ol class="arabic simple" start="2">
<li><p>Install Alpa-modified Jaxlib. Make sure that the jaxlib version corresponds to the version of
the existing CUDA and cuDNN installation you want to use.
You can specify a particular CUDA and cuDNN version for jaxlib explicitly via:</p></li>
</ol>
<blockquote>
<div><div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>pip3 install <span class="nv">jaxlib</span><span class="o">==</span><span class="m">0</span>.3.22+cuda<span class="o">{</span>cuda_version<span class="o">}</span>.cudnn<span class="o">{</span>cudnn_version<span class="o">}</span> -f https://alpa-projects.github.io/wheels.html
</pre></div>
</div>
<p>For example, to install the wheel compatible with CUDA >= 11.1 and cuDNN >= 8.0.5, use the following command:</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>pip3 install <span class="nv">jaxlib</span><span class="o">==</span><span class="m">0</span>.3.22+cuda111.cudnn805 -f https://alpa-projects.github.io/wheels.html
</pre></div>
</div>
<p>You can see all available wheel versions we provided at our <a class="reference external" href="https://alpa-projects.github.io/wheels.html">PyPI index</a>.</p>
</div></blockquote>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>As of now, Alpa modified the original jaxlib at the version <code class="docutils literal notranslate"><span class="pre">jaxlib==0.3.22</span></code>. Alpa regularly rebases the official jaxlib repository to catch up with the upstream.</p>
</div>
</section>
<section id="method-2-install-from-source">
<span id="install-from-source"></span><h3>Method 2: Install from Source<a class="headerlink" href="#method-2-install-from-source" title="Permalink to this headline"></a></h3>
<ol class="arabic simple">
<li><p>Clone repos</p></li>
</ol>
<blockquote>
<div><div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>git clone --recursive https://github.com/alpa-projects/alpa.git
</pre></div>
</div>
</div></blockquote>
<ol class="arabic simple" start="2">
<li><p>Install Alpa python package.</p></li>
</ol>
<blockquote>
<div><div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="nb">cd</span> alpa
pip3 install -e <span class="s2">".[dev]"</span> <span class="c1"># Note that the suffix `[dev]` is required to build custom modules.</span>
</pre></div>
</div>
</div></blockquote>
<ol class="arabic simple" start="3">
<li><p>Build and install Alpa-modified Jaxlib. The Jaxlib contains c++ code of Alpa.</p></li>
</ol>
<blockquote>
<div><div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="nb">cd</span> build_jaxlib
python3 build/build.py --enable_cuda --dev_install --bazel_options<span class="o">=</span>--override_repository<span class="o">=</span><span class="nv">org_tensorflow</span><span class="o">=</span><span class="k">$(</span><span class="nb">pwd</span><span class="k">)</span>/../third_party/tensorflow-alpa
<span class="nb">cd</span> dist
pip3 install -e .
</pre></div>
</div>
</div></blockquote>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>Building the latest Alpa-modified jaxlib requires new C++17 standards. It is known that some compiler versions such as <code class="docutils literal notranslate"><span class="pre">gcc==7.3</span></code> or <code class="docutils literal notranslate"><span class="pre">gcc==9.4</span></code> cannot correctly compile the jaxlib code.
See <a class="reference external" href="https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90415">this thread</a> about the know issues.</p>
<p>If you meet compilation errors, please install our recommended gcc version <code class="docutils literal notranslate"><span class="pre">gcc==7.5</span></code>; newer gcc versions might also work.
Then please clean the bazel cache (<code class="docutils literal notranslate"><span class="pre">rm</span> <span class="pre">-rf</span> <span class="pre">~/.cache/bazel</span></code>) and try to build jaxlib again.</p>
</div>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>All installations are in development mode, so you can modify python code and it will take effect immediately.
To modify c++ code in tensorflow, you only need to run the command below from step 3 to recompile jaxlib:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span>python3 build/build.py --enable_cuda --dev_install --bazel_options=--override_repository=org_tensorflow=$(pwd)/../third_party/tensorflow-alpa
</pre></div>
</div>
</div>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>Alpa python package and Alpa-modified Jaxlib are two separate libraries. If you only want to develop the python source code, you can install
Alpa python package from source and install Alpa-modified Jaxlib from wheels.</p>
</div>
</section>
</section>
<section id="check-installation">
<h2>Check Installation<a class="headerlink" href="#check-installation" title="Permalink to this headline"></a></h2>
<p>You can check the installation by running the following commands.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>ray start --head
python3 -m alpa.test_install
</pre></div>
</div>
</section>
<section id="optional-pytorch-frontend">
<h2>[Optional] PyTorch Frontend<a class="headerlink" href="#optional-pytorch-frontend" title="Permalink to this headline"></a></h2>
<p>While Alpa is mainly designed for Jax, Alpa also provides an experimental PyTorch frontend.
Alpa supports PyTorch models that meet the following requirements:</p>
<ol class="arabic simple">
<li><p>No input-dependent control flow</p></li>
<li><p>No weight sharing</p></li>
</ol>
<p>To enable Alpa for PyTorch, install the following dependencies:</p>
<blockquote>
<div><div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="c1"># Install torch and torchdistx</span>
pip3 uninstall -y torch torchdistx
pip install --extra-index-url https://download.pytorch.org/whl/cpu <span class="nv">torch</span><span class="o">==</span><span class="m">1</span>.12 torchdistx
<span class="c1"># Build functorch from source</span>
git clone https://github.com/pytorch/functorch
<span class="nb">cd</span> functorch/
git checkout 76976db8412b60d322c680a5822116ba6f2f762a
python3 setup.py install
</pre></div>
</div>
</div></blockquote>
<p>Please look at <code class="docutils literal notranslate"><span class="pre">tests/torch_frontend/test_simple.py</span></code> for usage examples.</p>
</section>
<section id="troubleshooting">
<h2>Troubleshooting<a class="headerlink" href="#troubleshooting" title="Permalink to this headline"></a></h2>
<section id="unhandled-cuda-error">
<h3>Unhandled Cuda Error<a class="headerlink" href="#unhandled-cuda-error" title="Permalink to this headline"></a></h3>
<p>If you see errors like <code class="docutils literal notranslate"><span class="pre">cupy_backends.cuda.libs.nccl.NcclError:</span> <span class="pre">NCCL_ERROR_UNHANDLED_CUDA_ERROR:</span> <span class="pre">unhandled</span> <span class="pre">cuda</span> <span class="pre">error</span></code>, it is mainly due to the compatibility issues between CUDA, NCCL, and GPU driver versions. Please double check these versions and see <a class="reference external" href="https://github.com/alpa-projects/alpa/issues/496">Issue #496</a> for more details.</p>
</section>
<section id="using-alpa-on-slurm">
<h3>Using Alpa on Slurm<a class="headerlink" href="#using-alpa-on-slurm" title="Permalink to this headline"></a></h3>
<p>Since Alpa relies on Ray to manage the cluster nodes, Alpa can run on a Slurm cluster as long as Ray can run on it.
If you have trouble running Alpa on a Slurm cluster, we recommend to follow <a class="reference external" href="https://docs.ray.io/en/latest/cluster/slurm.html">this guide</a> to setup Ray on Slurm and make sure simple Ray examples
can run without any problem, then move forward to install and run Alpa in the same environment.</p>
<p>Common issues of running Alpa on Slurm include:</p>
<ul class="simple">
<li><p>The Slurm cluster has installed additional networking proxies, so XLA client connections time out. Example errors can be found in <a class="reference external" href="https://github.com/alpa-projects/alpa/issues/452#issuecomment-1134260817">this thread</a>.
The slurm cluster users might need to check and fix those proxies on their slurm cluster and make sure processes spawned by Alpa can see each other.</p></li>
<li><p>When launching a Slurm job using <code class="docutils literal notranslate"><span class="pre">SRUN</span></code>, the users do not request enough CPU threads or GPU resources for Ray to spawn many actors on Slurm.
The users need to adjust the value for the argument <code class="docutils literal notranslate"><span class="pre">--cpus-per-task</span></code> passed to <code class="docutils literal notranslate"><span class="pre">SRUN</span></code> when launching Alpa. See <a class="reference external" href="https://slurm.schedmd.com/srun.html">Slurm documentation</a> for more information.</p></li>
</ul>
<p>You might also find the discussion under <a class="reference external" href="https://github.com/alpa-projects/alpa/issues/452">Issue #452</a> helpful.</p>
</section>
<section id="jaxlib-jax-flax-version-problems">
<h3>Jaxlib, Jax, Flax Version Problems<a class="headerlink" href="#jaxlib-jax-flax-version-problems" title="Permalink to this headline"></a></h3>
<p>Alpa is only tested against specific versions of Jax and Flax.
The recommended Jax and Flax versions are specified by <code class="docutils literal notranslate"><span class="pre">install_require_list</span></code> in <a class="reference external" href="https://github.com/alpa-projects/alpa/blob/main/setup.py">setup.py</a> .
(You can checkout the file to specific version tag if you are not using the latest HEAD.)</p>
<p>If you see version errors like below</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>>>> import alpa
......
RuntimeError: jaxlib version <span class="m">0</span>.3.7 is newer than and incompatible with jax version <span class="m">0</span>.3.5. Please update your jax and/or jaxlib packages
</pre></div>
</div>
<p>Make sure your Jax, Flax and Optax/Chex versions are compatible with the versions specified in Alpa’s <code class="docutils literal notranslate"><span class="pre">setup.py</span></code>.
Make sure you re-install <strong>Alpa-modified Jaxlib</strong> by either using <a class="reference internal" href="#install-from-wheels"><span class="std std-ref">our prebuilt wheels</span></a> or <a class="reference internal" href="#install-from-source"><span class="std std-ref">Install from Source</span></a> to overwrite the default Jaxlib.</p>
</section>
<section id="numpy-version-problems">
<h3>Numpy Version Problems<a class="headerlink" href="#numpy-version-problems" title="Permalink to this headline"></a></h3>
<p>If you start with a clean Python virtual environment and have followed the procedures in this guide strictly, you should not see problems about Numpy versions.</p>
<p>However, sometimes due to the installation of other Python packages, another version of numpy might be silently installed before compiling jaxlib,
and you might see numpy version errors similar to the following one when launching Alpa after installing from source:</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>>>> python3 tests/test_install.py
......
RuntimeError: module compiled against API version 0xf but this version of numpy is 0xd
ImportError: numpy.core._multiarray_umath failed to import
ImportError: numpy.core.umath failed to import
<span class="m">2022</span>-05-20 <span class="m">21</span>:57:35.710782: F external/org_tensorflow/tensorflow/compiler/xla/python/xla.cc:83<span class="o">]</span> Check failed: tensorflow::RegisterNumpyBfloat16<span class="o">()</span>
Aborted <span class="o">(</span>core dumped<span class="o">)</span>
</pre></div>
</div>
<p>This is because you have used a higher version of numpy when compiling jaxlib, but later used a lower version of numpy to run Alpa.</p>
<p>To address the problem, please first downgrade the numpy in your Python environment to <code class="docutils literal notranslate"><span class="pre">numpy==1.20</span></code> via <code class="docutils literal notranslate"><span class="pre">pip</span> <span class="pre">install</span> <span class="pre">numpy==1.20</span></code>,
then follow the procedures in <a class="reference internal" href="#install-from-source"><span class="std std-ref">install from source</span></a> to rebuild and reinstall jaxlib.
Optionally, you can switch back to use the higher version of numpy (<code class="docutils literal notranslate"><span class="pre">numpy>=1.20</span></code>) to run Alpa and your other applications, thanks to numpy’s backward compatibility.</p>
<p>See <a class="reference external" href="https://github.com/alpa-projects/alpa/issues/461">Issue#461</a> for more discussion.</p>
</section>
<section id="tests-hang-with-no-errors-on-multi-gpu-nodes">
<h3>Tests Hang with no Errors on Multi-GPU Nodes<a class="headerlink" href="#tests-hang-with-no-errors-on-multi-gpu-nodes" title="Permalink to this headline"></a></h3>
<p>This could be an indication that IO virtualization (VT-d, or IOMMU) is interfereing with the NCCL library. On multi-gpu systems, PCI point-to-point traffic can be redirected to the CPU by these systems causing performance reductions or programs to hang. These settings can typically be disabled from the BIOS, or sometimes from the OS. You can find more information on Nividia’s NCCL troubleshooting guide <a class="reference external" href="https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html">here</a>. Note that disabling IO virtualization can introduce security vulnerabilities, with peripherals having read/write access to DRAM through the DMA (Direct Memory Access) protocol.</p>
</section>
</section>
</section>
</div>
</div>
<footer><div class="rst-footer-buttons" role="navigation" aria-label="Footer">
<a href="index.html" class="btn btn-neutral float-left" title="Alpa Documentation" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
<a href="tutorials/quickstart.html" class="btn btn-neutral float-right" title="Alpa Quickstart" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
</div>
<hr/>
<div role="contentinfo">
<p>© Copyright 2022, Alpa Developers.</p>
</div>
Built with <a href="https://www.sphinx-doc.org/">Sphinx</a> using a
<a href="https://github.com/readthedocs/sphinx_rtd_theme">theme</a>
provided by <a href="https://readthedocs.org">Read the Docs</a>.
</footer>
</div>
</div>
</section>
</div>
<script>
jQuery(function () {
SphinxRtdTheme.Navigation.enable(true);
});
</script>
<!-- Theme Analytics -->
<script async src="https://www.googletagmanager.com/gtag/js?id=G-587CCSSRL2"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'G-587CCSSRL2', {
'anonymize_ip': false,
});
</script>
</body>
</html>