UCX 1.10.1 sometimes fails to find mlx5_core for some MPI tasks on single node runs #8704

akesandgren · 2022-11-16T06:30:48Z

Describe the bug

Sometimes, not always, when running OpenMPI 4.1.1 with UCX 1.10.1 the pml ucx component fails to find mlx5_core on some of the tasks in a single node run.

Steps to Reproduce

srun mpicode
UCX version used 1.10.1

Setup and versions

OS version Rocky Linux release 8.5
For RDMA/IB/RoCE related issues:
- Driver version:
  - rdma-core-54mlnx1-1.54340

Additional information (depending on the issue)

OpenMPI version 4.1.1
Output of ucx_info -d to show transports and devices recognized by UCX

ucx_info -d | grep -E 'Transport|Device'
#      Transport: posix
#         Device: memory
#      Transport: sysv
#         Device: memory
#      Transport: self
#         Device: memory
#      Transport: tcp
#         Device: lo
#      Transport: tcp
#         Device: ens1f1.1043
#      Transport: tcp
#         Device: enp0s20f0u1u6
#      Transport: tcp
#         Device: ens1f1
#      Transport: tcp
#         Device: ib0
#      Transport: tcp
#         Device: ens1f1.1044
#      Transport: rc_verbs
#         Device: mlx5_0:1
#      Transport: rc_mlx5
#         Device: mlx5_0:1
#      Transport: dc_mlx5
#         Device: mlx5_0:1
#      Transport: ud_verbs
#         Device: mlx5_0:1
#      Transport: ud_mlx5
#         Device: mlx5_0:1
#      Transport: rc_verbs
#         Device: mlx5_1:1
#      Transport: rc_mlx5
#         Device: mlx5_1:1
#      Transport: dc_mlx5
#         Device: mlx5_1:1
#      Transport: ud_verbs
#         Device: mlx5_1:1
#      Transport: ud_mlx5
#         Device: mlx5_1:1
#      Transport: cma
#         Device: memory

What happens is that for a single node job the diff between two tasks for the pml ucx component is this:

--- /tmp/good.ucx.1     2022-11-16 07:12:47.124598108 +0100
+++ /tmp/bad.ucx.1      2022-11-16 07:12:53.156640711 +0100
@@ -13,15 +13,27 @@
 common_ucx.c:303 tcp/ens1f1: did not match transport list
 common_ucx.c:303 tcp/ib0: did not match transport list
 common_ucx.c:303 tcp/ens1f1.1044: did not match transport list
-common_ucx.c:198 driver '../../../../bus/pci/drivers/mlx5_core' matched by 'mlx*'
-common_ucx.c:294 rc_verbs/mlx5_0:1: matched both transport and device list
-common_ucx.c:311 support level is transports and devices
+common_ucx.c:298 rc_verbs/mlx5_0:1: matched transport list but not device list
+common_ucx.c:298 rc_mlx5/mlx5_0:1: matched transport list but not device list
+common_ucx.c:298 dc_mlx5/mlx5_0:1: matched transport list but not device list
+common_ucx.c:298 ud_verbs/mlx5_0:1: matched transport list but not device list
+common_ucx.c:303 ud_mlx5/mlx5_0:1: did not match transport list
+common_ucx.c:298 rc_verbs/mlx5_1:1: matched transport list but not device list
+common_ucx.c:298 rc_mlx5/mlx5_1:1: matched transport list but not device list
+common_ucx.c:298 dc_mlx5/mlx5_1:1: matched transport list but not device list
+common_ucx.c:298 ud_verbs/mlx5_1:1: matched transport list but not device list
+common_ucx.c:303 ud_mlx5/mlx5_1:1: did not match transport list
+common_ucx.c:303 cma/memory: did not match transport list
+common_ucx.c:311 support level is transports only
 pml_ucx.c:289 mca_pml_ucx_init
 pml_ucx.c:114 Pack remote worker address, size 482
 pml_ucx.c:114 Pack local worker address, size 618
-pml_ucx.c:351 created ucp context 0x23184d0, worker 0x28283a0
-pml_ucx_component.c:129 returning priority 51
-selected ucx best priority 51
-select: component ucx selected
-pml_ucx.c:182 Got proc 2 address, size 618
-pml_ucx.c:411 connecting to proc. 2
+pml_ucx.c:351 created ucp context 0x902650, worker 0xe053d0
+pml_ucx_component.c:129 returning priority 19
+pml_ucx.c:367 mca_pml_ucx_cleanup
+select: component ucx not selected / finalized
+pml_ucx.c:268 mca_pml_ucx_close
+mca: base: close: component ucx closed
+mca: base: close: unloading component ucx
+check:select: checking my pml cm against process [[33740,728],0] pml ucx
+[[33740,728],10] selected pml cm, but peer [[33740,728],0] on alvis-cpu1 selected pml ucx

Has this been fixed in later versions and if so which commit(s) are involved?

The text was updated successfully, but these errors were encountered:

brminich · 2022-11-16T08:46:36Z

@akesandgren, can you please attach the mentioned log files?

akesandgren · 2022-11-16T09:24:01Z

Exactly which ones do you mean?

brminich · 2022-11-16T09:32:25Z

in the original description you mentioned the following:

Configure result - config.log
Log file - configure UCX with "--enable-logging" - and run with "UCX_LOG_LEVEL=data"

do you have those logs available?

akesandgren · 2022-11-16T09:58:43Z

The config output yes, not the other one, i missed removing those lines originally.
And we don't have a enable-logging build to run with.

The interesting part here is how two different tasks on the same node can end up with different outcomes on the choice of ucx or not. How it is actually built shouldn't really matter, it's a standard build with:

contrib/configure-release --prefix=/apps/Arch/software/UCX/1.10.0-GCCcore-10.3.0 --enable-optimizations --enable-cma --enable-mt --with-verbs --without-java --disable-doxygen-doc

UCX.config.txt

brminich · 2022-11-16T14:30:57Z

Ok, I see. Can you please upload outputs of both bad and good runs? Also please set UCX_LOG_LEVEL=debug when running the apps (even the release UCX version contains some debug traces)

akesandgren · 2022-11-16T15:03:07Z

Here's a tar file with a correct and a fail output. Both cases run on the same node.
UCX-fail.tar.gz

brminich · 2022-11-16T16:12:03Z

thanks for the logs. The issue seems to be similar to #8511. There are the following errors when the job fails:

[1668610663.569809] [alvis-cpu1:2191091:0]       mm_posix.c:194  UCX  ERROR open(file_name=/proc/2191086/fd/40 flags=0x0) failed: No such file or directory
[1668610663.569831] [alvis-cpu1:2191091:0]          mm_ep.c:154  UCX  ERROR mm ep failed to connect to remote FIFO id 0xc000000a00216eee: Shared memory error

[1668610663.573565] [alvis-cpu1:2191117:1]       mm_posix.c:194  UCX  ERROR open(file_name=/proc/2191116/fd/40 flags=0x0) failed: Permission denied
[1668610663.573574] [alvis-cpu1:2191117:1]          mm_ep.c:154  UCX  ERROR mm ep failed to connect to remote FIFO id 0xc000000a00216f0c: Shared memory error

You are not using containers, right?
Does setting UCX_POSIX_USE_PROC_LINK=n make the problem gone?
what is the output of ipcs -l ?

akesandgren · 2022-11-16T18:08:58Z

No, not using containers.

From the node I've been running on:

ipcs -l

------ Messages Limits --------
max queues system wide = 32000
max size of message (bytes) = 8192
default max size of queue (bytes) = 16384

------ Shared Memory Limits --------
max number of segments = 4096
max seg size (kbytes) = 18014398509465599
max total shared memory (kbytes) = 18014398509481980
min seg size (bytes) = 1

------ Semaphore Limits --------
max number of arrays = 32000
max semaphores per array = 32000
max semaphores system wide = 1024000000
max ops per semop call = 500
semaphore max value = 32767

UCX_POSIX_USE_PROC_LINK=n does not fix the problem.

UCX 1.11.2 does not seem to have this problem, at least I've been unable to trigger it (still using OpenMPI 4.1.1)

yosefe · 2022-11-20T09:52:57Z

A similar issue was fixed by open-mpi/ompi#9505 which is part of OpenMPI 4.1.2 and above

akesandgren added the Bug label Nov 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UCX 1.10.1 sometimes fails to find mlx5_core for some MPI tasks on single node runs #8704

UCX 1.10.1 sometimes fails to find mlx5_core for some MPI tasks on single node runs #8704

akesandgren commented Nov 16, 2022 •

edited

Loading

brminich commented Nov 16, 2022

akesandgren commented Nov 16, 2022

brminich commented Nov 16, 2022

akesandgren commented Nov 16, 2022

brminich commented Nov 16, 2022 •

edited

Loading

akesandgren commented Nov 16, 2022

brminich commented Nov 16, 2022

akesandgren commented Nov 16, 2022

yosefe commented Nov 20, 2022

UCX 1.10.1 sometimes fails to find mlx5_core for some MPI tasks on single node runs #8704

UCX 1.10.1 sometimes fails to find mlx5_core for some MPI tasks on single node runs #8704

Comments

akesandgren commented Nov 16, 2022 • edited Loading

Describe the bug

Steps to Reproduce

Setup and versions

Additional information (depending on the issue)

brminich commented Nov 16, 2022

akesandgren commented Nov 16, 2022

brminich commented Nov 16, 2022

akesandgren commented Nov 16, 2022

brminich commented Nov 16, 2022 • edited Loading

akesandgren commented Nov 16, 2022

brminich commented Nov 16, 2022

akesandgren commented Nov 16, 2022

yosefe commented Nov 20, 2022

akesandgren commented Nov 16, 2022 •

edited

Loading

brminich commented Nov 16, 2022 •

edited

Loading