Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCX 1.10.1 sometimes fails to find mlx5_core for some MPI tasks on single node runs #8704

Open
akesandgren opened this issue Nov 16, 2022 · 9 comments
Labels

Comments

@akesandgren
Copy link

akesandgren commented Nov 16, 2022

Describe the bug

Sometimes, not always, when running OpenMPI 4.1.1 with UCX 1.10.1 the pml ucx component fails to find mlx5_core on some of the tasks in a single node run.

Steps to Reproduce

  • srun mpicode
  • UCX version used 1.10.1

Setup and versions

  • OS version Rocky Linux release 8.5
  • For RDMA/IB/RoCE related issues:
    • Driver version:
      • rdma-core-54mlnx1-1.54340

Additional information (depending on the issue)

  • OpenMPI version 4.1.1
  • Output of ucx_info -d to show transports and devices recognized by UCX
ucx_info -d | grep -E 'Transport|Device'
#      Transport: posix
#         Device: memory
#      Transport: sysv
#         Device: memory
#      Transport: self
#         Device: memory
#      Transport: tcp
#         Device: lo
#      Transport: tcp
#         Device: ens1f1.1043
#      Transport: tcp
#         Device: enp0s20f0u1u6
#      Transport: tcp
#         Device: ens1f1
#      Transport: tcp
#         Device: ib0
#      Transport: tcp
#         Device: ens1f1.1044
#      Transport: rc_verbs
#         Device: mlx5_0:1
#      Transport: rc_mlx5
#         Device: mlx5_0:1
#      Transport: dc_mlx5
#         Device: mlx5_0:1
#      Transport: ud_verbs
#         Device: mlx5_0:1
#      Transport: ud_mlx5
#         Device: mlx5_0:1
#      Transport: rc_verbs
#         Device: mlx5_1:1
#      Transport: rc_mlx5
#         Device: mlx5_1:1
#      Transport: dc_mlx5
#         Device: mlx5_1:1
#      Transport: ud_verbs
#         Device: mlx5_1:1
#      Transport: ud_mlx5
#         Device: mlx5_1:1
#      Transport: cma
#         Device: memory

What happens is that for a single node job the diff between two tasks for the pml ucx component is this:

--- /tmp/good.ucx.1     2022-11-16 07:12:47.124598108 +0100
+++ /tmp/bad.ucx.1      2022-11-16 07:12:53.156640711 +0100
@@ -13,15 +13,27 @@
 common_ucx.c:303 tcp/ens1f1: did not match transport list
 common_ucx.c:303 tcp/ib0: did not match transport list
 common_ucx.c:303 tcp/ens1f1.1044: did not match transport list
-common_ucx.c:198 driver '../../../../bus/pci/drivers/mlx5_core' matched by 'mlx*'
-common_ucx.c:294 rc_verbs/mlx5_0:1: matched both transport and device list
-common_ucx.c:311 support level is transports and devices
+common_ucx.c:298 rc_verbs/mlx5_0:1: matched transport list but not device list
+common_ucx.c:298 rc_mlx5/mlx5_0:1: matched transport list but not device list
+common_ucx.c:298 dc_mlx5/mlx5_0:1: matched transport list but not device list
+common_ucx.c:298 ud_verbs/mlx5_0:1: matched transport list but not device list
+common_ucx.c:303 ud_mlx5/mlx5_0:1: did not match transport list
+common_ucx.c:298 rc_verbs/mlx5_1:1: matched transport list but not device list
+common_ucx.c:298 rc_mlx5/mlx5_1:1: matched transport list but not device list
+common_ucx.c:298 dc_mlx5/mlx5_1:1: matched transport list but not device list
+common_ucx.c:298 ud_verbs/mlx5_1:1: matched transport list but not device list
+common_ucx.c:303 ud_mlx5/mlx5_1:1: did not match transport list
+common_ucx.c:303 cma/memory: did not match transport list
+common_ucx.c:311 support level is transports only
 pml_ucx.c:289 mca_pml_ucx_init
 pml_ucx.c:114 Pack remote worker address, size 482
 pml_ucx.c:114 Pack local worker address, size 618
-pml_ucx.c:351 created ucp context 0x23184d0, worker 0x28283a0
-pml_ucx_component.c:129 returning priority 51
-selected ucx best priority 51
-select: component ucx selected
-pml_ucx.c:182 Got proc 2 address, size 618
-pml_ucx.c:411 connecting to proc. 2
+pml_ucx.c:351 created ucp context 0x902650, worker 0xe053d0
+pml_ucx_component.c:129 returning priority 19
+pml_ucx.c:367 mca_pml_ucx_cleanup
+select: component ucx not selected / finalized
+pml_ucx.c:268 mca_pml_ucx_close
+mca: base: close: component ucx closed
+mca: base: close: unloading component ucx
+check:select: checking my pml cm against process [[33740,728],0] pml ucx
+[[33740,728],10] selected pml cm, but peer [[33740,728],0] on alvis-cpu1 selected pml ucx

Has this been fixed in later versions and if so which commit(s) are involved?

@brminich
Copy link
Contributor

@akesandgren, can you please attach the mentioned log files?

@akesandgren
Copy link
Author

Exactly which ones do you mean?

@brminich
Copy link
Contributor

in the original description you mentioned the following:

Configure result - config.log
Log file - configure UCX with "--enable-logging" - and run with "UCX_LOG_LEVEL=data"

do you have those logs available?

@akesandgren
Copy link
Author

The config output yes, not the other one, i missed removing those lines originally.
And we don't have a enable-logging build to run with.

The interesting part here is how two different tasks on the same node can end up with different outcomes on the choice of ucx or not. How it is actually built shouldn't really matter, it's a standard build with:

contrib/configure-release --prefix=/apps/Arch/software/UCX/1.10.0-GCCcore-10.3.0 --enable-optimizations --enable-cma --enable-mt --with-verbs --without-java --disable-doxygen-doc

UCX.config.txt

@brminich
Copy link
Contributor

brminich commented Nov 16, 2022

Ok, I see. Can you please upload outputs of both bad and good runs? Also please set UCX_LOG_LEVEL=debug when running the apps (even the release UCX version contains some debug traces)

@akesandgren
Copy link
Author

Here's a tar file with a correct and a fail output. Both cases run on the same node.
UCX-fail.tar.gz

@brminich
Copy link
Contributor

thanks for the logs. The issue seems to be similar to #8511. There are the following errors when the job fails:

[1668610663.569809] [alvis-cpu1:2191091:0]       mm_posix.c:194  UCX  ERROR open(file_name=/proc/2191086/fd/40 flags=0x0) failed: No such file or directory
[1668610663.569831] [alvis-cpu1:2191091:0]          mm_ep.c:154  UCX  ERROR mm ep failed to connect to remote FIFO id 0xc000000a00216eee: Shared memory error

[1668610663.573565] [alvis-cpu1:2191117:1]       mm_posix.c:194  UCX  ERROR open(file_name=/proc/2191116/fd/40 flags=0x0) failed: Permission denied
[1668610663.573574] [alvis-cpu1:2191117:1]          mm_ep.c:154  UCX  ERROR mm ep failed to connect to remote FIFO id 0xc000000a00216f0c: Shared memory error

You are not using containers, right?
Does setting UCX_POSIX_USE_PROC_LINK=n make the problem gone?
what is the output of ipcs -l ?

@akesandgren
Copy link
Author

No, not using containers.

From the node I've been running on:

ipcs -l

------ Messages Limits --------
max queues system wide = 32000
max size of message (bytes) = 8192
default max size of queue (bytes) = 16384

------ Shared Memory Limits --------
max number of segments = 4096
max seg size (kbytes) = 18014398509465599
max total shared memory (kbytes) = 18014398509481980
min seg size (bytes) = 1

------ Semaphore Limits --------
max number of arrays = 32000
max semaphores per array = 32000
max semaphores system wide = 1024000000
max ops per semop call = 500
semaphore max value = 32767

UCX_POSIX_USE_PROC_LINK=n does not fix the problem.

UCX 1.11.2 does not seem to have this problem, at least I've been unable to trigger it (still using OpenMPI 4.1.1)

@yosefe
Copy link
Contributor

yosefe commented Nov 20, 2022

A similar issue was fixed by open-mpi/ompi#9505 which is part of OpenMPI 4.1.2 and above

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants