Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mm ep failed to connect to remote FIFO id : shared memory error; open(file_name=/proc/25599/fd/35 flags=0x0) failed: No such file or directory #8511

Open
jamesongithub opened this issue Sep 8, 2022 · 11 comments
Labels

Comments

@jamesongithub
Copy link

jamesongithub commented Sep 8, 2022

Describe the bug

A clear and concise description of what the bug is.
During an mpirun of hpl benchmark ucx errors were encountered which caused the job to fail.

The error message looks like the following:

[1662593980.908446] [slurm-slehpc15-james-hpc-pg0-4:25595:0]        mm_posix.c:207  UCX  ERROR   open(file_name=/proc/25599/fd/35 flags=0x0) failed: No such file or directory
[1662593980.908470] [slurm-slehpc15-james-hpc-pg0-4:25595:0]           mm_ep.c:159  UCX  ERROR   mm ep failed to connect to remote FIFO id 0xc0000008c00063ff: Shared memory error

Looks like a side issue that was reported in #4224 as well as easybuilders/easybuild#756

Steps to Reproduce

  • Command line
mpirun --debug-daemons \
	--mca opal_common_ucx_verbose 9 \
	--allow-run-as-root \
	--mca btl ^tcp \
	--mca opal_common_ucx_opal_mem_hooks 1 \
	/shared/home/james/hpl-2.3-dl/bin/xhpl
  • UCX version used (from github branch XX or release YY) + UCX configure flags (can be checked by ucx_info -v)
ucx_info -v
If 'ucx_info' is not a typo you can use command-not-found to lookup the package that contains it, like this:
    cnf ucx_info
  • Any UCX environment variables used
env |grep UCX
UCX_IB_PKEY=0x000b

Setup and versions

  • OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)
    • cat /etc/issue or cat /etc/redhat-release + uname -a
    • For Nvidia Bluefield SmartNIC include cat /etc/mlnx-release (the string identifies software and firmware setup)
 cat /etc/os-*release
NAME="SLE_HPC"
VERSION="15-SP4"
VERSION_ID="15.4"
PRETTY_NAME="SUSE Linux Enterprise High Performance Computing 15 SP4"
ID="sle_hpc"
ID_LIKE="suse"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:suse:sle_hpc:15:sp4"
DOCUMENTATION_URL="https://documentation.suse.com/"
VARIANT_ID="sles-hpc"

uname -a
Linux slurm-slehpc15-james-scheduler 5.14.21-150400.14.7-azure #1 SMP PREEMPT_DYNAMIC Tue Jul 12 09:32:53 UTC 2022 (00ddf73) x86_64 x86_64 x86_64 GNU/Linux
  • For RDMA/IB/RoCE related issues:
    • Driver version:
      • rpm -q rdma-core or rpm -q libibverbs
rpm -q rdma-core
rdma-core-38.1-150400.4.6.x86_64
    - or: MLNX_OFED version `ofed_info -s`
  • HW information from ibstat or ibv_devinfo -vv command
sudo ibstatus
Infiniband device 'mlx5_0' port 1 status:
	default gid:	 unknown
	base lid:	 0x0
	sm lid:		 0x0
	state:		 4: ACTIVE
	phys state:	 5: LinkUp
	rate:		 40 Gb/sec (4X QDR)
	link_layer:	 Ethernet
  • For GPU related issues: N/A
    • GPU type
    • Cuda:
      • Drivers version
      • Check if peer-direct is loaded: lsmod|grep nv_peer_mem and/or gdrcopy: lsmod|grep gdrdrv

Additional information (depending on the issue)

  • OpenMPI version
 mpirun --version
mpirun (Open MPI) 4.1.1.0.a8dd8708d8b6

Report bugs to http://www.open-mpi.org/community/help/
  • Output of ucx_info -d to show transports and devices recognized by UCX
ucx_info -d
If 'ucx_info' is not a typo you can use command-not-found to lookup the package that contains it, like this:
    cnf ucx_info
  • Configure result - config.log
  • Log file - configure UCX with "--enable-logging" - and run with "UCX_LOG_LEVEL=data"

https://gist.github.com/jamesongithub/bda88d5575aa06bedcf31255dae82b25

@jamesongithub
Copy link
Author

@hoopoepg any idea?

@hoopoepg
Copy link
Contributor

something wrong with proc file system. Are there containers used?
try to add variable
UCX_POSIX_USE_PROC_LINK=n
to command line

@jamesongithub
Copy link
Author

@hoopoepg no containers

tried adding UCX_POSIX_USE_PROC_LINK=n didn't see a difference

Please see log: https://gist.github.com/jamesongithub/ca1c9618f0dd994f6bf8356147111543

@hoopoepg
Copy link
Contributor

hoopoepg commented Oct 7, 2022

ok, it seems POSIX shm transport failed to access to shared memory.
could you try to exclude posix from transports? add UCX_TLS=^posix variable to your command line

@jamesongithub
Copy link
Author

jamesongithub commented Oct 7, 2022

@hoopoepg

/proc errors gone, now are shmat errors:

[1665173472.252802] [slurm-slehpc15-james-hpc-pg0-12:44314:0]         mm_sysv.c:56   UCX  ERROR   shmat(shmid=655360) failed: Invalid argument
[1665173472.252818] [slurm-slehpc15-james-hpc-pg0-12:44314:0]           mm_ep.c:159  UCX  ERROR   mm ep failed to connect to remote FIFO id 0xa0000: Shared memory error
[slurm-slehpc15-james-hpc-pg0-12:44311] pml_ucx.c:419  Error: ucp_ep_create(proc=502) failed: Shared memory error
[slurm-slehpc15-james-hpc-pg0-12:44313] pml_ucx.c:419  Error: ucp_ep_create(proc=502) failed: Shared memory error
[1665173472.252778] [slurm-slehpc15-james-hpc-pg0-12:44312:0]         mm_sysv.c:56   UCX  ERROR   shmat(shmid=655360) failed: Invalid argument
[1665173472.252792] [slurm-slehpc15-james-hpc-pg0-12:44312:0]           mm_ep.c:159  UCX  ERROR   mm ep failed to connect to remote FIFO id 0xa0000: Shared memory error
[slurm-slehpc15-james-hpc-pg0-12:44312] pml_ucx.c:419  Error: ucp_ep_create(proc=502) failed: Shared memory error
[1665173472.254258] [slurm-slehpc15-james-hpc-pg0-3:44147:0]         mm_sysv.c:56   UCX  ERROR   shmat(shmid=655378) failed: Invalid argument
[1665173472.254273] [slurm-slehpc15-james-hpc-pg0-3:44147:0]           mm_ep.c:159  UCX  ERROR   mm ep failed to connect to remote FIFO id 0xa0012: Shared memory error
[1665173472.252575] [slurm-slehpc15-james-hpc-pg0-12:44296:0]         mm_sysv.c:56   UCX  ERROR   shmat(shmid=655382) failed: Invalid argument
[1665173472.252585] [slurm-slehpc15-james-hpc-pg0-12:44296:0]           mm_ep.c:159  UCX  ERROR   mm ep failed to connect to remote FIFO id 0xa0016: Shared memory error
[slurm-slehpc15-james-hpc-pg0-3:44147] pml_ucx.c:419  Error: ucp_ep_create(proc=121) failed: Shared memory error
[slurm-slehpc15-james-hpc-pg0-12:44314] pml_ucx.c:419  Error: ucp_ep_create(proc=502) failed: Shared memory error
[slurm-slehpc15-james-hpc-pg0-3:44190] pml_ucx.c:419  Error: ucp_ep_create(proc=121) failed: Shared memory error
[1665173472.252885] [slurm-slehpc15-james-hpc-pg0-12:44313:0]         mm_sysv.c:56   UCX  ERROR   shmat(shmid=655360) failed: Invalid argument
[1665173472.252902] [slurm-slehpc15-james-hpc-pg0-12:44313:0]           mm_ep.c:159  UCX  ERROR   mm ep failed to connect to remote FIFO id 0xa0000: Shared memory error
[1665173472.254221] [slurm-slehpc15-james-hpc-pg0-3:44148:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(sysv/memory cma/memory dc_mlx5/mlx5_0:1); 
[slurm-slehpc15-james-hpc-pg0-12:44294] *** An error occurred in MPI_Init
[slurm-slehpc15-james-hpc-pg0-12:44294] *** reported by process [2433024001,506]
[slurm-slehpc15-james-hpc-pg0-12:44294] *** on a NULL communicator
[slurm-slehpc15-james-hpc-pg0-12:44294] *** Unknown error
[slurm-slehpc15-james-hpc-pg0-12:44294] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[slurm-slehpc15-james-hpc-pg0-12:44294] ***    and potentially your MPI job)
[1665173472.254259] [slurm-slehpc15-james-hpc-pg0-3:44150:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(sysv/memory cma/memory dc_mlx5/mlx5_0:1); 
[1665173472.254621] [slurm-slehpc15-james-hpc-pg0-3:44186:0]         mm_sysv.c:56   UCX  ERROR   shmat(shmid=655378) failed: Invalid argument
[1665173472.254633] [slurm-slehpc15-james-hpc-pg0-3:44186:0]           mm_ep.c:159  UCX  ERROR   mm ep failed to connect to remote FIFO id 0xa0012: Shared memory error
[1665173472.254622] [slurm-slehpc15-james-hpc-pg0-3:44187:0]         mm_sysv.c:56   UCX  ERROR   shmat(shmid=655378) failed: Invalid argument
[1665173472.254636] [slurm-slehpc15-james-hpc-pg0-3:44187:0]           mm_ep.c:159  UCX  ERROR   mm ep failed to connect to remote FIFO id 0xa0012: Shared memory error
[1665173472.254650] [slurm-slehpc15-james-hpc-pg0-3:44185:0]         mm_sysv.c:56   UCX  ERROR   shmat(shmid=655378) failed: Invalid argument
[1665173472.254667] [slurm-slehpc15-james-hpc-pg0-3:44185:0]           mm_ep.c:159  UCX  ERROR   mm ep failed to connect to remote FIFO id 0xa0012: Shared memory error
[1665173472.254558] [slurm-slehpc15-james-hpc-pg0-3:44188:0]         mm_sysv.c:56   UCX  ERROR   shmat(shmid=655378) failed: Invalid argument
[1665173472.254576] [slurm-slehpc15-james-hpc-pg0-3:44188:0]           mm_ep.c:159  UCX  ERROR   mm ep failed to connect to remote FIFO id 0xa0012: Shared memory error

@hoopoepg
Copy link
Contributor

hoopoepg commented Oct 9, 2022

it seems there are some restrictions to operate shared memory on your system - UCX can't use this transport at all.
to disable it add variable UCX_TLS=^sm and it will allow to run your application

@jamesongithub
Copy link
Author

jamesongithub commented Oct 10, 2022

with UCX_TLS=^sm still having issues.

[1665438954.954469] [slurm-slehpc15-james-hpc-pg0-2:26258:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[slurm-slehpc15-james-hpc-pg0-1:27333] Checking distance from this process to device=mlx5_0
[slurm-slehpc15-james-hpc-pg0-1:27333] hwloc_distances->nbobjs=2
[slurm-slehpc15-james-hpc-pg0-1:27333] hwloc_distances->values[0]=10
[slurm-slehpc15-james-hpc-pg0-1:27333] hwloc_distances->values[1]=20
[slurm-slehpc15-james-hpc-pg0-1:27333] Process is bound: distance to device is 0.000000
[slurm-slehpc15-james-hpc-pg0-1:27334] select: init of component ofi returned success
[slurm-slehpc15-james-hpc-pg0-1:27334] select: initializing btl component openib
[slurm-slehpc15-james-hpc-pg0-1:27334] Checking distance from this process to device=mlx5_0
[slurm-slehpc15-james-hpc-pg0-1:27334] hwloc_distances->nbobjs=2
[slurm-slehpc15-james-hpc-pg0-1:27334] hwloc_distances->values[0]=10
[slurm-slehpc15-james-hpc-pg0-1:27334] hwloc_distances->values[1]=20
[slurm-slehpc15-james-hpc-pg0-1:27334] Process is bound: distance to device is 0.000000
[1665438954.955595] [slurm-slehpc15-james-hpc-pg0-2:26262:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438954.956864] [slurm-slehpc15-james-hpc-pg0-2:26266:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438954.958393] [slurm-slehpc15-james-hpc-pg0-2:26264:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[slurm-slehpc15-james-hpc-pg0-1:27320] select: init of component ofi returned success
[slurm-slehpc15-james-hpc-pg0-1:27320] select: initializing btl component openib
[slurm-slehpc15-james-hpc-pg0-1:27320] Checking distance from this process to device=mlx5_0
[slurm-slehpc15-james-hpc-pg0-1:27320] hwloc_distances->nbobjs=2
[slurm-slehpc15-james-hpc-pg0-1:27320] hwloc_distances->values[0]=10
[slurm-slehpc15-james-hpc-pg0-1:27320] hwloc_distances->values[1]=20
[slurm-slehpc15-james-hpc-pg0-1:27320] Process is bound: distance to device is 0.000000
[1665438954.961159] [slurm-slehpc15-james-hpc-pg0-2:26250:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438954.962243] [slurm-slehpc15-james-hpc-pg0-2:26263:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[slurm-slehpc15-james-hpc-pg0-1:27331] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-1:27331] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-1:27331] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-1:27331] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-1:27331] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-1:27331] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-1:27331] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-1:27331] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-1:27331] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-1:27331] select: init of component vader returned success
[1665438954.964097] [slurm-slehpc15-james-hpc-pg0-2:26265:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[slurm-slehpc15-james-hpc-pg0-1:27330] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-1:27330] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-1:27330] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-1:27330] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-1:27330] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-1:27330] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-1:27330] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-1:27330] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-1:27330] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-1:27330] select: init of component vader returned success
[slurm-slehpc15-james-hpc-pg0-1:27342] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-1:27342] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-1:27342] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-1:27342] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-1:27342] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-1:27342] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-1:27342] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-1:27342] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-1:27342] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-1:27342] select: init of component vader returned success
[slurm-slehpc15-james-hpc-pg0-1:27337] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-1:27337] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-1:27337] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-1:27337] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-1:27337] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-1:27337] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-1:27337] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-1:27337] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-1:27337] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-1:27337] select: init of component vader returned success
[1665438954.970425] [slurm-slehpc15-james-hpc-pg0-2:26251:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[slurm-slehpc15-james-hpc-pg0-1:27303] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-1:27303] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-1:27303] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-1:27303] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-1:27303] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-1:27303] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-1:27303] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-1:27303] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-1:27303] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-1:27303] select: init of component vader returned success
[1665438954.972518] [slurm-slehpc15-james-hpc-pg0-2:26257:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[slurm-slehpc15-james-hpc-pg0-1:27326] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-1:27326] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-1:27326] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-1:27326] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-1:27326] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-1:27326] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-1:27326] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-1:27326] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-1:27326] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-1:27326] select: init of component vader returned success
[1665438954.974107] [slurm-slehpc15-james-hpc-pg0-2:26260:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438954.974839] [slurm-slehpc15-james-hpc-pg0-2:26259:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438954.975534] [slurm-slehpc15-james-hpc-pg0-2:26271:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438954.977762] [slurm-slehpc15-james-hpc-pg0-2:26253:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[slurm-slehpc15-james-hpc-pg0-1:27335] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-1:27335] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-1:27335] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-1:27335] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-1:27335] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-1:27335] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-1:27335] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-1:27335] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-1:27335] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-1:27335] select: init of component vader returned success
[slurm-slehpc15-james-hpc-pg0-1:27333] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-1:27333] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-1:27333] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-1:27333] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-1:27333] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-1:27333] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-1:27333] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-1:27333] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-1:27333] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-1:27333] select: init of component vader returned success
[slurm-slehpc15-james-hpc-pg0-1:27334] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-1:27334] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-1:27334] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-1:27334] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-1:27334] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-1:27334] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-1:27334] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-1:27334] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-1:27334] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-1:27334] select: init of component vader returned success
[1665438954.979615] [slurm-slehpc15-james-hpc-pg0-2:26255:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438954.980498] [slurm-slehpc15-james-hpc-pg0-2:26254:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438954.981717] [slurm-slehpc15-james-hpc-pg0-2:26272:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438954.982201] [slurm-slehpc15-james-hpc-pg0-2:26267:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438954.982706] [slurm-slehpc15-james-hpc-pg0-2:26268:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438954.983096] [slurm-slehpc15-james-hpc-pg0-2:26269:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438954.984676] [slurm-slehpc15-james-hpc-pg0-2:26270:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[slurm-slehpc15-james-hpc-pg0-1:27320] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-1:27320] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-1:27320] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-1:27320] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-1:27320] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-1:27320] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-1:27320] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-1:27320] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-1:27320] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-1:27320] select: init of component vader returned success
[1665438954.992201] [slurm-slehpc15-james-hpc-pg0-1:27304:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.009533] [slurm-slehpc15-james-hpc-pg0-1:27306:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.078640] [slurm-slehpc15-james-hpc-pg0-1:27308:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.106084] [slurm-slehpc15-james-hpc-pg0-1:27300:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.120960] [slurm-slehpc15-james-hpc-pg0-1:27317:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.133759] [slurm-slehpc15-james-hpc-pg0-1:27315:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.135367] [slurm-slehpc15-james-hpc-pg0-1:27314:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.138056] [slurm-slehpc15-james-hpc-pg0-1:27299:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.152223] [slurm-slehpc15-james-hpc-pg0-1:27309:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.154816] [slurm-slehpc15-james-hpc-pg0-1:27310:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.157326] [slurm-slehpc15-james-hpc-pg0-1:27318:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.166507] [slurm-slehpc15-james-hpc-pg0-1:27302:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.167333] [slurm-slehpc15-james-hpc-pg0-1:27313:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.169634] [slurm-slehpc15-james-hpc-pg0-1:27321:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.172057] [slurm-slehpc15-james-hpc-pg0-1:27327:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.172603] [slurm-slehpc15-james-hpc-pg0-1:27340:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.177517] [slurm-slehpc15-james-hpc-pg0-1:27319:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.178461] [slurm-slehpc15-james-hpc-pg0-1:27341:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.179619] [slurm-slehpc15-james-hpc-pg0-1:27338:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.181082] [slurm-slehpc15-james-hpc-pg0-1:27329:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.182425] [slurm-slehpc15-james-hpc-pg0-1:27322:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.184469] [slurm-slehpc15-james-hpc-pg0-1:27324:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.189443] [slurm-slehpc15-james-hpc-pg0-1:27328:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.190392] [slurm-slehpc15-james-hpc-pg0-1:27325:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.192031] [slurm-slehpc15-james-hpc-pg0-1:27332:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.192613] [slurm-slehpc15-james-hpc-pg0-1:27336:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.194418] [slurm-slehpc15-james-hpc-pg0-1:27323:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.195034] [slurm-slehpc15-james-hpc-pg0-1:27339:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.196726] [slurm-slehpc15-james-hpc-pg0-1:27331:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.200069] [slurm-slehpc15-james-hpc-pg0-1:27330:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.200985] [slurm-slehpc15-james-hpc-pg0-1:27342:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.202910] [slurm-slehpc15-james-hpc-pg0-1:27337:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.203561] [slurm-slehpc15-james-hpc-pg0-1:27333:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.204374] [slurm-slehpc15-james-hpc-pg0-1:27326:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.205513] [slurm-slehpc15-james-hpc-pg0-1:27303:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.205674] [slurm-slehpc15-james-hpc-pg0-1:27335:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.206344] [slurm-slehpc15-james-hpc-pg0-1:27334:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.206871] [slurm-slehpc15-james-hpc-pg0-1:27320:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665438955.216721] [slurm-slehpc15-james-hpc-pg0-1:27314:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216712] [slurm-slehpc15-james-hpc-pg0-1:27316:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216732] [slurm-slehpc15-james-hpc-pg0-1:27301:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216715] [slurm-slehpc15-james-hpc-pg0-1:27304:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216717] [slurm-slehpc15-james-hpc-pg0-1:27306:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216713] [slurm-slehpc15-james-hpc-pg0-1:27311:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216716] [slurm-slehpc15-james-hpc-pg0-1:27312:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216764] [slurm-slehpc15-james-hpc-pg0-1:27308:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216803] [slurm-slehpc15-james-hpc-pg0-1:27314:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.216805] [slurm-slehpc15-james-hpc-pg0-1:27316:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.216815] [slurm-slehpc15-james-hpc-pg0-1:27301:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.216803] [slurm-slehpc15-james-hpc-pg0-1:27304:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.216782] [slurm-slehpc15-james-hpc-pg0-1:27299:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216803] [slurm-slehpc15-james-hpc-pg0-1:27306:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.216771] [slurm-slehpc15-james-hpc-pg0-1:27305:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216807] [slurm-slehpc15-james-hpc-pg0-1:27311:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.216774] [slurm-slehpc15-james-hpc-pg0-1:27300:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216799] [slurm-slehpc15-james-hpc-pg0-1:27317:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216812] [slurm-slehpc15-james-hpc-pg0-1:27310:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216829] [slurm-slehpc15-james-hpc-pg0-1:27318:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216863] [slurm-slehpc15-james-hpc-pg0-1:27308:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.216865] [slurm-slehpc15-james-hpc-pg0-1:27299:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.216833] [slurm-slehpc15-james-hpc-pg0-1:27309:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216821] [slurm-slehpc15-james-hpc-pg0-1:27302:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216855] [slurm-slehpc15-james-hpc-pg0-1:27305:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.216819] [slurm-slehpc15-james-hpc-pg0-1:27315:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216856] [slurm-slehpc15-james-hpc-pg0-1:27300:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.216892] [slurm-slehpc15-james-hpc-pg0-1:27310:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.216910] [slurm-slehpc15-james-hpc-pg0-1:27318:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.216916] [slurm-slehpc15-james-hpc-pg0-1:27309:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.216903] [slurm-slehpc15-james-hpc-pg0-1:27302:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.216905] [slurm-slehpc15-james-hpc-pg0-1:27315:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
--------------------------------------------------------------------------
MPI_INIT has failed because at least one MPI process is unreachable
from another.  This *usually* means that an underlying communication
plugin -- such as a BTL or an MTL -- has either not loaded or not
allowed itself to be used.  Your MPI job will now abort.

You may wish to try to narrow down the problem;

 * Check the output of ompi_info to see which BTL/MTL plugins are
   available.
 * Run your application with MPI_THREAD_SINGLE.
 * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
   if using MTL-based communications) to see exactly which
   communication plugins were considered and/or discarded.
--------------------------------------------------------------------------
[1665438955.216881] [slurm-slehpc15-james-hpc-pg0-1:27317:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.216878] [slurm-slehpc15-james-hpc-pg0-1:27313:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216880] [slurm-slehpc15-james-hpc-pg0-1:27321:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216969] [slurm-slehpc15-james-hpc-pg0-1:27340:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216976] [slurm-slehpc15-james-hpc-pg0-1:27329:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216974] [slurm-slehpc15-james-hpc-pg0-1:27322:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216950] [slurm-slehpc15-james-hpc-pg0-1:27336:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216969] [slurm-slehpc15-james-hpc-pg0-1:27324:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216939] [slurm-slehpc15-james-hpc-pg0-1:27338:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216958] [slurm-slehpc15-james-hpc-pg0-1:27313:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.216988] [slurm-slehpc15-james-hpc-pg0-1:27325:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216975] [slurm-slehpc15-james-hpc-pg0-1:27339:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216956] [slurm-slehpc15-james-hpc-pg0-1:27332:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.216939] [slurm-slehpc15-james-hpc-pg0-1:27341:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217024] [slurm-slehpc15-james-hpc-pg0-1:27341:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.216939] [slurm-slehpc15-james-hpc-pg0-1:27319:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217024] [slurm-slehpc15-james-hpc-pg0-1:27319:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[slurm-slehpc15-james-hpc-pg0-1:27328] [[21652,1],28] selected pml cm, but peer [[21652,1],0] on slurm-slehpc15-james-hpc-pg0-1 selected pml ucx
[slurm-slehpc15-james-hpc-pg0-2:26233] [[21652,1],48] selected pml cm, but peer [[21652,1],0] on slurm-slehpc15-james-hpc-pg0-1 selected pml ucx
[1665438955.216943] [slurm-slehpc15-james-hpc-pg0-1:27327:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217024] [slurm-slehpc15-james-hpc-pg0-1:27327:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.216959] [slurm-slehpc15-james-hpc-pg0-1:27321:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217002] [slurm-slehpc15-james-hpc-pg0-1:27312:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217053] [slurm-slehpc15-james-hpc-pg0-1:27340:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217054] [slurm-slehpc15-james-hpc-pg0-1:27329:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217049] [slurm-slehpc15-james-hpc-pg0-1:27322:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217025] [slurm-slehpc15-james-hpc-pg0-1:27336:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217051] [slurm-slehpc15-james-hpc-pg0-1:27324:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217053] [slurm-slehpc15-james-hpc-pg0-1:27338:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217063] [slurm-slehpc15-james-hpc-pg0-1:27331:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217144] [slurm-slehpc15-james-hpc-pg0-1:27331:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217041] [slurm-slehpc15-james-hpc-pg0-1:27337:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217117] [slurm-slehpc15-james-hpc-pg0-1:27337:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217074] [slurm-slehpc15-james-hpc-pg0-1:27325:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217156] [slurm-slehpc15-james-hpc-pg0-1:27335:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217232] [slurm-slehpc15-james-hpc-pg0-1:27335:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217088] [slurm-slehpc15-james-hpc-pg0-1:27303:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217167] [slurm-slehpc15-james-hpc-pg0-1:27303:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217042] [slurm-slehpc15-james-hpc-pg0-1:27342:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217120] [slurm-slehpc15-james-hpc-pg0-1:27342:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217049] [slurm-slehpc15-james-hpc-pg0-1:27339:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217035] [slurm-slehpc15-james-hpc-pg0-1:27332:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217069] [slurm-slehpc15-james-hpc-pg0-1:27323:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217148] [slurm-slehpc15-james-hpc-pg0-1:27323:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217069] [slurm-slehpc15-james-hpc-pg0-1:27330:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217147] [slurm-slehpc15-james-hpc-pg0-1:27330:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217106] [slurm-slehpc15-james-hpc-pg0-1:27320:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217184] [slurm-slehpc15-james-hpc-pg0-1:27320:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217096] [slurm-slehpc15-james-hpc-pg0-1:27326:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217175] [slurm-slehpc15-james-hpc-pg0-1:27326:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217134] [slurm-slehpc15-james-hpc-pg0-1:27307:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217218] [slurm-slehpc15-james-hpc-pg0-1:27307:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217105] [slurm-slehpc15-james-hpc-pg0-1:27334:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217182] [slurm-slehpc15-james-hpc-pg0-1:27334:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217082] [slurm-slehpc15-james-hpc-pg0-1:27333:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217155] [slurm-slehpc15-james-hpc-pg0-1:27333:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217513] [slurm-slehpc15-james-hpc-pg0-2:26235:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217517] [slurm-slehpc15-james-hpc-pg0-2:26232:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217513] [slurm-slehpc15-james-hpc-pg0-2:26237:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217518] [slurm-slehpc15-james-hpc-pg0-2:26234:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217604] [slurm-slehpc15-james-hpc-pg0-2:26234:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217517] [slurm-slehpc15-james-hpc-pg0-2:26231:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217604] [slurm-slehpc15-james-hpc-pg0-2:26231:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217521] [slurm-slehpc15-james-hpc-pg0-2:26230:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217604] [slurm-slehpc15-james-hpc-pg0-2:26230:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217513] [slurm-slehpc15-james-hpc-pg0-2:26236:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217604] [slurm-slehpc15-james-hpc-pg0-2:26236:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217523] [slurm-slehpc15-james-hpc-pg0-2:26240:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217604] [slurm-slehpc15-james-hpc-pg0-2:26240:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217604] [slurm-slehpc15-james-hpc-pg0-2:26235:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217604] [slurm-slehpc15-james-hpc-pg0-2:26232:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217615] [slurm-slehpc15-james-hpc-pg0-2:26237:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217598] [slurm-slehpc15-james-hpc-pg0-2:26239:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217617] [slurm-slehpc15-james-hpc-pg0-2:26238:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217699] [slurm-slehpc15-james-hpc-pg0-2:26238:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217625] [slurm-slehpc15-james-hpc-pg0-2:26229:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217706] [slurm-slehpc15-james-hpc-pg0-2:26229:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217625] [slurm-slehpc15-james-hpc-pg0-2:26248:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217704] [slurm-slehpc15-james-hpc-pg0-2:26248:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217618] [slurm-slehpc15-james-hpc-pg0-2:26246:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217695] [slurm-slehpc15-james-hpc-pg0-2:26246:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217634] [slurm-slehpc15-james-hpc-pg0-2:26244:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217720] [slurm-slehpc15-james-hpc-pg0-2:26244:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217615] [slurm-slehpc15-james-hpc-pg0-2:26247:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217576] [slurm-slehpc15-james-hpc-pg0-2:26242:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217659] [slurm-slehpc15-james-hpc-pg0-2:26242:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217623] [slurm-slehpc15-james-hpc-pg0-2:26245:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217705] [slurm-slehpc15-james-hpc-pg0-2:26245:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217603] [slurm-slehpc15-james-hpc-pg0-2:26243:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217682] [slurm-slehpc15-james-hpc-pg0-2:26243:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217712] [slurm-slehpc15-james-hpc-pg0-2:26239:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217760] [slurm-slehpc15-james-hpc-pg0-2:26247:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217786] [slurm-slehpc15-james-hpc-pg0-2:26264:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217792] [slurm-slehpc15-james-hpc-pg0-2:26265:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217894] [slurm-slehpc15-james-hpc-pg0-2:26265:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217719] [slurm-slehpc15-james-hpc-pg0-2:26256:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217809] [slurm-slehpc15-james-hpc-pg0-2:26256:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217806] [slurm-slehpc15-james-hpc-pg0-2:26271:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217908] [slurm-slehpc15-james-hpc-pg0-2:26271:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217820] [slurm-slehpc15-james-hpc-pg0-2:26255:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217917] [slurm-slehpc15-james-hpc-pg0-2:26255:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217721] [slurm-slehpc15-james-hpc-pg0-2:26261:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217809] [slurm-slehpc15-james-hpc-pg0-2:26261:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217817] [slurm-slehpc15-james-hpc-pg0-2:26253:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217917] [slurm-slehpc15-james-hpc-pg0-2:26253:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217826] [slurm-slehpc15-james-hpc-pg0-2:26251:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217925] [slurm-slehpc15-james-hpc-pg0-2:26251:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217720] [slurm-slehpc15-james-hpc-pg0-2:26252:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217810] [slurm-slehpc15-james-hpc-pg0-2:26252:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217798] [slurm-slehpc15-james-hpc-pg0-2:26257:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217893] [slurm-slehpc15-james-hpc-pg0-2:26257:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217750] [slurm-slehpc15-james-hpc-pg0-2:26260:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217833] [slurm-slehpc15-james-hpc-pg0-2:26260:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217725] [slurm-slehpc15-james-hpc-pg0-2:26249:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217819] [slurm-slehpc15-james-hpc-pg0-2:26249:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217719] [slurm-slehpc15-james-hpc-pg0-2:26258:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217820] [slurm-slehpc15-james-hpc-pg0-2:26258:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217727] [slurm-slehpc15-james-hpc-pg0-2:26250:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217801] [slurm-slehpc15-james-hpc-pg0-2:26250:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217719] [slurm-slehpc15-james-hpc-pg0-2:26266:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217809] [slurm-slehpc15-james-hpc-pg0-2:26266:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217817] [slurm-slehpc15-james-hpc-pg0-2:26263:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217918] [slurm-slehpc15-james-hpc-pg0-2:26263:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217776] [slurm-slehpc15-james-hpc-pg0-2:26262:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217885] [slurm-slehpc15-james-hpc-pg0-2:26262:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217817] [slurm-slehpc15-james-hpc-pg0-2:26259:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217914] [slurm-slehpc15-james-hpc-pg0-2:26259:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.218026] [slurm-slehpc15-james-hpc-pg0-2:26270:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217936] [slurm-slehpc15-james-hpc-pg0-2:26241:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.218021] [slurm-slehpc15-james-hpc-pg0-2:26241:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217908] [slurm-slehpc15-james-hpc-pg0-2:26264:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217962] [slurm-slehpc15-james-hpc-pg0-2:26268:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.218045] [slurm-slehpc15-james-hpc-pg0-2:26268:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217926] [slurm-slehpc15-james-hpc-pg0-2:26272:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.218004] [slurm-slehpc15-james-hpc-pg0-2:26272:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217931] [slurm-slehpc15-james-hpc-pg0-2:26267:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.218036] [slurm-slehpc15-james-hpc-pg0-2:26267:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217910] [slurm-slehpc15-james-hpc-pg0-2:26254:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.217992] [slurm-slehpc15-james-hpc-pg0-2:26254:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.217952] [slurm-slehpc15-james-hpc-pg0-2:26269:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665438955.218036] [slurm-slehpc15-james-hpc-pg0-2:26269:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665438955.218105] [slurm-slehpc15-james-hpc-pg0-2:26270:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[slurm-slehpc15-james-hpc-pg0-1:27328] *** An error occurred in MPI_Init
[slurm-slehpc15-james-hpc-pg0-1:27328] *** reported by process [1418985473,28]
[slurm-slehpc15-james-hpc-pg0-1:27328] *** on a NULL communicator
[slurm-slehpc15-james-hpc-pg0-1:27328] *** Unknown error
[slurm-slehpc15-james-hpc-pg0-1:27328] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[slurm-slehpc15-james-hpc-pg0-1:27328] ***    and potentially your MPI job)
[slurm-slehpc15-james-hpc-pg0-1:27285] 87 more processes have sent help message help-mpi-btl-openib.txt / no device params found
[slurm-slehpc15-james-hpc-pg0-1:27285] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[slurm-slehpc15-james-hpc-pg0-1:27285] 87 more processes have sent help message help-mpi-btl-openib.txt / error in device init
[slurm-slehpc15-james-hpc-pg0-1:27285] 1 more process has sent help message help-mpi-runtime.txt / mpi_init:startup:pml-add-procs-fail
[slurm-slehpc15-james-hpc-pg0-1:27285] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle

@jamesongithub
Copy link
Author

preferably instead of disabling shared memory we can adjust system also since if we disable ucx completely we can get a successfully run

are these reasonable?

ipcs -l

------ Messages Limits --------
max queues system wide = 32000
max size of message (bytes) = 65536
default max size of queue (bytes) = 65536

------ Shared Memory Limits --------
max number of segments = 4096
max seg size (kbytes) = 18014398509481983
max total shared memory (kbytes) = 4611686018427386880
min seg size (bytes) = 1

------ Semaphore Limits --------
max number of arrays = 32000
max semaphores per array = 32000
max semaphores system wide = 1024000000
max ops per semop call = 500
semaphore max value = 32767

@hoopoepg
Copy link
Contributor

hi
I don't see any issues in ipcs -l output - we are testing UCX on similar configuration and it works fine.
as I can see from logs UCX was able to startup, but some peers selected pml cm instead of ucx.
can you add -mca pml ucx to command line to force using UCX?

thank you

@jamesongithub
Copy link
Author

hey with -mca pml ucx i was able to get a successfully run. here is some output

--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:

  Local host:            slurm-slehpc15-james-hpc-pg0-2
  Device name:           mlx5_0
  Device vendor ID:      0x02c9
  Device vendor part ID: 4120

Default device parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   slurm-slehpc15-james-hpc-pg0-2
  Local device: mlx5_0
--------------------------------------------------------------------------
[slurm-slehpc15-james-hpc-pg0-2:23280] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-2:23280] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-2:23280] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-2:23280] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-2:23280] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-2:23280] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-2:23280] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-2:23280] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-2:23280] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-2:23280] select: init of component vader returned success
[slurm-slehpc15-james-hpc-pg0-2:23296] select: init of component ofi returned success
[slurm-slehpc15-james-hpc-pg0-2:23296] select: initializing btl component openib
[slurm-slehpc15-james-hpc-pg0-2:23296] Checking distance from this process to device=mlx5_0
[slurm-slehpc15-james-hpc-pg0-2:23296] hwloc_distances->nbobjs=2
[slurm-slehpc15-james-hpc-pg0-2:23296] hwloc_distances->values[0]=10
[slurm-slehpc15-james-hpc-pg0-2:23296] hwloc_distances->values[1]=20
[slurm-slehpc15-james-hpc-pg0-2:23296] Process is bound: distance to device is 0.000000
[slurm-slehpc15-james-hpc-pg0-2:23296] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-2:23296] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-2:23296] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-2:23296] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-2:23296] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-2:23296] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-2:23296] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-2:23296] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-2:23296] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-2:23296] select: init of component vader returned success
[slurm-slehpc15-james-hpc-pg0-1:23233] select: init of component ofi returned success
[slurm-slehpc15-james-hpc-pg0-1:23233] select: initializing btl component openib
[slurm-slehpc15-james-hpc-pg0-1:23233] Checking distance from this process to device=mlx5_0
[slurm-slehpc15-james-hpc-pg0-1:23233] hwloc_distances->nbobjs=2
[slurm-slehpc15-james-hpc-pg0-1:23233] hwloc_distances->values[0]=10
[slurm-slehpc15-james-hpc-pg0-1:23233] hwloc_distances->values[1]=20
[slurm-slehpc15-james-hpc-pg0-1:23233] Process is bound: distance to device is 0.000000
[slurm-slehpc15-james-hpc-pg0-1:23233] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-1:23233] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-1:23233] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-1:23233] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-1:23233] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-1:23233] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-1:23233] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-1:23233] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-1:23233] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-1:23233] select: init of component vader returned success
[slurm-slehpc15-james-hpc-pg0-1:23234] select: init of component ofi returned success
[slurm-slehpc15-james-hpc-pg0-1:23234] select: initializing btl component openib
[slurm-slehpc15-james-hpc-pg0-1:23234] Checking distance from this process to device=mlx5_0
[slurm-slehpc15-james-hpc-pg0-1:23234] hwloc_distances->nbobjs=2
[slurm-slehpc15-james-hpc-pg0-1:23234] hwloc_distances->values[0]=10
[slurm-slehpc15-james-hpc-pg0-1:23234] hwloc_distances->values[1]=20
[slurm-slehpc15-james-hpc-pg0-1:23234] Process is bound: distance to device is 0.000000
[slurm-slehpc15-james-hpc-pg0-1:23234] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-1:23234] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-1:23234] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-1:23234] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-1:23234] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-1:23234] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-1:23234] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-1:23234] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-1:23234] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-1:23234] select: init of component vader returned success
[slurm-slehpc15-james-hpc-pg0-1:23222] select: init of component ofi returned success
[slurm-slehpc15-james-hpc-pg0-1:23222] select: initializing btl component openib
[slurm-slehpc15-james-hpc-pg0-1:23222] Checking distance from this process to device=mlx5_0
[slurm-slehpc15-james-hpc-pg0-1:23222] hwloc_distances->nbobjs=2
[slurm-slehpc15-james-hpc-pg0-1:23222] hwloc_distances->values[0]=10
[slurm-slehpc15-james-hpc-pg0-1:23222] hwloc_distances->values[1]=20
[slurm-slehpc15-james-hpc-pg0-1:23222] Process is bound: distance to device is 0.000000
[slurm-slehpc15-james-hpc-pg0-1:23222] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-1:23222] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-1:23222] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-1:23222] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-1:23222] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-1:23222] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-1:23222] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-1:23222] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-1:23222] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-1:23222] select: init of component vader returned success
[slurm-slehpc15-james-hpc-pg0-1:23240] select: init of component ofi returned success
[slurm-slehpc15-james-hpc-pg0-1:23240] select: initializing btl component openib
[slurm-slehpc15-james-hpc-pg0-1:23240] Checking distance from this process to device=mlx5_0
[slurm-slehpc15-james-hpc-pg0-1:23240] hwloc_distances->nbobjs=2
[slurm-slehpc15-james-hpc-pg0-1:23240] hwloc_distances->values[0]=10
[slurm-slehpc15-james-hpc-pg0-1:23240] hwloc_distances->values[1]=20
[slurm-slehpc15-james-hpc-pg0-1:23240] Process is bound: distance to device is 0.000000
[slurm-slehpc15-james-hpc-pg0-1:23240] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-1:23240] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-1:23240] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-1:23240] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-1:23240] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-1:23240] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-1:23240] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-1:23240] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-1:23240] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-1:23240] select: init of component vader returned success
[slurm-slehpc15-james-hpc-pg0-1:23226] select: init of component ofi returned success
[slurm-slehpc15-james-hpc-pg0-1:23226] select: initializing btl component openib
[slurm-slehpc15-james-hpc-pg0-1:23226] Checking distance from this process to device=mlx5_0
[slurm-slehpc15-james-hpc-pg0-1:23226] hwloc_distances->nbobjs=2
[slurm-slehpc15-james-hpc-pg0-1:23226] hwloc_distances->values[0]=10
[slurm-slehpc15-james-hpc-pg0-1:23226] hwloc_distances->values[1]=20
[slurm-slehpc15-james-hpc-pg0-1:23226] Process is bound: distance to device is 0.000000
[slurm-slehpc15-james-hpc-pg0-1:23226] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-1:23226] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-1:23226] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-1:23226] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-1:23226] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
...
[1665518071.720969] [slurm-slehpc15-james-hpc-pg0-2:23292:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[slurm-slehpc15-james-hpc-pg0-1:23215] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-1:23215] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-1:23215] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-1:23215] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-1:23215] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-1:23215] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-1:23215] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-1:23215] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-1:23215] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-1:23215] select: init of component vader returned success
[slurm-slehpc15-james-hpc-pg0-2:23276] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-2:23276] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-2:23276] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-2:23276] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-2:23276] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-2:23276] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-2:23276] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-2:23276] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-2:23276] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-2:23276] select: init of component vader returned success
[slurm-slehpc15-james-hpc-pg0-1:23237] select: init of component ofi returned success
[slurm-slehpc15-james-hpc-pg0-1:23237] select: initializing btl component openib
[slurm-slehpc15-james-hpc-pg0-2:23272] select: init of component openib returned failure
[slurm-slehpc15-james-hpc-pg0-2:23272] mca: base: close: component openib closed
[slurm-slehpc15-james-hpc-pg0-2:23272] mca: base: close: unloading component openib
[slurm-slehpc15-james-hpc-pg0-2:23272] select: initializing btl component usnic
[slurm-slehpc15-james-hpc-pg0-2:23272] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[slurm-slehpc15-james-hpc-pg0-2:23272] select: init of component usnic returned failure
[slurm-slehpc15-james-hpc-pg0-2:23272] mca: base: close: component usnic closed
[slurm-slehpc15-james-hpc-pg0-2:23272] mca: base: close: unloading component usnic
[slurm-slehpc15-james-hpc-pg0-2:23272] select: initializing btl component vader
[slurm-slehpc15-james-hpc-pg0-1:23237] Checking distance from this process to device=mlx5_0
[slurm-slehpc15-james-hpc-pg0-1:23237] hwloc_distances->nbobjs=2
[slurm-slehpc15-james-hpc-pg0-1:23237] hwloc_distances->values[0]=10
[slurm-slehpc15-james-hpc-pg0-1:23237] hwloc_distances->values[1]=20
[slurm-slehpc15-james-hpc-pg0-1:23237] Process is bound: distance to device is 0.000000
[slurm-slehpc15-james-hpc-pg0-2:23272] select: init of component vader returned success
[slurm-slehpc15-james-hpc-pg0-1:23243] select: init of component ofi returned success
[slurm-slehpc15-james-hpc-pg0-1:23243] select: initializing btl component openib
[slurm-slehpc15-james-hpc-pg0-1:23243] Checking distance from this process to device=mlx5_0
...
[1665518072.196751] [slurm-slehpc15-james-hpc-pg0-1:23230:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665518072.197006] [slurm-slehpc15-james-hpc-pg0-1:23221:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665518072.197333] [slurm-slehpc15-james-hpc-pg0-1:23228:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665518072.197729] [slurm-slehpc15-james-hpc-pg0-1:23244:0]          parser.c:1895 UCX  INFO  UCX_* env variables: UCX_TLS=^sm UCX_POSIX_USE_PROC_LINK=n UCX_LOG_LEVEL=info
[1665518072.199795] [slurm-slehpc15-james-hpc-pg0-1:23216:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665518072.199775] [slurm-slehpc15-james-hpc-pg0-1:23249:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665518072.199779] [slurm-slehpc15-james-hpc-pg0-1:23253:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665518072.199779] [slurm-slehpc15-james-hpc-pg0-1:23240:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665518072.199859] [slurm-slehpc15-james-hpc-pg0-1:23255:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665518072.199851] [slurm-slehpc15-james-hpc-pg0-1:23217:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665518072.199851] [slurm-slehpc15-james-hpc-pg0-1:23212:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665518072.199878] [slurm-slehpc15-james-hpc-pg0-1:23229:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665518072.199854] [slurm-slehpc15-james-hpc-pg0-1:23236:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665518072.199875] [slurm-slehpc15-james-hpc-pg0-1:23225:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665518072.199865] [slurm-slehpc15-james-hpc-pg0-1:23249:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
[1665518072.199829] [slurm-slehpc15-james-hpc-pg0-1:23247:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[0]: tag(self/memory0 dc_mlx5/mlx5_0:1); 
[1665518072.199892] [slurm-slehpc15-james-hpc-pg0-1:23253:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[1]: tag(dc_mlx5/mlx5_0:1); 
...
[1665523321.395622] [slurm-slehpc15-james-hpc-pg0-2:23284:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[2]: tag(dc_mlx5/mlx5_0:1); 
[1665523321.397063] [slurm-slehpc15-james-hpc-pg0-1:23252:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[2]: tag(dc_mlx5/mlx5_0:1); 
[1665523321.398176] [slurm-slehpc15-james-hpc-pg0-1:23220:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[2]: tag(dc_mlx5/mlx5_0:1); 
[1665523321.400071] [slurm-slehpc15-james-hpc-pg0-2:23254:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[2]: tag(dc_mlx5/mlx5_0:1); 
[1665523321.493806] [slurm-slehpc15-james-hpc-pg0-1:23227:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[2]: tag(dc_mlx5/mlx5_0:1); 
[1665523321.494078] [slurm-slehpc15-james-hpc-pg0-1:23237:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[2]: tag(dc_mlx5/mlx5_0:1); 
[1665523321.494303] [slurm-slehpc15-james-hpc-pg0-1:23224:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[2]: tag(dc_mlx5/mlx5_0:1); 
[1665523321.494504] [slurm-slehpc15-james-hpc-pg0-1:23243:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[2]: tag(dc_mlx5/mlx5_0:1); 
[1665523321.494491] [slurm-slehpc15-james-hpc-pg0-1:23249:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[2]: tag(dc_mlx5/mlx5_0:1); 
[1665523321.494582] [slurm-slehpc15-james-hpc-pg0-1:23238:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[2]: tag(dc_mlx5/mlx5_0:1); 
[1665523321.495015] [slurm-slehpc15-james-hpc-pg0-1:23240:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[2]: tag(dc_mlx5/mlx5_0:1); 
[1665523321.495049] [slurm-slehpc15-james-hpc-pg0-1:23250:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[2]: tag(dc_mlx5/mlx5_0:1); 
[1665523321.497471] [slurm-slehpc15-james-hpc-pg0-1:23225:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[2]: tag(dc_mlx5/mlx5_0:1); 
[1665523321.497906] [slurm-slehpc15-james-hpc-pg0-1:23226:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[2]: tag(dc_mlx5/mlx5_0:1); 
[1665523321.500558] [slurm-slehpc15-james-hpc-pg0-1:23221:0]      ucp_worker.c:1777 UCX  INFO    ep_cfg[2]: tag(dc_mlx5/mlx5_0:1); 
...
[slurm-slehpc15-james-hpc-pg0-2:23266] mca: base: close: component vader closed
[slurm-slehpc15-james-hpc-pg0-2:23266] mca: base: close: unloading component vader
[slurm-slehpc15-james-hpc-pg0-1:23216] mca: base: close: component ofi closed
[slurm-slehpc15-james-hpc-pg0-1:23216] mca: base: close: unloading component ofi
[slurm-slehpc15-james-hpc-pg0-1:23234] mca: base: close: component ofi closed
[slurm-slehpc15-james-hpc-pg0-1:23234] mca: base: close: unloading component ofi
[slurm-slehpc15-james-hpc-pg0-1:23234] mca: base: close: component vader closed
[slurm-slehpc15-james-hpc-pg0-1:23234] mca: base: close: unloading component vader
[slurm-slehpc15-james-hpc-pg0-1:23216] mca: base: close: component vader closed
[slurm-slehpc15-james-hpc-pg0-1:23216] mca: base: close: unloading component vader
[slurm-slehpc15-james-hpc-pg0-2:23292] mca: base: close: component ofi closed
[slurm-slehpc15-james-hpc-pg0-2:23292] mca: base: close: unloading component ofi
[slurm-slehpc15-james-hpc-pg0-2:23292] mca: base: close: component vader closed
[slurm-slehpc15-james-hpc-pg0-2:23292] mca: base: close: unloading component vader
[slurm-slehpc15-james-hpc-pg0-1:23215] mca: base: close: component ofi closed
[slurm-slehpc15-james-hpc-pg0-1:23215] mca: base: close: unloading component ofi
[slurm-slehpc15-james-hpc-pg0-1:23215] mca: base: close: component vader closed
[slurm-slehpc15-james-hpc-pg0-1:23215] mca: base: close: unloading component vader
[slurm-slehpc15-james-hpc-pg0-2:23272] mca: base: close: component ofi closed
[slurm-slehpc15-james-hpc-pg0-2:23272] mca: base: close: unloading component ofi
[slurm-slehpc15-james-hpc-pg0-2:23272] mca: base: close: component vader closed
[slurm-slehpc15-james-hpc-pg0-2:23272] mca: base: close: unloading component vader
[slurm-slehpc15-james-hpc-pg0-2:23289] mca: base: close: component ofi closed
[slurm-slehpc15-james-hpc-pg0-2:23289] mca: base: close: unloading component ofi
[slurm-slehpc15-james-hpc-pg0-2:23289] mca: base: close: component vader closed
[slurm-slehpc15-james-hpc-pg0-2:23289] mca: base: close: unloading component vader
[slurm-slehpc15-james-hpc-pg0-1:23225] mca: base: close: component ofi closed
[slurm-slehpc15-james-hpc-pg0-1:23225] mca: base: close: unloading component ofi
...

@jamesongithub
Copy link
Author

glad we were able to get a successful run but would like to know how to get it working with the default parameters.

does this last result give us an idea of what should be changed to work with defaults?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants