incorrect/inconsistent core pinning with OpenMPI #179

boegel · 2020-11-20T18:38:53Z

2x18-core Intel Xeon Gold 6140 (skitty, Skylake):

`foss/2019b` (OpenMPI 3.1.4)

$ mympirun grep 'Cpus_allowed_list' /proc/self/status
Cpus_allowed_list:      0,4,8,12,16,20,24,28,32
Cpus_allowed_list:      2,6,10,14,18,22,26,30,34
Cpus_allowed_list:      1,5,9,13,17,21,25,29,33
Cpus_allowed_list:      3,7,11,15,19,23,27,31,35
Cpus_allowed_list:      0,4,8,12,16,20,24,28,32
Cpus_allowed_list:      2,6,10,14,18,22,26,30,34
Cpus_allowed_list:      1,5,9,13,17,21,25,29,33
Cpus_allowed_list:      3,7,11,15,19,23,27,31,35
...
Cpus_allowed_list:      1,5,9,13,17,21,25,29,33
Cpus_allowed_list:      3,7,11,15,19,23,27,31,35

$ mympirun --hybrid 4 grep 'Cpus_allowed_list' /proc/self/status
Cpus_allowed_list:      0,4,8,12,16,20,24,28,32
Cpus_allowed_list:      2,6,10,14,18,22,26,30,34
Cpus_allowed_list:      1,5,9,13,17,21,25,29,33
Cpus_allowed_list:      3,7,11,15,19,23,27,31,35

$ mympirun --hybrid 4 --mpirunoptions="--bind-to core" grep 'Cpus_allowed_list' /proc/self/status
Cpus_allowed_list:      0
Cpus_allowed_list:      4
Cpus_allowed_list:      8
Cpus_allowed_list:      12

`foss/2020a` (OpenMPI 4.0.3)

$ mympirun grep 'Cpus_allowed_list' /proc/self/status
Cpus_allowed_list:      1
Cpus_allowed_list:      3
Cpus_allowed_list:      2
Cpus_allowed_list:      0
...
Cpus_allowed_list:      34
Cpus_allowed_list:      35
Cpus_allowed_list:      33

$ mympirun --hybrid 4 grep 'Cpus_allowed_list' /proc/self/status
Cpus_allowed_list:      0
Cpus_allowed_list:      4
Cpus_allowed_list:      8
Cpus_allowed_list:      12

$ mympirun --hybrid 4 --mpirunoptions="--bind-to core" grep 'Cpus_allowed_list' /proc/self/status
Cpus_allowed_list:      0
Cpus_allowed_list:      4
Cpus_allowed_list:      8
Cpus_allowed_list:      12

`intel/2019b` or `intel/2020a`

$ mympirun grep 'Cpus_allowed_list' /proc/self/status
Cpus_allowed_list:      6
Cpus_allowed_list:      2
Cpus_allowed_list:      14
Cpus_allowed_list:      10
Cpus_allowed_list:      26
...
Cpus_allowed_list:      31
Cpus_allowed_list:      12
Cpus_allowed_list:      20

$ mympirun --hybrid 4 grep 'Cpus_allowed_list' /proc/self/status
Cpus_allowed_list:      2,6,10,14,18,22,26,30,34
Cpus_allowed_list:      0,4,8,12,16,20,24,28,32
Cpus_allowed_list:      1,5,9,13,17,21,25,29,33
Cpus_allowed_list:      3,7,11,15,19,23,27,31,35

2x48-core AMD EPYC 7552 (doduo, Zen2):

`foss/2019b` (OpenMPI 3.1.4) + `foss/2020a` (OpenMPI 4.0.3)

$ mympirun grep 'Cpus_allowed_list' /proc/self/status
Cpus_allowed_list:	0-3
Cpus_allowed_list:	4-7
Cpus_allowed_list:	8-11
Cpus_allowed_list:	12-15
Cpus_allowed_list:	16-19
Cpus_allowed_list:	20-23
Cpus_allowed_list:	24-27
...
Cpus_allowed_list:	88-91
Cpus_allowed_list:	92-95
Cpus_allowed_list:	0-3
Cpus_allowed_list:	4-7
Cpus_allowed_list:	8-11
Cpus_allowed_list:	12-15
Cpus_allowed_list:	16-19
...
Cpus_allowed_list:	0-3
Cpus_allowed_list:	4-7
Cpus_allowed_list:	8-11
Cpus_allowed_list:	12-15
Cpus_allowed_list:	16-19
Cpus_allowed_list:	20-23
...
Cpus_allowed_list:	88-91
Cpus_allowed_list:	92-95

$ mympirun --hybrid 4 grep 'Cpus_allowed_list' /proc/self/status
Cpus_allowed_list:	0-3
Cpus_allowed_list:	4-7
Cpus_allowed_list:	8-11
Cpus_allowed_list:	12-15

$ mympirun --hybrid 4 --mpirunoptions="--bind-to core" grep 'Cpus_allowed_list' /proc/self/status
Cpus_allowed_list:	0
Cpus_allowed_list:	1
Cpus_allowed_list:	2
Cpus_allowed_list:	3

`intel/2019b` or `intel/2020a`

$ mympirun grep 'Cpus_allowed_list' /proc/self/status
Cpus_allowed_list:	78
Cpus_allowed_list:	17
Cpus_allowed_list:	9
Cpus_allowed_list:	27
...
Cpus_allowed_list:	74
Cpus_allowed_list:	95
Cpus_allowed_list:	38

$ mympirun --hybrid 4 grep 'Cpus_allowed_list' /proc/self/status
Cpus_allowed_list:	72-95
Cpus_allowed_list:	48-71
Cpus_allowed_list:	0-19,24-27
Cpus_allowed_list:	20-23,28-47

The text was updated successfully, but these errors were encountered:

boegel · 2020-11-20T20:05:54Z

Some useful links:

stdweird · 2020-11-20T22:30:25Z

@boegel i read the manpage again and then a few more times, i also found https://stackoverflow.com/questions/28216897/syntax-of-the-map-by-option-in-openmpi-mpirun-v1-8
i think we should probably start everything with --map-by ppr:<hybrid>:node:pe=<ppn/hybrid> (assuming the default value of hybrid is ppn). this should pin as well.
not sure if we should then also add a --rank-by option, default slot probably means what i think it means

hajgato · 2021-03-11T19:29:52Z

It seems that foss/2020a is also wrong (maybe on multiple sockets?):

testm.pbs:
ml foss/2020a vsc-mympirun
grep 'Cpus_allowed_list' /proc/self/status
mpirun grep 'Cpus_allowed_list' /proc/self/status | sort -u
mympirun grep 'Cpus_allowed_list' /proc/self/status | sort -u

On Swallot:
4CPU:

Cpus_allowed_list:      1,14,16,18
--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
   Bind to:     CORE
   #cpus:       1
   Node:        node2608.swalot.os
option to your binding directive.
   #processes:  2
processes than cpus on a resource:
You can override this protection by adding the "overload-allowed"
--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
   Bind to:     CORE
   #cpus:       1
   Node:        node2608.swalot.os
option to your binding directive.
   #processes:  2
processes than cpus on a resource:
You can override this protection by adding the "overload-allowed"

3CPU:

Cpus_allowed_list:      14,16,18
Cpus_allowed_list:      14
Cpus_allowed_list:      16
Cpus_allowed_list:      18
Cpus_allowed_list:      14
Cpus_allowed_list:      16
Cpus_allowed_list:      18

9CPU

pus_allowed_list:      1,3,5,7-9,11,16,18
--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
   Bind to:     CORE
   #cpus:       1
   Node:        node2606.swalot.os
option to your binding directive.
   #processes:  2
processes than cpus on a resource:
You can override this protection by adding the "overload-allowed"
--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
   Bind to:     CORE
   #cpus:       1
   Node:        node2606.swalot.os
option to your binding directive.
   #processes:  2
processes than cpus on a resource:
You can override this protection by adding the "overload-allowed"

On Skitty:
9CPUS:

Cpus_allowed_list:  2-3,7,15,19-20,25,29,33

--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
   Bind to:     CORE
   #cpus:       1
   Node:        node3106.skitty.os
option to your binding directive.
   #processes:  2
processes than cpus on a resource:
You can override this protection by adding the "overload-allowed"

--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
   Bind to:     CORE
   #cpus:       1
   Node:        node3106.skitty.os
option to your binding directive.
   #processes:  2
processes than cpus on a resource:
You can override this protection by adding the "overload-allowed"

8CPUS

Cpus_allowed_list:  2,6,10,14,18,22,26,30
Cpus_allowed_list:  10
Cpus_allowed_list:  14
Cpus_allowed_list:  18
Cpus_allowed_list:  2
Cpus_allowed_list:  22
Cpus_allowed_list:  26
Cpus_allowed_list:  30
Cpus_allowed_list:  6
Cpus_allowed_list:  10
Cpus_allowed_list:  14
Cpus_allowed_list:  18
Cpus_allowed_list:  2
Cpus_allowed_list:  22
Cpus_allowed_list:  26
Cpus_allowed_list:  30
Cpus_allowed_list:  6

with foss/2020b those things work:
On Skitty:
9CPUS:

Cpus_allowed_list:  2-3,7,15,19-20,25,29,33
Cpus_allowed_list:  2
Cpus_allowed_list:  20
Cpus_allowed_list:  25,29,33
Cpus_allowed_list:  3,7,15,19
Cpus_allowed_list:  2
Cpus_allowed_list:  20
Cpus_allowed_list:  25,29,33
Cpus_allowed_list:  3,7,15,19

8CPUS:

Cpus_allowed_list:  2,6,10,14,18,22,26,30
Cpus_allowed_list:  2,6,10,14,18,22,26,30
Cpus_allowed_list:  2,6,10,14,18,22,26,30

boegel · 2021-08-12T13:29:47Z

After spending quite a bit of time on this issue this week, here's what I've figured out so far when using OpenMPI (foss toolchain).

The problem is basically two-fold with the current version of mympirun (5.2.6):

Core pinning

By default, OpenMPI does --bind-to numa when there are more than 2 MPI processes (for <= 2, it defaults to --bind-to core). The mpirun manpage claims the default is bind-to socket with > 2 ranks, but that seems incorrect from what I see. https://www-lb.open-mpi.org/papers/sc-2016/Open-MPI-SC16-BOF.pdf (slide 100) confirms the default is actually bind-to numa. This difference matters (e.g. on skitty, a NUMA domain is 9 cores vs 18 cores per socket, but on doduo it's 4 cores per NUMA domain vs 48 cores per socket...).

This default isn't very good when --hybrid is involved for actual hybrid (MPI+OpenMPI) workloads, especially not on AMD Rome where a NUMA domain is just 4 cores, and for low values of --hybrid, since then a lot of cores will not be used.

Changes we should make here:

Explicitly use --bind-to core when --hybrid is not used (although the impact of --bind-to numa vs --bind-to core is probably very small).
Use an explicit bind-to xxx when --hybrid is used to ensure that core binding makes sense for MPI+OpenMP workloads. This probably means something more adaptive like --bind-to node (or no binding in MPI, leave it up to OpenMP) for --hybrid 1, --bind-to socket for --hybrid 2, etc. Ideally we also have a way to check how many cores there are in a single NUMA domain.
Add support to mympirun for a --pin-to option so you can control how core pinning is done, since what's ideal heavily depends on the application. For example, for OpeNFOAM (which is MPI-only and memory-bandwidth sensitive), you always want --bind-to core, there's no point in binding a rank to multiple cores if there's no threading at all...

Process placement

(a.k.a. "mapping" in OpenMPI terms)

This is the biggest problem currently... When --hybrid is used with OpenMPI, mympirun does --map-by ppr:H:node (since #175), where H is the value passed to --hybrid.
ppr is short for processors-per-resource, and node is the resource here.

That looks OK, but the core assignment (mapping) is done sequentially by NUMA domain per node, so ranks are not spread properly across sockets within the same node...

This gets worse in combination with the --bind-to numa default on AMD Rome systems, since then you're also pinning the ranks to the NUMA domains (a small number of cores on AMD Rome, just 4).
With --hybrid 12 on AMD Rome, all ranks are pinned to a NUMA domain on 1 socket, so half of the node is unused.
With --hybrid 4 on AMD Rome, you're stuck to 16 cores on a single socket...

So we definitely shouldn't blindly use map-by ppr:H:node and assume all is well...

Doing this properly probably requires an adaptive strategy again, based on the value for --hybrid, socket count (which we can hardcode to 2, I guess), and the size of a NUMA domain.

With the intel toolchain, the problems are a lot less severe (or things are even fine, really), it seems like Intel MPI is pinning things more intelligently (but I haven't experimented much yet there).

stdweird · 2021-08-13T09:38:54Z

@boegel be aware that bind to numa on epyc doesn't need to mean what you think it means. on doduo, the numa is actually the L2 cache, and it makes much more sense than having part of the socket as numa. i'm afraid we will need some more defaults options so people can choose the one that makes most sense. (to find that out, users need to make a communicatin map like teh IPM communciotn topology map to see hwo the ranks communicate, so they can decide on best placement)

stdweird · 2021-09-29T15:02:50Z

with pmi, you can use slurm control cfr https://slurm.schedmd.com/mc_support.html
srun has some --hint=options that might make sens to implement for mympirun as well

stdweird · 2021-11-09T17:57:31Z

@boegel what we really need now is a way to get sequential mapping, where each rank is pinned "next to" the rpevious pinnend rank. this pinning is default for intel mpi (at least, when used with mympirun).

boegel · 2021-11-09T18:40:09Z

@stdweird I'm guessing that needs to be under control of a specific mympirun option (maybe it already is for Intel MPI, I didn't check)?

For Open MPI, it probably boils down to --map-by core --bind-to core, if I recall correctly.

That shouldn't be the default I guess, at least not when --hybrid is also used, since then you definitely want to "spread" the ranks across the available resources, not pin yourself to a small subset of available cores?

stdweird · 2021-11-09T18:46:14Z

at least the behaviour between intel mpi and openmpi should be the same.

stdweird · 2021-11-23T07:38:40Z

@boegel even for hybrid, you want sequential blocks of mpi ranks per eg numa domain. what oipenmpi does so different is the roundrobin placement of the ranks.

placement should be:

split in sequential blocks per node
per node, subdivide in sequential blocks according to number of numa and ranks per numa
ping rank per core or set of cores (also sequential wrt topology)

boegel · 2021-11-23T07:45:02Z

@stdweird I'm happy to get back to this for hybrid (along with checking several scenarios like --hybrid N-S, --hybrid N/2, --hybrid S, --hybrid 1, etc. (with N number of cores and S number of sockets), but let's get #184 merged first, since none-hybrid is way more common (and the impact on OpenFOAM performance is big on AMD systems).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

incorrect/inconsistent core pinning with OpenMPI #179

incorrect/inconsistent core pinning with OpenMPI #179

boegel commented Nov 20, 2020

boegel commented Nov 20, 2020

stdweird commented Nov 20, 2020

hajgato commented Mar 11, 2021

boegel commented Aug 12, 2021 •

edited

Loading

stdweird commented Aug 13, 2021

stdweird commented Sep 29, 2021

stdweird commented Nov 9, 2021

boegel commented Nov 9, 2021

stdweird commented Nov 9, 2021

stdweird commented Nov 23, 2021

boegel commented Nov 23, 2021

incorrect/inconsistent core pinning with OpenMPI #179

incorrect/inconsistent core pinning with OpenMPI #179

Comments

boegel commented Nov 20, 2020

2x18-core Intel Xeon Gold 6140 (skitty, Skylake):

foss/2019b (OpenMPI 3.1.4)

foss/2020a (OpenMPI 4.0.3)

intel/2019b or intel/2020a

2x48-core AMD EPYC 7552 (doduo, Zen2):

foss/2019b (OpenMPI 3.1.4) + foss/2020a (OpenMPI 4.0.3)

intel/2019b or intel/2020a

boegel commented Nov 20, 2020

stdweird commented Nov 20, 2020

hajgato commented Mar 11, 2021

boegel commented Aug 12, 2021 • edited Loading

Core pinning

Process placement

stdweird commented Aug 13, 2021

stdweird commented Sep 29, 2021

stdweird commented Nov 9, 2021

boegel commented Nov 9, 2021

stdweird commented Nov 9, 2021

stdweird commented Nov 23, 2021

boegel commented Nov 23, 2021

`foss/2019b` (OpenMPI 3.1.4)

`foss/2020a` (OpenMPI 4.0.3)

`intel/2019b` or `intel/2020a`

`foss/2019b` (OpenMPI 3.1.4) + `foss/2020a` (OpenMPI 4.0.3)

`intel/2019b` or `intel/2020a`

boegel commented Aug 12, 2021 •

edited

Loading