Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incorrect/inconsistent core pinning with OpenMPI #179

Open
boegel opened this issue Nov 20, 2020 · 11 comments
Open

incorrect/inconsistent core pinning with OpenMPI #179

boegel opened this issue Nov 20, 2020 · 11 comments

Comments

@boegel
Copy link
Member

boegel commented Nov 20, 2020

2x18-core Intel Xeon Gold 6140 (skitty, Skylake):

foss/2019b (OpenMPI 3.1.4)

$ mympirun grep 'Cpus_allowed_list' /proc/self/status
Cpus_allowed_list:      0,4,8,12,16,20,24,28,32
Cpus_allowed_list:      2,6,10,14,18,22,26,30,34
Cpus_allowed_list:      1,5,9,13,17,21,25,29,33
Cpus_allowed_list:      3,7,11,15,19,23,27,31,35
Cpus_allowed_list:      0,4,8,12,16,20,24,28,32
Cpus_allowed_list:      2,6,10,14,18,22,26,30,34
Cpus_allowed_list:      1,5,9,13,17,21,25,29,33
Cpus_allowed_list:      3,7,11,15,19,23,27,31,35
...
Cpus_allowed_list:      1,5,9,13,17,21,25,29,33
Cpus_allowed_list:      3,7,11,15,19,23,27,31,35
$ mympirun --hybrid 4 grep 'Cpus_allowed_list' /proc/self/status
Cpus_allowed_list:      0,4,8,12,16,20,24,28,32
Cpus_allowed_list:      2,6,10,14,18,22,26,30,34
Cpus_allowed_list:      1,5,9,13,17,21,25,29,33
Cpus_allowed_list:      3,7,11,15,19,23,27,31,35
$ mympirun --hybrid 4 --mpirunoptions="--bind-to core" grep 'Cpus_allowed_list' /proc/self/status
Cpus_allowed_list:      0
Cpus_allowed_list:      4
Cpus_allowed_list:      8
Cpus_allowed_list:      12

foss/2020a (OpenMPI 4.0.3)

$ mympirun grep 'Cpus_allowed_list' /proc/self/status
Cpus_allowed_list:      1
Cpus_allowed_list:      3
Cpus_allowed_list:      2
Cpus_allowed_list:      0
...
Cpus_allowed_list:      34
Cpus_allowed_list:      35
Cpus_allowed_list:      33
$ mympirun --hybrid 4 grep 'Cpus_allowed_list' /proc/self/status
Cpus_allowed_list:      0
Cpus_allowed_list:      4
Cpus_allowed_list:      8
Cpus_allowed_list:      12
$ mympirun --hybrid 4 --mpirunoptions="--bind-to core" grep 'Cpus_allowed_list' /proc/self/status
Cpus_allowed_list:      0
Cpus_allowed_list:      4
Cpus_allowed_list:      8
Cpus_allowed_list:      12

intel/2019b or intel/2020a

$ mympirun grep 'Cpus_allowed_list' /proc/self/status
Cpus_allowed_list:      6
Cpus_allowed_list:      2
Cpus_allowed_list:      14
Cpus_allowed_list:      10
Cpus_allowed_list:      26
...
Cpus_allowed_list:      31
Cpus_allowed_list:      12
Cpus_allowed_list:      20
$ mympirun --hybrid 4 grep 'Cpus_allowed_list' /proc/self/status
Cpus_allowed_list:      2,6,10,14,18,22,26,30,34
Cpus_allowed_list:      0,4,8,12,16,20,24,28,32
Cpus_allowed_list:      1,5,9,13,17,21,25,29,33
Cpus_allowed_list:      3,7,11,15,19,23,27,31,35

2x48-core AMD EPYC 7552 (doduo, Zen2):

foss/2019b (OpenMPI 3.1.4) + foss/2020a (OpenMPI 4.0.3)

$ mympirun grep 'Cpus_allowed_list' /proc/self/status
Cpus_allowed_list:	0-3
Cpus_allowed_list:	4-7
Cpus_allowed_list:	8-11
Cpus_allowed_list:	12-15
Cpus_allowed_list:	16-19
Cpus_allowed_list:	20-23
Cpus_allowed_list:	24-27
...
Cpus_allowed_list:	88-91
Cpus_allowed_list:	92-95
Cpus_allowed_list:	0-3
Cpus_allowed_list:	4-7
Cpus_allowed_list:	8-11
Cpus_allowed_list:	12-15
Cpus_allowed_list:	16-19
...
Cpus_allowed_list:	0-3
Cpus_allowed_list:	4-7
Cpus_allowed_list:	8-11
Cpus_allowed_list:	12-15
Cpus_allowed_list:	16-19
Cpus_allowed_list:	20-23
...
Cpus_allowed_list:	88-91
Cpus_allowed_list:	92-95
$ mympirun --hybrid 4 grep 'Cpus_allowed_list' /proc/self/status
Cpus_allowed_list:	0-3
Cpus_allowed_list:	4-7
Cpus_allowed_list:	8-11
Cpus_allowed_list:	12-15
$ mympirun --hybrid 4 --mpirunoptions="--bind-to core" grep 'Cpus_allowed_list' /proc/self/status
Cpus_allowed_list:	0
Cpus_allowed_list:	1
Cpus_allowed_list:	2
Cpus_allowed_list:	3

intel/2019b or intel/2020a

$ mympirun grep 'Cpus_allowed_list' /proc/self/status
Cpus_allowed_list:	78
Cpus_allowed_list:	17
Cpus_allowed_list:	9
Cpus_allowed_list:	27
...
Cpus_allowed_list:	74
Cpus_allowed_list:	95
Cpus_allowed_list:	38
$ mympirun --hybrid 4 grep 'Cpus_allowed_list' /proc/self/status
Cpus_allowed_list:	72-95
Cpus_allowed_list:	48-71
Cpus_allowed_list:	0-19,24-27
Cpus_allowed_list:	20-23,28-47
@boegel
Copy link
Member Author

boegel commented Nov 20, 2020

@stdweird
Copy link
Member

@boegel i read the manpage again and then a few more times, i also found https://stackoverflow.com/questions/28216897/syntax-of-the-map-by-option-in-openmpi-mpirun-v1-8
i think we should probably start everything with --map-by ppr:<hybrid>:node:pe=<ppn/hybrid> (assuming the default value of hybrid is ppn). this should pin as well.
not sure if we should then also add a --rank-by option, default slot probably means what i think it means

@hajgato
Copy link
Contributor

hajgato commented Mar 11, 2021

It seems that foss/2020a is also wrong (maybe on multiple sockets?):

testm.pbs:
ml foss/2020a vsc-mympirun
grep 'Cpus_allowed_list' /proc/self/status
mpirun grep 'Cpus_allowed_list' /proc/self/status | sort -u
mympirun grep 'Cpus_allowed_list' /proc/self/status | sort -u

On Swallot:
4CPU:

Cpus_allowed_list:      1,14,16,18
--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
   Bind to:     CORE
   #cpus:       1
   Node:        node2608.swalot.os
option to your binding directive.
   #processes:  2
processes than cpus on a resource:
You can override this protection by adding the "overload-allowed"
--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
   Bind to:     CORE
   #cpus:       1
   Node:        node2608.swalot.os
option to your binding directive.
   #processes:  2
processes than cpus on a resource:
You can override this protection by adding the "overload-allowed"

3CPU:

Cpus_allowed_list:      14,16,18
Cpus_allowed_list:      14
Cpus_allowed_list:      16
Cpus_allowed_list:      18
Cpus_allowed_list:      14
Cpus_allowed_list:      16
Cpus_allowed_list:      18

9CPU

pus_allowed_list:      1,3,5,7-9,11,16,18
--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
   Bind to:     CORE
   #cpus:       1
   Node:        node2606.swalot.os
option to your binding directive.
   #processes:  2
processes than cpus on a resource:
You can override this protection by adding the "overload-allowed"
--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
   Bind to:     CORE
   #cpus:       1
   Node:        node2606.swalot.os
option to your binding directive.
   #processes:  2
processes than cpus on a resource:
You can override this protection by adding the "overload-allowed"

On Skitty:
9CPUS:

Cpus_allowed_list:  2-3,7,15,19-20,25,29,33

--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
   Bind to:     CORE
   #cpus:       1
   Node:        node3106.skitty.os
option to your binding directive.
   #processes:  2
processes than cpus on a resource:
You can override this protection by adding the "overload-allowed"

--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
   Bind to:     CORE
   #cpus:       1
   Node:        node3106.skitty.os
option to your binding directive.
   #processes:  2
processes than cpus on a resource:
You can override this protection by adding the "overload-allowed"

8CPUS

Cpus_allowed_list:  2,6,10,14,18,22,26,30
Cpus_allowed_list:  10
Cpus_allowed_list:  14
Cpus_allowed_list:  18
Cpus_allowed_list:  2
Cpus_allowed_list:  22
Cpus_allowed_list:  26
Cpus_allowed_list:  30
Cpus_allowed_list:  6
Cpus_allowed_list:  10
Cpus_allowed_list:  14
Cpus_allowed_list:  18
Cpus_allowed_list:  2
Cpus_allowed_list:  22
Cpus_allowed_list:  26
Cpus_allowed_list:  30
Cpus_allowed_list:  6

with foss/2020b those things work:
On Skitty:
9CPUS:

Cpus_allowed_list:  2-3,7,15,19-20,25,29,33
Cpus_allowed_list:  2
Cpus_allowed_list:  20
Cpus_allowed_list:  25,29,33
Cpus_allowed_list:  3,7,15,19
Cpus_allowed_list:  2
Cpus_allowed_list:  20
Cpus_allowed_list:  25,29,33
Cpus_allowed_list:  3,7,15,19

8CPUS:

Cpus_allowed_list:  2,6,10,14,18,22,26,30
Cpus_allowed_list:  2,6,10,14,18,22,26,30
Cpus_allowed_list:  2,6,10,14,18,22,26,30

@boegel
Copy link
Member Author

boegel commented Aug 12, 2021

After spending quite a bit of time on this issue this week, here's what I've figured out so far when using OpenMPI (foss toolchain).

The problem is basically two-fold with the current version of mympirun (5.2.6):

Core pinning

By default, OpenMPI does --bind-to numa when there are more than 2 MPI processes (for <= 2, it defaults to --bind-to core). The mpirun manpage claims the default is bind-to socket with > 2 ranks, but that seems incorrect from what I see. https://www-lb.open-mpi.org/papers/sc-2016/Open-MPI-SC16-BOF.pdf (slide 100) confirms the default is actually bind-to numa. This difference matters (e.g. on skitty, a NUMA domain is 9 cores vs 18 cores per socket, but on doduo it's 4 cores per NUMA domain vs 48 cores per socket...).

This default isn't very good when --hybrid is involved for actual hybrid (MPI+OpenMPI) workloads, especially not on AMD Rome where a NUMA domain is just 4 cores, and for low values of --hybrid, since then a lot of cores will not be used.

Changes we should make here:

  • Explicitly use --bind-to core when --hybrid is not used (although the impact of --bind-to numa vs --bind-to core is probably very small).
  • Use an explicit bind-to xxx when --hybrid is used to ensure that core binding makes sense for MPI+OpenMP workloads. This probably means something more adaptive like --bind-to node (or no binding in MPI, leave it up to OpenMP) for --hybrid 1, --bind-to socket for --hybrid 2, etc. Ideally we also have a way to check how many cores there are in a single NUMA domain.
  • Add support to mympirun for a --pin-to option so you can control how core pinning is done, since what's ideal heavily depends on the application. For example, for OpeNFOAM (which is MPI-only and memory-bandwidth sensitive), you always want --bind-to core, there's no point in binding a rank to multiple cores if there's no threading at all...

Process placement

(a.k.a. "mapping" in OpenMPI terms)

This is the biggest problem currently... When --hybrid is used with OpenMPI, mympirun does --map-by ppr:H:node (since #175), where H is the value passed to --hybrid.
ppr is short for processors-per-resource, and node is the resource here.

That looks OK, but the core assignment (mapping) is done sequentially by NUMA domain per node, so ranks are not spread properly across sockets within the same node...

This gets worse in combination with the --bind-to numa default on AMD Rome systems, since then you're also pinning the ranks to the NUMA domains (a small number of cores on AMD Rome, just 4).
With --hybrid 12 on AMD Rome, all ranks are pinned to a NUMA domain on 1 socket, so half of the node is unused.
With --hybrid 4 on AMD Rome, you're stuck to 16 cores on a single socket...

So we definitely shouldn't blindly use map-by ppr:H:node and assume all is well...

Doing this properly probably requires an adaptive strategy again, based on the value for --hybrid, socket count (which we can hardcode to 2, I guess), and the size of a NUMA domain.

With the intel toolchain, the problems are a lot less severe (or things are even fine, really), it seems like Intel MPI is pinning things more intelligently (but I haven't experimented much yet there).

@stdweird
Copy link
Member

@boegel be aware that bind to numa on epyc doesn't need to mean what you think it means. on doduo, the numa is actually the L2 cache, and it makes much more sense than having part of the socket as numa. i'm afraid we will need some more defaults options so people can choose the one that makes most sense. (to find that out, users need to make a communicatin map like teh IPM communciotn topology map to see hwo the ranks communicate, so they can decide on best placement)

@stdweird
Copy link
Member

with pmi, you can use slurm control cfr https://slurm.schedmd.com/mc_support.html
srun has some --hint=options that might make sens to implement for mympirun as well

@stdweird
Copy link
Member

stdweird commented Nov 9, 2021

@boegel what we really need now is a way to get sequential mapping, where each rank is pinned "next to" the rpevious pinnend rank. this pinning is default for intel mpi (at least, when used with mympirun).

@boegel
Copy link
Member Author

boegel commented Nov 9, 2021

@stdweird I'm guessing that needs to be under control of a specific mympirun option (maybe it already is for Intel MPI, I didn't check)?

For Open MPI, it probably boils down to --map-by core --bind-to core, if I recall correctly.

That shouldn't be the default I guess, at least not when --hybrid is also used, since then you definitely want to "spread" the ranks across the available resources, not pin yourself to a small subset of available cores?

@stdweird
Copy link
Member

stdweird commented Nov 9, 2021

at least the behaviour between intel mpi and openmpi should be the same.

@stdweird
Copy link
Member

@boegel even for hybrid, you want sequential blocks of mpi ranks per eg numa domain. what oipenmpi does so different is the roundrobin placement of the ranks.

placement should be:

split in sequential blocks per node
per node, subdivide in sequential blocks according to number of numa and ranks per numa
ping rank per core or set of cores (also sequential wrt topology)

@boegel
Copy link
Member Author

boegel commented Nov 23, 2021

@stdweird I'm happy to get back to this for hybrid (along with checking several scenarios like --hybrid N-S, --hybrid N/2, --hybrid S, --hybrid 1, etc. (with N number of cores and S number of sockets), but let's get #184 merged first, since none-hybrid is way more common (and the impact on OpenFOAM performance is big on AMD systems).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants