Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault for large-scale GMM CPU runs #74

Open
bentsherman opened this issue Jan 31, 2019 · 2 comments
Open

Segfault for large-scale GMM CPU runs #74

bentsherman opened this issue Jan 31, 2019 · 2 comments

Comments

@bentsherman
Copy link
Member

When processing the Rice dataset I get this error at both 256 and 512:

Testing with P = 256...
[node1629:25546] *** Process received signal ***
[node1629:25546] Signal: Segmentation fault (11)
[node1629:25546] Signal code: Invalid permissions (2)
[node1629:25546] Failing at address: 0x2b4f2fdff9a4
[node1629:25546] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x2b4f0eef25d0]
[node1629:25546] [ 1] kinc[0x43b412]
[node1629:25546] [ 2] kinc[0x43bed3]
[node1629:25546] [ 3] kinc[0x43ace5]
[node1629:25546] [ 4] kinc[0x439a97]
[node1629:25546] [ 5] /home/btsheal/software/ACE/3.0.2/lib/libacecore.so.3(_ZN3Ace8Analytic9SerialRun7addWorkEOSt10unique_ptrIN17EAbstractAnalytic5BlockESt14default_deleteIS4_EE+0x22)[0x2b4f0c49bcb2]
[node1629:25546] [ 6] /home/btsheal/software/ACE/3.0.2/lib/libacecore.so.3(_ZN3Ace8Analytic8MPISlave7processERK10QByteArray+0x49)[0x2b4f0c4a01c9]
[node1629:25546] [ 7] /home/btsheal/software/ACE/3.0.2/lib/libacecore.so.3(_ZN3Ace8Analytic8MPISlave12dataReceivedERK10QByteArrayi+0x2b)[0x2b4f0c4a038b]
[node1629:25546] [ 8] /software/Qt/5.9.2/lib/libQt5Core.so.5(_ZN11QMetaObject8activateEP7QObjectiiPPv+0x973)[0x2b4f0e9d8e23]
[node1629:25546] [ 9] /home/btsheal/software/ACE/3.0.2/lib/libacecore.so.3(_ZN3Ace4QMPI12dataReceivedERK10QByteArrayi+0x33)[0x2b4f0c4ab6a3]
[node1629:25546] [10] /home/btsheal/software/ACE/3.0.2/lib/libacecore.so.3(_ZN3Ace4QMPI5probeEP19ompi_communicator_ti+0x120)[0x2b4f0c4821e0]
[node1629:25546] [11] /home/btsheal/software/ACE/3.0.2/lib/libacecore.so.3(_ZN3Ace4QMPI10timerEventEP11QTimerEvent+0x2d)[0x2b4f0c4826cd]
[node1629:25546] [12] /software/Qt/5.9.2/lib/libQt5Core.so.5(_ZN7QObject5eventEP6QEvent+0x64)[0x2b4f0e9d9e94]
[node1629:25546] [13] /software/Qt/5.9.2/lib/libQt5Core.so.5(_ZN16QCoreApplication6notifyEP7QObjectP6QEvent+0x3c)[0x2b4f0e9af38c]
[node1629:25546] [14] /home/btsheal/software/ACE/3.0.2/lib/libaceconsole.so.3(_ZN12EApplication6notifyEP7QObjectP6QEvent+0x16)[0x2b4f0dfd0196]
[node1629:25546] [15] /software/Qt/5.9.2/lib/libQt5Core.so.5(_ZN16QCoreApplication15notifyInternal2EP7QObjectP6QEvent+0x75)[0x2b4f0e9af2c5]
[node1629:25546] [16] /software/Qt/5.9.2/lib/libQt5Core.so.5(_ZN14QTimerInfoList14activateTimersEv+0x4ce)[0x2b4f0e9ff9ee]
[node1629:25546] [17] /software/Qt/5.9.2/lib/libQt5Core.so.5(+0x2c2069)[0x2b4f0ea00069]
[node1629:25546] [18] /lib64/libglib-2.0.so.0(g_main_context_dispatch+0x159)[0x2b4f140774c9]
[node1629:25546] [19] /lib64/libglib-2.0.so.0(+0x4a818)[0x2b4f14077818]
[node1629:25546] [20] /lib64/libglib-2.0.so.0(g_main_context_iteration+0x2c)[0x2b4f140778cc]
[node1629:25546] [21] /software/Qt/5.9.2/lib/libQt5Core.so.5(_ZN20QEventDispatcherGlib13processEventsE6QFlagsIN10QEventLoop17ProcessEventsFlagEE+0x5c)[0x2b4f0ea0035c]
[node1629:25546] [22] /software/Qt/5.9.2/lib/libQt5Core.so.5(_ZN10QEventLoop4execE6QFlagsINS_17ProcessEventsFlagEE+0xfb)[0x2b4f0e9ad5bb]
[node1629:25546] [23] /software/Qt/5.9.2/lib/libQt5Core.so.5(_ZN16QCoreApplication4execEv+0x84)[0x2b4f0e9b5b84]
[node1629:25546] [24] /home/btsheal/software/ACE/3.0.2/lib/libaceconsole.so.3(_ZN12EApplication4execEv+0x714)[0x2b4f0dfd1c94]
[node1629:25546] [25] kinc[0x412e49]
[node1629:25546] [26] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2b4f0f9b5495]
[node1629:25546] [27] kinc[0x4132c2]
[node1629:25546] *** End of error message ***

I know that this error doesn't occur at 128 so I'm guessing there's a cutoff somewhere in terms of MPI ranks. Not yet sure if the cause is coming from ACE, KINC, or Palmetto. Will post new information as it becomes available.

@spficklin
Copy link
Member

This issue is over 1 year old. Is it still a problem?

@bentsherman
Copy link
Member Author

I haven't taken the time to test KINC with this many MPI ranks since then, but I'm still concerned that it could be a problem. I would like for either myself or someone else test KINC with up to 1024 CPU cores before closing this issue because I want to be sure that KINC and ACE can function properly up to that scale. This error that I got worries me that there is still some barrier to reaching that level of scalability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants