Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stall-detector: Try hard not to crash while collecting backtrace #2420

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions include/seastar/core/reactor.hh
Original file line number Diff line number Diff line change
Expand Up @@ -655,6 +655,7 @@ public:
/// resets the supression state.
static void set_stall_detector_report_function(std::function<void ()> report);
static std::function<void ()> get_stall_detector_report_function();
static void set_stall_detector_crash_collecting_backtrace();
};
};

Expand Down
29 changes: 29 additions & 0 deletions src/core/reactor.cc
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ module;
#include <fstream>
#include <regex>
#include <thread>
#include <setjmp.h>

#include <spawn.h>
#include <sys/syscall.h>
Expand Down Expand Up @@ -842,7 +843,23 @@ class backtrace_buffer {
}
};

static thread_local jmp_buf stall_detector_env;
static thread_local bool in_stall_detector = false;
static thread_local bool crash_collecting_backtrace = false;

inline void maybe_crash_for_test() noexcept {
if (crash_collecting_backtrace) [[unlikely]] {
*(volatile int *)nullptr = 0;
}
}

static void print_with_backtrace(backtrace_buffer& buf, bool oneline) noexcept {
if (sigsetjmp(stall_detector_env, 0)) {
buf.append(" ¯\\_(ツ)_/¯\n");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

goto out;
}
in_stall_detector = true;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be technically correct, we need an std::atomic_signal_fence(std::memory_order_relaxed). This prevents a magical compiler from delaying the write to memory because no one reads it.

if (local_engine) {
buf.append(" on shard ");
buf.append_decimal(this_shard_id());
Expand All @@ -853,12 +870,17 @@ static void print_with_backtrace(backtrace_buffer& buf, bool oneline) noexcept {

if (!oneline) {
buf.append(".\nBacktrace:\n");
maybe_crash_for_test();
buf.append_backtrace();
} else {
buf.append(". Backtrace:");
maybe_crash_for_test();
buf.append_backtrace_oneline();
buf.append("\n");
}

out:
in_stall_detector = false;
buf.flush();
}

Expand Down Expand Up @@ -1502,6 +1524,10 @@ reactor::test::set_stall_detector_report_function(std::function<void ()> report)
r._cpu_stall_detector->reset_suppression_state(reactor::now());
}

void reactor::test::set_stall_detector_crash_collecting_backtrace() {
crash_collecting_backtrace = true;
}

std::function<void ()>
reactor::test::get_stall_detector_report_function() {
return engine()._cpu_stall_detector->get_config().report;
Expand Down Expand Up @@ -3930,6 +3956,9 @@ void install_oneshot_signal_handler() {

struct sigaction sa;
sa.sa_sigaction = [](int sig, siginfo_t *info, void *p) {
if (sig == SIGSEGV && in_stall_detector) {
siglongjmp(stall_detector_env, 1);
}
std::lock_guard<util::spinlock> g(lock);
if (!handled) {
handled = true;
Expand Down
5 changes: 5 additions & 0 deletions tests/unit/stall_detector_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,11 @@ SEASTAR_THREAD_TEST_CASE(spin_in_kernel) {
test_spin_with_body("kernel", [] { mmap_populate(128 * 1024); });
}

SEASTAR_THREAD_TEST_CASE(crash_collecting_backtrace) {
reactor::test::set_stall_detector_crash_collecting_backtrace();
engine().update_blocked_reactor_notify_ms(100ms);
spin(500ms);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you also reproduce the crash during unwinding? It's not given that siglongjmp is a safe way to unwind. If the unwinder takes a lock, it will leak it (though I'm guessing it doesn't).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you also reproduce the crash during unwinding?

In labs -- unfortunately, no :(

It's not given that siglongjmp is a safe way to unwind.

Yes, sure, at this point the situation is already screwed up, and it's questionable whether these tricks are making things even worse or not

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can override __cxa_throw and whatever function it uses to exist unwinding (but maybe there isn't one), and call them via RTLD_NEXT. Then we can set flags when unwinding is in progress, and just avoid going into the stall detector again (or perhaps: ask the stall detector to run on the exit path of __cxa_throw).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it will work.

Also, tracing exception throwers is important.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can override __cxa_throw and whatever function it uses to exist unwinding (but maybe there isn't one)

There isn't one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe have a blacklist of functions that are known to crash. Every time we see a crash, add the triggering function to the blacklist. In a few short years we'll have a robust filter.


#else

Expand Down