Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Steal time constantly goes down #2374

Open
xemul opened this issue Jul 26, 2024 · 2 comments
Open

Steal time constantly goes down #2374

xemul opened this issue Jul 26, 2024 · 2 comments

Comments

@xemul
Copy link
Contributor

xemul commented Jul 26, 2024

The formula is

    return std::chrono::duration_cast<std::chrono::nanoseconds>(now() - _start_time - _total_sleep) -
           std::chrono::duration_cast<std::chrono::nanoseconds>(thread_cputime_clock::now().time_since_epoch());

it does (time_point - time_point - duration) - time_point which it technically "time", but substracting time_point from duration shouldn't work

@xemul
Copy link
Contributor Author

xemul commented Jul 31, 2024

image

@travisdowns
Copy link
Contributor

travisdowns commented Aug 13, 2024

I have a fox fix for this.

#2390

travisdowns added a commit to travisdowns/seastar that referenced this issue Aug 15, 2024
total_steal_time() can decrease in value from call to call, a violation
of the rule that a counter metric must never decrease.

This happens because steal time is "awake time (wall clock)" minus
"cpu thread time (get_rusage style CPU time)". The awake time is itself
composed of "time since reactor start" minus "accumulated sleep time",
but it can be under-counted because sleep time is over-counted: as it's
not possible to determine exactly the true sleep time, only get
timestamps before and after a period you think might involve a sleep.

Currently, sleep is even more significantly over-counted than the error
described above as it is measured at a point which includes significant
non-sleep work.

The result is that when there is little to no true steal, CPU time will
exceed measured awake wall clock time, resulting in negative steal.

This change "fixes" this by enforcing that steal time is monotonically
increasing. This occurs at measurement time by checking if "true steal"
(i.e., the old definition of steal) has increased since the last
measurement and adding that delta to our monotonic steal counter if so.
Otherwise the delta is dropped.

While not totally ideal this leads to a useful metric which mostly
clamps away the error related to negative steal times, and more
importantly avoids the catastrophic failure of PromQL functions when
used on non-monotonic functions.

Fixes scylladb#1521.
Fixes scylladb#2374.
travisdowns added a commit to travisdowns/seastar that referenced this issue Aug 20, 2024
total_steal_time() can decrease in value from call to call, a violation
of the rule that a counter metric must never decrease.

This happens because steal time is "awake time (wall clock)" minus
"cpu thread time (get_rusage style CPU time)". The awake time is itself
composed of "time since reactor start" minus "accumulated sleep time",
but it can be under-counted because sleep time is over-counted: as it's
not possible to determine exactly the true sleep time, only get
timestamps before and after a period you think might involve a sleep.

Currently, sleep is even more significantly over-counted than the error
described above as it is measured at a point which includes significant
non-sleep work.

The result is that when there is little to no true steal, CPU time will
exceed measured awake wall clock time, resulting in negative steal.

This change "fixes" this by enforcing that steal time is monotonically
increasing. This occurs at measurement time by checking if "true steal"
(i.e., the old definition of steal) has increased since the last
measurement and adding that delta to our monotonic steal counter if so.
Otherwise the delta is dropped.

While not totally ideal this leads to a useful metric which mostly
clamps away the error related to negative steal times, and more
importantly avoids the catastrophic failure of PromQL functions when
used on non-monotonic functions.

Fixes scylladb#1521.
Fixes scylladb#2374.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants