-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: hotplugging on linux startup doesn't work #1149
Comments
We are planning to switch to online / offlining CPUs which should solve the problem. Thus setting to blocked. |
@stradig do you have any ETA for that? Konstantin mentioned that he hit it several times during his prewarm tests, i.e. frequently got <10 requested CPUs online. |
@ololobus the temp fix for this is to force fixed size VM: https://docs.neon.build/autoscaling/operations/force_fixed_size_compute.html |
cc @knizhnik ^ could be helpful for tests |
Thank you. I have checked that forcing fixed size compute really help. |
I just want to ask because this is important to assess the priority of this bug. |
To clarify: This issue only affects computes that: miss the pool AND have min CU > 1 AND are missing the "force fixed-size VM" flag. IIRC we had plans to auto-enable the force-fixed size VM flag for all computes above a certain size; I'm not sure if that happened. cc @stradig as it's part of neondatabase/cloud#18281. Currently, we aren't planning to roll out a dedicated fix for this, as it requires:
But switching to sysfs CPU scaling will already fix this, and is in progress (remaining: bump vm-builder in If we hit issues enabling sysfs CPU scaling though, we can always fall back on a dedicated fix for this to set an upper bound on the timeline. edit: See also https://neondb.slack.com/archives/C07MM01A6UA/p1733201900302069 |
Environment
Initially discovered this issue on staging (slack thread)
We found a VM on staging that has
[10; 10] CU
autoscaling range, but only 3 CPUs are online and ready to use.When I looked at startup logs, I found this:
From the logs, it looks like some CPUs were hotplugged before linux was fully started. Because of that, smpboot was not ready to online them, and as a result we only have 3 CPUs:
Steps to reproduce
Should be easy to reproduce in tests, by creating a VM like this:
The idea is to have autoscaler-agent change
use
to 10CU right away, before linux has finish starting up. And it should reproduce the issue.Expected result
Expected result is that
use
always corresponds to a number of usable CPUs in VMActual result
Some CPUs are plugged, but not onlined and thus unavailable for postgres and other VM processes.
Other logs, links
The text was updated successfully, but these errors were encountered: