Bug: hotplugging on linux startup doesn't work #1149

petuhovskiy · 2024-11-18T15:53:20Z

Environment

Initially discovered this issue on staging (slack thread)

We found a VM on staging that has [10; 10] CU autoscaling range, but only 3 CPUs are online and ready to use.

When I looked at startup logs, I found this:

[    0.149516] smpboot: Max logical packages: 10
[    0.149841] smpboot: Total of 1 processors activated (5799.92 BogoMIPS)
[    0.154396] cpuidle: using governor ladder
[    0.154707] cpuidle: using governor menu
[    0.162735] ACPI: Added _OSI(Processor Device)
[    0.163434] ACPI: Added _OSI(Processor Aggregator Device)
[    0.167554] ACPI: Using IOAPIC for interrupt routing
[    0.365657] CPU1 has been hot-added
[    0.442354] Fallback order for Node 0: 0 
[    0.646402] Fallback order for Node 0: 0 
[    0.851667] intel_pstate: CPU model not supported
[    1.079508] CPU2 has been hot-added
[    1.079939] CPU3 has been hot-added
[    1.080254] CPU4 has been hot-added
[    1.080563] CPU5 has been hot-added
[    1.080873] CPU6 has been hot-added
[    1.112698] CPU7 has been hot-added
[    1.765315] CPU8 has been hot-added
[    1.789376] SMP alternatives: switching to SMP code
[    1.792977] smpboot: Booting Node 0 Processor 8 APIC 0x8
[    1.793740] Will online and init hotplugged CPU: 8
[    2.849321] CPU9 has been hot-added
[    2.909766] smpboot: Booting Node 0 Processor 9 APIC 0x9
[    2.910673] Will online and init hotplugged CPU: 9

From the logs, it looks like some CPUs were hotplugged before linux was fully started. Because of that, smpboot was not ready to online them, and as a result we only have 3 CPUs:

CPU 0 online
CPUs 1-7 offline
CPUs 8-9 onlined after hotplug

Steps to reproduce

Should be easy to reproduce in tests, by creating a VM like this:

apiVersion: vm.neon.tech/v1
kind: VirtualMachine
metadata:
  annotations:
    autoscaling.neon.tech/bounds: '{"min":{"cpu":"10","mem":"40Gi"},"max":{"cpu":"10","mem":"40Gi"}}'
    autoscaling.neon.tech/config: '{"enableLFCMetrics":true}'
...
    cpus:
      max: 10
      min: 250m
      use: 250m
...

The idea is to have autoscaler-agent change use to 10CU right away, before linux has finish starting up. And it should reproduce the issue.

Expected result

Expected result is that use always corresponds to a number of usable CPUs in VM

Actual result

Some CPUs are plugged, but not onlined and thus unavailable for postgres and other VM processes.

Other logs, links

The text was updated successfully, but these errors were encountered:

stradig · 2024-11-18T16:45:37Z

We are planning to switch to online / offlining CPUs which should solve the problem. Thus setting to blocked.
After this is implemented we should also have a test.

ololobus · 2024-11-19T15:24:59Z

We are planning to switch to online / offlining CPUs which should solve the problem.

@stradig do you have any ETA for that? Konstantin mentioned that he hit it several times during his prewarm tests, i.e. frequently got <10 requested CPUs online.

petuhovskiy · 2024-11-19T15:29:37Z

@ololobus the temp fix for this is to force fixed size VM: https://docs.neon.build/autoscaling/operations/force_fixed_size_compute.html
Another potential fix is to use autoscaling [1; 10] CU.

ololobus · 2024-11-19T15:31:17Z

cc @knizhnik ^ could be helpful for tests

knizhnik · 2024-11-19T15:49:53Z

Thank you. I have checked that forcing fixed size compute really help.

Bodobolero · 2024-11-30T15:46:59Z

I just want to ask because this is important to assess the priority of this bug.
Does it ONLY affect computes with pool misses?
So is my 7 fixed CU ONLY affected because I use a pinned compute image which is not in the pool?
Or does it also hit projects in Prod that have high CU spec that we normally don't have in pool?
This would imply that it hits our most important large customers the most. If this is the case I think this is high priority (we have already lost 2 weeks on this?)

sharnoff · 2024-12-03T04:57:11Z

To clarify: This issue only affects computes that: miss the pool AND have min CU > 1 AND are missing the "force fixed-size VM" flag.

IIRC we had plans to auto-enable the force-fixed size VM flag for all computes above a certain size; I'm not sure if that happened. cc @stradig as it's part of neondatabase/cloud#18281.

Currently, we aren't planning to roll out a dedicated fix for this, as it requires:

Making the fix in vm-builder, and releasing autoscaling
Bumping vm-builder in neon.git, and releasing updated compute images

But switching to sysfs CPU scaling will already fix this, and is in progress (remaining: bump vm-builder in neon.git, release updated compute images, and then enable by default in prod).

If we hit issues enabling sysfs CPU scaling though, we can always fall back on a dedicated fix for this to set an upper bound on the timeline.

edit: See also https://neondb.slack.com/archives/C07MM01A6UA/p1733201900302069

petuhovskiy added the t/bug Issue Type: Bug label Nov 18, 2024

Bodobolero added a/performance Area: relates to performance of the system a/benchmark Area: related to benchmarking labels Nov 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: hotplugging on linux startup doesn't work #1149

Bug: hotplugging on linux startup doesn't work #1149

petuhovskiy commented Nov 18, 2024

stradig commented Nov 18, 2024

ololobus commented Nov 19, 2024

petuhovskiy commented Nov 19, 2024

ololobus commented Nov 19, 2024

knizhnik commented Nov 19, 2024

Bodobolero commented Nov 30, 2024

sharnoff commented Dec 3, 2024 •

edited

Loading

Bug: hotplugging on linux startup doesn't work #1149

Bug: hotplugging on linux startup doesn't work #1149

Comments

petuhovskiy commented Nov 18, 2024

Environment

Steps to reproduce

Expected result

Actual result

Other logs, links

stradig commented Nov 18, 2024

ololobus commented Nov 19, 2024

petuhovskiy commented Nov 19, 2024

ololobus commented Nov 19, 2024

knizhnik commented Nov 19, 2024

Bodobolero commented Nov 30, 2024

sharnoff commented Dec 3, 2024 • edited Loading

sharnoff commented Dec 3, 2024 •

edited

Loading