agent: Add per-VM metric for desired CU(s) #1108

sharnoff · 2024-10-12T01:48:25Z

This commit adds a new per-VM metric: autoscaling_vm_desired_cu.

It's based on the same "desired CU" information exposed by the scaling event reporting, but updated continuously instead of being rate limited to avoid spamming our reporting.

The metric has the same base labels as the other per-VM metrics, with the addition of the "reason" label, which is one of:

total - the goal CU, after taking the maximum of the individual parts and rounding up to the next unit.
cpu - goal CU size in order to fit the current CPU usage
mem - goal CU size in order to fit the current memory usage (including some information derived from LFC, to make sure there's room for cache too)
lfc - goal CU size in order to fit the estimated working set size

All of these values are also multiplied by the same Compute Unit factor as with the normal scaling event reporting, so that Neon's fractional compute units are exposed as such in the metrics, even as we use integer compute units in the autoscaler-agent.

Also note that all values except "total" are NOT rounded, and instead show the fractional amounts to allow better comparison.

KNOWN LIMITATION: If ReportDesiredScaling is disabled at runtime for a particular VM, the metrics will not be cleared, and instead will just cease to be updated. I figured this is a reasonable trade-off for simplicity.

Notes for review: Tested this locally with the following patch to vm-deploy.yaml:

diff --git a/vm-deploy.yaml b/vm-deploy.yaml
index 09588f60..dcf9f0f0 100644
--- a/vm-deploy.yaml
+++ b/vm-deploy.yaml
@@ -10,6 +10,8 @@ metadata:
     autoscaling.neon.tech/enabled: "true"
     # Set to "true" to continuously migrate the VM (TESTING ONLY)
     autoscaling.neon.tech/testing-only-always-migrate: "false"
+    autoscaling.neon.tech/report-desired-scaling: "true"
+    neon/endpoint-id: "foobar"
 spec:
   schedulerName: autoscale-scheduler
   enableSSH: true

AFAICT it works as intended, but metrics are sometimes tricky. I plan to test it on staging before merging.

Also note: This PR builds on #1107 and must not be merged before it.

github-actions · 2024-10-12T01:51:49Z

No changes to the coverage.

HTML Report

Click to open

This commit adds a new per-VM metric: autoscaling_vm_desired_cu. It's based on the same "desired CU" information exposed by the scaling event reporting, but updated continuously instead of being rate limited to avoid spamming our reporting. The metric has the same base labels as the other per-VM metrics, with the addition of the "reason" label, which is one of: * "total" - the goal CU, after taking the maximum of the individual parts and rounding up to the next unit. * "cpu" - goal CU size in order to fit the current CPU usage * "mem" - goal CU size in order to fit the current memory usage, which includes some assesssment * "lfc" - goal CU size in order to fit the estimated working set size All of these values are also multiplied by the same Compute Unit factor as with the normal scaling event reporting, so that Neon's fractional compute units are exposed as such in the metrics, even as we use integer compute units in the autoscaler-agent. Also note that all values except "total" are NOT rounded, and instead show the fractional amounts to allow better comparison. KNOWN LIMITATION: If ReportDesiredScaling is disabled at runtime for a particular VM, the metrics will not be cleared, and instead will just cease to be updated. I figured this is a reasonable trade-off for simplicity.

Omrigan · 2024-12-20T11:38:37Z

pkg/agent/prommetrics.go

+	// that when we set the desired CU from internal information, we can check whether the VM still
+	// exists.
+	// Otherwise it's not possible to prevent data races that would result in leaking metric labels.
+	activeMu  sync.Mutex


Why did this become needed only now, how desiredCU is different from other metrics we already had?

IIRC, it wasn't needed before because the metrics were all derived from the state of the VM, whereas desiredCU is derived from our internal state.

sharnoff force-pushed the sharnoff/scaling-event-reporting-2 branch from 693b601 to a3cf0fa Compare October 12, 2024 21:39

sharnoff force-pushed the sharnoff/agent-desired-cu-metrics branch from c9f1d75 to fe3b019 Compare October 12, 2024 21:39

sharnoff force-pushed the sharnoff/scaling-event-reporting-2 branch from a3cf0fa to 16c0917 Compare October 12, 2024 22:16

sharnoff force-pushed the sharnoff/agent-desired-cu-metrics branch from fe3b019 to 244473b Compare October 12, 2024 22:21

sharnoff force-pushed the sharnoff/scaling-event-reporting-2 branch from 16c0917 to d2b4d45 Compare October 17, 2024 17:13

sharnoff force-pushed the sharnoff/agent-desired-cu-metrics branch from 244473b to 8008093 Compare October 17, 2024 17:14

sharnoff force-pushed the sharnoff/scaling-event-reporting-2 branch from d2b4d45 to 8c60b7f Compare November 18, 2024 04:01

sharnoff force-pushed the sharnoff/agent-desired-cu-metrics branch from 8008093 to 8639da0 Compare November 18, 2024 04:01

Omrigan self-requested a review December 16, 2024 17:14

Omrigan self-assigned this Dec 16, 2024

Omrigan requested review from mikhail-sakhnov and petuhovskiy December 16, 2024 17:15

Omrigan reviewed Dec 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agent: Add per-VM metric for desired CU(s) #1108

agent: Add per-VM metric for desired CU(s) #1108

sharnoff commented Oct 12, 2024 •

edited

Loading

github-actions bot commented Oct 12, 2024 •

edited

Loading

Omrigan Dec 20, 2024

sharnoff Dec 20, 2024

agent: Add per-VM metric for desired CU(s) #1108

Are you sure you want to change the base?

agent: Add per-VM metric for desired CU(s) #1108

Conversation

sharnoff commented Oct 12, 2024 • edited Loading

github-actions bot commented Oct 12, 2024 • edited Loading

HTML Report

Omrigan Dec 20, 2024

Choose a reason for hiding this comment

sharnoff Dec 20, 2024

Choose a reason for hiding this comment

sharnoff commented Oct 12, 2024 •

edited

Loading

github-actions bot commented Oct 12, 2024 •

edited

Loading