Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

agent: Add per-VM metric for desired CU(s) #1108

Open
wants to merge 1 commit into
base: sharnoff/scaling-event-reporting-2
Choose a base branch
from

Conversation

sharnoff
Copy link
Member

@sharnoff sharnoff commented Oct 12, 2024

This commit adds a new per-VM metric: autoscaling_vm_desired_cu.

It's based on the same "desired CU" information exposed by the scaling event reporting, but updated continuously instead of being rate limited to avoid spamming our reporting.

The metric has the same base labels as the other per-VM metrics, with the addition of the "reason" label, which is one of:

  • total - the goal CU, after taking the maximum of the individual parts and rounding up to the next unit.
  • cpu - goal CU size in order to fit the current CPU usage
  • mem - goal CU size in order to fit the current memory usage (including some information derived from LFC, to make sure there's room for cache too)
  • lfc - goal CU size in order to fit the estimated working set size

All of these values are also multiplied by the same Compute Unit factor as with the normal scaling event reporting, so that Neon's fractional compute units are exposed as such in the metrics, even as we use integer compute units in the autoscaler-agent.

Also note that all values except "total" are NOT rounded, and instead show the fractional amounts to allow better comparison.

KNOWN LIMITATION: If ReportDesiredScaling is disabled at runtime for a particular VM, the metrics will not be cleared, and instead will just cease to be updated. I figured this is a reasonable trade-off for simplicity.


Notes for review: Tested this locally with the following patch to vm-deploy.yaml:

diff --git a/vm-deploy.yaml b/vm-deploy.yaml
index 09588f60..dcf9f0f0 100644
--- a/vm-deploy.yaml
+++ b/vm-deploy.yaml
@@ -10,6 +10,8 @@ metadata:
     autoscaling.neon.tech/enabled: "true"
     # Set to "true" to continuously migrate the VM (TESTING ONLY)
     autoscaling.neon.tech/testing-only-always-migrate: "false"
+    autoscaling.neon.tech/report-desired-scaling: "true"
+    neon/endpoint-id: "foobar"
 spec:
   schedulerName: autoscale-scheduler
   enableSSH: true

AFAICT it works as intended, but metrics are sometimes tricky. I plan to test it on staging before merging.

Also note: This PR builds on #1107 and must not be merged before it.

Copy link

github-actions bot commented Oct 12, 2024

No changes to the coverage.

HTML Report

Click to open

@sharnoff sharnoff force-pushed the sharnoff/scaling-event-reporting-2 branch from 693b601 to a3cf0fa Compare October 12, 2024 21:39
@sharnoff sharnoff force-pushed the sharnoff/agent-desired-cu-metrics branch from c9f1d75 to fe3b019 Compare October 12, 2024 21:39
@sharnoff sharnoff force-pushed the sharnoff/scaling-event-reporting-2 branch from a3cf0fa to 16c0917 Compare October 12, 2024 22:16
@sharnoff sharnoff force-pushed the sharnoff/agent-desired-cu-metrics branch from fe3b019 to 244473b Compare October 12, 2024 22:21
@sharnoff sharnoff force-pushed the sharnoff/scaling-event-reporting-2 branch from 16c0917 to d2b4d45 Compare October 17, 2024 17:13
@sharnoff sharnoff force-pushed the sharnoff/agent-desired-cu-metrics branch from 244473b to 8008093 Compare October 17, 2024 17:14
This commit adds a new per-VM metric: autoscaling_vm_desired_cu.

It's based on the same "desired CU" information exposed by the scaling
event reporting, but updated continuously instead of being rate limited
to avoid spamming our reporting.

The metric has the same base labels as the other per-VM metrics, with
the addition of the "reason" label, which is one of:

* "total" - the goal CU, after taking the maximum of the individual
  parts and rounding up to the next unit.
* "cpu" - goal CU size in order to fit the current CPU usage
* "mem" - goal CU size in order to fit the current memory usage, which
  includes some assesssment
* "lfc" - goal CU size in order to fit the estimated working set size

All of these values are also multiplied by the same Compute Unit factor
as with the normal scaling event reporting, so that Neon's fractional
compute units are exposed as such in the metrics, even as we use integer
compute units in the autoscaler-agent.

Also note that all values except "total" are NOT rounded, and instead
show the fractional amounts to allow better comparison.

KNOWN LIMITATION: If ReportDesiredScaling is disabled at runtime for a
particular VM, the metrics will not be cleared, and instead will just
cease to be updated. I figured this is a reasonable trade-off for
simplicity.
@sharnoff sharnoff force-pushed the sharnoff/scaling-event-reporting-2 branch from d2b4d45 to 8c60b7f Compare November 18, 2024 04:01
@sharnoff sharnoff force-pushed the sharnoff/agent-desired-cu-metrics branch from 8008093 to 8639da0 Compare November 18, 2024 04:01
@Omrigan Omrigan self-requested a review December 16, 2024 17:14
@Omrigan Omrigan self-assigned this Dec 16, 2024
// that when we set the desired CU from internal information, we can check whether the VM still
// exists.
// Otherwise it's not possible to prevent data races that would result in leaking metric labels.
activeMu sync.Mutex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did this become needed only now, how desiredCU is different from other metrics we already had?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, it wasn't needed before because the metrics were all derived from the state of the VM, whereas desiredCU is derived from our internal state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants