Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

agent: Add scaling event reporting #1107

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

sharnoff
Copy link
Member

@sharnoff sharnoff commented Oct 12, 2024

This is part 2 of 2; see #1078 for the ground work and neondatabase/cloud#15939 for the full context.

In short, this PR:

  • Adds a new package: pkg/agent/scalingevents
  • Adds new callbacks to core.State to allow it to report on scaling events changes in desired CU.

Notes for review:

I'd like to add minio-based S3 tests to this, but it seemed like it'd be non-trivial, particularly because scaling events actually require that there's scaling that happens ­— unlike the existing billing tests.

So I figured I'd open this for review in the meantime.

Also note: This PR builds on #1078 and must not be merged before it.

Copy link

github-actions bot commented Oct 12, 2024

No changes to the coverage.

HTML Report

Click to open

@sharnoff sharnoff force-pushed the sharnoff/scaling-event-reporting-2 branch from 693b601 to a3cf0fa Compare October 12, 2024 21:39
@sharnoff sharnoff force-pushed the sharnoff/scaling-event-reporting-1 branch from b70150d to 54bfb21 Compare October 12, 2024 21:53
@sharnoff sharnoff force-pushed the sharnoff/scaling-event-reporting-1 branch from f608569 to 1c71a57 Compare October 12, 2024 22:06
@sharnoff sharnoff force-pushed the sharnoff/scaling-event-reporting-2 branch from a3cf0fa to 16c0917 Compare October 12, 2024 22:16
@sharnoff sharnoff force-pushed the sharnoff/scaling-event-reporting-1 branch from a46466d to df54b37 Compare October 17, 2024 17:13
@sharnoff sharnoff force-pushed the sharnoff/scaling-event-reporting-2 branch from 16c0917 to d2b4d45 Compare October 17, 2024 17:13
Base automatically changed from sharnoff/scaling-event-reporting-1 to main November 13, 2024 16:50
This is part 2 of 2; see #1078 for the ground work.

In short, this commit:

* Adds a new package: 'pkg/agent/scalingevents'
* Adds new callbacks to core.State to allow it to report on scaling
  events changes in desired CU.
@sharnoff sharnoff force-pushed the sharnoff/scaling-event-reporting-2 branch from d2b4d45 to 8c60b7f Compare November 18, 2024 04:01
@sharnoff sharnoff requested review from a team and Omrigan and removed request for a team November 19, 2024 19:42
@sharnoff
Copy link
Member Author

sharnoff commented Nov 19, 2024

Remaining items for me, on this:

  1. Add more thorough e2e tests
  2. Test on staging

In the meantime, it should be ok to review.

Copy link
Contributor

@Omrigan Omrigan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, some questions and suggestions.

erc.Whenf(ec, c.ScalingEvents.RegionName == "", emptyTmpl, ".scalingEvents.regionName")
if c.ScalingEvents.Clients.S3 != nil {
validateBaseReportingConfig(&c.ScalingEvents.Clients.S3.BaseClientConfig, "scalingEvents.clients.s3")
erc.Whenf(ec, c.ScalingEvents.Clients.S3.Bucket == "", emptyTmpl, ".scalingEvents.clients.s3.bucket")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's reuse this s3 validation, same as validateBaseReportingConfig?

}

goalCU := max(cpuGoalCU, memGoalCU, memTotalGoalCU, lfcGoalCU)
goalCU := uint32(math.Ceil(max(
math.Round(cpuGoalCU), // for historical compatibility, use round() instead of ceil()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need the historical compatibility?

Copy link
Member Author

@sharnoff sharnoff Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not, but I didn't want a scaling algorithm change to be a side-effect of this PR.

I'd be happy to change it in a separate PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not, but I didn't want a scaling algorithm change to be a side-effect of this PR.

Makes sense!

I'd be happy to change it in a separate PR?

Up to you.

Comment on lines +18 to +19
// This exists because Neon allows fractional compute units, while the autoscaler-agent acts on
// integer multiples of a smaller compute unit.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I never paid attention to this difference before, I'd like to discuss it more broadly.

Comment on lines +26 to +27
ClusterName string `json:"clusterName"`
RegionName string `json:"regionName"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do those mean?

pkg/agent/runner.go Outdated Show resolved Hide resolved
// Returns a function to generate keys for the placement of scaling events data into blob storage.
//
// Example: prefix/2024/10/31/23/events_{uuid}.ndjson.gz (11pm on halloween, UTC)
func newBlobStorageKeyGenerator(prefix string) func() string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reuse the same key generator as we have for billing?

lastParts *scalingevents.GoalCUComponents
}

func (rl *desiredScalingReportLimiter) report(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential alternative would be to have

type Skipper interface {
	Skip(event ScalingEvent) bool
}

injected into Reporter, and it would be called in Submit().

This would allow to implement limiter in a more generic way, for any type of event. Plus no need to pass so many arguments, when we can pass only ScalingEvent.

@@ -322,6 +333,102 @@ func (r *Runner) Run(ctx context.Context, logger *zap.Logger, vmInfoUpdated util
}
}

func (r *Runner) reportScalingEvent(timestamp time.Time, currentCU, targetCU uint32) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: use more consistent names, right now there is Real vs Hypothetical and ScalingEvent vs DesiredScaling.

reportRealScaling and reportHypotheticalScaling would be fine, but it suggests as if hypothetical value cannot be real. Maybe Actual vs Desired?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on more consistent naming. Let me explain where I'm coming from; I want your thoughts:

IIRC, the thing I was trying to distinguish here is that the "hypothetical"/"desired" scaling events can go far beyond the endpoint limits, and also contain the fractional CU values for each "part" (cpu/mem/lfc). We also change these events if the components change significantly, even if the overall scaling is still the same (so: it's not like each "hypothetical"/"desired" scaling event constitutes actual scaling)

I think I opted not to go fully for "desired" scaling also because IIRC "desired" is used elsewhere in the autoscaler-agent to mean "the scaling value we should be working towards", and is restricted by endpoint CU limits.

Thoughts? Happy to go with Actual vs Desired if you think that makes more sense.

Comment on lines +43 to +44
ScalingEvent ReportScalingEventCallback
DesiredScaling ReportDesiredScalingCallback
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: maybe instead of having those as callbacks, define a new adapter interface, and pass it like this?

Passing interface feels more idiomatic.

@@ -727,8 +736,20 @@ func (s *state) desiredResourcesFromMetricsOrRequestedUpscaling(now time.Time) (
// 2. Cap the goal CU by min/max, etc
// 3. that's it!

reportGoals := func(goalCU uint32, parts scalingevents.GoalCUComponents) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Instead of having callback here, we could merge scalingevents.GoalCUComponents into scalingGoal, define a method GoalCU() on it, which would return max, and we can put this object into DesiredScaling callback.

WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, how will that interact with things like #1129 / #1140 ? Otherwise I like this idea, I think it's a lot simpler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, how will that interact with things like #1129 / #1140 ?

I don't think there are significant interaction: both PR make scalingGoal public, and that should be it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants