Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(HMS-2181): statuser throttling configuration options #608

Merged
merged 1 commit into from
Jul 20, 2023

Conversation

lzap
Copy link
Member

@lzap lzap commented Jul 19, 2023

We are tracking a ticket to implement caching of source availability checks when necessary (AWS API limits). However, the numbers from both stage and production shows that we do hundreds of checks per hour, so it would be preliminary to work on this.

https://issues.redhat.com/browse/HMS-1244

However, it makes sense to prepare app configuration values just in case we hit some API limits and we will need to slow down rate of availability checks. One configuration value (delay) can be used to arbitrarily slow down pace of checks per hyperscaler. Another config value (rate) can be used for random skipping of checks in case we need to buy time in order to do proper caching implementation.

We have about 150 checks in total on stage, production is currently similar. Therefore I suggest to start with the default value of 1 second which has plenty of room for growth. If we start getting Kafka lag (we have a SLO for that), we can easily either shorten the delay or enable dice rolling (e.g. every 2/3 check will be skipped on average = rate 0.33).

@ezr-ondrej
Copy link
Member

I do love the approach, but why we are implementing it when we do not see a problem? 🤔
given the PR is not very complex, can't we implement it once there is an issue?
Or do you see us hitting the AWS limits soon?

@lzap
Copy link
Member Author

lzap commented Jul 19, 2023

why we are implementing it when we do not see a problem

I want to be ready if something happens, in that case it will take us just an hour to get an app-interface configuration changed which can be done for stage or production quite quickly. After all, this was why we have chosen the statuser being a single pod so this is the ultimate goal of our effort - not implementing it actually feels weird.

@lzap
Copy link
Member Author

lzap commented Jul 19, 2023

Rebased the example config file, why I always forget? :-D

Copy link
Member

@avitova avitova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this is a valid approach. 👍 TY

@adiabramovitch
Copy link
Member

adiabramovitch commented Jul 20, 2023

What about the skipped requests? Shouldn't we address those whenever possible? Or do we prefer to wait for another exact request to arrive from the user (if occurs)?

@lzap
Copy link
Member Author

lzap commented Jul 20, 2023

What is a "skipped request"?

@lzap
Copy link
Member Author

lzap commented Jul 20, 2023

Oh so you mean when probability-based check is skipped when this feature is enabled.

Indeed, this is a problem for user-initiated checks. We asked sources team to solve this in https://issues.redhat.com/browse/RHCLOUD-22776 but until then, if we ever enabled the probability, it would be probably only temporarily to buy us some time. As you can see, there is no skipping in the default configuration (default value is 1.0 = all checks are performed).

@adiabramovitch adiabramovitch merged commit dce8174 into RHEnVision:main Jul 20, 2023
6 checks passed
@lzap lzap deleted the delay-statuser branch July 20, 2023 13:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants