-
Notifications
You must be signed in to change notification settings - Fork 862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perform background refresh of credentials during preempt expiry period #3541
base: v4-development
Are you sure you want to change the base?
Perform background refresh of credentials during preempt expiry period #3541
Conversation
Is the problem you're seeing related to #2464? That issue also mentions the |
Yes, looks to be related. This PR would seem to resolve the issue reported there as well. |
@dscpinheiro Please let me know if there's any information I can add or questions I can answer to help make this PR ready for review. Thanks in advance! |
Would you mind targeting the |
c221644
to
fa4741f
Compare
@dscpinheiro I've rebased and re-targeted the PR to the |
I just faced a similar issue where the refreshing credential (in our case AssumeRoleAWSCredentials) waited for the default timeout to sts of 100 seconds to occur before finally processing the call to whatever service was using the credential to unlock. As far as I can tell, there is no way to really configure the timeout for the call, I tried various things. |
Tagging @normj for visibility |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The branch has gotten a bit stale with the latest v4 changes but I'm good with the changes. Can refresh your branch with the latest V4, take care of the conflict issues?
internal TimeSpan GetTimeToLive(TimeSpan preemptExpiryTime) | ||
{ | ||
var now = AWSSDKUtils.CorrectedUtcNow; | ||
var exp = Expiration.ToUniversalTime(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ToUniversalTime
should be removed. We recently merged the following PR in V4 to address the SDK's inconsistencies with UTC and Local time and made the sdk use UTC. #3572
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done and done.
5c8605f
to
4b91af0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change looks good. I'm running the change through the internal build system for validation.
Internal build was successful. @dscpinheiro will do a second pass. |
PreemptExpiryTime
period.Description
Adds a new method
GetTimeToLive
to theRefreshingAWSCredentials.CredentialsRefreshState
class which calculates the remaining time to live (TTL) for a credential, adjusting for the "baked-in" preempt expiry time period. Within theGetCredentials(Async)
method, the TTL is used to determine whether the current credentials are in one of three states: valid, expired, or valid but within the preempt expiry period. When in that last state, the current (valid, non-expired) credentials will be returned and a background refresh of the credentials will be attempted. If there is already an in-flight attempt (inline or background) to refresh the credentials, then a new background refresh will not be triggered. When in the expired state, an inline request to generate new credentials will still be triggered; however, after acquiring the mutual exclusion lock, the current credentials will be re-evaluated for whether they are still expired or not. This double-check helps to elide calls toGenerateNewCredentials(Async)
when multiple tasks were in queue to acquire the refresh credentials lock, and preserves the existing behavior which contains the expiry check within the lock.Motivation and Context
We have encountered an issue in our containerized HTTP API services that talk to AWS services (such as DynamoDB) while they are under load. The root cause is not an issue with the AWS SDK; however, an interesting cascading effect we have observed is "blips" in response times during an AWS credential refresh, in many cases leading to client request timeouts.
In the current implementation of the
RefreshingAWSCredentials
class, every call toGetCredentialsAsync
will attempt to obtain exclusive access by callingSemaphoreSlim.WaitAsync()
. When theGenerateNewCredentialsAsync
call is delayed, then all calls to obtain credentials are blocked. In our service, since every incoming request is making at least one AWS service call, this effectively blocks all requests until it completes. This then leads to increased memory usage as all task continuation are enqueued with theSemaphoreSlim
. If enough of these continuations are enqueued, GC pressure mounts, with the GC consuming more CPU time but unable to remove any of the rooted contexts, ultimately resulting in a negative feedback loop where the process spends most of its time in futile GC attempts. In the image below, there are over 1,000 continuations enqueued waiting for the new credentials to be generated, which consumes around 100MB.This PR attempts to bypass any delays (and lock contention in general) with generating new credentials by attempting to perform a refresh of the credentials using a single background task.
Testing
We were able to reproduce the issue by using a custom implementation of
AssumeRoleWithWebIdentityCredentials
which allowed us to introduce a configurable amount of delay in theGenerateNewCredentialsAsync
method. Additionally, we configured thePreemptExpiryTime
value to be 59 minutes so that new credentials would be generated every 1 minute.New unit tests were added to the solution to cover both existing and new functionality of the
RefreshingAWSCredentials
class.Screenshots (if appropriate)
Types of changes
Checklist
License