Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perform background refresh of credentials during preempt expiry period #3541

Open
wants to merge 3 commits into
base: v4-development
Choose a base branch
from

Conversation

kevinstapleton
Copy link

@kevinstapleton kevinstapleton commented Nov 1, 2024

  • Enable a "background" refresh of the credentials when the current credentials state is expired but within the PreemptExpiryTime period.

Description

Adds a new method GetTimeToLive to the RefreshingAWSCredentials.CredentialsRefreshState class which calculates the remaining time to live (TTL) for a credential, adjusting for the "baked-in" preempt expiry time period. Within the GetCredentials(Async) method, the TTL is used to determine whether the current credentials are in one of three states: valid, expired, or valid but within the preempt expiry period. When in that last state, the current (valid, non-expired) credentials will be returned and a background refresh of the credentials will be attempted. If there is already an in-flight attempt (inline or background) to refresh the credentials, then a new background refresh will not be triggered. When in the expired state, an inline request to generate new credentials will still be triggered; however, after acquiring the mutual exclusion lock, the current credentials will be re-evaluated for whether they are still expired or not. This double-check helps to elide calls to GenerateNewCredentials(Async) when multiple tasks were in queue to acquire the refresh credentials lock, and preserves the existing behavior which contains the expiry check within the lock.

Motivation and Context

We have encountered an issue in our containerized HTTP API services that talk to AWS services (such as DynamoDB) while they are under load. The root cause is not an issue with the AWS SDK; however, an interesting cascading effect we have observed is "blips" in response times during an AWS credential refresh, in many cases leading to client request timeouts.

In the current implementation of the RefreshingAWSCredentials class, every call to GetCredentialsAsync will attempt to obtain exclusive access by calling SemaphoreSlim.WaitAsync(). When the GenerateNewCredentialsAsync call is delayed, then all calls to obtain credentials are blocked. In our service, since every incoming request is making at least one AWS service call, this effectively blocks all requests until it completes. This then leads to increased memory usage as all task continuation are enqueued with the SemaphoreSlim. If enough of these continuations are enqueued, GC pressure mounts, with the GC consuming more CPU time but unable to remove any of the rooted contexts, ultimately resulting in a negative feedback loop where the process spends most of its time in futile GC attempts. In the image below, there are over 1,000 continuations enqueued waiting for the new credentials to be generated, which consumes around 100MB.

This PR attempts to bypass any delays (and lock contention in general) with generating new credentials by attempting to perform a refresh of the credentials using a single background task.

Testing

We were able to reproduce the issue by using a custom implementation of AssumeRoleWithWebIdentityCredentials which allowed us to introduce a configurable amount of delay in the GenerateNewCredentialsAsync method. Additionally, we configured the PreemptExpiryTime value to be 59 minutes so that new credentials would be generated every 1 minute.

New unit tests were added to the solution to cover both existing and new functionality of the RefreshingAWSCredentials class.

Screenshots (if appropriate)

Screenshot 2024-10-10 085240

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist

  • My code follows the code style of this project
  • My change requires a change to the documentation
  • I have updated the documentation accordingly
  • I have read the README document
  • I have added tests to cover my changes
  • All new and existing tests passed

License

  • I confirm that this pull request can be released under the Apache 2 license

@dscpinheiro
Copy link
Contributor

dscpinheiro commented Nov 4, 2024

Is the problem you're seeing related to #2464? That issue also mentions the AssumeRoleWithWebIdentityCredentials provider.

@kevinstapleton
Copy link
Author

Is the problem you're seeing related to #2464? That issue also mentions the AssumeRoleWithWebIdentityCredentials provider.

Yes, looks to be related. This PR would seem to resolve the issue reported there as well.

@kevinstapleton kevinstapleton marked this pull request as ready for review November 4, 2024 15:42
@dscpinheiro dscpinheiro changed the base branch from main to main-staging November 4, 2024 18:42
@kevinstapleton
Copy link
Author

@dscpinheiro Please let me know if there's any information I can add or questions I can answer to help make this PR ready for review. Thanks in advance!

@dscpinheiro
Copy link
Contributor

Would you mind targeting the v4-development branch? We have removed some obsolete properties you may have used in your PR, and since the next version of the .NET SDK is still in preview it allows us to check if any behavior breaks by changing how the credentials provider works.

@kevinstapleton kevinstapleton force-pushed the kstapleton/background-cred-refresh branch from c221644 to fa4741f Compare November 7, 2024 15:51
@kevinstapleton kevinstapleton changed the base branch from main-staging to v4-development November 7, 2024 15:51
@kevinstapleton
Copy link
Author

@dscpinheiro I've rebased and re-targeted the PR to the v4-development branch. The only difference seems to be that AWSSDKUtils.CorrectedUtcNow is now un-obsoleted, so the warning suppressions could be removed.

@sylvainduchesne
Copy link

I just faced a similar issue where the refreshing credential (in our case AssumeRoleAWSCredentials) waited for the default timeout to sts of 100 seconds to occur before finally processing the call to whatever service was using the credential to unlock.

As far as I can tell, there is no way to really configure the timeout for the call, I tried various things.
Pretty certain this PR would fix the issue, somewhat indirectly, but perhaps there is a way to actually configure this timeout? Like we would need to specify the STS client for the refreshing credential, but I fail to see how to do this...

@ozziepeeps
Copy link

Tagging @normj for visibility

Copy link
Member

@normj normj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The branch has gotten a bit stale with the latest v4 changes but I'm good with the changes. Can refresh your branch with the latest V4, take care of the conflict issues?

internal TimeSpan GetTimeToLive(TimeSpan preemptExpiryTime)
{
var now = AWSSDKUtils.CorrectedUtcNow;
var exp = Expiration.ToUniversalTime();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ToUniversalTime should be removed. We recently merged the following PR in V4 to address the SDK's inconsistencies with UTC and Local time and made the sdk use UTC. #3572

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done and done.

@kevinstapleton kevinstapleton force-pushed the kstapleton/background-cred-refresh branch from 5c8605f to 4b91af0 Compare December 20, 2024 15:12
Copy link
Member

@normj normj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks good. I'm running the change through the internal build system for validation.

@normj
Copy link
Member

normj commented Dec 20, 2024

Internal build was successful. @dscpinheiro will do a second pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants