Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DynamoDB: use of closed network connection (w/ repro) #2737

Open
2 tasks done
chrisr-spectra opened this issue Aug 7, 2024 · 11 comments
Open
2 tasks done

DynamoDB: use of closed network connection (w/ repro) #2737

chrisr-spectra opened this issue Aug 7, 2024 · 11 comments
Assignees
Labels
bug This issue is a bug. p2 This is a standard priority issue queued This issues is on the AWS team's backlog

Comments

@chrisr-spectra
Copy link

chrisr-spectra commented Aug 7, 2024

Acknowledgements

Describe the bug

This is a reproduction of #1825 with the tracing suggested by @lucix-aws

2024-08-07T03:26:57.822Z !!! READ ON CLIENT-CLOSED CONN !!!
2024-08-07T03:26:57.822Z goroutine 191 [running]:
2024-08-07T03:26:57.822Z runtime/debug.Stack()
2024-08-07T03:26:57.822Z /usr/local/go/src/runtime/debug/stack.go:24 +0x5e
2024-08-07T03:26:57.822Z github.com/---OurCode---/dynclient.(*tracedConn).Close(0xc0008bbbc0)
2024-08-07T03:26:57.822Z /go/pkg/mod/github.com/---OurCode---/client.go:53 +0x18
2024-08-07T03:26:57.822Z crypto/tls.(*Conn).Close(0xc00027d180?)
2024-08-07T03:26:57.822Z /usr/local/go/src/crypto/tls/conn.go:1428 +0xc8
2024-08-07T03:26:57.822Z net/http.(*persistConn).closeLocked(0xc0009c6000, {0x1d177e0, 0x2707f80})
2024-08-07T03:26:57.822Z /usr/local/go/src/net/http/transport.go:2746 +0x13a
2024-08-07T03:26:57.822Z net/http.(*persistConn).close(0xc00016d900?, {0x1d177e0?, 0x2707f80?})
2024-08-07T03:26:57.822Z /usr/local/go/src/net/http/transport.go:2731 +0xa8
2024-08-07T03:26:57.822Z net/http.(*persistConn).writeLoop(0xc0009c6000)
2024-08-07T03:26:57.822Z /usr/local/go/src/net/http/transport.go:2447 +0x29b
2024-08-07T03:26:57.822Z created by net/http.(*Transport).dialConn in goroutine 198
2024-08-07T03:26:57.822Z /usr/local/go/src/net/http/transport.go:1777 +0x16f1
2024-08-07T03:26:57.822Z !!! -(end trace)-------------- !!!

Expected Behavior

dynamodb Client.Query to return without error

Current Behavior

dynamodb Client.Query returns error:

operation error DynamoDB: Query, https response error StatusCode: 200, RequestID: GUR48MP1C12O956O84J2H67F7NVV4KQNSO5AEMVJF66Q9ASUAAJG, deserialization failed, failed to decode response body, read tcp 169.254.76.1:56796-\u003e52.94.0.72:443: use of closed network connection"}

Reproduction Steps

result, err := db.Query(
        ctx,
        &dynamodb.QueryInput{
                TableName:                 aws.String(name),
                ConsistentRead:            aws.Bool(true),
                Limit:                     aws.Int32(count),
                KeyConditionExpression:    aws.String(keyConditionExpression),
                ExpressionAttributeValues: expressionAttributeValues,
                ExclusiveStartKey:         exclusiveStartKey,
        },
)

Possible Solution

No response

Additional Information/Context

No response

AWS Go SDK V2 Module Versions Used

    github.com/aws/aws-lambda-go v1.47.0
    github.com/aws/aws-sdk-go-v2 v1.27.1
    github.com/aws/aws-sdk-go-v2/service/s3 v1.54.4
    github.com/aws/aws-sdk-go-v2/service/sns v1.29.9
    github.com/aws/smithy-go v1.20.2
    github.com/aws/aws-sdk-go-v2/aws/protocol/eventstream v1.6.2 // indirect
    github.com/aws/aws-sdk-go-v2/config v1.27.17 // indirect
    github.com/aws/aws-sdk-go-v2/credentials v1.17.17 // indirect
    github.com/aws/aws-sdk-go-v2/feature/ec2/imds v1.16.4 // indirect
    github.com/aws/aws-sdk-go-v2/internal/configsources v1.3.8 // indirect
    github.com/aws/aws-sdk-go-v2/internal/endpoints/v2 v2.6.8 // indirect
    github.com/aws/aws-sdk-go-v2/internal/ini v1.8.0 // indirect
    github.com/aws/aws-sdk-go-v2/internal/v4a v1.3.8 // indirect
    github.com/aws/aws-sdk-go-v2/service/dynamodb v1.32.7 // indirect
    github.com/aws/aws-sdk-go-v2/service/internal/accept-encoding v1.11.2 // indirect
    github.com/aws/aws-sdk-go-v2/service/internal/checksum v1.3.10 // indirect
    github.com/aws/aws-sdk-go-v2/service/internal/endpoint-discovery v1.9.9 // indirect
    github.com/aws/aws-sdk-go-v2/service/internal/presigned-url v1.11.10 // indirect
    github.com/aws/aws-sdk-go-v2/service/internal/s3shared v1.17.8 // indirect
    github.com/aws/aws-sdk-go-v2/service/kms v1.32.2 // indirect
    github.com/aws/aws-sdk-go-v2/service/sqs v1.32.4 // indirect
    github.com/aws/aws-sdk-go-v2/service/sso v1.20.10 // indirect
    github.com/aws/aws-sdk-go-v2/service/ssooidc v1.24.4 // indirect
    github.com/aws/aws-sdk-go-v2/service/sts v1.28.11 // indirect

Compiler and Version used

go version go1.21.3 linux/amd64

Operating System and version

lambda runtime provided.al2

@chrisr-spectra chrisr-spectra added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Aug 7, 2024
@zapisanchez
Copy link

zapisanchez commented Aug 8, 2024

same here

goroutine 893760 [running]:
runtime/debug.Stack()
/usr/local/go/src/runtime/debug/stack.go:24 +0x5e
parser/cmd/parser/storage.(*tracedConn).Close(0xc00070e840)
/go/src/cmd/parser/storage/traceconn.go:40 +0x18
crypto/tls.(*Conn).Close(0xc0000e8dd0?)
/usr/local/go/src/crypto/tls/conn.go:1429 +0xc8
net/http.(*persistConn).closeLocked(0xc000745200, {0xce89a0, 0x1101eb0})
/usr/local/go/src/net/http/transport.go:2783 +0x13a
net/http.(*persistConn).close(0xc0002e08c0?, {0xce89a0?, 0x1101eb0?})
/usr/local/go/src/net/http/transport.go:2768 +0xa8
net/http.(*persistConn).closeConnIfStillIdle(0xc000745200)
/usr/local/go/src/net/http/transport.go:2040 +0xbe
created by time.goFunc
/usr/local/go/src/time/sleep.go:177 +0x2d

@chrisr-spectra
Copy link
Author

For what it's worth, I actually get two different stack traces when I run with the tracing code. The one I provided in the description matches the time I saw the closed network connection error, and happens very infrequently. Hover, I frequently get the trace:

2024-08-07T03:26:22.589Z !!! READ ON CLIENT-CLOSED CONN !!!
2024-08-07T03:26:22.589Z goroutine 181 [running]:
2024-08-07T03:26:22.589Z runtime/debug.Stack()
2024-08-07T03:26:22.589Z /usr/local/go/src/runtime/debug/stack.go:24 +0x5e
2024-08-07T03:26:22.589Z github.com/---OurCode---/dynclient.(*tracedConn).Close(0xc00021a040)
2024-08-07T03:26:22.589Z /go/pkg/mod/github.com/---OurCode---/client.go:53 +0x18
2024-08-07T03:26:22.589Z crypto/tls.(*Conn).Close(0xc00027d180?)
2024-08-07T03:26:22.589Z /usr/local/go/src/crypto/tls/conn.go:1428 +0xc8
2024-08-07T03:26:22.589Z net/http.(*persistConn).closeLocked(0xc000672240, {0x1d177e0, 0x2708ba0})
2024-08-07T03:26:22.589Z /usr/local/go/src/net/http/transport.go:2746 +0x13a
2024-08-07T03:26:22.589Z net/http.(*persistConn).close(0xc00027d180?, {0x1d177e0?, 0x2708ba0?})
2024-08-07T03:26:22.589Z /usr/local/go/src/net/http/transport.go:2731 +0xa8
2024-08-07T03:26:22.589Z net/http.(*persistConn).closeConnIfStillIdle(0xc000672240)
2024-08-07T03:26:22.589Z /usr/local/go/src/net/http/transport.go:2017 +0xbe
2024-08-07T03:26:22.589Z created by time.goFunc
2024-08-07T03:26:22.589Z /usr/local/go/src/time/sleep.go:176 +0x2d
2024-08-07T03:26:22.589Z !!! -(end trace)-------------- !!!

but it doesn't seem to correspond to an error in my client code.

@lucix-aws
Copy link
Contributor

TYVM for taking the time to repro. I'll try to deep dive this week or early next.

@lucix-aws
Copy link
Contributor

These traces certainly appear to demonstrate that it's net/http that's ultimately closing the conns out from under us. Whether we indirectly caused said close remains to be seen.

@RanVaknin RanVaknin added queued This issues is on the AWS team's backlog p2 This is a standard priority issue and removed needs-triage This issue or PR still needs to be triaged. labels Aug 9, 2024
@kuadrosx
Copy link

@lucix-aws hi, I have a theory but, I do not know how to probe it:

I think the real issue is hidden by ignoring the error of closing the request body here https://github.com/aws/smithy-go/blob/bed421c3d7515856512061d107cf2ae3254cf3b8/transport/http/client.go#L116

also I saw a similar issue in the aws-sdk-v1 but the error was "connection closed by peer" maybe the are related with this issue? https://github.com/search?q=repo%3Aaws%2Faws-sdk-go+connection+reset+by+peer&type=issues

@lucix-aws lucix-aws self-assigned this Dec 16, 2024
@lucix-aws lucix-aws changed the title Reproduction of #1825 (use of closed network connection) with trace DynamoDB: use of closed network connection (w/ repro) Dec 16, 2024
@lucix-aws
Copy link
Contributor

Copying this comment I left on the original issue:

You can make the underlying error net.ErrClosed retryable in your client with the following:

type retryErrClosed struct {
	aws.Retryer
}

func (r *retryErrClosed) IsErrorRetryable(err error) bool {
	return errors.Is(err, net.ErrClosed) || r.Retryer.IsErrorRetryable(err)
}
	
svc := s3.NewFromConfig(cfg, func(o *s3.Options) {
	o.Retryer = &retryErrClosed{o.Retryer}	
})

This should be safe to do for get-type operations such as dynamodb.Query.

@lucix-aws
Copy link
Contributor

There's been a lot of back-and-forth about the premature request body close we're doing here. Some have commented claiming that removing it in a fork fixes it for them. At least one person commented that removing it temporarily fixed the issue, but it eventually re-appeared after throughput increased.

Unless I've missed something... this is a total red herring. We need to take a closer look at what this code is actually doing in practice. Let's look at dynamodb.Query, an oft-mentioned operation around this issue, as an example.

Some background: dynamodb.Query is a request-response style operation - we serialize the input structure to the wire and deserialize the HTTP response into the output structure. All of the request/response body Close()ing and such are opaque to the caller. This is different from a streaming-blob style operation like S3 GetObject - where you as the caller literally pass in a stream that we use as the http request body directly.

So when you call Query:

  1. we turn QueryInput into a json payload to be sent over the wire
  2. in the OperationSerializer middleware, we pass that json payload to smithyhttp.Request.SetStream() (code-generated source)
  3. in the ComputeContentLength middleware, this check happens - which causes the http.Request.ContentLength field to be set, since the length of the underlying request stream is known here (it's just a bytes.Reader underneath)
  4. We call smithyhttp.Request.Build() - which wraps the plain io.Reader stream previously set by SetStream in an io.NopCloser - which does nothing and never returns an error - and an additional self-implemented safeReadCloser that makes Close() calls idempotent (source)
  5. we Do() the request using the configured HTTP client
  6. we do the controversial request body close, ignoring the error

So plainly speaking -- the closing behavior that people claim to be problematic is literally a no-op. There is no error or side effect here on the request body close whatsoever, because for request-response style operations, the input is always a byte buffer wrapped in an io.NopCloser. There's no chance for us here, then, to be interfering with the HTTP client's connection lifecycle or anything like that. From a static analysis perspective, there is simply absolutely zero indication that this request body close routine is contributing to the appearance of this networking error.

@sathishdv
Copy link

sathishdv commented Dec 17, 2024

@lucix-aws Thanks for the detailed explanation. Based on my recent testing findings, these use of closed connection errors are thrown when the Querying some large sized data around consuming 5-6 RCU's under high throughput. In fact tried without setting HTTP timeout which is technically no timeouts. 5-6 RCU caps around 20-24 KB of data which is small enough not to batch, but under high load this may causing prematurely closing the connection before deserializing the complete response. In fact added filters to filter out the data, prolly while transferring on wire the size shrinks around 10-12 KB. But didn't see this issue with the forked version where the closing of body is commented.

@lucix-aws
Copy link
Contributor

But didn't see this issue with the forked version where the closing of body is commented.

You're experiencing a placebo effect here. Removing that logic does nothing, because the logic itself does nothing as explained above.

@lucix-aws
Copy link
Contributor

lucix-aws commented Dec 17, 2024

In all likelihood this is simply a race in the underlying net.Conn read/write loop logic. I'm not an SME here but I've been doing a deep dive on this part of the stdlib and it seems like connection resets occur in one of two ways:

  • a client calls Read() on the response body which triggers a synchronous data pull on the connection. The peer sends an RST and we get an error like "connection reset". We handle this today in retries.
  • the conn read loop asynchronously pulls an RST, which causes the connection to be closed - which is what we see in the stack traces above. In this scenario we get the "use of closed network connection" error. We do not handle this in retries.

The fix is most likely to just add the latter to our set of retryables by default, as my snippet above shows, since these errors are effectively the same thing. Even if they're not -- a retry should get past this anyway.

@sathishdv
Copy link

@lucix-aws Yes retry should solve this, but there's no proper way to categorize the error type. Had to check for certain keywords and contains use of closed network connection to recognize if it's retryable or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. p2 This is a standard priority issue queued This issues is on the AWS team's backlog
Projects
None yet
Development

No branches or pull requests

6 participants