Skip to content

Commit

Permalink
docs: add FAQ/troubleshooting and operation timing guide (#2434)
Browse files Browse the repository at this point in the history
  • Loading branch information
lucix-aws authored Dec 21, 2023
1 parent 54f1c1e commit 3f11379
Show file tree
Hide file tree
Showing 44 changed files with 2,214 additions and 188 deletions.
97 changes: 97 additions & 0 deletions content/en/docs/faq/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
---
title: "Frequently Asked Questions"
linkTitle: "FAQ / Troubleshooting"
description: "Answers to some commonly-asked questions about the {{% alias sdk-go %}}"
weight: 9
---

### How do I configure my SDK's HTTP client? Are there any guidelines or best practices?

We are unable to provide guidance to customers on how to configure their HTTP
workflow in a manner that is most effective for their particular workload. The
answer to this is the product of a multivariate equation, with input factors
including but not limited to:

* the network footprint of the application (TPS, throughput, etc.)
* the services being used
* the compute characteristics of the deployment
* the geographical nature of the deployment
* the desired application behavior or needs of the application itself (SLAs,
timings, etc.)

### How should I configure operation timeouts?

Much like the previous question, it depends. Elements to consider here include
the following:

* All of the above factors concerning HTTP client config
* Your own application timing or SLA constraints (e.g. if you yourself serve
traffic to other consumers)

**The answer to this question should almost NEVER be based on pure empirical
observation of upstream behavior** - e.g. "I made 1000 calls to this operation,
it took at most 5 seconds so I will set the timeout based on that with a safety
factor of 2x to 10 seconds". Environment conditions can change, services can
temporarily degrade, and these types of assumptions can become wrong without
warning.

### Requests made by the SDK are timing out or taking too long, how do I fix this?

We are unable to assist with extended or timed-out operation calls due to
extended time spent on the wire. "Wire time" in the SDK is defined as any of
the following:
* Time spent in an SDK client's `HTTPClient.Do()` method
* Time spent in `Read()`s on an HTTP response body that has been forwarded to
the caller (e.g. `GetObject`)

If you are experiencing issues due to operation latency or timeouts, your first
course of action should be to obtain telemetry of the SDK operation lifecycle
to determine the timing breakdown between time spent on the wire and the
surrounding overhead of the operation. See the guide on
[timing SDK operations]({{< ref "/docs/faq/timing-operations.md" >}}),
which contains a reusable code snippet that can achieve this.

### How do I fix a `read: connection reset` error?

The SDK retries any errors matching the `connection reset` pattern by default.
This will cover error handling for most operations, where the operation's HTTP
response is fully consumed and deserialized into its modeled result type.

However, this error can still occur in a context **outside** of the retry loop:
certain service operations directly forward the API's HTTP response body to the
caller to be consumed from the wire directly via `io.ReadCloser` (e.g.
`GetObject`'s object payload). You may encounter this error when performing a
`Read` on the response body.

This error indicates that your host, the service or any intermediary party
(e.g. NAT gateways, proxies, load balancers) closed the connection while
attempting to read the response.

This can occur for several reasons:
* You did not consume the response body for some time after the response itself
was received (after the service operation was called). **We recommend you
consume the HTTP response body as soon as possible for these types of
operations.**
* You did not close a previously-received response body. This can cause
connection resets on certain platforms. **You MUST close any `io.ReadCloser`
instances provided in an operation's response, regardless of whether you
consume its contents.**

Beyond that, try running a tcpdump for an affected connection at the edge of
your network (e.g. after any proxies that you control). If you see that the AWS
endpoint seems to be sending a TCP RST, you should use the AWS support console
to open a case against the offending service. Be prepared to provide request
IDs and specific timestamps of when the issue occured.

### Why am I getting "invalid signature" errors when using an HTTP proxy with the SDK?

The signature algorithm for AWS services (generally sigv4) is tied to the
serialized request's headers, more specifically most headers prefixed with
`X-`. Proxies are prone to modifying the outgoing request by adding additional
forwarding information (often via an `X-Forwarded-For` header) which
effectively breaks the signature that the SDK calculated.

If you're using an HTTP proxy and experiencing signature errors, you should
work to capture the request **as it appears outgoing from the proxy** and
determine whether it is different.

250 changes: 250 additions & 0 deletions content/en/docs/faq/timing-operations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,250 @@
---
title: "Timing SDK operations"
linkTitle: "Timing Operations"
description: "How to perform basic instrumentation in the {{% alias sdk-go %}} to time SDK operations"
weight: 1
---

When debugging timeout / latency issues in the SDK, it is critical to identify
the components of the operation lifecycle which are taking more time to execute
than expected. As a starting point, you will generally need to inspect the
timing breakdown between the overall operation call and the HTTP call itself.

The following sample program implements a basic instrumentation probe in terms
of `smithy-go` middleware for SQS clients and demonstrates how it is used. The
probe emits the following information for each operation call:

* AWS request ID
* service ID
* operation name
* operation invocation time
* http call time

Each emitted message is prefixed with a unique (to a single operation)
"invocation ID" which is set at the beginning of the handler stack.

The entry point for instrumentation is exposed as `WithOperationTiming`, which
is parameterized to accept a message handling function which will receive
instrumentation "events" as formatted strings. `PrintfMSGHandler` is provided
as a convenience which will simply dump messages to stdout.

The service used here is interchangeable - ALL service client options accept
`APIOptions` and an `HTTPClient` as configuration. For example,
`WithOperationTiming` could instead be declared as:

```go
func WithOperationTiming(msgHandler func(string)) func(*s3.Options)
func WithOperationTiming(msgHandler func(string)) func(*dynamodb.Options)
// etc.
```

If you change it, be sure to change the signature of the function it returns as
well.

```go
import (
"context"
"fmt"
"log"
"net/http"
"sync"
"time"

awsmiddleware "github.com/aws/aws-sdk-go-v2/aws/middleware"
awshttp "github.com/aws/aws-sdk-go-v2/aws/transport/http"
"github.com/aws/aws-sdk-go-v2/config"
"github.com/aws/aws-sdk-go-v2/service/sqs"
"github.com/aws/smithy-go/middleware"
smithyrand "github.com/aws/smithy-go/rand"
)

// WithOperationTiming instruments an SQS client to dump timing information for
// the following spans:
// - overall operation time
// - HTTPClient call time
//
// This instrumentation will also emit the request ID, service name, and
// operation name for each invocation.
//
// Accepts a message "handler" which is invoked with formatted messages to be
// handled externally, you can use the declared PrintfMSGHandler to simply dump
// these values to stdout.
func WithOperationTiming(msgHandler func(string)) func(*sqs.Options) {
return func(o *sqs.Options) {
o.APIOptions = append(o.APIOptions, addTimingMiddlewares(msgHandler))
o.HTTPClient = &timedHTTPClient{
client: awshttp.NewBuildableClient(),
msgHandler: msgHandler,
}
}
}

// PrintfMSGHandler writes messages to stdout.
func PrintfMSGHandler(msg string) {
fmt.Printf("%s\n", msg)
}

type invokeIDKey struct{}

func setInvokeID(ctx context.Context, id string) context.Context {
return middleware.WithStackValue(ctx, invokeIDKey{}, id)
}

func getInvokeID(ctx context.Context) string {
id, _ := middleware.GetStackValue(ctx, invokeIDKey{}).(string)
return id
}

// Records the current time, and returns a function to be called when the
// target span of events is completed. The return function will emit the given
// span name and time elapsed to the given message consumer.
func timeSpan(ctx context.Context, name string, consumer func(string)) func() {
start := time.Now()
return func() {
elapsed := time.Now().Sub(start)
consumer(fmt.Sprintf("[%s] %s: %s", getInvokeID(ctx), name, elapsed))
}
}

type timedHTTPClient struct {
client *awshttp.BuildableClient
msgHandler func(string)
}

func (c *timedHTTPClient) Do(r *http.Request) (*http.Response, error) {
defer timeSpan(r.Context(), "http", c.msgHandler)()

resp, err := c.client.Do(r)
if err != nil {
return nil, fmt.Errorf("inner client do: %v", err)
}

return resp, nil
}

type addInvokeIDMiddleware struct {
msgHandler func(string)
}

func (*addInvokeIDMiddleware) ID() string { return "addInvokeID" }

func (*addInvokeIDMiddleware) HandleInitialize(ctx context.Context, in middleware.InitializeInput, next middleware.InitializeHandler) (
out middleware.InitializeOutput, md middleware.Metadata, err error,
) {
id, err := smithyrand.NewUUID(smithyrand.Reader).GetUUID()
if err != nil {
return out, md, fmt.Errorf("new uuid: %v", err)
}

return next.HandleInitialize(setInvokeID(ctx, id), in)
}

type timeOperationMiddleware struct {
msgHandler func(string)
}

func (*timeOperationMiddleware) ID() string { return "timeOperation" }

func (m *timeOperationMiddleware) HandleInitialize(ctx context.Context, in middleware.InitializeInput, next middleware.InitializeHandler) (
middleware.InitializeOutput, middleware.Metadata, error,
) {
defer timeSpan(ctx, "operation", m.msgHandler)()
return next.HandleInitialize(ctx, in)
}

type emitMetadataMiddleware struct {
msgHandler func(string)
}

func (*emitMetadataMiddleware) ID() string { return "emitMetadata" }

func (m *emitMetadataMiddleware) HandleInitialize(ctx context.Context, in middleware.InitializeInput, next middleware.InitializeHandler) (
middleware.InitializeOutput, middleware.Metadata, error,
) {
out, md, err := next.HandleInitialize(ctx, in)

invokeID := getInvokeID(ctx)
requestID, _ := awsmiddleware.GetRequestIDMetadata(md)
service := awsmiddleware.GetServiceID(ctx)
operation := awsmiddleware.GetOperationName(ctx)
m.msgHandler(fmt.Sprintf(`[%s] requestID = "%s"`, invokeID, requestID))
m.msgHandler(fmt.Sprintf(`[%s] service = "%s"`, invokeID, service))
m.msgHandler(fmt.Sprintf(`[%s] operation = "%s"`, invokeID, operation))

return out, md, err
}

func addTimingMiddlewares(mh func(string)) func(*middleware.Stack) error {
return func(s *middleware.Stack) error {
if err := s.Initialize.Add(&timeOperationMiddleware{msgHandler: mh}, middleware.Before); err != nil {
return fmt.Errorf("add time operation middleware: %v", err)
}
if err := s.Initialize.Add(&addInvokeIDMiddleware{msgHandler: mh}, middleware.Before); err != nil {
return fmt.Errorf("add invoke id middleware: %v", err)
}
if err := s.Initialize.Insert(&emitMetadataMiddleware{msgHandler: mh}, "RegisterServiceMetadata", middleware.After); err != nil {
return fmt.Errorf("add emit metadata middleware: %v", err)
}
return nil
}
}

func main() {
cfg, err := config.LoadDefaultConfig(context.Background())
if err != nil {
log.Fatal(fmt.Errorf("load default config: %v", err))
}

svc := sqs.NewFromConfig(cfg, WithOperationTiming(PrintfMSGHandler))

var wg sync.WaitGroup

for i := 0; i < 6; i++ {
wg.Add(1)
go func() {
defer wg.Done()

_, err = svc.ListQueues(context.Background(), nil)
if err != nil {
fmt.Println(fmt.Errorf("list queues: %v", err))
}
}()
}
wg.Wait()
}
```

A sample output of this program:

```
[e9a801bb-c51d-45c8-8e9f-a202e263fde8] http: 192.24067ms
[e9a801bb-c51d-45c8-8e9f-a202e263fde8] requestID = "dbee3082-96a3-5b23-adca-6d005696fa94"
[e9a801bb-c51d-45c8-8e9f-a202e263fde8] service = "SQS"
[e9a801bb-c51d-45c8-8e9f-a202e263fde8] operation = "ListQueues"
[e9a801bb-c51d-45c8-8e9f-a202e263fde8] operation: 193.098393ms
[0740f0e0-953e-4328-94fc-830a5052e763] http: 195.185732ms
[0740f0e0-953e-4328-94fc-830a5052e763] requestID = "48b301fa-fc9f-5f1f-9007-5c783caa9322"
[0740f0e0-953e-4328-94fc-830a5052e763] service = "SQS"
[0740f0e0-953e-4328-94fc-830a5052e763] operation = "ListQueues"
[0740f0e0-953e-4328-94fc-830a5052e763] operation: 195.725491ms
[c0589832-f351-4cc7-84f1-c656eb79dbd7] http: 200.52383ms
[444030d0-6743-4de5-bd91-bc40b2b94c55] http: 200.525919ms
[c0589832-f351-4cc7-84f1-c656eb79dbd7] requestID = "4a73cc82-b47b-56e1-b327-9100744e1b1f"
[c0589832-f351-4cc7-84f1-c656eb79dbd7] service = "SQS"
[c0589832-f351-4cc7-84f1-c656eb79dbd7] operation = "ListQueues"
[c0589832-f351-4cc7-84f1-c656eb79dbd7] operation: 201.214365ms
[444030d0-6743-4de5-bd91-bc40b2b94c55] requestID = "ca1523ed-1879-5610-bf5d-7e6fd84cabee"
[444030d0-6743-4de5-bd91-bc40b2b94c55] service = "SQS"
[444030d0-6743-4de5-bd91-bc40b2b94c55] operation = "ListQueues"
[444030d0-6743-4de5-bd91-bc40b2b94c55] operation: 201.197071ms
[079e8dbd-bb93-43ab-89e5-a7bb392b86a5] http: 206.449568ms
[12b2b39d-df86-4648-a436-ff0482d13340] http: 206.526603ms
[079e8dbd-bb93-43ab-89e5-a7bb392b86a5] requestID = "64229710-b552-56ed-8f96-ca927567ec7b"
[079e8dbd-bb93-43ab-89e5-a7bb392b86a5] service = "SQS"
[079e8dbd-bb93-43ab-89e5-a7bb392b86a5] operation = "ListQueues"
[079e8dbd-bb93-43ab-89e5-a7bb392b86a5] operation: 207.252357ms
[12b2b39d-df86-4648-a436-ff0482d13340] requestID = "76d9cbc0-07aa-58aa-98b7-9642c79f9851"
[12b2b39d-df86-4648-a436-ff0482d13340] service = "SQS"
[12b2b39d-df86-4648-a436-ff0482d13340] operation = "ListQueues"
[12b2b39d-df86-4648-a436-ff0482d13340] operation: 207.360621ms
```
19 changes: 13 additions & 6 deletions content/en/docs/making-requests.md
Original file line number Diff line number Diff line change
Expand Up @@ -233,14 +233,21 @@ if err != nil {
For more information on error handling, including how to inspect for specific error types, see the
[Handling Errors]({{% ref "handling-errors.md" %}}) documentation.

#### Responses with io.ReadCloser
#### Responses with `io.ReadCloser`

Some API operations return a response struct that contain an output member that is an `io.ReadCloser`. If you're making
requests with these operations, always be sure to call `io.ReadCloser` member's `Close` method after you've completed
reading the content.
Some API operations return a response struct that contain an output member that
is an `io.ReadCloser`. This will be the case for API operations that expose
some element of their output in the body of the HTTP response itself.

For example {{% alias service=S3 %}} `GetObject` operation returns a response
whose `Body` member is an `io.ReadCloser`:
For example, {{% alias service=S3 %}} `GetObject` operation returns a response
whose `Body` member is an `io.ReadCloser` for accessing the object payload.

{{% pageinfo color="warning" %}}
**You MUST ALWAYS `Close()` any `io.ReadCloser` output members, regardless of
whether you've consumed its content. Failure to do so can leak resources and
potentially create issues with reading response bodies for operations called in
the future.**
{{% /pageinfo %}}

```go
resp, err := s3svc.GetObject(context.TODO(), &s3.GetObjectInput{...})
Expand Down
2 changes: 1 addition & 1 deletion docs/404.html
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@
aria-label="Search this site…"
autocomplete="off"

data-offline-search-index-json-src="/aws-sdk-go-v2/offline-search-index.810a5299d837b395cdd69648bfec2e99.json"
data-offline-search-index-json-src="/aws-sdk-go-v2/offline-search-index.ba2b7a89f1555398e1c5839c9aee3bc1.json"
data-offline-search-base-href="/"
data-offline-search-max-results="10"
>
Expand Down
Loading

0 comments on commit 3f11379

Please sign in to comment.