-
Notifications
You must be signed in to change notification settings - Fork 1
/
11-observability.slide
499 lines (345 loc) · 14.8 KB
/
11-observability.slide
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
# Observability
Course Go
Tags: golang, go
## Outline
- Health
- Metrics
- Prometheus
- Grafana
- Logs
- Loggers
- Loki
- Traces
- Jaeger
- OpenTelemetry
## Observability
- The ability to understand the internal state of a system
- The inspection is done by analyzing telemetry data outputted by the system
- Telemetry data describe how the application is used and how it performs
- [Four common telemetry data types](https://developer.newrelic.com/opentelemetry-masterclass/telemetry/telemetry-data-types/) exist (MELT):
- Metrics
- Events
- Logs
- Traces
## Health
## Health
- Used to ping or to healthcheck a service instance
- Reports on service components/dependencies health
- Usually exposed as `/health` or `/status`
```
{
"name": "API",
"version": "v1.15.3"
"status": "NOK",
"components": {
"database": {
"status": "OK",
"timestamp": "2024-05-04T12:10:18.001547Z"
},
"exporter": {
"status": "NOK",
"timestamp": "2024-05-04T12:13:45.652785Z",
"error": "corrupted data"
}
}
}
```
## Health
.code assets/lecture-11/health/health.go /START OMIT/,/END OMIT/
[GitHub: alexliesenfeld/health](https://github.com/alexliesenfeld/health)
## Metrics
## Metrics
- Runtime service measurements
- Quantifiable
- Examples:
- "Service uptime"
- "Memory allocations"
- "HTTP requests count"
- "HTTP response time"
## Prometheus
- System monitoring tool
- Built at SoundCloud [2012]
- A CNCF project since 2016
- Custom Prometheus Query Language ([PromQL](https://prometheus.io/docs/prometheus/latest/querying/basics/))
- [Stores data](https://prometheus.io/docs/prometheus/latest/storage/) as time series
- Exercises the pull strategy by default
- Pushing can be set-up using the [pushgateway](https://github.com/prometheus/pushgateway)
- Supports [alerting](https://prometheus.io/docs/alerting/latest/overview/) out of the box
[Prometheus](https://prometheus.io)
[GitHub: Prometheus](https://github.com/prometheus/prometheus)
## Prometheus Configuration
.code assets/lecture-11/metrics/prometheus.yaml
[Prometheus: Configuration](https://prometheus.io/docs/prometheus/latest/configuration/configuration/)
## Metric types
- Counter
- Incrementing counter
- Gauge
- Numerical value
- Histogram
- Samples observations
- Distributes values into buckets
- Summary
- [Similar to Histogram](https://prometheus.io/docs/practices/histograms/)
- Exposes quantiles
[Prometheus: Metric Types](https://prometheus.io/docs/concepts/metric_types/)
## Prometheus Metrics
.code assets/lecture-11/metrics/prometheus-metrics.text
## Instrumenting
.code assets/lecture-11/metrics/prometheus-instrumenting.go /START OMIT/,/END OMIT/
[Prometheus: Instrumenting Go Application](https://prometheus.io/docs/guides/go-application/)
## PromQL
- Metrics are requested by name
```
promhttp_metric_handler_requests_total
```
- Can be filtered using labels
```
promhttp_metric_handler_requests_total{code="200"}
```
- Number of unique metrics with given name
```
count(promhttp_metric_handler_requests_total)
```
- Average `/metrics` handler requests with status OK across 3 minute windows
```
rate(promhttp_metric_handler_requests_total{code="200"}[3m])
```
- Average HTTP request duration split by 3 minute windows
```
sum(rate(http_request_duration_seconds_sum[3m])) / sum(rate(http_request_duration_seconds_count[3m]))
```
##
.image assets/lecture-11/metrics/prometheus-query-table.png 550 _
##
.image assets/lecture-11/metrics/prometheus-query-graph.png 580 _
##
.image assets/lecture-11/metrics/prometheus-query-graph-rate.png 600 _
## Grafana
- Visualization tool
- Supports alerting
- Supports over [150 datasources](https://grafana.com/grafana/plugins/data-source-plugins/)
- Grafana also provides its own fully managed [Grafana Cloud](https://grafana.com/products/cloud/)
- [Free plan](https://grafana.com/pricing/)
- 10k metrics
- 14-day retention
- 50 GB of logs and metrics
[Grafana Labs: Grafana](https://grafana.com/grafana/)
[GitHub: Grafana](https://github.com/grafana/grafana)
##
.video assets/lecture-11/metrics/grafana.mp4 video/mp4 520 _
[Grafana Labs: Grafana sandbox](https://play.grafana.org/d/000000012/grafana-play-home?orgId=1)
## Logs
## Logs
- The most basic form of telemetry
- Timestamped text records
- Structured
- Unstructured
- Generally supported by programming languages out of the box
- [log](https://pkg.go.dev/log)
- [slog](https://pkg.go.dev/slog)
## zap
- Created by Uber
- Focuses on performance
- Minimalizes allocations
- [Extendable](https://pkg.go.dev/go.uber.org/zap#hdr-Extending_Zap)
.code assets/lecture-11/logs/zap.go /START OMIT/,/END OMIT/
[GitHub: zap](https://github.com/uber-go/zap)
[BetterStack: Complete Guide to Zap](https://betterstack.com/community/guides/logging/go/zap/)
## zerolog
- Inspired by zap
- Comes out the fastest according to [benchmarks](https://github.com/uber-go/zap?tab=readme-ov-file#performance)
.code assets/lecture-11/logs/zerolog.go /START OMIT/,/END OMIT/
[GitHub: zerolog](https://github.com/rs/zerolog)
[BetterStack: Complete Guide to Zerolog](https://betterstack.com/community/guides/logging/zerolog/)
## Loki
- Horizontally-scalable, highly-available, log aggregation system
- Inspired by Prometheus
- Focuses on logs instead of metrics
- Created by Grafana Labs and maintained under the CNCF
- The log data are compressed and [stored](https://grafana.com/docs/loki/latest/get-started/architecture/#storage) in chunks
- Object storage
- [Labels](https://grafana.com/docs/loki/latest/get-started/labels/) the data and uses the labels to index them
- Multiple clients can be configured to collect the logs
- Custom [LogQL](https://grafana.com/docs/loki/latest/query/) query language similar to PromQL
[Grafana Lab: Loki](https://grafana.com/oss/loki/)
[GitHub: Loki](https://github.com/grafana/loki)
## Architecture
- Modular system composed of multiple [components](https://grafana.com/docs/loki/latest/get-started/components/)
- Can be deployed:
- All-in-one
- In logical groups
- Individually
```
$ loki -config.file=loki.yaml -target all
$ loki -config.file=loki.yaml -target write
$ loki -config.file=loki.yaml -target compactor
```
[Grafana Labs: Loki deployment modes](https://grafana.com/docs/loki/latest/get-started/deployment-modes/)
## Configuration
.code assets/lecture-11/logs/loki.yaml
[Grafana Lab: Loki configuration](https://grafana.com/docs/loki/latest/configure/)
## Clients
- [Promtail](https://grafana.com/docs/loki/latest/send-data/promtail/)
- Logging agent
- Reuses Prometheus `scrape_configs` configuration
- [Docker Driver](https://grafana.com/docs/loki/latest/send-data/docker-driver/)
- Docker plugin that reads container logs and pushes them to Loki
- [OpenTelemetry Collector](https://grafana.com/docs/loki/latest/send-data/otel/)
- Loki can ingest OTel logs
- More on OpenTelemetry later
- [Grafana Alloy](https://grafana.com/docs/alloy/latest/)
- Grafanas distribution of OpenTelemetry Collector
[Grafana Labs: Send log data to Loki](https://grafana.com/docs/loki/latest/send-data/)
## Traces
## Traces
- Model work in distributed systems
- Each trace models a transaction through your system
- End-to-end request/response
- Traces are basically a set of interconnected logs
- Even through multiple services
- Each trace is made up of a single or multiple spans
- Span models a part of the transaction
- A span can model a service interaction or a interaction within a component etc.
- Each spans contains tags/attributes
- Provide additional context
## Spans
.image assets/lecture-11/traces/spans.svg 450 _
[Jaeger: Spans](https://www.jaegertracing.io/docs/1.57/architecture/#span)
## Jaeger
- Distrubuted tracing platform
- Inspired by Dapper and OpenZepkin
- Created by Uber in 2015
- A CNCF graduated project
- Nowadays uses OpenTelemetry for instrumentation
- Previously had its own clients
- Supports multiple storage backends
- Most notable are Cassandra, Elasticsearch, and Kafka
- Jaeger maintains a [hotrod demo](https://github.com/jaegertracing/jaeger/tree/main/examples/hotrod)
[Jaeger](https://www.jaegertracing.io)
##
.image assets/lecture-11/traces/jaeger-traces-view.png 520 _
[Jaeger: Introduction](https://www.jaegertracing.io/docs/1.57/#screenshots)
##
.image assets/lecture-11/traces/jaeger-traces-detail-view.png 520 _
[Jaeger: Introduction](https://www.jaegertracing.io/docs/1.57/#screenshots)
## Architecture
- Similarily to Loki, Jaeger can be:
- Deployed as a all-in-one binary
- Or split by components
- Components
- Collector
- Receives traces, runs them through a pipeline and saves it
- Query
- Exposes query API and hosts UI
- Ingester
- Adds support for Kafka
[Jaeger: Architecture](https://www.jaegertracing.io/docs/1.57/architecture/)
## Architecture
.image assets/lecture-11/traces/jaeger-architecture.svg 320 _
[Jaeger: Architecture](https://www.jaegertracing.io/docs/1.57/architecture/)
## Sampling
- Telemetry data are often pretty rich
- Saving everything results in large storage costs
- Instrumentation also has some runtime overhead
- Sampling helps us mitigate these problems
- Head-based sampling
- The decision is done right at the start
- E.g. based on the generated trace ID
- Tail-based sampling
- Inspects the data as a whole and decides based on specified criteria
- E.g. the trace contains an error
[OpenTelemetry: Sampling](https://opentelemetry.io/docs/concepts/sampling/)
## OpenTelemetry
## OpenTelemetry
- AKA OTel
- Vendor-neutral observability framework
- Defines industry standards
- Instrumenting, generating, collecting, and exporting telemetry data
- Created by merging [OpenCensus](https://opencensus.io) and [OpenTracing](https://opentracing.io) in 2019
- Again, a CNCF project
[OpenTelemetry](https://opentelemetry.io)
## Components
- Specification
- APIs
- SDKs
- Data: OpenTelemetry Protocol (OTLP)
- Collector
- Proxy that receives, processes, and exports telemetry data
- APIs & SDKs implementations
- K8s operator
- FaaS assets
[OpenTelemetry: Components](https://opentelemetry.io/docs/concepts/components/)
## Signals
- AKA categories of telemetry
- [Traces](https://opentelemetry.io/docs/concepts/signals/traces/)
- [Metrics](https://opentelemetry.io/docs/concepts/signals/metrics/)
- [Logs](https://opentelemetry.io/docs/concepts/signals/logs/)
- [Baggage](https://opentelemetry.io/docs/concepts/signals/baggage/)
- Contextual information passed between signals
- _Profiles_ (WIP)
- _Sessions_ (WIP)
[OpenTelemetry: Signals](https://opentelemetry.io/docs/concepts/signals/)
## Non-OpenTelemetry Architecture
.image assets/lecture-11/opentelemetry/non-opentelemetry-architecture.svg 480 _
[OpenTelemetry: Logging specification](https://opentelemetry.io/docs/specs/otel/logs/)
## OpenTelemetry Architecture
.image assets/lecture-11/opentelemetry/opentelemetry-architecture.svg 480 _
[OpenTelemetry: Logging specification](https://opentelemetry.io/docs/specs/otel/logs/)
## OpenTelemetry Collector
- Telemetry data are received, processed, and exported using pipelines
- Multiple pipelines can be defined at once
Pipeline
.image assets/lecture-11/opentelemetry/collector-pipeline.svg 250 _
[OpenTelemetry: Collector Architecture](https://opentelemetry.io/docs/collector/architecture/)
## OpenTelemetry Collector Configuration
.code assets/lecture-11/opentelemetry/collector.yaml
[OpenTelemetry: Collector Configuration](https://opentelemetry.io/docs/collector/configuration/)
## OpenTelemetry Instrumentation
- Many libraries provide automatic intrumentation via plugins
- [OpenTelemetry Registry](https://opentelemetry.io/ecosystem/registry/?s=sqlx&component=&language=)
- [zap](https://pkg.go.dev/github.com/uptrace/opentelemetry-go-extra/otelzap)
- [net/http](https://pkg.go.dev/go.opentelemetry.io/contrib/instrumentation/net/http)
- [gin](https://pkg.go.dev/go.opentelemetry.io/contrib/instrumentation/github.com/gin-gonic/gin)
- [sqlx](https://pkg.go.dev/github.com/uptrace/opentelemetry-go-extra/otelsqlx)
- [gorm](https://pkg.go.dev/github.com/uptrace/opentelemetry-go-extra/otelgorm)
- [grpc](https://pkg.go.dev/go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc)
- [runtime](https://pkg.go.dev/go.opentelemetry.io/contrib/instrumentation/runtime)
- Or you can use the OpenTelemetry API and SDK directly
- [OpenTelemetry-Go](https://github.com/open-telemetry/opentelemetry-go)
## Instrumentation: Resources
.code assets/lecture-11/opentelemetry/otel.go /START RESOURCE OMIT/,/END RESOURCE OMIT/
[OpenTelemetry: Resources](https://opentelemetry.io/docs/languages/go/resources/)
## Instrumentation: Providers & Exporter
.code assets/lecture-11/opentelemetry/otel.go /START TRACE-PROVIDER OMIT/,/END TRACE-PROVIDER OMIT/
.code assets/lecture-11/opentelemetry/otel.go /START METRIC-PROVIDER OMIT/,/END METRIC-PROVIDER OMIT/
## Instrumentation: Initialization
.code assets/lecture-11/opentelemetry/otel.go /START TRACE-INIT OMIT/,/END TRACE-INIT OMIT/
[OpenTelemetry: Instrumentation](https://opentelemetry.io/docs/languages/go/instrumentation/)
## OpenTelemetry Instrumentation: Traces
.code assets/lecture-11/opentelemetry/otel.go /START CREATING-SPAN OMIT/,/END CREATING-SPAN OMIT/
.code assets/lecture-11/opentelemetry/otel.go /START SPAN-ATTRIBUTES OMIT/,/END SPAN-ATTRIBUTES OMIT/
.code assets/lecture-11/opentelemetry/otel.go /START SPAN-STATUS OMIT/,/END SPAN-STATUS OMIT/
## OpenTelemetry Instrumentation: Metrics
.code assets/lecture-11/opentelemetry/otel.go /START METRIC OMIT/,/END METRIC OMIT/
## OpenTelemetry Metric Instruments
- Synchronous
- Counter
- UpDownCounter
- Histogram
- Asynchronous (Observable)
- Counter
- UpDownCounter
- Gauge
[Go Packages: OpenTelemetry API Instruments](https://pkg.go.dev/go.opentelemetry.io/otel/metric#hdr-Instruments)
## OpenTelemetry SDK
- The SDK has to be enabled for applications to expose telemetry
- Otherwise, all telemetry related actions are no-ops
- The SDK can be configured using [environmental variables](https://opentelemetry.io/docs/languages/sdk-configuration/general/)
- [Implementation Compliance Matrix](https://github.com/open-telemetry/opentelemetry-specification/blob/main/spec-compliance-matrix.md#environment-variables)
## Demo
- Additional [Go examples](https://github.com/open-telemetry/opentelemetry-go/tree/main/example)
- OpenTelemetry also maintains an up-to-date demo
- Microservice-based distributed system
- Showcases OTel usage
[OpenTelemetry: Demo Documentation](https://opentelemetry.io/docs/demo/)
[GitHub: OpenTelemetry Astronomy Shop](https://github.com/open-telemetry/opentelemetry-demo)