Composite policy doesn't allow for overflow between subpolicies if a policy sets Sampled but has gone over its max rate #36959

cancub · 2024-12-24T22:39:35Z

Component(s)

processor/tailsampling

What happened?

Description

Within the algorithm for processing of subpolicies within the composite policy, the following may happen:

the current subpolicy being processed has set the current trace as Sampled
we check if allowing this sampling would put this subpolicy over its specific spans/sec limit or the global limit
if either check evaluates to true, we return NotSampled and exit the function, thereby never checking any of the other subpolicies.

My personal opinion is that this is a bug, but I can see how this might be construed as a feature.

Steps to Reproduce

start up an otelcol with the provided config
send a bunch of traces in (for example, by using xk6-client-tracing)

Expected Result

The always sample subpolicy (ASSP) receives the traces first, and any unsampled traces then move on to the probabilistic subpolicy (PSP).

If a ramping increase of spans/sec is used, we will at first see a 100% sampling rate for this Composite policy followed by an inflection point where the ASSP saturates at 100 spans/sec and 80 % of spans above this point are not sampled. Similarly, when we look at the exporter's rate/sec rate, we can see that there is an inflection at around this time, where the slope of the graph after the inflection is 20% of the slop before.

Actual Result

ASSP always receives all traces and PSP never receives any traces. The spans per second flatlines at the maximum number of spans per second to be allocated to ASSP

Collector version

v0.116.0

Environment information

Environment

OS: Ubuntu 22.04
Compiler(if manually compiled): [email protected] go1.23.4

OpenTelemetry Collector configuration

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "localhost:14317"

processors:
  tail_sampling:
    policies:
      - name: composite-policy
        type: composite
        composite:
          max_total_spans_per_second: 10_000
          policy_order:
            - always-sample-policy
            - probabilistic-policy
          composite_sub_policy:
            - name: always-sample-policy
              type: always_sample
            - name: probabilistic-policy
              type: probabilistic
              probabilistic:
                sampling_percentage: 20
          rate_allocation:
            - policy: always-sample-policy
              percent: 5
            - policy: probabilistic-policy
              percent: 95

exporters:
  otlp:
    endpoint: "localhost:4317"
    tls:
      insecure: true
  debug:
    verbosity: detailed

service:
  telemetry:
    logs:
      level: DEBUG
    metrics:
      level: detailed
      readers:
        - pull:
            exporter:
              prometheus:
                host: "0.0.0.0"
                port: 8888
  pipelines:
    traces:
      receivers: [otlp]
      processors: [tail_sampling]
      exporters: [otlp]

Log output

2024-12-24T17:29:51.319-0500	debug	[email protected]/processor.go:271	Sampling Policy Evaluation ticked	{"kind": "processor", "name": "tail_sampling", "pipeline": "traces"}
2024-12-24T17:29:51.320-0500	debug	sampling/always_sample.go:29	Evaluating spans in always-sample filter	{"kind": "processor", "name": "tail_sampling", "pipeline": "traces", "policy": "always_sample"}
2024-12-24T17:29:51.320-0500	debug	sampling/always_sample.go:29	Evaluating spans in always-sample filter	{"kind": "processor", "name": "tail_sampling", "pipeline": "traces", "policy": "always_sample"}
2024-12-24T17:29:51.320-0500	debug	sampling/always_sample.go:29	Evaluating spans in always-sample filter	{"kind": "processor", "name": "tail_sampling", "pipeline": "traces", "policy": "always_sample"}
2024-12-24T17:29:51.320-0500	debug	sampling/always_sample.go:29	Evaluating spans in always-sample filter	{"kind": "processor", "name": "tail_sampling", "pipeline": "traces", "policy": "always_sample"}
2024-12-24T17:29:51.320-0500	debug	sampling/always_sample.go:29	Evaluating spans in always-sample filter	{"kind": "processor", "name": "tail_sampling", "pipeline": "traces", "policy": "always_sample"}
2024-12-24T17:29:51.320-0500	debug	sampling/always_sample.go:29	Evaluating spans in always-sample filter	{"kind": "processor", "name": "tail_sampling", "pipeline": "traces", "policy": "always_sample"}
2024-12-24T17:29:51.320-0500	debug	sampling/always_sample.go:29	Evaluating spans in always-sample filter	{"kind": "processor", "name": "tail_sampling", "pipeline": "traces", "policy": "always_sample"}
2024-12-24T17:29:51.320-0500	debug	sampling/always_sample.go:29	Evaluating spans in always-sample filter	{"kind": "processor", "name": "tail_sampling", "pipeline": "traces", "policy": "always_sample"}
2024-12-24T17:29:51.320-0500	debug	sampling/always_sample.go:29	Evaluating spans in always-sample filter	{"kind": "processor", "name": "tail_sampling", "pipeline": "traces", "policy": "always_sample"}
2024-12-24T17:29:51.320-0500	debug	sampling/always_sample.go:29	Evaluating spans in always-sample filter	{"kind": "processor", "name": "tail_sampling", "pipeline": "traces", "policy": "always_sample"}



### Additional context

We need two changes

## Return On Global Max Right Away

We should move the check for

spansInSecondIfSampled <= c.maxTotalSPS

Out of the loop. There's no need to bother with processing any of the subpolicies if the trace will put the global system over global limit regardless of which subpolicy samples it.

I actually think this inequality might be a mistake.
`spansInSecondIfSampled` [only takes into account the number of sampled spans *for this subpolicy* and the number of spans in the current trace](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/v0.116.0/processor/tailsamplingprocessor/internal/sampling/composite.go#L106). It doesn't make sense to compare this to the `maxTotalSPS` of the full `Composite` policy.

## Modify Logic for `Sampled`/`NotSampled` for Subpolicies

[This logic](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/v0.116.0/processor/tailsamplingprocessor/internal/sampling/composite.go#L116-L120) is counter-intuitive and -- I would argue -- restricts the utility of the `Composite` policy. It's saying that if any one of the subpolicies is over its limit then we should give up on this trace.

We should get rid of this `return NotSampled, nil` and allow the remaining subpolicies to process the trace. We still have a `return NotSampled, nil` at the end of the function in case no policy wants to sample the trace or no policy is able to sample the trace based on its SPS limits.

The text was updated successfully, but these errors were encountered:

github-actions · 2024-12-24T22:39:50Z

Pinging code owners:

processor/tailsampling: @jpkrohling

See Adding Labels via Comments if you do not have permissions to add labels yourself.

cancub · 2024-12-24T22:40:36Z

The graphs showing the flatline at the ASSP maximum rate.

cancub added bug Something isn't working needs triage New item requiring triage labels Dec 24, 2024

github-actions bot added the processor/tailsampling Tail sampling processor label Dec 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Composite policy doesn't allow for overflow between subpolicies if a policy sets Sampled but has gone over its max rate #36959

Composite policy doesn't allow for overflow between subpolicies if a policy sets Sampled but has gone over its max rate #36959

cancub commented Dec 24, 2024

github-actions bot commented Dec 24, 2024

cancub commented Dec 24, 2024