Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: OTel SDK should expose a metric to inform about sampling decisions made #5756

Open
samsp-msft opened this issue Jul 17, 2024 · 3 comments
Labels
enhancement New feature or request needs-spec-change Issues which require the OpenTelemetry Specification to clarify or define behavior pkg:OpenTelemetry Issues related to OpenTelemetry NuGet package
Milestone

Comments

@samsp-msft
Copy link

samsp-msft commented Jul 17, 2024

Package

OpenTelemetry

Is your feature request related to a problem?

Imagine the scenario that you have added a sampler to your Trace configuration, something like:

    builder.Services.AddOpenTelemetry()
        .WithTracing(tracing =>
        {
            tracing.AddAspNetCoreInstrumentation()
                .AddHttpClientInstrumentation()
               .SetSampler(new ParentBasedSampler(new RateLimitingSampler(3)))
        });

This will result in sampling decisions being made based on whether the incoming request has a trace header - if so it will honor that header, and if not it will sample a max of 3 requests per second.

When you look at the output you will get a combination of traces, but what you don't get is a good understanding of what traces got dropped and why.

What is the expected behavior?

I am suggesting that we have a new metric: opentelemetry.trace.sampler.count, implemented by the OTel SDK that provides details about the number of Activities that the sampler was called for and the sampling result. It should be dimensioned with:

Name Type Description Example
sampling.decision string The final verdict as to how the activity/span was flagged for sampling drop, record_only, record_and_sample
span.parent.is_remote bool Whether the parent is a remote span or local true
span.parent.recorded bool Whether the parent trace has the recorded flag set or not false
span.name string The name of the span that is being sampled Microsoft.AspNetCore.Hosting.HttpRequestIn
sampler.description string The description from the sampler that made the decision fixedratesampler{0.2}

The span.name may be too varied to be suitable for use in a metric, in which case we should make this the ActivitySource.Name. The goal being to give the observer some idea of which spans/activities are being sampled each way.

The expected use of the metric is to have observability into the trace sampling decisions that are being made by the sdk. By looking at the sampling.decision, you get a measure of how many Activities are being dropped, just recorded and those emitted. The ratios of these numbers should match what you have configured in the sampler and the incoming request rate.

The additional fields are to enable better diagnostics as to why the sampling decision was made, but limiting it to the fields that have constrained enough values for use in metric dimensions.

Which alternative solutions or features have you considered?

While the sampling state is available through EventSource events, that is not easily monitored, and so including this in the metrics already produced makes more sense to me.

Additional context

This feels like something that would apply to other languages/sdks where head-based sampling occurs. The same metric could also be used for tail sampling in a component like the OTel collector.

@samsp-msft samsp-msft added the enhancement New feature or request label Jul 17, 2024
@github-actions github-actions bot added the pkg:OpenTelemetry Issues related to OpenTelemetry NuGet package label Jul 17, 2024
@samsp-msft
Copy link
Author

Tagging some folks: @reyang @lmolkova @CodeBlanch @cijothomas @noahfalk @kalyanaj

@kalyanaj
Copy link
Contributor

Tagging @jmacd (to evaluate/consider this feedback for the specification).

@cijothomas
Copy link
Member

I think this issue is best moved to spec/sem.convention repo, as this is independent of language implementation.

@CodeBlanch CodeBlanch added this to the Future milestone Aug 16, 2024
@CodeBlanch CodeBlanch added the needs-spec-change Issues which require the OpenTelemetry Specification to clarify or define behavior label Aug 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request needs-spec-change Issues which require the OpenTelemetry Specification to clarify or define behavior pkg:OpenTelemetry Issues related to OpenTelemetry NuGet package
Projects
None yet
Development

No branches or pull requests

4 participants