Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Semantic Conventions - Multi-Registry Proposal #348

Open
wants to merge 29 commits into
base: main
Choose a base branch
from

Conversation

lquerel
Copy link
Contributor

@lquerel lquerel commented Aug 29, 2024

This PR proposes to support multiple semantic convention registries in OTEL and Weaver.

If you’re like me and prefer to read a rendered version of the markdown spec, it’s available here.

Note: This proposal could eventually be transformed into an OTEP if needed.

See GH issue #215

@lquerel lquerel self-assigned this Aug 30, 2024
@lquerel lquerel added documentation Improvements or additions to documentation enhancement New feature or request labels Aug 30, 2024
the needs of the owner. Registries should be accessible via a URL.
- **Common Format**: A published registry should adhere to a common packaging and format, making it easy to
consume and integrate with other registries.
- **Self-Contained**: A published registry should be self-contained and stored in a single file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with-self-contained, but single file sounds weird as a design principle

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was to make a published registry simple to consume, so in addition to being self-contained, having a single file also simplifies consumption. I’m going to remove the “single file” aspect from the design principles section for now, but I will continue to use the single file approach for the published registry in this proposal and see how things evolve.

These policies should be enforced by Weaver.
- **Cross-Registry References**: References between different registries should be supported, facilitating
interoperability and integration across various registries.
- **Conflict Avoidance through Scoping**: A scoping mechanism should be implemented to ensure that a signal
Copy link
Contributor

@lmolkova lmolkova Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to understand what scoping means.

There should not be a case when two attributes with the same name are defined within one schema_url.

If scoping means that both exist but you can specify which one is imported in yaml than this information is not available to the consumer of this telemetry.

Even worse if you can conditionally import one or another in different parts of conventions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make things clear, the consumer of telemetry, i.e., anyone downstream from signal production (backend, dashboard, etc.), will not see this type of conflict and will see the attribute or signal with its name as defined today. I will add this principle to the design principles.

I am currently writing an entire section on conflict resolution. Conflicts can arise for several reasons. Examples:

  • A registry that imports two registries, which in turn import a registry but in different versions.
  • A registry that imports two registries, which in turn import the same registry but apply different overrides.

Things should be better defined soon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the following item in the design principles list:

  • Transparency for Telemetry Consumers: The downstream consumers of telemetry data, such as backends and dashboards, should never be exposed to conflicts within or between registries. They should see attributes and signals as they are defined, without any scope or conflict resolution directives, ensuring a consistent and reliable data experience.

- Short answer: No for backward compatibility.
- Long answer: Experimental entities are not meant to be used in production. Experimental entities are subject to
change or removal without any notice in the next version of the registry. The type of an experimental entity can
be changed. Weaver will detect and report an error if an experimental entity is referenced across registries.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this limitation or reasoning behind it.

It's common to use experimental semconv in production, it's also common to implement them in OTel or outside it in the instrumentations.

It makes sense to require something that depends on experimental stuff to be experimental, but I don't understand why we should limit the ability to import experimental attribute/group.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach was to support only the most essential/simple features at the beginning, with the option of adding such things later if the need arose. It seems that the need is already here :-)

Technically, I don’t think there’s a problem with supporting experimental definitions across registries. Even if we decide that this is supported by default, I believe we should, at a minimum, make it easier for users who don’t want to allow this. This could simply be implemented as an OTEL policy that prevents this type of dependency.

error if a reference is made to an entity that is not defined in the imported registry.

### Open Questions:
- Can we override any field of a group defined in an imported registry? No for `type` field, what about the other
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd start with the same limitations as within the registry. You can't change type, stability, deprecation status. Can change anything else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

### Open Questions:
- Can we override any field of a group defined in an imported registry? No for `type` field, what about the other
fields?
- Is there a relationship to define between the instrumentation scope name and version and the registry?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instrumentation scope is a random string - name of the library/class/component that produced the telemetry, I can change which component/class emits my logs/events, but this should have no effect on the consumer of the telemetry.

I think we might change it at some point, but that's the current state of affairs.


### Spans
- No span name collisions
- Spans cannot be removed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not? what does it even mean to remove a span?

I assume I pick the spans that I want to support by referencing them and I should be free to drop any spans that I don't need.

Moreover, if I'm defining semconv for a specific system (e.g. Kafka library), not all spans declared in generic conventions apply to my library, some of them are defined for scenarios that I don't support and it'd be best if I don't ever mention corresponding spans in my conventions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the OTEL registry, once a span has been defined, can its definition be removed in a later version? I thought this wasn't allowed if the span had a stability status set to stable. If that's the case, I think we should apply the same rule for all registries, whether OTEL or not.

Copy link
Contributor

@lmolkova lmolkova Sep 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's say I work on a library that supports multiple protocols (HTTP and something else). At some point my library removes support for one of these protocols. I should be allowed to stop referencing corresponding protocol spans - I no longer report them.

For metrics we at least support informal requirement levels and we could say that if you report e.g. http metrics, you must at least report required ones. Even this is tough to say - how do you know that they are going to report HTTP metrics if they don't reference them?

For spans it's even more complicated since span identity is not defined and spans don't have requirement levels.
We don't define anything about spans in yaml beyond what's defined for the attribute_groups - we need to change this first.


I think there are two separate discussions here:

  1. What's allowed when I reference conventions from registry A in registry B (or within one registry).
  2. What changes are allowed in the the registry B from version N to N+1

I'm mostly talking about p1 (you can reference or not reference any group), and I think you're talking about p2.

Maybe it'd be more clear if policies were separated into two groups: within-one-version and backward-compatibility?

- Event names must match the following regular expression: `^[a-z][a-z0-9]*([._][a-z0-9]+)*$`

### Spans
- No span name collisions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to find a better identity parameter for spans than name. Span names is an existing thing - they are usually something like "GET /users/{user-id}" or SELECT public.my-shop-db - they are human-friendly summary of what this span represents and are part of the exported telemetry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. However, for symmetry with other signal types, we should ideally keep name to identify things, as we do with attributes and metrics, and use a separate identifier for the human-friendly summary of what this span represents.

Copy link
Contributor

@lmolkova lmolkova Sep 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Span name is used everywhere - in the spec - https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/api.md#span-creation, in all semantic conventions. I don't think we can change how span name property is called - it'll be a huge source of confusion.

- Spans cannot be removed
- Spans cannot "degrade" in stability (e.g., stable -> experimental)
- The set of required/recommended attributes must remain the same
- Span names must match the following regular expression: `^[a-z][a-z0-9]*([._][a-z0-9]+)*$`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comment above - span name is pre-existing thing with different rules (unicode string)

## Resolved Semantic Convention Registry Format

> Note: A resolved registry is self-contained and does not include any complex constructs
> like `imports`, `ref`, `extends`, etc. Their **structure is less subject to change**, making them good
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I'm defining a span or a metric, I have to reference some attribute and maybe extend some attribute group, so this may need clarification.

- Accessible via a URL.
- Self-contained, i.e. a single file.
- No `ref`, no `extends`, no `imports`, no alias, no other complex constructs.
- Yaml or JSON format so resolved registries can be easily consumed by any tool.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpicking: I think single file and this point are mutually exclusive - I get some warnings from VSCode on resolved attribute registry alone due to its size.

Open Questions:

- Do we keep track of the imported registries in the resolved registry? If yes, how? Lineage?
- Can we leverage the attribute deduplication mechanism to simplify the merging of imported registries? ToDo -> Explain
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we start by disallowing attribute/span/event/metric duplication and only add it along with scoping only if there is actual demand?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The next version of this document includes a complete revision of the resolution process and eliminates the need for scoping in the most common cases. I think the best approach at this stage is to postpone the discussion on scoping.


- What about introducing a new type of semconv file that will let end-users define global overrides and global redact
directives? For example, the requirement level of attributes such as `client.address` or `server.address` will be
better defined by the end-user than by the library author, or a vendor. A similar approach could be used for redact
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

future concern: we should be crystal clear that user modifying requirement level in their registry has no effect on instrumentation libraries or produced telemetry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in a general context. I will add this point to the document. However, I also believe that Weaver can enable the development of a new type of type-safe Client SDK, where a change in the requirement_level would have a direct impact on what is reported by the applications using this SDK. Personally, I am a strong believer in this approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the idea, but it needs all instrumentation libraries to change.

I.e. if my library reports http.request.header.foo as opt-in (or doesn't report it at all), it's not in the OTel SDK power to make my instrumentation report it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think we need to change anything in the current instrumentation libraries. In my view, we can follow a two-step approach. The first step involves proposing a library that offers a type-safe API on top of the existing SDK clients. Those who prefer the current approach can continue using the current SDK client as they always have. Those who want the pure type-safe API approach will only interact with this additional layer. It is even conceivable that a user might want to adopt a mixed approach, using one or the other APIs depending on the need.

In step 2, the idea is to no longer base the type-safe API on the SDK clients, but to provide a fully optimized version by taking advantage of the fact that a type-safe API doesn’t need all the overhead of a generic approach. Hashmaps, abstraction layers, etc., can either be optimized or completely removed.

The advantage of this two-step approach is that we can test the type-safe API approach with the community at a lower cost. I already have a proof of concept for step 1 in Rust.

name: otel

groups:
- ref: otel:http.server.request.duration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that reusing the same name across different signals is allowed and is a feature rather than a bug (e.g. dns.lookup.duration is a metric name, but if DNS lookup is reported as an event, it'd make a perfect event name, or if it's reported as attribute on something, it could be an attribute name).

So we need ref: to either reference a group id (and then give them some meaning) or reference a specific signal (ref_metric).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not something currently allowed or even supported. You’re mentioning in this comment something you’d like us to explore. Is that correct?

Copy link
Contributor

@lmolkova lmolkova Sep 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm talking about current state.

I can easily define an attribute with the same name as metric, there are no checks and nothing is going to stop me. Nothing would break if I do. And I consider it a feature, not a bug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad. It's now fixed. See 5d68d7d

Copy link
Contributor

@lmolkova lmolkova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed spec!

I really like the direction it goes into, here are some suggestions:

  • we need to define what it means to reference a group (metric/span/event) within semantic conventions - there are a lot of small questions I left on this, but I think this discussion needs to happen in semconv before anything is done in weaver to support it. We should start by supporting it within the same registry.
  • The current scope of this proposal is huge and while I agree it's important to design and prototype how the eventual thing would look like, I'd suggest to start with attributes and attribute registry for the initial implementation once we have a basic idea how it'd work for groups.

@lquerel lquerel changed the title [WIP] Multi-Registry - Draft Proposal Semantic Conventions - Multi-Registry Proposal Sep 7, 2024
@lquerel lquerel marked this pull request as ready for review September 7, 2024 04:31
@lquerel lquerel requested a review from a team September 7, 2024 04:31
> [!NOTE]
> In this document, “semantic convention registry” refers to a collection of semantic convention entities (attributes,
> groups, signals, etc.) that define the semantics of the data model used in OpenTelemetry. The terms “registry” and
> “semantic convention registry” are used interchangeably. The term “entity” refers to any semantic convention
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: "Entity" may have overlap in name in OpenTelemetry. I can't think of a better term just yet, but we may need one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand. Should I use telemetry object instead of entity?

```yaml
name: <registry_name>
description: <registry_description>
version: <registry_version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One concern I have here is the same I have, e.g. with cargo crate versions.

What does CI/CD look like? My ideal is releasing a version is as close to git tag <version> as possible.

This is a bit of a nit, but also a foundational question - What does release and maintenance of these look like? Should version be external, but everything else be in this file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here’s the English version of your text:

Using git tag <version> and not defining the version number in the YAML file is fine with me, as long as we can maintain the self-contained aspect of the artifact that will be downloadable by the users of the corresponding registry. This artifact could be hosted on a GitHub repo, a simple web server, or a CDN. In the end, Weaver should be able to retrieve the version either from the content, the file name, headers, etc. So, I feel like we’ll need a bit more than just a git tag <version>. Any suggestions to achieve something like this are welcome.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - I'm fine if git tag <version> kicks off a workflow that fills out this version, just want to make sure we keep it flexible so we can do that. I agree there needs to be a final file that has the version filled out. I just want to make sure that's a "produced" artifact not an "input source" requiement.

is used in these processes.

Open Questions:
- Should we follow SemVer 2 for registry versions? It seems advisable, as Weaver can detect breaking changes. However,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something we likely need to discuss on the specification itself. cc @tigrannajaryan.

I do think we should just adopt semconv as it alleviates a lot of problems with a path forward for handling them, all while keeping the versioning "simple".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that we should force semconv on all repositories. I think it's a good thing to follow for the otel repositories but might not be appropriate everywhere.

I think automatically upgrading to newer versions might be a bad idea in general. If I pinned to a specific version getting a newer version will eventually lead to unexpected results.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant semver

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsuereth I agree. We should use SemVer for OTel but not strictly adhere to all semantic versioning rules. We could simply define an order relation between versions (e.g., 1.2.0 > 1.1.15 > 0.9.32), which would be sufficient to determine if one registry is newer than another.

@MadVikingGod Regarding the automatic upgrade, to clarify, I’m not suggesting automatically upgrading registry versions. Instead, I’m proposing a command that allows the user to explicitly choose whether to automatically upgrade registry versions, so it’s done in a controlled manner.

`attributes` section of a group can reference an imported attribute.

In the current semantic conventions specification, referencing a group is currently unsupported. Uniqueness within
groups is scoped by the type of group. It is entirely possible to have an event and a metric identified by the same ID.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should fix this first - This is something we could do BEFORE allowing multiple registries and I think would make our lives better overall.

cc @lmolkova for thoughts on that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsuereth I think it's a feature, not a bug - #348 (comment)

E.g. http.request.body.size can in theory be an attribute and a metric name. Or messaging.message.time_in_queue.

I don't believe we have real examples of this but there were some discussions in the past where it would make sense to use the same attribute name on an event/span as some existing metric name.

It could be a good feature for spans/events->metrics pipeline (take specific attribute and convert it to metric) and I'd like to check if we can keep this door open.

@jmacd
Copy link

jmacd commented Sep 12, 2024

I don't feel qualified to approve this, but the motivation and vision look good to me. Thanks @lquerel!

@lquerel lquerel requested a review from a team as a code owner October 1, 2024 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

5 participants