Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unified semantic conventions for tasks, workflows, pipelines, jobs #1688

Open
svrnm opened this issue Dec 16, 2024 · 3 comments
Open

Unified semantic conventions for tasks, workflows, pipelines, jobs #1688

svrnm opened this issue Dec 16, 2024 · 3 comments
Labels
area:new enhancement New feature or request experts needed This issue or pull request is outside an area where general approvers feel they can approve triage:needs-triage

Comments

@svrnm
Copy link
Member

svrnm commented Dec 16, 2024

Area(s)

area:new

Is your change request related to a problem? Please describe.

While reviewing the AI Agent Span Semantic Convention I made the comment, that the definition of ai_agent.workflow.* and ai_agent.task.* should not be unique, since there tasks & workflows that can be modeled similarly outside of the ai_agent scope. Same is true for the existing experimental CICD pipeline attributes. Other examples might be cronjobs, business processes, build tools, like make or goyek, which defines their own goyek.flow.* tasks in https://pkg.go.dev/github.com/goyek/x/otelgoyek#example-package (thanks to @pellared for pointing me to goyek)

Describe the solution you'd like

The solution I would like to see is a unified set of attributes that describe such "workflow" and then only to have the unique attributes in the related specifications, similar as we have it today for the HTTP SemConv, where server.* or url.* attributes are used where applicable.

So instead of the current proposals, the future may look like the following:

  • cicd.pipeline.name => workflow.name
  • cicd.pipeline.run.id => workflow.run.id
  • cicd.pipeline.task.name => workflow.task.name
  • cicd.pipeline.task.run.id => workflow.task.run.id
  • cicd.pipeline.task.run.url.full => workflow.task.url.full (or just url.full if it is a "task span"?)
  • cicd.pipeline.task.type => workflow.task.type
  • ...

and

  • ai_agent.workflow.name => workflow.name
  • ai_agent.task.name => workflow.task.name
  • ai_agent.task.output => workflow.task.output
  • ...

and

  • goyek.flow.output => workflow.output
  • ...

Both @open-telemetry/semconv-genai-approvers @open-telemetry/semconv-cicd-approvers are very active working groups, pushing semantic conventions forward, and it would be great to see some broader thinking about "workflows" (or however this should be named in a unified way)

Describe alternatives you've considered

No response

Additional context

Note, that this may also require to make progress on the long standing question of "long running 'spans'"1:

open-telemetry/opentelemetry-specification#373,
open-telemetry/opentelemetry-specification#2692 and other previous issues touched upon this topic, but so far there is no solution for long-running (++minutes, hours, days) or even "infinite" spans

Footnotes

  1. the term "span" might not be correct in this context.

@svrnm svrnm added enhancement New feature or request experts needed This issue or pull request is outside an area where general approvers feel they can approve triage:needs-triage labels Dec 16, 2024
@trask
Copy link
Member

trask commented Dec 16, 2024

I think(?) this is related: open-telemetry/opentelemetry-java-instrumentation#12830

cc @cb645j

@christophe-kamphaus-jemmic
Copy link
Contributor

Related issue for long running spans: #1648

@svrnm
Copy link
Member Author

svrnm commented Dec 19, 2024

Thanks for the feedback here and during the SIG Call. Especially @lmolkova statement on not over-generalizing made me think more about this topic.

Before digging into that, let's ignore the "long running" issue for now, as shared in #1648 this is a spec issue and not a semantic convention issue, and for the sake of simplicity, let's assume that a workflow run with it's sub tasks can be represented by a trace with the run being the parent and the tasks the child spans.

Let me begin with where I am coming from: in my previous job as solution engineer for APM I created lots of dashboards and visualizations, and a thing that annoyed me was that I had to do the same thing again and again because things carried slightly different names and by that required me reworking a lot of things. That's why I am a huge fan of the semantic conventions, they create consistence!

So, the leading question I had in mind is: how can I visualize workflow telemetry in my preferred visualization tool consistently?

To answer that question, let's take a look how workflows are visualized across different solutions today:

So what all workflows have in common is that they can be represented by a graph where the vertices represent tasks, the edges the dependencies between tasks. There are "start vertices" and "end vertices", and a run is a flow through that graph from a start to an end.

From an observability standpoint, what I want to understand from a highlevel is:

  • How many runs through my workflow are successful? How many fail?
    • If runs fail, which tasks have the most errors and are the root cause of this?
  • How long do my runs through my workflow take? How many are slow?
    • If my runs slow down, or if I want to speed them up, which tasks take the longest on average?

These questions are independent of the workflow I am looking at. This is how I would troubleshoot a CICD workflow, a AI Agent workflow, a business workflow, a build workflow, etc.

What is different is when I want to drill into a specific task and investigate further why there are errors, or why that task my be slow.

So, similar to a service map, a way to represent them for monitoring is something like the following:

flowchart TD
    style A fill:#0f0
    style B fill:#ff0
    style C fill:#0f0
    style D fill:#0f0
    style E fill:#0f0
    style F fill:#f00
    style G fill:#ff0
    subgraph "WF1, sr: 62/100, art: 4min"
        A[Setup<br>0 errors/min<br>15s AVG runtime] -->|100 runs/min| B(Step 1<br>2 errors/min<br>3min AVG runtime)
        B -->|98 runs/min| C(Step 2<br>0 errors/min<br>1.3s AVG runtime)
        C -->|40 runs/min| D[Step 3a<br>0 errors/min<br>12s AVG runtime]
        C -->|20 runs/min| E[Step 3b<br>0 errors/min<br>17s AVG runtime]
        C -->|40 runs/min| F[Step 3c<br>0 errors/min<br>3.7min AVG runtime]
        D --->|40 runs/min| G
        E --->|20 runs/min| G
        F --->|2 runs/min| G(Tear Down<br>2 errors/min<br>13s AVG runtime)
    end
Loading

If different workflows now do not share a common set of attributes, engineers building visualizations have either to rebuild the same thing over and over again, or they have to add some logic to select the right attributes depending on the workflow type they are looking at.

This means, from my point of view, there are a few common attributes:

  • workflow.name (which should also be the span name of the parent span)
  • workflow.id (which can be used to uniquely identify the workflow if names are equal)
  • workflow.task.name (which should also be the span name of the task span)
  • workflow.task.id (which can be used to uniquely identify the task within the workflow if names are equal)
  • workflow.task.run.id or workflow.task.flow.id which uniquely identifies the current run.

If available, the *.id attributes should be taken from the workflow tool, such that they can be linked across the observability solution and the workflow tool.

For metrics, there are also some common examples:

  • workflow.run.duration (similar to http.server.request.duration)
  • workflow.failed_runs
  • workflow.active_runs (similar to http.server.active_requests)
  • workflow.task.run.duration
  • workflow.task.active_runs
  • workflow.task.failed_runs

Additionally, the following existing attributes may be used:

  • error.*
  • exception.*

There might be other common attributes, but these are the ones that would enable such a unified visualization. Other attributes would not be common, e.g. cicd.pipeline.task.run.url.full since they are domain specific/not available in all workflows (e.g. a tool like make would not have a concept of such an url), I also know think that url.full would not be correct, since this URL is not for a network call between services but for drilling deeper into the tool and it's representation of that run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:new enhancement New feature or request experts needed This issue or pull request is outside an area where general approvers feel they can approve triage:needs-triage
Projects
None yet
Development

No branches or pull requests

3 participants