Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add k8s.container.status.waiting metric to semantic conventions #1672

Open
povilasv opened this issue Dec 11, 2024 · 4 comments
Open

Add k8s.container.status.waiting metric to semantic conventions #1672

povilasv opened this issue Dec 11, 2024 · 4 comments
Labels
area:k8s enhancement New feature or request

Comments

@povilasv
Copy link
Contributor

Area(s)

area:k8s

Is your change request related to a problem? Please describe.

K8s Cluster receiver uses would like to monitor K8s CrashLoop Back off state. We had a multiple issues about it:

Previously I tried to model status.waiting as Resource attribute, but because it's mutable it cannot be Resource attribute. See #997

I would like to propose adding this as metric to unblock this PR open-telemetry/opentelemetry-collector-contrib#35668

Describe the solution you'd like

  # k8s.container.* metrics
  - id: metric.k8s.container.status.waiting
    type: metric
    metric_name: k8s.container.status.waiting
    stability: experimental
    brief: "Whether container is in waiting state. (0 for no, 1 for yes)"
    instrument: gauge
    unit: ""

Describe alternatives you've considered

No response

Additional context

No response

@povilasv
Copy link
Contributor Author

@TylerHelmuth / @ChrsMark / @dmitryax woukd appreciate thoughts / feeedback 🙇

@ChrsMark
Copy link
Member

Thank's for filing this @povilasv, modeling this as a metric makes sense!

There is a proposal on how we should model such "state"/"phase"/"status" metrics in SemConv. It is already in use by hw metrics and jmx metrics as mentioned at #1032 (comment) (see #1554).
Process' status will also be modeled like this: #1212 (comment)
I have filed another one for k8s.namespace.phase: #1668 which I think will help us verify the proposal.

My only question here would be how the modeling should actually look like. What you propose aligns with kube-state-metrics but I would like to see how this will be combined with other statuses like running and terminated as well as how we reflect the reason part. Can we deal with them all at once so as to ensure that the decision will scale and cover all?

FWIWI, If I'm not mistaken we can't have k8s.container.status.waiting and k8s.container.status.waiting.reason (metric name cannot be namespace at the same time).

@povilasv
Copy link
Contributor Author

Thanks for review, I'll try to work on this next week to model this differently

@povilasv
Copy link
Contributor Author

I did some experimenting and looked at some of the k8s code (https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet_pods.go#L2057-L2423)

Some examples of the data for different states:

Terminated

  containerStatuses:
  - containerID: containerd://be5d7d76ab64bdc354b97bb7d11ea81068f70d1a633574f72afe44d374f89628
    image: docker.io/library/busybox:latest
    imageID: docker.io/library/busybox@sha256:2919d0172f7524b2d8df9e50066a682669e6d170ac0f6a49676d54358fe970b5
    lastState:
      terminated:
        containerID: containerd://67c60d2ab1e31bbf82d27d96ffb507a3d407dc4314b72e328442fa2b5a63df1b
        exitCode: 1
        finishedAt: "2024-12-17T14:30:21Z"
        reason: Error
        startedAt: "2024-12-17T14:30:21Z"
    name: crashloop-container
    ready: false
    restartCount: 2
    started: false
    state:
      terminated:
        containerID: containerd://be5d7d76ab64bdc354b97bb7d11ea81068f70d1a633574f72afe44d374f89628
        exitCode: 1
        finishedAt: "2024-12-17T14:30:39Z"
        reason: Error
        startedAt: "2024-12-17T14:30:39Z"

Running:

- containerID: containerd://9e1c366c3ef52f6a5df772b7b26d5950612d75e5c96b2fb27b49882a48ecce35
    image: docker.io/library/busybox:latest
    imageID: docker.io/library/busybox@sha256:2919d0172f7524b2d8df9e50066a682669e6d170ac0f6a49676d54358fe970b5
    lastState: {}
    name: always-running-container
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2024-12-17T14:33:50Z"

Waiting

  containerStatuses:
  - containerID: containerd://dba57c3f17b046e4ebad10351495d7f6f86db568abec77a611f81ffd3b10a378
    image: docker.io/library/busybox:latest
    imageID: docker.io/library/busybox@sha256:2919d0172f7524b2d8df9e50066a682669e6d170ac0f6a49676d54358fe970b5
    lastState:
      terminated:
        containerID: containerd://dba57c3f17b046e4ebad10351495d7f6f86db568abec77a611f81ffd3b10a378
        exitCode: 1
        finishedAt: "2024-12-17T14:46:23Z"
        reason: Error
        startedAt: "2024-12-17T14:46:23Z"
    name: crashloop-container
    ready: false
    restartCount: 8
    started: false
    state:
      waiting:
        message: back-off 5m0s restarting failed container=crashloop-container pod=crashloop-pod_default(f8e3ee00-f1ed-465a-bd70-e88de51219c2)
        reason: CrashLoopBackOff

Terminated is rather short state it quickly goes to "Waiting" state. All those states are exclusive.

I think we can model something like this:

  - id: metric.k8s.container.status.state
    type: metric
    metric_name: k8s.container.status.state
    stability: experimental
    brief: ""
    instrument: updowncounter
    unit: ""
    attributes:
      - ref: k8s.container.state
        requirement_level: required
      - ref: k8s.container.reason
        requirement_level: required
      - ref: k8s.pod.name
        requirement_level: required
      - ref: k8s.container.name
        requirement_level: required

Thinks to consider:

  1. Should reason be in seperate metric? As it increases cardinality quite a bit? Though we would have two metrics something like k8s.container.status.state and k8s.container.status.reason
  2. "reason" is a "string" (not an enum) in k8s and can contain any value, but usually it has format like ErrImagePull / ImagePullBackOff / ContainerStatusUnknown / CrashLoopBackOff
  3. There is also the last_terimated_reason, which is very useful to troubleshoot Waiting containers (was it oomed killed? Did process errored out? ) Should this be seperate metric?
  4. Also Container can be Readyto accept traffic or not, so likely in future we would have k8s.container.status.ready metric
  5. There is also seperate state for init cointainers, so we would likely need k8s.initcontainer.status metric?

FYI Kube-state-metrics did this:

kube_pod_container_status_terminated{container=""} 0 | 1
kube_pod_container_status_terminated_reason{container="", reason=""} 1
kube_pod_container_status_waiting{container=""} 0 | 1
kube_pod_container_status_waiting_reason{container="", reason=""} 1
kube_pod_container_status_running{container=""} 1
kube_pod_container_status_ready{container=""} 0 | 1
kube_pod_init_container_status_running{container=""} 0 | 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:k8s enhancement New feature or request
Projects
Status: In Review
Development

Successfully merging a pull request may close this issue.

2 participants