Container failed to start because of cgroups issue #7510

asaff1 · 2024-12-10T16:40:06Z

Description

Observed Behavior:
After some time that a node is running pods, suddenly the node fails to start new pods.
The instance type is m7a.medium.

This is what I'm seeing in the pod:

  containerStatuses:
    - name: main
      state:
        terminated:
          exitCode: 128
          reason: StartError
          message: >-
            failed to create containerd task: failed to create shim task: OCI
            runtime create failed: runc create failed: unable to start container
            process: error during container init: error setting cgroup config
            for procHooks process: unable to freeze: unknown
          startedAt: '1970-01-01T00:00:00Z'
          finishedAt: '2024-12-10T12:47:14Z'
          containerID: >-
            containerd://4d06aa70903cab3cfc48a96fcfddc4f6732c1869a4ff16573f28a52b9847f906
      lastState: {}
      ready: false
      restartCount: 0
      image: 'docker.io/library/python:3.7'
      imageID: >-
        docker.io/library/python@sha256:eedf63967cdb57d8214db38ce21f105003ed4e4d0358f02bedc057341bcf92a0
      containerID: >-
        containerd://4d06aa70903cab3cfc48a96fcfddc4f6732c1869a4ff16573f28a52b9847f906
      started: false
      volumeMounts:
        - name: var-run-argo
          mountPath: /var/run/argo
        - name: kube-api-access-9swtp
          mountPath: /var/run/secrets/kubernetes.io/serviceaccount
          readOnly: true
          recursiveReadOnly: Disabled
        - name: aws-iam-token
          mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount
          readOnly: true
          recursiveReadOnly: Disabled

The node did run some tasks successfully, but when that issue happen, the pod is stuck in "Running" state, and the only way to exit this state is to manually delete the node, and then wait for a new node to be scheduled and run the pod.
This is not the first time it happens, it usually happens after a while, but not immediately. I don't understand why is this cgroups issue.
I'm not sure if it is related to Karpenter or not, could be an EKS bug (Maybe AMI bug?). But I can say for sure this issue started when we started to use Karpenter. We didn't use this type of instance before though - so this could be related as well.
What can cause this error:

            failed to create containerd task: failed to create shim task: OCI
            runtime create failed: runc create failed: unable to start container
            process: error during container init: error setting cgroup config
            for procHooks process: unable to freeze: unknown

and in kubernetes events I see:
Cgroup v1 support is in maintenance mode, please migrate to Cgroup v2.

In EC2NodeClass I have:

spec:
  amiFamily: AL2
  amiSelectorTerms:
    - name: amazon-eks-node-1.31-v20241011

Expected Behavior:
Pod should run successfully.

Versions:

Chart Version: karpenter-1.0.6
Kubernetes Version (kubectl version):
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.0", GitCommit:"ab69524f795c42094a6630298ff53f3c3ebab7f4", GitTreeState:"clean", BuildDate:"2021-12-07T18:16:20Z", GoVersion:"go1.17.3", Compiler:"gc", Platform:"windows/amd64"}
Server Version: version.Info{Major:"1", Minor:"31", GitVersion:"v1.31.2-eks-7f9249a", GitCommit:"1316e23bda3256fab6fbead2f22f6811dde77fb6", GitTreeState:"clean", BuildDate:"2024-10-23T23:38:37Z", GoVersion:"go1.22.8", Compiler:"gc", Platform:"linux/amd64"}

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

asaff1 added bug Something isn't working needs-triage Issues that need to be triaged labels Dec 10, 2024

jonathan-innis added triage/needs-investigation Issues that need to be investigated before triaging and removed needs-triage Issues that need to be triaged labels Dec 10, 2024

jonathan-innis assigned jigisha620 Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Container failed to start because of cgroups issue #7510

Container failed to start because of cgroups issue #7510

asaff1 commented Dec 10, 2024 •

edited

Loading

Container failed to start because of cgroups issue #7510

Container failed to start because of cgroups issue #7510

Comments

asaff1 commented Dec 10, 2024 • edited Loading

Description

asaff1 commented Dec 10, 2024 •

edited

Loading