Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Container failed to start because of cgroups issue #7510

Open
asaff1 opened this issue Dec 10, 2024 · 0 comments
Open

Container failed to start because of cgroups issue #7510

asaff1 opened this issue Dec 10, 2024 · 0 comments
Assignees
Labels
bug Something isn't working triage/needs-investigation Issues that need to be investigated before triaging

Comments

@asaff1
Copy link

asaff1 commented Dec 10, 2024

Description

Observed Behavior:
After some time that a node is running pods, suddenly the node fails to start new pods.
The instance type is m7a.medium.

This is what I'm seeing in the pod:

  containerStatuses:
    - name: main
      state:
        terminated:
          exitCode: 128
          reason: StartError
          message: >-
            failed to create containerd task: failed to create shim task: OCI
            runtime create failed: runc create failed: unable to start container
            process: error during container init: error setting cgroup config
            for procHooks process: unable to freeze: unknown
          startedAt: '1970-01-01T00:00:00Z'
          finishedAt: '2024-12-10T12:47:14Z'
          containerID: >-
            containerd://4d06aa70903cab3cfc48a96fcfddc4f6732c1869a4ff16573f28a52b9847f906
      lastState: {}
      ready: false
      restartCount: 0
      image: 'docker.io/library/python:3.7'
      imageID: >-
        docker.io/library/python@sha256:eedf63967cdb57d8214db38ce21f105003ed4e4d0358f02bedc057341bcf92a0
      containerID: >-
        containerd://4d06aa70903cab3cfc48a96fcfddc4f6732c1869a4ff16573f28a52b9847f906
      started: false
      volumeMounts:
        - name: var-run-argo
          mountPath: /var/run/argo
        - name: kube-api-access-9swtp
          mountPath: /var/run/secrets/kubernetes.io/serviceaccount
          readOnly: true
          recursiveReadOnly: Disabled
        - name: aws-iam-token
          mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount
          readOnly: true
          recursiveReadOnly: Disabled

The node did run some tasks successfully, but when that issue happen, the pod is stuck in "Running" state, and the only way to exit this state is to manually delete the node, and then wait for a new node to be scheduled and run the pod.
This is not the first time it happens, it usually happens after a while, but not immediately. I don't understand why is this cgroups issue.
I'm not sure if it is related to Karpenter or not, could be an EKS bug (Maybe AMI bug?). But I can say for sure this issue started when we started to use Karpenter. We didn't use this type of instance before though - so this could be related as well.
What can cause this error:

            failed to create containerd task: failed to create shim task: OCI
            runtime create failed: runc create failed: unable to start container
            process: error during container init: error setting cgroup config
            for procHooks process: unable to freeze: unknown

and in kubernetes events I see:
Cgroup v1 support is in maintenance mode, please migrate to Cgroup v2.

In EC2NodeClass I have:

spec:
  amiFamily: AL2
  amiSelectorTerms:
    - name: amazon-eks-node-1.31-v20241011

Expected Behavior:
Pod should run successfully.

Versions:

  • Chart Version: karpenter-1.0.6

  • Kubernetes Version (kubectl version):
    Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.0", GitCommit:"ab69524f795c42094a6630298ff53f3c3ebab7f4", GitTreeState:"clean", BuildDate:"2021-12-07T18:16:20Z", GoVersion:"go1.17.3", Compiler:"gc", Platform:"windows/amd64"}
    Server Version: version.Info{Major:"1", Minor:"31", GitVersion:"v1.31.2-eks-7f9249a", GitCommit:"1316e23bda3256fab6fbead2f22f6811dde77fb6", GitTreeState:"clean", BuildDate:"2024-10-23T23:38:37Z", GoVersion:"go1.22.8", Compiler:"gc", Platform:"linux/amd64"}

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@asaff1 asaff1 added bug Something isn't working needs-triage Issues that need to be triaged labels Dec 10, 2024
@jonathan-innis jonathan-innis added triage/needs-investigation Issues that need to be investigated before triaging and removed needs-triage Issues that need to be triaged labels Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage/needs-investigation Issues that need to be investigated before triaging
Projects
None yet
Development

No branches or pull requests

3 participants