Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug]when running pipeline code, the pod DAG always stay in status Init:StartError #11422

Open
Epochex opened this issue Nov 23, 2024 · 1 comment
Labels

Comments

@Epochex
Copy link

Epochex commented Nov 23, 2024

What happened?

I am currently running hyperparameter tuning, which results in creating 60x4 pods within my Kubeflow pipeline. During execution, I encountered an issue where the DAG pod is unable to complete its initialization successfully, preventing the pipeline from continuing

What did you expect to happen?

Pod Status: I observed the status of one of the DAG driver pods:
kubeflow-user-example-com auto-digits-pipeline-half-complex2-tvdqj-system-dag-driver-1079452148 0/2 Init:StartError 0 84m

Logs Check: When I tried to fetch the logs for the pod, I received the following message:

kubectl logs auto-digits-pipeline-half-complex2-tvdqj-system-dag-driver-1079452148 -n kubeflow-user-example-com Error from server (BadRequest): container "main" in pod "auto-digits-pipeline-half-complex2-xdlfv-system-dag-driver-3285612689" is waiting to start: PodInitializing

This suggested that the DAG pod might not be initializing due to the large number of pods that need to be executed concurrently.
Configuration Investigation: I attempted to locate the ConfigMap associated with the DAG to extend the initialization time limit, as I suspected that the pod timeout might be too short. I used the following command:

kubectl get cm -n kubeflow
However, I could not find a ConfigMap containing relevant parameters to control the DAG pod startup timeout.

Cluster Events: Upon further investigation by listing the cluster events:
kubectl get events
I found the following error:
Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: fork/exec /kind/bin/mount-product-files.sh: argument list too long: unknown

The error suggests that the initialization failure might be caused by the number of arguments or the size of the arguments passed to /kind/bin/mount-product-files.sh, which is exceeding the allowable limit, leading to the failure of the container creation process.

Question

How can I modify the corresponding parameters to avoid this "argument list too long" issue during the container initialization phase? Specifically, I would appreciate guidance on:

Identifying the appropriate ConfigMap or configuration where I can modify the initialization settings for the DAG pods.

Mitigating the "argument list too long" issue, possibly by optimizing or limiting the number of mounted files or arguments.

Any insights or suggestions on how to address this issue would be greatly appreciated.

Environment

Kubernetes version:1.9

$ kubectl version
Client Version: v1.31.2
Kustomize Version: v5.4.2
Server Version: v1.31.0

Training Operator version:

$ kubectl get pods -n kubeflow -l control-plane=kubeflow-training-operator -o jsonpath="{.items[*].spec.containers[*].image}"

Training Operator Python SDK version:

$ pip show kubeflow-training

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

@andreyvelich
Copy link
Member

Hi @Epochex, I think this issue is related to Kubeflow Pipelines, not Kubeflow Training Operator.
/transfer pipelines

@google-oss-prow google-oss-prow bot transferred this issue from kubeflow/training-operator Nov 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants