[bug]when running pipeline code, the pod DAG always stay in status Init:StartError #11422

Epochex · 2024-11-23T14:04:51Z

What happened?

I am currently running hyperparameter tuning, which results in creating 60x4 pods within my Kubeflow pipeline. During execution, I encountered an issue where the DAG pod is unable to complete its initialization successfully, preventing the pipeline from continuing

What did you expect to happen?

Pod Status: I observed the status of one of the DAG driver pods:
kubeflow-user-example-com auto-digits-pipeline-half-complex2-tvdqj-system-dag-driver-1079452148 0/2 Init:StartError 0 84m

Logs Check: When I tried to fetch the logs for the pod, I received the following message:

kubectl logs auto-digits-pipeline-half-complex2-tvdqj-system-dag-driver-1079452148 -n kubeflow-user-example-com Error from server (BadRequest): container "main" in pod "auto-digits-pipeline-half-complex2-xdlfv-system-dag-driver-3285612689" is waiting to start: PodInitializing

This suggested that the DAG pod might not be initializing due to the large number of pods that need to be executed concurrently.
Configuration Investigation: I attempted to locate the ConfigMap associated with the DAG to extend the initialization time limit, as I suspected that the pod timeout might be too short. I used the following command:

kubectl get cm -n kubeflow
However, I could not find a ConfigMap containing relevant parameters to control the DAG pod startup timeout.

Cluster Events: Upon further investigation by listing the cluster events:
kubectl get events
I found the following error:
Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: fork/exec /kind/bin/mount-product-files.sh: argument list too long: unknown

The error suggests that the initialization failure might be caused by the number of arguments or the size of the arguments passed to /kind/bin/mount-product-files.sh, which is exceeding the allowable limit, leading to the failure of the container creation process.

Question

How can I modify the corresponding parameters to avoid this "argument list too long" issue during the container initialization phase? Specifically, I would appreciate guidance on:

Identifying the appropriate ConfigMap or configuration where I can modify the initialization settings for the DAG pods.

Mitigating the "argument list too long" issue, possibly by optimizing or limiting the number of mounted files or arguments.

Any insights or suggestions on how to address this issue would be greatly appreciated.

Environment

Kubernetes version:1.9

$ kubectl version
Client Version: v1.31.2
Kustomize Version: v5.4.2
Server Version: v1.31.0

Training Operator version:

$ kubectl get pods -n kubeflow -l control-plane=kubeflow-training-operator -o jsonpath="{.items[*].spec.containers[*].image}"

Training Operator Python SDK version:

$ pip show kubeflow-training

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

The text was updated successfully, but these errors were encountered:

andreyvelich · 2024-11-30T00:17:55Z

Hi @Epochex, I think this issue is related to Kubeflow Pipelines, not Kubeflow Training Operator.
/transfer pipelines

Epochex added the kind/bug label Nov 23, 2024

google-oss-prow bot transferred this issue from kubeflow/training-operator Nov 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug]when running pipeline code, the pod DAG always stay in status Init:StartError #11422

[bug]when running pipeline code, the pod DAG always stay in status Init:StartError #11422

Epochex commented Nov 23, 2024 •

edited

Loading

andreyvelich commented Nov 30, 2024

[bug]when running pipeline code, the pod DAG always stay in status Init:StartError #11422

[bug]when running pipeline code, the pod DAG always stay in status Init:StartError #11422

Comments

Epochex commented Nov 23, 2024 • edited Loading

What happened?

What did you expect to happen?

Environment

Impacted by this bug?

andreyvelich commented Nov 30, 2024

Epochex commented Nov 23, 2024 •

edited

Loading