There are many error train pods, But Failed Trials still is 0, And Experiment still run #2452

Yumeka999 · 2024-11-08T07:03:17Z

What happened?

When I run this python script code, max_trial_count=3 and parallel_trial_count=3

def objective(katib_param):
    raise ValueError("Let train error")
    # result = 4 * int(katib_param["a"]) - float(katib_param["b"]) ** 2
    # print(f"result={result}")

import kubeflow.katib as katib
katib_param = {"a": katib.search.int(min=10, max=20),"b": katib.search.double(min=0.1, max=0.2)}
katib_client = katib.KatibClient(namespace="kubeflow")
name = "hpo-katib"
katib_client.tune(
    name=name,
    objective=objective,
    base_image="harbor.xnunion.com/run_torch:0.0.1",
    parameters=katib_param,
    objective_metric_name="result",
    max_trial_count=3, 
    parallel_trial_count=3,
    resources_per_trial={"cpu": "2"})
katib_client.wait_for_experiment_condition(name=name)
katib_client.get_optimal_hyperparameters(name)

And this is all pod status

kubectl get pod -n kubeflow
NAME                                 READY   STATUS                     RESTARTS   AGE
hpo-katib-hzs5gl7g-62v64             0/2     Error                      0          61s
hpo-katib-hzs5gl7g-npjh4             0/2     Error                      0          89s
hpo-katib-hzs5gl7g-shcgt             0/2     Error                      0          93s
hpo-katib-hzs5gl7g-zn864             0/2     Error                      0          65s
hpo-katib-random-584585bfb8-tjfjk    1/1     Running                    0          115s
hpo-katib-rpkx8tsv-2v4wn             0/2     Error                      0          10s
hpo-katib-rpkx8tsv-6vp8h             0/2     Error                      0          79s
hpo-katib-rpkx8tsv-q6fw5             0/2     Error                      0          55s
hpo-katib-rpkx8tsv-vffsq             0/2     Error                      0          93s
hpo-katib-zcbpxgp5-4jk8s             0/2     Error                      0          51s
hpo-katib-zcbpxgp5-6pg6f             0/2     Error                      0          93s
hpo-katib-zcbpxgp5-hzb4t             0/2     Error                      0          79s
hpo-katib-zcbpxgp5-zh4bl             0/2     Error                      0          55s
katib-cert-generator-krsgv           0/1     Completed                  0          27h
katib-controller-5f9596f9f8-5vhh6    1/1     Running                    0          27h

And this is utils.print_experiment_status(experiment)

Experiment Trials status: 3 Trials, 0 Pending Trials, 3 Running Trials, 0 Succeeded Trials, 0 Failed Trials, 0 EarlyStopped Trials, 0 MetricsUnavailable Trials

It's strange, the number of Error train pods is 12, but Failed Trials still is 0.

What did you expect to happen?

number of error train pod == Failed Trials number

How do i write the code or config?

Environment

Kubernetes version:

$ kubectl version
Client Version: v1.29.6
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.2

Katib controller version:

$ kubectl get pods -n kubeflow -l katib.kubeflow.org/component=controller -o jsonpath="{.items[*].spec.containers[*].image}"
kubeflow/kubeflowkatib/katib-controller:v0.15.0

Katib Python SDK version:

$ pip show kubeflow-katib
Name: kubeflow-katib
Version: 0.17.0

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

The text was updated successfully, but these errors were encountered:

andreyvelich · 2024-11-30T00:40:47Z

Sorry for the late reply @Yumeka999!
As you can see, your Experiment creates only 3 Trials: hpo-katib-zcbpxgp5, hpo-katib-hzs5gl7g, and hpo-katib-rpkx8tsv.
The Trial's pods are just getting restarted due to Batch/Job restart policy.
The default behaviour for Job's backOffLimit is 6: https://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-backoff-failure-policy.

When 6 pods will be failed, the Job will be Failed and the corresponding Trial also should be failed.

andreyvelich · 2024-11-30T00:40:57Z

/remove-label lifecycle/needs-triage

Yumeka999 added kind/bug lifecycle/needs-triage labels Nov 8, 2024

google-oss-prow bot removed the lifecycle/needs-triage label Nov 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

There are many error train pods, But Failed Trials still is 0, And Experiment still run #2452

There are many error train pods, But Failed Trials still is 0, And Experiment still run #2452

Yumeka999 commented Nov 8, 2024

andreyvelich commented Nov 30, 2024

andreyvelich commented Nov 30, 2024

There are many error train pods, But Failed Trials still is 0, And Experiment still run #2452

There are many error train pods, But Failed Trials still is 0, And Experiment still run #2452

Comments

Yumeka999 commented Nov 8, 2024

What happened?

What did you expect to happen?

Environment

Impacted by this bug?

andreyvelich commented Nov 30, 2024

andreyvelich commented Nov 30, 2024