Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

There are many error train pods, But Failed Trials still is 0, And Experiment still run #2452

Open
Yumeka999 opened this issue Nov 8, 2024 · 2 comments
Labels

Comments

@Yumeka999
Copy link

What happened?

When I run this python script code, max_trial_count=3 and parallel_trial_count=3

def objective(katib_param):
    raise ValueError("Let train error")
    # result = 4 * int(katib_param["a"]) - float(katib_param["b"]) ** 2
    # print(f"result={result}")

import kubeflow.katib as katib
katib_param = {"a": katib.search.int(min=10, max=20),"b": katib.search.double(min=0.1, max=0.2)}
katib_client = katib.KatibClient(namespace="kubeflow")
name = "hpo-katib"
katib_client.tune(
    name=name,
    objective=objective,
    base_image="harbor.xnunion.com/run_torch:0.0.1",
    parameters=katib_param,
    objective_metric_name="result",
    max_trial_count=3, 
    parallel_trial_count=3,
    resources_per_trial={"cpu": "2"})
katib_client.wait_for_experiment_condition(name=name)
katib_client.get_optimal_hyperparameters(name)

And this is all pod status

kubectl get pod -n kubeflow
NAME                                 READY   STATUS                     RESTARTS   AGE
hpo-katib-hzs5gl7g-62v64             0/2     Error                      0          61s
hpo-katib-hzs5gl7g-npjh4             0/2     Error                      0          89s
hpo-katib-hzs5gl7g-shcgt             0/2     Error                      0          93s
hpo-katib-hzs5gl7g-zn864             0/2     Error                      0          65s
hpo-katib-random-584585bfb8-tjfjk    1/1     Running                    0          115s
hpo-katib-rpkx8tsv-2v4wn             0/2     Error                      0          10s
hpo-katib-rpkx8tsv-6vp8h             0/2     Error                      0          79s
hpo-katib-rpkx8tsv-q6fw5             0/2     Error                      0          55s
hpo-katib-rpkx8tsv-vffsq             0/2     Error                      0          93s
hpo-katib-zcbpxgp5-4jk8s             0/2     Error                      0          51s
hpo-katib-zcbpxgp5-6pg6f             0/2     Error                      0          93s
hpo-katib-zcbpxgp5-hzb4t             0/2     Error                      0          79s
hpo-katib-zcbpxgp5-zh4bl             0/2     Error                      0          55s
katib-cert-generator-krsgv           0/1     Completed                  0          27h
katib-controller-5f9596f9f8-5vhh6    1/1     Running                    0          27h

And this is utils.print_experiment_status(experiment)

Experiment Trials status: 3 Trials, 0 Pending Trials, 3 Running Trials, 0 Succeeded Trials, 0 Failed Trials, 0 EarlyStopped Trials, 0 MetricsUnavailable Trials

It's strange, the number of Error train pods is 12, but Failed Trials still is 0.

What did you expect to happen?

number of error train pod == Failed Trials number

How do i write the code or config?

Environment

Kubernetes version:

$ kubectl version
Client Version: v1.29.6
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.2

Katib controller version:

$ kubectl get pods -n kubeflow -l katib.kubeflow.org/component=controller -o jsonpath="{.items[*].spec.containers[*].image}"
kubeflow/kubeflowkatib/katib-controller:v0.15.0

Katib Python SDK version:

$ pip show kubeflow-katib
Name: kubeflow-katib
Version: 0.17.0

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

@andreyvelich
Copy link
Member

Sorry for the late reply @Yumeka999!
As you can see, your Experiment creates only 3 Trials: hpo-katib-zcbpxgp5, hpo-katib-hzs5gl7g, and hpo-katib-rpkx8tsv.
The Trial's pods are just getting restarted due to Batch/Job restart policy.
The default behaviour for Job's backOffLimit is 6: https://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-backoff-failure-policy.

When 6 pods will be failed, the Job will be Failed and the corresponding Trial also should be failed.

@andreyvelich
Copy link
Member

/remove-label lifecycle/needs-triage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants