We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When I run this python script code, max_trial_count=3 and parallel_trial_count=3
def objective(katib_param): raise ValueError("Let train error") # result = 4 * int(katib_param["a"]) - float(katib_param["b"]) ** 2 # print(f"result={result}") import kubeflow.katib as katib katib_param = {"a": katib.search.int(min=10, max=20),"b": katib.search.double(min=0.1, max=0.2)} katib_client = katib.KatibClient(namespace="kubeflow") name = "hpo-katib" katib_client.tune( name=name, objective=objective, base_image="harbor.xnunion.com/run_torch:0.0.1", parameters=katib_param, objective_metric_name="result", max_trial_count=3, parallel_trial_count=3, resources_per_trial={"cpu": "2"}) katib_client.wait_for_experiment_condition(name=name) katib_client.get_optimal_hyperparameters(name)
And this is all pod status
kubectl get pod -n kubeflow NAME READY STATUS RESTARTS AGE hpo-katib-hzs5gl7g-62v64 0/2 Error 0 61s hpo-katib-hzs5gl7g-npjh4 0/2 Error 0 89s hpo-katib-hzs5gl7g-shcgt 0/2 Error 0 93s hpo-katib-hzs5gl7g-zn864 0/2 Error 0 65s hpo-katib-random-584585bfb8-tjfjk 1/1 Running 0 115s hpo-katib-rpkx8tsv-2v4wn 0/2 Error 0 10s hpo-katib-rpkx8tsv-6vp8h 0/2 Error 0 79s hpo-katib-rpkx8tsv-q6fw5 0/2 Error 0 55s hpo-katib-rpkx8tsv-vffsq 0/2 Error 0 93s hpo-katib-zcbpxgp5-4jk8s 0/2 Error 0 51s hpo-katib-zcbpxgp5-6pg6f 0/2 Error 0 93s hpo-katib-zcbpxgp5-hzb4t 0/2 Error 0 79s hpo-katib-zcbpxgp5-zh4bl 0/2 Error 0 55s katib-cert-generator-krsgv 0/1 Completed 0 27h katib-controller-5f9596f9f8-5vhh6 1/1 Running 0 27h
And this is utils.print_experiment_status(experiment)
Experiment Trials status: 3 Trials, 0 Pending Trials, 3 Running Trials, 0 Succeeded Trials, 0 Failed Trials, 0 EarlyStopped Trials, 0 MetricsUnavailable Trials
It's strange, the number of Error train pods is 12, but Failed Trials still is 0.
number of error train pod == Failed Trials number
How do i write the code or config?
Kubernetes version:
$ kubectl version Client Version: v1.29.6 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.30.2
Katib controller version:
$ kubectl get pods -n kubeflow -l katib.kubeflow.org/component=controller -o jsonpath="{.items[*].spec.containers[*].image}" kubeflow/kubeflowkatib/katib-controller:v0.15.0
Katib Python SDK version:
$ pip show kubeflow-katib Name: kubeflow-katib Version: 0.17.0
Give it a 👍 We prioritize the issues with most 👍
The text was updated successfully, but these errors were encountered:
Sorry for the late reply @Yumeka999! As you can see, your Experiment creates only 3 Trials: hpo-katib-zcbpxgp5, hpo-katib-hzs5gl7g, and hpo-katib-rpkx8tsv. The Trial's pods are just getting restarted due to Batch/Job restart policy. The default behaviour for Job's backOffLimit is 6: https://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-backoff-failure-policy.
hpo-katib-zcbpxgp5, hpo-katib-hzs5gl7g, and hpo-katib-rpkx8tsv
When 6 pods will be failed, the Job will be Failed and the corresponding Trial also should be failed.
Sorry, something went wrong.
/remove-label lifecycle/needs-triage
No branches or pull requests
What happened?
When I run this python script code, max_trial_count=3 and parallel_trial_count=3
And this is all pod status
And this is utils.print_experiment_status(experiment)
It's strange, the number of Error train pods is 12, but Failed Trials still is 0.
What did you expect to happen?
number of error train pod == Failed Trials number
How do i write the code or config?
Environment
Kubernetes version:
Katib controller version:
Katib Python SDK version:
Impacted by this bug?
Give it a 👍 We prioritize the issues with most 👍
The text was updated successfully, but these errors were encountered: