Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce Job-agnostic API to declare the maximal execution time for a Job #3125

Open
3 tasks
Tracked by #3192
mimowo opened this issue Sep 24, 2024 · 11 comments
Open
3 tasks
Tracked by #3192
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@mimowo
Copy link
Contributor

mimowo commented Sep 24, 2024

What would you like to be added:

A job-agnostic API to set the maximal execution time for a Job.

There are some open questions:

  • name of the API (I would initially propose annotation kueue.x-k8s.io/max-exec-time-seconds)
  • semantics of the deadline (measured since last admission or cumulative execution time) - I would initially suggest to measure the cumulative execution time across admissions , and re-use it in the EXEC_TIME command for kueuectl, but it will require API changes
  • what happens after exceeding the time - I would initially propose deactivating the workload, but alternatives might be worth considering

Why is this needed:

Different Jobs CRDs have a field with similar semantics, but there is no standard. For example batch/Job has spec.activeDeadlneSeconds, while JobSet does not have such an API for now.

We would like to have such an API to use it in kueuectl command as an analog of slurm's "-t" option. Long term (but out-of-scope here) we could use this value to optimize Job scheduling.

Completion requirements:

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

@mimowo mimowo added the kind/feature Categorizes issue or PR as related to a new feature. label Sep 24, 2024
@mimowo
Copy link
Contributor Author

mimowo commented Sep 24, 2024

/cc @mwielgus @trasc @mbobrovskyi

@trasc
Copy link
Contributor

trasc commented Sep 24, 2024

/assign

@kannon92
Copy link
Contributor

kannon92 commented Sep 24, 2024

Maybe kueue could set activeDeadlineSeconds on all the pod templates? NVM. I see that you want the limit for the entire job so I don't think pod level is the right fit here.

@tardieu
Copy link

tardieu commented Sep 24, 2024

Very much in favor.

Specifying a deadline for a workload (possibly wall clock time instead of cumulative execution time) would also enable capacity planning and permit developing ordering strategies to, e.g., ensure large workloads have a chance to run without resorting to priorities and preemption.

An alternative to workload deactivation could be to dynamically lower the priority of the workload, hence let it run if there is excess capacity and only evict if necessary.

@mimowo
Copy link
Contributor Author

mimowo commented Sep 24, 2024

(possibly wall clock time instead of cumulative execution time)

Yeah, I was thinking that "cumulative execution time" measures wall time. I say it is cumulative because it would add up running wall time from all admission times (say if it is admitted (3min) -> suspended (whatever) -> admitted (3min), this would account to 6min, vs activeDeadlineSeconds which accounts only 3min (last admission time).

wdyt?

would also enable capacity planning and permit developing ordering strategies to, e.g., ensure large workloads have a chance to run without resorting to priorities and preemption.

Yeah, that is the long-term plan. IIUC its counterpart ("-t" in slurm) is used for better scheduling.

An alternative to workload deactivation could be to dynamically lower the priority of the workload, hence let it run if there is excess capacity and only evict if necessary.

Potentially, but it sounds complex (what function should decrease the priorities). Also, some users don't like to use priorities as they have the incentive of setting them as high as possible.

@tenzen-y
Copy link
Member

I like this feature. Actually, users easily violate fairness using the sleep inf command.
And the preemption is only way to prevent unfairness Jobs. Once we provide feature to enforce the deadline time, we can prevent unfairness Jobs.

My main question is, at what time can the calculation start? I guess that we need to add a dedicated field like startTime to the Workload object.

@tenzen-y
Copy link
Member

Additionally, I guess that we need to consider #2737.
So, we may need to reset Workload startTime field (new field, and just my idea of field name) when the Workload gets WorkloadWaitForPodsReadyReplacement condition.

@mimowo
Copy link
Contributor Author

mimowo commented Sep 24, 2024

My main question is, at what time can the calculation start? I guess that we need to add a dedicated field like startTime to the Workload object.

Yeah, we need a new field for that, but I was rather thinking about keeping the accumulated time from the previous admissions (say prevAdmissionsRuntime).

Then compute exec time as:

execTime = prevAdmissionsRuntime + now() - Admitted.LastTransitionTime

I think with startTime we would not account for the time when the job is suspended. WDYT?

@trasc
Copy link
Contributor

trasc commented Sep 25, 2024

I've opened #3133 as a KEP PR for this, please have a look and let's continue the discussion there.

@mimowo
Copy link
Contributor Author

mimowo commented Sep 26, 2024

I have played a bit with slurm's "-t" option, and it looks like it puts a limit on the cumulative wall clock across "suspend / resume". While we don't need to follow it exactly, it seems like a sensible inspiration.

@tenzen-y
Copy link
Member

I think with startTime we would not account for the time when the job is suspended. WDYT?

I was supposed to reset the startTime when the job is preempted or evicted (StopJob), similar to batch/v1 Job integration.
Anyway, we can evaluate both approaches during the proposal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

5 participants