Skip to content

Commit

Permalink
Implement a KEP for DRA and Kueue
Browse files Browse the repository at this point in the history
  • Loading branch information
kannon92 committed Sep 16, 2024
1 parent 2d3d284 commit 66bdc4f
Show file tree
Hide file tree
Showing 3 changed files with 254 additions and 85 deletions.
257 changes: 209 additions & 48 deletions keps/2941-DRA-Structured-Parameters/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,35 +41,122 @@ tags, and then generate with `hack/update-toc.sh`.
## Summary

Dynamic Resource Allocation (DRA) is a major effort to improve device support in Kubernetes.
It changes how one can request resources in a myriad of ways. Kueue should be able to integrate with DRA.
It changes how one can request resources in a myriad of ways.

## Motivation

DRA allows for more elaborate scheduling of devices. It puts control in how devices are scheduled into the device driver.
Dynamic Resource Allocation (DRA) provides the groundwork for more sophisticated device allocations to Pods.
Quota management is about enforcing rules around the use of resources.
For example, GPUs are resource constrained and a popular request is the ability to enforce fair sharing of GPU resources.
With these devices, many users want access and sometimes some users want the ability to preempt other users if their workloads have a higher priority. Kueue provides support for this.

### DRA Background
DRA provides a future where users could schedule partitionable GPU devices (MIG) or time slicing. As devices gain a more robust way to schedule, it is important to walk through how support of DRA will work with Kueue.

To be able to dive into details for Kueue, I first want to summarize the different usecases for DRA from a workload perspective.
### Background

#### Examples
DRA has three APIs that are relevant for a Kueue:

- Resource Claims
- DeviceClasses
- ResourceSlices

#### DRA Example

I found the easiest way to test DRA was to use [dra example driver repository](https://github.com/kubernetes-sigs/dra-example-driver)

You can clone that repo and run `make setup-e2e` and that will create a Kind cluster with the DRA feature gate and install a mock dra driver.

This does not use actual GPUs so it is perfect for a test environment for exploring Kueue and DRA integration.

#### Workload Example

An example workload that uses DRA:

```yaml
---

apiVersion: resource.k8s.io/v1alpha3
kind: ResourceClaimTemplate
metadata:
namespace: gpu-test1
name: single-gpu
spec:
spec:
devices:
requests:
- name: gpu
deviceClassName: gpu.example.com

---

apiVersion: batch/v1
kind: Job
metadata:
namespace: gpu-test1
name: job0
labels:
app: job
kueue.x-k8s.io/queue-name: user-queue
spec:
template:
spec:
restartPolicy: Never
containers:
- name: ctr0
image: ubuntu:22.04
command: ["bash", "-c"]
args: ["export; sleep 9999"]
resources:
claims:
- name: gpu
requests:
cpu: 1
memory: "200Mi"
resourceClaims:
- name: gpu
resourceClaimTemplateName: gpu.example.com
```
#### Example Driver Cluster Resources
The dra-example-driver creates a resource slice and a device class for the entire cluster.
##### Resource slices
Resource slices are meant for communication between drivers and the control planes. These are not expected to be used for workloads.
Kueue does not need to be aware of these resources.
##### Device classes
Each driver creates a device class and every resource claim will reference the device class.
The dra-example-driver has a simple device class named `gpu.example.com`.

This can be a way to enforce quota limits.

### Goals

- Users can submit workloads using resource claims and Kueue can monitor the usage.
- Admins can enforce the number of requests to a given device class.

<!--
List the specific goals of the KEP. What is it trying to achieve? How will we
know that this has succeeded?
-->

### Non-Goals

- We are limiting scope for DRA to structured parameters (beta in 1.32)

<!--
What is out of scope for this KEP? Listing non-goals helps to focus discussion
and make progress.
-->

## Proposal


<!--
This is where we get down to the specifics of what the proposal actually is.
This should have enough detail that reviewers can understand exactly what
Expand All @@ -90,7 +177,11 @@ bogged down.

#### Story 1

#### Story 2
As an user, I want to use resource claims to provide more control over the scheduling of devices.
I have a dra driver installed on my cluster and I am interested in using DRA for scheduling.

I want to enforce quota usage for a ClusterQueue and forbid admitting workloads once they exceed the cluster queue limit.


### Notes/Constraints/Caveats (Optional)

Expand All @@ -117,6 +208,96 @@ Consider including folks who also work outside the SIG or subproject.

## Design Details

### Resource Quota API

```golang
type ResourceQuota struct {
// ...
// Kind is the type of resource that this resource is
// +kubebuilder:validation:Enum={Core,DeviceClass}
// +kubebuilder:default=Core
Kind ResourceKind `json:"kind"`
}
```

Kind allows one to distinguish between a Core resource and a Device class.

With this, a cluster queue could be defined as follows:

```yaml
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: "cluster-queue"
spec:
namespaceSelector: {} # match all.
resourceGroups:
- coveredResources: ["cpu", "memory", "gpu.example.com"]
flavors:
- name: "default-flavor"
resources:
- name: "cpu"
nominalQuota: 9
- name: "memory"
nominalQuota: "200Mi"
- name: "gpu.example.com"
nominalQuota: 2
kind: "DeviceClass"
```
### Workloads
When a user submits a workload and KueueDynamicResourceAllocation feature gate is on, Kueue will do the following:
a. Claims will be read from resources.claims in the PodTemplateSpec.
b. The name of the claim will be used to look up the corresponding `ResourceClaimTemplateName` in the PodTemplateSpec.
c. The ResourceClaim will be read given the name in b and using the same namespace as the workload.
d. From the ResourceClaimTemplate, the deviceClassName will be read.
e. Every claim that requests the same deviceClassName will be tallied and reported in the ResourceUsage.

```yaml
---
apiVersion: batch/v1
kind: Job
metadata:
namespace: gpu-test1
name: job0
labels:
app: job
kueue.x-k8s.io/queue-name: user-queue
spec:
template:
spec:
restartPolicy: Never
containers:
- name: ctr0
image: ubuntu:22.04
command: ["bash", "-c"]
args: ["export; sleep 9999"]
resources:
claims:
- name: gpu. #a) read the claim from resources.claims
requests:
cpu: 1
memory: "200Mi"
resourceClaims:
- name: gpu # b) use the name in resources.claim
resourceClaimTemplateName: single-gpu # c) the name for resource claim templates
---
apiVersion: resource.k8s.io/v1alpha3
kind: ResourceClaimTemplate
metadata:
namespace: gpu-test1
name: single-gpu
spec:
spec:
devices:
requests:
- name: gpu
deviceClassName: gpu.example.com # d) the name of the device class
```
<!--
This section should contain enough information that the specifics of your
change are understandable. This may include API specs (though not always
Expand All @@ -126,18 +307,7 @@ proposal will be implemented, this is the place to discuss them.

### Test Plan

<!--
**Note:** *Not required until targeted at a release.*
The goal is to ensure that we don't accept enhancements with inadequate testing.
All code is expected to have adequate tests (eventually with coverage
expectations). Please adhere to the [Kubernetes testing guidelines][testing-guidelines]
when drafting this test plan.
[testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md
-->

[ ] I/we understand the owners of the involved components may require updates to
[x] I/we understand the owners of the involved components may require updates to
existing tests to make this code solid enough prior to committing the changes necessary
to implement this enhancement.

Expand Down Expand Up @@ -167,10 +337,19 @@ This can inform certain test coverage improvements that we want to do before
extending the production code to implement this enhancement.
-->

TBD
- `<package>`: `<date>` - `<test coverage>`

#### Integration tests

I am not sure if we can test DRA functionality (requiring alpha features enabled) at the integration level.

DRA requires a kubelet plugin so this may not be a good candidate for an integration test.

#### E2E Test

It may be worth creating install dra-example-driver and testing this e2e.

<!--
Describe what tests will be added to ensure proper quality of the enhancement.

Expand All @@ -179,45 +358,27 @@ After the implementation PR is merged, add the names of the tests here.

### Graduation Criteria

<!--
#### Feature Gate

Clearly define what it means for the feature to be implemented and
considered stable.
We will introduce a KueueDynamicResourceAllocation feature gate.

If the feature you are introducing has high complexity, consider adding graduation
milestones with these graduation criteria:
- [Maturity levels (`alpha`, `beta`, `stable`)][maturity-levels]
- [Feature gate][feature gate] lifecycle
- [Deprecation policy][deprecation-policy]
This feature gate will go beta once DRA is beta.

[feature gate]: https://git.k8s.io/community/contributors/devel/sig-architecture/feature-gates.md
[maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions
[deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/
-->
The goal will be limit changes only if this feature gate is enabled in combination with the DRA feature.

## Implementation History

<!--
Major milestones in the lifecycle of a KEP should be tracked in this section.
Major milestones might include:
- the `Summary` and `Motivation` sections being merged, signaling SIG acceptance
- the `Proposal` section being merged, signaling agreement on a proposed design
- the date implementation started
- the first Kubernetes release where an initial version of the KEP was available
- the version of Kubernetes where the KEP graduated to general availability
- when the KEP was retired or superseded
-->
- Draft on September 16th 2024.

## Drawbacks

<!--
Why should this KEP _not_ be implemented?
-->
NA. Kueue should be able to schedule devices following what upstream is proposing.
The only drawbacks are that workloads will have to fetch the resource claim if they are specifying resource claims.

## Alternatives

<!--
What other approaches did you consider, and why did you rule them out? These do
not need to be as detailed as the proposal, but should include enough
information to express the idea and why it was not acceptable.
-->
### Resource Claim By Count

Originally I was thinking one could keep a tally of the resource claims for a given workload.
The issue with this is that resource claims are namespaced scoped.
To enforce quota usage across namespaces we need to use cluster scoped resources.
Loading

0 comments on commit 66bdc4f

Please sign in to comment.