Implement a KEP for DRA and Kueue

kubernetes-sigs · Sep 16, 2024 · 66bdc4f · 66bdc4f
1 parent 2d3d284
commit 66bdc4f
Show file tree

Hide file tree

Showing 3 changed files with 254 additions and 85 deletions.
diff --git a/keps/2941-DRA-Structured-Parameters/README.md b/keps/2941-DRA-Structured-Parameters/README.md
@@ -41,35 +41,122 @@ tags, and then generate with `hack/update-toc.sh`.
 ## Summary
 
 Dynamic Resource Allocation (DRA) is a major effort to improve device support in Kubernetes.
-It changes how one can request resources in a myriad of ways. Kueue should be able to integrate with DRA.
+It changes how one can request resources in a myriad of ways.
 
 ## Motivation
 
-DRA allows for more elaborate scheduling of devices. It puts control in how devices are scheduled into the device driver.
+Dynamic Resource Allocation (DRA) provides the groundwork for more sophisticated device allocations to Pods.
+Quota management is about enforcing rules around the use of resources.
+For example, GPUs are resource constrained and a popular request is the ability to enforce fair sharing of GPU resources.
+With these devices, many users want access and sometimes some users want the ability to preempt other users if their workloads have a higher priority. Kueue provides support for this.
 
-### DRA Background
+DRA provides a future where users could schedule partitionable GPU devices (MIG) or time slicing. As devices gain a more robust way to schedule, it is important to walk through how support of DRA will work with Kueue.
 
-To be able to dive into details for Kueue, I first want to summarize the different usecases for DRA from a workload perspective.
+### Background
 
-#### Examples
+DRA has three APIs that are relevant for a Kueue:
 
+- Resource Claims
+- DeviceClasses
+- ResourceSlices
+
+#### DRA Example
+
+I found the easiest way to test DRA was to use [dra example driver repository](https://github.com/kubernetes-sigs/dra-example-driver)
+
+You can clone that repo and run `make setup-e2e` and that will create a Kind cluster with the DRA feature gate and install a mock dra driver.
+
+This does not use actual GPUs so it is perfect for a test environment for exploring Kueue and DRA integration.
+
+#### Workload Example
+
+An example workload that uses DRA:
+
+```yaml
+---
+
+apiVersion: resource.k8s.io/v1alpha3
+kind: ResourceClaimTemplate
+metadata:
+  namespace: gpu-test1
+  name: single-gpu
+spec:
+  spec:
+    devices:
+      requests:
+      - name: gpu
+        deviceClassName: gpu.example.com
+
+---
+
+apiVersion: batch/v1
+kind: Job
+metadata:
+  namespace: gpu-test1
+  name: job0
+  labels:
+    app: job
+    kueue.x-k8s.io/queue-name: user-queue
+spec:
+  template:
+    spec:
+      restartPolicy: Never
+      containers:
+      - name: ctr0
+        image: ubuntu:22.04
+        command: ["bash", "-c"]
+        args: ["export; sleep 9999"]
+        resources:
+          claims:
+          - name: gpu
+          requests:
+            cpu: 1
+            memory: "200Mi"
+      resourceClaims:
+      - name: gpu
+        resourceClaimTemplateName: gpu.example.com
+```
+
+#### Example Driver Cluster Resources
+
+The dra-example-driver creates a resource slice and a device class for the entire cluster.
+
+##### Resource slices
+
+Resource slices are meant for communication between drivers and the control planes. These are not expected to be used for workloads.
+
+Kueue does not need to be aware of these resources.
+
+##### Device classes
+
+Each driver creates a device class and every resource claim will reference the device class.
+
+The dra-example-driver has a simple device class named `gpu.example.com`.
+
+This can be a way to enforce quota limits.
 
 ### Goals
 
+- Users can submit workloads using resource claims and Kueue can monitor the usage.
+- Admins can enforce the number of requests to a given device class.
+
 <!--
 List the specific goals of the KEP. What is it trying to achieve? How will we
 know that this has succeeded?
 -->
 
 ### Non-Goals
 
+- We are limiting scope for DRA to structured parameters (beta in 1.32)
+
 <!--
 What is out of scope for this KEP? Listing non-goals helps to focus discussion
 and make progress.
 -->
 
 ## Proposal
 
+
 <!--
 This is where we get down to the specifics of what the proposal actually is.
 This should have enough detail that reviewers can understand exactly what
@@ -90,7 +177,11 @@ bogged down.
 
 #### Story 1
 
-#### Story 2
+As an user, I want to use resource claims to provide more control over the scheduling of devices.
+I have a dra driver installed on my cluster and I am interested in using DRA for scheduling.
+
+I want to enforce quota usage for a ClusterQueue and forbid admitting workloads once they exceed the cluster queue limit.
+
 
 ### Notes/Constraints/Caveats (Optional)
 
@@ -117,6 +208,96 @@ Consider including folks who also work outside the SIG or subproject.
 
 ## Design Details
 
+### Resource Quota API
+
+```golang
+type ResourceQuota struct {
+  // ...
+	// Kind is the type of resource that this resource is
+	// +kubebuilder:validation:Enum={Core,DeviceClass}
+	// +kubebuilder:default=Core
+	Kind ResourceKind `json:"kind"`
+}
+```
+
+Kind allows one to distinguish between a Core resource and a Device class.
+
+With this, a cluster queue could be defined as follows:
+
+```yaml
+apiVersion: kueue.x-k8s.io/v1beta1
+kind: ClusterQueue
+metadata:
+  name: "cluster-queue"
+spec:
+  namespaceSelector: {} # match all.
+  resourceGroups:
+  - coveredResources: ["cpu", "memory", "gpu.example.com"]
+    flavors:
+    - name: "default-flavor"
+      resources:
+      - name: "cpu"
+        nominalQuota: 9
+      - name: "memory"
+        nominalQuota: "200Mi"
+      - name: "gpu.example.com"
+        nominalQuota: 2
+        kind: "DeviceClass"
+```
+
+### Workloads
+
+When a user submits a workload and KueueDynamicResourceAllocation feature gate is on, Kueue will do the following:
+
+a. Claims will be read from resources.claims in the PodTemplateSpec.
+b. The name of the claim will be used to look up the corresponding `ResourceClaimTemplateName` in the PodTemplateSpec.
+c. The ResourceClaim will be read given the name in b and using the same namespace as the workload.
+d. From the ResourceClaimTemplate, the deviceClassName will be read.
+e. Every claim that requests the same deviceClassName will be tallied and reported in the ResourceUsage.
+
+```yaml
+---
+
+apiVersion: batch/v1
+kind: Job
+metadata:
+  namespace: gpu-test1
+  name: job0
+  labels:
+    app: job
+    kueue.x-k8s.io/queue-name: user-queue
+spec:
+  template:
+    spec:
+      restartPolicy: Never
+      containers:
+      - name: ctr0
+        image: ubuntu:22.04
+        command: ["bash", "-c"]
+        args: ["export; sleep 9999"]
+        resources:
+          claims:
+          - name: gpu. #a) read the claim from resources.claims
+          requests:
+            cpu: 1
+            memory: "200Mi"
+      resourceClaims:
+      - name: gpu # b) use the name in resources.claim
+        resourceClaimTemplateName: single-gpu # c) the name for resource claim templates 
+---
+apiVersion: resource.k8s.io/v1alpha3
+kind: ResourceClaimTemplate
+metadata:
+  namespace: gpu-test1
+  name: single-gpu
+spec:
+  spec:
+    devices:
+      requests:
+      - name: gpu
+        deviceClassName: gpu.example.com # d) the name of the device class
+
+```
 <!--
 This section should contain enough information that the specifics of your
 change are understandable. This may include API specs (though not always
@@ -126,18 +307,7 @@ proposal will be implemented, this is the place to discuss them.
 
 ### Test Plan
 
-<!--
-**Note:** *Not required until targeted at a release.*
-The goal is to ensure that we don't accept enhancements with inadequate testing.
-
-All code is expected to have adequate tests (eventually with coverage
-expectations). Please adhere to the [Kubernetes testing guidelines][testing-guidelines]
-when drafting this test plan.
-
-[testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md
--->
-
-[ ] I/we understand the owners of the involved components may require updates to
+[x] I/we understand the owners of the involved components may require updates to
 existing tests to make this code solid enough prior to committing the changes necessary
 to implement this enhancement.
 
@@ -167,10 +337,19 @@ This can inform certain test coverage improvements that we want to do before
 extending the production code to implement this enhancement.
 -->
 
+TBD
 - `<package>`: `<date>` - `<test coverage>`
 
 #### Integration tests
 
+I am not sure if we can test DRA functionality (requiring alpha features enabled) at the integration level.
+
+DRA requires a kubelet plugin so this may not be a good candidate for an integration test.
+
+#### E2E Test
+
+It may be worth creating install dra-example-driver and testing this e2e.
+
 <!--
 Describe what tests will be added to ensure proper quality of the enhancement.
 
@@ -179,45 +358,27 @@ After the implementation PR is merged, add the names of the tests here.
 
 ### Graduation Criteria
 
-<!--
+#### Feature Gate
 
-Clearly define what it means for the feature to be implemented and
-considered stable.
+We will introduce a KueueDynamicResourceAllocation feature gate.
 
-If the feature you are introducing has high complexity, consider adding graduation
-milestones with these graduation criteria:
-- [Maturity levels (`alpha`, `beta`, `stable`)][maturity-levels]
-- [Feature gate][feature gate] lifecycle
-- [Deprecation policy][deprecation-policy]
+This feature gate will go beta once DRA is beta.
 
-[feature gate]: https://git.k8s.io/community/contributors/devel/sig-architecture/feature-gates.md
-[maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions
-[deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/
--->
+The goal will be limit changes only if this feature gate is enabled in combination with the DRA feature.
 
 ## Implementation History
 
-<!--
-Major milestones in the lifecycle of a KEP should be tracked in this section.
-Major milestones might include:
-- the `Summary` and `Motivation` sections being merged, signaling SIG acceptance
-- the `Proposal` section being merged, signaling agreement on a proposed design
-- the date implementation started
-- the first Kubernetes release where an initial version of the KEP was available
-- the version of Kubernetes where the KEP graduated to general availability
-- when the KEP was retired or superseded
--->
+- Draft on September 16th 2024.
 
 ## Drawbacks
 
-<!--
-Why should this KEP _not_ be implemented?
--->
+NA. Kueue should be able to schedule devices following what upstream is proposing. 
+The only drawbacks are that workloads will have to fetch the resource claim if they are specifying resource claims.
 
 ## Alternatives
 
-<!--
-What other approaches did you consider, and why did you rule them out? These do
-not need to be as detailed as the proposal, but should include enough
-information to express the idea and why it was not acceptable.
--->
+### Resource Claim By Count
+
+Originally I was thinking one could keep a tally of the resource claims for a given workload. 
+The issue with this is that resource claims are namespaced scoped.
+To enforce quota usage across namespaces we need to use cluster scoped resources.