Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job should list "Untolerated Taint" as reason for not being admitted #3158

Open
nfung-soundhound opened this issue Sep 27, 2024 · 0 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@nfung-soundhound
Copy link

What would you like to be added:
Pending Jobs/Workloads should be verbose in listing the reasons why they are pending, in particular in the case where it cannot tolerate the taints of any of the ResourceFlavors of the ClusterQueue for which it has submitted to.

Why is this needed:
Consider the following example ResourceFlavor with a taint, and job definition below.
Assume the dev LocalQueue submits to the dev ClusterQueue.
Also assume that the ClusterQueue contains the ResourceFlavor node_type defined below with sufficient quotas.

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: node_type
spec:
  nodeLabels:
    beta.kubernetes.io/instance-type: "node_type"
  nodeTaints:
  - key: taint_key
    value: "taint_value"
    effect: NoSchedule
# job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: myjob
  labels:
    kueue.x-k8s.io/queue-name: dev
spec:
  suspend: true
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: myapp
        image: busybox
        command: ["sleep", "500"]
        resources:
          requests:
            memory: "128Mi"
            cpu: "500m"
          limits:
            memory: "128Mi"
            cpu: "500m"

When submitting the job above, the controller will suspend the job.
However, the job events yield very little information:

Events:
  Type    Reason           Age   From                        Message
  ----    ------           ----  ----                        -------
  Normal  Suspended        10m   job-controller              Job suspended
  Normal  CreatedWorkload  10m   batch/job-kueue-controller  Created Workload: default/job-myjob-d2369

After enabling the debug logs on the controller, it turns out the job was not scheduled because it couldn't tolerate the taints for that node type. This might be fine for an administrator, but this makes it not user friendly for developers, where they might accidentally miss a taint. Typically, when scheduling pods/jobs, if it's not schedulable kubernetes provides the fact that it can't tolerate certain taints on some nodes.

{"level":"debug",
 "ts":"2024-09-27T19:56:20.485176276Z",
 "logger":"events","caller":"recorder/recorder.go:104","msg":"couldn't assign flavors to pod set main: untolerated taint {taint_key taint_value NoSchedule <nil>} in flavor node_type","type":"Normal","object":{"kind":"Workload","namespace":"default","name":"job-myjob-d2369","uid":"f469c8b0-a23b-4854-bc05-048a38904520","apiVersion":"kueue.x-k8s.io/v1beta1","resourceVersion":"654137093"},"reason":"Pending"}

My current workaround is to no longer use taints on the ResourceFlavors, instead relying on the taints on the nodes themselves, and not using any tolerations on the ResourceFlavors. This has the unintended side effect of reserving a portion of the Quota without actually running a workload (i.e., the job will be submitted, but the pods will be stuck in pending since they do not tolerate the taints.)

I have just begun to use Kueue, so please suggest any workarounds (I've thought of but not tested all-or nothing scheduling in this instance).

Completion requirements:

TBD, but would require some changes to the controller that give pending jobs/workloads reasons why they cannot be scheduled when sufficient quotas exist.

@nfung-soundhound nfung-soundhound added the kind/feature Categorizes issue or PR as related to a new feature. label Sep 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

1 participant