Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

option to not auto-add tolerations #13088

Closed
3 of 4 tasks
tooptoop4 opened this issue May 24, 2024 · 3 comments
Closed
3 of 4 tasks

option to not auto-add tolerations #13088

tooptoop4 opened this issue May 24, 2024 · 3 comments
Labels
area/controller Controller issues, panics area/upstream This is an issue with an upstream dependency, not Argo itself solution/invalid This is incorrect. Also can be used for spam type/support User support issue - likely not a bug

Comments

@tooptoop4
Copy link
Contributor

tooptoop4 commented May 24, 2024

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what did you expect to happen?

looking at a pod (kubectl get pod -o yaml) for a step in a workflow i see it has these tolerations added:

  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300

this can cause issues (unschedulable) where a step/pod is being assigned to a new node that is not ready ie aws/amazon-vpc-cni-k8s#2808 where it flicks between ready/notready and back to ready

Version

3.5.6

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

n/a

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

n/a

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded

n/a
@agilgur5
Copy link

agilgur5 commented May 24, 2024

I don't think Argo ever sets its own tolerations, only the ones that you specify. There's only one place in the code that they're added too.

Why do you think Argo is setting these?

@agilgur5 agilgur5 added type/support User support issue - likely not a bug solution/invalid This is incorrect. Also can be used for spam problem/more information needed Not enough information has been provide to diagnose this issue. area/controller Controller issues, panics and removed type/bug labels May 24, 2024
@tooptoop4
Copy link
Contributor Author

tooptoop4 commented May 25, 2024

turns out to be a k8s default https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-based-evictions

Note:
Kubernetes automatically adds a toleration for node.kubernetes.io/not-ready and node.kubernetes.io/unreachable with tolerationSeconds=300, unless you, or a controller, set those tolerations explicitly.

These automatically-added tolerations mean that Pods remain bound to Nodes for 5 minutes after one of these problems is detected.

perhaps docs should mention overwriting this

@agilgur5
Copy link

perhaps docs should mention overwriting this

Tolerations are only mentioned in the Fields Reference, where they directly inherit from k8s, so have no Argo specific behavior nor Argo specific docs.

@agilgur5 agilgur5 removed the problem/more information needed Not enough information has been provide to diagnose this issue. label May 30, 2024
@agilgur5 agilgur5 added the area/upstream This is an issue with an upstream dependency, not Argo itself label Jul 15, 2024
@argoproj argoproj locked as resolved and limited conversation to collaborators Jul 15, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/controller Controller issues, panics area/upstream This is an issue with an upstream dependency, not Argo itself solution/invalid This is incorrect. Also can be used for spam type/support User support issue - likely not a bug
Projects
None yet
Development

No branches or pull requests

2 participants