-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Node name instead of Machine name on removal #46
Use Node name instead of Machine name on removal #46
Conversation
@HomayoonAlimohammadi i came to review this -- but if you could please add some description and references to why this change is necessary. Sorry, i don't have enough context to see how this is an obvious change. thanks |
return fmt.Errorf("machine %s has no node reference", machine.Name) | ||
} | ||
|
||
nodeName := machine.Status.NodeRef.Name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work finding the root cause. I only have one question, otherwise this looks good. Is Status.NodeRef set automatically by CAPI's Machine controller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it does:
- https://cluster-api.sigs.k8s.io/developer/architecture/controllers/machine-pool#:~:text=Setting%20NodeRefs%20on%20MachinePool%20instances%20to%20be%20able%20to%20associate%20them%20with%20Kubernetes%20nodes
- https://github.com/kubernetes-sigs/cluster-api/blob/main/internal/controllers/machine/machine_controller_noderef.go#L90-L100
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this is part of the CAPI contract.
I'm really sorry @addyess, you're right, the PR was rushed out of excitement! :D |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Summary
This PR fixes the issue with CAPI rolling upgrades. The problem was that we were passing the wrong
node-name
to k8sdremove-node
endpoint.Description
CAPI rolling upgrades were failing, we narrowed down the problem, figured that the node removal process is not happening as expected (i.e. you can reproduce the problem by scaling down the control plane replicas of the workload cluster). Finally it was evident that the
/capi/remove-node
endpoint of k8sd was returning an error indicating that it can not find the node it was called to delete. That is because we were passing themachine.Name
to k8sd/capi/remove-node
endpoint, while as can be seen in the output ofkubectl get nodes
(of the workload cluster), we need to pass the node name.