-
Notifications
You must be signed in to change notification settings - Fork 635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Health checker doesn't restart kubelet on EKS #861
Comments
Here is how kubelet is configured in k/k: https://github.com/kubernetes/kubernetes/blob/f8930f980d2986f9e486b04c14c3e93e57bdbe12/cluster/gce/gci/configure-helper.sh#L1652
Its kernel log pattern is different from |
@wangzhen127 Thanks for the explanation, and I see that you added the comment to the code as well 🫡. Unfortunately, the kubelet service in the EKS AMI is still using Regarding #847, isn't there a better way to differentiate between unhealthy restarts and normal restarts? I also find it weird that I think it's best if we don't have to rely on the difference between |
I agree that it is desirable to find a better way to differentiate between unhealthy restart and planned restart. Do you have any suggestions? |
I don't have any suggestions at the moment either. I will update this issue if I got time to play around with it and think of something. |
I looked into it a bit more. The journald logs by the Relying on the current hack, we could still make the repair function more robust by checking the We could also simply always call What do you think? |
I think the reason why the system service is using |
I cannot answer for the EKS maintainers, but I don't think it was a wrong choice. In their healthy state, these server services are expected to run forever, never exiting with code 0 on their own, so
Like you said, I assumed most commonly would be admins reconfiguring the system. This wouldn't make I think it's even more unnatural to restart a service by relying on Since we cannot use |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
/remove-lifecycle stale |
@wangzhen127 ping on this |
Hi,
The health-checker-kubelet monitor runs
systemctl kill
against the kubelet service but kubelet doesn't get restarted because the exit code would be 0. EKS containerd nodes only restart kubelet on failure or SIGPIPE https://github.com/awslabs/amazon-eks-ami/blob/master/files/kubelet-containerd.service.Am I misunderstanding the purpose of the repair function? Is it meant to kill the kubelet so that another system can take care of it, e.g. rotating out the node?
Is there a reason
systemctl kill
is used instead ofsystemctl restart
? I was testing this functionality by stopping the kubelet manually and the health checker failed to kill the service with the error messageFailed to kill unit kubelet.service: No main process to kill
(slightly related: #860).The text was updated successfully, but these errors were encountered: