-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nodes never uncordoned #691
Comments
Thanks for raising this issue. The logs you posted are curious:
This indicates that the node is currently running The agent logs having their last message be about reboot seems to corroborate this: we triggered the reboot, and Bottlerocket terminated us and started the reboot... but then we never hear from the agent daemon again. My suspicion is that the brupop agent is never actually restarted on the node after it reboots.
|
hey @cbgbt, unfortunately I don't have the logs anymore...but I do remember the node was reporting as running the new version (1.25.0) and I checked the uptime and that seemed to corroborate that too. |
Thank you! I'll attempt to replicate, but if you do happen to encounter this again logs from the agent after the reboot would be very helpful. |
Hey @cbgbt, so the upgrade happened on Nov 3 - The scheduler is set to Current state of the cluster is: It looks like the agent pod did not get restarted on the node, but other pods were able to start. On node3 these pods are scheduled
The agent pod is showing the event "Pod sandbox changed, it will be killed and re-created." every couple of minutes since the node was restarted.
node3 admin container
Most Recent Log for node3 agent
Controller Log
I tried deleting the agent pod on node3, and after it was recreated it's now saying
The aws-node pod seems to be running fine on node3 (we're running vpc-cni v1.18.5-eksbuild.1) I manually uncordoned the node, and the agent pod successfully started (log below). What's interesting is this does not consistently happen. When I uncordoned the node, the operator started the upgrade on another node which rebooted and was uncordoned successfully. It then did another node, which is stuck cordoned again.
|
Image I'm using:
v1.4.0
Issue:
We're seeing our nodes get stuck with the taint
node.kubernetes.io/unschedulable:NoSchedule
after an update. It doesn't look like this happens to every node when it updates, only some of them. When it does hit this issue - the node successfully performs the update, reboots, but is never uncordoned when it comes back up.The last thing in the controller logs is the event
RebootedIntoUpdate
and I see the node reports the new version of bottlerocketLast event I can see in the agent logs is
Bottlerocket node is terminated by reboot signal
** Helm Values **
The text was updated successfully, but these errors were encountered: