Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node Launch Issues in Bottlerocket 1.26.0 #4262

Open
KCSesh opened this issue Oct 24, 2024 · 12 comments
Open

Node Launch Issues in Bottlerocket 1.26.0 #4262

KCSesh opened this issue Oct 24, 2024 · 12 comments
Assignees
Labels
status/needs-triage Pending triage or re-evaluation type/bug Something isn't working

Comments

@KCSesh
Copy link
Contributor

KCSesh commented Oct 24, 2024

Bottlerocket has rolled back release 1.26.0, and 1.25.0 is now the latest release/AMI available. #4253

Bottlerocket team has been made aware of several issues with release 1.26.0 rollout.

Reported issues:

@KCSesh KCSesh added type/bug Something isn't working status/needs-triage Pending triage or re-evaluation labels Oct 24, 2024
@KCSesh KCSesh self-assigned this Oct 24, 2024
@KCSesh KCSesh pinned this issue Oct 24, 2024
@ginglis13
Copy link
Contributor

bottlerocket-os/bottlerocket-core-kit#158 added MemoryDenyWriteExecute=yes as a default security setting for all systemd services.

From the systemd man page: https://www.freedesktop.org/software/systemd/man/latest/systemd.exec.html#MemoryDenyWriteExecute=

If set, attempts to create memory mappings that are writable and executable at the same time, or to change existing memory mappings to become executable, or mapping shared memory segments as executable, are prohibited. Specifically, a system call filter is added (or preferably, an equivalent kernel check is enabled with prctl(2)) that rejects mmap(2) system calls

Based on user reports of denied mmap syscalls in the two linked issues to this tracker, it looks like this commit is the culprit.

I'm working to verify that now.

@KCSesh
Copy link
Contributor Author

KCSesh commented Oct 24, 2024

We have successfully rolled back 1.26.0, and now 1.25.0 is listed as the latest AMI available.

@kintoandar
Copy link

Thank you for the rollback, version 1.26.0 wreaked havoc on our infrastructure

@patkinson01
Copy link

Hi @KCSesh , I've just rolled out new nodes on a cluster fixing at 1.25 AMI, however BRUPOP is still updating to 1.26 - has the BRUPOP rollback been deployed?

@portswigger-tim
Copy link

portswigger-tim commented Oct 24, 2024

Another AWS blerp - I think I'll be pinning versions from now on 🤣 - It seems that Karpenter still thinks the latest version is 1.26.0 - possibly as a result of caching 🤔

@mikel-jason
Copy link

Thanks for your efforts and rolling back, appreciated a lot! ❤️

We still see 1.26 AMIs published in AWS (owneralias: amazon). Can you tell if/when they will be removed?

@btuffreau
Copy link

I've also noticed issue with the nginx-ingress controller ( helm chart ingress-nginx-4.10.1 ) being unable to start. Let me know if this warrants a specific issue.

-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:       v1.10.1
  Build:         4fb5aac1dd3669daa3a14d9de3e3cdb371b4c518
  Repository:    https://github.com/kubernetes/ingress-nginx
  nginx version: nginx/1.25.3

-------------------------------------------------------------------------------

W1024 09:36:02.004830       7 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I1024 09:36:02.004940       7 main.go:205] "Creating API client" host="https://10.101.0.1:443"
I1024 09:36:02.010877       7 main.go:248] "Running in Kubernetes cluster" major="1" minor="29+" git="v1.29.9-eks-ce1d5eb" state="clean" commit="573458d196a7a17e4d349781dbad9c8b56de2681" platform="linux/amd64"
I1024 09:36:02.158976       7 main.go:101] "SSL fake certificate created" file="/etc/ingress-controller/ssl/default-fake-certificate.pem"
I1024 09:36:02.176597       7 ssl.go:535] "loading tls certificate" path="/usr/local/certificates/cert" key="/usr/local/certificates/key"
I1024 09:36:02.185223       7 nginx.go:264] "Starting NGINX Ingress controller"
I1024 09:36:02.185263       7 logger.go:42] Is Chrooted, starting logger
I1024 09:36:02.190905       7 event.go:364] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"ingress-nginx", Name:"ingress-nginx-controller", UID:"3986ab72-cd5d-424f-9767-0c943c6aecdf", APIVersion:"v1", ResourceVersion:"5830", FieldPath:""}): type: 'Normal' reason: 'CREATE' ConfigMap ingress-nginx/ingress-nginx-controller
I1024 09:36:03.387014       7 nginx.go:307] "Starting NGINX process"
I1024 09:36:03.387053       7 leaderelection.go:250] attempting to acquire leader lease ingress-nginx/ingress-nginx-leader...
I1024 09:36:03.387399       7 nginx.go:327] "Starting validation webhook" address=":8443" certPath="/usr/local/certificates/cert" keyPath="/usr/local/certificates/key"
I1024 09:36:03.387576       7 controller.go:190] "Configuration changes detected, backend reload required"
I1024 09:36:03.390475       7 status.go:84] "New leader elected" identity="ingress-nginx-controller-678b74779-4ss7z"
I1024 09:36:03.457826       7 controller.go:210] "Backend successfully reloaded"
I1024 09:36:03.457906       7 controller.go:221] "Initial sync, sleeping for 1 second"
I1024 09:36:03.457978       7 event.go:364] Event(v1.ObjectReference{Kind:"Pod", Namespace:"ingress-nginx", Name:"ingress-nginx-controller-678b74779-z62th", UID:"33701a45-ec08-47e1-a1d6-289acc46139a", APIVersion:"v1", ResourceVersion:"627864", FieldPath:""}): type: 'Normal' reason: 'RELOAD' NGINX reload triggered due to a change in configuration
PANIC: unprotected error in call to Lua API (runtime code generation failed, restricted kernel?)
W1024 09:36:03.498316       7 nginx.go:37]
-------------------------------------------------------------------------------
NGINX master process died (1): exit status 1
-------------------------------------------------------------------------------
W1024 09:36:04.459276       7 controller.go:241] Dynamic reconfiguration failed (retrying; 15 retries left): Post "http://127.0.0.1:10246/configuration/backends": dial tcp 127.0.0.1:10246: connect: connection refused
W1024 09:36:05.523712       7 controller.go:241] Dynamic reconfiguration failed (retrying; 14 retries left): Post "http://127.0.0.1:10246/configuration/backends": dial tcp 127.0.0.1:10246: connect: connection refused

Thanks for the update.

@rpkelly
Copy link
Contributor

rpkelly commented Oct 24, 2024

@mikel-jason Do you have a requirement for the AMIs themselves to no longer be discoverable? Our actions thus far have not affected that; instead we have rolled back via control of the our latest SSM parameter and our tuf repositories.

@KCSesh
Copy link
Contributor Author

KCSesh commented Oct 24, 2024

Hey @patkinson01, we have heard reports of that, but the affected nodes enabled ignore-waves. Can you confirm that’s the case for you?

@patkinson01
Copy link

Hey @patkinson01, we have heard reports of that, but the affected nodes enabled ignore-waves. Can you confirm that’s the case for you?

Yes, the affected clusters were using ignore-waves=true. but I can confirm now these clusters are fixing on 1.25

@mikel-jason
Copy link

@rpkelly We're using karpenter which analyzes available AMIs and does not work with SSM params. I would guess we are not the only ones? See https://karpenter.sh/docs/concepts/nodeclasses/#specamiselectorterms

We rolled back our affected clusters with selecting version 1.25. We have some that should explicitly run with latest, which is not possible at the moment, from what I understand.

@larvacea
Copy link
Member

larvacea commented Oct 28, 2024

We're using karpenter which analyzes available AMIs and does not work with SSM params.

Karpenter will consult AWS public SSM parameters to find AMI IDs for Amazon Linux 2, Amazon Linux 2023, or Bottlerocket given an alias parameter. There is a complication:

The newer versions of Karpenter can cache discovered SSM aliases for up to 24 hours.

I am told that it is possible to restart Karpenter to discard the cache.

Caching is not a problem for explicit version aliases, as that SSM parameter is immutable once published:

You may wish to consult Karpenter's Managing AMIs documentation for some additional suggestions. I hope this information helps.

@KCSesh KCSesh unpinned this issue Nov 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/needs-triage Pending triage or re-evaluation type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

9 participants