Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rke2-server isn't starting up after restarting cluster node #6644

Open
kurborg opened this issue Aug 27, 2024 · 5 comments
Open

rke2-server isn't starting up after restarting cluster node #6644

kurborg opened this issue Aug 27, 2024 · 5 comments

Comments

@kurborg
Copy link

kurborg commented Aug 27, 2024

Environmental Info:
RKE2 Version:
rke2 version v1.26.15+rke2r1 (a413a7f)
go version go1.21.8 X:boringcrypto

Node(s) CPU architecture, OS, and Version:
Rocky 8.6

Cluster Configuration:
1 server. 5 agents.

Describe the bug:
API server seems like it's failing to come up at port 6443.

Steps To Reproduce:
sudo systemctl restart rke2-server

Expected behavior:
rke2-server starts up normally and connects to agent nodes.

Actual behavior:
Command:
netstat -ltn
Output:
Shows that etcd is listening on port 2379
tcp 129 0 127.0.0.1:2379 0.0.0.0:* LISTEN

Command:
curl -k https://127.0.0.1:2379/health
Output:
Times out

Command:
journalctl -u rke2-server -f
Output:
Aug 27 16:31:32 OurServer rke2[132272]: {"level":"warn","ts":"2024-08-27T16:31:32.345951-0400","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000d29500/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"} Aug 27 16:31:32 OurServer rke2[132272]: time="2024-08-27T16:31:32-04:00" level=error msg="Failed to check local etcd status for learner management: context deadline exceeded" Aug 27 16:31:32 OurServer rke2[132272]: time="2024-08-27T16:31:32-04:00" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error" Aug 27 16:31:35 OurServer rke2[132272]: time="2024-08-27T16:31:35-04:00" level=error msg="Sending HTTP 503 response to 192.168.1.111:53002: runtime core not ready" Aug 27 16:31:37 OurServer rke2[132272]: time="2024-08-27T16:31:37-04:00" level=error msg="Sending HTTP 503 response to 192.168.1.115:50932: runtime core not ready" Aug 27 16:31:37 OurServer rke2[132272]: time="2024-08-27T16:31:37-04:00" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error" Aug 27 16:31:40 OurServer rke2[132272]: time="2024-08-27T16:31:40-04:00" level=error msg="Sending HTTP 503 response to 192.168.1.111:53158: runtime core not ready" Aug 27 16:31:41 OurServer rke2[132272]: {"level":"warn","ts":"2024-08-27T16:31:41.599955-0400","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000d29500/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: i/o timeout\""} Aug 27 16:31:41 OurServer rke2[132272]: {"level":"info","ts":"2024-08-27T16:31:41.600009-0400","logger":"etcd-client","caller":"[email protected]/client.go:210","msg":"Auto sync endpoints failed.","error":"context deadline exceeded"} Aug 27 16:31:41 OurServer rke2[132272]: time="2024-08-27T16:31:41-04:00" level=info msg="Waiting for etcd server to become available" Aug 27 16:31:41 OurServer rke2[132272]: time="2024-08-27T16:31:41-04:00" level=info msg="Waiting for API server to become available"

Additional context / logs:

We have disabled firewalld on all the server and agent nodes.

/etc/rancher/rke2/rke2.yaml

apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: <OUR_CERT>
    server: https://127.0.0.1:6443
  name: default
contexts:
- context:
    cluster: default
    user: default
  name: default
current-context: default
kind: Config
preferences: {}
users:
- name: default
  user:
    client-certificate-data: <OUR_CERT>
    client-key-data: <OUR_CERT>

/etc/rancher/rke2/registries.yaml

mirrors:
  docker.io:
    endpoint:
      - 'https://harbor.ourInstance.com'
  harbor.ourInstance .com:
    endpoint:
      - 'https://harbor.ourInstance.com'

configs:
  'harbor.ourInstance.com':
    auth:
      username: admin # this is the registry username
      password: $HARBOR_PASSWORD # this is the registry password
    tls:
      cert_file: /etc/docker/certs.d/harbor.ourInstance.com/harbor.ourInstance.com.cert   # path to the cert file used to authenticate to the registry
      key_file: /etc/docker/certs.d/harbor.ourInstance.com/harbor.ourInstance.com.key            # path to the key file for the certificate used to authenticate to the registry
      ca_file: /etc/docker/certs.d/harbor.ourInstance.com/ca.crt       # path to the ca file used to verify the registry's certificate
      insecure_skip_verify: false # may be set to true to skip verifying the registry's certificate
@brandond
Copy link
Member

The etcd and kube-apiserver static pods do not appear to be running. Check the logs under /var/log/pods to see why. You might also check kubelet.log.

@kurborg
Copy link
Author

kurborg commented Aug 28, 2024

Yeah I didn't detect anything that was very helpful

Command:
tail /var/log/pods/kube-system_kube-apiserver-silverbox1_c5fc9ef130551800e3d08eba4103e251/kube-apiserver/14.log

Output:
2024-08-27T13:04:33.2026255-04:00 stderr F I0827 17:04:33.202398 1 trace.go:219] Trace[1654830898]: "Get" accept:application/vnd.kubernetes.protobuf,application/json,audit-id:a507a912-3063-4c5e-8709-b4e32a0575bd,client:192.168.1.114,protocol:HTTP/2.0,resource:nodes,scope:resource,url:/api/v1/nodes/silverbox5,user-agent:kubelet/v1.26.15+rke2r1 (linux/amd64) kubernetes/1649f59,verb:GET (27-Aug-2024 17:04:23.199) (total time: 10002ms): 2024-08-27T13:04:33.202636751-04:00 stderr F Trace[1654830898]: [10.002934173s] [10.002934173s] END 2024-08-27T13:04:33.20263862-04:00 stderr F E0827 17:04:33.202441 1 timeout.go:142] post-timeout activity - time-elapsed: 2.8923ms, GET "/api/v1/nodes/silverbox5" result: <nil> 2024-08-27T13:04:33.217658989-04:00 stderr F E0827 17:04:33.217493 1 writers.go:122] apiserver was unable to write a JSON response: http: Handler timeout 2024-08-27T13:04:33.217667206-04:00 stderr F E0827 17:04:33.217502 1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout 2024-08-27T13:04:33.218779791-04:00 stderr F E0827 17:04:33.218579 1 writers.go:135] apiserver was unable to write a fallback JSON response: http: Handler timeout 2024-08-27T13:04:33.218788137-04:00 stderr F I0827 17:04:33.218598 1 trace.go:219] Trace[1519742216]: "List" accept:application/vnd.kubernetes.protobuf,application/json,audit-id:7ef246ec-c97d-41d8-9c85-2d86373a279e,client:192.168.1.114,protocol:HTTP/2.0,resource:csidrivers,scope:cluster,url:/apis/storage.k8s.io/v1/csidrivers,user-agent:kubelet/v1.26.15+rke2r1 (linux/amd64) kubernetes/1649f59,verb:LIST (27-Aug-2024 17:03:33.217) (total time: 60001ms): 2024-08-27T13:04:33.21878948-04:00 stderr F Trace[1519742216]: ["cacher list" audit-id:7ef246ec-c97d-41d8-9c85-2d86373a279e,type:csidrivers.storage.k8s.io 60001ms (17:03:33.217)] 2024-08-27T13:04:33.218790599-04:00 stderr F Trace[1519742216]: [1m0.001315607s] [1m0.001315607s] END 2024-08-27T13:04:33.218791911-04:00 stderr F E0827 17:04:33.218625 1 timeout.go:142] post-timeout activity - time-elapsed: 1.138035ms, GET "/apis/storage.k8s.io/v1/csidrivers" result: <nil>

Command:
tail /var/log/pods/kube-system_etcd-silverbox1_9160bfe2d04901447e9cc07e54611d49/etcd/11.log.20240827-130353

Output:
2024-08-27T13:03:51.598160944-04:00 stderr F {"level":"info","ts":"2024-08-27T17:03:51.597908Z","caller":"v3rpc/maintenance.go:96","msg":"finished defragment"} 2024-08-27T13:03:52.090591742-04:00 stderr F {"level":"warn","ts":"2024-08-27T17:03:52.090291Z","caller":"etcdserver/v3_server.go:897","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":11895010693730220364,"retry-timeout":"500ms"} 2024-08-27T13:03:52.48278394-04:00 stderr F {"level":"warn","ts":"2024-08-27T17:03:52.482555Z","caller":"etcdserver/util.go:170","msg":"apply request took too long","took":"1.999988543s","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/registry/health\" ","response":"","error":"context deadline exceeded"} 2024-08-27T13:03:52.482792771-04:00 stderr F {"level":"info","ts":"2024-08-27T17:03:52.482589Z","caller":"traceutil/trace.go:171","msg":"trace[762688015] range","detail":"{range_begin:/registry/health; range_end:; }","duration":"2.000029581s","start":"2024-08-27T17:03:50.482552Z","end":"2024-08-27T17:03:52.482581Z","steps":["trace[762688015] 'agreement among raft nodes before linearized reading' (duration: 1.999987279s)"],"step_count":1} 2024-08-27T13:03:52.482794434-04:00 stderr F {"level":"warn","ts":"2024-08-27T17:03:52.482606Z","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2024-08-27T17:03:50.482547Z","time spent":"2.000054422s","remote":"127.0.0.1:59482","response type":"/etcdserverpb.KV/Range","request count":0,"request size":18,"response count":0,"response size":0,"request content":"key:\"/registry/health\" "} 2024-08-27T13:03:52.48279553-04:00 stderr F {"level":"warn","ts":"2024-08-27T17:03:52.482669Z","caller":"etcdserver/util.go:170","msg":"apply request took too long","took":"2.00031618s","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/registry/health\" ","response":"","error":"context deadline exceeded"} 2024-08-27T13:03:52.4827968-04:00 stderr F {"level":"info","ts":"2024-08-27T17:03:52.482695Z","caller":"traceutil/trace.go:171","msg":"trace[1630550680] range","detail":"{range_begin:/registry/health; range_end:; }","duration":"2.000350502s","start":"2024-08-27T17:03:50.482339Z","end":"2024-08-27T17:03:52.48269Z","steps":["trace[1630550680] 'agreement among raft nodes before linearized reading' (duration: 2.000315157s)"],"step_count":1} 2024-08-27T13:03:52.483099162-04:00 stderr F {"level":"warn","ts":"2024-08-27T17:03:52.48271Z","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2024-08-27T17:03:50.482334Z","time spent":"2.000372167s","remote":"127.0.0.1:59480","response type":"/etcdserverpb.KV/Range","request count":0,"request size":18,"response count":0,"response size":0,"request content":"key:\"/registry/health\" "} 2024-08-27T13:03:52.590909904-04:00 stderr F {"level":"warn","ts":"2024-08-27T17:03:52.590649Z","caller":"etcdserver/v3_server.go:897","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":11895010693730220364,"retry-timeout":"500ms"} 2024-08-27T13:03:53.091226335-04:00 stderr F {"level":"warn","ts":"2024-08-27T17:03:53.090896Z","caller":"etcdserver/v3_server.go:897","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":11895010693730220364,"retry-timeout":"500ms"}

@brandond
Copy link
Member

Please just attach the files, a few lines from the tail isn't really enough to figure out what's going on.

The messages from etcd suggest that it is running EXTREMELY slowly though. What are the CPU and IO load on this node? If this is a single-node cluster, there should be no delays in RAFT operations.

@pythonbdfl
Copy link

pythonbdfl commented Aug 30, 2024

I was having the same issue today, but I kept looking and found that I had been using the gzipped rke2 binary, which didn't have any suffix attached to it. Run the linux 'file' command on your rke2 binary to verify that you're not doing the same. If it's a gzip file, then the solution is to simply extract it and then move the bin/rke2 in place.

Copy link
Contributor

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 45 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants