Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Traefik stops being monitorable during graceTimeout #1202

Open
2 tasks done
brianbraunstein opened this issue Sep 27, 2024 · 2 comments
Open
2 tasks done

Traefik stops being monitorable during graceTimeout #1202

brianbraunstein opened this issue Sep 27, 2024 · 2 comments
Labels
kind/bug/possible a possible bug that needs analysis before it is confirmed or fixed.

Comments

@brianbraunstein
Copy link

Welcome!

  • Yes, I've searched similar issues on GitHub and didn't find any.
  • Yes, I've searched similar issues on the Traefik community forum and didn't find any.

What version of the Traefik's Helm Chart are you using?

v32.0.0

What version of Traefik are you using?

default from the v32.0.0 helm chart

What did you do?

I noticed that traefik is marked up == 0 while kube_pod_status_ready{condition="true"} == 0

I debugged it to the helm chart not setting requestAcceptGraceTimeout properly for the metrics entrypoint/port (docs: https://doc.traefik.io/traefik/routing/entrypoints/#lifecycle). It should be set here by default:

metrics:
# -- When using hostNetwork, use another port to avoid conflict with node exporter:
# https://github.com/prometheus/prometheus/wiki/Default-port-allocations
port: 9100
# -- You may not want to expose the metrics port on production deployments.
# If you want to access it from outside your cluster,
# use `kubectl port-forward` or create a secure ingress
expose:
default: false
# -- The exposed port for this service
exposedPort: 9100
# -- The port protocol (TCP/UDP)
protocol: TCP

The default value for graceTimeout is 10 seconds according to https://doc.traefik.io/traefik/routing/entrypoints/#lifecycle , which means most people don't notice this bug. However, we needed to increase graceTimeout for long lived connections so for long periods of time traefik becomes completely unmonitored, and appears down (up == 0) to our prometheus.

What did you see instead?

I saw a bug

What is your environment & configuration?

Traefik helm chart + kube + prometheus + long lived connections with graceTimeout set to a long value (hours).

Additional Information

No response

@mloiseleur mloiseleur added kind/bug/possible a possible bug that needs analysis before it is confirmed or fixed. and removed status/0-needs-triage labels Oct 4, 2024
@mloiseleur
Copy link
Contributor

At first glance, it seems more a configuration enhancement or warning to display than a real bug, but let's dig it.
Would you please share values showing this issue you encountered ?

@brianbraunstein
Copy link
Author

Can you confirm if these statements are true, I might be missing something:

  • A) By default in traefik, graceTimeout is 10 seconds source
  • B) By default in traefik's helm chart, the metrics endpoint does not set requestAcceptGraceTimeout source and so gets the default of 0s source
  • C) A + B means the default helm chart causes traefik to be unmonitorable for 10 seconds while shutting down

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug/possible a possible bug that needs analysis before it is confirmed or fixed.
Projects
None yet
Development

No branches or pull requests

2 participants