Traefik stops being monitorable during graceTimeout #1202

brianbraunstein · 2024-09-27T21:53:40Z

Welcome!

Yes, I've searched similar issues on GitHub and didn't find any.
Yes, I've searched similar issues on the Traefik community forum and didn't find any.

What version of the Traefik's Helm Chart are you using?

v32.0.0

What version of Traefik are you using?

default from the v32.0.0 helm chart

What did you do?

I noticed that traefik is marked up == 0 while kube_pod_status_ready{condition="true"} == 0

I debugged it to the helm chart not setting requestAcceptGraceTimeout properly for the metrics entrypoint/port (docs: https://doc.traefik.io/traefik/routing/entrypoints/#lifecycle). It should be set here by default:

traefik-helm-chart/traefik/values.yaml

Lines 707 to 719 in 7a13fc8

    
           metrics: 
        
             # -- When using hostNetwork, use another port to avoid conflict with node exporter: 
        
             # https://github.com/prometheus/prometheus/wiki/Default-port-allocations 
        
             port: 9100 
        
             # -- You may not want to expose the metrics port on production deployments. 
        
             # If you want to access it from outside your cluster, 
        
             # use `kubectl port-forward` or create a secure ingress 
        
             expose: 
        
               default: false 
        
             # -- The exposed port for this service 
        
             exposedPort: 9100 
        
             # -- The port protocol (TCP/UDP) 
        
             protocol: TCP

The default value for graceTimeout is 10 seconds according to https://doc.traefik.io/traefik/routing/entrypoints/#lifecycle , which means most people don't notice this bug. However, we needed to increase graceTimeout for long lived connections so for long periods of time traefik becomes completely unmonitored, and appears down (up == 0) to our prometheus.

What did you see instead?

I saw a bug

What is your environment & configuration?

Traefik helm chart + kube + prometheus + long lived connections with graceTimeout set to a long value (hours).

Additional Information

No response

The text was updated successfully, but these errors were encountered:

mloiseleur · 2024-10-04T09:56:53Z

At first glance, it seems more a configuration enhancement or warning to display than a real bug, but let's dig it.
Would you please share values showing this issue you encountered ?

brianbraunstein · 2024-10-04T10:35:53Z

Can you confirm if these statements are true, I might be missing something:

A) By default in traefik, graceTimeout is 10 seconds source
B) By default in traefik's helm chart, the metrics endpoint does not set requestAcceptGraceTimeout source and so gets the default of 0s source
C) A + B means the default helm chart causes traefik to be unmonitorable for 10 seconds while shutting down

brianbraunstein added the status/0-needs-triage label Sep 27, 2024

mloiseleur added kind/bug/possible a possible bug that needs analysis before it is confirmed or fixed. and removed status/0-needs-triage labels Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Traefik stops being monitorable during graceTimeout #1202

Traefik stops being monitorable during graceTimeout #1202

brianbraunstein commented Sep 27, 2024

mloiseleur commented Oct 4, 2024

brianbraunstein commented Oct 4, 2024

Traefik stops being monitorable during graceTimeout #1202

Traefik stops being monitorable during graceTimeout #1202

Comments

brianbraunstein commented Sep 27, 2024

Welcome!

What version of the Traefik's Helm Chart are you using?

What version of Traefik are you using?

What did you do?

What did you see instead?

What is your environment & configuration?

Additional Information

mloiseleur commented Oct 4, 2024

brianbraunstein commented Oct 4, 2024