Handle transient gracefully by retrying when submitting request to k8s server #2345

ajvishwa-dev · 2024-12-03T09:34:29Z

What feature you would like to be added?

When submitting Spark applications to a Kubernetes cluster, intermittent connection issues with the Kubernetes API server (e.g., Connection refused or Failed to connect to /:443) can cause job submissions to fail entirely. These errors are often transient, caused by short-lived network interruptions, API server restarts, or resource contention. Currently, Spark operator lacks robust retry mechanisms to address such scenarios, leading to failed jobs that could succeed with retries.

Why is this needed?

• Improved reliability of Spark applications in production environments with transient network or API server issues.
• Reduced operational overhead caused by manually resubmitting failed jobs.
• Enhanced user experience with more resilient job submissions.

Describe the solution you would like

Introduce enhanced retry logic in Spark's Kubernetes client to handle transient connection errors more gracefully during job submission.
Key Features:

Configurable Retry Mechanism:
Add configuration options for retry behavior, such as:
o Number of retries (spark.kubernetes.maxRetries).
o Initial delay and backoff multiplier for retries (spark.kubernetes.retryBackoffMs).
Retryable Error Detection:
Implement logic to detect and classify retryable errors (e.g., java.net.ConnectException, Connection refused, TimeoutException) and apply retries selectively.
Enhanced Logging:
Provide detailed logs for retry attempts, including reasons for retries and the final status after all attempts.
Default Safe Retry Settings:
Define safe default retry values to avoid overwhelming the Kubernetes API server while improving resiliency.

Describe alternatives you have considered

Implement custom retry logic in Spark job submission scripts.
Use external tools to monitor and resubmit failed jobs.

Additional context

This feature aligns with the principles of resilient distributed systems and would be especially beneficial in dynamic, multi-tenant Kubernetes clusters where transient errors are more common.

Love this feature?

Give it a 👍 We prioritize the features with most 👍

ajvishwa-dev added the kind/feature label Dec 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle transient gracefully by retrying when submitting request to k8s server #2345

Handle transient gracefully by retrying when submitting request to k8s server #2345

ajvishwa-dev commented Dec 3, 2024

Handle transient gracefully by retrying when submitting request to k8s server #2345

Handle transient gracefully by retrying when submitting request to k8s server #2345

Comments

ajvishwa-dev commented Dec 3, 2024

What feature you would like to be added?

Why is this needed?

Describe the solution you would like

Describe alternatives you have considered

Additional context

Love this feature?