You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When submitting Spark applications to a Kubernetes cluster, intermittent connection issues with the Kubernetes API server (e.g., Connection refused or Failed to connect to /:443) can cause job submissions to fail entirely. These errors are often transient, caused by short-lived network interruptions, API server restarts, or resource contention. Currently, Spark operator lacks robust retry mechanisms to address such scenarios, leading to failed jobs that could succeed with retries.
Why is this needed?
• Improved reliability of Spark applications in production environments with transient network or API server issues.
• Reduced operational overhead caused by manually resubmitting failed jobs.
• Enhanced user experience with more resilient job submissions.
Describe the solution you would like
Introduce enhanced retry logic in Spark's Kubernetes client to handle transient connection errors more gracefully during job submission.
Key Features:
Configurable Retry Mechanism:
Add configuration options for retry behavior, such as:
o Number of retries (spark.kubernetes.maxRetries).
o Initial delay and backoff multiplier for retries (spark.kubernetes.retryBackoffMs).
Retryable Error Detection:
Implement logic to detect and classify retryable errors (e.g., java.net.ConnectException, Connection refused, TimeoutException) and apply retries selectively.
Enhanced Logging:
Provide detailed logs for retry attempts, including reasons for retries and the final status after all attempts.
Default Safe Retry Settings:
Define safe default retry values to avoid overwhelming the Kubernetes API server while improving resiliency.
Describe alternatives you have considered
Implement custom retry logic in Spark job submission scripts.
Use external tools to monitor and resubmit failed jobs.
Additional context
This feature aligns with the principles of resilient distributed systems and would be especially beneficial in dynamic, multi-tenant Kubernetes clusters where transient errors are more common.
Love this feature?
Give it a 👍 We prioritize the features with most 👍
The text was updated successfully, but these errors were encountered:
What feature you would like to be added?
When submitting Spark applications to a Kubernetes cluster, intermittent connection issues with the Kubernetes API server (e.g., Connection refused or Failed to connect to /:443) can cause job submissions to fail entirely. These errors are often transient, caused by short-lived network interruptions, API server restarts, or resource contention. Currently, Spark operator lacks robust retry mechanisms to address such scenarios, leading to failed jobs that could succeed with retries.
Why is this needed?
• Improved reliability of Spark applications in production environments with transient network or API server issues.
• Reduced operational overhead caused by manually resubmitting failed jobs.
• Enhanced user experience with more resilient job submissions.
Describe the solution you would like
Introduce enhanced retry logic in Spark's Kubernetes client to handle transient connection errors more gracefully during job submission.
Key Features:
Add configuration options for retry behavior, such as:
o Number of retries (spark.kubernetes.maxRetries).
o Initial delay and backoff multiplier for retries (spark.kubernetes.retryBackoffMs).
Implement logic to detect and classify retryable errors (e.g., java.net.ConnectException, Connection refused, TimeoutException) and apply retries selectively.
Provide detailed logs for retry attempts, including reasons for retries and the final status after all attempts.
Define safe default retry values to avoid overwhelming the Kubernetes API server while improving resiliency.
Describe alternatives you have considered
Additional context
This feature aligns with the principles of resilient distributed systems and would be especially beneficial in dynamic, multi-tenant Kubernetes clusters where transient errors are more common.
Love this feature?
Give it a 👍 We prioritize the features with most 👍
The text was updated successfully, but these errors were encountered: