Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle transient gracefully by retrying when submitting request to k8s server #2345

Open
ajvishwa-dev opened this issue Dec 3, 2024 · 0 comments

Comments

@ajvishwa-dev
Copy link

What feature you would like to be added?

When submitting Spark applications to a Kubernetes cluster, intermittent connection issues with the Kubernetes API server (e.g., Connection refused or Failed to connect to /:443) can cause job submissions to fail entirely. These errors are often transient, caused by short-lived network interruptions, API server restarts, or resource contention. Currently, Spark operator lacks robust retry mechanisms to address such scenarios, leading to failed jobs that could succeed with retries.

Why is this needed?

• Improved reliability of Spark applications in production environments with transient network or API server issues.
• Reduced operational overhead caused by manually resubmitting failed jobs.
• Enhanced user experience with more resilient job submissions.

Describe the solution you would like

Introduce enhanced retry logic in Spark's Kubernetes client to handle transient connection errors more gracefully during job submission.
Key Features:

  1. Configurable Retry Mechanism:
    Add configuration options for retry behavior, such as:
    o Number of retries (spark.kubernetes.maxRetries).
    o Initial delay and backoff multiplier for retries (spark.kubernetes.retryBackoffMs).
  2. Retryable Error Detection:
    Implement logic to detect and classify retryable errors (e.g., java.net.ConnectException, Connection refused, TimeoutException) and apply retries selectively.
  3. Enhanced Logging:
    Provide detailed logs for retry attempts, including reasons for retries and the final status after all attempts.
  4. Default Safe Retry Settings:
    Define safe default retry values to avoid overwhelming the Kubernetes API server while improving resiliency.

Describe alternatives you have considered

  • Implement custom retry logic in Spark job submission scripts.
  • Use external tools to monitor and resubmit failed jobs.

Additional context

This feature aligns with the principles of resilient distributed systems and would be especially beneficial in dynamic, multi-tenant Kubernetes clusters where transient errors are more common.

Love this feature?

Give it a 👍 We prioritize the features with most 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant