Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry OOM killed jobs #4

Open
wants to merge 12 commits into
base: develop
Choose a base branch
from
Open

Retry OOM killed jobs #4

wants to merge 12 commits into from

Commits on Sep 16, 2024

  1. wip

    cmelone committed Sep 16, 2024
    Configuration menu
    Copy the full SHA
    d9b67c8 View commit details
    Browse the repository at this point in the history

Commits on Sep 18, 2024

  1. Configuration menu
    Copy the full SHA
    0491a4e View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    c63072a View commit details
    Browse the repository at this point in the history
  3. add handle_pipeline function to collect webhook

    - if pipeline failed, check if any of the jobs failed due to OOM
    - insert OOM jobs into database for prediction step
    cmelone committed Sep 18, 2024
    Configuration menu
    Copy the full SHA
    6f978ad View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    496dca2 View commit details
    Browse the repository at this point in the history
  5. clean comment

    cmelone committed Sep 18, 2024
    Configuration menu
    Copy the full SHA
    a79ce67 View commit details
    Browse the repository at this point in the history

Commits on Sep 19, 2024

  1. remove extra print

    cmelone committed Sep 19, 2024
    Configuration menu
    Copy the full SHA
    f0a16a7 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    ce3c2b2 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    cef234a View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    cbe19c3 View commit details
    Browse the repository at this point in the history
  5. add tests for OOM handling

    cmelone committed Sep 19, 2024
    Configuration menu
    Copy the full SHA
    45cc72a View commit details
    Browse the repository at this point in the history

Commits on Sep 27, 2024

  1. Add retry limit check on the collection side

    If a job is over the defined retry limit, we won't mark it as needing to be retried. However, because we are handling this on a pipeline level, if another job in the pipeline was OOMed but not over the retry limit the pipeline will still be retried, leading to some idiosyncrasies.
    cmelone committed Sep 27, 2024
    Configuration menu
    Copy the full SHA
    29a26db View commit details
    Browse the repository at this point in the history