Retry OOM killed jobs #4

cmelone · 2024-01-23T01:20:06Z

Automatically retries jobs if they are OOM killed after gantry underallocates memory.

Implements a pipeline webhook handler. Checks if any jobs failed due to being OOM killed and inserts them into the db.
Creates a new pipeline following spackbot's example and gitlab's api

As discussed in #74, we would prefer to restart jobs directly by supplying new variables, but this feature is not supported by gitlab. When a new pipeline is created for a ref, successful builds in the previous pipeline will be pruned in the generate/concretization step, minimizing wasted cycles.

During the generate step, gantry will receive a request for resource allocations for a job that was recently OOM killed. The program will look for an exact spec match in the database and return these modified variables:

KUBERNETES_MEMORY_LIMIT * 0.2 -- bump the past request by 20%
GANTRY_RETRY_COUNT +=1 -- maintain a count of how many times this spec has been retried
cpu request/limit and memory limit will remain unmodified
maybe: GANTRY_RETRY_ID -- gitlab id for the original job to link between retries...not sure if necessary

To ensure we don't fall into an infinite loop of increasing memory limits, gantry will not bump the limit if the retry count will exceed three. Additionally, it will not restart a pipeline if all OOM killed jobs also exceed the retry limit. This means we are allowing certain jobs to fail. If we investigate and it seems like an increase is warranted, how do we ensure that this gets communicated to gantry?

New optional columns in the jobs table:

oomed -- failed job OOM killed
retry_count -- number of times the job has been retried

TODO:

fix oom detection in k8s cluster. see kitware-llnl channel
tests
figure out how you'll deal with missing data issue -- use the last timestamp as a datapoint? (what happens during missed webhooks?)
get api permissions for restarting pipelines in spack-infra terraform config
add new variables to annotations in spack-infra k8s config

Questions:

Do we need to weigh the most current build higher after has been retried and bumped up?
- No, we will allow the genetic algorithm to learn. If subsequent jobs are OOM killed, they will be given more memory and retried, which will eventually lead to an optimal memory limit.

- if pipeline failed, check if any of the jobs failed due to OOM - insert OOM jobs into database for prediction step

cmelone · 2024-09-18T22:05:48Z

@alecbcs the pipeline webhook aspect of this PR is ready for review. This first step essentially detects OOM killed jobs and inserts them into the db.

Like I mentioned in the kitware channel, OOM detection is broken atm, so I will update the prometheus.job.is_oom method once we have that cleared up

…alled by a pipeline

If a job is over the defined retry limit, we won't mark it as needing to be retried. However, because we are handling this on a pipeline level, if another job in the pipeline was OOMed but not over the retry limit the pipeline will still be retried, leading to some idiosyncrasies.

cmelone added the feature New feature or request label Jan 23, 2024

cmelone self-assigned this Jan 23, 2024

cmelone mentioned this pull request Jan 24, 2024

Collection API #3

Merged

1 task

Base automatically changed from add/collection-func to develop February 12, 2024 19:19

wip

d9b67c8

cmelone force-pushed the add/handle-oom branch from 5a3ede8 to d9b67c8 Compare September 16, 2024 23:55

cmelone mentioned this pull request Sep 17, 2024

Improved retries in Spack CI #74

Open

3 tasks

cmelone changed the title ~~Handle build OOMs~~ Retry OOM killed jobs Sep 17, 2024

cmelone added 5 commits September 18, 2024 18:30

add migration for new retry/oom columns

0491a4e

gitlab client: refactor of _request and start_pipeline method

c63072a

add handle_pipeline function to collect webhook

6f978ad

- if pipeline failed, check if any of the jobs failed due to OOM - insert OOM jobs into database for prediction step

add tests for pipeline handling

496dca2

clean comment

a79ce67

cmelone marked this pull request as ready for review September 18, 2024 22:03

cmelone requested a review from alecbcs September 18, 2024 22:06

cmelone added 5 commits September 18, 2024 22:33

remove extra print

f0a16a7

add check to ensure failed jobs aren't erroneously collected if not c…

ce3c2b2

…alled by a pipeline

update spec format to include arch

cef234a

add automatically bump allocations after OOM detected

cbe19c3

add tests for OOM handling

45cc72a

github-actions bot added the docs Improvements or additions to documentation label Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry OOM killed jobs #4

Retry OOM killed jobs #4

cmelone commented Jan 23, 2024 •

edited

Loading

cmelone commented Sep 18, 2024

Retry OOM killed jobs #4

Are you sure you want to change the base?

Retry OOM killed jobs #4

Conversation

cmelone commented Jan 23, 2024 • edited Loading

cmelone commented Sep 18, 2024

cmelone commented Jan 23, 2024 •

edited

Loading