Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ephemeral runners are often not available - linux.12xlarge.ephemeral #5493

Open
atalman opened this issue Jul 23, 2024 · 5 comments
Open

Ephemeral runners are often not available - linux.12xlarge.ephemeral #5493

atalman opened this issue Jul 23, 2024 · 5 comments

Comments

@atalman
Copy link
Contributor

atalman commented Jul 23, 2024

Creating this issue to address Ephemeral runners not being available.
This has been on going issue for Docker builds: https://github.com/pytorch/builder/actions/workflows/build-manywheel-images.yml

Problematic runner : linux.12xlarge.ephemeral

Workflow queueing:
https://github.com/pytorch/builder/actions/runs/10054503945/job/27789222269?pr=1928 (took 6hrs in queue)
https://github.com/pytorch/builder/actions/runs/10046348302/job/27807149533

Since ephemeral runners need to be used in nightly and release pipeline, is there a possibility to reserve the capacity so its always available ?

Can we looks into this possible EC2 reservation for these runners:
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-capacity-reservations.html

cc @jeanschmidt @malfet @seemethere

@jeanschmidt
Copy link
Contributor

We can reserve a capacity, but we can't guarantee that when we request a instance it will come from the reserved capacity.

So, we reserve a capacity for, say 10 instances of c5.2xlarge. AFAIK, the first 10 c5.2xlarge we create will use that capacity. If we have two instances types of c5.2xlarge, it is by first come, first served. Given that non-ephemeral instances tend to live longer, what happens is that it tends to then use all the reservation capacity.

This is a nice approach, but maybe we should have different instance types only for those runners so we leverage the reservation?

@jeanschmidt
Copy link
Contributor

I believe that we should eventually continue our work towards add support for multi-region, so we can solve those stock-outs problems.

@atalman
Copy link
Contributor Author

atalman commented Jul 23, 2024

Yes I like the idea of having different runner type. Maybe we can try linux.8xlarge.ephemeral ?

@jeanschmidt
Copy link
Contributor

Ok, so I'll ask for a reservation of 15 c5.9xlarge that we dedicate for those instances.

@jeanschmidt
Copy link
Contributor

Screenshot 2024-07-23 at 16 58 49

So we should now have 5 instances reserved in AZs A, B and C. Total 15 instances. I did not reserved them in the same AZ for redundancy and because potentially networking issues (running out of IPv4 address in that region)

atalman added a commit that referenced this issue Jul 23, 2024
… builds (#5498)

Trying to mitigate #5493
Should replace linux.12xlarge.ephemeral which are often not available
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants