-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ephemeral runners are often not available - linux.12xlarge.ephemeral #5493
Comments
We can reserve a capacity, but we can't guarantee that when we request a instance it will come from the reserved capacity. So, we reserve a capacity for, say 10 instances of c5.2xlarge. AFAIK, the first 10 c5.2xlarge we create will use that capacity. If we have two instances types of c5.2xlarge, it is by first come, first served. Given that non-ephemeral instances tend to live longer, what happens is that it tends to then use all the reservation capacity. This is a nice approach, but maybe we should have different instance types only for those runners so we leverage the reservation? |
I believe that we should eventually continue our work towards add support for multi-region, so we can solve those stock-outs problems. |
Yes I like the idea of having different runner type. Maybe we can try linux.8xlarge.ephemeral ? |
Ok, so I'll ask for a reservation of 15 |
Creating this issue to address Ephemeral runners not being available.
This has been on going issue for Docker builds: https://github.com/pytorch/builder/actions/workflows/build-manywheel-images.yml
Problematic runner : linux.12xlarge.ephemeral
Workflow queueing:
https://github.com/pytorch/builder/actions/runs/10054503945/job/27789222269?pr=1928 (took 6hrs in queue)
https://github.com/pytorch/builder/actions/runs/10046348302/job/27807149533
Since ephemeral runners need to be used in nightly and release pipeline, is there a possibility to reserve the capacity so its always available ?
Can we looks into this possible EC2 reservation for these runners:
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-capacity-reservations.html
cc @jeanschmidt @malfet @seemethere
The text was updated successfully, but these errors were encountered: