You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the feature you'd like
I would love to be able to make use of SIGTERM handling used by modern ML frameworks such as pytorch_lightning. If I understood correctly, when the spot failure is announced, the container receives a SIGTERM and has 120 seconds time before it is forcefully terminated. I would like to be able to get the signal passed down to the entry point in order to make use of the SIGTERM handling callbacks provided by those frameworks.
How would this feature be used? Please describe.
The 120 seconds can be used for writing out a checkpoint and gracefully terminating the experiment when using an experiment tracker.
Describe alternatives you've considered
One can just not use the last 120 seconds and start from the last checkpoint written out by the model and just accept that spot instance failures are marked as "failed" experiments in the MLflow experiment tracker.
Additional context
During my journey getting to the bottom of this problem, I created a small proof-of-concept what changes would be necessary to make it work in my specific case (i.e. being able to handle SIGTERMs in a shell script entry point (which could be passed down to pytorch_lightning training script), see here for an example: https://github.com/croth1/sagemaker-toolkit-sigterm-handling
Describe the feature you'd like
I would love to be able to make use of SIGTERM handling used by modern ML frameworks such as pytorch_lightning. If I understood correctly, when the spot failure is announced, the container receives a SIGTERM and has 120 seconds time before it is forcefully terminated. I would like to be able to get the signal passed down to the entry point in order to make use of the SIGTERM handling callbacks provided by those frameworks.
How would this feature be used? Please describe.
The 120 seconds can be used for writing out a checkpoint and gracefully terminating the experiment when using an experiment tracker.
Describe alternatives you've considered
One can just not use the last 120 seconds and start from the last checkpoint written out by the model and just accept that spot instance failures are marked as "failed" experiments in the MLflow experiment tracker.
Additional context
During my journey getting to the bottom of this problem, I created a small proof-of-concept what changes would be necessary to make it work in my specific case (i.e. being able to handle SIGTERMs in a shell script entry point (which could be passed down to pytorch_lightning training script), see here for an example: https://github.com/croth1/sagemaker-toolkit-sigterm-handling
Only few changes are necessary to make this work in my specific case, see: https://github.com/aws/sagemaker-training-toolkit/compare/master...croth1:sigterm_forwarding?expand=1. HOWEVER this is just a proof-of-concept as there are many paths in the code-base eventually leading to entrypoint execution and this is fixing only the one I used.
I hope this is of interest :)
The text was updated successfully, but these errors were encountered: