You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
I ran a training job as part of a sagemaker pipeline. The model by default wrote checkpoints and after epoch 2 of 10 disck utilisation reached 100%.
Despite abnormal exit from the training script, the training job and hence pipeline step was reported as successful.
To reproduce
I used the HuggingFace estimator with the following parameters
Describe the bug
I ran a training job as part of a sagemaker pipeline. The model by default wrote checkpoints and after epoch 2 of 10 disck utilisation reached 100%.
Despite abnormal exit from the training script, the training job and hence pipeline step was reported as successful.
To reproduce
I used the
HuggingFace
estimator with the following parametersThe model is a
sentence-transformers
model (installed using requirements.txt). I inadvertently enabled checkpoints hence the out of disk issue.Cloudwatch logs indicate abnormal termination, i.e.
The training job charts show the disk utilisation hitting 100%
But the training job status is "complete", the abnormal termination wasn't detected.
Expected behavior
Sagemaker pipeline steps shouldn't report success unless the script terminated normally.
The text was updated successfully, but these errors were encountered: