This repository has been archived by the owner on Sep 19, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 143
Issues: kubeflow/pytorch-operator
Author
Label
Projects
Milestones
Assignee
Sort
Issues list
run https://github.com/kubeflow/pytorch-operator/blob/master/sdk/python/test/test_e2e.py failed
#363
opened Nov 19, 2021 by
sxl1993
service label mismatches selector, which result in inconsistency
kind/bug
#360
opened Nov 5, 2021 by
konnase
The training hangs after reloading one of master/worker pods
area/engprod
kind/question
#359
opened Oct 28, 2021 by
dmitsf
Can I freeze pytorchjob training pods and migrate them to other nodes?
#356
opened Sep 22, 2021 by
Shuai-Xie
Pytorch version may have an effect on the training reproduction
#355
opened Sep 21, 2021 by
Shuai-Xie
container "pytorch" is waiting to start: PodInitializing
kind/bug
#348
opened Aug 15, 2021 by
gogogwwb
PytorchJob replicas has different node affinity behaviors compared with Deployment
#344
opened Jul 21, 2021 by
Shuai-Xie
'host not found' error occurs during PyTorch distributed learning
kind/feature
#333
opened Apr 30, 2021 by
JGoo1
Operator has invalid memory address error on specific pytorchjob spec
#321
opened Feb 22, 2021 by
ca-scribner
Unlable to spawn PyTorchJob due to image alpine dependency of pytorch-operator
kind/bug
#319
opened Feb 11, 2021 by
asahalyft
Previous Next
ProTip!
Updated in the last three days: updated:>2024-12-23.