-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add BERT e2e training test #467
base: main
Are you sure you want to change the base?
Conversation
…erfile for the e2e BERT training task
…erfile for the e2e BERT training task
…or MPI, NCCL, and EFA.
…ce to be consistent with the other test images
resources: | ||
requests: | ||
nvidia.com/gpu: 8 | ||
vpc.amazonaws.com/efa: 0 | ||
limits: | ||
nvidia.com/gpu: 8 | ||
vpc.amazonaws.com/efa: 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we shouldn't hard code this, since we might need to run it in different node configurations (e.g. node type, node count).
Check here for a reference on how not to hardcode this.
https://github.com/aws/aws-k8s-tester/blob/main/e2e2/test/cases/nvidia/main_test.go#L98-L144
https://github.com/aws/aws-k8s-tester/blob/main/e2e2/test/cases/nvidia/manifests/mpi-job-nccl-test-multi-node.yaml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, we can parameterize it for future proofing. Right now all tests will be ran on an instance with 8 NVIDIA GPUs, but I have no problem with this. Will make the update.
return ctx | ||
}). | ||
Teardown(func(ctx context.Context, t *testing.T, cfg *envconf.Config) context.Context { | ||
// Delete the manifest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can print out the logs before deleting them, it will help us troubleshoot the test failures better.
Reference: https://github.com/aws/aws-k8s-tester/blob/main/e2e2/test/cases/nvidia/mpi_test.go#L128-L135
@@ -85,6 +86,8 @@ def train_bert(rank, world_size, local_rank, model, tokenizer): | |||
|
|||
start_time = time.time() | |||
|
|||
print(f"starting training for rank: {rank}") | |||
|
|||
for epoch in range(1): # Short run for testing | |||
ddp_model.train() | |||
for batch in train_dataloader: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test is broken.
[1,0]<stderr>:Traceback (most recent call last):
[1,0]<stderr>: File "/app/train.py", line 138, in <module>
[1,0]<stderr>: main()
[1,0]<stderr>: File "/app/train.py", line 123, in main
[1,0]<stderr>: num_gpus_per_node = int(os.environ["NUM_GPUS_PER_NODE"])
[1,0]<stderr>: File "/usr/local/lib/python3.10/os.py", line 680, in __getitem__
[1,0]<stderr>: raise KeyError(key) from None
[1,0]<stderr>:KeyError: 'NUM_GPUS_PER_NODE'
[1,1]<stderr>:Traceback (most recent call last):
[1,1]<stderr>: File "/app/train.py", line 138, in <module>
[1,1]<stderr>: main()
[1,1]<stderr>: File "/app/train.py", line 123, in main
[1,1]<stderr>: num_gpus_per_node = int(os.environ["NUM_GPUS_PER_NODE"])
[1,1]<stderr>: File "/usr/local/lib/python3.10/os.py", line 680, in __getitem__
[1,1]<stderr>: raise KeyError(key) from None
[1,1]<stderr>:KeyError: 'NUM_GPUS_PER_NODE'
[1,2]<stderr>:Traceback (most recent call last):
[1,2]<stderr>: File "/app/train.py", line 138, in <module>
[1,2]<stderr>: main()
[1,2]<stderr>: File "/app/train.py", line 123, in main
[1,2]<stderr>: num_gpus_per_node = int(os.environ["NUM_GPUS_PER_NODE"])
[1,2]<stderr>: File "/usr/local/lib/python3.10/os.py", line 680, in __getitem__
[1,2]<stderr>: raise KeyError(key) from None
[1,2]<stderr>:KeyError: 'NUM_GPUS_PER_NODE'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also you forget to add the test binary to Dockerfile.kubetest2
Issue #, if available:
Description of changes:
This test being added will run an E2E BERT training test. The validation for this test was done on a cluster consisting of p3.16xlarge instance type. The cluster has four nodes in total.
The results of running the training test can be seen below. These logs were obtained from the master pod that coordinated the E2E BERT training job.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.