Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for testing with minimum supported Nvidia Drivers to release validations #5434

Open
atalman opened this issue Jul 16, 2024 · 3 comments

Comments

@atalman
Copy link
Contributor

atalman commented Jul 16, 2024

To avoid issues like these: pytorch/pytorch#130684
I would like to add support for testing with minimum supported Nvidia Drivers to release validations

  1. Add nvidia-driver parameter to linux_job.yml:
    https://github.com/pytorch/test-infra/blob/main/.github/workflows/linux_job.yml

  2. Make sure we pass this parameter to:
    https://github.com/pytorch/test-infra/blob/main/.github/actions/setup-nvidia/action.yml

  3. Add option to validate release binaries to https://github.com/pytorch/builder/actions/workflows/validate-binaries.yml

@malfet
Copy link
Contributor

malfet commented Jul 16, 2024

An easier way to accomplish that would probably be having a different AMI with the driver we want..

@atalman
Copy link
Contributor Author

atalman commented Jul 16, 2024

@malfet not sure. Building AMI and managing it quite a big headache. This should be straight forward .

setup-nvidia action already supports driver-version as a parameter:
https://github.com/pytorch/test-infra/blob/main/.github/actions/setup-nvidia/action.yml#L6

Hence all we have to do is to pass it.

@ptrblck
Copy link
Contributor

ptrblck commented Jul 16, 2024

Adding more driver tests sounds generally like a valid idea.
However, I don't think pytorch/pytorch#130684 is the best motivation for it, as it's unclear to me if we even support PyTorch + CUDA 11.8 + Triton, see: pytorch/pytorch#106144 (comment)

From the linked issue:

Yes, triton always uses cuda-12

It's somewhat hard to test something like that in CI, as runners are provisioned with the latest kernel driver in order be usable with both CUDA-12 and CUDA-11.8. Also, older driver is less stable, so we run into a multiple hangs/segfaults that were mitigated by installing newer driver.

I would not want to add the risk of using older drivers (unless these are additional tests) to test a potentially invalid or unsupported PyTorch + Triton combination or is Triton now fully supported in our CUDA 11.8 builds?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants