chore: [NCCL] reorg and better error messages #3338

narendasan · 2024-12-25T00:18:26Z

Description

Precommit was clearly not run, assorted fixes and reorg. Some AIs for @apbose re: custom ops and some hardcoded special cases

Fixes # (issue)

Type of change

Please delete options that are not relevant and/or add your own.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

Checklist:

My code follows the style guidelines of this project (You can use the linters)
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas and hacks
I have made corresponding changes to the documentation
I have added tests to verify my fix or my feature
New and existing unit tests pass locally with my changes
I have added the relevant labels to my PR in so that relevant reviewers are notified

Signed-off-by: Naren Dasan <[email protected]>

apbose · 2024-12-25T01:43:07Z

py/torch_tensorrt/dynamo/conversion/impl/select.py

@@ -25,6 +23,8 @@
 )
 from torch_tensorrt.fx.types import TRTTensor

+import tensorrt as trt
+


Why is this change here? Is this a pre-commit change?

apbose · 2024-12-25T01:44:59Z

py/torch_tensorrt/dynamo/lowering/passes/fuse_distributed_ops.py

@@ -12,16 +12,23 @@
 logger = logging.getLogger(__name__)


-def tensorrt_fused_nccl_all_gather_op(args0, args1, args2):
+# TODO: @apbose make these actual torch custom ops, should allow aot_export
+def tensorrt_fused_nccl_all_gather_op(


Ok noted in the TODOs. Just to get clarity, what is meant by allowing aot_export? What is the difference in the present behavior?

Sonthat you dont need to use the autograd workflow and could potentially support torch.export flows

apbose · 2024-12-25T01:45:30Z

pyproject.toml

 prerelease = "if-necessary-or-explicit"
-
+index-strategy = "unsafe-best-match"                                    # Needed for TRT-LLM



Why is this added? What is the significance?

This is for uv to install Tensorrt llm

apbose · 2024-12-25T01:50:18Z

Wanted to get clarity on- precommit was clearly not run. I generally run pre-commit run, and push in the modified changes and cross check if the CI Lint check passes. Should I be mindful of other things before pushing in?
Also what are the hardcoded changes?
Have some review comments/questions, else the PR changes look good to me.

apbose · 2024-12-25T01:52:22Z

pyproject.toml

 monitoring-tools = ["rich>=13.7.1"]
 jupyter = ["rich[jupyter]>=13.7.1"]
+distributed = ["tensorrt-llm>=0.16.0"]



How do we ensure this goes through when we have a torchTRT container having python version different than 3.10. Won't this cause an issue here?

Its an extra so it wont effect install

chore: reorg and better error messages

e532c05

Signed-off-by: Naren Dasan <[email protected]>

facebook-github-bot added the cla signed label Dec 25, 2024

github-actions bot requested a review from zewenli98 December 25, 2024 00:18

narendasan changed the title ~~chore: reorg and better error messages~~ chore: [NCCL] reorg and better error messages Dec 25, 2024

apbose reviewed Dec 25, 2024

View reviewed changes

apbose approved these changes Dec 25, 2024

View reviewed changes

apbose reviewed Dec 25, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: [NCCL] reorg and better error messages #3338

chore: [NCCL] reorg and better error messages #3338

narendasan commented Dec 25, 2024

apbose Dec 25, 2024

narendasan Dec 25, 2024

apbose Dec 25, 2024

narendasan Dec 25, 2024

apbose Dec 25, 2024

narendasan Dec 25, 2024

apbose commented Dec 25, 2024

apbose Dec 25, 2024

narendasan Dec 25, 2024

		prerelease = "if-necessary-or-explicit"

		index-strategy = "unsafe-best-match" # Needed for TRT-LLM

chore: [NCCL] reorg and better error messages #3338

Are you sure you want to change the base?

chore: [NCCL] reorg and better error messages #3338

Conversation

narendasan commented Dec 25, 2024

Description

Type of change

Checklist:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apbose commented Dec 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment