-
Notifications
You must be signed in to change notification settings - Fork 355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: [NCCL] reorg and better error messages #3338
base: nccl_ops_multi_gpu
Are you sure you want to change the base?
Conversation
Signed-off-by: Naren Dasan <[email protected]>
@@ -25,6 +23,8 @@ | |||
) | |||
from torch_tensorrt.fx.types import TRTTensor | |||
|
|||
import tensorrt as trt | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this change here? Is this a pre-commit change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Precommit
@@ -12,16 +12,23 @@ | |||
logger = logging.getLogger(__name__) | |||
|
|||
|
|||
def tensorrt_fused_nccl_all_gather_op(args0, args1, args2): | |||
# TODO: @apbose make these actual torch custom ops, should allow aot_export | |||
def tensorrt_fused_nccl_all_gather_op( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok noted in the TODOs. Just to get clarity, what is meant by allowing aot_export? What is the difference in the present behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sonthat you dont need to use the autograd workflow and could potentially support torch.export flows
prerelease = "if-necessary-or-explicit" | ||
|
||
index-strategy = "unsafe-best-match" # Needed for TRT-LLM | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this added? What is the significance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for uv to install Tensorrt llm
Wanted to get clarity on- |
monitoring-tools = ["rich>=13.7.1"] | ||
jupyter = ["rich[jupyter]>=13.7.1"] | ||
distributed = ["tensorrt-llm>=0.16.0"] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we ensure this goes through when we have a torchTRT container having python version different than 3.10. Won't this cause an issue here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its an extra so it wont effect install
Description
Precommit was clearly not run, assorted fixes and reorg. Some AIs for @apbose re: custom ops and some hardcoded special cases
Fixes # (issue)
Type of change
Please delete options that are not relevant and/or add your own.
Checklist: