You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
torch.compile and torch._dynamo.export are two promising new utilities for model compilation which are actively under development for integration within the Torch-TRT framework. This is a discussion of methods and suggestions to further accelerate inference in these paths, with a focus on torch.compile.
Goal(s)
The objective of this discussion is to highlight current shortcomings of the torch.compile backend and related issues with the torch._dynamo.export path, and suggest ideas for improved acceleration of these frameworks. Currently, the performance pitfalls of torch.compile originate from three sources: Module-Level Acceleration, Converter Coverage, and Control Flow.
1. Module-Level Acceleration
Problem Context
Module-level acceleration encompasses the notion that translating high-level modules, such as Attention modules, directly into their respective accelerated counterparts is a more performance-effective acceleration method than breaking up such modules into their smaller components. Currently, in both the torch.compile and torch._dynamo.export paths, large modules such as Attention are being subdivided into component modules composed of aten operators. Ideally, such modules would instead be directly replaced with their accelerated counterparts.
Proposed Solution
The approach to solution in this scenario could take one of two paths. First, one could intercept the Attention module prior to lowering, and replace it automatically with its accelerated counterpart. Alternatively, one could lower the Attention module to its aten components, and then use subgraph matching to match the pattern of calls which correspond to an Attention module. While the latter is much easier to implement, since there is no known way to use the former in the current torch.compile framework, the former would be a cleaner solution. Some options of subgraph-matching utilities include the Torch FX subgraph rewriter and the Inductor pattern matcher.
2. Converter Coverage
Problem Context
Converter coverage is a critical piece in both the torch.compile and torch._dynamo.export paths, as the aten converters are used in both of these. Improved converter coverage is critical since it allows acceleration of more operations in a given model, and reduces the segmentation caused by partitioning.
Proposed Solution
Currently, we are working to implement aten converters which are key for certain critical models. This effort should be in conjunction with effective lowering passes which can both reduce the number of necessary converter implementations, but also improve code performance and readability. Another alternative to keep in mind is the use of Prims IR, which is a low-level, more restricted version of the aten operators. The potential utility of the prims IR is that we could implement the entirety of the set of prim operators and thereby support many more models. The drawback in this case is that the decompositions are down to a much lower level, so the optimizations we can make are much more limited.
3. Control Flow
Problem Context
One of the most promising aspects of the torch.compile path is its ability to handle control flow automatically and split the graph into subgraphs based on control flow branches. This is also one of the drawbacks of this method, as excessive control flow can deteriorate performance. The torch._dynamo.export path also provides restricted support for control flow, when using experimental Torch conditional operators.
Proposed Solution
On this topic, the solution is more of a trade-off. In the torch.compile path, taking a new branch on a control flow decision will spur recompilation, but no changes are needed to the codebase. In the torch._dynamo.export path, the resulting model will not need recompilation, but substantial model rewriting is required to support control flow within the model.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Accelerating Performance in Dynamo + Torch-TRT
TL;DR
torch.compile
andtorch._dynamo.export
are two promising new utilities for model compilation which are actively under development for integration within the Torch-TRT framework. This is a discussion of methods and suggestions to further accelerate inference in these paths, with a focus ontorch.compile
.Goal(s)
The objective of this discussion is to highlight current shortcomings of the
torch.compile
backend and related issues with thetorch._dynamo.export
path, and suggest ideas for improved acceleration of these frameworks. Currently, the performance pitfalls oftorch.compile
originate from three sources: Module-Level Acceleration, Converter Coverage, and Control Flow.1. Module-Level Acceleration
Problem Context
Module-level acceleration encompasses the notion that translating high-level modules, such as Attention modules, directly into their respective accelerated counterparts is a more performance-effective acceleration method than breaking up such modules into their smaller components. Currently, in both the
torch.compile
andtorch._dynamo.export
paths, large modules such as Attention are being subdivided into component modules composed ofaten
operators. Ideally, such modules would instead be directly replaced with their accelerated counterparts.Proposed Solution
The approach to solution in this scenario could take one of two paths. First, one could intercept the Attention module prior to lowering, and replace it automatically with its accelerated counterpart. Alternatively, one could lower the Attention module to its
aten
components, and then use subgraph matching to match the pattern of calls which correspond to an Attention module. While the latter is much easier to implement, since there is no known way to use the former in the currenttorch.compile
framework, the former would be a cleaner solution. Some options of subgraph-matching utilities include the Torch FX subgraph rewriter and the Inductor pattern matcher.2. Converter Coverage
Problem Context
Converter coverage is a critical piece in both the
torch.compile
andtorch._dynamo.export
paths, as the aten converters are used in both of these. Improved converter coverage is critical since it allows acceleration of more operations in a given model, and reduces the segmentation caused by partitioning.Proposed Solution
Currently, we are working to implement aten converters which are key for certain critical models. This effort should be in conjunction with effective lowering passes which can both reduce the number of necessary converter implementations, but also improve code performance and readability. Another alternative to keep in mind is the use of Prims IR, which is a low-level, more restricted version of the
aten
operators. The potential utility of the prims IR is that we could implement the entirety of the set of prim operators and thereby support many more models. The drawback in this case is that the decompositions are down to a much lower level, so the optimizations we can make are much more limited.3. Control Flow
Problem Context
One of the most promising aspects of the
torch.compile
path is its ability to handle control flow automatically and split the graph into subgraphs based on control flow branches. This is also one of the drawbacks of this method, as excessive control flow can deteriorate performance. Thetorch._dynamo.export
path also provides restricted support for control flow, when using experimental Torch conditional operators.Proposed Solution
On this topic, the solution is more of a trade-off. In the
torch.compile
path, taking a new branch on a control flow decision will spur recompilation, but no changes are needed to the codebase. In thetorch._dynamo.export
path, the resulting model will not need recompilation, but substantial model rewriting is required to support control flow within the model.Beta Was this translation helpful? Give feedback.
All reactions