[Features] Add NMS Kernel support with Triton Implementation #8746

Stonepia · 2024-11-25T12:49:05Z

Motivation

This PR follows RFC #8679 which proposes to add torchvision custom op support with Triton kernels.

Implementing Method

The Triton kernel mapping basically follows the CUDA kernels. As is shown below, the native CUDA kernel will be mapped into the Triton kernel. Some logic could not be run in parallel, thus they will be implemented with Python as well as C++ Ops.

This PR contains the following parts:

Kernel Implementation: This is mostly done in folder torchvision/ops/triton/. This contains the common logic that could be implemented in Triton.
Op Registration: This is in torchvision/ops/xpu. This will do op registration and combine non-Triton ops with Triton kernels into one big op.
Tests: Will be the same as the existing test.

Kernel Implementing Structure

The NMS kernel contains three parts, please see torchvision/ops/xpu/nms.py for details. It wraps the three parts:

pre-processing part: There are some logic like argsort which are called using PyTorch ATen ops.
Triton kernel: This Triton kernel will compute a matrix of IoU mask. This is done in torchvision/ops/triton/nms.py. It is a device-agnostic part, which could be shared across devices.
post-processing: This is a serialized part that has data dependency, there is no benefit of implementing them using Triton, thus fallback to ATen implementation.

Kernel Implementing Detail

The Triton kernel calculates the mask matrix for every input box based on the intersection-over-union (IoU) score. Its output will be a matrix indicating whether we should choose box j if we have already chosen box i. A naive implementation will have a matrix with [N, N]. However, as the performance consideration, it will try to combine the "bit mask" into 32-bits ints. Thus, the output will be [N, N//32].
After the mask matrix is calculated, a serialized post-process function will be needed. It will iterate on each row of the mask matrix. If we choose the row i, this means we choose the box i. As a result, some boxes j will be excluded. That's what the post-process function does. To make it more device-agnostic, we choose to do this serialized process on the CPU.

cc: @EikanWang

pytorch-bot · 2024-11-25T12:49:08Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/8746

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

vadimkantorov · 2024-11-25T17:52:55Z

torchvision/ops/xpu/nms.py

+            picked.append(order[i])
+            remove_box[i:] |= iou_keep_out_mask[i][i:]
+
+    return torch.as_tensor(picked)


should this also respect the device of the boxes? (remove_boxes is allocated on boxes.device, while the return value - always on CPU)

Thanks for the reminder~! Yes this should be on boxes.device. I will update it.

Stonepia · 2024-12-16T10:00:20Z

Also attach the performance compared with native CUDA implementation on A100:

From this picture, the triton implementation could reach a competitive result as Native CUDA. The torch.compile would reach peak performance, because it reduces the Python overhead and kernel launch overhead.

However, when the size is too large, one interesting finding is that all of the implementations will have a large performance drop. I think this may reach the bottleneck of memory. Maybe the next step would be optimizing this.

@triton.testing.perf_report(
    triton.testing.Benchmark(
        x_names=['size'],  # Argument names to use as an x-axis for the plot.
        x_vals=[2**i for i in range(2, 13, 1)],  # Different possible values for `x_name`.
        x_log=True,  # x axis is logarithmic.
        line_arg='provider',  # Argument name whose value corresponds to a different line in the plot.
        line_vals=['triton', 'torch_compile', 'native_cuda', 'python'],  # Possible values for `line_arg`.
        line_names=['Triton', 'Torch Compile', 'Native CUDA', 'Python Op'],  # Label name for the lines.
        styles=[('blue', '-'),('pink', '-'), ('green', '-'),('yellow', '-')],  # Line styles.
        ylabel='GB/s',  # Label name for the y-axis.
        plot_name='nms_performance',  # Name for the plot. Used also as a file name for saving the plot.
        args={},  # Values for function arguments not in `x_names` and `y_name`.
    ))
def benchmark(size, provider):
    boxes, scores = _create_tensors_with_iou(size, threshold)
    quantiles = [0.5, 0.2, 0.8]
    compiled_nms = torch.compile(custom_nms_triton_kernel)
    if provider == 'native_cuda':
        ms, min_ms, max_ms = triton.testing.do_bench(lambda: torchvision.ops.nms(boxes, scores, threshold), quantiles=quantiles)
    if provider == 'triton':
        ms, min_ms, max_ms = triton.testing.do_bench(lambda: custom_nms_triton_kernel(boxes, scores,threshold), quantiles=quantiles)
    if provider == 'python':
        ms, min_ms, max_ms = triton.testing.do_bench(lambda: _reference_nms(boxes, scores,threshold), quantiles=quantiles)
    if provider == 'torch_compile':
        ms, min_ms, max_ms = triton.testing.do_bench(lambda: compiled_nms(boxes, scores,threshold), quantiles=quantiles)
    gbps = lambda ms: boxes.numel() * boxes.element_size() * 1e-9 / (ms * 1e-3)
    return gbps(ms), gbps(max_ms), gbps(min_ms)

benchmark.run(print_data=True, show_plots=True, save_path='.')

facebook-github-bot added the cla signed label Nov 25, 2024

vadimkantorov reviewed Nov 25, 2024

View reviewed changes

Stonepia force-pushed the xpu/nms branch from ad5698b to 2be0acf Compare December 3, 2024 09:52

Stonepia added 5 commits December 16, 2024 08:40

Init XPU support for NMS kernel

e0cf833

Init test for NMS kernel

d79010a

format code

01df93a

Fix Performance Issue

607c839

Fix runtime issue

7f277f6

Stonepia force-pushed the xpu/nms branch from 2be0acf to 7f277f6 Compare December 16, 2024 08:44

Stonepia added 4 commits December 16, 2024 09:25

delete unused code

96745df

Add comments for the code

3e75978

delete unused code

3cb9895

Add comments on code

e366724

Stonepia marked this pull request as ready for review December 16, 2024 10:05

Stonepia changed the title ~~[Draft] [Features] Add NMS Kernel support with Triton Implementation~~ [Features] Add NMS Kernel support with Triton Implementation Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Features] Add NMS Kernel support with Triton Implementation #8746

[Features] Add NMS Kernel support with Triton Implementation #8746

Stonepia commented Nov 25, 2024 •

edited

Loading

pytorch-bot bot commented Nov 25, 2024

vadimkantorov Nov 25, 2024

Stonepia Nov 26, 2024

Stonepia commented Dec 16, 2024

[Features] Add NMS Kernel support with Triton Implementation #8746

Are you sure you want to change the base?

[Features] Add NMS Kernel support with Triton Implementation #8746

Conversation

Stonepia commented Nov 25, 2024 • edited Loading

Motivation

Implementing Method

Kernel Implementing Structure

Kernel Implementing Detail

pytorch-bot bot commented Nov 25, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/8746

vadimkantorov Nov 25, 2024

Choose a reason for hiding this comment

Stonepia Nov 26, 2024

Choose a reason for hiding this comment

Stonepia commented Dec 16, 2024

Stonepia commented Nov 25, 2024 •

edited

Loading