[WIP] Implement approximate mode #440

scotts · 2024-12-20T03:29:55Z

[WIP]

Comprehensive implementation of approximate mode. Modifies the C++ and Python sides to match. Adds parameters to existing tests, but does not add any new tests.

I know that doing all of the changes in one go - C++, core ops, Python public interface - is a bear to review. But I wanted to quickly get to a state where we could compare performance. Also, because the approach I took makes the decoder have two different modes of operation, it requires changes everywhere.

Starting with the C++ VideoDecoder layer:

Adds a new enum called SeekMode:
- exact implies doing a full file scan, and we can use all of the prior algorithms for getting ranges of frames.
- approximate implies not doing a full file scan, and we have to rely on the average FPS to calculate indices. We also assume that the minimum pts in seconds is 0, and the maximum is the duration. If there is no duration or average FPS in the metadata, that's an error. (TODO: make sure the C++ throws on all of those conditions.)
The seek mode is passed in as a constructor parameter to VideoDecoder, and it applies to all streams. I considered trying to make it apply only when we add a stream, but because we want it to control whether or not we do a full file scan, I made it a property of the entire decoder. We could potentially make it apply per stream, and do a lazy scan if an exact mode stream is added.
The default seek mode is exact. I modified a bunch of our core ops tests to reflect this fact, as it is now redundant to request exact seek mode and manually do a full file scan.
The pts-based member functions that converted the requested pts values to indices still do so. I initially implemented different algorithms for approximate mode where we used the actual pts values as is. That lead to poor performance because we could not allocate our output tensor up front and we could not take advantage of our dedupe logic. Since the point of approximate mode is performance, I switched it to what we currently have: helper functions which are seek mode aware, but the main algorithms are unaware of seek mode. However, this does mean that in approximate mode, we may return inaccurate frames even in cases when in principle we could return exact frames. That is, if a user requests a batch of frames at pts values [x, y, z], we will turn that into indices [a, b, c] based on the average fps. If the average fps is wrong, or if the video has a variable frame rate, then that mapping may be wrong. In principle, we could have returned the correct frames.

In the Python public API, we add:

  seek_mode: Literal["exact", "approximate"] = "exact",

To the Python VideoDecoder constructor. Not much logic actually changes here.

We do have a lot of changes in the Python metadata to support the VideoDecoder:

begin_stream_seconds -> begin_stream_seconds_from_content: This was always only from the full scan. We need to be explicit now.
begin_stream_seconds is now a property that if begin_stream_seconds_from_content is none, is just 0.
end_stream_seconds -> end_stream_seconds_from_content: Same reasoning as above.
end_stream_seconds is now a property that if end_stream_seconds_from_content is none returns the duration_seconds.

By changing the metadata in that manner, the Python VideoDecoder class remains mostly unchanged as the real logic is in resolving what metadata to use. Note that the metadata class has no knowledge of the seek mode; it just knows if some values are missing.

On defaults, the current implementation has exact mode as the default everywhere. We could, however, decide to apply different defaults at each level. The levels we can apply defaults are:

During construction of the C++ VideoDecoder class. In some ways, the old behavior was approximate as default, since scanning the file did not automatically happen.
In the Python core API create_decoder_from family of functions. Similar to the C++, the old behavior was closer to approximate as default.
In the Python VideoDecoder constructor. The old behavior was definitely exact as we always scanned.

scotts · 2024-12-20T04:38:20Z

Initial performance numbers are encouraging. I generated a video with the following command:

ffmpeg -y -f lavfi -i mandelbrot=s=1920x1080 -t 120 -c:v h264 -r 60 -g 600 -pix_fmt yuv420p mandelbrot_1920x1080_120s.mp4

That produced a 141 MB video file. I then ran our standard benchmark with:

python benchmarks/decoders/benchmark_decoders.py --decoders torchcodec_public:seek_mode=exact,torchcodec_public:seek_mode=approximate,torchcodec_public_nonbatch:seek_mode=exact,torchcodec_public_nonbatch:seek_mode=approximate,decord,decord_batch,torchaudio,torchcodec_core_batch,torchcodec_core_nonbatch --min-run-seconds 40 --video-paths mandelbrot_1920x1080_120s.mp4

And that yields:

[-------------------------------------------------- video=mandelbrot_1920x1080_120s.mp4 h264 1920x1080, 120.0s 60.0fps -------------------------------------------------]
                                                      |  decode 10 uniform frames  |  decode 10 random frames  |  first 1 frames  |  first 10 frames  |  first 100 frames
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------
      TorchAudio                                      |           1590.5           |           2112.7          |       29.3       |        49.0       |       253.2      
      DecordAccurate                                  |           1798.6           |           2201.3          |       95.0       |       171.6       |       701.8      
      DecordAccurateBatch                             |           1756.8           |           2164.2          |       92.9       |       135.7       |       576.7      
      TorchCodecCoreBatch                             |           1605.8           |           1334.6          |       79.6       |        92.6       |       480.1      
      TorchCodecCoreNonBatch                          |           1587.3           |           1971.6          |       25.7       |        38.9       |       155.9      
      TorchCodecPublicNonBatch:seek_mode=exact        |           1620.3           |           2011.6          |       79.3       |        92.7       |       218.8      
      TorchCodecPublicNonBatch:seek_mode=approximate  |           1541.5           |           1981.8          |       26.4       |        39.5       |       166.5      
      TorchCodecPublic:seek_mode=exact                |           1591.5           |           1330.2          |       78.6       |        92.1       |       214.2      
      TorchCodecPublic:seek_mode=approximate          |           1562.3           |           1292.8          |       25.9       |        39.0       |       159.9      

Times are in milliseconds (ms).

Some explanations of what the options are:

TorchCodecCoreBatch: core API, uses exact seek mode and batch APIs.
TorchCodecCoreNonBatch: core API, uses approximate seek mode and non-batch APIs.
Public means using VideoDecoder.
NonBatch means using the single frame API, even when getting multiple frames.
Batch means using the batch API when getting multiple frames.
seek_mode= is changing the seek mode.

For the full matrix, we could change the seek modes for the core options. Approximate mode is basically meeting our performance expectations here.

src/torchcodec/decoders/_core/VideoDecoder.cpp

scotts added 5 commits December 16, 2024 07:41

Start implementation of approximate mode

16d698f

Merge branch 'main' of github.com:pytorch/torchcodec into approx

9a5abce

Initial seek mode implementation in VideoDecoder.

d95b128

Merge branch 'main' of github.com:pytorch/torchcodec into approx

97ac764

Added Python side support, extended tests.

35f2e59

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 20, 2024

scotts added 3 commits December 19, 2024 19:34

Apply lints

8c9aeac

Default C++ tests to approximate mode

921b822

Apply lints

b349282

scotts mentioned this pull request Dec 20, 2024

Approximate seeking mode #427

Open

Updated metadata; all tests pass.

081a5bb

scotts commented Dec 20, 2024

View reviewed changes

src/torchcodec/decoders/_core/VideoDecoder.cpp Outdated Show resolved Hide resolved

scotts commented Dec 20, 2024

View reviewed changes

src/torchcodec/decoders/_core/VideoDecoder.cpp Outdated Show resolved Hide resolved

scotts added 4 commits December 20, 2024 11:27

Removed commened out code.

802b881

Consolidated logic for timestamp batch. Big perf win.

911a3bc

Consolidated logic for timestamp range.

7267b5a

More mode consolidation.

ae44f78

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Implement approximate mode #440

[WIP] Implement approximate mode #440

scotts commented Dec 20, 2024 •

edited

Loading

scotts commented Dec 20, 2024 •

edited

Loading

[WIP] Implement approximate mode #440

Are you sure you want to change the base?

[WIP] Implement approximate mode #440

Conversation

scotts commented Dec 20, 2024 • edited Loading

scotts commented Dec 20, 2024 • edited Loading

scotts commented Dec 20, 2024 •

edited

Loading

scotts commented Dec 20, 2024 •

edited

Loading