Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Implement approximate mode #440

Draft
wants to merge 13 commits into
base: main
Choose a base branch
from
Draft

[WIP] Implement approximate mode #440

wants to merge 13 commits into from

Conversation

scotts
Copy link
Contributor

@scotts scotts commented Dec 20, 2024

[WIP]

Comprehensive implementation of approximate mode. Modifies the C++ and Python sides to match. Adds parameters to existing tests, but does not add any new tests.

I know that doing all of the changes in one go - C++, core ops, Python public interface - is a bear to review. But I wanted to quickly get to a state where we could compare performance. Also, because the approach I took makes the decoder have two different modes of operation, it requires changes everywhere.

Starting with the C++ VideoDecoder layer:

  • Adds a new enum called SeekMode:
    • exact implies doing a full file scan, and we can use all of the prior algorithms for getting ranges of frames.
    • approximate implies not doing a full file scan, and we have to rely on the average FPS to calculate indices. We also assume that the minimum pts in seconds is 0, and the maximum is the duration. If there is no duration or average FPS in the metadata, that's an error. (TODO: make sure the C++ throws on all of those conditions.)
  • The seek mode is passed in as a constructor parameter to VideoDecoder, and it applies to all streams. I considered trying to make it apply only when we add a stream, but because we want it to control whether or not we do a full file scan, I made it a property of the entire decoder. We could potentially make it apply per stream, and do a lazy scan if an exact mode stream is added.
  • The default seek mode is exact. I modified a bunch of our core ops tests to reflect this fact, as it is now redundant to request exact seek mode and manually do a full file scan.
  • The pts-based member functions that converted the requested pts values to indices still do so. I initially implemented different algorithms for approximate mode where we used the actual pts values as is. That lead to poor performance because we could not allocate our output tensor up front and we could not take advantage of our dedupe logic. Since the point of approximate mode is performance, I switched it to what we currently have: helper functions which are seek mode aware, but the main algorithms are unaware of seek mode. However, this does mean that in approximate mode, we may return inaccurate frames even in cases when in principle we could return exact frames. That is, if a user requests a batch of frames at pts values [x, y, z], we will turn that into indices [a, b, c] based on the average fps. If the average fps is wrong, or if the video has a variable frame rate, then that mapping may be wrong. In principle, we could have returned the correct frames.

In the Python public API, we add:

  seek_mode: Literal["exact", "approximate"] = "exact",

To the Python VideoDecoder constructor. Not much logic actually changes here.

We do have a lot of changes in the Python metadata to support the VideoDecoder:

  • begin_stream_seconds -> begin_stream_seconds_from_content: This was always only from the full scan. We need to be explicit now.
  • begin_stream_seconds is now a property that if begin_stream_seconds_from_content is none, is just 0.
  • end_stream_seconds -> end_stream_seconds_from_content: Same reasoning as above.
  • end_stream_seconds is now a property that if end_stream_seconds_from_content is none returns the duration_seconds.

By changing the metadata in that manner, the Python VideoDecoder class remains mostly unchanged as the real logic is in resolving what metadata to use. Note that the metadata class has no knowledge of the seek mode; it just knows if some values are missing.

On defaults, the current implementation has exact mode as the default everywhere. We could, however, decide to apply different defaults at each level. The levels we can apply defaults are:

  • During construction of the C++ VideoDecoder class. In some ways, the old behavior was approximate as default, since scanning the file did not automatically happen.
  • In the Python core API create_decoder_from family of functions. Similar to the C++, the old behavior was closer to approximate as default.
  • In the Python VideoDecoder constructor. The old behavior was definitely exact as we always scanned.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 20, 2024
@scotts
Copy link
Contributor Author

scotts commented Dec 20, 2024

Initial performance numbers are encouraging. I generated a video with the following command:

ffmpeg -y -f lavfi -i mandelbrot=s=1920x1080 -t 120 -c:v h264 -r 60 -g 600 -pix_fmt yuv420p mandelbrot_1920x1080_120s.mp4

That produced a 141 MB video file. I then ran our standard benchmark with:

python benchmarks/decoders/benchmark_decoders.py --decoders torchcodec_public:seek_mode=exact,torchcodec_public:seek_mode=approximate,torchcodec_public_nonbatch:seek_mode=exact,torchcodec_public_nonbatch:seek_mode=approximate,decord,decord_batch,torchaudio,torchcodec_core_batch,torchcodec_core_nonbatch --min-run-seconds 40 --video-paths mandelbrot_1920x1080_120s.mp4

And that yields:

[-------------------------------------------------- video=mandelbrot_1920x1080_120s.mp4 h264 1920x1080, 120.0s 60.0fps -------------------------------------------------]
                                                      |  decode 10 uniform frames  |  decode 10 random frames  |  first 1 frames  |  first 10 frames  |  first 100 frames
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------
      TorchAudio                                      |           1590.5           |           2112.7          |       29.3       |        49.0       |       253.2      
      DecordAccurate                                  |           1798.6           |           2201.3          |       95.0       |       171.6       |       701.8      
      DecordAccurateBatch                             |           1756.8           |           2164.2          |       92.9       |       135.7       |       576.7      
      TorchCodecCoreBatch                             |           1605.8           |           1334.6          |       79.6       |        92.6       |       480.1      
      TorchCodecCoreNonBatch                          |           1587.3           |           1971.6          |       25.7       |        38.9       |       155.9      
      TorchCodecPublicNonBatch:seek_mode=exact        |           1620.3           |           2011.6          |       79.3       |        92.7       |       218.8      
      TorchCodecPublicNonBatch:seek_mode=approximate  |           1541.5           |           1981.8          |       26.4       |        39.5       |       166.5      
      TorchCodecPublic:seek_mode=exact                |           1591.5           |           1330.2          |       78.6       |        92.1       |       214.2      
      TorchCodecPublic:seek_mode=approximate          |           1562.3           |           1292.8          |       25.9       |        39.0       |       159.9      

Times are in milliseconds (ms).

Some explanations of what the options are:

  • TorchCodecCoreBatch: core API, uses exact seek mode and batch APIs.
  • TorchCodecCoreNonBatch: core API, uses approximate seek mode and non-batch APIs.
  • Public means using VideoDecoder.
  • NonBatch means using the single frame API, even when getting multiple frames.
  • Batch means using the batch API when getting multiple frames.
  • seek_mode= is changing the seek mode.

For the full matrix, we could change the seek modes for the core options. Approximate mode is basically meeting our performance expectations here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants