Enable new models in audio-to-text #163

eliteprox · 2024-08-17T16:11:37Z

This change adds support for new whisper models distil-whisper/distil-large-v3 and openai/whisper-medium.

It also optimizes those models to use the appropriate BFLOAT, FLOAT16 or FLOAT32 values.

Credit to @ad-astra-video for intially exploring these models and optimizations

rickstaa · 2024-08-17T18:32:24Z

runner/app/pipelines/audio_to_text.py

+            logger.info("AudioToTextPipeline using float16 precision for %s", model_id)
+            kwargs["torch_dtype"] = torch.float16
+
+        if bfloat16_enabled:
            logger.info("AudioToTextPipeline using bfloat16 precision for %s", model_id)


@eliteprox, thanks for the pull request! 🚀 It looks good overall. However, please keep in mind that the default models openai/whisper-large-v3 and distil-whisper/distil-large-v3 use weights in either float16 or bfloat16 formats. The torch_dtype parameter is primarily for the calculations during runtime. You can verify this by checking the model files in these repositories: Hugging Face - distil-large-v3. Notice the presence of files with the .fp32.safetensors extension, indicating the format being used.

If the standard .safetensors (fp16) format meets your needs, you might consider removing the FLOAT16 environment variable and instead switch based on the model extension. This approach was implemented by Yondon in this commit. I will leave that decision to you based on your research 👍🏻. Feel free to merge when you think this pull request is done 🚀.

Thanks for the tip, I updated the logic to load recommended float values by model. Tested that they download and load correctly

This reverts commit f835dd4.

eliteprox · 2024-10-01T16:47:28Z

@rickstaa I made several changes since you last reviewed this PR, so I held off on merging. Could you or @ad-astra-video re-review the latest changes?

enable new models in audio-to-text

f731c1c

eliteprox requested a review from rickstaa as a code owner August 17, 2024 16:11

check optimization flag values in audio-to-text

c52a0c9

rickstaa approved these changes Aug 17, 2024

View reviewed changes

eliteprox added 9 commits September 3, 2024 21:38

(a2t) add optional field return_timestamps

f835dd4

add support for more models

9fbeecf

fix model path

bfa2f6f

Revert "(a2t) add optional field return_timestamps"

c01ac88

This reverts commit f835dd4.

set chunk size and batch size by model

e6e3c08

remove flash_attn

656e9c7

code cleanup

c5c29d7

Merge branch 'main' into whisper-fp16

398810b

Merge branch 'main' into whisper-fp16

d3b14f3

eliteprox requested a review from rickstaa October 1, 2024 15:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable new models in audio-to-text #163

Enable new models in audio-to-text #163

eliteprox commented Aug 17, 2024 •

edited

Loading

rickstaa Aug 17, 2024

eliteprox Sep 4, 2024

eliteprox commented Oct 1, 2024

Enable new models in audio-to-text #163

Are you sure you want to change the base?

Enable new models in audio-to-text #163

Conversation

eliteprox commented Aug 17, 2024 • edited Loading

rickstaa Aug 17, 2024

Choose a reason for hiding this comment

eliteprox Sep 4, 2024

Choose a reason for hiding this comment

eliteprox commented Oct 1, 2024

eliteprox commented Aug 17, 2024 •

edited

Loading