Skip to content

Commit

Permalink
Merge pull request #149 from martincerven/voicecraft_tutorial
Browse files Browse the repository at this point in the history
Added voicecraft tutorial and cleaned audiocraft tutorial
  • Loading branch information
dusty-nv authored May 16, 2024
2 parents a93ee94 + 77fd1a7 commit 0482ad5
Show file tree
Hide file tree
Showing 5 changed files with 97 additions and 9 deletions.
Binary file added docs/images/voicecraft_load_models.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 2 additions & 1 deletion docs/tutorial-intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,8 +55,9 @@ Give your locally running LLM an access to vision!

| | |
| :---------- | :----------------------------------- |
| **[AudioCraft](./tutorial_audiocraft.md)** | Meta's [AudioCraft](https://github.com/facebookresearch/audiocraft), to produce high-quality audio and music |
| **[Whisper](./tutorial_whisper.md)** | OpenAI's [Whisper](https://github.com/openai/whisper), pre-trained model for automatic speech recognition (ASR) |
| **[AudioCraft](./tutorial_audiocraft.md)** | Meta's [AudioCraft](https://github.com/facebookresearch/audiocraft), to produce high-quality audio and music |
| **[Voicecraft](./tutorial_voicecraft.md)** | [Voicecraft](https://github.com/jasonppy/VoiceCraft), Speech editing and zero shot TTS |

### Metropolis Microservices

Expand Down
33 changes: 26 additions & 7 deletions docs/tutorial_audiocraft.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,12 +50,14 @@ On Jupyter Lab navigation pane on the left, double-click `demos` folder.

### AudioGen demo

For "**Text-conditional Generation**", you should get something like this.
<!-- For "**Text-conditional Generation**", you should get something like this.
<audio controls>
<source src="./assets/subway.wav" type="audio/wav">
Your browser does not support the audio element.
</audio>
</audio> -->

Run cells with ```Shift + Enter```, first one will download models, which can take some time.

!!! info

Expand All @@ -65,13 +67,30 @@ Your browser does not support the audio element.
Error caught was: No module named 'triton'
```

!!! warning
<!-- !!! warning
When running the 5-th cell of `audiogen_demo.ipynb`, you may run into "**Failed to load audio**" RuntimeError. -->

In the *Audio Continuation* cells, you can generate continuation based on text, while in *Text-conditional Generation* you can generate audio based just on text descriptions.

You can also use your own audio as prompt, and use text descriptions to generate continuation:
```
prompt_waveform, prompt_sr = torchaudio.load("../assets/sirens_and_a_humming_engine_approach_and_pass.mp3") # you can upload your own audio
prompt_duration = 2
prompt_waveform = prompt_waveform[..., :int(prompt_duration * prompt_sr)]
output = model.generate_continuation(prompt_waveform.expand(3, -1, -1), prompt_sample_rate=prompt_sr,descriptions=[
'Subway train blowing its horn', # text descriptions for continuation
'Horse neighing furiously',
'Cat hissing'
], progress=True)
display_audio(output, sample_rate=16000)
```

When running the 5-th cell of `audiogen_demo.ipynb`, you may run into "**Failed to load audio**" RuntimeError.
### MusicGen and MAGNeT demos

### MusicGen demo
The two other jupyter notebooks are similar to AuidioGen, where you can generate continuation or generate audio, while using models trained to generate music.

For "**Text-conditional Generation**", you should get something like this.
<!-- For "**Text-conditional Generation**", you should get something like this.
<audio controls>
<source src="./assets/80s-pop.wav" type="audio/wav">
Expand All @@ -80,4 +99,4 @@ Your browser does not support the audio element.
!!! warning
When running the 5-th cell of `musicgen_demo.ipynb`, you may run into "**Failed to load audio**" RuntimeError.
When running the 5-th cell of `musicgen_demo.ipynb`, you may run into "**Failed to load audio**" RuntimeError. -->
67 changes: 67 additions & 0 deletions docs/tutorial_voicecraft.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# Tutorial - VoiceCraft

Let's run [VoiceCraft](https://github.com/jasonppy/VoiceCraft), a Zero-Shot Speech Editing and Text-to-Speech in the Wild!
!!! abstract "What you need"

1. One of the following Jetson devices:

<span class="blobDarkGreen4">Jetson AGX Orin (64GB)</span>
<span class="blobDarkGreen5">Jetson AGX Orin (32GB)</span>
<!-- <span class="blobLightGreen4">Jetson Orin Nano (8GB)</span> -->

2. Running one of the following versions of [JetPack](https://developer.nvidia.com/embedded/jetpack):

<!-- <span class="blobPink1">JetPack 5 (L4T r35.x)</span> -->
<span class="blobPink2">JetPack 6 (L4T r36.x)</span>

3. Sufficient storage space (preferably with NVMe SSD).

- `15.6 GB` for `voicecraft` container image
- Space for models

4. Clone and setup [`jetson-containers`](https://github.com/dusty-nv/jetson-containers/blob/master/docs/setup.md){:target="_blank"}:

```bash
git clone https://github.com/dusty-nv/jetson-containers
bash jetson-containers/install.sh
```

## How to start

Use `run.sh` and `autotag` script to automatically pull or build a compatible container image.

```
jetson-containers run $(autotag voicecraft)
```

The container has a default run command (`CMD`) that will automatically start the Gradio app.

Open your browser and access `http://<IP_ADDRESS>:7860`.

<!-- > The default password for Jupyter Lab is `nvidia`. -->

## Gradio app

VoiceCraft repo comes with Gradio demo app.

1. Select which models you want to use, I recommend using 330M_TTSEnhanced on 32GB AGX Orin
2. Click load, if you run it for the first time, models are downloaded from huggingface, otherwise are loaded from ```/data``` folder, where are saved to from previous runs
3. Upload audio file of your choice (MP3/wav)
4. Click transcribe, it will use whisper to get transcription along with start/end time of each word spoken
5. Now you can edit the sentence, or use TTS. Click Run to generate output.


![](./images/voicecraft_load_models.png)


!!! warning

For TTS it's okay to use only first few seconds of audio as prompt, since it consumes a lot of memory. On AGX 32GB Orin the maximal TTS length of generated audio is around ~16 seconds in headless mode.


## Resources
If you want to know how it works under the hood, you can read following papers:

1. [VOICECRAFT: Zero-Shot Speech Editing and Text-to-Speech in the Wild](https://arxiv.org/pdf/2403.16973)
2. [High Fidelity Neural Audio Compression](https://arxiv.org/pdf/2210.13438)
3. [Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers](https://arxiv.org/pdf/2301.02111)
3 changes: 2 additions & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -104,8 +104,9 @@ nav:
- Vector Database:
- NanoDB: tutorial_nanodb.md
- Audio:
- AudioCraft: tutorial_audiocraft.md
- Whisper: tutorial_whisper.md
- AudioCraft: tutorial_audiocraft.md
- VoiceCraft: tutorial_voicecraft.md
- Metropolis Microservices:
- First Steps: tutorial_mmj.md
# - Tools:
Expand Down

0 comments on commit 0482ad5

Please sign in to comment.