Merge pull request #149 from martincerven/voicecraft_tutorial

Added voicecraft tutorial and cleaned audiocraft tutorial
NVIDIA-AI-IOT · May 16, 2024 · 0482ad5 · 0482ad5
2 parents a93ee94 + 77fd1a7
commit 0482ad5
Show file tree

Hide file tree

Showing 5 changed files with 97 additions and 9 deletions.
diff --git a/docs/images/voicecraft_load_models.png b/docs/images/voicecraft_load_models.png
diff --git a/docs/tutorial-intro.md b/docs/tutorial-intro.md
@@ -55,8 +55,9 @@ Give your locally running LLM an access to vision!
 
 |      |                     |
 | :---------- | :----------------------------------- |
-| **[AudioCraft](./tutorial_audiocraft.md)** | Meta's [AudioCraft](https://github.com/facebookresearch/audiocraft), to produce high-quality audio and music |
 | **[Whisper](./tutorial_whisper.md)** | OpenAI's [Whisper](https://github.com/openai/whisper), pre-trained model for automatic speech recognition (ASR) |
+| **[AudioCraft](./tutorial_audiocraft.md)** | Meta's [AudioCraft](https://github.com/facebookresearch/audiocraft), to produce high-quality audio and music |
+| **[Voicecraft](./tutorial_voicecraft.md)** | [Voicecraft](https://github.com/jasonppy/VoiceCraft), Speech editing and zero shot TTS |
 
 ### Metropolis Microservices
 

diff --git a/docs/tutorial_audiocraft.md b/docs/tutorial_audiocraft.md
@@ -50,12 +50,14 @@ On Jupyter Lab navigation pane on the left, double-click `demos` folder.
 
 ### AudioGen demo
 
-For "**Text-conditional Generation**", you should get something like this.
+<!-- For "**Text-conditional Generation**", you should get something like this.
 
 <audio controls>
   <source src="./assets/subway.wav" type="audio/wav">
 Your browser does not support the audio element.
-</audio>
+</audio> -->
+
+Run cells with ```Shift + Enter```, first one will download models, which can take some time.
 
 !!! info
 
@@ -65,13 +67,30 @@ Your browser does not support the audio element.
     Error caught was: No module named 'triton'
     ```
 
-!!! warning
+<!-- !!! warning
+
+    When running the 5-th cell of `audiogen_demo.ipynb`, you may run into "**Failed to load audio**" RuntimeError. -->
+
+In the *Audio Continuation* cells, you can generate continuation based on text, while in *Text-conditional Generation* you can generate audio based just on text descriptions.
+
+You can also use your own audio as prompt, and use text descriptions to generate continuation:
+```
+prompt_waveform, prompt_sr = torchaudio.load("../assets/sirens_and_a_humming_engine_approach_and_pass.mp3") # you can upload your own audio
+prompt_duration = 2
+prompt_waveform = prompt_waveform[..., :int(prompt_duration * prompt_sr)]
+output = model.generate_continuation(prompt_waveform.expand(3, -1, -1), prompt_sample_rate=prompt_sr,descriptions=[
+        'Subway train blowing its horn',   # text descriptions for continuation
+        'Horse neighing furiously',
+        'Cat hissing'
+], progress=True)
+display_audio(output, sample_rate=16000)
+```
 
-    When running the 5-th cell of `audiogen_demo.ipynb`, you may run into "**Failed to load audio**" RuntimeError.
+### MusicGen and MAGNeT demos
 
-### MusicGen demo
+The two other jupyter notebooks are similar to AuidioGen, where you can generate continuation or generate audio, while using models trained to generate music.
 
-For "**Text-conditional Generation**", you should get something like this.
+<!-- For "**Text-conditional Generation**", you should get something like this.
 
 <audio controls>
   <source src="./assets/80s-pop.wav" type="audio/wav">
@@ -80,4 +99,4 @@ Your browser does not support the audio element.
 
 !!! warning
 
-    When running the 5-th cell of `musicgen_demo.ipynb`, you may run into "**Failed to load audio**" RuntimeError.
+    When running the 5-th cell of `musicgen_demo.ipynb`, you may run into "**Failed to load audio**" RuntimeError. -->
diff --git a/docs/tutorial_voicecraft.md b/docs/tutorial_voicecraft.md
@@ -0,0 +1,67 @@
+# Tutorial - VoiceCraft
+
+Let's run [VoiceCraft](https://github.com/jasonppy/VoiceCraft), a Zero-Shot Speech Editing and Text-to-Speech in the Wild!
+!!! abstract "What you need"
+
+    1. One of the following Jetson devices:
+
+        <span class="blobDarkGreen4">Jetson AGX Orin (64GB)</span>
+        <span class="blobDarkGreen5">Jetson AGX Orin (32GB)</span>
+        <!-- <span class="blobLightGreen4">Jetson Orin Nano (8GB)</span> -->
+
+    2. Running one of the following versions of [JetPack](https://developer.nvidia.com/embedded/jetpack):
+
+        <!-- <span class="blobPink1">JetPack 5 (L4T r35.x)</span> -->
+         <span class="blobPink2">JetPack 6 (L4T r36.x)</span>
+
+    3. Sufficient storage space (preferably with NVMe SSD).
+
+        - `15.6 GB` for `voicecraft` container image
+        - Space for models
+
+    4. Clone and setup [`jetson-containers`](https://github.com/dusty-nv/jetson-containers/blob/master/docs/setup.md){:target="_blank"}:
+
+		```bash
+		git clone https://github.com/dusty-nv/jetson-containers
+		bash jetson-containers/install.sh
+		``` 
+
+## How to start
+
+Use `run.sh` and `autotag` script to automatically pull or build a compatible container image.
+
+```
+jetson-containers run $(autotag voicecraft)
+```
+
+The container has a default run command (`CMD`) that will automatically start the Gradio app.
+
+Open your browser and access `http://<IP_ADDRESS>:7860`.
+
+<!-- > The default password for Jupyter Lab is `nvidia`. -->
+
+## Gradio app
+
+VoiceCraft repo comes with Gradio demo app.
+
+1. Select which models you want to use, I recommend using 330M_TTSEnhanced on 32GB AGX Orin
+2. Click load, if you run it for the first time, models are downloaded from huggingface, otherwise are loaded from ```/data``` folder, where are saved to from previous runs
+3. Upload audio file of your choice (MP3/wav)
+4. Click transcribe, it will use whisper to get transcription along with start/end time of each word spoken
+5. Now you can edit the sentence, or use TTS. Click Run to generate output.
+
+
+![](./images/voicecraft_load_models.png)
+
+
+!!! warning
+
+    For TTS it's okay to use only first few seconds of audio as prompt, since it consumes a lot of memory. On AGX 32GB Orin the maximal TTS length of generated audio is around ~16 seconds in headless mode.
+
+
+## Resources
+If you want to know how it works under the hood, you can read following papers:
+
+1.  [VOICECRAFT: Zero-Shot Speech Editing and Text-to-Speech in the Wild](https://arxiv.org/pdf/2403.16973)
+2.  [High Fidelity Neural Audio Compression](https://arxiv.org/pdf/2210.13438)
+3.  [Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers](https://arxiv.org/pdf/2301.02111)
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -104,8 +104,9 @@ nav:
     - Vector Database:
       - NanoDB: tutorial_nanodb.md
     - Audio:
-      - AudioCraft: tutorial_audiocraft.md
       - Whisper: tutorial_whisper.md
+      - AudioCraft: tutorial_audiocraft.md
+      - VoiceCraft: tutorial_voicecraft.md
     - Metropolis Microservices:
       - First Steps: tutorial_mmj.md
     # - Tools: