Voxlingua is a project that allows users to input a YouTube video link, choose a target language, and receive the original video with translated audio while preserving the original speaker's voice. This project employs multiple technologies to process video, recognize speech, translate text, convert text to speech, clone voices, and synchronize audio with video.
decided to use coquiXTTS w ec2 instance for the voice cloning part, f5-tts is also a better alternative but pnly supports eng and chinese. integrating aws s3 and ec2, lambda with the project, done with video_processing.py
The workflow of Voxlingua consists of six primary steps:
-
Video Processing:
- The user uploads a YouTube link, and the video is processed using
yt-dlp
for download andffmpeg
for extraction of the audio stream.
- The user uploads a YouTube link, and the video is processed using
-
Speech Recognition:
- The extracted audio is processed through OpenAI's Whisper for speech recognition, converting the speech in the original language to text.
-
Text Translation:
- The recognized text is translated into the target language using the MarianMT Model from Hugging Face Transformers.
-
Text-to-Speech:
- The translated text is converted into speech using Google Text-to-Speech (gTTS), generating an audio file in the target language.
-
Voice Cloning:
- Using GPT-SoVITS / OpenVoice, the generated audio is transformed to retain the original speaker's voice, ensuring that the translated audio mimics the pitch and tone of the speaker in the original video.
-
Audio-Video Sync:
- Finally, the translated audio is synced back to the video using
ffmpeg
, producing a video with the translated audio but preserving the speaker's original voice characteristics.
- Finally, the translated audio is synced back to the video using
To set up and run Voxlingua locally, follow these steps:
-
Clone the repository:
git clone https://github.com/yourusername/Voxlingua.git cd Voxlingua
The basic architecture v1 is illustrated below:
- yt-dlp: Used to download the video and extract the audio.
- ffmpeg: For video processing and synchronization of audio with video.
- Whisper (OpenAI): For speech-to-text conversion.
- Hugging Face Transformers: For text translation into the desired language.
- Google Text-to-Speech (gTTS): For text-to-speech generation in the target language.
- GPT-SoVITS: For cloning the speaker’s voice to retain their unique vocal characteristics.
- Gradio: For creating an interactive user interface.
-
Input a YouTube Video:
- Enter the YouTube URL in the input field.
- Select the target language for translation.
-
Process and Output:
- Voxlingua will process the video in the background and provide a downloadable link for the output video with translated audio in the original speaker’s voice.
- implement the trained voice cloning model for accurate voice cloning
- Add more language support for translation.
- Improve real-time performance of voice cloning.
- Enable additional customization options for the user.
Feel free to contribute to this project by opening issues or submitting pull requests.
This project is licensed under the MIT License.