Introduction
Text-to-speech (TTS) technology has revolutionized how we interact with devices, making accessing content through auditory means easier. TTS is vital in various applications such as virtual assistants, audiobooks, accessibility tools for the visually impaired, and language learning platforms.
This tutorial will explore how to convert text-to-speech using Hugging Face's MeloTTS transformer, a powerful model designed for high-quality TTS tasks.
We will walk through installing the necessary libraries, creating basic examples, experimenting with different accents and languages, adjusting speech speed, and ultimately, combining these elements into a comprehensive TTS function.
Note: Check out my article on how to generate stunning images from text if you are interested in text-to-image generation.
Installing Required Libraries
To begin, we must clone the MeloTTS repository from GitHub and install the required dependencies. This can be done with the following commands:
!git clone https://github.com/myshell-ai/MeloTTS.git
%cd MeloTTS
!pip install -e .
!python -m unidic download
In the above script, the git
clone command fetches the MeloTTS repository, and we navigate into the cloned directory. The pip install -e .
command installs the package in editable mode, allowing us to make changes if necessary. Finally, the unidic download
command downloads the language dictionary required for text processing.
A Basic Example
Let's create a basic example of converting English text to speech using the MeloTTS model. In the following code, we import the TTS
class from the melo.api
module and set the speech speed to 1.0.
Notice that we pass the language and the device type to the TTS
class constructor. The device parameter is set to auto
, allowing the model to use the GPU if available. We define our English text and initialize the TTS model for the English language. The speaker_ids
dictionary maps different accents to their respective IDs. Finally, we call the tts_to_file()
method to generate the speech file with the American accent and save it as en-us.wav
.
from melo.api import TTS
speed = 1.0
device = 'auto' # Will automatically use GPU if available
# English
text = "In this video, you will learn about Large Language Models. This is going to be fun."
model = TTS(language='EN', device=device)
speaker_ids = model.hps.data.spk2id
# American accent
output_path = 'en-us.wav'
model.tts_to_file(text, speaker_ids['EN-US'], output_path, speed=speed)
Trying Different Accents
You can try different accents for a given language. For instance, in the previous script we created an English model. You can find the list of accents for this model using the speaker_ids
dictionary.
print(speaker_ids)
Output:
{'EN-US': 0, 'EN-BR': 1, 'EN_INDIA': 2, 'EN-AU': 3, 'EN-Default': 4}
Let's try to generate speech with an Indian accent using the EN_INDIA
speaker ID.
speed = 1.0
device = 'auto'
# English Indian Accent
text = "In this video, you will learn about Large Language Models. This is going to be fun."
model = TTS(language='EN', device=device)
speaker_ids = model.hps.data.spk2id
output_path = 'en-in.wav'
model.tts_to_file(text, speaker_ids['EN_INDIA'], output_path, speed=speed)
Trying Different Languages
We can also generate speech in different languages. All languages are listed on Hugging Face MeloTTS model card. To change a language, you must pass the language ID to the language
attribute of the TTS class constructor. For example, the following script uses the FR
id to create a French TTS model. You can then print the accents associated with a language using the speaker_ids
dictionary as shown in the script below.
speed = 1.0
device = 'auto'
# French
model = TTS(language='FR', device=device)
speaker_ids = model.hps.data.spk2id
print(speaker_ids)
Output:
{'FR': 0}
The above output shows that French has only one accent, FR
. You can use this speaker id to generate French speech.
text = "Dans cette vidéo, vous allez apprendre sur les Large Language Models. Ça va être amusant."
output_path = 'fr.wav'
model.tts_to_file(text, speaker_ids['FR'], output_path, speed=speed)
Adjusting Speech Speed
The speed of the speech can be adjusted by changing the speed parameter. Here’s an example with a faster speech speed:
speed = 5
text = "In this video, you will learn about Large Language Models. This is going to be fun."
model = TTS(language='EN', device=device)
speaker_ids = model.hps.data.spk2id
output_path = 'en-us.wav'
model.tts_to_file(text, speaker_ids['EN-US'], output_path, speed=speed)
By setting the speed to 5, the speech is generated at a much faster rate.
Putting it All Together
Finally, to streamline the process, we can create a function that generates speech based on the specified parameters:
def generate_speech(text, language = "EN", speaker_id = "EN-US", speed = 1.0, audio_path = "speech.wav"):
model = TTS(language=language,
device='auto')
speaker_ids = model.hps.data.spk2id
model.tts_to_file(text,
speaker_ids[speaker_id],
audio_path,
speed=speed)
Using the generate_speech
function, we can quickly generate speech by providing the necessary arguments:
generate_speech(text = "Hello, this is test speech",
speaker_id = "EN_INDIA",
speed = 3,
audio_path = "indian_speech.wav")
Conclusion
This tutorial explored the basics of text-to-speech conversion using Hugging Face's MeloTTS transformer. We covered installing required libraries, creating basic examples, experimenting with different accents and languages, adjusting speech speed, and putting everything together in a reusable function. With these tools, you can create high-quality speech synthesis for various applications, enhancing user experiences and accessibility.