Setting Up Meta AI’s SeamlessM4T — Massively Multilingual & Multimodal Machine Translation Model
What is SeamlessM4T?
Earlier today, MetaAI released the SeamlessM4T library. So what is SeamlessM4T you may ask? According to the library’s official Github page, it says.
SeamlessM4T is designed to provide high quality translation, allowing people from different linguistic communities to communicate effortlessly through speech and text.
📥 101 languages for speech input.
⌨️ 96 Languages for text input/output.
🗣️ 35 languages for speech output.
What really got me interested is that they’ve stated that this is a unified model that would enable multiple tasks WITHOUT the need to rely on multiple models.
Tasks such as
- Speech-to-speech translation (S2ST)
- Speech-to-text translation (S2TT)
- Text-to-speech translation (T2ST)
- Text-to-text translation (T2TT)
- Automatic speech recognition (ASR)
Setting Up The Environment
I’ve set up a new PC last Dec 2022 and we will use this to set up the SeamlessM4T environment. Based on the research I’ve done, it doesn’t support Windows (yet), and good thing that the PC I’ve set up has dual boot enabled, Windows and Ubuntu Linux. The setup instructions can be found on the official Github page. But here are the actual Ubuntu terminal commands that I’ve used when setting up the environment.
conda create -y -n seamless_communication
conda activate seamless_communication
github clone https://github.com/facebookresearch/seamless_communication
cd seamless_communication
pip install .
conda install -y -c conda-forge libsndfile
# Install on Jupyter
conda install ipykernel -y
conda install ipywidgets==8.0.4 -y
python -m ipykernel install --user --name seamless_communication --display-name "seamless_communication"
The last 3 lines above will enable us to utilize the new conda environment from within a Jupyter lab notebook.
Testing it out
The SeamlessM4T model card page in HuggingFace has some boilerplate code as a starting point. I used this to prepare a simple Jupyter lab notebook as seen below. There are two model flavors, SeamlessM4T-Medium and SeamlessM4T-Large. For this example, we will be using the Large version.
Where to get the list of languages supported?
One can download the official paper of SeamlessM4T, and on pages 14 to 15, we can see the list of supported languages of the model.
Picking Our Source and Target Languages
Now that we know the supported languages and what their language codes are, we will now utilize the model. I live in the Philippines and there are literally hundreds, even thousands of dialects. One of the predominantly spoken dialects in the Philippines is TAGALOG, and we will use this as our source language.
We will be submitting a sample text for translation which is.
Salamat sa MetaAI at naglabas sila ng SeamlessM4T model para gamitin ng mga tao.
Which in English means.
Thanks to MetaAI for releasing the SeamlessM4T model to be used by people.
Actual Translations
Here’s my actual code to do the Text-to-speech translation (T2ST)
import torch
import torchaudio
from seamless_communication.models.inference import Translator
# Initialize a Translator object with a multitask model, vocoder on the GPU.
translator = Translator("seamlessM4T_large", vocoder_name_or_card="vocoder_36langs", device=torch.device("cuda:0"))
# We got the languages above from the official paper
src_lang = "tgl" #tagalog
tgt_lang = "eng" #english
input_text = "Salamat sa MetaAI at naglabas sila SeamlessM4T model para gamitin ng mga tao!"
translated_text, wav, sr = translator.predict(input_text, "t2st", tgt_lang=tgt_lang, src_lang=src_lang)
#Let's print the translated text
print(translated_text)
# Save the translated audio generation.
torchaudio.save(
"./wav_files/Tagalog-to-English.wav",
wav[0].cpu(),
sample_rate=sr,
)
As we’ve invoked a T2ST task, the primary purpose of the function is to generate a wav file containing the translated version of the text we’ve inputted above. Here’s the Soundcloud link to the actual wav file that was generated by the model.
It’s very useful that, aside from the model generating the wav file, it will also return the translated text which we will print below.
Based on both the wav file and the translated text, it got it correctly! Amazing! Using the generated wav file, let’s try converting from English back to Tagalog with the code below.
translated_text, wav, sr = translator.predict("./wav_files/Tagalog-to-English.wav", "s2st", src_lang)
#Let's print the translated text
print(translated_text)
# Save the translated audio generation.
torchaudio.save(
"./wav_files/English-to-Tagalog.wav",
wav[0].cpu(),
sample_rate=sr,
)
Here’s the generated wav file from English back to Tagalog. I hosted it on Soundcloud as well. I wasn’t able to understand much on how it translated the term SeamlessM4T. Good thing we can print the text version as seen below.
Apparently, the model translated the term SeamlessM4T to CEMUT M40. It got that wrong, of course, but it does sound like SeamlessM4T. Though, thinking of it, tell that to almost any person and they would either incorrectly spell that or even get it totally incorrect (Seamless-Forty… whut?!).
Takeaways
The potential impact of this SeamlessM4T is immense. It may revolutionize the way we interact and connect with people from different cultures and backgrounds. No longer will language be a barrier to understanding and collaboration. SeamlessM4T has the potential to make communication seamless and effortless.
Of course, there will always be room for improvement. As with any technology, there may be occasional hiccups and inaccuracies. However, with continuous refinement and feedback, MetaAI’s SeamlessM4T model has the potential to become an indispensable tool for breaking down language barriers.
I hope that you’ve learned something new by reading this Medium article of mine, and have thought of ways to utilize this amazing new model.
Always be awesome!