Meet Meta’s Speech-to-Text, Text-to-Speech model for more than 1100+ languages

Sriram Parthasarathy
5 min readMay 23, 2023

Meta’s solution stands out by leveraging audio recordings of individuals reading translated texts from the New Testament in various languages.

Meta’s solution stands out by leveraging audio recordings of individuals reading translated texts from the New Testament in various languages. This innovative approach tackles the data scarcity issue for less common languages, allowing MMS to surpass this limitation.

Most of us have used a AI assisant on the phone. Speech recognition algorithms has the ability to understand natural language and allows us to interact with the machines in a natural way. They have applications in across various domains such as education and entertainment and helps to enhance productivity and minimize errors.

One notable application of machine learning speech recognition algorithms can be observed in the legal field. Lawyers and legal professionals use this technology to routiunely transcribe legal documents and case notes, enabling them to dedicate more time to analyzing and strategizing. In addition, speech recognition technology is also being employed in legal research to process and transcribe court proceedings. This facilitates faster access to relevant information.

Lawyers benefit from speech-to-text technology, allowing them to efficiently transcribe legal documents, review case notes, and focus on analyzing and strategizing for their clients.

Lets take an example in education. In classrooms, this technology is being used to transcribe lectures and discussions, providing students with accurate and accessible study materials. Additionally, speech recognition technology can be used in language learning applications, where it assists learners in improving pronunciation and fluency through real-time feedback.

Teachers leverage speech-to-text technology to enhance their instructional practices. It enables them to transcribe lectures, create accessible study materials, and provide real-time feedback to students for improved learning experiences.

Another example is in the the hospitality sector which uses this technology to enhance customer experiences. In hotels, voice-activated assistants are used to provide guests with information about amenities, local attractions, and room service options. In addition, this technology can also assist hotel staff in managing reservations and personalized guest preferences. This results in a seamless and efficient guest experience.

Hospitality workers utilize speech-to-text technology to enhance guest experiences. It assists in managing reservations, providing information about amenities, and streamlining communication for efficient customer service in the hospitality industry.

However, one of the primary challenges faced by speech recognition technology is the limited support for lesser-known or less widely spoken languages. This is a difficult obstacle to overcome as it involves further advancements in language models and data collection efforts to make sure there is inclusivity and accessibility for diverse language communities worldwide.

In the past, constructing a highly multilingual model has proven to be challenging for several key reasons. The conventional approach to training such models heavily relies on abundant supervised data. This includes significant speech samples accompanied by existing transcriptions. This dependency significantly restricts the quantity of available training data, as manually generating transcriptions is both expensive and laborious. This is where the new announcement from Meta makes a huge difference.

Meta Platforms Inc.’s AI research team has released an open-source project called Massively Multilingual Speech (MMS) with the ability to recognize over 4,000 spoken languages. This is a significant advancement compared to previous technologies. What sets it apart is the innovative use of audio recordings of individuals reading translated texts from the New Testament in different languages. This creative approach addresses the challenge of limited data availability for less common languages, enabling MMS to overcome this limitation.

Addressing the training data challenge

The world comprises over 7,000 known spoken languages, yet current speech models only support a small fraction of approximately 100 languages. The primary challenge lies in acquiring labeled audio data for such a vast array of languages. Existing speech datasets, at their largest, cover a maximum of 100 languages.

In an innovative approach, Meta addressed this challenge in a unique and an interesting manner. They leveraged religious texts, such as the Bible, which have been translated into numerous languages. By utilizing a dataset consisting of readings from the New Testament in over 1,100 languages, they were able to obtain an average of 32 hours of data per language. In addition, by incorporating unlabeled recordings of various Christian religious readings, they were able to increase the number of available languages was expanded to over 4,000.

Naturally, a mere 32 hours of data is insufficient to train a conventional supervised speech recognition model. This limitation led to the utilization of wav2vec 2.0, a self-supervised learning algorithm that allows machines to learn without depending on labeled training data.

By leveraging wav2vec 2.0, it becomes feasible to train speech recognition models using significantly less data. The MMS (Massively Multilingual Speech) project employed this approach, training multiple self-supervised models on approximately 500,000 hours of speech data across more than 1,400 languages.

Model performance

Meta’s groundbreaking research achieved a remarkable feat by training multilingual speech recognition models across an extensive range of over 1,100 languages. They utilized a powerful 1B parameter wav2vec 2.0 model for this purpose. As the number of languages increased from 61 to 1,107, there was a minor decline in performance, with just a 0.4% rise in character error rate. Nonetheless, the expansion in language coverage surpassed a staggering 17-fold increase.

In an enlightening comparison with OpenAI LP’s Whisper speech recognition model, Meta’s researchers made a remarkable discovery. The models trained on MMS data achieved a noteworthy reduction of approximately 50% in word error rate. This finding serves as compelling evidence of their model’s exceptional performance, surpassing existing leading speech models in the field.

From Meta’s paper. Word error rate of OpenAI Whisper compared to Massively Multilingual Speech on the 54 FLEURS languages that enable a direct comparison.

Conclusion

In conclusion, Meta AI’s Massively Multilingual Speech (MMS) represents a important milestone in speech recognition technology.
The unique aspect of their approach is how they solved the challenges of acquiring datasets for obscure languages. They used translations of the New Testament from over 1,100 languages and combined it with other dataset to help with training. The resulting models they released shows impressive performance across a wide range of languages.

With the open-sourcing of the MMS dataset and tools, Meta is enabling researchers to make further advancements in this field. MMS has the potential to greatly impact businesses, enabling improved customer service, multilingual market reach, and the development of language-specific applications. Now with the availability of the code and tools, I am sure open source researchers will take this to the next level in the coming months.

BECOME a WRITER at MLearning.ai. Mind-to-Art models are here

--

--