Posted in

Omnilingual ASR: Advancing Automatic Speech Recognition for 1,600+ Languages

Meta has unveiled a revolutionary leap in speech technology with the introduction of Omnilingual Automatic Speech Recognition (ASR), a comprehensive suite of models that delivers high-quality speech-to-text capabilities for over 1,600 languages. This breakthrough includes support for 500 low-resource languages previously untranscribed by AI, achieving state-of-the-art performance at an unprecedented scale. By transcending the limitations of traditional ASR systems that focus mainly on high-resource languages, Omnilingual ASR aims to bridge the digital divide and make speech technology universally accessible.

Omnilingual ASR Performance

The architecture behind Omnilingual ASR features two key innovations: a massively scaled wav2vec 2.0 speech encoder with 7 billion parameters, and two decoder variants—one using connectionist temporal classification (CTC) and another leveraging a transformer decoder inspired by large language models (LLMs). This LLM-ASR approach significantly boosts performance, especially for long-tail languages, with character error rates (CER) below 10 for 78% of the supported languages.

Character Error Rate Chart

A standout feature of Omnilingual ASR is its community-driven “Bring Your Own Language” framework. Unlike conventional systems that require expert fine-tuning, this model allows users to extend support to new languages with just a few audio-text samples. By incorporating in-context learning from LLMs, it enables speakers of unsupported languages to achieve usable transcription quality without extensive data, expertise, or high-end computing resources.

In-Context Learning Diagram

The release includes a versatile suite of models tailored for different use cases, ranging from lightweight 300M versions for low-power devices to powerful 7B models for high accuracy. All assets are open-sourced under permissive licenses, including the Omnilingual wav2vec 2.0 foundation model for broader speech tasks and the Omnilingual ASR Corpus—a massive collection of transcribed speech in 350 underserved languages.

Developed in collaboration with global partners such as Mozilla Foundation’s Common Voice and Lanfrica/NaijaVoices, Omnilingual ASR integrates deep linguistic and cultural insights. The training corpus, one of the largest and most diverse ever assembled, combines public datasets with community-sourced recordings from remote regions. This initiative empowers researchers, developers, and language advocates worldwide to advance speech technology and foster inclusive communication across linguistic barriers.