Five hundred languages. One family.
The Bantu language family is one of the largest in the world, spread from Cameroon to South Africa over millennia of migration. We start with the most-spoken and most under-served, and work outward.
- Family 01
- Bantu
- Phylum 02
- Niger-Congo
- Languages 03
- 500+
- Speakers 04
- 350M+
Spoken from the Atlantic to the Indian Ocean.
Bantu languages emerged in present-day Cameroon and Nigeria around 3,000–4,000 years ago. The great Bantu expansion carried them south and east, where they evolved into more than 500 distinct languages spoken today across Central, East and Southern Africa.
Despite their reach, only Swahili enjoys meaningful representation in modern NLP datasets. The rest — Lingala, Kikongo, Tshiluba, Shona, Sesotho and hundreds more — are functionally absent.
Where we are, and where we're going.
Roadmap is rolling. New languages get prioritized as community contributors and partners come on board.
Swahili
Kikongo
Tshiluba
Kinyarwanda
Shona
Zulu
Luganda
Sesotho
Seven-vowel system (i, e, ɛ, a, ɔ, o, u). Tonal: high vs. low tones distinguish meaning. Limited consonant clusters.
Agglutinative. Noun-class system characteristic of Bantu languages, with paired singular/plural prefixes and class-agreement on verbs and adjectives.
Latin alphabet, with several conventions in use. Tonal marking is inconsistent in everyday writing — a known challenge for ASR and TTS systems.
Sparse high-quality corpora, code-switching with French, dialectal variation between Kinshasa and upriver Lingala, and limited speech data.