bli://thelanguagesweservev0.1
/ The languages we serve

Five hundred languages. One family.

The Bantu language family is one of the largest in the world, spread from Cameroon to South Africa over millennia of migration. We start with the most-spoken and most under-served, and work outward.

Family
01
Bantu
Phylum
02
Niger-Congo
Languages
03
500+
Speakers
04
350M+
/ 01 · Distribution

Spoken from the Atlantic to the Indian Ocean.

Bantu languages emerged in present-day Cameroon and Nigeria around 3,000–4,000 years ago. The great Bantu expansion carried them south and east, where they evolved into more than 500 distinct languages spoken today across Central, East and Southern Africa.

Despite their reach, only Swahili enjoys meaningful representation in modern NLP datasets. The rest — Lingala, Kikongo, Tshiluba, Shona, Sesotho and hundreds more — are functionally absent.

Bantu language regions · approximateLingala
/ 02 · Languages in scope

Where we are, and where we're going.

Roadmap is rolling. New languages get prioritized as community contributors and partners come on board.

Lingala

ISO 639-3 · lin
In progress
RegionDR Congo · Republic of Congo
Speakers70M+
First language in publication

Swahili

ISO 639-3 · swa
Planned
RegionEast Africa · Lingua franca
Speakers200M+
Most spoken Bantu language

Kikongo

ISO 639-3 · kon
Planned
RegionDR Congo · Angola · Congo
Speakers8M+

Tshiluba

ISO 639-3 · lua
Planned
RegionDR Congo (Kasai)
Speakers6M+

Kinyarwanda

ISO 639-3 · kin
Planned
RegionRwanda · DR Congo
Speakers12M+

Shona

ISO 639-3 · sna
Planned
RegionZimbabwe · Mozambique
Speakers10M+

Zulu

ISO 639-3 · zul
Planned
RegionSouth Africa
Speakers12M+

Luganda

ISO 639-3 · lug
Planned
RegionUganda
Speakers10M+

Sesotho

ISO 639-3 · sot
Planned
RegionLesotho · South Africa
Speakers6M+
First language

A close look at Lingala.

Open dataset on Hugging Face ↗
Phonology

Seven-vowel system (i, e, ɛ, a, ɔ, o, u). Tonal: high vs. low tones distinguish meaning. Limited consonant clusters.

Morphology

Agglutinative. Noun-class system characteristic of Bantu languages, with paired singular/plural prefixes and class-agreement on verbs and adjectives.

Orthography

Latin alphabet, with several conventions in use. Tonal marking is inconsistent in everyday writing — a known challenge for ASR and TTS systems.

NLP challenges

Sparse high-quality corpora, code-switching with French, dialectal variation between Kinshasa and upriver Lingala, and limited speech data.