initiative · livev0.1 · founding-phasea3f9·c1b

Mbote.Building the AI infrastructurefor 350M Bantu speakers.

Open datasets, NLP/ASR/TTS models and language tooling for the 500+ Bantu languages spoken across Central, East and Southern Africa — starting with Lingala.

languages
500+
in family
speakers
350M
across africa
datasets
1+
in publication
bli://datasets/lingala/specimen.json live
specimen / 001iso 639-3 · lin
Mbote
[mboˈte] · Lingala · CG · CD

family
Bantu (Niger-Congo)
speakers
~350 million
languages
500+
license
CC-BY-4.0

lingala-corpus-base62%
lingala-asr-seed14%
swahili-fr-parallel05%
huggingface.co/BantuLanguagesInitiativeopen ↗
↑ live system specimen
roadmap.feed · auto-refresh
[lin]Lingala70M speakers○ wip
/
[swa]Swahili200M speakers· planned
/
[kon]Kikongo8M speakers· planned
/
[lua]Tshiluba6M speakers· planned
/
[kin]Kinyarwanda12M speakers· planned
/
[sna]Shona10M speakers· planned
/
[zul]Zulu12M speakers· planned
/
[lug]Luganda10M speakers· planned
/
[sot]Sesotho6M speakers· planned
/
[xho]Xhosa8M speakers· planned
/
[nya]Chichewa12M speakers· planned
/
[tso]Tsonga4M speakers· planned
/
[lin]Lingala70M speakers○ wip
/
[swa]Swahili200M speakers· planned
/
[kon]Kikongo8M speakers· planned
/
[lua]Tshiluba6M speakers· planned
/
[kin]Kinyarwanda12M speakers· planned
/
[sna]Shona10M speakers· planned
/
[zul]Zulu12M speakers· planned
/
[lug]Luganda10M speakers· planned
/
[sot]Sesotho6M speakers· planned
/
[xho]Xhosa8M speakers· planned
/
[nya]Chichewa12M speakers· planned
/
[tso]Tsonga4M speakers· planned
/
scope.metrics · v2026.q2last sync · live
/ 02 · the absence, in numbers

The scale of an absence.

Hundreds of millions of speakers. Almost no representation in major AI systems. Both numbers are real.

speakers01/4
0M+
Speakers of Bantu languages
central · east · southern africa
languages02/4
0+
Distinct Bantu languages
lingala · swahili · zulu · shona · …
datasets03/4
0 / many
Datasets in publication
first releases land 2026
sectors04/4
0
Sectors of impact
fin · hlt · edu · gov · tec
/ 03 · why this matters

An entire continent of voices — missing from the machine.

The largest language models in the world are trained on data that's overwhelmingly English, Mandarin, and a handful of European languages. When 350 million people who speak a Bantu language try to use a voice assistant, a chatbot, a search engine — the systems often don't even pretend to listen.

It's not a quirk of the technology. It's a question of which voices were judged worth recording. We exist to change that answer — not by hand-waving about “AI for Africa”, but by shipping the dull, exact, foundational work: clean datasets, reproducible pipelines, public benchmarks.

If you can't train on it, evaluate on it, and improve it in the open — it doesn't exist for AI. We're changing that.

$ tree --bantu --depth=2glottolog 5.x
proto-bantu/
├─ northwest/ [3]
├─ duala.lang
├─ bubi.lang
└─ yaoundé.lang
├─ central/ [3]
├─ lingala.lang← active
├─ kikongo.lang
└─ tshiluba.lang
├─ east/ [3]
├─ swahili.lang
├─ kikuyu.lang
└─ luganda.lang
└─ southern/ [4]
├─ zulu.lang
├─ xhosa.lang
├─ shona.lang
└─ sesotho.lang
$ _
family · niger-congo · phylumtotal ~500 languages
/ 04 · infrastructure pipeline

From a recorded voice in Kinshasa to an open dataset the world can train on.

01step

Collect

Texts, audio, transcriptions. Sourced from communities, archives, broadcasters and academic partners.

textaudiometadata
02step

Process

Cleaning, normalization, annotation, alignment. Reproducible pipelines, public scripts, version-controlled.

pipelineQAalignment
03step

Publish

Open release on Hugging Face, with model cards, license, data sheets, and contribution guidelines.

HF datasetsCC-BY-4.0
04step

Integrate

Reference NLP, ASR, and TTS models. APIs, SDKs, and partnerships with builders across the continent.

modelsAPItools
/ 05 · sectors of impact

What working language models unlock.

The infrastructure is foundational. The applications it enables touch every part of public and economic life.

FIN01 / 6

Banking & Finance

Voice-enabled KYC, mobile money interfaces and chatbots in the languages clients actually speak.

Use case
HLT02 / 6

Health

Vocal medical assistants, diagnostic translation, accessible patient interfaces in remote clinics.

Use case
EDU03 / 6

Education

Mother-tongue learning tools, literacy apps, augmented teacher resources for primary schools.

Use case
GOV04 / 6

Governance

Public services accessible in local languages: official translation, citizen-facing AI agents.

Use case
TEC05 / 6

Tech & Startups

Drop-in APIs and SDKs for African builders shipping localized apps to local markets.

Use case
CUL06 / 6

Media & Culture

Subtitling, archival, language preservation. Tools for journalists, broadcasters and storytellers.

Use case
/ 06 · first language

Why we're starting with Lingala.

Lingala is the lingua franca of Kinshasa — a city of fifteen million and one of the fastest-growing in the world — and of much of the Congo River basin. It is sung across Africa, broadcast on dozens of radio stations, and used daily in markets, offices, schools and ministries.

It is also functionally invisible to the AI systems that increasingly mediate banking, education and healthcare. Starting here is a choice with weight: a major African language, a rich oral and written tradition, and a measurable gap we can close.

Language fact sheetLingala
70M
speakers across Central Africa
Speakers
70 million+
Status
Official in DRC & Republic of Congo
Family
Bantu / Niger-Congo
Code (ISO 639-3)
lin
NLP coverage
Near-zero in major LLMs
Sample

“Mbote na yo, ndenge nini ozali?”

Hello, how are you? · [mbote na jo, ndenge nini oˈzali]

/ 07 · system.activity

Built in public. Logged in public.

full changelog →
2026.06
RELEASE

lingala-corpus-base · v0.1

First open Lingala text corpus enters preview. Cleaning and tokenization scripts published alongside.

ref ↗
2026.05
INFRA

Pipeline scaffolding online

Reproducible processing pipelines + CI on the public GitHub organization. Contributor guide drafted.

ref ↗
2026.04
MILESTONE

Initiative founded

BantuLanguages Initiative formally established. Founding hubs in Brazzaville and Kinshasa.

ref ↗
2026.Q3
RESEARCH

State-of-Bantu-NLP report (in writing)

Public foundation report mapping the gap between current LLM coverage and the Bantu language family.

ref ↗
/ 08 · join

Be part of
the foundation.

Researchers, developers, linguists, foundations, ministries — we're building a community as multilingual as the languages it serves. Get the next dataset releases, calls for contribution, and research notes in your inbox.

We send no more than one email per month. No tracking, no sale of data — this is an open-source non-profit.