A research agenda for the long horizon.
Datasets are the foundation. The research agenda is what they enable: NLP, ASR, MT and TTS systems for languages that today have no working models worth the name — all released in the open.
Four tracks, one direction.
Natural Language Processing
Foundation models, tokenizers and benchmarks adapted to the morphology of Bantu languages — including noun-class agreement and tonal phenomena.
- Tokenizer R&D
- Morphological analyzers
- Public benchmarks
Automatic Speech Recognition
Speech-to-text systems robust to dialectal variation, code-switching with French and English, and the limited audio resources available today.
- Seed corpora
- Acoustic models
- Code-switching evaluation
Machine Translation
Open translation models between Bantu languages and major European languages, with parallel corpora and shared evaluation suites.
- Parallel corpora
- Reference models
- FLORES-style benchmarks
Text-to-Speech
Voice synthesis that respects tone, pronunciation and prosody — with native voice contributors compensated and credited.
- Voice banks
- Tonal modeling
- Voice contributor framework
In writing.
The first publications are in preparation. Drafts will be shared on the GitHub organization as they progress.
Lingala-001: a baseline corpus and evaluation suite.
Dataset paperTokenization strategies for agglutinative Bantu morphology.
MethodsLooking for university partners, especially in Africa.
We're building working relationships with linguistic departments, computer science labs, and applied research groups across Central and Southern Africa — and with diaspora researchers everywhere else. If your team is working on African NLP or wants to, we'd like to talk.