A research agenda for the long horizon.

Datasets are the foundation. The research agenda is what they enable: NLP, ASR, MT and TTS systems for languages that today have no working models worth the name — all released in the open.

/ 01 · Agenda

Four tracks, one direction.

NLPtrack 01

Natural Language Processing

Foundation models, tokenizers and benchmarks adapted to the morphology of Bantu languages — including noun-class agreement and tonal phenomena.

Tokenizer R&D
Morphological analyzers
Public benchmarks

ASRtrack 02

Automatic Speech Recognition

Speech-to-text systems robust to dialectal variation, code-switching with French and English, and the limited audio resources available today.

Seed corpora
Acoustic models
Code-switching evaluation

MTtrack 03

Machine Translation

Open translation models between Bantu languages and major European languages, with parallel corpora and shared evaluation suites.

Parallel corpora
Reference models
FLORES-style benchmarks

TTStrack 04

Text-to-Speech

Voice synthesis that respects tone, pronunciation and prosody — with native voice contributors compensated and credited.

Voice banks
Tonal modeling
Voice contributor framework

/ 02 · Released

Shipped, not just planned.

Our first model is out. More follow as the datasets mature.

ASR · LIVE

BLI ASR 0

Automatic speech recognition for Lingala — Whisper large-v3 adapted with LoRA, trained on the Waxal Lingala corpus.

CER · normalized

0.1703

Read the model page →

/ 03 · Publications

In writing.

The first publications are in preparation. Drafts will be shared on the GitHub organization as they progress.

2026

Coming

A foundation report on the state of Bantu languages in modern NLP.

Report

2026

Coming

Lingala-001: a baseline corpus and evaluation suite.

Dataset paper

2027

Planned

Tokenization strategies for agglutinative Bantu morphology.

Methods

/ 04 · Academic collaboration

Looking for university partners, especially in Africa.

We're building working relationships with linguistic departments, computer science labs, and applied research groups across Central and Southern Africa — and with diaspora researchers everywhere else. If your team is working on African NLP or wants to, we'd like to talk.

Reach out about research →