bli://researchv0.1
/ Research

A research agenda for the long horizon.

Datasets are the foundation. The research agenda is what they enable: NLP, ASR, MT and TTS systems for languages that today have no working models worth the name — all released in the open.

/ 01 · Agenda

Four tracks, one direction.

NLPtrack 01

Natural Language Processing

Foundation models, tokenizers and benchmarks adapted to the morphology of Bantu languages — including noun-class agreement and tonal phenomena.

  • Tokenizer R&D
  • Morphological analyzers
  • Public benchmarks
ASRtrack 02

Automatic Speech Recognition

Speech-to-text systems robust to dialectal variation, code-switching with French and English, and the limited audio resources available today.

  • Seed corpora
  • Acoustic models
  • Code-switching evaluation
MTtrack 03

Machine Translation

Open translation models between Bantu languages and major European languages, with parallel corpora and shared evaluation suites.

  • Parallel corpora
  • Reference models
  • FLORES-style benchmarks
TTStrack 04

Text-to-Speech

Voice synthesis that respects tone, pronunciation and prosody — with native voice contributors compensated and credited.

  • Voice banks
  • Tonal modeling
  • Voice contributor framework
/ 02 · Publications

In writing.

The first publications are in preparation. Drafts will be shared on the GitHub organization as they progress.

2026
Coming

A foundation report on the state of Bantu languages in modern NLP.

Report
2026
Coming

Lingala-001: a baseline corpus and evaluation suite.

Dataset paper
2027
Planned

Tokenization strategies for agglutinative Bantu morphology.

Methods
/ 03 · Academic collaboration

Looking for university partners, especially in Africa.

We're building working relationships with linguistic departments, computer science labs, and applied research groups across Central and Southern Africa — and with diaspora researchers everywhere else. If your team is working on African NLP or wants to, we'd like to talk.