The data layer. Public, documented, reusable.

Every release lives on Hugging Face under BantuLanguagesInitiative. Source code and pipelines on GitHub. Standards documented and enforced.

Hosted on: Hugging Face
Default license: CC-BY-4.0
Code: GitHub · Open
First release: 2026

/ 01 · Catalogue

What's in the pipeline.

View on Hugging Face ↗

lingala-corpus-base

○ in progress

First open text corpus of contemporary Lingala, drawn from public sources, broadcasts, and community contributions.

Language

Lingala

Type · Size

text · —

Open ↗

lingala-asr-seed

○ planned

Seed speech corpus to bootstrap automatic speech recognition for Lingala.

Language

Lingala

Type · Size

audio + transcript · —

Open ↗

swahili-french-parallel

○ planned

Bilingual parallel sentences for machine translation training and evaluation.

Language

Swahili / French

Type · Size

parallel corpus · —

Open ↗

kikongo-lexicon-base

○ planned

Foundational lexical database with phonetic and morphological annotations.

Language

Kikongo

Type · Size

lexicon · —

Open ↗

/ 02 · How to contribute

Quality standards.

We'd rather have one well-documented dataset than ten unusable scrapes. Every contribution is held to the same bar.

GitHub · Contribute ↗Propose a dataset

Document the source.

Every dataset ships with a data sheet: provenance, collection method, consent, known biases.

Make it reproducible.

Cleaning and processing scripts live in the public repo. Anyone can rerun the pipeline.

License for reuse.

Default to CC-BY-4.0. If a more restrictive license is needed, we explain why on the dataset card.

Credit contributors.

Contributors are named on the dataset page. Communities providing data are consulted and acknowledged.