bli://opendatasetsv0.1
/ Open datasets

The data layer. Public, documented, reusable.

Every release lives on Hugging Face under BantuLanguagesInitiative. Source code and pipelines on GitHub. Standards documented and enforced.

Hosted on
01
Hugging Face
Default license
02
CC-BY-4.0
Code
03
GitHub · Open
First release
04
2026
/ 01 · Catalogue

What's in the pipeline.

View on Hugging Face ↗

lingala-corpus-base

○ in progress

First open text corpus of contemporary Lingala, drawn from public sources, broadcasts, and community contributions.

Language
Lingala
Type · Size
text ·
Open ↗

lingala-asr-seed

○ planned

Seed speech corpus to bootstrap automatic speech recognition for Lingala.

Language
Lingala
Type · Size
audio + transcript ·
Open ↗

swahili-french-parallel

○ planned

Bilingual parallel sentences for machine translation training and evaluation.

Language
Swahili / French
Type · Size
parallel corpus ·
Open ↗

kikongo-lexicon-base

○ planned

Foundational lexical database with phonetic and morphological annotations.

Language
Kikongo
Type · Size
lexicon ·
Open ↗
/ 02 · How to contribute

Quality standards.

We'd rather have one well-documented dataset than ten unusable scrapes. Every contribution is held to the same bar.

01

Document the source.

Every dataset ships with a data sheet: provenance, collection method, consent, known biases.

02

Make it reproducible.

Cleaning and processing scripts live in the public repo. Anyone can rerun the pipeline.

03

License for reuse.

Default to CC-BY-4.0. If a more restrictive license is needed, we explain why on the dataset card.

04

Credit contributors.

Contributors are named on the dataset page. Communities providing data are consulted and acknowledged.