lingala-corpus-base
○ in progressFirst open text corpus of contemporary Lingala, drawn from public sources, broadcasts, and community contributions.
Every release lives on Hugging Face under BantuLanguagesInitiative. Source code and pipelines on GitHub. Standards documented and enforced.
First open text corpus of contemporary Lingala, drawn from public sources, broadcasts, and community contributions.
Seed speech corpus to bootstrap automatic speech recognition for Lingala.
Bilingual parallel sentences for machine translation training and evaluation.
Foundational lexical database with phonetic and morphological annotations.
We'd rather have one well-documented dataset than ten unusable scrapes. Every contribution is held to the same bar.
Every dataset ships with a data sheet: provenance, collection method, consent, known biases.
Cleaning and processing scripts live in the public repo. Anyone can rerun the pipeline.
Default to CC-BY-4.0. If a more restrictive license is needed, we explain why on the dataset card.
Contributors are named on the dataset page. Communities providing data are consulted and acknowledged.