Hi everyone! Iโm Omar, an ML/NLP researcher from Berlin and Iโve been building Wikilangs, open-source NLP infrastructure and models trained on Wikipedia across 340+ languages, including many that have little to no existing model coverage. ![]()
![]()
![]()
The project currently provides, for each supported language:
-
Vocabulary, a word list with usage frequencies
-
Custom tokenizers, trained on native Wikipedia text per language
-
N-gram language models, lightweight, fast, usable offline
-
Word embeddings, cross-lingual vector spaces, monolingual and english-aligned
-
Markov chains, to generate all the non-sensical text you ever dreamed of
-
Morphological tokenizers, to make stemming easier (uses an experimental statistical approach)
-
And a Wordle-like game to kill time while your models are training

You can explore everything at wikilangs.org or install directly:
pip install wikilangs
The hub page is at
huggingface.co/wikilangs. Each language has its own model card with download stats, training corpus size, and evaluation notes.
This project builds on my dataset
omarkamali/wikipedia-monthly, which publishes a monthly text corpus for every language on wikipedia (3 years ahead of the official Wikipedia dataset on HF).
Why Iโm posting here: Iโd love to connect with researchers, engineers, and community members working on NLP for any of these languages, especially low-resource ones. If youโre working on African languages, Arabic, indigenous languages, creoles and pidgins, or any other underrepresented language, Iโd genuinely love to hear whatโs missing, whatโs broken, and what would make these resources actually useful for your work.
A few things Iโm actively looking for:
-
Native speakers / language experts who can help validate tokenization quality
-
Collaborators interested in building downstream projects (LLMs, search and semantics, morphological tokenization, user-facing apps and games โฆ)
-
Feedback on any specific languageโs model quality or ideas how to take the project further
Drop a reply and introduce yourself, which language(s) are you working on? ![]()