Wikilangs - Open NLP for 340+ Wikipedia Languages 🌐

omarkamali · March 8, 2026, 11:18am

Hi everyone! I’m Omar, an ML/NLP researcher from Berlin and I’ve been building Wikilangs, open-source NLP infrastructure and models trained on Wikipedia across 340+ languages, including many that have little to no existing model coverage.

The project currently provides, for each supported language:

Vocabulary, a word list with usage frequencies
Custom tokenizers, trained on native Wikipedia text per language
N-gram language models, lightweight, fast, usable offline
Word embeddings, cross-lingual vector spaces, monolingual and english-aligned
Markov chains, to generate all the non-sensical text you ever dreamed of
Morphological tokenizers, to make stemming easier (uses an experimental statistical approach)
And a Wordle-like game to kill time while your models are training

You can explore everything at wikilangs.org or install directly:

pip install wikilangs

The hub page is at huggingface.co/wikilangs. Each language has its own model card with download stats, training corpus size, and evaluation notes.

This project builds on my dataset omarkamali/wikipedia-monthly, which publishes a monthly text corpus for every language on wikipedia (3 years ahead of the official Wikipedia dataset on HF).

Why I’m posting here: I’d love to connect with researchers, engineers, and community members working on NLP for any of these languages, especially low-resource ones. If you’re working on African languages, Arabic, indigenous languages, creoles and pidgins, or any other underrepresented language, I’d genuinely love to hear what’s missing, what’s broken, and what would make these resources actually useful for your work.

A few things I’m actively looking for:

Native speakers / language experts who can help validate tokenization quality
Collaborators interested in building downstream projects (LLMs, search and semantics, morphological tokenization, user-facing apps and games …)
Feedback on any specific language’s model quality or ideas how to take the project further

Drop a reply and introduce yourself, which language(s) are you working on?

Topic		Replies	Views
About the Languages at Hugging Face category Languages at Hugging Face	0	1472	February 14, 2021
Translation for Indian languages With CoT Research	4	50	July 8, 2025
Grammar Check using pre-trained models 🤗Hub	0	323	March 14, 2023
Models without any language tags Models	0	427	May 19, 2022
Swedish NLP - Introductions :sweden: Languages at Hugging Face	0	940	January 14, 2022

Wikilangs - Open NLP for 340+ Wikipedia Languages 🌐

Related topics