Wikilangs - Open NLP for 340+ Wikipedia Languages ๐ŸŒ

Hi everyone! Iโ€™m Omar, an ML/NLP researcher from Berlin and Iโ€™ve been building Wikilangs, open-source NLP infrastructure and models trained on Wikipedia across 340+ languages, including many that have little to no existing model coverage. :globe_showing_americas::globe_showing_europe_africa::globe_showing_asia_australia:

The project currently provides, for each supported language:

  • Vocabulary, a word list with usage frequencies

  • Custom tokenizers, trained on native Wikipedia text per language

  • N-gram language models, lightweight, fast, usable offline

  • Word embeddings, cross-lingual vector spaces, monolingual and english-aligned

  • Markov chains, to generate all the non-sensical text you ever dreamed of

  • Morphological tokenizers, to make stemming easier (uses an experimental statistical approach)

  • And a Wordle-like game to kill time while your models are training :slight_smile:

You can explore everything at wikilangs.org or install directly:

pip install wikilangs

The hub page is at :hugs: huggingface.co/wikilangs. Each language has its own model card with download stats, training corpus size, and evaluation notes.

This project builds on my dataset :hugs: omarkamali/wikipedia-monthly, which publishes a monthly text corpus for every language on wikipedia (3 years ahead of the official Wikipedia dataset on HF).


Why Iโ€™m posting here: Iโ€™d love to connect with researchers, engineers, and community members working on NLP for any of these languages, especially low-resource ones. If youโ€™re working on African languages, Arabic, indigenous languages, creoles and pidgins, or any other underrepresented language, Iโ€™d genuinely love to hear whatโ€™s missing, whatโ€™s broken, and what would make these resources actually useful for your work.

A few things Iโ€™m actively looking for:

  • Native speakers / language experts who can help validate tokenization quality

  • Collaborators interested in building downstream projects (LLMs, search and semantics, morphological tokenization, user-facing apps and games โ€ฆ)

  • Feedback on any specific languageโ€™s model quality or ideas how to take the project further

Drop a reply and introduce yourself, which language(s) are you working on? :backhand_index_pointing_down:

1 Like