omnes-flores (technology preview)
The omnes-flores is a unified NLP framework for LLMs consisting of three components:
- The
LScomponent takes documents as input, and outputs results oflanguage identificationandsentence segmentationtasks- Corresponding model is omnes-flores-40-lang-42-treebank-v0-ls.
- The
WXcomponent takes a sentence and its language, and outputs results ofword segmentationandlanguage-specific part-of-speech taggingtasks- Corresponding model is omnes-flores-40-lang-42-treebank-v0-wx.
- The
UDcomponent takes a sentence and its language, constituent word list and language, and outputs results ofdependency parsingtask- Corresponding model is on this page.
By executing these three tasks in sequence using the Python library omnes-flores, you can obtain dependency parsing results corresponding to the language of the input text simply by inputting text, regardless of the language.
For details, please read the Requirements and Install sections in omnes-flores repository.
42 Treebanks Used for LoRA SFT
This model was trained using training data from 40 UD languages, consisting of 42 treebanks.
This model uses the Corpus of Everyday Japanese Conversation (CEJC) as part of training data, and uses SUW as the Japanese word unit in order to handle non-sentence contexts contained in fragmented speech.
(本モデルは訓練データの一部に日本語日常会話コーパスを使用しており、日常会話の断片的な発話に含まれる非文法的な文脈に対応するために、日本語の単語分割基準には文節構造を前提としない国語研短単位を用いています。)
The following 40 UD treebanks, which have both a commercially available license and over 40k UD tokens in the train set, were select to train the LoRA models of omnes-flores-40-lang-42-treebank-v0.
- UD_Armenian-ArmTDP, UD_Belarusian-HSE, UD_Bororo-BDT, UD_Chinese-GSD, UD_Chinese-GSDSimp, UD_Croatian-SET, UD_Czech-CAC, UD_Danish-DDT, UD_Dutch-Alpino, UD_English-EWT, UD_Estonian-EWT, UD_Finnish-TDT, UD_French-GSD, UD_German-GSD, UD_Haitian_Creole-Adolphe, UD_Hebrew-IAHLTwiki, UD_Icelandic-GC, UD_Indonesian-GSD, UD_Irish-IDT, UD_Japanese-GSDLUW, UD_Korean-Kaist, UD_Latvian-LVTB, UD_Lithuanian-ALKSNIS, UD_Naija-NSC, UD_Norwegian-Nynorsk, UD_Persian-PerDT, UD_Portuguese-Porttinari, UD_Romanian-RRT, UD_Russian-GSD, UD_Scottish_Gaelic-ARCOSG, UD_Serbian-SET, UD_Sindhi-Isra, UD_Slovak-SNK, UD_Slovenian-SSJ, UD_Spanish-GSD, UD_Swedish-Talbanken, UD_Thai-TUD, UD_Turkish-BOUN, UD_Ukrainian-ParlaMint, UD_Western_Armenian-ArmTDP,
In addition, the following datasets were used for training, which were specially licensed from the National Institute for Japanese Language and Linguistics exclusively for training this model.
- UD_Japanese-BCCWJ (excluding PN newspaper articles)
- UD_Japanese-CEJC
Acknowledgements
This work was conducted as part of a collaborative research project between Recruit Co., Ltd. and the National Institute for Japanese Language and Linguistics.
Citations
You are encouraged to cite one of the following papers if you use omnes-flores models:
@inproceedings{matsuda-etal-2025-step,
title = "Step-by-step Instructions and a Simple Tabular Output Format Improve the Dependency Parsing Accuracy of {LLM}s",
author = "Matsuda, Hiroshi and
Ma, Chunpeng and
Asahara, Masayuki",
editor = "Sagae, Kenji and
Oepen, Stephan",
booktitle = "Proceedings of the 18th International Conference on Parsing Technologies (IWPT, SyntaxFest 2025)",
month = aug,
year = "2025",
address = "Ljubljana, Slovenia",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.iwpt-1.2/",
pages = "11--19",
ISBN = "979-8-89176-294-7",
abstract = "Recent advances in large language models (LLMs) have enabled impressive performance in various tasks. However, standard prompting often struggles to produce structurally valid and accurate outputs, especially in dependency parsing. We propose a novel step-by-step instruction strategy, where universal part-of-speech tagging precedes the prediction of syntactic heads and dependency labels, and a simplified CoNLL-U like output format, our method achieves state-of-the-art accuracy on Universal Dependencies datasets across 17 languages without hallucination or contamination. We further show that multilingual fine-tuning simultaneously improves cross-language generalization performance. Our results highlight the effectiveness of explicit reasoning steps in LLM-based parsing and offer a scalable, format-consistent alternative to bracket-based approaches."
}
@misc{matsuda2025stepbystepinstructionssimpletabular,
title={Step-by-step Instructions and a Simple Tabular Output Format Improve the Dependency Parsing Accuracy of LLMs},
author={Hiroshi Matsuda and Chunpeng Ma and Masayuki Asahara},
year={2025},
eprint={2506.09983},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.09983},
}
- Downloads last month
- 160
Model tree for megagonlabs/omnes-flores-40-lang-42-treebank-v0
Base model
google/gemma-2-9b