LatinCy Flair (la_flair_latincy)
A Flair model suite for Latin trained on harmonized Universal Dependencies treebanks from LatinCy. Provides POS tagging, lemmatization, and named entity recognition using Flair's stacked embedding architecture (contextual string embeddings + word vectors + BiLSTM-CRF).
Highlights
- POS tagger, NER tagger, and lemmatizer -- three sequence labeling models
- Custom character language models (forward + backward) trained on 1.6 GB of curated Latin text (13.7M sentences)
- Custom word vectors (CBOW-300, trained on curated Latin corpus)
- 6 UD treebanks: ITTB, LLCT, PROIEL, Perseus, UDante, CIRCSE (~1.03M tokens)
- NER with 3 entity types: PERSON, LOC, NORP
Quick Start
from flair.data import Sentence
from flair.models import SequenceTagger
from huggingface_hub import hf_hub_download
# Download POS tagger (cached after first download)
model_path = hf_hub_download("latincy/la_flair_latincy", "models/pos/best-model.pt")
tagger = SequenceTagger.load(model_path)
# Tag a sentence
sentence = Sentence("Gallia est omnis divisa in partes tres.")
tagger.predict(sentence)
for token in sentence:
print(f"{token.text:12s} {token.get_label('upos').value}")
Output:
Gallia PROPN
est AUX
omnis DET
divisa VERB
in ADP
partes NOUN
tres NUM
. PUNCT
NER
from flair.data import Sentence
from flair.models import SequenceTagger
model_path = hf_hub_download("latincy/la_flair_latincy", "models/ner/best-model.pt")
ner = SequenceTagger.load(model_path)
sentence = Sentence("Caesar in Galliam cum legionibus contendit.")
ner.predict(sentence)
for entity in sentence.get_spans("ner"):
print(f"{entity.text:20s} {entity.get_label('ner').value}")
Output:
Caesar PERSON
Galliam LOC
Lemmatization
from flair.data import Sentence
from flair.models import SequenceTagger
model_path = hf_hub_download("latincy/la_flair_latincy", "models/lemmatizer/best-model.pt")
lemmatizer = SequenceTagger.load(model_path)
sentence = Sentence("Arma virumque cano Troiae qui primus ab oris.")
lemmatizer.predict(sentence)
for token in sentence:
print(f"{token.text:12s} {token.get_label('lemma').value}")
Output:
Arma arma
virumque vir
cano cano
Troiae Troia
qui qui
primus primus
ab ab
oris ora
. .
Model Description
| Property | Value |
|---|---|
| Author | Patrick J. Burns / LatinCy |
| Model type | Flair stacked embeddings + BiLSTM-CRF |
| Language | Latin |
| License | CC BY-NC-SA 4.0 |
| Total size | ~1.35 GB (5 model files) |
| Framework | Flair (Akbik et al.) |
Model Components
| Component | Model File | Size | Architecture |
|---|---|---|---|
| POS Tagger | models/pos/best-model.pt |
429 MB | BiLSTM-CRF with stacked embeddings |
| NER Tagger | models/ner/best-model.pt |
429 MB | BiLSTM-CRF with stacked embeddings |
| Lemmatizer | models/lemmatizer/best-model.pt |
347 MB | Seq2seq encoder-decoder |
| CharLM (fwd) | models/lm-forward/best-lm.pt |
73 MB | Character-level LSTM language model |
| CharLM (bwd) | models/lm-backward/best-lm.pt |
73 MB | Character-level LSTM language model |
Embedding Stack
All sequence labeling models use the same stacked embedding architecture:
- Flair forward embeddings -- contextual character-level LM (2048-dim LSTM)
- Flair backward embeddings -- reverse-direction character-level LM (2048-dim LSTM)
- Word embeddings -- CBOW-300 word vectors trained on curated Latin corpus
The stacked embeddings (4396-dim total) are fed into a BiLSTM-CRF for POS/NER, or a seq2seq encoder-decoder for lemmatization.
Training Data
POS, Lemmatization
Trained on harmonized data from 6 Universal Dependencies Latin treebanks, prepared by the LatinCy treebank pipeline. Treebanks are harmonized to consistent annotation standards before combining.
| Treebank | Full Name | Domain |
|---|---|---|
| ITTB | Index Thomisticus Treebank | Scholastic Latin (Thomas Aquinas) |
| LLCT | Late Latin Charter Treebank | Medieval legal charters |
| PROIEL | PROIEL Treebank | Vulgate Bible, historical texts |
| Perseus | Perseus Latin Treebank | Classical Latin (Caesar, Cicero, etc.) |
| UDante | UDante Treebank | Dante Alighieri (De vulgari eloquentia, etc.) |
| CIRCSE | CIRCSE Latin Treebank | LASLA-derived classical texts |
Combined (v0.2): 130,692 train / 12,136 dev / 13,713 test sentences (~2.97M tokens, including LASLA classical Latin data).
NER
Trained on LatinCy NER annotations from 4 sources: 13,493 train / 3,195 dev sentences. Entity types: PERSON (79%), LOC (14%), NORP (7%).
Character Language Models
Trained on 1.6 GB of curated Latin text (13.7M sentences from 9 sources) for 10 epochs. Forward and backward CharLMs provide contextualized character-level features to the POS tagger, NER tagger, and lemmatizer.
Training Procedure
POS tagger: BiLSTM-CRF with 256-unit hidden layer, stacked Flair + CBOW embeddings, SGD with learning rate 0.1, 150 epochs. Best model at epoch 100.
NER tagger: Same architecture as POS, trained on NER-annotated data, 150 epochs. Best model at epoch 62.
Lemmatizer: Flair seq2seq encoder-decoder with stacked embeddings, GRU decoder, 512-dim hidden, trained with SGD learning rate 0.1, batch size 1024, early stopping with patience 5.
Evaluation Results
Overall Scores
| Component | Metric | Test | Dev |
|---|---|---|---|
| POS Tagger | F1-micro | 96.65 | 96.89 |
| NER | F1-micro | โ | 90.48 |
| Lemmatizer | F1-micro | 96.55 | 97.46 |
| CharLM (fwd) | Perplexity | โ | 3.11 |
| CharLM (bwd) | Perplexity | โ | 3.12 |
POS and Lemma scores are on held-out test sets (v0.2, retrained on UD+LASLA data). NER score is on dev only (no NER test set exists yet).
Cross-Framework Comparison (LatinCy v3.8)
All models trained on the same harmonized treebank data. Scores on held-out test sets unless noted. NER scores are on dev (no test set exists).
| Metric | LatinCy Flair 0.2 |
LatinCy Stanza 0.2 |
LatinCy UDPipe 0.1 |
LatinCy spaCy lg 3.9.0 |
|---|---|---|---|---|
| UPOS | 96.65 | 97.26 | 93.28 | 97.26 |
| UFeats | -- | 92.80 | 82.48 | 92.58 |
| Lemma | 96.55 | 97.87 | 93.05 | 95.26 |
| UAS | -- | 86.95 | 76.11 | 83.50 |
| LAS | -- | 83.23 | 71.29 | 78.53 |
| NER F1 | 90.48 | 90.22 | -- | -- |
Flair's BiLSTM-CRF with contextual string embeddings achieves competitive POS and lemmatization scores, and leads on NER. Dependency parsing is not supported by Flair's sequence labeling architecture.
Limitations
- No dependency parsing: Flair is a sequence labeling framework and does not support dependency parsing. For dependency analysis, use LatinCy's spaCy, Stanza, or UDPipe models.
- No tokenizer: Flair uses whitespace tokenization by default. Texts should be pre-tokenized before input.
- Large model files: POS and NER models are 429 MB each because they embed the full CharLM weights. The CharLM files are also included separately for standalone use.
- NER scores on dev set: No held-out NER test evaluation is available.
Future Development
- Phase 2: Replace Flair CharLM embeddings with LatinCy RoBERTa transformer features for improved accuracy
- Chunking: Shallow parsing / phrase structure detection
- Prose/verse classification: Period and register detection
References
- Akbik, A., Blythe, D., and Vollgraf, R. 2018. "Contextual String Embeddings for Sequence Labeling." In Proceedings of the 27th International Conference on Computational Linguistics, pp. 1638--1649. https://aclanthology.org/C18-1139/
Citation
@misc{burns2026latincyflair,
author = {Burns, Patrick J.},
title = {{LatinCy Flair (la\_flair\_latincy)}},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/latincy/la_flair_latincy},
}
Acknowledgments
This work was supported in part through the NYU IT High Performance Computing resources, services, and staff expertise.
- Downloads last month
- -
Evaluation results
- UPOS F1-micro (test) on UD Latin (combined)self-reported96.650
- Lemma F1-micro (test) on UD Latin (combined)self-reported96.550
- Entity F1 on LatinCy NER (4 sources)self-reported90.480