LatinCy Flair (la_flair_latincy)

A Flair model suite for Latin trained on harmonized Universal Dependencies treebanks from LatinCy. Provides POS tagging, lemmatization, and named entity recognition using Flair's stacked embedding architecture (contextual string embeddings + word vectors + BiLSTM-CRF).

Highlights

POS tagger, NER tagger, and lemmatizer -- three sequence labeling models
Custom character language models (forward + backward) trained on 1.6 GB of curated Latin text (13.7M sentences)
Custom word vectors (CBOW-300, trained on curated Latin corpus)
6 UD treebanks: ITTB, LLCT, PROIEL, Perseus, UDante, CIRCSE (~1.03M tokens)
NER with 3 entity types: PERSON, LOC, NORP

Quick Start

from flair.data import Sentence
from flair.models import SequenceTagger
from huggingface_hub import hf_hub_download

# Download POS tagger (cached after first download)
model_path = hf_hub_download("latincy/la_flair_latincy", "models/pos/best-model.pt")
tagger = SequenceTagger.load(model_path)

# Tag a sentence
sentence = Sentence("Gallia est omnis divisa in partes tres.")
tagger.predict(sentence)

for token in sentence:
    print(f"{token.text:12s} {token.get_label('upos').value}")

Output:

Gallia       PROPN
est          AUX
omnis        DET
divisa       VERB
in           ADP
partes       NOUN
tres         NUM
.            PUNCT

NER

from flair.data import Sentence
from flair.models import SequenceTagger

model_path = hf_hub_download("latincy/la_flair_latincy", "models/ner/best-model.pt")
ner = SequenceTagger.load(model_path)
sentence = Sentence("Caesar in Galliam cum legionibus contendit.")
ner.predict(sentence)

for entity in sentence.get_spans("ner"):
    print(f"{entity.text:20s} {entity.get_label('ner').value}")

Output:

Caesar               PERSON
Galliam              LOC

Lemmatization

from flair.data import Sentence
from flair.models import SequenceTagger

model_path = hf_hub_download("latincy/la_flair_latincy", "models/lemmatizer/best-model.pt")
lemmatizer = SequenceTagger.load(model_path)
sentence = Sentence("Arma virumque cano Troiae qui primus ab oris.")
lemmatizer.predict(sentence)

for token in sentence:
    print(f"{token.text:12s} {token.get_label('lemma').value}")

Output:

Arma         arma
virumque     vir
cano         cano
Troiae       Troia
qui          qui
primus       primus
ab           ab
oris         ora
.            .

Model Description

Property	Value
Author	Patrick J. Burns / LatinCy
Model type	Flair stacked embeddings + BiLSTM-CRF
Language	Latin
License	CC BY-NC-SA 4.0
Total size	~1.35 GB (5 model files)
Framework	Flair (Akbik et al.)

Model Components

Component	Model File	Size	Architecture
POS Tagger	`models/pos/best-model.pt`	429 MB	BiLSTM-CRF with stacked embeddings
NER Tagger	`models/ner/best-model.pt`	429 MB	BiLSTM-CRF with stacked embeddings
Lemmatizer	`models/lemmatizer/best-model.pt`	347 MB	Seq2seq encoder-decoder
CharLM (fwd)	`models/lm-forward/best-lm.pt`	73 MB	Character-level LSTM language model
CharLM (bwd)	`models/lm-backward/best-lm.pt`	73 MB	Character-level LSTM language model

Embedding Stack

All sequence labeling models use the same stacked embedding architecture:

Flair forward embeddings -- contextual character-level LM (2048-dim LSTM)
Flair backward embeddings -- reverse-direction character-level LM (2048-dim LSTM)
Word embeddings -- CBOW-300 word vectors trained on curated Latin corpus

The stacked embeddings (4396-dim total) are fed into a BiLSTM-CRF for POS/NER, or a seq2seq encoder-decoder for lemmatization.

Training Data

POS, Lemmatization

Trained on harmonized data from 6 Universal Dependencies Latin treebanks, prepared by the LatinCy treebank pipeline. Treebanks are harmonized to consistent annotation standards before combining.

Treebank	Full Name	Domain
ITTB	Index Thomisticus Treebank	Scholastic Latin (Thomas Aquinas)
LLCT	Late Latin Charter Treebank	Medieval legal charters
PROIEL	PROIEL Treebank	Vulgate Bible, historical texts
Perseus	Perseus Latin Treebank	Classical Latin (Caesar, Cicero, etc.)
UDante	UDante Treebank	Dante Alighieri (De vulgari eloquentia, etc.)
CIRCSE	CIRCSE Latin Treebank	LASLA-derived classical texts

Combined (v0.2): 130,692 train / 12,136 dev / 13,713 test sentences (~2.97M tokens, including LASLA classical Latin data).

NER

Trained on LatinCy NER annotations from 4 sources: 13,493 train / 3,195 dev sentences. Entity types: PERSON (79%), LOC (14%), NORP (7%).

Character Language Models

Trained on 1.6 GB of curated Latin text (13.7M sentences from 9 sources) for 10 epochs. Forward and backward CharLMs provide contextualized character-level features to the POS tagger, NER tagger, and lemmatizer.

Training Procedure

POS tagger: BiLSTM-CRF with 256-unit hidden layer, stacked Flair + CBOW embeddings, SGD with learning rate 0.1, 150 epochs. Best model at epoch 100.

NER tagger: Same architecture as POS, trained on NER-annotated data, 150 epochs. Best model at epoch 62.

Lemmatizer: Flair seq2seq encoder-decoder with stacked embeddings, GRU decoder, 512-dim hidden, trained with SGD learning rate 0.1, batch size 1024, early stopping with patience 5.

Evaluation Results

Overall Scores

Component	Metric	Test	Dev
POS Tagger	F1-micro	96.65	96.89
NER	F1-micro	—	90.48
Lemmatizer	F1-micro	96.55	97.46
CharLM (fwd)	Perplexity	—	3.11
CharLM (bwd)	Perplexity	—	3.12

POS and Lemma scores are on held-out test sets (v0.2, retrained on UD+LASLA data). NER score is on dev only (no NER test set exists yet).

Cross-Framework Comparison (LatinCy v3.8)

All models trained on the same harmonized treebank data. Scores on held-out test sets unless noted. NER scores are on dev (no test set exists).

Metric	LatinCy Flair 0.2	LatinCy Stanza 0.2	LatinCy UDPipe 0.1	LatinCy spaCy lg 3.9.0
UPOS	96.65	97.26	93.28	97.26
UFeats	--	92.80	82.48	92.58
Lemma	96.55	97.87	93.05	95.26
UAS	--	86.95	76.11	83.50
LAS	--	83.23	71.29	78.53
NER F1	90.48	90.22	--	--

Flair's BiLSTM-CRF with contextual string embeddings achieves competitive POS and lemmatization scores, and leads on NER. Dependency parsing is not supported by Flair's sequence labeling architecture.

Limitations

No dependency parsing: Flair is a sequence labeling framework and does not support dependency parsing. For dependency analysis, use LatinCy's spaCy, Stanza, or UDPipe models.
No tokenizer: Flair uses whitespace tokenization by default. Texts should be pre-tokenized before input.
Large model files: POS and NER models are 429 MB each because they embed the full CharLM weights. The CharLM files are also included separately for standalone use.
NER scores on dev set: No held-out NER test evaluation is available.

Future Development

Phase 2: Replace Flair CharLM embeddings with LatinCy RoBERTa transformer features for improved accuracy
Chunking: Shallow parsing / phrase structure detection
Prose/verse classification: Period and register detection

References

Akbik, A., Blythe, D., and Vollgraf, R. 2018. "Contextual String Embeddings for Sequence Labeling." In Proceedings of the 27th International Conference on Computational Linguistics, pp. 1638--1649. https://aclanthology.org/C18-1139/

Citation

@misc{burns2026latincyflair,
  author = {Burns, Patrick J.},
  title = {{LatinCy Flair (la\_flair\_latincy)}},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/latincy/la_flair_latincy},
}

Acknowledgments

This work was supported in part through the NYU IT High Performance Computing resources, services, and staff expertise.

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results

UPOS F1-micro (test) on UD Latin (combined)
self-reported

96.650
Lemma F1-micro (test) on UD Latin (combined)
self-reported

96.550
Entity F1 on LatinCy NER (4 sources)
self-reported

90.480