LatinCy Flair (la_flair_latincy)

A Flair model suite for Latin trained on harmonized Universal Dependencies treebanks from LatinCy. Provides POS tagging, lemmatization, and named entity recognition using Flair's stacked embedding architecture (contextual string embeddings + word vectors + BiLSTM-CRF).

Highlights

  • POS tagger, NER tagger, and lemmatizer -- three sequence labeling models
  • Custom character language models (forward + backward) trained on 1.6 GB of curated Latin text (13.7M sentences)
  • Custom word vectors (CBOW-300, trained on curated Latin corpus)
  • 6 UD treebanks: ITTB, LLCT, PROIEL, Perseus, UDante, CIRCSE (~1.03M tokens)
  • NER with 3 entity types: PERSON, LOC, NORP

Quick Start

from flair.data import Sentence
from flair.models import SequenceTagger
from huggingface_hub import hf_hub_download

# Download POS tagger (cached after first download)
model_path = hf_hub_download("latincy/la_flair_latincy", "models/pos/best-model.pt")
tagger = SequenceTagger.load(model_path)

# Tag a sentence
sentence = Sentence("Gallia est omnis divisa in partes tres.")
tagger.predict(sentence)

for token in sentence:
    print(f"{token.text:12s} {token.get_label('upos').value}")

Output:

Gallia       PROPN
est          AUX
omnis        DET
divisa       VERB
in           ADP
partes       NOUN
tres         NUM
.            PUNCT

NER

from flair.data import Sentence
from flair.models import SequenceTagger

model_path = hf_hub_download("latincy/la_flair_latincy", "models/ner/best-model.pt")
ner = SequenceTagger.load(model_path)
sentence = Sentence("Caesar in Galliam cum legionibus contendit.")
ner.predict(sentence)

for entity in sentence.get_spans("ner"):
    print(f"{entity.text:20s} {entity.get_label('ner').value}")

Output:

Caesar               PERSON
Galliam              LOC

Lemmatization

from flair.data import Sentence
from flair.models import SequenceTagger

model_path = hf_hub_download("latincy/la_flair_latincy", "models/lemmatizer/best-model.pt")
lemmatizer = SequenceTagger.load(model_path)
sentence = Sentence("Arma virumque cano Troiae qui primus ab oris.")
lemmatizer.predict(sentence)

for token in sentence:
    print(f"{token.text:12s} {token.get_label('lemma').value}")

Output:

Arma         arma
virumque     vir
cano         cano
Troiae       Troia
qui          qui
primus       primus
ab           ab
oris         ora
.            .

Model Description

Property Value
Author Patrick J. Burns / LatinCy
Model type Flair stacked embeddings + BiLSTM-CRF
Language Latin
License CC BY-NC-SA 4.0
Total size ~1.35 GB (5 model files)
Framework Flair (Akbik et al.)

Model Components

Component Model File Size Architecture
POS Tagger models/pos/best-model.pt 429 MB BiLSTM-CRF with stacked embeddings
NER Tagger models/ner/best-model.pt 429 MB BiLSTM-CRF with stacked embeddings
Lemmatizer models/lemmatizer/best-model.pt 347 MB Seq2seq encoder-decoder
CharLM (fwd) models/lm-forward/best-lm.pt 73 MB Character-level LSTM language model
CharLM (bwd) models/lm-backward/best-lm.pt 73 MB Character-level LSTM language model

Embedding Stack

All sequence labeling models use the same stacked embedding architecture:

  1. Flair forward embeddings -- contextual character-level LM (2048-dim LSTM)
  2. Flair backward embeddings -- reverse-direction character-level LM (2048-dim LSTM)
  3. Word embeddings -- CBOW-300 word vectors trained on curated Latin corpus

The stacked embeddings (4396-dim total) are fed into a BiLSTM-CRF for POS/NER, or a seq2seq encoder-decoder for lemmatization.

Training Data

POS, Lemmatization

Trained on harmonized data from 6 Universal Dependencies Latin treebanks, prepared by the LatinCy treebank pipeline. Treebanks are harmonized to consistent annotation standards before combining.

Treebank Full Name Domain
ITTB Index Thomisticus Treebank Scholastic Latin (Thomas Aquinas)
LLCT Late Latin Charter Treebank Medieval legal charters
PROIEL PROIEL Treebank Vulgate Bible, historical texts
Perseus Perseus Latin Treebank Classical Latin (Caesar, Cicero, etc.)
UDante UDante Treebank Dante Alighieri (De vulgari eloquentia, etc.)
CIRCSE CIRCSE Latin Treebank LASLA-derived classical texts

Combined (v0.2): 130,692 train / 12,136 dev / 13,713 test sentences (~2.97M tokens, including LASLA classical Latin data).

NER

Trained on LatinCy NER annotations from 4 sources: 13,493 train / 3,195 dev sentences. Entity types: PERSON (79%), LOC (14%), NORP (7%).

Character Language Models

Trained on 1.6 GB of curated Latin text (13.7M sentences from 9 sources) for 10 epochs. Forward and backward CharLMs provide contextualized character-level features to the POS tagger, NER tagger, and lemmatizer.

Training Procedure

POS tagger: BiLSTM-CRF with 256-unit hidden layer, stacked Flair + CBOW embeddings, SGD with learning rate 0.1, 150 epochs. Best model at epoch 100.

NER tagger: Same architecture as POS, trained on NER-annotated data, 150 epochs. Best model at epoch 62.

Lemmatizer: Flair seq2seq encoder-decoder with stacked embeddings, GRU decoder, 512-dim hidden, trained with SGD learning rate 0.1, batch size 1024, early stopping with patience 5.

Evaluation Results

Overall Scores

Component Metric Test Dev
POS Tagger F1-micro 96.65 96.89
NER F1-micro โ€” 90.48
Lemmatizer F1-micro 96.55 97.46
CharLM (fwd) Perplexity โ€” 3.11
CharLM (bwd) Perplexity โ€” 3.12

POS and Lemma scores are on held-out test sets (v0.2, retrained on UD+LASLA data). NER score is on dev only (no NER test set exists yet).

Cross-Framework Comparison (LatinCy v3.8)

All models trained on the same harmonized treebank data. Scores on held-out test sets unless noted. NER scores are on dev (no test set exists).

Metric LatinCy
Flair 0.2
LatinCy
Stanza 0.2
LatinCy
UDPipe 0.1
LatinCy
spaCy lg 3.9.0
UPOS 96.65 97.26 93.28 97.26
UFeats -- 92.80 82.48 92.58
Lemma 96.55 97.87 93.05 95.26
UAS -- 86.95 76.11 83.50
LAS -- 83.23 71.29 78.53
NER F1 90.48 90.22 -- --

Flair's BiLSTM-CRF with contextual string embeddings achieves competitive POS and lemmatization scores, and leads on NER. Dependency parsing is not supported by Flair's sequence labeling architecture.

Limitations

  • No dependency parsing: Flair is a sequence labeling framework and does not support dependency parsing. For dependency analysis, use LatinCy's spaCy, Stanza, or UDPipe models.
  • No tokenizer: Flair uses whitespace tokenization by default. Texts should be pre-tokenized before input.
  • Large model files: POS and NER models are 429 MB each because they embed the full CharLM weights. The CharLM files are also included separately for standalone use.
  • NER scores on dev set: No held-out NER test evaluation is available.

Future Development

  • Phase 2: Replace Flair CharLM embeddings with LatinCy RoBERTa transformer features for improved accuracy
  • Chunking: Shallow parsing / phrase structure detection
  • Prose/verse classification: Period and register detection

References

  • Akbik, A., Blythe, D., and Vollgraf, R. 2018. "Contextual String Embeddings for Sequence Labeling." In Proceedings of the 27th International Conference on Computational Linguistics, pp. 1638--1649. https://aclanthology.org/C18-1139/

Citation

@misc{burns2026latincyflair,
  author = {Burns, Patrick J.},
  title = {{LatinCy Flair (la\_flair\_latincy)}},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/latincy/la_flair_latincy},
}

Acknowledgments

This work was supported in part through the NYU IT High Performance Computing resources, services, and staff expertise.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Evaluation results