sentence-transformers
Safetensors
Etruscan
Latin
cross-lingual
low-resource-nlp
ancient-languages
etruscan
epigraphy
LoRA
LaBSE
XLM-R
Eval Results (legacy)
Instructions to use Eddy1919/etr-lora-v4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use Eddy1919/etr-lora-v4 with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("Eddy1919/etr-lora-v4") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
File size: 12,205 Bytes
69694d6 09ede36 69694d6 09ede36 69694d6 09ede36 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 | ---
language:
- ett
- la
license: apache-2.0
library_name: sentence-transformers
tags:
- sentence-transformers
- cross-lingual
- low-resource-nlp
- ancient-languages
- etruscan
- epigraphy
- LoRA
- LaBSE
- XLM-R
base_model: sentence-transformers/LaBSE
datasets:
- Eddy1919/openetruscan-corpus
metrics:
- precision-at-k
- cosine-similarity
model-index:
- name: etr-lora-v4
results:
- task:
type: cross-lingual-retrieval
name: Etruscan-Latin word-vector retrieval
dataset:
name: rosetta-eval-v1 (test split)
type: Eddy1919/openetruscan-rosetta-eval-v1
metrics:
- type: precision_at_10_semantic_field
value: 0.1875
name: Semantic-field precision@10 (LaBSE baseline)
verified: false
- type: precision_at_10
value: 0.0625
name: Strict-lexical precision@10 (LaBSE baseline)
verified: false
---
# etr-lora-v4 — Etruscan-side LoRA adapter for LaBSE
> **Status note.** The numbers in the YAML frontmatter and in the
> Evaluation table below are the **LaBSE-only** column of the current
> frozen `rosetta-eval-v1` benchmark. That is what the first Hub
> deposit covers. The **v4 column** will be added after WBS tasks
> **T2.3** (ingest v4 vectors behind a feature flag) and **T2.4** (run
> the head-to-head eval) land in prod and the benchmark gains its
> fourth row.
## TL;DR
`etr-lora-v4` is a **LoRA adapter** that fine-tunes the **Etruscan-side
vocabulary projection** of a multilingual encoder (XLM-R-base, with
LaBSE as the cross-lingual anchor on the Latin/Greek side) so that
Etruscan words land in the same 768-dim semantic space as the rest of
the multilingual vocabulary. The system is evaluated against held-out
Etruscan ↔ Latin equivalences drawn from the philological literature
(Bonfante & Bonfante 2002, Wallace 2008, Pallottino 1968), exposed
through the `rosetta-eval-v1` frozen benchmark.
The pipeline is designed for **semantic-neighbourhood retrieval over a
low-resource, undeciphered ancient language**, not lexical-equivalence
translation. See *Limitations* before you cite the numbers.
## Intended use
- **Cognate / loanword detection.** Given an Etruscan word, find
orthographically- or semantically-similar Latin or Greek words.
Useful for spotting Etruscan→Latin borrowings (e.g. `histrio`,
`popa`, `subulo`, `satura`).
- **Theonym and place-name alignment.** Etruscan deity and place
names were often Latinised by Roman authors with regular sound
correspondences. The system reliably recovers these:
`menrva→minerva`, `hercle→hercules`, `fanu→fanum`.
- **Within-language semantic-field exploration.** For an Etruscan
query, the system returns Latin words with related meanings even
when the exact target lemma is wrong (e.g. `papa→[papa, daddy,
pater]`).
- **Multilingual nearest-neighbour browsing** as a primitive other
ancient-language work (Phoenician, Faliscan, Oscan) can plug into
without rebuilding the storage / API layer.
## Out of scope
- **Mechanical Etruscan → Latin translation.** Lexical equivalence
between *unrelated* surface forms (`clan → filius`, `puia → uxor`,
`lautn → familia`) is **not** in the model, and no amount of
pooling, centering, or LoRA fine-tuning recovers signal that was
never in the training corpus.
- **Decipherment of unknown Etruscan words.** Top-k results will be
orthographic and semantic neighbours of the source surface form,
not authoritative semantic equivalents.
- **An Etruscan dictionary.** This is not a dictionary. We make no
such claim. The output is a ranked shortlist for downstream
philological judgement, not a translation.
## How to use
### From `sentence-transformers`
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Eddy1919/etr-lora-v4")
embeddings = model.encode(["fanu", "avil", "clan"])
# embeddings.shape == (3, 768)
```
### Through the hosted API
```bash
curl 'https://api.openetruscan.com/neural/rosetta?word=fanu&from=ett&to=lat&embedder=xlmr-lora-v4'
```
The default `embedder` is `LaBSE`; passing `embedder=xlmr-lora-v4`
routes the query through the v4 adapter. The route currently returns
LaBSE results until T2.3 lands the v4 partition in prod.
## Training data
Derived from the **OpenEtruscan corpus v1** (Zenodo DOI
[10.5281/zenodo.20075836](https://doi.org/10.5281/zenodo.20075836)):
- **6,633 unified inscriptions**, drawn primarily from the *Larth
Dataset* (Vico & Spanakis 2023; ~71% of rows) and the *Corpus
Inscriptionum Etruscarum* Vol. I extractions (~29%).
- **~8,905 unique Etruscan tokens** on the source side after
divider-normalisation (see *Training procedure* below).
- No primary-source-attested anchors are used in training — only the
raw transcriptions. The Bonfante / Wallace / Pallottino
equivalences are held out for evaluation in `rosetta-eval-v1`.
Upstream provenance chain is documented in
[`research/BIBLIOGRAPHY.md`](https://github.com/Eddy1919/openEtruscan/blob/main/research/BIBLIOGRAPHY.md).
## Training procedure
LoRA over **XLM-R-base** (768-dim hidden), trained on Vertex AI in
the `openetruscan-rosetta` GCP project.
- **Output adapter:** `gs://openetruscan-rosetta/adapters/etr-lora-v4/`
- **Re-embedded Etruscan vocabulary:**
`gs://openetruscan-rosetta/embeddings/etr-xlmr-lora-v4.jsonl`
(8,905 rows × 768 dim).
- **Etruscan-side preprocessing:** word-divider normalisation
(`:` and `·` → space, per Bonfante 2002 §10), preserving `.`
(intra-word phonological marker) and `-` (compounding marker).
Hyperparameters (matching the v3 → v4 recipe in
`scripts/training/vertex/submit_etr_lora_v4.sh`):
| Hyperparameter | Value |
|---|---|
| Base model | `xlm-roberta-base` |
| Epochs | 5 |
| Learning rate | 5e-4 |
| Batch size | 16 |
| Max length | 64 tokens |
| LoRA r | 8 |
| LoRA alpha | 16 |
| LoRA dropout | 0.1 |
| Target modules | `q_proj`, `v_proj` |
| Seed | 42 |
| Hardware | 1× NVIDIA T4 (Vertex AI `n1-standard-8`) |
| Wall time | ~30–60 min |
| Compute cost | ~$0.40 USD |
The training recipe (and divider-normalisation function) is in
[`scripts/training/vertex/train_etruscan_lora.py`](https://github.com/Eddy1919/openEtruscan/blob/main/scripts/training/vertex/train_etruscan_lora.py).
The only delta from v3 is the corpus input
(`etruscan-prod-rawtext-v3.jsonl`, the cleaner V3 corpus produced
after `normalize_inscriptions.py` removed Cyrillic / Latin-Ext-B
mirror-glyph corruption and unified sibilant variants σ/ś/š/ς → SAN).
## Evaluation
All numbers below are from the first frozen run of `rosetta-eval-v1`,
committed at
[`eval/rosetta-eval-v1-20260510T210124Z.json`](https://github.com/Eddy1919/openEtruscan/blob/main/eval/rosetta-eval-v1-20260510T210124Z.json).
**The `model` column reflects the LaBSE baseline that prod was serving
at the time of the run.** The v4 column will be added when T2.3 lands
v4 vectors in prod and T2.4 runs the head-to-head.
### Headline numbers — 22-pair test split
| Metric | random | Levenshtein | LaBSE (current prod) | v4 (after T2.3 / T2.4) |
|---|---:|---:|---:|---:|
| Strict-lexical precision@10 | 0.0002 | 0.000 | **0.0625** | _to be added_ |
| Semantic-field precision@10 | 0.0081 | 0.000 | **0.1875** | _to be added_ |
| Coverage@cos≥0.50 | 0.000 | 0.955 | **1.000** | _to be added_ |
| Coverage@cos≥0.70 | 0.000 | 0.273 | **1.000** | _to be added_ |
| Coverage@cos≥0.85 | 0.000 | 0.091 | **0.6875** | _to be added_ |
| n evaluated (of 22) | 22 | 22 | 16 | _to be added_ |
| n skipped (OOV on the source side) | 0 | 0 | 6 (27.3%) | _to be added_ |
### Per-confidence breakdown (LaBSE column)
| Confidence | n | strict @10 | field @10 |
|---|---:|---:|---:|
| high | 10 | 0.100 | 0.200 |
| medium | 6 | 0.000 | 0.167 |
### Per-category breakdown (LaBSE column, field@10)
| Category | n | field @10 |
|---|---:|---:|
| kinship | 3 | 0.333 |
| theonym | 3 | 0.333 |
| onomastic | 2 | 0.500 |
| religious | 2 | 0.000 |
| time | 2 | 0.000 |
| numeral | 3 | 0.000 |
| verb | 1 | 0.000 |
The strict-lexical metric measures something the system *cannot* do
without parallel-data supervision; the semantic-field metric measures
what it *can* do, and is the honest reflection of the system's actual
research utility. Both are reported side-by-side for historical
comparability.
For the full reproducibility manifest (pinned commit hashes, Latin
vocab snapshot, baseline math), see
[`research/notes/reproduce-rosetta-eval-v1.md`](https://github.com/Eddy1919/openEtruscan/blob/main/research/notes/reproduce-rosetta-eval-v1.md).
## Limitations
Honesty matters more here than marketing:
1. **Small held-out test split (n=22 pairs).** Confidence intervals
are correspondingly wide. RG.4 in the SOTA roadmap adds
95%-bootstrap CIs to every reported number; until that lands,
treat single-decimal-point differences between models as noise.
2. **27% OOV rate on the source side.** 6 of the 22 test-split pairs
are skipped by the model because the Etruscan token has no vector
in `language_word_embeddings`. The other two baselines (random,
Levenshtein) evaluate all 22. Comparisons are accordingly *not*
apples-to-apples without per-pair pairing.
3. **No primary-source-attested anchors used in training.** The
evaluation set is itself the philological consensus. Any training
signal that pushed precision up — short of genuinely parallel data
we do not have — would be reflecting that same consensus back at
us. Work-package P4 (primary-source mining) is the route out.
4. **Philological consensus reflects a school.** The Bonfante &
Bonfante / Wallace / Pallottino reading is one school's best
reading. Categories like `verb` (n=1) and `time` (n=2) are
under-represented; the per-category breakdown above is indicative,
not authoritative.
5. **Cross-language semantic alignment for unrelated surface forms
remains weak.** `clan → filius`, `puia → uxor`, `lautn → familia`
are misses by design; there is no signal in the training corpus
that these are equivalent.
## Citation
If you use this model, please cite both the software/dataset DOI and
the model directly:
```bibtex
@software{openetruscan_2026,
author = {OpenEtruscan Contributors},
title = {{OpenEtruscan: open-source digital corpus platform for Etruscan epigraphy}},
year = {2026},
version = {0.5.0},
doi = {10.5281/zenodo.20075836},
url = {https://doi.org/10.5281/zenodo.20075836},
publisher = {Zenodo}
}
@misc{openetruscan_etr_lora_v4_2026,
author = {OpenEtruscan Contributors},
title = {{etr-lora-v4: Etruscan-side LoRA adapter for LaBSE / XLM-R}},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/Eddy1919/etr-lora-v4}},
note = {Evaluated against the rosetta-eval-v1 frozen benchmark.}
}
```
The frozen reference benchmark is `rosetta-eval-v1`; full reproduction
instructions live in
[`research/notes/reproduce-rosetta-eval-v1.md`](https://github.com/Eddy1919/openEtruscan/blob/main/research/notes/reproduce-rosetta-eval-v1.md).
## License
**Apache 2.0** — matches the model-artifact licensing scheme of the
OpenEtruscan repository (code: MIT, data: CC0 1.0, models:
Apache 2.0).
## Acknowledgements
- Vico, A. and Spanakis, G. (2023). *Larth Dataset* — primary source
for ~71% of the unified corpus.
- Compilers of the *Corpus Inscriptionum Etruscarum* (CIE Vol. I),
source of the remaining ~29%.
- Bonfante, G. and Bonfante, L. (2002). *The Etruscan Language: An
Introduction*, 2nd edition.
- Wallace, R. E. (2008). *Zikh Rasna: A Manual of the Etruscan
Language and Inscriptions*.
- Pallottino, M. (1968). *Testimonia Linguae Etruscae*.
- Feng et al. (2020). *LaBSE: Language-agnostic BERT Sentence
Embedding* — the cross-lingual anchor.
- The Pelagios Network, the EpiDoc community, and the Classical
Language Toolkit.
|