File size: 12,205 Bytes
69694d6
09ede36
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69694d6
 
09ede36
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69694d6
 
 
09ede36
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
---
language:
  - ett
  - la
license: apache-2.0
library_name: sentence-transformers
tags:
  - sentence-transformers
  - cross-lingual
  - low-resource-nlp
  - ancient-languages
  - etruscan
  - epigraphy
  - LoRA
  - LaBSE
  - XLM-R
base_model: sentence-transformers/LaBSE
datasets:
  - Eddy1919/openetruscan-corpus
metrics:
  - precision-at-k
  - cosine-similarity
model-index:
  - name: etr-lora-v4
    results:
      - task:
          type: cross-lingual-retrieval
          name: Etruscan-Latin word-vector retrieval
        dataset:
          name: rosetta-eval-v1 (test split)
          type: Eddy1919/openetruscan-rosetta-eval-v1
        metrics:
          - type: precision_at_10_semantic_field
            value: 0.1875
            name: Semantic-field precision@10 (LaBSE baseline)
            verified: false
          - type: precision_at_10
            value: 0.0625
            name: Strict-lexical precision@10 (LaBSE baseline)
            verified: false
---

# etr-lora-v4 — Etruscan-side LoRA adapter for LaBSE

> **Status note.** The numbers in the YAML frontmatter and in the
> Evaluation table below are the **LaBSE-only** column of the current
> frozen `rosetta-eval-v1` benchmark. That is what the first Hub
> deposit covers. The **v4 column** will be added after WBS tasks
> **T2.3** (ingest v4 vectors behind a feature flag) and **T2.4** (run
> the head-to-head eval) land in prod and the benchmark gains its
> fourth row.

## TL;DR

`etr-lora-v4` is a **LoRA adapter** that fine-tunes the **Etruscan-side
vocabulary projection** of a multilingual encoder (XLM-R-base, with
LaBSE as the cross-lingual anchor on the Latin/Greek side) so that
Etruscan words land in the same 768-dim semantic space as the rest of
the multilingual vocabulary. The system is evaluated against held-out
Etruscan ↔ Latin equivalences drawn from the philological literature
(Bonfante & Bonfante 2002, Wallace 2008, Pallottino 1968), exposed
through the `rosetta-eval-v1` frozen benchmark.

The pipeline is designed for **semantic-neighbourhood retrieval over a
low-resource, undeciphered ancient language**, not lexical-equivalence
translation. See *Limitations* before you cite the numbers.

## Intended use

- **Cognate / loanword detection.** Given an Etruscan word, find
  orthographically- or semantically-similar Latin or Greek words.
  Useful for spotting Etruscan→Latin borrowings (e.g. `histrio`,
  `popa`, `subulo`, `satura`).
- **Theonym and place-name alignment.** Etruscan deity and place
  names were often Latinised by Roman authors with regular sound
  correspondences. The system reliably recovers these:
  `menrva→minerva`, `hercle→hercules`, `fanu→fanum`.
- **Within-language semantic-field exploration.** For an Etruscan
  query, the system returns Latin words with related meanings even
  when the exact target lemma is wrong (e.g. `papa→[papa, daddy,
  pater]`).
- **Multilingual nearest-neighbour browsing** as a primitive other
  ancient-language work (Phoenician, Faliscan, Oscan) can plug into
  without rebuilding the storage / API layer.

## Out of scope

- **Mechanical Etruscan → Latin translation.** Lexical equivalence
  between *unrelated* surface forms (`clan → filius`, `puia → uxor`,
  `lautn → familia`) is **not** in the model, and no amount of
  pooling, centering, or LoRA fine-tuning recovers signal that was
  never in the training corpus.
- **Decipherment of unknown Etruscan words.** Top-k results will be
  orthographic and semantic neighbours of the source surface form,
  not authoritative semantic equivalents.
- **An Etruscan dictionary.** This is not a dictionary. We make no
  such claim. The output is a ranked shortlist for downstream
  philological judgement, not a translation.

## How to use

### From `sentence-transformers`

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Eddy1919/etr-lora-v4")
embeddings = model.encode(["fanu", "avil", "clan"])
# embeddings.shape == (3, 768)
```

### Through the hosted API

```bash
curl 'https://api.openetruscan.com/neural/rosetta?word=fanu&from=ett&to=lat&embedder=xlmr-lora-v4'
```

The default `embedder` is `LaBSE`; passing `embedder=xlmr-lora-v4`
routes the query through the v4 adapter. The route currently returns
LaBSE results until T2.3 lands the v4 partition in prod.

## Training data

Derived from the **OpenEtruscan corpus v1** (Zenodo DOI
[10.5281/zenodo.20075836](https://doi.org/10.5281/zenodo.20075836)):

- **6,633 unified inscriptions**, drawn primarily from the *Larth
  Dataset* (Vico & Spanakis 2023; ~71% of rows) and the *Corpus
  Inscriptionum Etruscarum* Vol. I extractions (~29%).
- **~8,905 unique Etruscan tokens** on the source side after
  divider-normalisation (see *Training procedure* below).
- No primary-source-attested anchors are used in training — only the
  raw transcriptions. The Bonfante / Wallace / Pallottino
  equivalences are held out for evaluation in `rosetta-eval-v1`.

Upstream provenance chain is documented in
[`research/BIBLIOGRAPHY.md`](https://github.com/Eddy1919/openEtruscan/blob/main/research/BIBLIOGRAPHY.md).

## Training procedure

LoRA over **XLM-R-base** (768-dim hidden), trained on Vertex AI in
the `openetruscan-rosetta` GCP project.

- **Output adapter:** `gs://openetruscan-rosetta/adapters/etr-lora-v4/`
- **Re-embedded Etruscan vocabulary:**
  `gs://openetruscan-rosetta/embeddings/etr-xlmr-lora-v4.jsonl`
  (8,905 rows × 768 dim).
- **Etruscan-side preprocessing:** word-divider normalisation
  (`:` and `·` → space, per Bonfante 2002 §10), preserving `.`
  (intra-word phonological marker) and `-` (compounding marker).

Hyperparameters (matching the v3 → v4 recipe in
`scripts/training/vertex/submit_etr_lora_v4.sh`):

| Hyperparameter | Value |
|---|---|
| Base model | `xlm-roberta-base` |
| Epochs | 5 |
| Learning rate | 5e-4 |
| Batch size | 16 |
| Max length | 64 tokens |
| LoRA r | 8 |
| LoRA alpha | 16 |
| LoRA dropout | 0.1 |
| Target modules | `q_proj`, `v_proj` |
| Seed | 42 |
| Hardware | 1× NVIDIA T4 (Vertex AI `n1-standard-8`) |
| Wall time | ~30–60 min |
| Compute cost | ~$0.40 USD |

The training recipe (and divider-normalisation function) is in
[`scripts/training/vertex/train_etruscan_lora.py`](https://github.com/Eddy1919/openEtruscan/blob/main/scripts/training/vertex/train_etruscan_lora.py).
The only delta from v3 is the corpus input
(`etruscan-prod-rawtext-v3.jsonl`, the cleaner V3 corpus produced
after `normalize_inscriptions.py` removed Cyrillic / Latin-Ext-B
mirror-glyph corruption and unified sibilant variants σ/ś/š/ς → SAN).

## Evaluation

All numbers below are from the first frozen run of `rosetta-eval-v1`,
committed at
[`eval/rosetta-eval-v1-20260510T210124Z.json`](https://github.com/Eddy1919/openEtruscan/blob/main/eval/rosetta-eval-v1-20260510T210124Z.json).
**The `model` column reflects the LaBSE baseline that prod was serving
at the time of the run.** The v4 column will be added when T2.3 lands
v4 vectors in prod and T2.4 runs the head-to-head.

### Headline numbers — 22-pair test split

| Metric | random | Levenshtein | LaBSE (current prod) | v4 (after T2.3 / T2.4) |
|---|---:|---:|---:|---:|
| Strict-lexical precision@10           | 0.0002 | 0.000 | **0.0625** | _to be added_ |
| Semantic-field precision@10           | 0.0081 | 0.000 | **0.1875** | _to be added_ |
| Coverage@cos≥0.50                     | 0.000  | 0.955 | **1.000**  | _to be added_ |
| Coverage@cos≥0.70                     | 0.000  | 0.273 | **1.000**  | _to be added_ |
| Coverage@cos≥0.85                     | 0.000  | 0.091 | **0.6875** | _to be added_ |
| n evaluated (of 22)                   | 22     | 22    | 16         | _to be added_ |
| n skipped (OOV on the source side)    | 0      | 0     | 6 (27.3%)  | _to be added_ |

### Per-confidence breakdown (LaBSE column)

| Confidence | n | strict @10 | field @10 |
|---|---:|---:|---:|
| high   | 10 | 0.100 | 0.200 |
| medium | 6  | 0.000 | 0.167 |

### Per-category breakdown (LaBSE column, field@10)

| Category   | n | field @10 |
|---|---:|---:|
| kinship    | 3 | 0.333 |
| theonym    | 3 | 0.333 |
| onomastic  | 2 | 0.500 |
| religious  | 2 | 0.000 |
| time       | 2 | 0.000 |
| numeral    | 3 | 0.000 |
| verb       | 1 | 0.000 |

The strict-lexical metric measures something the system *cannot* do
without parallel-data supervision; the semantic-field metric measures
what it *can* do, and is the honest reflection of the system's actual
research utility. Both are reported side-by-side for historical
comparability.

For the full reproducibility manifest (pinned commit hashes, Latin
vocab snapshot, baseline math), see
[`research/notes/reproduce-rosetta-eval-v1.md`](https://github.com/Eddy1919/openEtruscan/blob/main/research/notes/reproduce-rosetta-eval-v1.md).

## Limitations

Honesty matters more here than marketing:

1. **Small held-out test split (n=22 pairs).** Confidence intervals
   are correspondingly wide. RG.4 in the SOTA roadmap adds
   95%-bootstrap CIs to every reported number; until that lands,
   treat single-decimal-point differences between models as noise.
2. **27% OOV rate on the source side.** 6 of the 22 test-split pairs
   are skipped by the model because the Etruscan token has no vector
   in `language_word_embeddings`. The other two baselines (random,
   Levenshtein) evaluate all 22. Comparisons are accordingly *not*
   apples-to-apples without per-pair pairing.
3. **No primary-source-attested anchors used in training.** The
   evaluation set is itself the philological consensus. Any training
   signal that pushed precision up — short of genuinely parallel data
   we do not have — would be reflecting that same consensus back at
   us. Work-package P4 (primary-source mining) is the route out.
4. **Philological consensus reflects a school.** The Bonfante &
   Bonfante / Wallace / Pallottino reading is one school's best
   reading. Categories like `verb` (n=1) and `time` (n=2) are
   under-represented; the per-category breakdown above is indicative,
   not authoritative.
5. **Cross-language semantic alignment for unrelated surface forms
   remains weak.** `clan → filius`, `puia → uxor`, `lautn → familia`
   are misses by design; there is no signal in the training corpus
   that these are equivalent.

## Citation

If you use this model, please cite both the software/dataset DOI and
the model directly:

```bibtex
@software{openetruscan_2026,
  author    = {OpenEtruscan Contributors},
  title     = {{OpenEtruscan: open-source digital corpus platform for Etruscan epigraphy}},
  year      = {2026},
  version   = {0.5.0},
  doi       = {10.5281/zenodo.20075836},
  url       = {https://doi.org/10.5281/zenodo.20075836},
  publisher = {Zenodo}
}

@misc{openetruscan_etr_lora_v4_2026,
  author       = {OpenEtruscan Contributors},
  title        = {{etr-lora-v4: Etruscan-side LoRA adapter for LaBSE / XLM-R}},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Eddy1919/etr-lora-v4}},
  note         = {Evaluated against the rosetta-eval-v1 frozen benchmark.}
}
```

The frozen reference benchmark is `rosetta-eval-v1`; full reproduction
instructions live in
[`research/notes/reproduce-rosetta-eval-v1.md`](https://github.com/Eddy1919/openEtruscan/blob/main/research/notes/reproduce-rosetta-eval-v1.md).

## License

**Apache 2.0** — matches the model-artifact licensing scheme of the
OpenEtruscan repository (code: MIT, data: CC0 1.0, models:
Apache 2.0).

## Acknowledgements

- Vico, A. and Spanakis, G. (2023). *Larth Dataset* — primary source
  for ~71% of the unified corpus.
- Compilers of the *Corpus Inscriptionum Etruscarum* (CIE Vol. I),
  source of the remaining ~29%.
- Bonfante, G. and Bonfante, L. (2002). *The Etruscan Language: An
  Introduction*, 2nd edition.
- Wallace, R. E. (2008). *Zikh Rasna: A Manual of the Etruscan
  Language and Inscriptions*.
- Pallottino, M. (1968). *Testimonia Linguae Etruscae*.
- Feng et al. (2020). *LaBSE: Language-agnostic BERT Sentence
  Embedding* — the cross-lingual anchor.
- The Pelagios Network, the EpiDoc community, and the Classical
  Language Toolkit.