Malfono LARK – Syriac–English Translation LoRA Adapter

Model Description

This is a LoRA adapter for CohereLabs/tiny-aya-base fine‑tuned to translate between English and Classical Syriac. The model was trained using the LARK (Language-Agnostic Rule-Guided Knowledge-Constrained Generation) framework, which adds a constraint‑aware loss to encourage grammatical correctness (subject‑verb agreement, construct state chains) based on a knowledge base of Syriac morphological rules extracted from a grammar textbook.

The adapter alone is small (~7 MB) and must be loaded on top of the base model.

Intended Uses & Limitations

Intended use:

Translation from English to Syriac and Syriac to English.
Research on neuro‑symbolic methods for low‑resource languages.
Demonstration of grammar‑aware fine‑tuning.

Limitations:

Due to limited training (30% of the Peshitta, ~1400 steps on a Kaggle T4 GPU), translation fluency is still moderate (BLEU score not reported).
The model sometimes produces repetitive or incomplete outputs.
Syriac orthography uses a simplified ASCII‑to‑Syriac transliteration; diacritics are not preserved.
The morphological analyzer is rule‑based and may produce occasional false positives.

Training Data

Syriac text: Peshitta Old Testament (ETCBC) – 49,455 verses.
English parallel: eBible Corpus (English Standard Version).
Training split: 30% of the aligned verses (≈ 15,000 examples).

Prompt format: Alpaca‑style instruction:

### Instruction:
Translate the following English text to Syriac.
### Input:
{English sentence}
### Response:
{Syriac translation}

(Both translation directions were used.)

Training Procedure

Base model: CohereLabs/tiny-aya-base (3.35B parameters).
Quantization: 4‑bit (QLoRA) via unsloth.
LoRA rank: 16, applied to q_proj, k_proj, v_proj, o_proj.
Batch size: 2 per GPU, gradient accumulation 4 (effective batch 8).
Sequence length: 64 tokens.
Optimizer: paged_adamw_8bit.
Learning rate: 2e‑4.
Steps: 1000 (resumed from a checkpoint trained for 700 steps).
Constraint loss weight (LARK): 0.1.

Evaluation Results

The model was evaluated on 100 held‑out verses from the Peshitta using two grammar‑focused metrics:

Metric	Score
Subject‑verb agreement accuracy (gender & number)	36.0%
Morphological violation rate (percentage of tokens violating any rule in the KB)	9.9%

These numbers show that the LARK constraint reduces grammatical errors compared to a baseline fine‑tuned without constraints (baseline agreement accuracy ≈ 28%, violation rate ≈ 15%).

How to Use

Installation

pip install peft transformers torch

Load the adapter (English → Syriac translation)

import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base_model_name = "CohereLabs/tiny-aya-base"
adapter_name = "aaronmat1905/malfono-lark-lora"

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    device_map="auto",
    torch_dtype=torch.float16,
)
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
tokenizer.pad_token = tokenizer.eos_token

model = PeftModel.from_pretrained(base_model, adapter_name)
model.eval()

Translation function

def translate_to_syriac(english_sentence: str, max_new_tokens: int = 60) -> str:
    prompt = f"### Instruction:\nTranslate the following English text to Syriac.\n\n### Input:\n{english_sentence}\n\n### Response:\n"
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512).to(model.device)
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.2,
            do_sample=True,
            top_p=0.9,
            repetition_penalty=1.2,
            pad_token_id=tokenizer.eos_token_id,
        )
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    if "### Response:\n" in response:
        response = response.split("### Response:\n")[-1].strip()
    return response.split("\n")[0]

print(translate_to_syriac("Peace be with you."))

Reverse direction (Syriac → English)

Replace the instruction with "Translate the following Syriac text to English." and swap input/output.

Citation

If you use this model in your research, please cite the LARK project (see the LARK repository for details).

Contact

For questions, please open an issue on the Hugging Face community tab or the GitHub repository. ```

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for aaronmat1905/malfono-lark-lora

Base model

CohereLabs/tiny-aya-base

Adapter

(8)

this model