---
base_model: AI-Sweden-Models/gpt-sw3-6.7b-v2
library_name: transformers
datasets:
- barbaroo/Sprotin_parallel
- barbaroo/fo_en_synthetic
language:
- en
- fo
metrics:
- bleu
- chrf
- bertscore
pipeline_tag: text-generation
---

# Model Card: English–Faroese Translation (Merged Model)

## Model Details

### Model Description

- **Developed by:** Barbara Scalvini  
- **Model type:** Fully merged model for **English → Faroese** translation  
- **Languages:** English, Faroese  
- **License:** Inherits license from the base model (GPT-SW3 6.7B)  
- **Finetuned from:** [AI-Sweden-Models/gpt-sw3-6.7b-v2](https://huggingface.co/AI-Sweden-Models/gpt-sw3-6.7b-v2)  
- **Library:** [Transformers](https://github.com/huggingface/transformers)  

This model is the **merged version** of the PEFT adapter [`barbaroo/gptsw3_translate_synth_6.7B`](https://huggingface.co/barbaroo/gptsw3_translate_synth_6.7B) with its base model.  

---

## Uses

### Direct Use
- English → Faroese machine translation.  

### Downstream Use
- Can be integrated into **multilingual NLP pipelines** or localization workflows.  

### Out-of-Scope Use
- Languages other than English or Faroese.  
- Tasks like summarization, classification, or dialogue without further fine-tuning.  

---

## Bias, Risks, and Limitations
- As with all translation models, may reflect **biases** from the training corpora.  
- Outputs should be **carefully validated** for sensitive or high-stakes domains.  

---

## How to Get Started with the Model

```python

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import re
import pandas as pd

# Model repo
MODEL_NAME = "barbaroo/gptsw3-6.7B-translation-en-fo"

# Quantization config (8-bit)
bnb_config = BitsAndBytesConfig(
    load_in_8bit=True
)

# Initialize tokenizer & model
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map="auto",
)
model.eval()

# Alpaca-style prompt template
alpaca_prompt = """

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token
print("EOS token:", EOS_TOKEN)

# Example sentences
sentences = ["I love Faroese!"]
translations = []

for sentence in sentences:
    inputs = tokenizer(
        [
            alpaca_prompt.format(
                "Translate this sentence from English to Faroese:",
                sentence,
                "",
            )
        ],
        return_tensors="pt"
    ).to("cuda")

    outputs = model.generate(
        **inputs,
        max_new_tokens=500,
        use_cache=True,
        do_sample=True,
        temperature=0.1,
        top_p=1,
    )

    output_string = tokenizer.batch_decode(outputs, skip_special_tokens=False)[0]

    try:
        response = output_string.split("Response:\n", 1)[1]
        translation = response.replace(EOS_TOKEN, "")
    except IndexError:
        translation = ""

    translations.append(translation)
    print(translation)

```


## Training Details

### Training Data
- [barbaroo/Sprotin_parallel](https://huggingface.co/datasets/barbaroo/Sprotin_parallel)  
- [barbaroo/fo_en_synthetic](https://huggingface.co/datasets/barbaroo/fo_en_synthetic)  

### Procedure
- Initially trained as a **PEFT adapter** using Alpaca-style prompts.  
- Then **merged with the base GPT-SW3 6.7B model** to produce this standalone version.  

**Hyperparameters:**  
- Epochs: 3 (early stopping on validation loss)  
- Batch Size: 2 (with 4 gradient accumulation steps)  
- Learning Rate: 2e-4  
- Optimizer: AdamW with LR scheduler + warm-up  

---

## Evaluation

### Test Data
- FLORES-200 benchmark (~1012 English–Faroese pairs).  

### Metrics
- **BLEU:** 19.8  
- **chrF:** 52.4