--- base_model: AI-Sweden-Models/gpt-sw3-6.7b-v2 library_name: transformers datasets: - barbaroo/Sprotin_parallel - barbaroo/fo_en_synthetic language: - en - fo metrics: - bleu - chrf - bertscore pipeline_tag: text-generation --- # Model Card: English–Faroese Translation (Merged Model) ## Model Details ### Model Description - **Developed by:** Barbara Scalvini - **Model type:** Fully merged model for **English → Faroese** translation - **Languages:** English, Faroese - **License:** Inherits license from the base model (GPT-SW3 6.7B) - **Finetuned from:** [AI-Sweden-Models/gpt-sw3-6.7b-v2](https://huggingface.co/AI-Sweden-Models/gpt-sw3-6.7b-v2) - **Library:** [Transformers](https://github.com/huggingface/transformers) This model is the **merged version** of the PEFT adapter [`barbaroo/gptsw3_translate_synth_6.7B`](https://huggingface.co/barbaroo/gptsw3_translate_synth_6.7B) with its base model. --- ## Uses ### Direct Use - English → Faroese machine translation. ### Downstream Use - Can be integrated into **multilingual NLP pipelines** or localization workflows. ### Out-of-Scope Use - Languages other than English or Faroese. - Tasks like summarization, classification, or dialogue without further fine-tuning. --- ## Bias, Risks, and Limitations - As with all translation models, may reflect **biases** from the training corpora. - Outputs should be **carefully validated** for sensitive or high-stakes domains. --- ## How to Get Started with the Model ```python import torch from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig import re import pandas as pd # Model repo MODEL_NAME = "barbaroo/gptsw3-6.7B-translation-en-fo" # Quantization config (8-bit) bnb_config = BitsAndBytesConfig( load_in_8bit=True ) # Initialize tokenizer & model tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME) model = AutoModelForCausalLM.from_pretrained( MODEL_NAME, quantization_config=bnb_config, device_map="auto", ) model.eval() # Alpaca-style prompt template alpaca_prompt = """ ### Instruction: {} ### Input: {} ### Response: {}""" EOS_TOKEN = tokenizer.eos_token print("EOS token:", EOS_TOKEN) # Example sentences sentences = ["I love Faroese!"] translations = [] for sentence in sentences: inputs = tokenizer( [ alpaca_prompt.format( "Translate this sentence from English to Faroese:", sentence, "", ) ], return_tensors="pt" ).to("cuda") outputs = model.generate( **inputs, max_new_tokens=500, use_cache=True, do_sample=True, temperature=0.1, top_p=1, ) output_string = tokenizer.batch_decode(outputs, skip_special_tokens=False)[0] try: response = output_string.split("Response:\n", 1)[1] translation = response.replace(EOS_TOKEN, "") except IndexError: translation = "" translations.append(translation) print(translation) ``` ## Training Details ### Training Data - [barbaroo/Sprotin_parallel](https://huggingface.co/datasets/barbaroo/Sprotin_parallel) - [barbaroo/fo_en_synthetic](https://huggingface.co/datasets/barbaroo/fo_en_synthetic) ### Procedure - Initially trained as a **PEFT adapter** using Alpaca-style prompts. - Then **merged with the base GPT-SW3 6.7B model** to produce this standalone version. **Hyperparameters:** - Epochs: 3 (early stopping on validation loss) - Batch Size: 2 (with 4 gradient accumulation steps) - Learning Rate: 2e-4 - Optimizer: AdamW with LR scheduler + warm-up --- ## Evaluation ### Test Data - FLORES-200 benchmark (~1012 English–Faroese pairs). ### Metrics - **BLEU:** 19.8 - **chrF:** 52.4