Repetitive Answers From Fine-Tuned LLM

emre570 · October 3, 2024, 7:49pm

Hello people, I fine-tuned Llama 3.2-1B model with a Turkish dataset.

Sometimes it gives repetitive answers. Our dataset does not contain any repetitive lines. This problem occurs ever model that I fine-tuned. I am making mistakes somewhere but don’t know where. Here is the model and dataset first if you want to inspect:

You may not understand since it’s Turkish, but the dataset we combined is some kind of “synthetic” or “direct”. They are not so user-friendly. Maybe it’s related to dataset.

Here is code:

from huggingface_hub import notebook_login
notebook_login()

# %%
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

# %%
from peft import LoraConfig
from transformers import BitsAndBytesConfig

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "o_proj", "k_proj", "v_proj"],
    task_type="CAUSAL_LM",
)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# %%
from transformers import AutoTokenizer, AutoModelForCausalLM

modelName = "meta-llama/Llama-3.2-1B"

tokenizer = AutoTokenizer.from_pretrained(modelName)
model = AutoModelForCausalLM.from_pretrained(modelName, quantization_config=bnb_config, device_map="auto")

# %%
from datasets import load_dataset
dataset = load_dataset("myzens/alpaca-turkish-combined", split="train")
dataset, dataset[0]

# %%
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

bos_token = tokenizer.bos_token
eos_token = tokenizer.eos_token

tokenizer.pad_token_id = 128002
pad_token = tokenizer.pad_token

# %%
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        text = bos_token + alpaca_prompt.format(instruction, input, output) + eos_token
        texts.append(text)
    return { "text" : texts, }
pass

# %%
dataset = dataset.map(formatting_prompts_func, batched = True)

# %%
print(dataset["text"][0])

# %%
from transformers import TrainingArguments

train_args = TrainingArguments(
        per_device_train_batch_size = 4,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        #max_steps = 150,
        num_train_epochs = 1,
        gradient_checkpointing = True,
        learning_rate = 2e-4,
        bf16 = True,
        logging_steps = 250,
        optim = "adamw_hf",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        output_dir = "llama3.2-1b-tr",
)

# %%
from trl import SFTTrainer

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    args = train_args,
    peft_config = lora_config,
    train_dataset = dataset,
    dataset_text_field = "text",
    packing = False,
)
trainer.train()

# %%
model.push_to_hub("emre570/llama3.2-1b-tr-qlora")
tokenizer.push_to_hub("emre570/llama3.2-1b-tr-qlora")

What am I doing wrong?

Chahnwoo · October 4, 2024, 12:36am

What other models have you tried that yield similar results?

emre570 · October 8, 2024, 12:22pm

I tried Gemma 2B, 7B, Gemma 1.1 7B and Llama 3 8B. And I think it’s my mistake.

huggingzob · October 8, 2024, 3:26pm

Not an expert but I had similar problems with some other models and passing repetition_penalty to the generate function with a value larger than 1 helped.

Tum huggingface methodlari super kafa karistirici

madjeisah · October 17, 2024, 12:34pm

Use max_steps and train for about 2-5k steps.
Also, use repetition_penalty in your Text generation strategies
See: Text generation strategies (huggingface.co)

emre570 · December 23, 2024, 9:17am

Hi, sorry for late answer but I saw it and recently went curious. What’s the logic behind of it?

DawidN · March 26, 2025, 9:22am

I’m having a similar issue here. Setting repetition_penalty does in fact help with the model repeating itself, but instead of ending he just generates new text.

My issue is due to the fact that the model doesn’t stop, it just generates new text over and over again

trainer = SFTTrainer(
    model=model,
    train_dataset=ds["train"],
    peft_config=peft_config,
    args=SFTConfig(
        output_dir="models/email-tuning",
        num_train_epochs=2,
        per_device_train_batch_size=2,  
        per_device_eval_batch_size=2,
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        logging_steps=10,
        save_steps=10,
        save_strategy="steps",
        report_to="wandb",
        run_name="email-tuning-v3-llama-3.1-8b-instruct-2048",
        max_length=2048,
      
    ),
)

trainer.train()

This is my training loop, I’m using meta-llama/Llama-3.1-8B-Instruct to train + a dataset with messages and roles: system,assistant,user. Example:

chat = [
    {"role": "system", "content": "You are a sassy, wise-cracking robot as imagined by Hollywood circa 1986."},
    {"role": "user", "content": "Hey, can you tell me any fun things to do in New York?"}
]

John6666 · March 26, 2025, 11:37am

Hmm… How about max_new_tokens?

outputs = model.generate(
    **inputs,
    top_p=0.9,
    temperature=0.7,
    repetition_penalty=1.2,
    max_new_tokens=512,
    do_sample=True,
)

DawidN · March 28, 2025, 7:33am

Yeah I’ve tried that and now the model just stops mid sentence once at the max_new_tokens limit.

DawidN · March 28, 2025, 7:50am

Quite interesting thing happened here, I didn’t changed anything only the model to QWEN-2.5-7B-instruct from LLama-3.1-8b-instruct and the problem stopped. The training works fine now.

alexgre · July 16, 2025, 12:17am

I used qwen2.5-3b-inst and qwen2.5-0.5b-inst. I have the problem you had. I wonder did you do something special?

zzumri5 · January 17, 2026, 6:15am

I am having the same issue when I finetuned Qwen2.5-VL-3B-Instruct. Did you manage to solve it?

John6666 · January 17, 2026, 8:24am

Seems SFT common failure?

Your symptoms match a very specific, very common SFT failure mode: you think you trained “prompt → answer → EOS”, but in practice you trained many truncated examples that never include the answer ending (and often never include EOS), while also optimizing the model to reproduce the whole prompt template. That combination makes looping much more likely.

Below is a concrete fix plan for your exact code and dataset.

Why repetition happens even when there are “no duplicated rows”

“Repetition” at inference is usually not caused by duplicated dataset rows. It is usually caused by one or more of these:

Stop condition not learned

Model rarely sees EOS in the supervised target.
Or EOS is seen inconsistently because examples are truncated.
Generation then runs until max_new_tokens and often degenerates into loops.
This exact pattern is widely reported in finetunes that “don’t generate EOS.” (GitHub)

Training objective teaches “continue the template,” not “answer the user”

If you compute loss on prompt + response, the model is rewarded for predicting headers, separators, and prompt text.
This increases template echoing and template re-entry (“### Instruction:” again), which looks like repetition.
TRL explicitly documents “train on completions only” as the intended fix. (Hugging Face)

Decoding degeneracy

Greedy decoding is known to start repeating itself on longer outputs.
Transformers docs explicitly call out that greedy decoding breaks down and repeats. (Hugging Face)

In your case, you have strong evidence for (1) and (2), and likely some of (3).

What you are doing wrong in your run

1) You did not set `max_seq_length` or `max_length`, so TRL truncates to 1024 by default

TRL’s SFTTrainer docs state a default maximum tokenized length of 1024, and sequences longer than that are truncated from the right. (Hugging Face)

TRL v0.7.4 is even more explicit: if you do not pass max_seq_length, it defaults to min(tokenizer.model_max_length, 1024). (Hugging Face)

Your dataset has very long examples:

Dataset viewer shows output string lengths up to ~8k characters and instruction lengths >1k characters. (Hugging Face)

So many examples will exceed 1024 tokens once formatted into your Alpaca template. Those examples are then truncated. Truncation is the big hidden bug.

Why truncation causes repetition

You append EOS at the very end of the formatted text.
If the example is truncated, the EOS is often cut off.
The model is trained on many sequences that end abruptly without EOS, so it does not reliably learn to stop.
At inference it may keep going until the token limit, and long generations are exactly where repetition loops show up. (GitHub)

2) Your dataset actually contains repetition-like targets inside single examples

Even if there are no duplicated rows, the dataset viewer shows an output that repeats “cesur” multiple times in a single answer. (Hugging Face)

That matters because small models can generalize “repeating items is acceptable” into unrelated contexts when generation becomes unstable.

3) You are training on the whole prompt template by default

In your code you do not supply a completion-only collator or masking. So your loss is applied to the entire concatenated string (instruction + input + “### Response:” + output).

TRL documentation explicitly recommends DataCollatorForCompletionOnlyLM for instruction SFT when you want to train only on the answer, and notes it works when packing=False (your setting). (Hugging Face)

There is even a TRL issue requesting that SFTTrainer’s default collator be changed because the current default is misleading for SFT use. (GitHub)

4) Your inference prompt format in the model card does not match your training format

Your training text uses:

“Below is an instruction…”
“### Instruction / ### Input / ### Response”
plus BOS at the start and EOS at the end.

But your model card example uses a different scaffold (“Instruction:” / “Input:” / “Response:”) without the same header lines and without the “Below is …” preamble. (Hugging Face)

Format mismatch does not always cause loops, but it increases “weird behaviors” because the model is being prompted off-distribution relative to training.

5) Special tokens are fine, but you are relying on manual BOS insertion

Your uploaded model config shows BOS and EOS ids are 128000 and 128001. (Hugging Face)
Your repo’s special tokens map shows pad token is <|reserved_special_token_0|>. (Hugging Face)
And tokenizer config lists id 128002 as that reserved token. (Hugging Face)

So special tokens are consistent.

Still, manually concatenating bos_token + text is a common footgun because tokenizers and model configs sometimes insert BOS automatically depending on settings. If you accidentally train with double-BOS sometimes and not others, you add noise to the learned boundary behavior.

This is not the biggest issue compared to truncation, but it is easy to clean up.

The fix plan (high leverage, in the right order)

Step 0: Measure the problem first (fast sanity checks)

You want two numbers:

What fraction of training examples exceed 1024 tokens?
If it is high, truncation is your main culprit. TRL default is 1024. (Hugging Face)
What fraction of tokenized examples actually end with EOS inside the trained window?
If many are missing EOS, the model will not stop reliably, and looping becomes likely. (GitHub)

You do not need Turkish understanding to compute these.

Step 1: Stop silent truncation

Pick one approach:

Option A (most common): raise the sequence length

If your GPU allows it, set 2048 or 4096.

In older TRL (your style), set max_seq_length=2048.
In current TRL docs, this knob is described as max_length (default 1024, truncates right). (Hugging Face)

Option B (often best for 1B): filter or trim long examples

For a 1B model, training on extremely long synthetic outputs can be counterproductive. Filter out the longest tail or cap the response length so EOS is always present.

Why filtering often helps:

You increase the proportion of examples with a clean stop.
You reduce exposure to “degenerate list rambles,” which can imprint repetition patterns.

Step 2: Train on completion only (mask the prompt)

This is the standard fix for “template echo + looping inside scaffolding.”

TRL v0.7.4 shows exactly how to do this using DataCollatorForCompletionOnlyLM, and explicitly notes it works when packing=False. (Hugging Face)

Important detail: your response marker must match tokenization exactly.
TRL docs warn that tokenization can differ depending on context and show how to pass token IDs if needed (adding a newline context can matter). (Hugging Face)

Step 3: Make inference prompt match training prompt

If you trained with:

the “Below is an instruction…” preamble
and “### Instruction / ### Input / ### Response”

then use the same at inference.

Right now, your model card example does not match the training format. (Hugging Face)
This mismatch can worsen instability and repetition, especially for small models.

If you want to upgrade: use the model’s chat template (tokenizer.apply_chat_template) and train in chat format. But the minimal fix is just consistency.

Step 4: Add light decoding “loop brakes”

Even with perfect training, greedy decoding can repeat on long outputs. Transformers docs explicitly call this out. (Hugging Face)

Use:

do_sample=True with top_p and a moderate temperature
repetition_penalty around 1.1–1.2
optionally no_repeat_ngram_size=3 (test carefully, can be too restrictive)

Your repo’s generation_config.json already sets do_sample: true, temperature: 0.6, top_p: 0.9. (Hugging Face)
But ensure your inference code actually loads and uses that generation config (many people call generate() without realizing they are effectively greedy). (Hugging Face)

Step 5: Clean the most harmful dataset patterns

You do have repetition-like targets (“cesur” repeated many times). (Hugging Face)
You do not need to remove them all. Just remove the extreme ones.

Practical filters that help:

Drop samples where a single token repeats above a threshold.
Drop samples where the unique word ratio is extremely low.
Deduplicate list items inside an output for “generate a list of adjectives/synonyms” tasks.

This improves both:

training stability
inference stability

Concrete updated training code (drop-in style)

1) Do not manually prepend BOS

Let the tokenizer manage BOS. Keep EOS at the end.

2) Add completion-only collator + max length

Based on TRL’s documented approach. (Hugging Face)

from trl import SFTTrainer, DataCollatorForCompletionOnlyLM

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:
{output}"""

EOS = tokenizer.eos_token

def formatting_prompts_func(examples):
    texts = []
    for instr, inp, out in zip(examples["instruction"], examples["input"], examples["output"]):
        texts.append(alpaca_prompt.format(instruction=instr, input=inp, output=out) + EOS)
    return {"text": texts}

dataset = dataset.map(formatting_prompts_func, batched=True)

# Add a little context before the marker so tokenization matches what appears in the full text.
response_template = "\n### Response:\n"
collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    args=train_args,
    peft_config=lora_config,
    train_dataset=dataset,
    dataset_text_field="text",
    data_collator=collator,
    packing=False,          # required for this collator per TRL docs
    max_seq_length=2048,    # TRL v0.7.x name; increase if you can afford it
)

trainer.train()

If you are on newer TRL where the parameter is max_length, the concept is the same: default 1024 truncates, set it explicitly. (Hugging Face)

Concrete inference config (loop-resistant)

Transformers docs define greedy vs sampling behavior via do_sample. (Hugging Face)

gen = model.generate(
    **tokenizer(prompt, return_tensors="pt").to(model.device),
    max_new_tokens=256,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
    repetition_penalty=1.15,
    no_repeat_ngram_size=3,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.pad_token_id,
)

Also ensure prompt matches the training format, not the simplified format shown elsewhere. (Hugging Face)

Optional: optimizer and stability tweaks for QLoRA

Not the main cause of repetition, but helps stability:

Use bitsandbytes 8-bit optimizer (less memory, commonly used for finetuning). (Hugging Face)
Add lora_dropout=0.05
Reduce LR if you still see collapse into repetitive loops after some steps

What I would do first in your exact situation (fastest wins)

Set max_seq_length (or max_length) explicitly, because default 1024 truncation is almost certainly cutting off EOS for long samples. (Hugging Face)
Switch to completion-only loss using DataCollatorForCompletionOnlyLM. (Hugging Face)
Make inference prompt match training prompt exactly, because mismatch is visible in your own example usage. (Hugging Face)
Add repetition brakes at decoding because greedy decoding is known to repeat. (Hugging Face)
Filter the most repetition-heavy targets (you have some). (Hugging Face)

Summary

TRL SFTTrainer defaults to truncating at 1024 unless you set sequence length. Long Turkish outputs make truncation likely, which often removes EOS and causes looping. (Hugging Face)
Your dataset includes examples with repeated items inside one output, which can teach repetition patterns. (Hugging Face)
Train on completions only with DataCollatorForCompletionOnlyLM and keep packing=False. (Hugging Face)
Keep prompt format consistent between training and inference. Your current example usage differs from the training template. (Hugging Face)
Use sampling and repetition penalties because greedy decoding can repeat on long outputs. (Hugging Face)

Topic		Replies	Views
How does the model generate the answer 1 time? Beginners	0	98	May 19, 2024
LLama2 trained on completions only repeating prompt during inference Beginners	0	259	April 1, 2024
When I try to use my fine-tuned Causal LM model to inference a prompt, I get nothing but the last word repeated multiple times 🤗Transformers	1	598	April 14, 2024
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation Models	5	5631	October 16, 2024
Llama-2 7B-hf repeats context of question directly from input prompt, cuts off with newlines 🤗Transformers	16	29327	January 10, 2025