Seems SFT common failure?
Your symptoms match a very specific, very common SFT failure mode: you think you trained “prompt → answer → EOS”, but in practice you trained many truncated examples that never include the answer ending (and often never include EOS), while also optimizing the model to reproduce the whole prompt template. That combination makes looping much more likely.
Below is a concrete fix plan for your exact code and dataset.
Why repetition happens even when there are “no duplicated rows”
“Repetition” at inference is usually not caused by duplicated dataset rows. It is usually caused by one or more of these:
- Stop condition not learned
- Model rarely sees EOS in the supervised target.
- Or EOS is seen inconsistently because examples are truncated.
- Generation then runs until
max_new_tokens and often degenerates into loops.
This exact pattern is widely reported in finetunes that “don’t generate EOS.” (GitHub)
- Training objective teaches “continue the template,” not “answer the user”
- If you compute loss on prompt + response, the model is rewarded for predicting headers, separators, and prompt text.
- This increases template echoing and template re-entry (“### Instruction:” again), which looks like repetition.
TRL explicitly documents “train on completions only” as the intended fix. (Hugging Face)
- Decoding degeneracy
- Greedy decoding is known to start repeating itself on longer outputs.
Transformers docs explicitly call out that greedy decoding breaks down and repeats. (Hugging Face)
In your case, you have strong evidence for (1) and (2), and likely some of (3).
What you are doing wrong in your run
1) You did not set max_seq_length or max_length, so TRL truncates to 1024 by default
TRL’s SFTTrainer docs state a default maximum tokenized length of 1024, and sequences longer than that are truncated from the right. (Hugging Face)
TRL v0.7.4 is even more explicit: if you do not pass max_seq_length, it defaults to min(tokenizer.model_max_length, 1024). (Hugging Face)
Your dataset has very long examples:
- Dataset viewer shows output string lengths up to ~8k characters and instruction lengths >1k characters. (Hugging Face)
So many examples will exceed 1024 tokens once formatted into your Alpaca template. Those examples are then truncated. Truncation is the big hidden bug.
Why truncation causes repetition
- You append EOS at the very end of the formatted text.
- If the example is truncated, the EOS is often cut off.
- The model is trained on many sequences that end abruptly without EOS, so it does not reliably learn to stop.
- At inference it may keep going until the token limit, and long generations are exactly where repetition loops show up. (GitHub)
2) Your dataset actually contains repetition-like targets inside single examples
Even if there are no duplicated rows, the dataset viewer shows an output that repeats “cesur” multiple times in a single answer. (Hugging Face)
That matters because small models can generalize “repeating items is acceptable” into unrelated contexts when generation becomes unstable.
3) You are training on the whole prompt template by default
In your code you do not supply a completion-only collator or masking. So your loss is applied to the entire concatenated string (instruction + input + “### Response:” + output).
TRL documentation explicitly recommends DataCollatorForCompletionOnlyLM for instruction SFT when you want to train only on the answer, and notes it works when packing=False (your setting). (Hugging Face)
There is even a TRL issue requesting that SFTTrainer’s default collator be changed because the current default is misleading for SFT use. (GitHub)
4) Your inference prompt format in the model card does not match your training format
Your training text uses:
- “Below is an instruction…”
- “### Instruction / ### Input / ### Response”
- plus BOS at the start and EOS at the end.
But your model card example uses a different scaffold (“Instruction:” / “Input:” / “Response:”) without the same header lines and without the “Below is …” preamble. (Hugging Face)
Format mismatch does not always cause loops, but it increases “weird behaviors” because the model is being prompted off-distribution relative to training.
5) Special tokens are fine, but you are relying on manual BOS insertion
Your uploaded model config shows BOS and EOS ids are 128000 and 128001. (Hugging Face)
Your repo’s special tokens map shows pad token is <|reserved_special_token_0|>. (Hugging Face)
And tokenizer config lists id 128002 as that reserved token. (Hugging Face)
So special tokens are consistent.
Still, manually concatenating bos_token + text is a common footgun because tokenizers and model configs sometimes insert BOS automatically depending on settings. If you accidentally train with double-BOS sometimes and not others, you add noise to the learned boundary behavior.
This is not the biggest issue compared to truncation, but it is easy to clean up.
The fix plan (high leverage, in the right order)
Step 0: Measure the problem first (fast sanity checks)
You want two numbers:
-
What fraction of training examples exceed 1024 tokens?
If it is high, truncation is your main culprit. TRL default is 1024. (Hugging Face)
-
What fraction of tokenized examples actually end with EOS inside the trained window?
If many are missing EOS, the model will not stop reliably, and looping becomes likely. (GitHub)
You do not need Turkish understanding to compute these.
Step 1: Stop silent truncation
Pick one approach:
Option A (most common): raise the sequence length
If your GPU allows it, set 2048 or 4096.
- In older TRL (your style), set
max_seq_length=2048.
- In current TRL docs, this knob is described as
max_length (default 1024, truncates right). (Hugging Face)
Option B (often best for 1B): filter or trim long examples
For a 1B model, training on extremely long synthetic outputs can be counterproductive. Filter out the longest tail or cap the response length so EOS is always present.
Why filtering often helps:
- You increase the proportion of examples with a clean stop.
- You reduce exposure to “degenerate list rambles,” which can imprint repetition patterns.
Step 2: Train on completion only (mask the prompt)
This is the standard fix for “template echo + looping inside scaffolding.”
TRL v0.7.4 shows exactly how to do this using DataCollatorForCompletionOnlyLM, and explicitly notes it works when packing=False. (Hugging Face)
Important detail: your response marker must match tokenization exactly.
TRL docs warn that tokenization can differ depending on context and show how to pass token IDs if needed (adding a newline context can matter). (Hugging Face)
Step 3: Make inference prompt match training prompt
If you trained with:
- the “Below is an instruction…” preamble
- and “### Instruction / ### Input / ### Response”
then use the same at inference.
Right now, your model card example does not match the training format. (Hugging Face)
This mismatch can worsen instability and repetition, especially for small models.
If you want to upgrade: use the model’s chat template (tokenizer.apply_chat_template) and train in chat format. But the minimal fix is just consistency.
Step 4: Add light decoding “loop brakes”
Even with perfect training, greedy decoding can repeat on long outputs. Transformers docs explicitly call this out. (Hugging Face)
Use:
do_sample=True with top_p and a moderate temperature
repetition_penalty around 1.1–1.2
- optionally
no_repeat_ngram_size=3 (test carefully, can be too restrictive)
Your repo’s generation_config.json already sets do_sample: true, temperature: 0.6, top_p: 0.9. (Hugging Face)
But ensure your inference code actually loads and uses that generation config (many people call generate() without realizing they are effectively greedy). (Hugging Face)
Step 5: Clean the most harmful dataset patterns
You do have repetition-like targets (“cesur” repeated many times). (Hugging Face)
You do not need to remove them all. Just remove the extreme ones.
Practical filters that help:
- Drop samples where a single token repeats above a threshold.
- Drop samples where the unique word ratio is extremely low.
- Deduplicate list items inside an output for “generate a list of adjectives/synonyms” tasks.
This improves both:
- training stability
- inference stability
Concrete updated training code (drop-in style)
1) Do not manually prepend BOS
Let the tokenizer manage BOS. Keep EOS at the end.
2) Add completion-only collator + max length
Based on TRL’s documented approach. (Hugging Face)
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Input:
{input}
### Response:
{output}"""
EOS = tokenizer.eos_token
def formatting_prompts_func(examples):
texts = []
for instr, inp, out in zip(examples["instruction"], examples["input"], examples["output"]):
texts.append(alpaca_prompt.format(instruction=instr, input=inp, output=out) + EOS)
return {"text": texts}
dataset = dataset.map(formatting_prompts_func, batched=True)
# Add a little context before the marker so tokenization matches what appears in the full text.
response_template = "\n### Response:\n"
collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
args=train_args,
peft_config=lora_config,
train_dataset=dataset,
dataset_text_field="text",
data_collator=collator,
packing=False, # required for this collator per TRL docs
max_seq_length=2048, # TRL v0.7.x name; increase if you can afford it
)
trainer.train()
If you are on newer TRL where the parameter is max_length, the concept is the same: default 1024 truncates, set it explicitly. (Hugging Face)
Concrete inference config (loop-resistant)
Transformers docs define greedy vs sampling behavior via do_sample. (Hugging Face)
gen = model.generate(
**tokenizer(prompt, return_tensors="pt").to(model.device),
max_new_tokens=256,
do_sample=True,
temperature=0.6,
top_p=0.9,
repetition_penalty=1.15,
no_repeat_ngram_size=3,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id,
)
Also ensure prompt matches the training format, not the simplified format shown elsewhere. (Hugging Face)
Optional: optimizer and stability tweaks for QLoRA
Not the main cause of repetition, but helps stability:
- Use bitsandbytes 8-bit optimizer (less memory, commonly used for finetuning). (Hugging Face)
- Add
lora_dropout=0.05
- Reduce LR if you still see collapse into repetitive loops after some steps
What I would do first in your exact situation (fastest wins)
- Set
max_seq_length (or max_length) explicitly, because default 1024 truncation is almost certainly cutting off EOS for long samples. (Hugging Face)
- Switch to completion-only loss using
DataCollatorForCompletionOnlyLM. (Hugging Face)
- Make inference prompt match training prompt exactly, because mismatch is visible in your own example usage. (Hugging Face)
- Add repetition brakes at decoding because greedy decoding is known to repeat. (Hugging Face)
- Filter the most repetition-heavy targets (you have some). (Hugging Face)
Summary
- TRL SFTTrainer defaults to truncating at 1024 unless you set sequence length. Long Turkish outputs make truncation likely, which often removes EOS and causes looping. (Hugging Face)
- Your dataset includes examples with repeated items inside one output, which can teach repetition patterns. (Hugging Face)
- Train on completions only with
DataCollatorForCompletionOnlyLM and keep packing=False. (Hugging Face)
- Keep prompt format consistent between training and inference. Your current example usage differs from the training template. (Hugging Face)
- Use sampling and repetition penalties because greedy decoding can repeat on long outputs. (Hugging Face)