Finetuning T5 problems

John6666 · January 16, 2026, 9:50pm

hmm… maybe

Yes. What you are seeing is consistent with “pipeline is fine, evaluation is measuring something different,” plus one very likely extra issue in your setup: you evaluate with stochastic decoding.

You currently have three different regimes:

Training loss: teacher forcing. The decoder always gets the correct previous tokens. This can go near 0 on 64 examples even with zero generalization.
eval_loss: also teacher forcing on the validation set. It can stagnate or rise if val differs, even while train loss goes to ~0. That is normal overfitting.
eval_sequence_recovery: autoregressive generation, because predict_with_generate=True makes Seq2SeqTrainer call model.generate() during evaluation. Hugging Face explicitly documents this behavior. (Hugging Face)
This metric is not “the same thing as loss.” It is “how good is free-running decoding under your decoding strategy.”

So your understanding is basically correct: predict_with_generate=True makes evaluation include generation, which is a different setup than teacher forcing. But the more concrete diagnosis is:

The biggest reason your `eval_sequence_recovery` can look saturated or worse even when the model memorizes

You set:

do_sample=True
top_p=0.85, top_k=3, temperature=1.0
repetition_penalty=1.2

That means evaluation uses sampling. Sampling is intentionally random. Hugging Face generation docs make it explicit: do_sample=True switches to multinomial sampling (or sampling variants), while do_sample=False gives greedy or beam decoding depending on beams. (Hugging Face)

Why this matters even if the model overfits perfectly

Even if the model assigns the correct next token the highest probability, with do_sample=True it can still sample a different token. With top_k=3, it is literally allowed to pick among the top 3. So you can get:

train loss ≈ 0 (teacher forcing, easy)
generated outputs still sometimes wrong (sampling, stochastic)
recovery metric flat or noisy (because you are injecting randomness)

For a “sequence recovery” metric, sampling is usually the wrong choice during evaluation. Sampling is for diversity. Recovery is for correctness.

1) Your padding and collator look correct

label_pad_token_id=-100 is the right setting and Transformers explicitly states that -100 label padding is ignored by PyTorch loss functions. (Hugging Face)

So padding is unlikely to be the cause here.

2) `predict_with_generate=True` is not “wrong,” but it changes what `compute_metrics` receives

With predict_with_generate=True, Seq2SeqTrainer generates sequences for eval and those generated sequences are what your metric sees. That is the intended feature. (Hugging Face)

So your metric is answering:

“How similar are sampled generations to references?”

Not:

“How well does the model predict the next token given the gold prefix?”

Those can diverge massively, especially with sampling.

3) The single best debug test: evaluate on the training examples

“Could I also make an overfit run with using the same samples in the val set and see if it then improves?”

Yes. Do it. It is a standard sanity check.

Expected outcome if your pipeline is correct

train loss → near 0
eval_loss on the same data → near 0 (maybe slightly higher due to eval mode differences, but should drop hard)
greedy generation recovery → very high, often near 1.0 on memorized examples

If that does NOT happen

Then something is wrong in evaluation or decoding, not “generalization.”

4) What I would change in your eval setup immediately

4.1 Make evaluation decoding deterministic

For debugging and for meaningful recovery curves:

set do_sample=False
remove top_p, top_k, temperature (they are sampling controls)
set num_beams=1 (greedy) first, then try num_beams=4 if you want

This matches how generation modes are defined in Transformers. (Hugging Face)

Why greedy first

Greedy is the cleanest signal. If greedy can’t recover on train examples, sampling won’t fix it.

4.2 Make sure your “bad words” are not blocking required tokens

You pass bad_words_ids=bad_words. If any token that appears in your labels is included there, generation can never match the reference, no matter how well the model trains.

This is a very common silent killer for “recovery” style metrics: the model is trained to output something, but decoding forbids it.

Concrete check:

collect the set of token IDs appearing in labels
intersect with bad_words_ids
intersection must be empty

4.3 Stop using sampling penalties for evaluation

repetition_penalty changes token probabilities during decoding. It can force the decoder away from the true sequence even if the model learned it.

Use such penalties only for “nice-looking text.” Not for correctness metrics.

5) A subtle metric pitfall in your code: `skip_special_tokens=True`

You decode with skip_special_tokens=True for both preds and labels.

That is often correct for normal text. But it can be catastrophically wrong if your task uses “special tokens” as meaningful symbols (custom markers, sentinels, reserved tokens).

If your ground-truth sequences contain tokens that the tokenizer considers “special,” they will be deleted before scoring and your metric can look flat or nonsensical.

Quick test:

decode one label with skip_special_tokens=False
compare to True
if important symbols disappear, your metric is not measuring what you think

6) About `eval_loss` specifically: why it can increase while train loss goes to ~0 on 64 examples

With 64 examples, you can memorize training patterns quickly. Validation loss can:

plateau if the mapping is not learnable from that tiny sample
increase as you overfit and become more confident on wrong patterns for val

That is normal “tiny dataset overfit” behavior. Nothing surprising there.

What is surprising is “recovery metric does not improve even on train examples.” If that’s your situation, it is almost always decode/metric config.

7) Generation length knobs: reduce confusion

You set generation_max_length=tgt_max and also GenerationConfig(max_length=tgt_max, ...).

This is redundant.

Also, Transformers has been nudging people toward max_new_tokens instead of relying on max_length defaults. There is even a warning that recommends max_new_tokens for controlling output length. (Mark Needham)

Separately, in Seq2SeqTrainer, generation_max_length is used to build gen_kwargs["max_length"] during evaluation, overriding defaults. (Hugging Face Forums)

For encoder-decoder models, max_length is usually fine, but for clarity during debugging:

pick one place to define length
avoid mixing multiple overlapping settings

8) A concrete “debug mode” recipe for your exact pipeline

Do this for 1–2 runs:

Debug Run A (pipeline correctness)

train set = 64 examples
eval set = same 64 examples
predict_with_generate=True
deterministic generation:
- do_sample=False
- num_beams=1
- no repetition penalty
metric decode with skip_special_tokens=False temporarily (just to confirm nothing meaningful is being dropped)

Expected:

eval_loss drops with training
recovery rises strongly

Debug Run B (generalization baseline)

train set = 64 examples
eval set = separate validation split
same deterministic generation