Finetuning T5 problems

hmm… maybe


Yes. What you are seeing is consistent with “pipeline is fine, evaluation is measuring something different,” plus one very likely extra issue in your setup: you evaluate with stochastic decoding.

You currently have three different regimes:

  1. Training loss: teacher forcing. The decoder always gets the correct previous tokens. This can go near 0 on 64 examples even with zero generalization.

  2. eval_loss: also teacher forcing on the validation set. It can stagnate or rise if val differs, even while train loss goes to ~0. That is normal overfitting.

  3. eval_sequence_recovery: autoregressive generation, because predict_with_generate=True makes Seq2SeqTrainer call model.generate() during evaluation. Hugging Face explicitly documents this behavior. (Hugging Face)
    This metric is not “the same thing as loss.” It is “how good is free-running decoding under your decoding strategy.”

So your understanding is basically correct: predict_with_generate=True makes evaluation include generation, which is a different setup than teacher forcing. But the more concrete diagnosis is:

The biggest reason your eval_sequence_recovery can look saturated or worse even when the model memorizes

You set:

  • do_sample=True
  • top_p=0.85, top_k=3, temperature=1.0
  • repetition_penalty=1.2

That means evaluation uses sampling. Sampling is intentionally random. Hugging Face generation docs make it explicit: do_sample=True switches to multinomial sampling (or sampling variants), while do_sample=False gives greedy or beam decoding depending on beams. (Hugging Face)

Why this matters even if the model overfits perfectly

Even if the model assigns the correct next token the highest probability, with do_sample=True it can still sample a different token. With top_k=3, it is literally allowed to pick among the top 3. So you can get:

  • train loss ≈ 0 (teacher forcing, easy)
  • generated outputs still sometimes wrong (sampling, stochastic)
  • recovery metric flat or noisy (because you are injecting randomness)

For a “sequence recovery” metric, sampling is usually the wrong choice during evaluation. Sampling is for diversity. Recovery is for correctness.


1) Your padding and collator look correct

label_pad_token_id=-100 is the right setting and Transformers explicitly states that -100 label padding is ignored by PyTorch loss functions. (Hugging Face)

So padding is unlikely to be the cause here.


2) predict_with_generate=True is not “wrong,” but it changes what compute_metrics receives

With predict_with_generate=True, Seq2SeqTrainer generates sequences for eval and those generated sequences are what your metric sees. That is the intended feature. (Hugging Face)

So your metric is answering:

“How similar are sampled generations to references?”

Not:

“How well does the model predict the next token given the gold prefix?”

Those can diverge massively, especially with sampling.


3) The single best debug test: evaluate on the training examples

“Could I also make an overfit run with using the same samples in the val set and see if it then improves?”

Yes. Do it. It is a standard sanity check.

Expected outcome if your pipeline is correct

  • train loss → near 0
  • eval_loss on the same data → near 0 (maybe slightly higher due to eval mode differences, but should drop hard)
  • greedy generation recovery → very high, often near 1.0 on memorized examples

If that does NOT happen

Then something is wrong in evaluation or decoding, not “generalization.”


4) What I would change in your eval setup immediately

4.1 Make evaluation decoding deterministic

For debugging and for meaningful recovery curves:

  • set do_sample=False
  • remove top_p, top_k, temperature (they are sampling controls)
  • set num_beams=1 (greedy) first, then try num_beams=4 if you want

This matches how generation modes are defined in Transformers. (Hugging Face)

Why greedy first

Greedy is the cleanest signal. If greedy can’t recover on train examples, sampling won’t fix it.

4.2 Make sure your “bad words” are not blocking required tokens

You pass bad_words_ids=bad_words. If any token that appears in your labels is included there, generation can never match the reference, no matter how well the model trains.

This is a very common silent killer for “recovery” style metrics: the model is trained to output something, but decoding forbids it.

Concrete check:

  • collect the set of token IDs appearing in labels
  • intersect with bad_words_ids
  • intersection must be empty

4.3 Stop using sampling penalties for evaluation

repetition_penalty changes token probabilities during decoding. It can force the decoder away from the true sequence even if the model learned it.

Use such penalties only for “nice-looking text.” Not for correctness metrics.


5) A subtle metric pitfall in your code: skip_special_tokens=True

You decode with skip_special_tokens=True for both preds and labels.

That is often correct for normal text. But it can be catastrophically wrong if your task uses “special tokens” as meaningful symbols (custom markers, sentinels, reserved tokens).

If your ground-truth sequences contain tokens that the tokenizer considers “special,” they will be deleted before scoring and your metric can look flat or nonsensical.

Quick test:

  • decode one label with skip_special_tokens=False
  • compare to True
  • if important symbols disappear, your metric is not measuring what you think

6) About eval_loss specifically: why it can increase while train loss goes to ~0 on 64 examples

With 64 examples, you can memorize training patterns quickly. Validation loss can:

  • plateau if the mapping is not learnable from that tiny sample
  • increase as you overfit and become more confident on wrong patterns for val

That is normal “tiny dataset overfit” behavior. Nothing surprising there.

What is surprising is “recovery metric does not improve even on train examples.” If that’s your situation, it is almost always decode/metric config.


7) Generation length knobs: reduce confusion

You set generation_max_length=tgt_max and also GenerationConfig(max_length=tgt_max, ...).

This is redundant.

Also, Transformers has been nudging people toward max_new_tokens instead of relying on max_length defaults. There is even a warning that recommends max_new_tokens for controlling output length. (Mark Needham)

Separately, in Seq2SeqTrainer, generation_max_length is used to build gen_kwargs["max_length"] during evaluation, overriding defaults. (Hugging Face Forums)

For encoder-decoder models, max_length is usually fine, but for clarity during debugging:

  • pick one place to define length
  • avoid mixing multiple overlapping settings

8) A concrete “debug mode” recipe for your exact pipeline

Do this for 1–2 runs:

Debug Run A (pipeline correctness)

  • train set = 64 examples

  • eval set = same 64 examples

  • predict_with_generate=True

  • deterministic generation:

    • do_sample=False
    • num_beams=1
    • no repetition penalty
  • metric decode with skip_special_tokens=False temporarily (just to confirm nothing meaningful is being dropped)

Expected:

  • eval_loss drops with training
  • recovery rises strongly

Debug Run B (generalization baseline)

  • train set = 64 examples
  • eval set = separate validation split
  • same deterministic generation

Expected:

  • train loss goes near 0
  • val loss may not improve much
  • recovery may saturate low due to no generalization

This separates “evaluation broken” from “no generalization.”


High-quality references you are directly relying on here

  • predict_with_generate=True triggers generate() during evaluation in Seq2SeqTrainer. (Hugging Face)
  • Generation modes: greedy vs sampling vs beam depend on do_sample and num_beams. (Hugging Face)
  • label_pad_token_id=-100 is ignored by PyTorch loss, and is the documented default for seq2seq collators. (Hugging Face)
  • Transformers guidance: prefer max_new_tokens for controlling generated length in many contexts. (Mark Needham)
  • Seq2SeqTrainer generation kwargs behavior (how generation_max_length maps into generate). (Hugging Face Forums)

Summary

  • Yes, with 64 examples and a separate val split, “train loss ↓, eval_loss flat or ↑” is normal overfitting.
  • Your recovery metric is likely “stuck” because you evaluate with sampling (do_sample=True), which injects randomness by design. (Hugging Face)
  • predict_with_generate=True makes evaluation run generate(), so metrics and eval_loss measure different regimes. (Hugging Face)
  • Do the sanity check: use the same examples for train and val and switch eval decoding to deterministic.
  • Also verify bad_words_ids does not forbid any label token, and confirm skip_special_tokens=True is not deleting meaningful symbols.