hmm… maybe
Yes. What you are seeing is consistent with “pipeline is fine, evaluation is measuring something different,” plus one very likely extra issue in your setup: you evaluate with stochastic decoding.
You currently have three different regimes:
-
Training loss: teacher forcing. The decoder always gets the correct previous tokens. This can go near 0 on 64 examples even with zero generalization.
-
eval_loss: also teacher forcing on the validation set. It can stagnate or rise if val differs, even while train loss goes to ~0. That is normal overfitting. -
eval_sequence_recovery: autoregressive generation, becausepredict_with_generate=TruemakesSeq2SeqTrainercallmodel.generate()during evaluation. Hugging Face explicitly documents this behavior. (Hugging Face)
This metric is not “the same thing as loss.” It is “how good is free-running decoding under your decoding strategy.”
So your understanding is basically correct: predict_with_generate=True makes evaluation include generation, which is a different setup than teacher forcing. But the more concrete diagnosis is:
The biggest reason your eval_sequence_recovery can look saturated or worse even when the model memorizes
You set:
do_sample=Truetop_p=0.85,top_k=3,temperature=1.0repetition_penalty=1.2
That means evaluation uses sampling. Sampling is intentionally random. Hugging Face generation docs make it explicit: do_sample=True switches to multinomial sampling (or sampling variants), while do_sample=False gives greedy or beam decoding depending on beams. (Hugging Face)
Why this matters even if the model overfits perfectly
Even if the model assigns the correct next token the highest probability, with do_sample=True it can still sample a different token. With top_k=3, it is literally allowed to pick among the top 3. So you can get:
- train loss ≈ 0 (teacher forcing, easy)
- generated outputs still sometimes wrong (sampling, stochastic)
- recovery metric flat or noisy (because you are injecting randomness)
For a “sequence recovery” metric, sampling is usually the wrong choice during evaluation. Sampling is for diversity. Recovery is for correctness.
1) Your padding and collator look correct
label_pad_token_id=-100 is the right setting and Transformers explicitly states that -100 label padding is ignored by PyTorch loss functions. (Hugging Face)
So padding is unlikely to be the cause here.
2) predict_with_generate=True is not “wrong,” but it changes what compute_metrics receives
With predict_with_generate=True, Seq2SeqTrainer generates sequences for eval and those generated sequences are what your metric sees. That is the intended feature. (Hugging Face)
So your metric is answering:
“How similar are sampled generations to references?”
Not:
“How well does the model predict the next token given the gold prefix?”
Those can diverge massively, especially with sampling.
3) The single best debug test: evaluate on the training examples
“Could I also make an overfit run with using the same samples in the val set and see if it then improves?”
Yes. Do it. It is a standard sanity check.
Expected outcome if your pipeline is correct
- train loss → near 0
eval_losson the same data → near 0 (maybe slightly higher due to eval mode differences, but should drop hard)- greedy generation recovery → very high, often near 1.0 on memorized examples
If that does NOT happen
Then something is wrong in evaluation or decoding, not “generalization.”
4) What I would change in your eval setup immediately
4.1 Make evaluation decoding deterministic
For debugging and for meaningful recovery curves:
- set
do_sample=False - remove
top_p,top_k,temperature(they are sampling controls) - set
num_beams=1(greedy) first, then trynum_beams=4if you want
This matches how generation modes are defined in Transformers. (Hugging Face)
Why greedy first
Greedy is the cleanest signal. If greedy can’t recover on train examples, sampling won’t fix it.
4.2 Make sure your “bad words” are not blocking required tokens
You pass bad_words_ids=bad_words. If any token that appears in your labels is included there, generation can never match the reference, no matter how well the model trains.
This is a very common silent killer for “recovery” style metrics: the model is trained to output something, but decoding forbids it.
Concrete check:
- collect the set of token IDs appearing in labels
- intersect with
bad_words_ids - intersection must be empty
4.3 Stop using sampling penalties for evaluation
repetition_penalty changes token probabilities during decoding. It can force the decoder away from the true sequence even if the model learned it.
Use such penalties only for “nice-looking text.” Not for correctness metrics.
5) A subtle metric pitfall in your code: skip_special_tokens=True
You decode with skip_special_tokens=True for both preds and labels.
That is often correct for normal text. But it can be catastrophically wrong if your task uses “special tokens” as meaningful symbols (custom markers, sentinels, reserved tokens).
If your ground-truth sequences contain tokens that the tokenizer considers “special,” they will be deleted before scoring and your metric can look flat or nonsensical.
Quick test:
- decode one label with
skip_special_tokens=False - compare to
True - if important symbols disappear, your metric is not measuring what you think
6) About eval_loss specifically: why it can increase while train loss goes to ~0 on 64 examples
With 64 examples, you can memorize training patterns quickly. Validation loss can:
- plateau if the mapping is not learnable from that tiny sample
- increase as you overfit and become more confident on wrong patterns for val
That is normal “tiny dataset overfit” behavior. Nothing surprising there.
What is surprising is “recovery metric does not improve even on train examples.” If that’s your situation, it is almost always decode/metric config.
7) Generation length knobs: reduce confusion
You set generation_max_length=tgt_max and also GenerationConfig(max_length=tgt_max, ...).
This is redundant.
Also, Transformers has been nudging people toward max_new_tokens instead of relying on max_length defaults. There is even a warning that recommends max_new_tokens for controlling output length. (Mark Needham)
Separately, in Seq2SeqTrainer, generation_max_length is used to build gen_kwargs["max_length"] during evaluation, overriding defaults. (Hugging Face Forums)
For encoder-decoder models, max_length is usually fine, but for clarity during debugging:
- pick one place to define length
- avoid mixing multiple overlapping settings
8) A concrete “debug mode” recipe for your exact pipeline
Do this for 1–2 runs:
Debug Run A (pipeline correctness)
-
train set = 64 examples
-
eval set = same 64 examples
-
predict_with_generate=True -
deterministic generation:
do_sample=Falsenum_beams=1- no repetition penalty
-
metric decode with
skip_special_tokens=Falsetemporarily (just to confirm nothing meaningful is being dropped)
Expected:
eval_lossdrops with training- recovery rises strongly
Debug Run B (generalization baseline)
- train set = 64 examples
- eval set = separate validation split
- same deterministic generation
Expected:
- train loss goes near 0
- val loss may not improve much
- recovery may saturate low due to no generalization
This separates “evaluation broken” from “no generalization.”
High-quality references you are directly relying on here
predict_with_generate=Truetriggersgenerate()during evaluation in Seq2SeqTrainer. (Hugging Face)- Generation modes: greedy vs sampling vs beam depend on
do_sampleandnum_beams. (Hugging Face) label_pad_token_id=-100is ignored by PyTorch loss, and is the documented default for seq2seq collators. (Hugging Face)- Transformers guidance: prefer
max_new_tokensfor controlling generated length in many contexts. (Mark Needham) - Seq2SeqTrainer generation kwargs behavior (how
generation_max_lengthmaps intogenerate). (Hugging Face Forums)
Summary
- Yes, with 64 examples and a separate val split, “train loss ↓, eval_loss flat or ↑” is normal overfitting.
- Your recovery metric is likely “stuck” because you evaluate with sampling (
do_sample=True), which injects randomness by design. (Hugging Face) predict_with_generate=Truemakes evaluation rungenerate(), so metrics andeval_lossmeasure different regimes. (Hugging Face)- Do the sanity check: use the same examples for train and val and switch eval decoding to deterministic.
- Also verify
bad_words_idsdoes not forbid any label token, and confirmskip_special_tokens=Trueis not deleting meaningful symbols.