Hmm�
In your exact setup, âpeaks near 8 GiBâ can be normal. The â< 2 GiBâ expectation comes from using a load-memory estimator for a training-peak question, plus a silent default in Trainer that likely makes your run much larger than you think.
Below is a concrete diagnosis for your code, then a verification plan, then fixes that reliably shrink VRAM.
1) Why your estimate is low and your measured peak is high
1.1 The Accelerate âModel Memory Estimatorâ is not a training peak estimator
The Accelerate calculator explicitly says it estimates memory âto purely load the model in, not to perform inference.â Training uses even more memory than inference because it needs gradients, optimizer states, and saved activations. (Hugging Face)
So â< 2 GiBâ can be consistent with â8 GiB peak during training.â
1.2 You did not set batch size, so you likely trained with batch size 8
TrainingArguments.per_device_train_batch_size defaults to 8 (and eval also defaults to 8). (Hugging Face)
That matters a lot at block_size = 512.
If you thought you were training with batch size 1, but you are actually training with 8, peak VRAM can jump by multiple GB because activation memory scales roughly linearly with batch size.
1.3 Training peak memory is dominated by more than âmodel weightsâ
Hugging Faceâs âEfficient Training on a Single GPUâ guide summarizes the core reality:
- Adam/AdamW stores additional per-parameter state.
- Activations add a large, shape-dependent chunk (batch size, sequence length, hidden size).
- So âbytes per parameter + activation memoryâ is the right mental model for training. (Hugging Face)
This is exactly where âpeaksâ come from: forward saves activations, backward consumes them, allocator caches blocks, repeat.
2) Why the profiler shows âUnknownâ memory
âUnknownâ usually means ânot attributed,â not âmysterious.â
2.1 PyTorchâs caching allocator makes peaks look sticky and âbigger than tensorsâ
PyTorch uses a caching allocator. Freed blocks are often kept reserved for reuse. PyTorch documentation says unused cached memory can still show as âusedâ and recommends comparing:
memory_allocated() and max_memory_allocated() for tensor memory
memory_reserved() and max_memory_reserved() for allocator-managed total (PyTorch Docs)
If your timeline is showing allocator-reserved memory that isnât neatly categorized, it can appear as âUnknown.â
A simple rule: if max_reserved is near 8 GiB but max_allocated is much lower, your â8 GiB peakâ is mostly allocator reservation and fragmentation behavior, not 8 GiB of live tensors.
2.2 The memory timeline categorizer is known to miss categories in some cases
There are real reports of export_memory_timeline degenerating into large grey âunknownâ regions, especially when execution mode changes (example: after torch.compile). (PyTorch Developer Mailing List)
There are also discussions where the first forward-pass allocation ramp is shown as âunknownâ rather than âparameters/activations.â (GitHub)
PyTorchâs own blog post is explicit: the timeline âcategorizes memory usage,â but for deep dives with stack traces âwe still rely on the Memory Snapshot.â (PyTorch)
2.3 You are profiling from step 0 with warmup=0
Your schedule has wait=0, warmup=0. That means you record one-time setup allocations (CUDA context, kernel autotuning, allocator pool growth). Those one-time surges are common and can look like spikes.
3) The fastest way to confirm what is happening in your run
3.1 Print âmax allocatedâ vs âmax reservedâ
Add this after trainer.train() (or via a callback during training):
import torch
def gib(x): return x / 1024**3
torch.cuda.reset_peak_memory_stats()
trainer.train()
print("max allocated (GiB):", gib(torch.cuda.max_memory_allocated()))
print("max reserved (GiB):", gib(torch.cuda.max_memory_reserved()))
Interpretation:
- reserved â« allocated: caching allocator and fragmentation dominate the âpeakâ story. This is expected behavior in PyTorchâs CUDA memory model. (PyTorch Docs)
- allocated â reserved â 8 GiB: your live training tensors (activations, attention intermediates, optimizer states) truly reach that level.
3.2 Verify your actual microbatch size
Print:
trainer.args.per_device_train_batch_size
- and the effective global batch (microbatch Ă devices Ă grad accumulation)
Because the docs say the default is 8. (Hugging Face)
4) High-impact changes that directly address your case
4.1 Set microbatch size explicitly and use gradient accumulation
This is the single most reliable way to shrink activation peaks without changing the effective batch.
training_args = TrainingArguments(
output_dir="out",
per_device_train_batch_size=1,
gradient_accumulation_steps=8, # effective batch similar to default 8
num_train_epochs=1,
learning_rate=2e-5,
weight_decay=0.01,
)
The HF single-GPU training guide explicitly frames batch size as a primary memory lever. (Hugging Face)
4.2 Disable KV cache during training (use_cache=False)
KV cache is for generation speed. It is usually wasted in plain causal LM training and can increase memory footprint.
Transformers GPT-2 code warns that use_cache=True is incompatible with gradient checkpointing and flips it off in that case. (Hugging Face)
Even if you do not use checkpointing, explicitly setting use_cache=False is a good training default.
model = AutoModelForCausalLM.from_pretrained("gpt2")
model.config.use_cache = False
4.3 Enable mixed precision (fp16 or bf16)
Mixed precision typically cuts activation memory substantially.
training_args = TrainingArguments(
output_dir="out",
per_device_train_batch_size=1,
gradient_accumulation_steps=8,
fp16=True, # or bf16=True if supported
num_train_epochs=1,
learning_rate=2e-5,
weight_decay=0.01,
)
This aligns with HFâs guidance that training memory is âbytes per parameter plus activation memory,â so shrinking activations matters. (Hugging Face)
4.4 Enable gradient checkpointing if seq_len=512 is the driver
Checkpointing trades compute for less saved activation memory.
training_args = TrainingArguments(
output_dir="out",
per_device_train_batch_size=1,
gradient_accumulation_steps=8,
fp16=True,
gradient_checkpointing=True,
num_train_epochs=1,
)
model.config.use_cache = False
Transformers issues and docs repeatedly point out the use_cache interaction with checkpointing. (GitHub)
4.5 Make sure you are using an efficient attention implementation (SDPA)
Transformers has an âattention backendâ interface. You can request SDPA or FlashAttention2 at load time. (Hugging Face)
For GPT-2, the docs say SDPA is used by default for sufficiently new PyTorch when available, and you can also explicitly request it. (Hugging Face)
model = AutoModelForCausalLM.from_pretrained(
"gpt2",
attn_implementation="sdpa",
)
model.config.use_cache = False
Why this matters: PyTorch SDPA can dispatch to more memory-efficient kernels. The PyTorch SDPA blog explains SDPA has multiple backends such as flash attention and memory-efficient attention. (PyTorch)
5) Profiling changes to avoid misleading âfirst-iteration spikesâ
5.1 Add warmup to your profiler schedule
Right now you profile the first CUDA allocations. That captures one-time spikes.
Use something like:
schedule=torch.profiler.schedule(wait=1, warmup=2, active=6, repeat=1)
5.2 Use Memory Snapshots when the timeline says âUnknownâ
PyTorchâs recommended workflow for attribution is CUDA memory snapshots plus the viewer (pytorch.org/memory_viz). (PyTorch Docs)
Minimal snapshot capture:
torch.cuda.memory._record_memory_history(max_entries=100000)
trainer.train()
torch.cuda.memory._dump_snapshot("snapshot.pickle")
torch.cuda.memory._record_memory_history(enabled=None)
Then inspect the snapshot in the viewer described in the docs. (PyTorch Docs)
This is how you turn âUnknown 3.2 GiBâ into a list of allocations with stack traces.
6) If peaks are mostly âreserved,â consider allocator knobs (optional, last mile)
If max_reserved stays near 8 GiB while max_allocated is far smaller, you are likely seeing fragmentation and allocator high-water behavior.
PyTorch documents allocator backend selection via PYTORCH_CUDA_ALLOC_CONF, including backend=cudaMallocAsync (CUDA 11.4+). (PyTorch Docs)
Caution: there are also bug reports around cudaMallocAsync in some setups, so treat it as an experiment, not a guaranteed fix. (GitHub)
A âbest practiceâ patch of your script (minimal and targeted)
model = AutoModelForCausalLM.from_pretrained(
"gpt2",
attn_implementation="sdpa", # request SDPA explicitly
)
model.config.use_cache = False # good training default
training_args = TrainingArguments(
output_dir="out",
per_device_train_batch_size=1,
gradient_accumulation_steps=8,
fp16=True, # or bf16=True
gradient_checkpointing=True, # optional but usually helpful at seq=512
num_train_epochs=1,
learning_rate=2e-5,
weight_decay=0.01,
)
This specifically targets your biggest likely drivers: default batch=8, saved activations, cache, attention kernel choice. (Hugging Face)
References worth keeping open (high-signal)
Summary
- Your â<2 GiBâ estimate is for loading, not training peaks. (Hugging Face)
- You likely trained with batch size 8 by default. That alone can push peaks toward 8 GiB at seq=512. (Hugging Face)
- âUnknownâ is commonly allocator-reserved or unattributed memory. Compare
max_allocated vs max_reserved. (PyTorch Docs)
- Fixes that usually work: set microbatch explicitly, use grad accumulation, disable
use_cache, enable fp16/bf16, consider checkpointing, request SDPA. (Hugging Face)
- For definitive attribution, use CUDA memory snapshots and the official viewer. (PyTorch Docs)