GPT 2 finetuning peaks at 8 GiB of VRAM

LucasMagnana · January 9, 2026, 11:55am

Hi,

I am finetuning a GPT2 on a Causal Language Modeling Task and as I want to scale up my experiments I am interested in the VRAM usage of this process. The torch profiler gave me this plot :

It shows peaks using 8 GiB of VRAM, is this a normal behavior ? Using the Model Memory Estimator I was expecting a max usage of less than 2 GiB, and it would be the case if it was not for the weird peaks of unknown memory allocations.

I am using python 3.11.14 with :

accelerate==1.12.0
torch==2.9.1
transformers==4.57.3

The code producing the plot is :

from datasets import load_dataset
from transformers import DataCollatorForLanguageModeling, AutoModelForCausalLM, TrainingArguments, Trainer, AutoTokenizer
import torch
import os

os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

model_name = "gpt2"

eli5 = load_dataset("dany0407/eli5_category", split="train[:500]")
eli5 = eli5.train_test_split(test_size=0.2)

tokenizer = AutoTokenizer.from_pretrained(model_name)

def preprocess_function(examples):
    return tokenizer([" ".join(x["text"]) for x in examples["answers"]])

tokenized_eli5 = eli5.map(
    preprocess_function,
    batched=True,
    num_proc=4,
    remove_columns=eli5["train"].column_names,
)

block_size = 512


def group_texts(examples):

    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size

    # Split by chunks of block_size.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

lm_dataset = tokenized_eli5.map(group_texts, batched=True, num_proc=4)

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

model = AutoModelForCausalLM.from_pretrained(model_name)

training_args = TrainingArguments(
    num_train_epochs=1,
    learning_rate=2e-5,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_dataset["train"],
    eval_dataset=lm_dataset["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

def trace_handler(prof: torch.profiler.profile):
    # Prefix for file names.
    # Construct the trace file.
    prof.export_chrome_trace(f"profiler.json.gz")

    # Construct the memory timeline file.
    prof.export_memory_timeline(f"profiler.html", device="cuda:0")

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CUDA,
        torch.profiler.ProfilerActivity.CPU,

    ],
    schedule=torch.profiler.schedule(wait=0, warmup=0, active=6, repeat=1),
    record_shapes=True,
    profile_memory=True,
    with_stack=True,
    on_trace_ready=trace_handler,
) as prof:

    trainer.train()                       

prof.export_memory_timeline(f"profiler_{model_name}.html", device="cuda:0")

RFTSystems · January 9, 2026, 10:01pm

Hi there,
Most likely yes, this is normal. The HF memory estimator is mostly about weights/optim state, but the profiler/nvidia-smi often shows PyTorch reserved/cached VRAM + temporary CUDA workspaces. So you can get big peaks even if “real tensors” aren’t that big. Also note: Trainer defaults to per_device_train_batch_size=8, and with seq_len=512 that can spike activations.

Quick way to confirm what’s happening (allocated vs reserved):

import torch
torch.cuda.reset_peak_memory_stats()
trainer.train(max_steps=1)
torch.cuda.synchronize()
print(“max_alloc GiB:”, torch.cuda.max_memory_allocated()/10243)
print(“max_reserved GiB:”, torch.cuda.max_memory_reserved()/10243)

If max_alloc is ~1–2 GiB but max_reserved is ~8 GiB, it’s basically the caching allocator holding onto memory (not a leak). If you need to reduce it: set per_device_train_batch_size=1, use grad accumulation, enable fp16/bf16, or gradient checkpointing.

hope this helps, Liam

John6666 · January 10, 2026, 12:29am

Hmm…?

In your exact setup, “peaks near 8 GiB” can be normal. The “< 2 GiB” expectation comes from using a load-memory estimator for a training-peak question, plus a silent default in Trainer that likely makes your run much larger than you think.

Below is a concrete diagnosis for your code, then a verification plan, then fixes that reliably shrink VRAM.

1) Why your estimate is low and your measured peak is high

1.1 The Accelerate “Model Memory Estimator” is not a training peak estimator

The Accelerate calculator explicitly says it estimates memory “to purely load the model in, not to perform inference.” Training uses even more memory than inference because it needs gradients, optimizer states, and saved activations. (Hugging Face)

So “< 2 GiB” can be consistent with “8 GiB peak during training.”

1.2 You did not set batch size, so you likely trained with batch size 8

TrainingArguments.per_device_train_batch_size defaults to 8 (and eval also defaults to 8). (Hugging Face)

That matters a lot at block_size = 512.

If you thought you were training with batch size 1, but you are actually training with 8, peak VRAM can jump by multiple GB because activation memory scales roughly linearly with batch size.

1.3 Training peak memory is dominated by more than “model weights”

Hugging Face’s “Efficient Training on a Single GPU” guide summarizes the core reality:

Adam/AdamW stores additional per-parameter state.
Activations add a large, shape-dependent chunk (batch size, sequence length, hidden size).
So “bytes per parameter + activation memory” is the right mental model for training. (Hugging Face)

This is exactly where “peaks” come from: forward saves activations, backward consumes them, allocator caches blocks, repeat.

2) Why the profiler shows “Unknown” memory

“Unknown” usually means “not attributed,” not “mysterious.”

2.1 PyTorch’s caching allocator makes peaks look sticky and “bigger than tensors”

PyTorch uses a caching allocator. Freed blocks are often kept reserved for reuse. PyTorch documentation says unused cached memory can still show as “used” and recommends comparing:

memory_allocated() and max_memory_allocated() for tensor memory
memory_reserved() and max_memory_reserved() for allocator-managed total (PyTorch Docs)

If your timeline is showing allocator-reserved memory that isn’t neatly categorized, it can appear as “Unknown.”

A simple rule: if max_reserved is near 8 GiB but max_allocated is much lower, your “8 GiB peak” is mostly allocator reservation and fragmentation behavior, not 8 GiB of live tensors.

2.2 The memory timeline categorizer is known to miss categories in some cases

There are real reports of export_memory_timeline degenerating into large grey “unknown” regions, especially when execution mode changes (example: after torch.compile). (PyTorch Developer Mailing List)

There are also discussions where the first forward-pass allocation ramp is shown as “unknown” rather than “parameters/activations.” (GitHub)

PyTorch’s own blog post is explicit: the timeline “categorizes memory usage,” but for deep dives with stack traces “we still rely on the Memory Snapshot.” (PyTorch)

2.3 You are profiling from step 0 with warmup=0

Your schedule has wait=0, warmup=0. That means you record one-time setup allocations (CUDA context, kernel autotuning, allocator pool growth). Those one-time surges are common and can look like spikes.

3) The fastest way to confirm what is happening in your run

3.1 Print “max allocated” vs “max reserved”

Add this after trainer.train() (or via a callback during training):

import torch

def gib(x): return x / 1024**3

torch.cuda.reset_peak_memory_stats()
trainer.train()

print("max allocated (GiB):", gib(torch.cuda.max_memory_allocated()))
print("max reserved  (GiB):", gib(torch.cuda.max_memory_reserved()))

Interpretation:

reserved ≫ allocated: caching allocator and fragmentation dominate the “peak” story. This is expected behavior in PyTorch’s CUDA memory model. (PyTorch Docs)
allocated ≈ reserved ≈ 8 GiB: your live training tensors (activations, attention intermediates, optimizer states) truly reach that level.

3.2 Verify your actual microbatch size

Print:

trainer.args.per_device_train_batch_size
and the effective global batch (microbatch × devices × grad accumulation)

Because the docs say the default is 8. (Hugging Face)

4) High-impact changes that directly address your case

4.1 Set microbatch size explicitly and use gradient accumulation

This is the single most reliable way to shrink activation peaks without changing the effective batch.

training_args = TrainingArguments(
    output_dir="out",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,  # effective batch similar to default 8
    num_train_epochs=1,
    learning_rate=2e-5,
    weight_decay=0.01,
)

The HF single-GPU training guide explicitly frames batch size as a primary memory lever. (Hugging Face)

4.2 Disable KV cache during training (`use_cache=False`)

KV cache is for generation speed. It is usually wasted in plain causal LM training and can increase memory footprint.

Transformers GPT-2 code warns that use_cache=True is incompatible with gradient checkpointing and flips it off in that case. (Hugging Face)
Even if you do not use checkpointing, explicitly setting use_cache=False is a good training default.

model = AutoModelForCausalLM.from_pretrained("gpt2")
model.config.use_cache = False

4.3 Enable mixed precision (fp16 or bf16)

Mixed precision typically cuts activation memory substantially.

training_args = TrainingArguments(
    output_dir="out",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    fp16=True,  # or bf16=True if supported
    num_train_epochs=1,
    learning_rate=2e-5,
    weight_decay=0.01,
)

This aligns with HF’s guidance that training memory is “bytes per parameter plus activation memory,” so shrinking activations matters. (Hugging Face)

4.4 Enable gradient checkpointing if seq_len=512 is the driver

Checkpointing trades compute for less saved activation memory.

training_args = TrainingArguments(
    output_dir="out",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    fp16=True,
    gradient_checkpointing=True,
    num_train_epochs=1,
)
model.config.use_cache = False

Transformers issues and docs repeatedly point out the use_cache interaction with checkpointing. (GitHub)

4.5 Make sure you are using an efficient attention implementation (SDPA)

Transformers has an “attention backend” interface. You can request SDPA or FlashAttention2 at load time. (Hugging Face)

For GPT-2, the docs say SDPA is used by default for sufficiently new PyTorch when available, and you can also explicitly request it. (Hugging Face)

model = AutoModelForCausalLM.from_pretrained(
    "gpt2",
    attn_implementation="sdpa",
)
model.config.use_cache = False

Why this matters: PyTorch SDPA can dispatch to more memory-efficient kernels. The PyTorch SDPA blog explains SDPA has multiple backends such as flash attention and memory-efficient attention. (PyTorch)

5) Profiling changes to avoid misleading “first-iteration spikes”

5.1 Add warmup to your profiler schedule

Right now you profile the first CUDA allocations. That captures one-time spikes.

Use something like:

schedule=torch.profiler.schedule(wait=1, warmup=2, active=6, repeat=1)

5.2 Use Memory Snapshots when the timeline says “Unknown”

PyTorch’s recommended workflow for attribution is CUDA memory snapshots plus the viewer (pytorch.org/memory_viz). (PyTorch Docs)

Minimal snapshot capture:

torch.cuda.memory._record_memory_history(max_entries=100000)
trainer.train()
torch.cuda.memory._dump_snapshot("snapshot.pickle")
torch.cuda.memory._record_memory_history(enabled=None)

Then inspect the snapshot in the viewer described in the docs. (PyTorch Docs)

This is how you turn “Unknown 3.2 GiB” into a list of allocations with stack traces.

6) If peaks are mostly “reserved,” consider allocator knobs (optional, last mile)

If max_reserved stays near 8 GiB while max_allocated is far smaller, you are likely seeing fragmentation and allocator high-water behavior.

PyTorch documents allocator backend selection via PYTORCH_CUDA_ALLOC_CONF, including backend=cudaMallocAsync (CUDA 11.4+). (PyTorch Docs)

Caution: there are also bug reports around cudaMallocAsync in some setups, so treat it as an experiment, not a guaranteed fix. (GitHub)

A “best practice” patch of your script (minimal and targeted)

model = AutoModelForCausalLM.from_pretrained(
    "gpt2",
    attn_implementation="sdpa",  # request SDPA explicitly
)
model.config.use_cache = False  # good training default

training_args = TrainingArguments(
    output_dir="out",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    fp16=True,                   # or bf16=True
    gradient_checkpointing=True,  # optional but usually helpful at seq=512
    num_train_epochs=1,
    learning_rate=2e-5,
    weight_decay=0.01,
)

This specifically targets your biggest likely drivers: default batch=8, saved activations, cache, attention kernel choice. (Hugging Face)

References worth keeping open (high-signal)

Trainer defaults (batch size 8): (Hugging Face)
Accelerate estimator caveat (load only): (Hugging Face)
PyTorch CUDA caching allocator and allocated vs reserved: (PyTorch Docs)
PyTorch memory profiler vs memory snapshot (why “unknown” happens): (PyTorch)
SDPA backends and rationale: (PyTorch)
Transformers attention backends API: (Hugging Face)
GPT-2 SDPA note: (Hugging Face)
“Unknown timeline” reports: (PyTorch Developer Mailing List)

Summary

Your “<2 GiB” estimate is for loading, not training peaks. (Hugging Face)
You likely trained with batch size 8 by default. That alone can push peaks toward 8 GiB at seq=512. (Hugging Face)
“Unknown” is commonly allocator-reserved or unattributed memory. Compare max_allocated vs max_reserved. (PyTorch Docs)
Fixes that usually work: set microbatch explicitly, use grad accumulation, disable use_cache, enable fp16/bf16, consider checkpointing, request SDPA. (Hugging Face)
For definitive attribution, use CUDA memory snapshots and the official viewer. (PyTorch Docs)

LucasMagnana · January 12, 2026, 4:50pm

I added the prints:

torch.cuda.reset_peak_memory_stats()
trainer.train()
torch.cuda.synchronize()

print("max_alloc GiB:", torch.cuda.max_memory_allocated()/1024)
print("max_reserved GiB:", torch.cuda.max_memory_reserved()/1024)

and both are ~8-9GiB:

max_alloc GiB: 8450177.5
max_reserved GiB: 9154560.0

It does explain why nvidia-smi is showing me an even upper number than the torch profiler (more than 9GiB) but not why torch is using this much memory.

RFTSystems · January 12, 2026, 5:58pm

Yep — that ~8–9 GiB allocated means the peak is “real” tensors/workspace, not just PyTorch’s cache.

What’s causing it

Trainer default batch size is 8
If you don’t set per_device_train_batch_size, Trainer uses 8. With seq_len=512, activations spike.
Training memory isn’t just weights/optimizer
The estimator mostly covers params + optim state. It misses:
- activations saved for backward (per layer)
- attention intermediates (often scales ~seq^2)
- temporary CUDA workspaces (cuBLASLt/cudnn/etc.)

First: print GiB correctly (bytes → GiB)

import torch
def gib(x): return x / (1024**3)

torch.cuda.reset_peak_memory_stats()
trainer.train()
torch.cuda.synchronize()

print(“max_alloc GiB:”, gib(torch.cuda.max_memory_allocated()))
print(“max_reserved GiB:”, gib(torch.cuda.max_memory_reserved()))

fixes (big VRAM drop)

A) Reduce batch size + use grad accumulation (same effective batch)
training_args = TrainingArguments(
…,
per_device_train_batch_size=1,
gradient_accumulation_steps=8,
)

B) Mixed precision (huge win)
training_args = TrainingArguments(
…,
bf16=True, # or fp16=True
)

C) Gradient checkpointing (trades compute for VRAM)
model = AutoModelForCausalLM.from_pretrained(model_name)
model.gradient_checkpointing_enable()
model.config.use_cache = False # important when checkpointing

D) Shorten sequence length (attention cost drops fast)
Try block_size=256 (or lower).

If you want proof / debugging

Allocator / fragmentation view:
print(torch.cuda.memory_summary(device=0, abbreviated=False))

Optional allocator tuning (set before running):
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128,garbage_collection_threshold:0.8

Bottom line

You’re effectively training GPT-2 with batch=8, seq=512, fp32 unless you set these args.
That can absolutely hit 8–10 GiB. Reduce batch and/or enable bf16/fp16 + checkpointing.
Liam @RFTSystems

system · January 13, 2026, 10:11am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Gpt-oss training on A100 - OOM error Beginners	9	157	November 27, 2025
Missmatch between memory-estimate and Trainer-API Beginners	1	205	January 9, 2026
Compute VRAM size for Text2Text text generation 🤗Transformers	0	61	December 1, 2024
Increasing VRAM Usage with Transformers Trainer Leads to OOM on GPUs 🤗Transformers	2	1204	March 29, 2024
2B Model Fill Up Memory Usage on 4xA100s 🤗Transformers	1	222	April 10, 2025