Missmatch between memory-estimate and Trainer-API

AlJ95 · January 23, 2024, 10:33pm

Hey, I fool around with some LM for Code Generation and I ran out of memory all the time, no matter which model I use. I have 8GB VRAM, so I tried flax-community/gpt-neo-125M-code-clippy-dedup-2048, because accelerate estimate-memory estimated 1.89GB VRAM for Training with Adam.

When I train, I get the following error:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 246.00 MiB. GPU 0 has a total capacty of 8.00 GiB of which 0 bytes is free. Of the allocated memory 20.73 GiB is allocated by PyTorch, and 1.35 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Why does PyTorch allocate so much memory?

I reduced the sql-files to 10
I reduced batch size to 1
I tried to reduce the max_length parameter as well, but it had no effect
I use transformers 4.37.0 and torch 2.1.2+cu121

Here is my code for more context:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, DataCollatorForLanguageModeling, TrainingArguments, Trainer
from datasets import Dataset, load_dataset, Value, Features
from pathlib import Path


DATA_PATH = Path("./data")
files = [str(p) for p in DATA_PATH.glob("*.sql")][:10]
train_files, test_files = files[:int(len(files) * 0.8)], files[int(len(files) * 0.8):]

features = Features({'code': Value('string')})
ds = load_dataset("text", data_files={"train": train_files, "test": test_files}, sample_by="document", features=features)

tokenizer = AutoTokenizer.from_pretrained(
    # "stabilityai/stable-code-3b",
    # "deepseek-ai/deepseek-coder-1.3b-instruct",
    "flax-community/gpt-neo-125M-code-clippy-dedup-2048",
    trust_remote_code=True)

tokenizer.add_special_tokens({'pad_token': '[PAD]'})

model = AutoModelForCausalLM.from_pretrained(
    # "stabilityai/stable-code-3b",
    # "deepseek-ai/deepseek-coder-1.3b-instruct",
    "flax-community/gpt-neo-125M-code-clippy-dedup-2048",
    trust_remote_code=True,
    torch_dtype="auto",
)

max_length = model.config.max_position_embeddings

tokenized_dataset = (ds.map(
    lambda example: tokenizer(example["code"],
                              return_tensors="pt",
                              truncation=True,
                              padding="max_length",
                              max_length=max_length),
    batched=True,
    batch_size=1,
))
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
training_args = TrainingArguments("test-trainer")
model.cuda()


trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_dataset["train"],
    data_collator=data_collator,
    tokenizer=tokenizer
)

trainer.train()

RFTSystems · January 9, 2026, 10:08pm

Yep — this one’s mostly a “gotcha”,

You didn’t actually set training batch size to 1.
ds.map(..., batch_size=1) only affects tokenisation, not the Trainer. TrainingArguments defaults to per_device_train_batch_size=8, so you’re probably training with batch=8 and seq_len=2048 (because you pad to max). That will eat 8GB fast, even on a small model.

Also: you’re doing padding="max_length" with max_length = 2048, so every sample becomes 2048 tokens even if the SQL is short. That’s a VRAM killer.

Minimal fix:

training_args = TrainingArguments(
“test-trainer”,
per_device_train_batch_size=1,
gradient_accumulation_steps=8, # keep “effective batch” if you want
fp16=True, # or bf16=True if supported
)

model.config.use_cache = False # saves memory during training

And don’t pre-pad everything to 2048 in the dataset. Let the collator pad dynamically:

tokenized_dataset = ds.map(
lambda ex: tokenizer(ex[“code”], truncation=True, max_length=512), # pick a real max
batched=True,
remove_columns=[“code”],
)

Why the error says crazy numbers (like 20GB “allocated”): PyTorch uses a caching allocator and reports “allocated/reserved” in a way that can look weird when things fragment or when it keeps chunks cached. The important part is: your run has 0 bytes free, so something (batch size + padding) is simply too big.
hope this helps..
RFTSystems, Liam

Topic		Replies	Views
Always getting RuntimeError: CUDA out of memory with Trainer 🤗Transformers	10	7066	April 4, 2024
GPT 2 finetuning peaks at 8 GiB of VRAM 🤗Transformers	7	91	January 12, 2026
LLM ingores max_memory in inference Models	0	144	June 20, 2024
CUDA out of memory when training mt5-XL 🤗Transformers	1	268	March 11, 2024
Running out of Memory with run_clm.py Beginners	3	1718	December 14, 2022

Missmatch between memory-estimate and Trainer-API

Related topics