Missmatch between memory-estimate and Trainer-API

Hey, I fool around with some LM for Code Generation and I ran out of memory all the time, no matter which model I use. I have 8GB VRAM, so I tried flax-community/gpt-neo-125M-code-clippy-dedup-2048, because accelerate estimate-memory estimated 1.89GB VRAM for Training with Adam.

When I train, I get the following error:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 246.00 MiB. GPU 0 has a total capacty of 8.00 GiB of which 0 bytes is free. Of the allocated memory 20.73 GiB is allocated by PyTorch, and 1.35 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Why does PyTorch allocate so much memory?

  • I reduced the sql-files to 10
  • I reduced batch size to 1
  • I tried to reduce the max_length parameter as well, but it had no effect
  • I use transformers 4.37.0 and torch 2.1.2+cu121

Here is my code for more context:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, DataCollatorForLanguageModeling, TrainingArguments, Trainer
from datasets import Dataset, load_dataset, Value, Features
from pathlib import Path


DATA_PATH = Path("./data")
files = [str(p) for p in DATA_PATH.glob("*.sql")][:10]
train_files, test_files = files[:int(len(files) * 0.8)], files[int(len(files) * 0.8):]

features = Features({'code': Value('string')})
ds = load_dataset("text", data_files={"train": train_files, "test": test_files}, sample_by="document", features=features)

tokenizer = AutoTokenizer.from_pretrained(
    # "stabilityai/stable-code-3b",
    # "deepseek-ai/deepseek-coder-1.3b-instruct",
    "flax-community/gpt-neo-125M-code-clippy-dedup-2048",
    trust_remote_code=True)

tokenizer.add_special_tokens({'pad_token': '[PAD]'})

model = AutoModelForCausalLM.from_pretrained(
    # "stabilityai/stable-code-3b",
    # "deepseek-ai/deepseek-coder-1.3b-instruct",
    "flax-community/gpt-neo-125M-code-clippy-dedup-2048",
    trust_remote_code=True,
    torch_dtype="auto",
)

max_length = model.config.max_position_embeddings

tokenized_dataset = (ds.map(
    lambda example: tokenizer(example["code"],
                              return_tensors="pt",
                              truncation=True,
                              padding="max_length",
                              max_length=max_length),
    batched=True,
    batch_size=1,
))
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
training_args = TrainingArguments("test-trainer")
model.cuda()


trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_dataset["train"],
    data_collator=data_collator,
    tokenizer=tokenizer
)

trainer.train()

1 Like

Yep — this one’s mostly a “gotcha”,

You didn’t actually set training batch size to 1.
ds.map(..., batch_size=1) only affects tokenisation, not the Trainer. TrainingArguments defaults to per_device_train_batch_size=8, so you’re probably training with batch=8 and seq_len=2048 (because you pad to max). That will eat 8GB fast, even on a small model.

Also: you’re doing padding="max_length" with max_length = 2048, so every sample becomes 2048 tokens even if the SQL is short. That’s a VRAM killer.

Minimal fix:

training_args = TrainingArguments(
“test-trainer”,
per_device_train_batch_size=1,
gradient_accumulation_steps=8, # keep “effective batch” if you want
fp16=True, # or bf16=True if supported
)

model.config.use_cache = False # saves memory during training

And don’t pre-pad everything to 2048 in the dataset. Let the collator pad dynamically:

tokenized_dataset = ds.map(
lambda ex: tokenizer(ex[“code”], truncation=True, max_length=512), # pick a real max
batched=True,
remove_columns=[“code”],
)

Why the error says crazy numbers (like 20GB “allocated”): PyTorch uses a caching allocator and reports “allocated/reserved” in a way that can look weird when things fragment or when it keeps chunks cached. The important part is: your run has 0 bytes free, so something (batch size + padding) is simply too big.
hope this helps..
RFTSystems, Liam :nerd_face:

1 Like