GPT-oss-20b: torch.OutOfMemoryError: CUDA out of memory

Hi,

I am trying to perform a distributed training run of gpt-oss-20b on x8 A100s (40gb); however, I am running into memory issues when trying to load the model into memory using the code below. I am aware that for GPT-OSS the Mxfp4 is only supported for Hopper generation and greater; however, even when dequantizing the model to float16/bfloat16 I should still be well within the required memory since I have 8 x 40gb = 320 GB, when doing stage-3 sharding.

Question: Do you have any suggestions as to why this might be the case? Am I performing the loading of the model correctly?

Debugging Steps

  • I have checked all devices are visible.
  • Tried manually setting the device_map={‘’:local_rank}
  • Inference on the model works when quantization_config = Mxfp4Config(dequantize=False)

Dependencies

torch: 2.10.0
transformers: 4.57.3
trl: 0.24.0

Accelerate Config

compute_environment: LOCAL_MACHINE
distributed_type: DEEPSPEED
deepspeed_config:
  deepspeed_multinode_launcher: standard
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
debug: false

Shell Script

module purge
module load GCCcore/9.3.0 CUDA/12.2.0 cuDNN/8.9.2.26-CUDA-12.2.0

# Ensuring all GPUs on the node are visible.
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

# Path to accelerate config file.
CONFIG="accelerate_config.yaml"

# Launching the script with accelerate
accelerate launch --config_file ${CONFIG} script.py

Python Code

from transformers import AutoTokenizer, AutoModelForCausalLM, Mxfp4Config
import torch

# Setting up the model configuration and telling HF to dequantize the model.
quantization_config = Mxfp4Config(dequantize=True)
model_kwargs = dict(
    attn_implementation="eager",
    dtype=torch.bfloat16,
    quantization_config=quantization_config,
    use_cache=True,
    device_map="auto",  # {'':local_rank}
    max_memory={i: "40GiB" for i in range(8)},
    offload_folder="offload"
)

# Loading the model and tokeniser into memory.
model = AutoModelForCausalLM.from_pretrained("openai/gpt-oss-20b", **model_kwargs)
tokenizer = AutoTokenizer.from_pretrained("openai/gpt-oss-20b")

Error Trace

1 Like

ZeRO-3 and device_map is not compatible…?


Why you can OOM at load time despite having 8×40GB

You are mixing two different distribution mechanisms:

  1. device_map="auto" / max_memory / offload_folder
    This triggers Accelerate Big Model Inference style inference-time dispatch: it “fills GPU(s) first, then CPU, then disk”. (Hugging Face)
    This is not DeepSpeed ZeRO sharding.

  2. DeepSpeed ZeRO-3 (stage-3 sharding)
    ZeRO-3 shards parameters/optimizer states across ranks, but it only works if the model is constructed/loaded under the ZeRO-3 initialization path (e.g., deepspeed.zero.Init or HfDeepSpeedConfig + from_pretrained), not via device_map.

In an accelerate launch --num_processes 8 run, each of the 8 processes executes your top-level Python code. With device_map="auto", each process will try to use all visible GPUs to dispatch the model, which can lead to “multiple copies worth” of allocations across the node (or heavy temporary allocations during dequantization), and you OOM before ZeRO-3 ever has a chance to shard things.

This is consistent with multiple upstream warnings/issues:

  • ZeRO-3 is incompatible with device_map and low_cpu_mem_usage in the Transformers loading path. (GitHub)
  • You can’t train a model loaded with device_map='auto' in distributed mode (Accelerate/Transformers explicitly error on this in many setups). (GitHub)

Even if your run doesn’t hit those exact ValueErrors (because you OOM first), the underlying incompatibility remains.


Why {"": local_rank} still OOMs on a single A100 40GB

Once you set Mxfp4Config(dequantize=True), you are effectively asking to materialize BF16/FP16 weights. A 20B-parameter model at BF16 is ~40GB just for parameters (20e9 × 2 bytes ≈ 40GB), before accounting for:

  • embeddings/head tied weights handling
  • layernorm/buffers
  • temporary tensors during weight loading/dequantization
  • fragmentation / allocator reserves

There is a very similar report from an A100 40GB user: they get an OOM while loading because the model already consumes ~37GB and then fails on an extra ~2GB allocation. (Hugging Face)

So: mapping the whole dequantized model onto one 40GB GPU is expected to be right on the edge (and often fails).


The core fix: don’t use device_map for ZeRO-3 training

What to remove from your from_pretrained call

For DeepSpeed ZeRO-3 training, remove:

  • device_map="auto"
  • max_memory=...
  • offload_folder=... (this is for Big Model Inference CPU/disk offload, not ZeRO offload)

Also set:

  • use_cache=False (cache is for generation; for training it’s wasted memory and often disabled in examples)

Correct loading patterns for ZeRO-3 sharded training

Option A (recommended): let Trainer/TRL + DeepSpeed handle initialization

If you’re using TRL/Trainer, pass a DeepSpeed config into the training arguments and load the model without device_map. The OpenAI cookbook’s fine-tuning article is single-H100 oriented (80GB) (OpenAI Cookbook), but the principle is the same: you need ZeRO-3 to own placement, not device_map.

Key idea: the distributed engine must be active during/around model init (or you’ll load full weights per process).

Option B (robust for “non-Trainer” setups): HfDeepSpeedConfig before from_pretrained

Transformers documents a “non-Trainer integration” where HfDeepSpeedConfig enables ZeRO-3 partitioning behavior during from_pretrained(). Critically, it must be instantiated before loading the model. (Hugging Face)

Minimal sketch (conceptual; adapt to your actual training loop):

import json
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, Mxfp4Config
from transformers.integrations import HfDeepSpeedConfig

model_id = "openai/gpt-oss-20b"

# Load your DS ZeRO-3 config (json/dict) matching stage-3 + offload settings
ds_config = json.load(open("ds_zero3.json"))

# Must be created BEFORE from_pretrained, and kept alive
dschf = HfDeepSpeedConfig(ds_config)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    quantization_config=Mxfp4Config(dequantize=True),
    use_cache=False,
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

This avoids the device_map path entirely and uses the ZeRO-3-aware initialization hook described in the docs. (Hugging Face)

Option C (manual init): deepspeed.zero.Init(...)

Accelerate also shows that if automatic integration isn’t in play, you can explicitly use deepspeed.zero.Init to ensure the model is initialized under ZeRO-3 rules. (Hugging Face)


Notes specific to MXFP4 and A100

  • Transformers will try to use MXFP4 Triton kernels only if available and supported; otherwise it falls back. (Hugging Face)
  • The gpt-oss model discussions include reports where A100 ends up dequantizing/falling back, and load-time memory becomes the limiter. (Hugging Face)

Also, there was a recent Transformers bug report about device_map="auto" failing to load dequantized gpt-oss on GPU+CPU offload (closed, but relevant if you keep experimenting with device_map). (GitHub)

Given you’re training with ZeRO-3 anyway, the clean solution is to stop using device_map in the training job.


What I think is happening in your exact script

  1. You launch 8 processes.
  2. Each process runs from_pretrained(...).
  3. Because you set device_map="auto" (+ max_memory), you’re in the Big Model Inference dispatch path (GPU→CPU→disk). (Hugging Face)
  4. You also request dequantization to BF16, which creates large allocations and temporary buffers.
  5. Before ZeRO-3 sharding is applied, one or more processes allocate enough on one GPU to push it over 40GB → torch.OutOfMemoryError.

This matches the A100-40GB OOM pattern reported by others when the model becomes effectively BF16-sized on a single device. (Hugging Face)


Similar cases + high-signal references

Device-map vs distributed training incompatibilities

  • Transformers issue: can’t train with device_map='auto' in distributed mode. (GitHub)
  • Accelerate issue: ZeRO-3 incompatible with device_map / low_cpu_mem_usage. (GitHub)
  • PEFT issue discussion explaining why: device_map/low_cpu_mem_usage implies a naive model-parallel style, while ZeRO-3 is sharded DP. (GitHub)

gpt-oss + A100 memory behavior

  • HF model discussion with A100-40GB load OOM. (Hugging Face)

Official docs you’ll actually use for the fix

  • Accelerate docs: Big Model Inference (device_map="auto") is an inference feature and how it dispatches memory. (Hugging Face)
  • Transformers docs: HfDeepSpeedConfig must be instantiated before loading to deploy ZeRO-3 efficiently. (Hugging Face)
  • PEFT/Accelerate DeepSpeed guide: explains zero3_init_flag / deepspeed.zero.Init. (Hugging Face)
  • Transformers quantization docs: MXFP4 kernels behavior. (Hugging Face)

Minimal actionable change for your code

Replace your model kwargs with something like:

from transformers import AutoModelForCausalLM, AutoTokenizer, Mxfp4Config
import torch

model_id = "openai/gpt-oss-20b"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    quantization_config=Mxfp4Config(dequantize=True),
    use_cache=False,
    # no device_map, no max_memory, no offload_folder
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

…and ensure ZeRO-3 init is truly active during load (Trainer/TRL DS integration, or HfDeepSpeedConfig, or explicit deepspeed.zero.Init). (Hugging Face)

If you apply only one principle: for ZeRO-3 training, do not use device_map. (GitHub)

1 Like

Thanks so much @John6666 for your help! This was a very comprehensive answer, and I learned a lot. I will give this a try and get back to you if it does not resolve my underlying issue. Many thanks again!

1 Like

Hey @John6666,

Do you have an example of “Option A (recommended): let Trainer/TRL + DeepSpeed handle initialization”? I had a read of the linked article, but it doesn’t seem to show letting TRL handle the distributed training. I found this article which was the closest to your suggestion. I am unsure as to configure the training arguments (include Mxfp4Config(dequantize=True)) as well as setup the (GRPO)Trainer.

I had a go at removing the device map with the following code (using same run script with accelerate):

from transformers import AutoModelForCausalLM, AutoTokenizer, Mxfp4Config
import torch

model_id = "openai/gpt-oss-20b"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    quantization_config=Mxfp4Config(dequantize=True),
    use_cache=False,
    # no device_map, no max_memory, no offload_folder
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

However, that still gives an OOM error, and I suspect is still not distributing properly.

I also tried the following code (using same run script with accelerate):

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Mxfp4Config
from trl import GRPOTrainer

# Training arguments
training_args = TrainingArguments(
    #torch_dtype=torch.bfloat16,
    #quantization_config=Mxfp4Config(dequantize=True),
    per_device_train_batch_size=1,
    max_steps=0,
)

# Placeholder reward function.
def dummy_reward_fn(*args, **kwargs):
    return 0.0

# Initialize trainer (using the TRL library)
trainer = GRPOTrainer(
    model="openai/gpt-oss-20b",
    args=training_args,
    train_dataset=None,
    reward_funcs=dummy_reward_fn,
)

trainer.train()

However, this does not seem to work, and I am unsure how you would go about defining the dtype, quantisation configuration, PEFT, etc.

1 Like

Hmm…


Why your “no device_map” snippet still OOMs (and why it’s not “distributing” yet)

  1. accelerate launch runs your top-level code on every rank.
    With num_processes: 8, your script is executed 8 times (once per GPU/process). If each process executes AutoModelForCausalLM.from_pretrained(...) in normal mode, then each rank can end up materializing the full BF16 model during load/dequantization before any ZeRO-3 partitioning is active.

  2. openai/gpt-oss-20b dequantized to BF16 is right on the edge of a 40GB GPU even for single-device load.
    There are A100-40GB reports where the model load reaches ~39GB and then fails on an extra ~2GB allocation during the MXFP4 conversion/dequantization path. (Hugging Face)

  3. The key gotcha: Accelerate’s zero3_init_flag: true alone historically did not guarantee that Transformers’ from_pretrained() runs under ZeRO-3 zero.Init for Transformers models, unless you also pass a DeepSpeed config through the Transformers/Trainer integration. (GitHub)
    If ZeRO-3 init isn’t active at load time, you can OOM before sharding ever happens.

So your suspicion (“still not distributing properly”) is consistent with known behavior: the model isn’t being initialized under the ZeRO-3-aware path at load time.


Option A (recommended): Let TRL’s GRPOTrainer + DeepSpeed own init/sharding

The goal is:

  • Do not use device_map, max_memory, offload_folder (those are Big Model Inference dispatch knobs, not ZeRO-3 training sharding).
  • Provide a DeepSpeed ZeRO-3 config via Trainer/TRL args so Transformers activates the correct ZeRO-3 init path.
  • Ensure GRPO doesn’t OOM during the generation phase by disabling “gather weights for generation” (see below).

1) Create a DeepSpeed config JSON (e.g., ds_z3_bf16.json)

{
  "train_micro_batch_size_per_gpu": 1,
  "gradient_accumulation_steps": 1,
  "bf16": { "enabled": true },
  "zero_optimization": {
    "stage": 3,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "reduce_bucket_size": 5e8,
    "stage3_prefetch_bucket_size": 5e8,
    "stage3_param_persistence_threshold": 1e6,
    "offload_param": { "device": "cpu", "pin_memory": true },
    "offload_optimizer": { "device": "cpu", "pin_memory": true }
  },
  "aio": { "block_size": 1048576, "queue_depth": 16, "single_submit": false, "overlap_events": true }
}

(Adjust batch/accumulation later; this is a “make it load and step” baseline.)

2) Minimal GRPOTrainer example that passes Mxfp4Config(dequantize=True) without device_map

This example is intentionally small: it demonstrates correct initialization + sharding. Replace the dataset/reward with your real setup.

import torch
from datasets import Dataset
from transformers import AutoTokenizer, Mxfp4Config
from trl import GRPOConfig, GRPOTrainer

model_id = "openai/gpt-oss-20b"

# Tiny placeholder dataset: GRPO expects prompts.
train_dataset = Dataset.from_dict({
    "prompt": [
        "Solve: 2+2=",
        "Solve: 10-3=",
        "What is the capital of Australia?",
        "Return only 'OK'."
    ]
})

def simple_reward(prompts, completions, **kwargs):
    # Dummy reward: prefer shorter outputs (replace with real reward model / function)
    return [-len(c) for c in completions]

tokenizer = AutoTokenizer.from_pretrained(model_id)

training_args = GRPOConfig(
    output_dir="grpo_z3_smoketest",
    bf16=True,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,

    # IMPORTANT: pass DeepSpeed here (Trainer/Transformers integration)
    deepspeed="ds_z3_bf16.json",

    # IMPORTANT for 40GB cards: avoid gathering full weights onto one GPU for generation
    ds3_gather_for_generation=False,

    # GRPO detail: effective batch size must be divisible by num_generations
    # With 8 GPUs * 1 * 1 = 8, keep num_generations=8 to satisfy the constraint.
    num_generations=8,

    # Keep memory sane
    gradient_checkpointing=True
)

# Let GRPOTrainer load the model internally so it happens after args are constructed
# (and so DS/ZeRO-3 init hooks can be active at load time).
training_args.model_init_kwargs = {
    "torch_dtype": torch.bfloat16,
    "quantization_config": Mxfp4Config(dequantize=True),
    "use_cache": False,
    # Consider "sdpa" or "flash_attention_2" if available; "eager" is OK but not best for memory.
    "attn_implementation": "sdpa",
}

trainer = GRPOTrainer(
    model=model_id,                 # <- pass string ID (Trainer loads it)
    args=training_args,
    processing_class=tokenizer,
    train_dataset=train_dataset,
    reward_funcs=simple_reward,
)

trainer.train()

3) Launch

Prefer launching with a simple Accelerate config (MULTI_GPU), because the DeepSpeed config is already supplied via GRPOConfig.deepspeed.

Example:

ACCELERATE_LOG_LEVEL=info TRANSFORMERS_VERBOSITY=info \
accelerate launch --num_processes 8 train_grpo.py

What you want to see in logs: an indication that ZeRO-3 init is being activated for the model (the absence of this is a classic reason for “OOM during from_pretrained”). (GitHub)


Two GRPO-specific memory pitfalls that matter on 8×A100-40GB

Pitfall 1: ZeRO-3 “gather for generation” can OOM even if training fits

Online RL methods (including GRPO) do generation inside the training loop. With ZeRO-3, some stacks temporarily gather full weights on one GPU to generate faster—this can OOM for models that don’t fit on a single GPU. TRL documents this and explicitly recommends disabling it: (Hugging Face)

  • Set: ds3_gather_for_generation=False (as in the example).

Caveat: there is at least one historical TRL issue where GRPO + ZeRO-3 + ds3_gather_for_generation=False could stall. If you hit “step stuck at 0”, check this thread and your TRL version. (GitHub)

Pitfall 2: Reference model doubling memory (less likely for you, but confirm)

In TRL GRPO, the default beta=0.0 means “don’t load a reference model”, saving a lot of memory. Verify you didn’t set beta>0 unless you need KL regularization. (Hugging Face)


Why I would avoid device_map="auto" for your training job

  • There are active/very recent reports that device_map="auto" + dequantized gpt-oss behaves poorly on A100-class GPUs and/or offload scenarios. (GitHub)
  • Even when it “works”, device_map is a different distribution mechanism (inference-time dispatch), and it’s not what you want for ZeRO-3 training sharding.

If you still OOM at load time: pre-dequantize once on CPU, then train from BF16 weights

If MXFP4→BF16 conversion is producing big transient GPU allocations, a robust workaround is:

  1. Dequantize on CPU and save a BF16 checkpoint locally (no CUDA involved).
    A community recipe for this “load on CPU, save locally” approach exists (shown for smaller GPUs, but the principle applies). (Hugging Face)

  2. Train from that saved BF16 checkpoint without Mxfp4Config at runtime.

This trades disk space + one-time CPU time for much more predictable GPU memory behavior.


High-signal references for this exact problem cluster

  • Accelerate: ZeRO-3 init not activating from zero3_init_flag alone (historical) (GitHub)
  • Transformers: common “must create TrainingArguments / DS config before loading model” failure mode (GitHub)
  • TRL: disabling ZeRO-3 gather-for-generation to prevent OOM in online methods (Hugging Face)
  • A100-40GB reports of load-time OOM for dequantized gpt-oss (Hugging Face)
  • Very recent Transformers issue around device_map="auto" + dequantized gpt-oss (GitHub)
  • OpenAI cookbook fine-tune example assumes single H100 80GB (important context) (cookbook.openai.com)
1 Like

Thanks for sharing. It helps me a lot.

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.