Cosmos-Reason2-2B-NVFP4

NVFP4 quantized version of nvidia/Cosmos-Reason2-2B by vrfai using llm-compressor.

License: This model inherits the NVIDIA Open Model License from the base model. Commercial use and derivative models are permitted under its terms.

NVFP4 Quantization Details


Base model	nvidia/Cosmos-Reason2-2B
Quantization	NVFP4 — weights FP4, activations FP4 (dynamic local), scales FP8
Format	`compressed-tensors` (native vLLM support)
Tool	vllm-project/llm-compressor
Model size	4.6 GB → 2.7 GB (~41% reduction)
Requires	NVIDIA Blackwell GPU (SM 120+), vLLM ≥ 0.19

What's Quantized / What's Not

Unlike hybrid-attention models (e.g. Qwen3.6), Cosmos-Reason2-2B uses a standard transformer backbone — all language model linear layers are quantized. Only the visual components and output head are preserved in BF16:

Component	Precision	Reason
All LLM layers — FFN + attention projections (28 layers)	NVFP4	Standard transformer, stable under 4-bit
Vision encoder — all 24 blocks + merger	BF16	Preserved for visual perception quality
DeepStack merger list (3×)	BF16	Multi-scale visual fusion, sensitive to precision
`lm_head`	BF16	Output logits preserved for generation stability

Quantization Config (llm-compressor)

# recipe.yaml
QuantizationModifier:
  targets: [Linear]
  scheme: NVFP4
  ignore:
    - lm_head
    # Vision encoder — 24 blocks (attn + mlp) + merger
    - re:model\.visual\.blocks\.\d+\..*
    - model.visual.merger.linear_fc1
    - model.visual.merger.linear_fc2
    # DeepStack multi-scale merger
    - re:model\.visual\.deepstack_merger_list\.\d+\..*

Quick Start (vLLM)

vllm serve vrfai/Cosmos-Reason2-2B-NVFP4 \
  --max-model-len 8192

The model fits comfortably on a single RTX 5090 (32 GB). No --tensor-parallel-size needed.

Python (Transformers)

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "vrfai/Cosmos-Reason2-2B-NVFP4"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)

OpenAI-compatible API

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
    model="vrfai/Cosmos-Reason2-2B-NVFP4",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://..."}},
                {"type": "text", "text": "Describe the physical interaction in this scene."}
            ]
        }
    ],
    temperature=0.7,
    max_tokens=512,
)
print(response.choices[0].message.content)

Tested Environment

Component	Version
vLLM	0.19.1
Transformers	5.6.0
PyTorch	2.10.0+cu128
CUDA	12.8 (nvcc 12.8.61)
llm-compressor	compressed-tensors 0.14.0.1
GPU	1× NVIDIA RTX 5090

Model Overview

Cosmos-Reason2-2B is a vision-language model developed by NVIDIA for Physical AI reasoning — understanding physical common sense and embodied interactions from video and image inputs. It is designed for use as a planner or reasoning backbone in robotics and Vision-Language-Action (VLA) pipelines.


Architecture	`Qwen3VLForConditionalGeneration`
Parameters	~2B
Hidden size	2048
Layers	28 (standard GQA transformer)
Attention heads	16 Q / 8 KV
Vision encoder depth	24 blocks (DeepStack-enhanced)
Context length	262,144 tokens
Input modalities	Text, image, video

Quality Benchmarks

For benchmark results see the Physical AI Bench Leaderboard and the base model card.

Ethical Considerations & Safety

This section is reproduced from the base model card and applies equally to this quantized derivative.

This model is intended for Physical AI developers working on embodied reasoning tasks. Users are responsible for model inputs and outputs, including implementing appropriate guardrails prior to deployment.

Safety note: Because this model is designed for robot planning and can serve as a VLA backbone, its outputs may directly influence physical actuation. Planning errors or misinterpretations carry inherent life-safety risks, including physical collisions or unsafe object manipulation.

Please report security vulnerabilities or NVIDIA AI concerns here.

Credits

Original model: NVIDIA — Cosmos-Reason2-2B
NVFP4 quantization: vrfai
Quantization framework: vllm-project/llm-compressor

Downloads last month: 35

Safetensors

Model size

2B params

Tensor type

F32

BF16

F8_E4M3

Model tree for vrfai/Cosmos-Reason2-2B-NVFP4

Base model

Qwen/Qwen3-VL-2B-Instruct

Finetuned

nvidia/Cosmos-Reason2-2B

Quantized

(10)

this model