---
license: llama3.1
language:
- en
pipeline_tag: text-generation
datasets:
- allenai/RLVR-MATH
base_model:
- allenai/Llama-3.1-Tulu-3-405B
tags:
- quant
---

This is an [llmcompressor](https://github.com/vllm-project/llm-compressor) v0.4.0 [FP8 Dynamic](https://github.com/vllm-project/llm-compressor/tree/main/examples/quantization_w8a8_fp8) quant.

You can refer to [CPU offloading example](https://github.com/vllm-project/llm-compressor/tree/main/examples/big_models_with_accelerate) but for quanting with an H100 node, we used this setup to avoid OOM errors:

```
config = AutoConfig.from_pretrained(model_name)
with init_empty_weights():
    model = AutoModelForCausalLM.from_config(config)

max_memory = {
      0: "60GiB",
      1: "60GiB",
      2: "60GiB",
      3: "60GiB",
      4: "60GiB",
      5: "60GiB",
      6: "60GiB",
      7: "60GiB",
      "cpu": "1500GiB",
}

device_map = infer_auto_device_map(
    model,
    max_memory=max_memory,
    no_split_module_classes=["LlamaDecoderLayer"],
)
```

Original model here: https://huggingface.co/allenai/Llama-3.1-Tulu-3-405B