--- license: llama3.1 language: - en pipeline_tag: text-generation datasets: - allenai/RLVR-MATH base_model: - allenai/Llama-3.1-Tulu-3-405B tags: - quant --- This is an [llmcompressor](https://github.com/vllm-project/llm-compressor) v0.4.0 [FP8 Dynamic](https://github.com/vllm-project/llm-compressor/tree/main/examples/quantization_w8a8_fp8) quant. You can refer to [CPU offloading example](https://github.com/vllm-project/llm-compressor/tree/main/examples/big_models_with_accelerate) but for quanting with an H100 node, we used this setup to avoid OOM errors: ``` config = AutoConfig.from_pretrained(model_name) with init_empty_weights(): model = AutoModelForCausalLM.from_config(config) max_memory = { 0: "60GiB", 1: "60GiB", 2: "60GiB", 3: "60GiB", 4: "60GiB", 5: "60GiB", 6: "60GiB", 7: "60GiB", "cpu": "1500GiB", } device_map = infer_auto_device_map( model, max_memory=max_memory, no_split_module_classes=["LlamaDecoderLayer"], ) ``` Original model here: https://huggingface.co/allenai/Llama-3.1-Tulu-3-405B