Phi-4-mini-reasoning (BitsAndBytes 4-bit NF4 Quantized)

This repository contains a 4-bit quantized version of microsoft/Phi-4-mini-reasoning, produced with BitsAndBytes via Hugging Face Transformers.
Quantization reduces VRAM usage while preserving most of the model鈥檚 reasoning capabilities.


Model Details

Model Description

  • Developed by (base model): Microsoft
  • Shared by (quantized version): KavinduHansaka
  • Model type: Causal Language Model (decoder-only transformer)
  • Context length: 128K
  • Language(s): English
  • License: MIT (inherited from base model)
  • Finetuned from: microsoft/Phi-4-mini-reasoning

Model Sources


Uses

Direct Use

  • Text and reasoning generation
  • Educational and research experiments
  • Running inference on lower-VRAM GPUs

Downstream Use

  • Can be fine-tuned further for domain-specific reasoning tasks
  • Integrated into chatbots, assistants, and research pipelines

Out-of-Scope Use

  • Do not use for generating harmful, biased, or unsafe content
  • Not recommended for high-accuracy production systems without further testing

Bias, Risks, and Limitations

  • As with the base model, it may produce biased or incorrect content.
  • Quantization may reduce numerical precision, which can slightly affect reasoning quality.
  • Long-context reasoning (128k tokens) may still be resource-intensive.

Recommendations

  • Apply appropriate safety filters before deploying in production.
  • Be aware that outputs are not guaranteed to be factually correct.

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

model_id = "KavinduHansaka/phi4-mini-bnb-4bit"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    dtype=torch.bfloat16,
    quantization_config=bnb_config
)

inputs = tokenizer("Explain why the sky is blue in simple terms.", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

This model inherits training data from microsoft/Phi-4-mini-reasoning. No additional fine-tuning was done.

  • Quantization method: BitsAndBytes 4-bit (NF4, double quantization)
  • Precision: bfloat16 compute
  • Original precision: fp16

Technical Specifications

  • Architecture: Decoder-only transformer
  • Parameters: Same as Phi-4-mini-reasoning
  • Quantization: 4-bit NF4

Citation

If you use this quantized model, please also cite the original Microsoft release:

@misc{microsoft2025phi4mini,
  title={Phi-4-mini-reasoning},
  author={Microsoft},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/microsoft/Phi-4-mini-reasoning}}
}

Model Card Authors

  • Quantized version shared by KavinduHansaka
  • Base model by Microsoft

Model Card Contact

Downloads last month
22
Safetensors
Model size
4B params
Tensor type
F32
BF16
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for KavinduHansaka/phi4-mini-bnb-4bit

Quantized
(34)
this model