Phi-4-mini-reasoning (BitsAndBytes 4-bit NF4 Quantized)

This repository contains a 4-bit quantized version of microsoft/Phi-4-mini-reasoning, produced with BitsAndBytes via Hugging Face Transformers.
Quantization reduces VRAM usage while preserving most of the model’s reasoning capabilities.

Model Details

Model Description

Developed by (base model): Microsoft
Shared by (quantized version): KavinduHansaka
Model type: Causal Language Model (decoder-only transformer)
Context length: 128K
Language(s): English
License: MIT (inherited from base model)
Finetuned from: microsoft/Phi-4-mini-reasoning

Model Sources

Repository (quantized): KavinduHansaka/phi4-mini-bnb-4bit
Repository (base model): microsoft/Phi-4-mini-reasoning

Uses

Direct Use

Text and reasoning generation
Educational and research experiments
Running inference on lower-VRAM GPUs

Downstream Use

Can be fine-tuned further for domain-specific reasoning tasks
Integrated into chatbots, assistants, and research pipelines

Out-of-Scope Use

Do not use for generating harmful, biased, or unsafe content
Not recommended for high-accuracy production systems without further testing

Bias, Risks, and Limitations

As with the base model, it may produce biased or incorrect content.
Quantization may reduce numerical precision, which can slightly affect reasoning quality.
Long-context reasoning (128k tokens) may still be resource-intensive.

Recommendations

Apply appropriate safety filters before deploying in production.
Be aware that outputs are not guaranteed to be factually correct.

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

model_id = "KavinduHansaka/phi4-mini-bnb-4bit"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    dtype=torch.bfloat16,
    quantization_config=bnb_config
)

inputs = tokenizer("Explain why the sky is blue in simple terms.", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

This model inherits training data from microsoft/Phi-4-mini-reasoning. No additional fine-tuning was done.

Quantization method: BitsAndBytes 4-bit (NF4, double quantization)
Precision: bfloat16 compute
Original precision: fp16

Technical Specifications

Architecture: Decoder-only transformer
Parameters: Same as Phi-4-mini-reasoning
Quantization: 4-bit NF4

Citation

If you use this quantized model, please also cite the original Microsoft release:

@misc{microsoft2025phi4mini,
  title={Phi-4-mini-reasoning},
  author={Microsoft},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/microsoft/Phi-4-mini-reasoning}}
}

Model Card Authors

Quantized version shared by KavinduHansaka
Base model by Microsoft

Model Card Contact

For issues/questions with this quantized release: open a discussion on KavinduHansaka/phi4-mini-bnb-4bit.
For base model details: see microsoft/Phi-4-mini-reasoning.

Downloads last month: 22

Safetensors

Model size

4B params

Tensor type

F32

BF16

Model tree for KavinduHansaka/phi4-mini-bnb-4bit

Base model

microsoft/Phi-4-mini-reasoning

Quantized

(34)

this model