Qwen3.5-2B-q0f16-MLC
This is the Qwen3.5-2B model in MLC format q0f16.
Qwen3.5 is a hybrid architecture: 75% GatedDeltaNet recurrent linear attention layers, 25% standard GQA softmax attention layers. This requires the kHybrid KVStateKind in MLC-LLM which manages both PagedKVCache and RNNState simultaneously.
Compiled with mlc-llm using the hybrid KVStateKind branch.
Usage
Python API
from mlc_llm import MLCEngine
model = "HF://Mitiskuma/Qwen3.5-2B-q0f16-MLC"
engine = MLCEngine(model, device="metal")
for response in engine.chat.completions.create(
messages=[{"role": "user", "content": "What is the meaning of life?"}],
model=model,
stream=True,
):
for choice in response.choices:
print(choice.delta.content, end="", flush=True)
print()
engine.terminate()
Chat CLI
mlc_llm chat HF://Mitiskuma/Qwen3.5-2B-q0f16-MLC
Model Details
| Parameter | Value |
|---|---|
| Base model | Qwen3.5-2B |
| Architecture | Qwen3.5 GatedDeltaNet (hybrid recurrent + attention) |
| Quantization | q0f16 |
| KV state kind | hybrid (PagedKVCache + RNNState) |
| Context window | 1024 (compile-time setting) |
| Conversation template | chatml |
- Downloads last month
- 15
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support