Qwen3.5-2B-q0f16-MLC

This is the Qwen3.5-2B model in MLC format q0f16.

Qwen3.5 is a hybrid architecture: 75% GatedDeltaNet recurrent linear attention layers, 25% standard GQA softmax attention layers. This requires the kHybrid KVStateKind in MLC-LLM which manages both PagedKVCache and RNNState simultaneously.

Compiled with mlc-llm using the hybrid KVStateKind branch.

Usage

Python API

from mlc_llm import MLCEngine

model = "HF://Mitiskuma/Qwen3.5-2B-q0f16-MLC"
engine = MLCEngine(model, device="metal")

for response in engine.chat.completions.create(
    messages=[{"role": "user", "content": "What is the meaning of life?"}],
    model=model,
    stream=True,
):
    for choice in response.choices:
        print(choice.delta.content, end="", flush=True)
print()

engine.terminate()

Chat CLI

mlc_llm chat HF://Mitiskuma/Qwen3.5-2B-q0f16-MLC

Model Details

Parameter Value
Base model Qwen3.5-2B
Architecture Qwen3.5 GatedDeltaNet (hybrid recurrent + attention)
Quantization q0f16
KV state kind hybrid (PagedKVCache + RNNState)
Context window 1024 (compile-time setting)
Conversation template chatml
Downloads last month
15
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Mitiskuma/Qwen3.5-2B-q0f16-MLC

Finetuned
Qwen/Qwen3.5-2B
Finetuned
(37)
this model