Qwen3.5 Dense-to-MoE Weight Transfer
Collection
Qwen3.5 MoE models from dual-source weight transfer (dense backbone + 35B-A3B experts). Hybrid DeltaNet + GQA attention. • 6 items • Updated
Built from scratch following Sebastian Raschka's LLMs-from-Scratch style, then weights transferred from kshitijthakkar/qwen3.5-moe-0.87B-d0.8B.
| Property | Value |
|---|---|
| Base model | kshitijthakkar/qwen3.5-moe-0.87B-d0.8B |
| Layers | 24 |
| Hidden size | 2048 |
| Experts | 8 total, top-2 active |
| Total parameters | 2914 M |
| Attention | Hybrid (GatedDeltaNet + Gated Full Attention) |
from transformers import AutoModelForImageTextToText, AutoProcessor
model = AutoModelForImageTextToText.from_pretrained(
"kshitijthakkar/qwen3.5-0.8b-moe-from-scratch",
torch_dtype="bfloat16",
device_map="auto",
)
processor = AutoProcessor.from_pretrained("kshitijthakkar/qwen3.5-0.8b-moe-from-scratch")
messages = [{"role": "user", "content": "Explain the Qwen3.5 architecture."}]
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True,
tokenize=True, return_dict=True, return_tensors="pt"
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
See the companion notebook qwen3_5_from_scratch.ipynb
for the full from-scratch implementation.
Base model
kshitijthakkar/qwen3.5-moe-0.87B-d0.8B