Qwen3.5 Dense-to-MoE Weight Transfer
Collection
Qwen3.5 MoE models from dual-source weight transfer (dense backbone + 35B-A3B experts). Hybrid DeltaNet + GQA attention. • 6 items • Updated
A Qwen3.5 Mixture-of-Experts model created via dual-source weight transfer:
| Property | Value |
|---|---|
| Total Parameters | 854,386,752 (0.85B) |
| Active Parameters | 677,439,552 (0.68B) |
| Architecture | Qwen3.5 Hybrid MoE |
| Experts | 8 routed + 1 shared, top-2 |
| Hidden Size | 1024 |
| Layers | 24 (hybrid: DeltaNet + full attention) |
| Attention | GQA 8Q / 2KV, head_dim=256 |
| Context | 262,144 tokens |
| Vocab | 248,320 |
| Dtype | bfloat16 |
Total MoE FFN parameters are approximately equal to the dense model's FFN parameters. The speed benefit comes from sparsity: only top-2 experts
Most weights are pre-trained (backbone from dense model, experts from 35B-A3B). Only the MoE dimension resize introduces noise, making this model suitable for fine-tuning at nominal cost.
| Component | Source | Strategy |
|---|---|---|
| Embeddings, LM Head | Qwen/Qwen3.5-0.8B | Exact copy |
| Attention (Q/K/V/O, norms) | Qwen/Qwen3.5-0.8B | Exact copy |
| DeltaNet (linear attention) | Qwen/Qwen3.5-0.8B | Exact copy |
| Vision encoder | Qwen/Qwen3.5-0.8B | Exact copy |
| Layer norms | Qwen/Qwen3.5-0.8B | Exact copy |
| Routed experts | Qwen3.5-35B-A3B | Slice 256->8, bilinear resize |
| Shared expert | Qwen3.5-35B-A3B | Bilinear resize |
| Router | Qwen3.5-35B-A3B | Slice + resize |
Apache 2.0 (following source models)