Qwen3.5 MoE 0.85B (from Qwen3.5-0.8B)

A Qwen3.5 Mixture-of-Experts model created via dual-source weight transfer:

Model Details

Property Value
Total Parameters 854,386,752 (0.85B)
Active Parameters 677,439,552 (0.68B)
Architecture Qwen3.5 Hybrid MoE
Experts 8 routed + 1 shared, top-2
Hidden Size 1024
Layers 24 (hybrid: DeltaNet + full attention)
Attention GQA 8Q / 2KV, head_dim=256
Context 262,144 tokens
Vocab 248,320
Dtype bfloat16

Design

Total MoE FFN parameters are approximately equal to the dense model's FFN parameters. The speed benefit comes from sparsity: only top-2 experts

  • shared expert are active per token (~1/3 of total FFN).

Most weights are pre-trained (backbone from dense model, experts from 35B-A3B). Only the MoE dimension resize introduces noise, making this model suitable for fine-tuning at nominal cost.

Weight Transfer Sources

Component Source Strategy
Embeddings, LM Head Qwen/Qwen3.5-0.8B Exact copy
Attention (Q/K/V/O, norms) Qwen/Qwen3.5-0.8B Exact copy
DeltaNet (linear attention) Qwen/Qwen3.5-0.8B Exact copy
Vision encoder Qwen/Qwen3.5-0.8B Exact copy
Layer norms Qwen/Qwen3.5-0.8B Exact copy
Routed experts Qwen3.5-35B-A3B Slice 256->8, bilinear resize
Shared expert Qwen3.5-35B-A3B Bilinear resize
Router Qwen3.5-35B-A3B Slice + resize

License

Apache 2.0 (following source models)

Downloads last month
139
Safetensors
Model size
1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kshitijthakkar/qwen3.5-moe-0.87B-d0.8B

Finetunes
1 model
Quantizations
1 model

Collection including kshitijthakkar/qwen3.5-moe-0.87B-d0.8B