Qwen3.5 MoE 0.85B (from Qwen3.5-0.8B)

A Qwen3.5 Mixture-of-Experts model created via dual-source weight transfer:

Backbone (attention, embeddings, vision, norms): from Qwen/Qwen3.5-0.8B
MoE experts (routed + shared): from Qwen/Qwen3.5-35B-A3B (sliced 256->8 experts, bilinear resized)

Model Details

Property	Value
Total Parameters	854,386,752 (0.85B)
Active Parameters	677,439,552 (0.68B)
Architecture	Qwen3.5 Hybrid MoE
Experts	8 routed + 1 shared, top-2
Hidden Size	1024
Layers	24 (hybrid: DeltaNet + full attention)
Attention	GQA 8Q / 2KV, head_dim=256
Context	262,144 tokens
Vocab	248,320
Dtype	bfloat16

Design

Total MoE FFN parameters are approximately equal to the dense model's FFN parameters. The speed benefit comes from sparsity: only top-2 experts

shared expert are active per token (~1/3 of total FFN).

Most weights are pre-trained (backbone from dense model, experts from 35B-A3B). Only the MoE dimension resize introduces noise, making this model suitable for fine-tuning at nominal cost.

Weight Transfer Sources

Component	Source	Strategy
Embeddings, LM Head	Qwen/Qwen3.5-0.8B	Exact copy
Attention (Q/K/V/O, norms)	Qwen/Qwen3.5-0.8B	Exact copy
DeltaNet (linear attention)	Qwen/Qwen3.5-0.8B	Exact copy
Vision encoder	Qwen/Qwen3.5-0.8B	Exact copy
Layer norms	Qwen/Qwen3.5-0.8B	Exact copy
Routed experts	Qwen3.5-35B-A3B	Slice 256->8, bilinear resize
Shared expert	Qwen3.5-35B-A3B	Bilinear resize
Router	Qwen3.5-35B-A3B	Slice + resize