qwen3.5-0.8b-moe-from-scratch

Built from scratch following Sebastian Raschka's LLMs-from-Scratch style, then weights transferred from kshitijthakkar/qwen3.5-moe-0.87B-d0.8B.

Architecture

Property	Value
Base model	`kshitijthakkar/qwen3.5-moe-0.87B-d0.8B`
Layers	24
Hidden size	2048
Experts	8 total, top-2 active
Total parameters	2914 M
Attention	Hybrid (GatedDeltaNet + Gated Full Attention)

Usage

from transformers import AutoModelForImageTextToText, AutoProcessor

model = AutoModelForImageTextToText.from_pretrained(
    "kshitijthakkar/qwen3.5-0.8b-moe-from-scratch",
    torch_dtype="bfloat16",
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("kshitijthakkar/qwen3.5-0.8b-moe-from-scratch")

messages = [{"role": "user", "content": "Explain the Qwen3.5 architecture."}]
inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True,
    tokenize=True, return_dict=True, return_tensors="pt"
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=200)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

How it was built

See the companion notebook qwen3_5_from_scratch.ipynb for the full from-scratch implementation.

Downloads last month: 57

Safetensors

Model size

3B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kshitijthakkar/qwen3.5-0.8b-moe-from-scratch

Base model

kshitijthakkar/qwen3.5-moe-0.87B-d0.8B

Finetuned

(1)

this model

Collection including kshitijthakkar/qwen3.5-0.8b-moe-from-scratch

Qwen3.5 Dense-to-MoE Weight Transfer

Collection

Qwen3.5 MoE models from dual-source weight transfer (dense backbone + 35B-A3B experts). Hybrid DeltaNet + GQA attention. • 6 items • Updated 7 days ago