qwen3.5-0.8b-moe-from-scratch

Built from scratch following Sebastian Raschka's LLMs-from-Scratch style, then weights transferred from kshitijthakkar/qwen3.5-moe-0.87B-d0.8B.

Architecture

Property Value
Base model kshitijthakkar/qwen3.5-moe-0.87B-d0.8B
Layers 24
Hidden size 2048
Experts 8 total, top-2 active
Total parameters 2914 M
Attention Hybrid (GatedDeltaNet + Gated Full Attention)

Usage

from transformers import AutoModelForImageTextToText, AutoProcessor

model = AutoModelForImageTextToText.from_pretrained(
    "kshitijthakkar/qwen3.5-0.8b-moe-from-scratch",
    torch_dtype="bfloat16",
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("kshitijthakkar/qwen3.5-0.8b-moe-from-scratch")

messages = [{"role": "user", "content": "Explain the Qwen3.5 architecture."}]
inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True,
    tokenize=True, return_dict=True, return_tensors="pt"
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=200)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

How it was built

See the companion notebook qwen3_5_from_scratch.ipynb for the full from-scratch implementation.

Downloads last month
57
Safetensors
Model size
3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kshitijthakkar/qwen3.5-0.8b-moe-from-scratch

Finetuned
(1)
this model

Collection including kshitijthakkar/qwen3.5-0.8b-moe-from-scratch