Attention 0.6B AdamW-WSD - step 8500

Intermediate checkpoint at optimizer step 8500 (of 20000) of the 0.6B Qwen3-style softmax-attention baseline used in the Parallax mechanism analysis.

  • Architecture: Qwen3-0.6B (28 layers, d_model=1024, 16 Q-heads, 8 KV-heads, GQA, RoPE theta 1e6)
  • Optimizer: AdamW (lr=3e-4, weight decay=0.1, betas=(0.9, 0.95))
  • Scheduler: WSD (warmup=0, last 20% linearly decayed)
  • Tokens seen at this step: ~33405.0M sequences * 4096 tokens = ~133.6B tokens
  • Dataset: Ultra-FineWeb (English split)
  • Tokenizer: Qwen3-0.6B (vocab 151936)
  • Total steps in run: 20000 (~80B tokens)

The full training trajectory (every 500 steps) lives in the Attention 0.6B AdamW-WSD training trajectory collection.

Loading

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "YifeiZuo/attention-0.6b-adamw-wsd-step8500",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("YifeiZuo/attention-0.6b-adamw-wsd-step8500")
Downloads last month

-

Downloads are not tracked for this model. How to track
Safetensors
Model size
0.6B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for YifeiZuo/attention-0.6b-adamw-wsd-step8500

Finetuned
Qwen/Qwen3-0.6B
Finetuned
(960)
this model

Collection including YifeiZuo/attention-0.6b-adamw-wsd-step8500