Attention 0.6B AdamW-WSD - step 8500

Intermediate checkpoint at optimizer step 8500 (of 20000) of the 0.6B Qwen3-style softmax-attention baseline used in the Parallax mechanism analysis.

Architecture: Qwen3-0.6B (28 layers, d_model=1024, 16 Q-heads, 8 KV-heads, GQA, RoPE theta 1e6)
Optimizer: AdamW (lr=3e-4, weight decay=0.1, betas=(0.9, 0.95))
Scheduler: WSD (warmup=0, last 20% linearly decayed)
Tokens seen at this step: ~33405.0M sequences * 4096 tokens = ~133.6B tokens
Dataset: Ultra-FineWeb (English split)
Tokenizer: Qwen3-0.6B (vocab 151936)
Total steps in run: 20000 (~80B tokens)

The full training trajectory (every 500 steps) lives in the Attention 0.6B AdamW-WSD training trajectory collection.

Loading

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "YifeiZuo/attention-0.6b-adamw-wsd-step8500",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("YifeiZuo/attention-0.6b-adamw-wsd-step8500")