Attention 0.6B AdamW-WSD training trajectory
Collection
Per-step record (every 500 steps, 40 ckpts) of the 0.6B Qwen3 softmax-attention baseline trained AdamW + WSD on 80B tokens. • 40 items • Updated
Intermediate checkpoint at optimizer step 8500 (of 20000) of the 0.6B Qwen3-style softmax-attention baseline used in the Parallax mechanism analysis.
The full training trajectory (every 500 steps) lives in the Attention 0.6B AdamW-WSD training trajectory collection.
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"YifeiZuo/attention-0.6b-adamw-wsd-step8500",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("YifeiZuo/attention-0.6b-adamw-wsd-step8500")