HuggingFaceFW/fineweb-edu
Viewer • Updated • 3.5B • 514k • 1.12k
A 124M parameter language model trained from scratch on FineWeb-Edu.
| Property | Value |
|---|---|
| Architecture | Standard GPT-2 Small architecture |
| Parameters | 124M |
| Layers | 12 |
| Hidden size | 768 |
| Attention heads | 12 |
| Context length | 1,024 tokens |
| Vocab size | 50,257 |
| Final loss | 5.6518 |
| Final perplexity | 284.8 |
This model was trained from scratch (random initialization) for research purposes.
| Hyperparameter | Value |
|---|---|
| Batch size | 4 sequences |
| Gradient accumulation | 8 steps |
| Effective batch | 32,768 tokens/step |
| Total steps | 6,103 |
| Learning rate | 6e-4 (cosine decay to 6e-5) |
| Warmup steps | 200 |
| Optimizer | AdamW (beta1=0.9, beta2=0.95) |
| Weight decay | 0.1 |
| Gradient clipping | 1.0 |
| Precision | Mixed (AMP fp16) |
| Hardware | NVIDIA T4 GPU (16GB) |
rockerritesh/gpt2-small-fineweb-edu-200m — GPT-2 Small (124M, 12 layers)rockerritesh/smollm2-135m-fineweb-edu-200m — SmolLM2 (135M, 30 layers, Llama-style)