gpt2-small-fineweb-edu-200m

A 124M parameter language model trained from scratch on FineWeb-Edu.

Model Description

Property Value
Architecture Standard GPT-2 Small architecture
Parameters 124M
Layers 12
Hidden size 768
Attention heads 12
Context length 1,024 tokens
Vocab size 50,257
Final loss 5.6518
Final perplexity 284.8

Training Details

This model was trained from scratch (random initialization) for research purposes.

Training Data

  • Dataset: FineWeb-Edu (sample-10BT split)
  • Total tokens: ~200M (streamed)

Training Configuration

Hyperparameter Value
Batch size 4 sequences
Gradient accumulation 8 steps
Effective batch 32,768 tokens/step
Total steps 6,103
Learning rate 6e-4 (cosine decay to 6e-5)
Warmup steps 200
Optimizer AdamW (beta1=0.9, beta2=0.95)
Weight decay 0.1
Gradient clipping 1.0
Precision Mixed (AMP fp16)
Hardware NVIDIA T4 GPU (16GB)

Training Curve

  • Initial loss: ~10.9 (random)
  • Final loss: 5.6518 (perplexity 284.8)

Limitations

  • Not for production use: Trained on only 200M tokens (produces rough text)
  • No instruction tuning: Base causal LM, not a chat model
  • English only: Trained on English FineWeb-Edu data

Related Models

Downloads last month
46
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train rockerritesh/gpt2-small-fineweb-edu-200m