rockerritesh
/

gpt2-small-fineweb-edu-200m

Text Generation

training-dynamics

Model card Files Files and versions

gpt2-small-fineweb-edu-200m

A 124M parameter language model trained from scratch on FineWeb-Edu.

Model Description

Property	Value
Architecture	Standard GPT-2 Small architecture
Parameters	124M
Layers	12
Hidden size	768
Attention heads	12
Context length	1,024 tokens
Vocab size	50,257
Final loss	5.6518
Final perplexity	284.8

Training Details

This model was trained from scratch (random initialization) for research purposes.

Training Data

Dataset: FineWeb-Edu (sample-10BT split)
Total tokens: ~200M (streamed)

Training Configuration

Hyperparameter	Value
Batch size	4 sequences
Gradient accumulation	8 steps
Effective batch	32,768 tokens/step
Total steps	6,103
Learning rate	6e-4 (cosine decay to 6e-5)
Warmup steps	200
Optimizer	AdamW (beta1=0.9, beta2=0.95)
Weight decay	0.1
Gradient clipping	1.0
Precision	Mixed (AMP fp16)
Hardware	NVIDIA T4 GPU (16GB)

Training Curve

Initial loss: ~10.9 (random)
Final loss: 5.6518 (perplexity 284.8)

Limitations

Not for production use: Trained on only 200M tokens (produces rough text)
No instruction tuning: Base causal LM, not a chat model
English only: Trained on English FineWeb-Edu data

Related Models

rockerritesh/gpt2-small-fineweb-edu-200m — GPT-2 Small (124M, 12 layers)
rockerritesh/smollm2-135m-fineweb-edu-200m — SmolLM2 (135M, 30 layers, Llama-style)

Downloads last month: 46

Safetensors

Model size

0.1B params

Tensor type

F32

·

Dataset used to train rockerritesh/gpt2-small-fineweb-edu-200m