DMaS-LLaMa-Lite-step-23.9k

This repository provides access to DMaS-LLaMa-Lite-step-23.9k, a 1.7-billion-parameter language model based on the LLaMa architecture. The model has been trained from scratch as part of the DMaS-LLaMa-Lite project using approximately 20 billion tokens of high-quality educational content.

Model Overview

Architecture: LLaMa-based
Parameters: 1.7B (36 layers, 32 attention heads, RMSNorm)
Tokenizer: GPT-2 tokenizer
Training Data: FineWeb-Edu subset (educational text)
Training Steps: 23,900
Optimizer: AdamW with linear warmup and decay
Hardware: Trained on 1-2 RTX A6000 GPUs with PyTorch DDP
Dataset Source: FineWeb-Edu Dataset

The training process emphasizes qualitative improvements in coherence, fluency, and factual grounding, demonstrating competitive results even with fewer tokens compared to larger-scale models. This checkpoint represents the model's state at 23,900 training steps. Validation loss and downstream performance benchmarks demonstrate notable early improvements in text fluency and alignment with prompts.

Paper

The model was presented in the paper Experience of Training a 1.7B-Parameter LLaMa Model From Scratch.

Pretraining large language models is a complex endeavor influenced by multiple factors, including model architecture, data quality, training continuity, and hardware constraints. In this paper, we share insights gained from the experience of training DMaS-LLaMa-Lite, a fully open source, 1.7-billion-parameter, LLaMa-based model, on approximately 20 billion tokens of carefully curated data. We chronicle the full training trajectory, documenting how evolving validation loss levels and downstream benchmarks reflect transitions from incoherent text to fluent, contextually grounded output. Beyond pretraining, we extend our analysis to include a post-training phase focused on instruction tuning, where the model was refined to produce more contextually appropriate, user-aligned responses. We highlight practical considerations such as the importance of restoring optimizer states when resuming from checkpoints, and the impact of hardware changes on training stability and throughput. While qualitative evaluation provides an intuitive understanding of model improvements, our analysis extends to various performance benchmarks, demonstrating how high-quality data and thoughtful scaling enable competitive results with significantly fewer training tokens. By detailing these experiences and offering training logs, checkpoints, and sample outputs, we aim to guide future researchers and practitioners in refining their pretraining strategies.

Training Code

The training script, including configurations and instructions, is open-sourced and available here:
📄 DMaS-LLaMa-Lite Training Code

Usage

You can load the model with Hugging Face Transformers library:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "McGill-DMaS/DMaS-LLaMa-Lite-step-23.9k"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

inputs = tokenizer("The Pyramids of Giza in Egypt are some of the oldest man-made structures in the world.", return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Citation

If you use this model or its training insights in your work, please cite the following paper:

@INPROCEEDINGS{li2025training,
  author={Li, Miles Q. and Fung, Benjamin C. M. and Huang, Shih-Chia},
  booktitle={2025 International Joint Conference on Neural Networks (IJCNN)},
  title={Training Dynamics of a 1.7B LLaMa Model: A Data-Efficient Approach},
  year={2025},
  volume={},
  number={},
  pages={1-10},
  keywords={Training;Analytical models;Refining;Benchmark testing;Throughput;Data models;Hardware;Stability analysis;Trajectory;Tuning},
  doi={10.1109/IJCNN64981.2025.11228044}}

License

This model and code are released under the Apache License 2.0. Please check the respective repositories for detailed terms.

Downloads last month: 1

Safetensors

Model size

2B params

Tensor type

F32

Dataset used to train McGill-DMaS/DMaS-LLaMa-Lite-step-23.9k

Collection including McGill-DMaS/DMaS-LLaMa-Lite-step-23.9k

DMaS-LLaMa-Lite

Collection

17 items • Updated Oct 29, 2025 • 1

Paper for McGill-DMaS/DMaS-LLaMa-Lite-step-23.9k

Experience of Training a 1.7B-Parameter LLaMa Model From Scratch

Paper • 2412.13335 • Published Dec 17, 2024