You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Context-aware Biases for Length Extrapolation

The source code of (Context-aware Biases for Length Extrapolation)

🚀 News

[2025.02.3] Code release

Upcoming

Cleaning codebase
Adding scripts for training ALiBi, RoPE, T5-bias

Datasets and Models

Fineweb-Edu 🔗 🤗
Fineweb 🔗 🤗
WikiText-103 🔗 🤗
WikiText-2 🔗 🤗

Download the datasets from HuggingFace and use use dataset_preparation.py for saving tokenized dataset.

Some of trained models:

Dataset	Model	Parameters	Sequence Length
Fineweb-Edu(10B)	GPT-Medium	334M	1024
Fineweb-Edu(10B)	GPT-Medium	334M	512
WikiText-103	GPT-Tiny	44M	1024
WikiText-103	GPT-Tiny	44M	512

how to use models:

from transformers import AutoModel

cable_fineweb_md_1024 = AutoModel.from_pretrained("axiomlaborg/Cable", trust_remote_code=True, revision = "cable-edufineweb-md-1024")

cable_fineweb_md_512 = AutoModel.from_pretrained("axiomlaborg/Cable", trust_remote_code=True, revision = "cable-edufineweb-md-512")

cable_wiki_tiny_1024 = AutoModel.from_pretrained("axiomlaborg/Cable", trust_remote_code=True, revision = "cable-wiki-tiny-1024")

cable_wiki_tiny_512 = AutoModel.from_pretrained("axiomlaborg/Cable", trust_remote_code=True, revision = "cable-wiki-tiny-512")

Training

Single GPU

python Cable.py --dataset-dir "path to dataset" --model "medium or small or tiny" --save-dir "dir for logs"

Multiple GPUs

torchrun --standalone --nproc_per_node=2 Cable.py

For Hellaswag benchmark and evaluating extrapolation please use evaluation.ipynb notebook.

Length Extrapolation

A Cable model trained on T=1024 can extrapolate on T=8192, achieving a better performance (PPL=22.22) compared to the sinusoidal model (PPL=22.81) trained on T=8192.

Runtime and Memory Overhead

Cable improves the model's extrapolation ability significantly with a negligible burden in time and memory compared to the vanilla transformer. Furthermore, compared to existing RPE methods, our approach maintains nearly identical training time and GPU memory usage, while its inference overhead remains either negligible or comparable, depending on the sequence length.

Citation

If you use this repository for your research or wish to refer to our positional encoding method, please use the following BibTeX entry:

@article{veisi2025context,
  title={Context-aware Biases for Length Extrapolation},
  author={Ali Veisi and Amir Mansourian},
  journal={arXiv preprint arXiv:2503.08067},
  year={2025}
}

Acknowledgement

This repo is based on Karpathy/Build-NanoGPT. Thanks for their excellent work.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for axiomlaborg/Cable

Base model

openai-community/gpt2

Finetuned

(2153)

this model

Datasets used to train axiomlaborg/Cable

Papers for axiomlaborg/Cable