Ternary LLMs & Knowledge distillation & SOTA
updated
Addition is All You Need for Energy-efficient Language Models
Paper
• 2410.00907
• Published
• 151
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Paper
• 2402.17764
• Published
• 627
LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding
Paper
• 2404.16710
• Published
• 80
Beyond Scaling Laws: Understanding Transformer Performance with
Associative Memory
Paper
• 2405.08707
• Published
• 34
Token-Scaled Logit Distillation for Ternary Weight Generative Language
Models
Paper
• 2308.06744
• Published
• 1
TerDiT: Ternary Diffusion Models with Transformers
Paper
• 2405.14854
• Published
• 2
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention
Paper
• 2405.12981
• Published
• 33
You Only Cache Once: Decoder-Decoder Architectures for Language Models
Paper
• 2405.05254
• Published
• 10
Paper
• 2410.05258
• Published
• 180
BitNet a4.8: 4-bit Activations for 1-bit LLMs
Paper
• 2411.04965
• Published
• 69
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse
Attention
Paper
• 2502.11089
• Published
• 168