Abstract
A novel Post-LayerNorm Transformer architecture called Keel addresses training instability in extremely deep networks by replacing residual connections with Highway-style connections, enabling stable training beyond 1000 layers.
Large language model (LLM) scaling is hitting a wall. Widening models yields diminishing returns, and extending context length does not improve fundamental expressivity. In contrast, depth scaling offers theoretically superior expressivity, yet current Transformer architectures struggle to train reliably at extreme depths. We revisit the Post-LayerNorm (Post-LN) formulation, whose instability at scale caused its replacement by Pre-LN in modern LLMs. We show that the central failure mode of Post-LN arises from the ResNet-style residual pathway, which introduces gradient vanishing in deep networks. We present Keel, a Post-LN Transformer that replaces this residual path with a Highway-style connection. This modification preserves the gradient flow through the residual branch, preventing signal vanishing from the top layers to the bottom. Unlike prior methods, Keel enables stable training at extreme depths without requiring specialized initialization or complex optimization tricks. Keel trains robustly at depths exceeding 1000 layers and consistently improves perplexity and depth-scaling characteristics over Pre-LN. These findings indicate that Post-LN, when paired with a Highway-style connection, provides a simple and effective foundation for building deeply scalable LLMs, opening the possibility for future infinite-depth architectures.
Community
π Only a few lines of code changed, and we pushed deep LLMs to the next level.
π With Keel, we scaled LLM to 1000 layers. And the deeper we go, the more Keel pulls ahead of standard Pre-LN Transformers.
arXivlens breakdown of this paper π https://arxivlens.com/PaperView/Details/post-layernorm-is-back-stable-expressive-and-deep-4692-93927610
- Executive Summary
- Detailed Breakdown
- Practical Applications
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Bounded Hyperbolic Tangent: A Stable and Efficient Alternative to Pre-Layer Normalization in Large Language Models (2025)
- The Unseen Bias: How Norm Discrepancy in Pre-Norm MLLMs Leads to Visual Information Loss (2025)
- MIDUS: Memory-Infused Depth Up-Scaling (2025)
- VersatileFFN: Achieving Parameter Efficiency in LLMs via Adaptive Wide-and-Deep Reuse (2025)
- Data-Free Pruning of Self-Attention Layers in LLMs (2025)
- Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings (2025)
- Attention Projection Mixing with Exogenous Anchors (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper