Beyond Memorization: Extending Reasoning Depth with Recurrence, Memory and Test-Time Compute Scaling
Abstract
Models trained on random Boolean functions in a cellular automata framework show that increasing depth, recurrence, memory, and test-time compute scaling enhances multi-step reasoning capabilities.
Reasoning is a core capability of large language models, yet understanding how they learn and perform multi-step reasoning remains an open problem. In this study, we explore how different architectures and training methods affect model multi-step reasoning capabilities within a cellular automata framework. By training on state sequences generated with random Boolean functions for random initial conditions to exclude memorization, we demonstrate that most neural architectures learn to abstract the underlying rules. While models achieve high accuracy in next-state prediction, their performance declines sharply if multi-step reasoning is required. We confirm that increasing model depth plays a crucial role for sequential computations. We demonstrate that an extension of the effective model depth with recurrence, memory, and test-time compute scaling substantially enhances reasoning capabilities.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- A Survey on Latent Reasoning (2025)
- The Serial Scaling Hypothesis (2025)
- The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner (2025)
- Reasoning on a Budget: A Survey of Adaptive and Controllable Test-Time Compute in LLMs (2025)
- Recurrent Memory-Augmented Transformers with Chunked Attention for Long-Context Language Modeling (2025)
- SABER: Switchable and Balanced Training for Efficient LLM Reasoning (2025)
- On the Bias of Next-Token Predictors Toward Systematically Inefficient Reasoning: A Shortest-Path Case Study (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Wonderful paper! Thank you for the interesting read. I find your benchmark to be quite useful for testing capability of small models being developed locally and have been having some fun with it. I just had a question about something that I didn't find within your paper but maybe I did in your code. It looks like you ran experiments with batch size of 256 with 40k steps? Roughly ~10 epochs? Just wondering if that is correct. Knowing this would help me to determine compute equivalency to your charts to compare my results against.
@tknecht
Glad to hear you found the work interesting! Yes, that’s correct—training was done for roughly 10 epochs. The training scripts are available here:
https://github.com/RodkinIvan/associative-recurrent-memory-transformer/tree/ACT/scripts/
In particular, see the cell_autom, ca_oo, and ca_grpo folders.
One note to avoid confusion: some transformer baselines were trained and evaluated as a single-segment ARMT. Since a one-segment ARMT is effectively equivalent to a transformer, this was done purely for code compatibility.
Thanks for the information @irodkin ! There are several interesting things about your benchmark (hard task, short context/vocab - good for ablation) and once I get clear results I might toss a plot in this chat for the future reference. As of now, a scaled down (d=128, l=2, single step with 3 cycles and 6 inner recursions no deep supervision) variation of the Tiny Recursive Model (less is more paper) seems to do excellent, I am investigating why it seems to 'grok' the pattern behind the rules. Running at low batch size (16) with one sample per unique rule and after 2500 steps (seeing 40k unique rules and their sample) it levels out at 95% accuracy k=1 and 80% accuracy k=4 AR rollout on a unique test set of rules not seen in training while being trained on k=1 only (O-O), no additional embeddings beyond tokens. Weird things must occur within latent space (~CoT) and ACT like iterative refinement. The TRM model is of course non-causal and not single token AR. 🤷🏻♂️
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper







