arxiv:2508.16745

Beyond Memorization: Extending Reasoning Depth with Recurrence, Memory and Test-Time Compute Scaling

Published on Aug 22, 2025

· Submitted by

MIKHAIL BURTSEV on Aug 26, 2025

Authors:

Ivan Rodkin ,

Daniil Orel ,

,

,

Bilal Elbouardi ,

Besher Hassan ,

Yuri Kuratov ,

Aydar Bulatov ,

,

,

,

Mikhail Burtsev

Abstract

Models trained on random Boolean functions in a cellular automata framework show that increasing depth, recurrence, memory, and test-time compute scaling enhances multi-step reasoning capabilities.

AI-generated summary

Reasoning is a core capability of large language models, yet understanding how they learn and perform multi-step reasoning remains an open problem. In this study, we explore how different architectures and training methods affect model multi-step reasoning capabilities within a cellular automata framework. By training on state sequences generated with random Boolean functions for random initial conditions to exclude memorization, we demonstrate that most neural architectures learn to abstract the underlying rules. While models achieve high accuracy in next-state prediction, their performance declines sharply if multi-step reasoning is required. We confirm that increasing model depth plays a crucial role for sequential computations. We demonstrate that an extension of the effective model depth with recurrence, memory, and test-time compute scaling substantially enhances reasoning capabilities.

View arXiv page View PDF GitHub 60 Add to collection

Community

mbur

Paper author Paper submitter Aug 26, 2025

mbur

Paper author Paper submitter Aug 26, 2025

mbur

Paper author Paper submitter Aug 26, 2025

mbur

Paper author Paper submitter Aug 26, 2025

mbur

Paper author Paper submitter Aug 26, 2025

mbur

Paper author Paper submitter Aug 26, 2025

mbur

Paper author Paper submitter Aug 26, 2025

mbur

Paper author Paper submitter Aug 26, 2025

Aug 27, 2025

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

irodkin

Paper author Aug 31, 2025

ECA dataset used in the paper:

https://huggingface.co/datasets/irodkin/1dCA_r2s20T20

tknecht

7 days ago

Wonderful paper! Thank you for the interesting read. I find your benchmark to be quite useful for testing capability of small models being developed locally and have been having some fun with it. I just had a question about something that I didn't find within your paper but maybe I did in your code. It looks like you ran experiments with batch size of 256 with 40k steps? Roughly ~10 epochs? Just wondering if that is correct. Knowing this would help me to determine compute equivalency to your charts to compare my results against.

irodkin

Paper author 7 days ago

@tknecht
Glad to hear you found the work interesting! Yes, that’s correct—training was done for roughly 10 epochs. The training scripts are available here:
https://github.com/RodkinIvan/associative-recurrent-memory-transformer/tree/ACT/scripts/

In particular, see the cell_autom, ca_oo, and ca_grpo folders.

One note to avoid confusion: some transformer baselines were trained and evaluated as a single-segment ARMT. Since a one-segment ARMT is effectively equivalent to a transformer, this was done purely for code compatibility.

tknecht

3 days ago

•

edited 3 days ago

Thanks for the information @irodkin ! There are several interesting things about your benchmark (hard task, short context/vocab - good for ablation) and once I get clear results I might toss a plot in this chat for the future reference. As of now, a scaled down (d=128, l=2, single step with 3 cycles and 6 inner recursions no deep supervision) variation of the Tiny Recursive Model (less is more paper) seems to do excellent, I am investigating why it seems to 'grok' the pattern behind the rules. Running at low batch size (16) with one sample per unique rule and after 2500 steps (seeing 40k unique rules and their sample) it levels out at 95% accuracy k=1 and 80% accuracy k=4 AR rollout on a unique test set of rules not seen in training while being trained on k=1 only (O-O), no additional embeddings beyond tokens. Weird things must occur within latent space (~CoT) and ACT like iterative refinement. The TRM model is of course non-causal and not single token AR. 🤷🏻‍♂️

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2508.16745 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2508.16745 in a Space README.md to link it from this page.

Collections including this paper 4