ACE: Attribution-Controlled Knowledge Editing for Multi-hop Factual Recall Paper • 2510.07896 • Published Oct 9, 2025 • 11
EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL Paper • 2605.18703 • Published 17 days ago • 50
Linear Ensembles Wash Away Watermarks: On the Fragility of Distributional Perturbations in LLMs Paper • 2605.30501 • Published 7 days ago • 29
MIB Datasets Collection The tasks and counterfactuals from the Mechanistic Interpretability Benchmark. • 7 items • Updated Apr 16, 2025 • 5
Thinking in Frames: How Visual Context and Test-Time Scaling Empower Video Reasoning Paper • 2601.21037 • Published Jan 28 • 15
IntrEx: A Dataset for Modeling Engagement in Educational Conversations Paper • 2509.06652 • Published Sep 8, 2025 • 26
Analysing Chain of Thought Dynamics: Active Guidance or Unfaithful Post-hoc Rationalisation? Paper • 2508.19827 • Published Aug 27, 2025 • 33
Spectrum Projection Score: Aligning Retrieved Summaries with Reader Models in Retrieval-Augmented Generation Paper • 2508.05909 • Published Aug 8, 2025 • 21
A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems Paper • 2508.07407 • Published Aug 10, 2025 • 99