Google Cloud AI Research Introduces ReasoningBank: A Memory Framework that Distills Reasoning Strategies from Agent Successes and Failures

Most AI agents today have a fundamental amnesia problem. Deploy one to browse the web, resolve GitHub issues, or navigate a shopping platform, and it approaches every single task as if it has never seen anything like it before. No matter how many times it has stumbled on the same type of problem, it repeats the same mistakes.

Valuable lessons evaporate the moment a task ends. A team of researchers from Google Cloud AI, the University of Illinois Urbana-Champaign and Yale University introduces ReasoningBank, a memory framework that doesn’t just record what an agent did — it distills why something worked or failed into reusable, generalizable reasoning strategies. The Problem with Existing Agent Memory To understand why ReasoningBank is important, you need to understand what existing agent memory actually does.

Two popular approaches are trajectory memory (used in a system called Synapse) and workflow memory (used in Agent Workflow Memory, or AWM). Trajectory memory stores raw action logs — every click, scroll, and typed query an agent executed. Workflow memory goes a step further and extracts reusable step-by-step procedures from successful runs only.

Both have critical blind spots. Raw trajectories are noisy and too long to be directly useful for new tasks. Workflow memory only mines successful attempts, which means the rich learning signal buried in every failure — and agents fail a lot — gets completely discarded. https://arxiv.org/pdf/2509.25140 How ReasoningBank Works ReasoningBank operates as a closed-loop memory process with three stages that run around every completed task: memory retrieval, memory extraction, and memory consolidation. https://arxiv.org/pdf/2509.25140 Before an agent starts a new task, it queries ReasoningBank using embedding-based similarity search to retrieve the top-k most relevant memory items.

Those items get injected directly into the agent’s system prompt as additional context. Importantly, the default is k=1, a single retrieved memory item per task. Ablation experiments show that retrieving more memories actually hurts performance: success rate drops from 49.7% at k=1 to 44.4% at k=4.

The quality and relevance of retrieved memory matter far more than quantity. Once the task is finished, a Memory Extractor — powered by the same backbone LLM as the agent — analyzes the trajectory and distills it into structured memory items. Each item has three components: a title (a concise strategy name), a description (a one-sentence summary), and content (1–3 sentences of distilled reasoning steps or operational insights).

Crucially, the extractor treats successful and failed trajectories differently: successes contribute validated strategies, while failures supply counterfactual pitfalls and preventative lessons. To decide whether a trajectory was successful or not — without access to ground-truth labels at test time — the system uses an LLM-as-a-Judge, which outputs a binary “Success” or “Failure” verdict given the user query, the trajectory, and the final page state. The judge doesn’t need to be perfect; ablation experiments show ReasoningBank remains robust even when judge accuracy drops to around 70%.

New memory items are then appended directly to the ReasoningBank store, maintained as JSON with pre-computed embeddings for fast cosine similarity search, completing the loop. MaTTS: Pairing Memory with Test-Time Scaling The research team goes further and introduces memory-aware test-time scaling (MaTTS), which links ReasoningBank with test-time compute scaling — a technique that has already proven powerful in math reasoning and coding tasks. The insight is simple but important: scaling at test time generates multiple trajectories for the same task.

Instead of just picking the best answer and discarding the rest, MaTTS uses the full set of trajectories as rich contrastive signals for memory extraction. MaTTS comes in two ways. Parallel scaling generates k independent trajectories for the same query, then uses self-contrast — comparing what went right and wrong across all trajectories — to extract higher-quality, more reliable memory items.

Sequential scaling iteratively refines a single trajectory using self-refinement, capturing intermediate corrections and insights as memory signals. The result is a positive feedback loop: better memory guides the agent toward more promising rollouts, and richer rollouts forge even stronger memory. The paper notes that at k=5, parallel scaling (55.1% SR) edges out sequential scaling (54.5% SR) on WebArena-Shopping — sequential gains saturate quickly once the model reaches a decisive success or failure, while parallel scaling keeps providing diverse rollouts that the agent can contrast and learn from. https://arxiv.org/pdf/2509.25140 Results Across Three Benchmarks Tested on WebArena (a web navigation benchmark spanning shopping, admin, GitLab, and Reddit tasks), Mind2Web (which tests generalization across cross-task, cross-website, and cross-domain sett