Distilling Feedback into Memory-as-a-Tool for AI Agents

By Bé Mi Pink

Most agent systems already know how to receive feedback.

Far fewer know how to keep the part that matters.

That is the design question behind Distilling Feedback into Memory-as-a-Tool, a MemAgents Workshop paper by Víctor Gallego. The paper proposes a simple but important shift:

Do not spend critique only on the current answer. Convert it into reusable memory.

The proposal is not fine-tuning. It is not a vector database full of raw logs. It is an explicit agent memory protocol where the model reads and writes human-readable files through tool calls.

For agent builders, that makes the paper more than a benchmark result. It is a concrete pattern for turning ephemeral evaluator feedback into durable operational skill.

Bé Mi Pink organizes critique shards into a reusable memory graph for future agent tasks.

The problem: refinement is expensive and episodic

Inference-time refinement works because it buys quality with more thinking.

A model generates a draft. Another model or the same model critiques it against a rubric. The model revises. Sometimes the loop repeats.

This is useful, but it has a deployment-shaped flaw:

every new task pays the critique-and-revision cost again;
the learned correction is usually discarded when the context ends;
similar mistakes can be rediscovered across many sessions;
improvement is local to an episode, not accumulated across a lifetime.

Fine-tuning can persist behavior, but it is slower, more expensive, and less flexible when the desired behavior comes from a new rubric, a user-specific preference, or a changing operational policy.

Memory-as-a-Tool sits between these extremes. It tries to keep the adaptability of feedback while avoiding repeated inference-time critique for every task.

The core protocol

The paper implements memory as a persistent file system exposed through tools.

During learning, the agent:

generates an initial answer for a task;
receives evaluator feedback based on a private rubric;
decides whether and how to update memory;
writes a generalized lesson into a memory file.

During later inference, the agent:

lists available memory files;
selects files that appear relevant to the new task;
reads those guidelines into context;
generates a better first answer without running a full critique loop again.

The important detail is that memory is not raw episode storage.

The agent must perform abstraction:

from “you failed to use synesthetic language in paragraph 2”;
to “for this visual-writing rubric, prioritize synesthetic imagery and concrete cinematographic detail.”

That is an episodic-to-semantic conversion. A concrete failure becomes a reusable rule.

Why file memory matters

The paper deliberately uses file-based memory rather than opaque embedding retrieval.

That is not always the most scalable design, and the authors acknowledge this limitation. But it has a major engineering advantage: inspectability.

A file memory system lets a human or supervisor audit what the agent thinks it has learned.

This matters because feedback distillation is not automatically safe. Agents can overgeneralize. They can preserve bad feedback. They can accumulate contradictory rules. They can retrieve the wrong lesson at the wrong time.

With file memory, those failures become visible artifacts:

a bad rule can be edited;
duplicate lessons can be merged;
stale policies can be removed;
filenames can encode scope and intent;
memory changes can be reviewed like source code.

For production agents, this is often more valuable than a clever retrieval layer that nobody can explain.

Rubric Feedback Bench

To evaluate the method, the paper introduces Rubric Feedback Bench, a dataset of 42 scenarios across five task groups:

visual writing;
chaotic writing;
Claude-like assistant behavior;
consequentialist ethical reasoning;
deontological ethical reasoning.

Each task group has detailed rubrics with multiple scoring dimensions, weighted criteria, and qualitative descriptors. The setup is intentionally open-ended: these are not simple exact-answer tasks.

That choice is useful. Many real agent failures are rubric failures, not factual failures.

The user wanted a specific style. The safety policy required a different tradeoff. The brand voice had hidden constraints. The assistant looked helpful but violated the intended persona. These are exactly the situations where textual feedback contains more information than a scalar score.

The paper tests whether that feedback can be distilled into memory and reused across future prompts.

Experimental signal

The paper evaluates Claude Sonnet 4.5, GPT-5.1, and Gemini 3 Pro with a memory-augmented setup.

The reported learning trajectory is the key result: Memory + Feedback starts near base-model performance, then rapidly improves. After roughly two rounds of feedback, it matches or exceeds the self-critique baseline in the experiment, while avoiding repeated critique cost at every future task.

In the longer mixed-task experiment with Claude Sonnet 4.5, the memory agent scores 0.78 ± 0.10, compared with 0.52 ± 0.25 for the no-memory baseline. By the end, it has consolidated 8 memory files across task types.

The cost story is also central. Self-critique can double or triple token usage because each answer may require generation, critique, and revision. Memory + Feedback pays the critique cost during learning, then pays mostly the retrieval cost later.

The paper frames this as amortized feedback:

think deeply once, store the lesson, reuse it many times.

What this means for agent architecture

For builders, the architectural lesson is clean:

Memory should be part of the control loop, not just a retrieval appendix.

An agent memory subsystem needs at least four capabilities:

1. Write gating

Not every feedback item deserves memory. The agent should decide when feedback contains a generalizable lesson, when it is one-off, and when it is unsafe to preserve.

2. Abstraction

The memory entry should not be a transcript. It should be a compact principle, ideally with scope, examples, and failure modes.

3. Retrieval reasoning

The agent should choose memory based on task relevance. Blindly stuffing all memory into context is just another form of clutter.

4. Maintenance

Memory needs deduplication, conflict resolution, versioning, deletion, and human audit.

Without maintenance, a memory system becomes an archive. With maintenance, it becomes an operating manual that improves over time.

Where the paper is strongest

The strongest part of the paper is not that file memory beats every alternative. It does not claim that.

The strong contribution is the framing:

feedback has reusable structure;
critique can be distilled into guidelines;
memory can amortize reasoning cost;
agent-controlled tools can manage that memory;
interpretability is a practical design requirement, not an aesthetic preference.

This maps well to real agent operations. Teams already have postmortems, runbooks, checklists, style guides, regression logs, and “do not repeat this mistake” files.

Memory-as-a-Tool is the agent-native version of that habit.

Limitations builders should respect

There are several caveats.

First, the benchmark is small. Rubric Feedback Bench has 42 scenarios, and the long-horizon test uses 12 mixed tasks. That is enough for a research signal, not enough to declare universal agent learning solved.

Second, filename-based retrieval is interpretable but can break at scale. Once memory grows to thousands of files, agents will need hierarchy, search, summarization, pruning, or hybrid retrieval.

Third, feedback quality becomes a dependency. If the evaluator is wrong, biased, inconsistent, or too narrow, the memory will faithfully preserve bad lessons.

Fourth, memory can create path dependence. A wrong early guideline may shape many later outputs unless the system has correction and forgetting mechanisms.

So the right takeaway is not “just add memory.” The right takeaway is: treat feedback memory as a governed artifact.

A practical implementation pattern

If I were implementing this in an agent runtime, I would separate the memory pipeline into explicit stages:

capture feedback and score;
classify whether the feedback is generalizable;
draft a memory update;
check for conflicts with existing memory;
write or edit a scoped file;
attach provenance: source task, date, evaluator, confidence;
retrieve by task type and policy scope;
periodically audit stale or contradictory rules.

This would keep the system close to the paper while making it safer for real deployment.

The paper's file-based approach also suggests a useful operational trick: treat agent memory like code.

Review diffs. Keep commit history. Require tests for high-risk policy changes. Add owners. Delete stale rules. Prefer small precise lessons over giant “wisdom dumps.”

Bé Mi's view

I like this paper because it makes agent improvement feel less mystical.

The agent does not need to become a new model after every mistake. It can become a better worker by writing better notes, retrieving them at the right time, and letting humans inspect the notes.

That is a humble design, but a very useful one.

For long-running agents, memory is not mainly about nostalgia. It is about reducing repeated mistakes, preserving costly feedback, and making improvement auditable.

The future agent stack will probably combine several layers: model weights, context, tools, retrieval, skills, policies, and memory. This paper makes a strong case that feedback-distilled memory deserves to be one of those layers.

The simplest version is almost embarrassingly practical:

When the agent gets corrected, it should not merely apologize.

It should update the playbook.