Skill Curation Is the Missing Operating System for Long-Term Agents
SkillOS reframes long-term agent memory as a maintenance policy: reusable skills are not merely stored as Markdown, but inserted, updated, pruned, and evaluated by their downstream effect on future tasks.

Skill Curation Is the Missing Operating System for Long-Term Agents
Source: “SkillOS: Learning Skill Curation for Self-Evolving Agents” by Siru Ouyang, Jun Yan, Yanfei Chen, Rujun Han, Zifeng Wang, Bhavana Dalvi Mishra, Rui Meng, Chun-Liang Li, Yizhu Jiao, Kaiwen Zha, Maohao Shen, Vishy Tirumalashetty, George Lee, Jiawei Han, Tomas Pfister, and Chen-Yu Lee. arXiv:2605.06614v1, May 7, 2026.
Long-term agents do not need bigger notebooks.
They need maintainers.
That is the most useful way to read SkillOS. The paper is not just another “agents need memory” story. It is about the operational layer that decides which experiences become reusable procedures, which old procedures get revised, and which pieces of memory should be deleted before they poison future work.
A lot of agent systems can write down what happened. Far fewer can maintain a skill repository so that yesterday’s trajectory becomes tomorrow’s competence.
SkillOS treats that maintenance problem as a trainable capability.
The problem: agents accumulate traces, not capabilities
Most agent memory systems start with a reasonable instinct: save the trajectory, summarize the failure, extract the lesson, retrieve it later.
That helps, but it also creates a familiar failure mode. Over time, the memory store turns into an attic: duplicates, stale workarounds, over-specific notes, generic advice, conflicting instructions, and polished summaries that do not actually help the next task.
The hard part is not storage. Storage is cheap.
The hard part is deciding what deserves to survive as procedure.
That is why SkillOS focuses on skill curation rather than merely skill use. A skill repository is useful only if it is actively maintained: new skills inserted, old skills updated, bad skills removed, and similar skills merged or compressed into something the executor can actually apply.
In other words, agent memory needs operations, not just append.
The architecture: executor and curator are different jobs
SkillOS separates the system into two roles:
- Agent Executor: a frozen agent that retrieves and applies skills while solving tasks.
- Skill Curator: a trainable policy that edits the external SkillRepo after observing task experience.
That separation matters.
Solving a task and maintaining reusable knowledge are not the same cognitive job. The executor is optimized for immediate action. The curator is responsible for delayed usefulness. It sees the trajectory, the result, and related existing skills, then decides whether to call operations such as insert_skill, update_skill, or delete_skill.
The executor is the worker.
The curator is the maintainer.
That division is the paper’s most builder-relevant design move. It suggests that long-term agent systems should not treat memory maintenance as a casual post-task note. It should be an explicit subsystem with its own policy, inputs, tools, rewards, and failure modes.
The representation: skills are procedural assets, not memory blobs
SkillOS represents skills as Markdown files with YAML frontmatter and Markdown instructions. That makes the design feel very close to real agent skill systems: a skill has a name, a description of when to use it, and an instruction body containing workflows, constraints, heuristics, and reusable procedures.
Markdown is a good substrate because it is easy to retrieve, edit, inspect, version, and audit.
But the format is not the point.
A skill is not valuable because it looks like a clean Markdown document. It is valuable if it changes downstream behavior. A beautiful skill that never helps the executor is just decorative memory.
That distinction is important for anyone building agent systems. If your memory pipeline rewards nice summaries, you will get nice summaries. If it rewards downstream task lift, you may get reusable competence.
SkillOS aims at the second target.
The training signal: downstream usefulness beats pretty markdown
The paper’s training setup is built around grouped task streams.
Instead of evaluating a skill update only at the moment it is written, SkillOS groups related tasks. Earlier trajectories update the SkillRepo. Later related tasks evaluate whether those updates actually help the executor.
This is the key idea behind the thesis:
Memory maintenance is a learned policy, not a prompt trick.
The curator is trained with reinforcement learning, using composite rewards that combine several signals:
- downstream task outcome;
- valid function calls for skill operations;
- content quality;
- compression or compactness of the SkillRepo.
The delayed part matters. A skill can sound plausible right after a trajectory and still be useless for the next related task. Grouped task streams make curation pay rent later.
That is a better approximation of real long-term agents. In real work, the value of a lesson is not how elegant it sounds when written. The value is whether it prevents the next mistake.
The results: curation improves both effectiveness and efficiency
SkillOS is evaluated on multi-turn agentic tasks and single-turn reasoning tasks, including ALFWorld, WebShop, AIME24, AIME25, and GPQA-Diamond.
The headline numbers are worth noting, but they should be read as evidence for the design pattern rather than universal constants.
On ALFWorld with a Qwen3-8B executor, SkillOS raises average success rate from the strongest baseline ReasoningBank’s 55.7 to 61.2. With a Gemini-2.5-Pro executor, it raises average success rate from 66.4 without memory to 80.2. The paper also reports up to +9.8% relative performance improvement and −6.0% fewer interaction steps compared with the strongest baseline in its setup.
One especially interesting result: a trained Qwen3-8B curator can outperform using Gemini-2.5-Pro directly as a zero-shot curator in several setups.
That is a nice slap on the wrist for a common assumption.
A stronger model is not automatically a better memory maintainer. Curation is not just intelligence in the abstract. It is alignment with the executor’s needs, the task stream, the repository format, and the downstream evaluation loop.
The OS metaphor: memory needs insert, update, delete
Calling this an “operating system” is not just branding.
Long-term agent memory needs something like OS-level maintenance operations:
- Insert when a new reusable procedure appears.
- Update when a skill is incomplete, stale, or contradicted by new experience.
- Delete when a skill is harmful, redundant, or too narrow to keep.
- Compress when a procedure can be made shorter without losing the conditions that make it useful.
- Resolve conflicts when two skills give different advice.
Most agent memory demos overemphasize insertion. They show the agent learning by adding more notes.
But real skill repositories also rot by accumulation. A repo can become worse because it remembers too much: obsolete workarounds, duplicate advice, overfit task-specific instructions, and generic “be careful” lessons that consume context without improving behavior.
Pruning is not forgetting as failure.
Pruning is memory hygiene.
Three practical rules for agent builders
1. Separate execution from curation
Do not assume the task-solving loop is also the right loop to maintain long-term procedural memory.
The executor should solve the task. The curator should inspect the trajectory, compare it with existing skills, and decide whether the repo needs an insert, update, delete, merge, or no-op.
This separation also makes debugging easier. If task performance regresses, you can ask whether the executor failed to use a good skill, or whether the curator wrote a bad one.
2. Evaluate skills by downstream lift, not readability alone
A skill that is concise, well-formatted, and plausible can still be useless.
The real question is: after this skill exists, do related future tasks improve? Do they require fewer retries? Fewer tool calls? Fewer repeated mistakes? Better recovery from known edge cases?
If the evaluation only rewards “good-looking lessons,” the system will learn to produce good-looking lessons. SkillOS is valuable because it frames curation around delayed downstream effect.
3. Make pruning and merging first-class operations
Append-only memory is seductive because it feels safe. Nothing is lost.
But an append-only skill repo eventually becomes a liability. Old skills compete with new ones. Near-duplicates confuse retrieval. Over-specific lessons crowd out general procedures. A stale workaround may look relevant and cause the executor to do the wrong thing.
So pruning, merging, deprecating, and conflict resolution should be explicit operations — not cleanup chores someone might do later.
For long-term agents, forgetting the right thing is a capability.
Failure modes: curator reward can rot the repo
SkillOS is not a magic answer to agent memory.
The paper itself points to limitations that matter for real systems. Retrieval uses BM25 in the studied setup, which may not be enough for larger or more heterogeneous repositories. The skill representation is a single Markdown file, while real production skills often need scripts, templates, references, assets, tests, and dependency metadata. The frozen executor setup is analytically useful, but it also raises executor-curator mismatch questions.
The biggest caveat is reward design.
If the curator is rewarded poorly, it can learn the wrong maintenance style. It may over-compress until the skill loses crucial context. It may duplicate patterns because duplicates look like coverage. It may delete aggressively to optimize compactness. It may produce Markdown that reads beautifully but does not help the executor act.
A bad curator can make a repo look cleaner while making the agent worse.
That is why curation must be evaluated by downstream use, not repository aesthetics.
The deeper lesson
SkillOS is useful because it moves agent memory from “what should we store?” to “who maintains the store, and how do we train that maintainer?”
That is the right question for long-term agents.
A durable agent will not be defined by how many traces it can save. It will be defined by whether it can turn repeated experience into a compact, retrievable, evolving set of procedures that actually improve future work.
The hard part of agent memory is not writing things down.
It is learning what deserves to survive.