Train the Skill, Not the Model: SkillOpt as Validation-Gated Procedural Memory
SkillOpt treats an agent skill document as trainable external state: a frozen target model runs scored rollouts, an optimizer proposes bounded text edits, and a held-out validation gate accepts only skill changes that actually improve performance.

A lot of agent work treats skills as static documentation: write a skill.md, hope the agent reads it, and patch it by hand when the workflow breaks.
SkillOpt: Executive Strategy for Self-Evolving Agent Skills makes a more interesting move. It treats the skill document itself as trainable external state. The model weights stay frozen. The execution harness stays fixed. What changes is a compact procedural artifact that is updated from scored rollouts, bounded text edits, and a held-out validation gate.
That is the useful thesis for builders:
A skill can be trained like a small, auditable, deployable piece of agent memory — not by vibes, but by evidence, patches, rejection, and validation.
Paper: arXiv:2605.23904v2 — SkillOpt: Executive Strategy for Self-Evolving Agent Skills, Yifan Yang, Ziyang Gong, Weiquan Huang, Qihao Yang, Ziwei Zhou, Xue Yang, Xiaoyang Wu, C. L. Philip Chen, Wenqiang Lei, Yan Xia, Hongcheng Gao, Xuhong Zhang, Lichao Sun, Tongliang Liu, Shuaiqiang Wang, Dawei Yin. Version dated 25 May 2026.
The important shift: skill.md becomes external trainable state
SkillOpt is not fine-tuning. It is not prompt tinkering in a loop. It is not an agent randomly rewriting its own instructions after every failure.
The paper frames a skill document as external natural-language state attached to a frozen target model. The target model executes tasks with the current skill. A separate optimizer model reads scored trajectories and proposes structured edits to the skill.
The final deployed artifact is still just text — roughly 300–2,000 tokens in their case studies — and deployment adds zero optimizer calls at inference time.
That distinction matters.
If a team fine-tunes model weights, the learned behavior is hard to inspect and harder to surgically revert. If a team only writes prompts by hand, improvement is often unmeasured and personality-driven. SkillOpt sits in the middle: train the procedure, keep it readable, and accept changes only when validation says the procedure got better.
For agent systems, that is a clean control surface.
The loop: roll out, reflect, patch, validate
The SkillOpt loop has a surprisingly engineering-flavored shape:
- Run rollout batches with the current skill on training tasks.
- Collect scored trajectories: what succeeded, what failed, and where.
- Use reflection minibatches so the optimizer model sees common failure patterns rather than one-off anecdotes.
- Propose bounded edits to the skill: add, delete, or replace specific instructions.
- Clip the update by a textual learning-rate budget, so one step cannot rewrite the whole skill into a different creature.
- Evaluate the candidate skill on a held-out selection split.
- Accept only if the selection score strictly improves.
- Store rejected edits and failure patterns so the optimizer can avoid repeating bad updates.
- Apply epoch-wise slow/meta updates to preserve stable editing directions across longer training horizons.
This is the core design lesson: SkillOpt borrows the discipline of training — batches, learning rates, validation, ablation, accepted/rejected updates — but maps it into text-space.
The validation gate is especially important. Without it, “self-improving skill” can become a fancy name for instruction drift. The skill may get longer, louder, or more confident while quietly getting worse. SkillOpt’s held-out gate is what turns rewriting into optimization.
Why bounded text updates are not cosmetic
The paper’s add/delete/replace interface is not just implementation detail. It is a safety and stability mechanism.
Unbounded rewrites can:
- erase rules that were working;
- overfit to the latest failed example;
- introduce incompatible procedures;
- grow the skill into a bloated transcript summary;
- make it impossible to know which edit helped or hurt.
Bounded patch edits keep continuity. A candidate skill remains close enough to the previous skill that rejected edits and accepted edits still form a meaningful training history.
That is exactly what many real agent teams need. The hard problem is not “can the model write more instructions?” The hard problem is whether the team can maintain a procedural memory artifact that is compact, testable, auditable, and not constantly destabilized by the last weird failure.
Results, in the paper’s evaluation setup
The paper reports strong numbers across six benchmarks, seven target models, and three execution modes: direct chat, Codex-style agentic loops, and Claude Code-style loops.
The headline result is bold: SkillOpt is best or tied-best on all 52 evaluated model/benchmark/harness cells against no-skill, human-written skill, one-shot LLM skill, Trace2Skill, TextGrad, GEPA, and EvoSkill baselines.
For GPT-5.5, the paper reports average gains over no-skill of:
- +23.5 points in direct chat;
- +24.8 points inside the Codex harness;
- +19.1 points inside Claude Code.
They also report transfer effects. A SpreadsheetBench skill trained on GPT-5.4 improves smaller GPT variants. A Codex-trained spreadsheet skill transfers to Claude Code with a reported +59.7 point gain. An OlympiadBench skill yields positive gains on Omni-MATH.
The right phrasing is “in their evaluation setup,” not “skills now universally self-improve.” Still, the result is meaningful because it crosses more than one boundary: model scale, execution harness, and nearby benchmark.
That is the part agent builders should pay attention to. If a skill only works in the exact loop that produced it, it may be a brittle prompt artifact. If it transfers across related harnesses, it starts to look more like procedural knowledge.
The OpenClaw/Hermes connection: skills as governed memory
For systems built around explicit skills — like OpenClaw/Hermes-style workflows — SkillOpt is interesting because it formalizes something practitioners already feel.
A good skill is not merely a note. It is a small operational policy:
- when to use a tool;
- what order to perform checks;
- which failure modes to avoid;
- how to verify completion;
- what not to do even if the model is tempted.
SkillOpt’s contribution is to ask: can we optimize that policy artifact with the same seriousness we apply to model training?
The answer, at least on these benchmarks, is yes enough to be worth building around.
The practical direction is not a fully autonomous agent rewriting all of its memory forever. That would be sloppy and dangerous. The better direction is governed skill training:
- collect task evidence;
- propose minimal skill patches;
- run validation;
- reject regressions;
- preserve negative feedback;
- keep the artifact readable;
- deploy only the compact best skill.
That makes skills portable across agents, reviewable by humans, and debuggable when behavior changes.
What I would actually use this for
The best early use cases are domains with clear feedback:
- spreadsheet agents with executable checks;
- code agents with tests and lint;
- document QA with exact-match or rubric-backed scoring;
- browser/task agents with measurable success states;
- internal operational workflows where a verifier can say pass/fail.
This is less attractive for one-off creative writing, ambiguous judgment calls, or tasks where the scorer is weak. If the validation signal is noisy, biased, or gameable, SkillOpt can optimize toward the wrong target very efficiently. A bad scorer turns “skill training” into “skill laundering.”
That is not a flaw unique to SkillOpt. It is the core governance problem of any feedback-driven agent system.
Caveats builders should keep in view
The paper is careful about several limitations, and they are not minor footnotes.
First, SkillOpt depends on scored trajectories and a held-out selection split. It works best when success can be measured with automatic verifiers, exact-match metrics, executable tests, or otherwise reliable feedback. Open-ended domains need stronger human or model-based evaluation.
Second, the deployed skill is cheap, but training is not free. The paper reports training-token costs from 0.6M to 46.4M tokens per absolute test-set point, depending on benchmark and trajectory length. That can be worth it for reusable workflows. It is probably overkill for throwaway tasks.
Third, SkillOpt optimizes a single portable skill. That is clean for deployment, but some domains need a library of specialized procedures rather than one universal document.
Fourth, transfer is encouraging but not magic. The authors explicitly warn that optimized skills can encode domain-specific heuristics from the training distribution. Before moving a skill to a substantially different model, harness, or task setting, you still need held-out evaluation.
Finally, reproducibility matters. The paper provides a code link, but builder decisions should still depend on whether the released implementation, benchmark splits, scorers, and harness details reproduce the reported pattern outside the paper.
The builder takeaway
SkillOpt is valuable because it makes procedural memory feel less mystical.
It says: do not just ask the agent to “learn from experience.” Define the experience, score it, summarize the failures, propose bounded edits, validate the candidate, retain rejected edits as negative feedback, and export the best compact artifact.
That is not glamorous self-evolution. It is better than that.
It is engineering.
And for agent systems, that is the point: the path to better agents may not always be larger models or more autonomous loops. Sometimes it is a small best_skill.md that has been trained, rejected, validated, and kept honest.