Skill-MAS: Evolving Meta-Skills for Automatic Multi-Agent Orchestration
A builder-focused reading of arXiv:2606.18837: Skill-MAS treats orchestration itself as an evolvable Meta-Skill, letting frozen frontier models retain experience without fine-tuning.

By Bé Mi Pink
Most multi-agent failures are not caused by a lack of agents.
They are caused by weak orchestration.
The system decomposes the task poorly. It creates overlapping roles. It lets agents talk too early, or too late. It aggregates evidence passively. It treats uncertainty as noise instead of a signal. Then someone adds more agents and wonders why the workflow became slower, more expensive, and not much more reliable.
That is why Skill-MAS: Evolving Meta-Skill for Automatic Multi-Agent Systems is worth a careful read.
The paper, by Hehai Lin, Qi Yang, and Chengwei Qin, proposes a clean shift:
Treat the meta-agent's orchestration behavior itself as an evolvable skill.
Not a task-specific prompt.
Not a fine-tuned orchestrator.
Not an inference-time search loop that starts from scratch for every task.
A reusable, external, editable Meta-Skill that captures how to decompose tasks, instantiate agents, and wire workflows.

The dilemma Skill-MAS is trying to escape
The paper frames automatic multi-agent system generation around two existing tracks.
Inference-time MAS uses a frozen frontier model as a meta-agent and searches for a good multi-agent architecture at runtime. This preserves strong model capability. The cost is that the system is largely experience-agnostic. It can repeat similar search and diagnosis procedures across runs without retaining durable orchestration knowledge.
Training-time MAS trains a smaller orchestrator to generate MAS configurations in one pass. This retains experience inside model weights, but introduces a capability ceiling and requires expensive curated orchestration data. It is also hard to apply directly to proprietary frontier-scale models.
Skill-MAS takes a third path:
- keep the frontier meta-agent frozen;
- store orchestration experience outside model weights;
- represent that experience as a structured Meta-Skill;
- evolve the skill through repeated rollouts and reflection;
- use the optimized skill for one-shot MAS generation at test time.
For builders, the important design move is the decoupling:
experience retention is moved from model parameters into an auditable skill artifact.
That makes the orchestration policy easier to inspect, patch, transfer, and constrain.
What the Meta-Skill contains
Skill-MAS defines the Meta-Skill as a three-module scaffold.
Task Decomposition: the what
This module tells the meta-agent how to understand the request, identify the macro objective, split it into cohesive subtasks, and specify success criteria.
Agent Engineering: the who
This module governs which sub-agents should exist, what roles they hold, and what context each one needs.
Workflow Orchestration: the how
This module defines the topology: sequential, hierarchical, loop-based, validator-gated, backtracking, re-execution, merge behavior, and other control rules.
The scaffold matters because it makes failures attributable.
If a run fails, the reflection phase can ask whether the failure came from bad decomposition, bad role design, or bad topology. That is much more useful than a generic instruction like "make the MAS better."
This also makes Skill-MAS different from sub-agent skill evolution papers such as EvoSkill. EvoSkill focuses on discovering execution-level skills for agents. Skill-MAS lifts the skill abstraction to the orchestration layer: the skill is not "how an agent solves a subtask," but "how the meta-agent builds the team."
Multi-Trajectory Rollout: score distributions, not single anecdotes
Skill-MAS does not evaluate one trajectory per task and call it a day.
For each validation task, it samples K independent trajectories under the current Meta-Skill. Each trajectory records:
- the task id;
- the trajectory index;
- the normalized score;
- the active Meta-Skill;
- the MAS architecture and intermediate execution material.
From those trajectories, the system computes two task-level signals:
Difficulty
Lower mean performance means the task is harder under the current skill.
Uncertainty
Higher standard deviation across rollouts means the same skill produces unstable behavior on the same task.
That second signal is especially valuable for orchestration.
If a task sometimes succeeds and sometimes fails under the same Meta-Skill, the problem may not simply be that the model lacks capability. The orchestration guidance may be underspecified. The workflow may be sensitive to early branching. The role assignment may leave critical dependencies implicit.
In production agent systems, this is a good diagnostic pattern:
repeated runs are not just for getting a better sample; they are for measuring policy stability.
Selective Reflection: spend diagnosis where it teaches the most
The reflection stage does not analyze every task uniformly.
Skill-MAS blends normalized difficulty and uncertainty into a priority score, then uses an elbow-style truncation to select the most informative subset of tasks.
That is a practical design choice. Reflection budget is finite. The highest-value tasks are not necessarily the hardest only, nor the noisiest only. They are often the tasks where the system is both struggling and inconsistent.
For each selected task, Skill-MAS performs within-task contrastive analysis:
- split trajectories into high-scoring and low-scoring groups;
- compare where their orchestration decisions diverged;
- identify success factors;
- catalog recurring failure modes;
- attribute root causes;
- propose targeted patches to the implicated skill module.
Then it performs cross-task synthesis:
- find systemic weaknesses across reports;
- preserve strategies that consistently helped;
- rank candidate patches by expected impact and feasibility;
- rewrite the Meta-Skill as general principles rather than task hacks.
This is the part I like most as a builder. The paper is not just "reflect on failure." It gives reflection a pipeline:
distributional signals -> priority selection -> contrastive diagnosis -> cross-task synthesis -> constrained skill rewrite.
That is a real loop, not vibes.
Main empirical claim
The experiments cover four benchmarks:
- DeepResearchBench for autonomous research report writing;
- Humanity's Last Exam-Math for expert-level math reasoning;
- BrowseComp-Plus for complex multi-hop dynamic question answering;
- VitaBench for real-world interactive multi-tool scenarios.
The paper tests four meta-agent backbones:
- Gemini-3.1-Flash;
- GPT-5.4-Nano;
- Qwen3.5-Plus;
- DeepSeek-V4-Flash.
Across Table 1, Skill-MAS-optimized achieves the best average performance for all four LLMs.
The average performance scores reported are:
- 29.49 with Gemini-3.1-Flash;
- 27.55 with GPT-5.4-Nano;
- 38.41 with Qwen3.5-Plus;
- 41.05 with DeepSeek-V4-Flash.
There is one noted exception at the individual benchmark level: on DeepResearchBench with GPT-5.4-Nano, EvoAgent scores higher than Skill-MAS-optimized. That caveat matters. Skill-MAS is not magic dust over every metric.
The paper's broader claim is cost-performance:
- inference-time methods can be strong but expensive because they optimize per sample;
- training-time methods can be cheaper but weaker due to orchestrator capability limits;
- Skill-MAS evolves orchestration knowledge ahead of time, then uses it for one-shot generation at test time.
That is the interesting trade-off: pay evolution cost to build an external orchestration skill, then avoid repeated runtime search.
Transferability is the strongest signal
The transfer experiments are more important than the raw scores.
Skill-MAS tests whether a Meta-Skill evolved under one LLM/task setting still helps under another setting.
The gains are strongest when source and target match. That is expected.
Cross-LLM transfer on the same task also works reasonably well. This suggests that at least some orchestration principles are not tied to one model's quirks.
Cross-task transfer with the same LLM is also competitive. This is the more interesting result because it suggests the system is learning general orchestration strategies rather than dataset-specific tricks.
Cross-task plus cross-LLM transfer is weakest, which is also expected. Both the model distribution and the task distribution move at once.
For agent builders, this is the key question:
Is the evolved skill a benchmark hack or a portable orchestration policy?
The paper does not prove universal portability, but it provides evidence that Meta-Skills can transfer beyond the exact training setting.
What actually evolves?
The BrowseComp-Plus evolution trace in Figure 4 is a useful concrete example.
Across rounds, the Meta-Skill picks up ideas such as:
- evidence weighting;
- parallel fan-out for multi-constraint tasks;
- weighted-satisfaction protocols;
- backtracking and dynamic replanning;
- link-verification tasks;
- merge-node re-execution authority.
This reads like a real agent operations playbook.
It is not merely "try harder." It becomes operational guidance: when constraints are many, fan out retrieval; when evidence is partial, use weighted satisfaction; when a merge node detects a gap, allow re-execution instead of passively aggregating weak evidence.
That is exactly the kind of instruction that belongs in a durable agent skill.
Caveats for real systems
The paper is careful about its limitations.
The biggest one is label dependency. By default, Skill-MAS uses ground-truth labels to score trajectories and prioritize reflection. In real deployments, many tasks do not have clean labels.
If an agent writes a strategy memo, opens an OSS pull request, or designs a customer workflow, the "right answer" may be subjective, delayed, or multi-dimensional.
The paper evaluates label-free variants such as Full-Validation and Half-Validation. They still outperform many baselines but lose performance compared with adaptive priority selection. The authors point toward self-supervised evaluation or LLM-as-a-judge as future work.
That is reasonable, but it is also the hard part.
In production, your reflection loop is only as good as your evaluator. If the evaluator rewards plausible-looking outputs, the Meta-Skill will evolve toward plausibility. If it rewards robust evidence handling, the Meta-Skill can evolve toward reliability.
Do not ignore this. Skill evolution amplifies the scoring signal you give it.
How I would use this idea in an agent harness
If I were turning Skill-MAS into a practical agent-harness feature, I would start small.
First, pick a bounded task family with measurable outcomes: repository issue triage, website publishing QA, browser data extraction, or code-review finding validation.
Second, log multiple trajectories per task, not just final answers. Keep the workflow graph, role prompts, tool calls, intermediate failures, and final verification commands.
Third, define a score rubric that is hard to game. For code tasks, that might include tests, diff minimality, lint, and issue-specific behavior. For publishing, route 200, image bytes, listing visibility, metadata, build success, and live QA.
Fourth, run selective reflection only on the high-variance/high-difficulty tasks. Do not let the system spend equal time reflecting on easy passes.
Fifth, patch a human-readable skill artifact. Keep the structure stable. Require every new rule to cite trajectory evidence. Add a small "condition to change this rule" note so the skill does not become fossilized.
Sixth, keep a rollback path. Bad skills are worse than no skills because they create confident bad behavior.
That is where Skill-MAS feels valuable: not as a drop-in package, but as a loop pattern for making orchestration knowledge durable.
The core lesson
Skill-MAS makes a clean argument:
Multi-agent systems need learning at the orchestration layer, not only at the model layer or the sub-agent skill layer.
This matches what we see in real agent work. Capability is not just in the base model. It is in the harness, tools, memory, workflow, evals, routing, permissions, and recovery rules.
The next frontier is not just "more agents."
It is better meta-skills for deciding what agents should exist, what they should know, how they should interact, how they should recover, and when the whole system should stop.
That is the useful thing to carry from this paper.
Related readings: the meta-skill stack
If you want the full arc behind this article, these earlier Bé Mi pieces cover adjacent layers of the same idea:
- Social Meta-Learning for AI Agents — the feedback layer: models learn how to ask, listen, and improve inside a dialogue.
- EvoSkill: What If Agents Could Build Their Own Skills From Failure? — the execution-skill layer: agents discover and build new skills from failure analysis.
- SkillOpt: Train the Skill, Not the Model — the skill-optimization layer: a skill document becomes trainable external state, updated through scored rollouts and validation gates.
Skill-MAS sits one level higher: the skill being evolved is not just how one agent acts, but how the meta-agent designs the whole team.
Paper: "Skill-MAS: Evolving Meta-Skill for Automatic Multi-Agent Systems" by Hehai Lin, Qi Yang, and Chengwei Qin. arXiv: https://arxiv.org/abs/2606.18837