Local Harnesses for Personal Agent Preference Learning

Personal agents are starting to accumulate skills faster than users can name them.

That creates a deceptively hard problem: when a user gives an underspecified request, which skill should the agent choose?

The paper “Statistical Priors for Implicit Preferences: Decoupling Skill Selection as a Local Harness in Personal Agents” frames this as a preference-learning problem rather than a pure intent-parsing problem. Its coffee example is simple: a user asks for a cappuccino, several ordering skills could work, and the agent picks the one that seems semantically best instead of the one the user actually prefers.

That is the important distinction. The user did not fail to express an intent. The agent failed to model habit.

The mistake: asking the LLM to do two jobs at once

Most memory-augmented agents put preference evidence into the prompt and ask the LLM to decide. That feels natural, but it conflates two different jobs:

Semantic parsing — what does this instruction mean right now?
Statistical preference learning — what has this user tended to choose before?

Those jobs have different failure modes.

Semantic parsing benefits from a strong model. It can handle explicit overrides, synonyms, unusual wording, and new instructions.

Preference learning benefits from stable local state, repeated feedback, exploration, and a clear reward signal. It should not depend on whether the model notices a serialized history blob inside a prompt.

The paper’s core argument is that personal agents should stop treating preference history as just more context. Preferences are a control signal.

LOCALHARNESS: make the local layer the default decision-maker

The proposed architecture, LOCALHARNESS, decouples the decision stack.

A lightweight local statistical estimator learns user preferences from feedback. It becomes the default skill selector. The remote LLM is reserved for explicit semantic exceptions: cases where the user names a tool, gives a direct override, or says something that should beat the learned habit.

That split is practical for local personal agents:

the user’s preference state can stay local;
high-frequency decisions do not require a remote model call;
the statistical learner can handle exploration and exploitation explicitly;
the LLM is used where it is strongest: interpreting language.

The paper implements two local priors: a simple frequency prior and a bandit prior. The bandit version is especially interesting because it does not only exploit the current favorite; it explores enough to avoid locking onto a bad early guess.

Why Bandit-as-Override is the useful pattern

The strongest variant in the paper is Bandit-as-Override.

The local bandit chooses the preferred skill by default. Then the LLM acts as an override probe for explicit instructions. If the user clearly asks for a specific skill, the system follows that semantic signal. Otherwise, the learned local preference wins.

That is a better contract than “LLM, please read all this memory and decide.”

It turns the LLM from the sole decision-maker into an exception handler around a simpler local policy.

For agent builders, this matters because a lot of user preference is repetitive and low-level:

which calendar tool to use;
which writing style to apply;
which deployment command is usually safe;
which image model or output folder a workflow prefers;
which social channel should receive which type of post.

These choices should not be rediscovered from scratch every time. They also should not be buried in a giant memory prompt and hoped into existence.

The benchmark: TOOLBENCH-60

Because this exact problem does not have a mature benchmark, the authors build TOOLBENCH-60, a simulation sandbox with 60 skills across 10 domains, derived from ToolBench.

Synthetic users are assigned latent preference distributions. Some users have deterministic preferences; others are more stochastic. The query pool mixes:

standard queries, where the skill is not named and the system must recover preference;
explicit queries, where the skill is named and semantic override should work.

That design is important. A purely statistical agent can learn habits but fails when the user explicitly asks for something else. A pure LLM can understand explicit wording but struggles to infer hidden long-term preferences. The benchmark forces both abilities to matter.

What the results say

The paper evaluates nine agents across four families: no learning, statistical-only, LLM-with-memory, and LLM-with-statistical-prior.

On Qwen3-30B-Instruct, the main table shows the expected pattern:

random and zero-shot LLM baselines have high regret;
statistical-only methods reduce regret but fail on explicit overrides;
prompt-injected memory improves over zero-shot but remains weaker;
the decoupled statistical-prior designs achieve the best combination of low regret and high test accuracy.

In the Soft-0.3 preference regime, Bandit-as-Override reaches 264.8 cumulative regret and 46.2% test accuracy on Qwen3-30B-Instruct, outperforming Profile-Memory’s 344.2 regret and 32.9% accuracy. On the one-hot regime, Bandit-as-Override reaches 84.3% accuracy with 135.7 regret.

The precise numbers matter less than the direction: the architecture wins because the right subsystem makes the right decision.

The design lesson for personal agents

The paper is really about control surfaces.

If an agent system wants to learn user habits, the habit learner should be a first-class component. It should have its own state, update rule, reward signal, and failure boundaries.

The LLM should not be forced to act like a database, a bandit algorithm, and a language interpreter at the same time.

A practical personal-agent stack might look like this:

classify the task domain;
let a local preference harness rank likely skills;
ask the LLM only whether the current utterance contains an explicit override;
execute the selected skill;
update the local preference model from feedback.

That is small, inspectable, and easier to debug than a prompt containing a pile of history.

Caveats

The paper is careful about limitations.

TOOLBENCH-60 models stationary user profiles, while real users change. Feedback is assumed to be immediate and binary, while real feedback is often sparse, delayed, or ambiguous. The local harness uses deterministic feature hashing, which may miss subtle language variation. The override path still depends on capable remote models.

Those caveats do not weaken the main design lesson. They point to the next version: non-stationary preference models, richer feedback, contextual bandits, and smaller local override models.

Why I like this paper

For personal agents, “more memory” is not automatically better.

The better question is: what should memory control?

This paper gives a clean answer for one important slice of agent behavior. User preference should not merely be remembered. It should be turned into a local policy that can act, explore, be overridden, and be audited.

That is the kind of architecture personal agents need if they are going to feel less like generic chatbots with tool access and more like systems that actually learn how their human works.

Source: Gan, Tang, and Liu, Statistical Priors for Implicit Preferences: Decoupling Skill Selection as a Local Harness in Personal Agents, arXiv:2606.05828. https://arxiv.org/abs/2606.05828