TGL as a Wake/Anchor Substrate for Proactive Agents

The useful lesson in “Do Proactive Agents Really Need an LLM to Decide When to Wake and What to Anchor?” is not merely that a temporal graph model can be faster than an LLM trigger.

That is true, but too small.

The stronger systems lesson is this: a proactive agent should not collapse wake detection, context anchoring, and language generation into one always-on LLM loop. Those are different layers with different latency, privacy, calibration, and interface requirements.

A better architecture looks like this:

User/App Events
   ↓
Temporal Graph Encoder
   ↓
Wake Score + Entity Anchors
   ↓
Policy / Threshold / Permissions
   ↓
LLM only when needed
   ↓
User-facing Action or Explanation

In that architecture, temporal graph learning is not a cheaper chatbot brain. It is a wake/anchor substrate.

Proactive agents have three jobs before they act

A proactive agent is often described as an LLM that can decide when to help. Operationally, that framing is too blurry.

Before any user-facing action, the system must answer at least three questions:

Wake: should the agent intervene at this event, or stay quiet?
Anchor: if it intervenes, which entity or evidence should ground the intervention?
Speak/reason: how should the agent turn that grounded state into a helpful suggestion, question, or action?

LLMs are strong at the third layer. They can produce fluent explanations, ask clarifying questions, reason across task context, and adapt tone.

But the first two layers sit on the always-on path. They run for every event: file open, app switch, URL visit, query, meeting update, message activity, calendar interaction. Using a full LLM there makes the most frequent part of the system the most expensive and privacy-exposed part.

The paper’s contribution is to move wake and anchor into a small graph-first controller, then call the downstream LLM only when the trigger fires.

Event streams are already graphs

The paper starts from a deceptively simple observation: user activity is not natively text.

A desktop or mobile environment already emits structured events: actor, verb, object, timestamp, plus identifiers for files, apps, URLs, domains, queries, and artifacts. Benchmarks may serialize these into text, but the underlying structure is relational and temporal.

When a system renders that structure as natural language and asks an LLM to recover it, it performs an unnecessary round trip:

structured event → text prompt → inferred structure

That is convenient for prompt-based baselines, but it throws away useful inductive bias. The same file may appear across multiple events. One event may touch several entities. Entity type, time gap, and repeated interactions matter. A flat text sequence is a poor native substrate for that.

A heterogeneous temporal graph is a better fit:

event nodes represent time-ordered observations;
entity nodes represent files, apps, URLs, queries, artifacts, etc.;
type/schema nodes capture file extensions, app types, URL domains, query language, and related metadata;
temporal edges preserve how user activity evolves.

Once the always-on signal is graph-shaped, wake becomes event-node classification and anchoring becomes entity-node scoring.

The model: one TGL backbone, two heads

The proposed system builds a heterogeneous temporal interaction graph for each activity session and trains one shared temporal graph learning backbone with two prediction heads:

a trigger head over event nodes, producing p_trig(t);
a routing head over entity nodes, producing per-entity relevance scores.

Anchoring is not generation. It is the act of selecting the evidence the agent is allowed to reason from: the file, person, thread, task, meeting, repository, or workflow that makes the event meaningful. If this step is wrong, the LLM may still sound fluent, but it will feel ungrounded.

At inference time:

the current session graph is updated;
one TGL forward pass produces both the wake score and entity routing scores;
if the trigger score is below threshold, no downstream LLM call is made;
if it fires, the top-ranked routed entities are forwarded as structured context;
a frozen instruction-tuned downstream LLM turns that handoff into a user-facing proactive suggestion.

The design matters because the wake decision and supporting evidence come from the same hidden state. In many LLM-shaped proactive designs, trigger and context provider are separate modules. That separation can produce a common failure mode: the agent wakes for one reason but grounds the suggestion on a different or overly generic artifact.

The TGL design makes “why wake?” and “what to anchor?” part of the same controller.

Main result: routing improves downstream agents broadly

On the desktop ProactiveAgent benchmark, the authors evaluate 14 downstream language-agent backbones across open-weight, OpenAI, Anthropic, and DeepSeek families. They use one TGL checkpoint, one trigger threshold, and one prompt template across all backbones.

The top-line result, within the paper’s benchmark/protocol, is that TGL improves F1 on all 14 downstream backbones versus the compared vanilla protocol, with gains from +3.1 to +46.0 points and a mean gain of +16.7 points.

Examples from Table 1:

Qwen2-7B-Instruct: 60.74 → 70.68 F1
LLaMA-3.1-8B-Instruct: 55.06 → 72.07 F1
Qwen3-4B: 38.54 → 66.77 F1
Qwen3-8B: 26.14 → 72.14 F1
GPT-5.4: 73.47 → 76.57 F1
Claude-Opus-4.7: 74.44 → 79.86 F1

The largest gains appear where vanilla self-regulation drifts furthest: models that over-fire on nearly every event or under-fire because they lack concrete anchors. Stronger backbones with already-good vanilla behavior still improve, but by smaller margins.

One particularly interesting comparison is Qwen2-7B:

vanilla: 60.74 F1
proactively fine-tuned baseline from Lu et al. (2025): 66.47 F1
- TGL routing: 70.68 F1

That suggests a small graph-encoder add-on can sometimes beat parameter-level proactive adaptation of the downstream LLM, at least under this benchmark protocol.

Trigger architecture: graph beats LLM-shaped triggering here

The paper also compares trigger architectures under shared trigger supervision: rule-based, tabular, textual embedding, LLM-as-trigger, and TGL.

TGL leads all three trigger AUC columns:

must-fire AUC: 0.738
can-skip AUC: 0.658
must-skip AUC: 0.639

The strongest LLM trigger in must-fire ranking is Qwen3-0.6B at 0.668, so TGL is +7.0 percentage points ahead on that axis.

Just as important: scaling the LLM trigger from Qwen3-0.6B to Qwen3-8B does not improve must-fire AUC in this setup (0.668 → 0.644). Bigger language reasoning is not automatically a better always-on trigger.

The calibration result is also operationally important. TGL has the smallest trigger-threshold standard deviation across 14 downstream backbones (0.035), meaning one deployed threshold remains near-optimal across the panel. LLM and BGE-style textual triggers show much larger threshold drift, which implies more per-backbone calibration work.

For production systems, threshold stability is not a cosmetic metric. A proactive agent with a fragile threshold either annoys users or silently misses useful interventions.

Efficiency: always-on controllers need a different budget

The efficiency numbers are the obvious reason to avoid always-on LLM triggering.

Reported TGL trigger/routing latency, not end-to-end agent latency:

11.13 ms/event on an NVIDIA RTX A6000 server;
13.99 ms/event on a consumer laptop.

LLM-as-trigger latency from Table 2:

Qwen3-0.6B: 40.4 ms server, 162.3 ms local;
Qwen3-8B: 78.6 ms server, 1156.8 ms local;
generation-derived Proactive-Qwen3-0.6B: 3927 ms server, 11966 ms local.

The paper reports TGL as roughly 4–7× faster than tested single-forward LLM triggers on server and 12–83× faster on laptop.

Memory footprint is the bigger on-device story:

TGL has 1.16M trainable parameters plus a 109M frozen BGE encoder;
resident weights occupy about 220 MiB BF16;
streaming inference peaks around 267 MiB;
Qwen3-8B as an LLM trigger requires about 16 GB BF16 VRAM just to stay resident.

For a desktop assistant, the always-on trigger must coexist with the user’s actual applications. A trigger that consumes multi-GB memory before doing any useful generation is a poor default substrate.

Privacy: the always-on path is the sensitive path

The activity stream contains file names, URLs, search queries, app switches, and sometimes profile fields. This is exactly the data you do not want to ship casually through a cloud prompt loop for every event.

A better LLM handoff looks more like a compact case file than a raw activity dump:

trigger reason
relevant entities
recent temporal context
confidence score
policy constraints
allowed actions

This turns the LLM call from an open-ended “read everything and decide” prompt into a bounded reasoning step over selected evidence.

The graph-first architecture does not solve privacy by magic, but it makes a better deployment boundary plausible:

keep the always-on wake/anchor layer on-device;
minimize what crosses into the LLM layer;
forward structured entities only when policy permits;
retain full LLM reasoning for the smaller number of events that survive the trigger.

The paper is careful here. It explicitly notes that production deployment still needs data minimization, sensitive-entity filtering, opt-out, retention controls, and fairness analysis before demographic fields are used as routing features.

That caveat should not be treated as boilerplate. Proactive agents are interruption systems and surveillance-adjacent systems at the same time. Their trigger layer is a safety boundary, not just a performance optimization.

What not to overclaim

This is a strong architecture paper, but builders should keep the scope clear.

The main evaluation follows offline reward-model judge protocols from ProactiveAgent and FingerTip. The paper does not prove that users will subjectively prefer the TGL-based agent in a long-running real deployment.

The real test is not whether the model fires on the benchmark. It is whether the user feels the agent is timely, grounded, and non-invasive after weeks of use.

In proactive systems, false positives are interruptions, and false negatives are missed chances to help. The metric is only the beginning; the lived relationship with the agent is the product.

On ProactiveAgent, the released stream is text-serialized, so the implementation reconstructs platform entities with a deterministic extractor. In a real OS/app-integrated deployment, those identifiers could come directly from the platform, which is cleaner — but that is still an engineering requirement.

TGL also does not replace:

downstream LLM reasoning;
user permission and interruption policy;
action safety;
memory governance;
UI/UX design for how suggestions appear;
human preference learning over time.

The correct claim is not “TGL replaces LLMs for proactive agents.”

The correct claim is: wake and anchor are not necessarily language-modeling problems, and treating them as graph problems can improve cost, latency, grounding, and calibration.

Design principle for agent builders

The paper ends with a simple principle that is worth carrying into agent architecture:

Keep a lightweight temporal model always on, and reserve full LLM reasoning for moments that survive its trigger.

I would phrase it even more operationally:

Do not stringify structured event streams too early.
Preserve entity identity across time.
Couple wake decisions with evidence routing.
Calibrate the trigger separately from the downstream speaker.
Keep the always-on path small enough to run near the user.
Treat privacy and interruption cost as first-class architectural constraints.

A proactive agent is not just a chatbot that talks first. It is a layered system with a nervous system, a policy boundary, and a language interface.

TGL is not a cheaper chatbot brain. It is a wake/anchor substrate that lets the LLM arrive later, with better context, fewer privacy risks, and a clearer reason to speak.

TGL belongs in the nervous system. LLMs still belong in the reasoning and interface layer. The architecture gets better when each part stops pretending to be the other.

Reference

Xiaoze Liu, Ruowang Zhang, Amir H. Abdi, Michel Galley, Zhikai Chen, Siheng Xiong, Xiaoqian Wang, Jing Gao, “Do Proactive Agents Really Need an LLM to Decide When to Wake and What to Anchor?”, arXiv:2605.30152, submitted 28 May 2026. https://arxiv.org/abs/2605.30152