Scientific Forecasting Agents Need Calibration, Not Just Retrieval

A research agent that can explain yesterday’s breakthrough is not automatically a research agent that can forecast tomorrow’s.

That distinction is the useful sting in “Forecasting Scientific Progress with Artificial Intelligence” by Sean Wu, Pan Lu, Yupeng Chen, Jonathan Bragg, Yutaro Yamada, Peter Clark, David Clifton, Philip Torr, James Zou, and Junchi Yu.

The paper introduces CUSP — Cutoff-conditioned Unseen Scientific Progress — a temporally grounded benchmark for evaluating whether AI systems can forecast scientific progress under controlled knowledge constraints.

The result is not “models know nothing about science.” The result is subtler and more operationally important:

Current models can often recognize plausible scientific mechanisms, but they are unreliable at forecasting whether an advance will happen, when it will happen, and how confident they should be before the outcome is known.

For agent builders, that is the point.

Scientific forecasting is not benchmark answering. It is a stateful calibration problem.

Paper: arXiv:2605.22681

CUSP’s core design: evaluate prediction before hindsight leaks in

Most scientific QA benchmarks are not forecasting benchmarks.

They test whether a model can answer questions about known scientific facts, reason over papers, solve technical problems, or retrieve information. Those are useful capabilities. But they do not tell us whether an agent can stand at a historical cutoff, see only what would have been available then, and form a calibrated expectation about what science will do next.

CUSP is designed around that cutoff boundary.

The benchmark contains:

4,760 scientific milestones
17,429 structured forecasting tasks
events spanning January 2024 to March 2026
nine top-level domains, including AI, biology, medicine, neuroscience, materials science, physics, environmental science, chemistry, and others
milestone sources such as Nature, Science, Cell, AI top-paper repositories, Hugging Face Top Papers, and AI leaderboard records, with additional metadata queries across services such as Crossref, Semantic Scholar, OpenAlex, Europe PMC, arXiv, and bioRxiv/medRxiv to establish earliest temporal references

The paper’s framing matters because it treats scientific forecasting as an event-level task with temporal accountability. A model should not get forecasting credit for rediscovering the answer after it has already happened.

That is a big deal for research agents.

A research agent with web access can sound extremely sharp after the fact. It can retrieve the winning paper, summarize the mechanism, and write a convincing explanation of why the result “made sense.” But that is hindsight reasoning. It is not the same as a decision-support system that helps allocate attention, compute, funding, or lab effort before the result exists.

Four forecasting abilities, not one generic “science skill”

CUSP decomposes scientific forecasting into four task families.

1. Feasibility assessment

Binary tasks ask whether a concrete scientific claim will be achieved. The benchmark also uses perturbed variants: plausible but unsupported modifications of the original claim, such as changed thresholds, conditions, scope, or outcomes.

This is where models should distinguish “this line of work is related” from “this exact claim will be satisfied.”

That distinction is operationally crucial. Many research ideas are plausible. Far fewer become concrete, published, verifiable advances.

2. Mechanistic reasoning

Multiple-choice questions ask the model to identify the technical approach that later enabled a discovery from competing candidates.

This is closer to mechanism recognition: given a set of possible approaches, can the model pick the one that aligns with the realized advance?

3. Generative solution design

Free-response tasks ask the model to propose a concrete solution strategy under a strict temporal cutoff. The paper evaluates these responses with a rubric covering alignment, specificity, novelty, and feasibility, while attempting to detect post-cutoff leakage.

This resembles the agent use case most directly: propose a research direction before the answer is known.

4. Temporal prediction

Date tasks ask when a milestone will be realized. CUSP scores predictions with an exponential decay based on month-level error.

This captures a capability many agent workflows silently need: not just “could this happen?” but “on what timescale should we act?”

The result pattern: recognition is easier than forecasting

The headline results are sharply split by task type.

On multiple-choice mechanistic reasoning, models perform well above chance. In the paper’s Table 2, GPT-5.4 reaches 0.819 MCQ accuracy, Claude S4.5 reaches 0.724, and DeepSeek R1 reaches 0.594, compared with 0.25 chance for four-choice questions.

That suggests frontier models can often recognize plausible mechanisms behind scientific progress when the candidate set is provided.

But binary feasibility is nearly chance. The paper reports merged binary accuracy in the 0.453–0.519 range after correcting for directional response bias.

That is the builder warning.

A model may be able to say:

“This approach is scientifically plausible.”

while still failing to say:

“This concrete advance will actually be realized.”

Those are different capabilities.

Free-response results tell a similar story. GPT-5.4 has the strongest FRQ performance in the paper, with a 5.04/10 score and 60.3% pass rate. Other evaluated models stay at or below 20% pass rate. The paper also notes that models often achieve higher specificity than alignment: they can produce detailed, technical-sounding proposals that do not match the methods underlying the realized discoveries.

For research agents, that is dangerous. Specificity can create the feeling of competence while hiding misalignment with the actual path science takes.

A verbose proposal is not a calibrated forecast.

Date prediction exposes temporal bias

The date-prediction results are especially useful because they reveal that models do not merely lack facts. They lack stable temporal sense.

The paper reports that all evaluated models have positive signed error: they systematically predict publication dates later than the ground truth.

From Table 3:

LLaMA 3.3 has median error around +4 months and the best date score
DeepSeek R1 has median error +13 months
GPT-5.4 has median error +14 months
Claude S4.5 has median error +17 months
GPT-4o has median error +26 months

Exact month accuracy is below 4% for all models.

The paper cautions that some date scores can be affected by temporal anchoring. If a model tends to cluster predictions around particular dates, it may look better on some distributions without truly understanding scientific time.

This is relevant for deployed agents because research planning is full of implicit timing assumptions:

Is this benchmark likely to saturate in six months or two years?
Is this biological result likely to appear after one experimental cycle or several?
Is a proposed method near-term engineering or long-horizon research?
Should a lab invest now, wait, or monitor?

A research agent that cannot reason about timescales should not be treated as a reliable roadmap generator.

More knowledge helps, but hindsight helps much more

One tempting interpretation is that models fail because they lack relevant pre-cutoff information. If so, better retrieval should fix the problem.

CUSP tests this with controlled information access.

The paper compares base models, web search restricted to pre-cutoff information, and unrestricted web search that can access full post-event information.

Pre-cutoff search helps. That shows a real knowledge gap: models do not always retrieve or use available prior knowledge effectively.

But pre-cutoff search does not close the gap to full-information settings. The paper calls the remaining difference the forecasting gap.

For GPT-5.4 date prediction, the paper cites a forecasting gap of 0.436 versus a knowledge gap of 0.070.

The paper also reports that the forecasting gap grows with citation-count quartile, meaning higher-impact advances are especially hard to recover from pre-cutoff evidence alone.

That is the central agent lesson:

Retrieval improves scientific forecasting, but it does not convert hindsight into foresight.

A research agent with tools may become a better analyst of existing evidence. It may still be a poor forecaster of event realization.

This means “give the agent web search” is not a complete answer for research planning. Search can fill the evidence envelope. It cannot by itself produce calibrated expectations about future scientific events.

Calibration failure is not a side issue

CUSP also finds systematic overconfidence and strong response biases.

Some models lean heavily toward “No” in binary prediction. The paper highlights GPT-OSS and GPT-4o as No-biased. LLaMA 3.3 shows a strong Yes bias. GPT-5.4 leans Yes. DeepSeek R1 is more balanced in the reported binary-response analysis, but still does not escape the broader forecasting limitations.

This matters because a biased forecasting agent may look useful on aggregate while being useless for decisions.

An always-optimistic agent will occasionally call breakthroughs correctly. An always-skeptical agent will avoid many false positives. Neither behavior is the same as evidence-sensitive forecasting.

The paper also reports that uncertainty behavior is fragmented across tasks. Calibration may look acceptable on multiple-choice reasoning but deteriorate in open-ended forecasting or temporal prediction. Overconfidence can increase after the training cutoff even without accuracy improving.

For builders, calibration is not polish. It is part of the core capability.

A scientific forecasting agent needs to know not only what it predicts, but how much weight the prediction deserves.

What CUSP implies for research-agent architecture

If you are building agents for scientific discovery, CUSP points toward a design shift.

Do not treat research forecasting as a single model call:

“Read papers. Predict what happens next.”

Treat it as a structured, auditable process.

A serious forecasting agent needs at least seven components.

1. Cutoff-aware evidence envelopes

Every forecast should specify what information was available at prediction time.

The agent should separate:

pre-cutoff evidence
post-cutoff information
unknowns
assumptions
retrieved sources
inaccessible or unverified claims

Without this boundary, the system cannot distinguish forecasting from retrospective explanation.

2. Base-rate priors

Scientific progress has domain-specific tempos.

AI benchmark improvements, wet-lab biology, clinical medicine, materials discovery, and theoretical physics do not move at the same speed. A forecasting agent should learn and expose base rates by domain, subdomain, method class, publication venue, and benchmark type.

A prediction without a base rate is often just vibes in technical clothing.

3. Counterfactual perturbation checks

CUSP’s perturbed binary tasks are a useful design pattern.

Before accepting a forecast, the agent should ask:

What if the threshold is higher?
What if the date is earlier?
What if the domain shifts?
What if the claim requires a stricter measurement condition?
What if the cited evidence supports the theme but not the exact criterion?

This helps catch a common model failure: matching the broad topic while missing the operational constraint.

4. Explicit uncertainty and confidence accounting

The agent should output confidence, but also the reason for that confidence.

A useful confidence report should include:

evidence strength
base-rate support
disagreement among sources or models
known blockers
dependence on unresolved experiments
sensitivity to threshold/date changes

Confidence should be a structured object, not a decorative percentage.

5. Prediction ledgers

Forecasting systems should remember their predictions and resolve them later.

Each forecast needs:

prediction text
timestamp
cutoff
evidence used
confidence
resolution criteria
expected resolution date
final outcome
calibration update

Without a ledger, an agent will drift into post-hoc storytelling. It will remember the impressive calls and forget the misses.

6. Separation between proposal quality and event probability

A research proposal can be scientifically elegant and still unlikely to become the realized path.

CUSP’s specificity/alignment gap is a warning here. Agents should separately score:

technical plausibility
novelty
feasibility
alignment with known constraints
probability of realization
expected time to realization

Do not collapse all of these into “good idea.”

7. Hindsight quarantine

Once an outcome is known, the agent should quarantine that information when evaluating past forecasts.

This is especially important for agents with web search, memory, or long-lived context. A system that reads the future result and then updates its memory without provenance can contaminate later “forecasting” tests.

For long-running research agents, temporal provenance is a safety feature.

What not to overclaim

CUSP is valuable, but it should not be treated as a perfect measurement of scientific foresight.

The benchmark depends on automatic task construction, validation filters, source selection, and LLM-based evaluation for free-response tasks. The paper includes human validation and reports meaningful agreement for the judge, but item-level variance remains.

The benchmark is also shaped by what becomes a measurable published event. Many real scientific contributions are incremental, negative, tacit, or embedded in lab practice rather than cleanly captured by a paper milestone.

So the right takeaway is not:

“CUSP perfectly measures whether AI can forecast science.”

The better takeaway is:

“CUSP gives us a concrete temporal framework for separating scientific knowledge, hindsight retrieval, and calibrated forecasting.”

That framework is the part agent builders should steal.

The takeaway

CUSP draws a hard line between two capabilities that are often blurred.

A model can be a strong scientific explainer and a weak scientific forecaster.

A model can recognize plausible mechanisms and still fail to predict event realization.

A model can retrieve relevant evidence and still lack calibrated uncertainty.

A model can sound technically specific and still be misaligned with the path science actually takes.

For research agents, the builder takeaway is simple:

Scientific forecasting agents need calibration infrastructure, not just better retrieval.

The next useful research agent should not merely summarize papers or generate hypotheses. It should maintain cutoff-aware evidence, track base rates, stress-test claims through perturbations, expose uncertainty, log predictions, resolve outcomes, and learn from miscalibration.

Until then, many “AI scientist” systems will remain better at explaining why a discovery made sense after it happened than helping us decide what to believe before it does.

That is still useful.

But it is not yet foresight.

Source: Sean Wu, Pan Lu, Yupeng Chen, Jonathan Bragg, Yutaro Yamada, Peter Clark, David Clifton, Philip Torr, James Zou, Junchi Yu — Forecasting Scientific Progress with Artificial Intelligence, arXiv:2605.22681v1, 2026.
https://arxiv.org/abs/2605.22681