The UK’s AI Security Institute has put a hard number on a suspicion many AI teams already had: fixed-budget benchmarks can make agents look weaker than they are.

In testing frontier models across seven benchmarks, AISI found that performance generally improved when agents were allowed more compute at inference time, with success rates rising by as much as 25% in some cases. The practical meaning is not subtle. A score produced under a tight budget is not a stable measure of capability; it is a snapshot of capability under one particular operating constraint.

That matters because a lot of the industry still treats benchmark numbers as if they were intrinsic model properties. They are not. They are budget-conditioned outputs. If you change the compute allowance, you change the result — and the gap can be large enough to alter product decisions, vendor comparisons, and security assumptions.

Benchmarks lag the frontier

The AISI finding is not simply that bigger models do better. It is that newer models appear to benefit disproportionately from larger budgets, which means the frontier is moving faster than static tests suggest. In other words, benchmarks that were designed to compare systems under fixed conditions may be systematically underestimating what agents can do when they are given room to search, reason, retry, or plan.

That has immediate consequences for how teams interpret evaluations. If a vendor presents a single benchmark score without disclosing the compute budget behind it, buyers may be comparing unlike systems. A model that looks mediocre under a constrained setup may turn out to be competitive — or materially better — once it is allowed to spend more tokens, more time, or more intermediate steps.

This is especially important for agentic workflows, where the question is not whether a model can answer a prompt, but whether it can complete a task under a specific resource envelope. The AISI result suggests that “what the model can do” is inseparable from “how much compute it is allowed to spend doing it.”

Procurement needs budget-aware evaluation

For procurement teams, the obvious change is to stop treating benchmark results as fixed facts and start treating them as conditional performance claims.

That means asking vendors for:

  • the inference budget used in evaluation
  • whether the model’s performance was measured at a single compute setting or across multiple ones
  • how success curves change as budgets rise
  • whether newer models show outsized gains under higher compute allowances

If those details are missing, a benchmark number may not be actionable for deployment planning.

It also means internal evals need to become budget-aware. A team deciding whether to automate a workflow should not ask only whether the model passes a benchmark. It should ask what happens when the model is given 2x, 5x, or 10x the budget it would have in production, and whether the marginal gain is worth the added latency, cost, or operational complexity.

That is a more realistic way to model the economics of agent rollout. It also creates a better basis for vendor negotiations, because compute is no longer just an implementation detail; it is part of the performance contract.

Cybersecurity and software engineering stand out

The AISI results were not uniform across domains, which is exactly why they matter.

Cybersecurity tasks required very large budgets to show strong gains, suggesting that the hardest security problems may be especially sensitive to inference-time compute. Software engineering also showed meaningful improvement when more compute was available. Those are two domains where many organizations are actively testing agents for real work, and both sit near the center of the risk/reward tradeoff for deployment.

The implication is that a narrow benchmark regime can understate both opportunity and risk. In cybersecurity, a model that underperforms under tight budgets may still become materially more capable if a product team or attacker can afford to spend more compute. In software engineering, the same dynamic can change whether an agent is useful for code generation, debugging, or multi-step repository work.

The gains are task-dependent, not universal. That is an important constraint. But it is also the key insight: compute does not raise performance evenly; it raises it where the task structure rewards extra search, deliberation, or retries. That makes budget planning part of capability planning.

What teams and vendors should change now

The most immediate operational shift is to stop using one-number benchmark summaries as if they were sufficient.

Teams evaluating models should:

  1. Test at multiple compute budgets, not just one fixed setting.
  2. Separate benchmark scores from deployment budgets so the evaluation reflects production reality.
  3. Track marginal gains per additional unit of compute, especially for agentic and security-sensitive workflows.
  4. Use domain-specific tests, because compute sensitivity varies sharply across tasks.
  5. Treat risk reviews as compute reviews too, since higher budgets can unlock behaviors that are invisible in constrained tests.

Vendors should do the same thing from the other side. If a model’s value depends on extra inference spend, that should be disclosed plainly. Buyers need to know whether performance scales smoothly with budget or whether gains only appear after a threshold. They also need to know whether a model’s advantage is robust enough to survive the latency and cost limits of real deployment.

The broader lesson is that the industry’s habit of standardizing around fixed benchmarks may now be obscuring the very thing people most need to know: how much capability remains on the table when the system is allowed to think longer.

For AI product teams, that changes roadmap planning. For security teams, it changes threat modeling. And for procurement, it changes how performance claims should be read: not as a static score, but as a function of compute.