PACE Method Cuts Agent Evaluation Costs by 99% Using Proxy Benchmarks

A shortcut that costs 1% but predicts with 85% accuracy
PACE uses cheap proxy tests to identify strong agent candidates before expensive full benchmarks.

As artificial intelligence agents grow more capable, the cost of measuring that capability has quietly become its own engineering problem. A research team has proposed PACE, a method that uses inexpensive, simpler tests to predict how a model will perform on the full, costly benchmarks that define the field — achieving 85% ranking accuracy at less than 1% of the usual cost. The work does not claim to replace rigorous evaluation, but rather to make the early stages of model selection more humane for teams constrained by time and budget. It is, in essence, a philosophy of triage: not every candidate deserves the full examination, and knowing which ones do is itself a form of wisdom.

  • Full agent evaluations like SWE-Bench can cost thousands of dollars and run for days, creating a bottleneck that slows every decision in the development cycle.
  • Teams must compare models constantly — adjusting tool policies, routing logic, and strategies — but the only definitive tests are too expensive to run at each turn.
  • PACE proposes a shortcut: run cheap, non-agentic proxy tests first, then use those results to predict full benchmark performance with less than 4% average error.
  • Across 14 models and four major benchmarks, the method ranked models correctly 85% of the time while consuming under 1% of normal evaluation costs.
  • The risk is real — proxy tests may miss failures that only surface in complex, multi-step interactions or long-horizon reasoning tasks that matter most in production.
  • The method is landing as a screening tool, not a verdict: it narrows the field so that only the strongest candidates face the full, expensive validation.

Building and testing AI agents has grown genuinely expensive. A single evaluation on benchmarks like SWE-Bench or GAIA can cost thousands of dollars, demand specialized infrastructure, and consume days of compute. For teams iterating on model selection or tool strategies, this creates a painful tension: the tests that give definitive answers are too costly to run at every decision point.

A paper submitted to arXiv on July 2, 2026 proposes a practical shortcut called PACE — Proxy for Agentic Capability Evaluation. The idea is to test models on a much smaller set of simpler, cheaper tasks, then use those results to predict performance on the full, expensive benchmarks. Tested across 14 models, four major agent benchmarks, and 19 pools of simpler test cases, the method predicted benchmark scores with less than 4% average error, achieved correlations above 0.80, and ranked models correctly about 85% of the time — all at under 1% of the cost of a full evaluation.

The practical value is one of triage. A team with five candidate models and a limited budget can run all five through PACE in hours, identify the two or three worth full validation, and avoid spending thousands on the rest. For teams managing evaluation infrastructure as a real constraint, this early signal is genuinely useful.

The limits are equally real. Proxy tests are simpler by design and may not surface failures that only emerge in complex, multi-step interactions — the long-horizon reliability issues that matter most in production. The researchers frame PACE accordingly: a screening tool, not a replacement for rigorous final validation. Whether the method holds as models grow more capable and agent tasks more complex remains the open question.

Building and testing AI agents has become expensive. A single full evaluation on benchmarks like SWE-Bench or GAIA can consume thousands of dollars, require specialized infrastructure, and run for days. Teams developing these systems face a practical problem: they need to compare models and strategies quickly, but the tools that give definitive answers are too costly to run at every decision point. A new paper submitted to arXiv on July 2, 2026, proposes a shortcut.

The method is called PACE—Proxy for Agentic Capability Evaluation. The core idea is straightforward: instead of running a model through a full, expensive agent benchmark, test it on a much smaller set of simpler, cheaper evaluation tasks. Then use those quick results to predict how the model would perform on the real benchmark. If the prediction is reliable enough, teams can use it to narrow their candidates before spending the money on full validation.

The researchers tested this approach across 14 different models, four major agent benchmarks, and 19 pools of simpler, non-agentic test cases. The results suggest the method works. When they predicted agent benchmark scores using only the cheaper proxy tests, their predictions were off by less than 4% on average. The correlation between predicted and actual scores exceeded 0.80. When asked to rank models from best to worst, PACE got the ranking right about 85% of the time. All of this while using less than 1% of the computational cost of a full agent evaluation.

The practical value lies not in replacing real testing but in triage. Imagine a team with five candidate models and a limited evaluation budget. They can run all five through PACE's cheap proxy tests in hours. The results suggest which two or three are worth the expensive full-benchmark treatment. The strong candidates get validated properly; the weaker ones never consume the thousands of dollars and days of infrastructure time. For teams iterating on model selection, tool policies, or routing logic, this kind of early signal matters.

But the method has real limits. Proxy benchmarks, by definition, are simpler than the full thing. They may not catch failures that only emerge in complex, multi-step agent interactions. They might miss regressions in tool use or problems with long-horizon reasoning—the kind of reliability issues that matter most in production. An agent might score well on atomic test cases but fail when asked to solve a real software engineering problem or navigate a complex reasoning task.

The researchers themselves frame PACE as a screening tool, not a replacement for rigorous validation. The signal it provides is useful for narrowing the field, but final decisions should still rest on full benchmarks. For teams managing evaluation infrastructure as a real engineering constraint—which most are—PACE represents a practical direction worth exploring. The question now is whether the method holds up as models grow more capable and agent tasks grow more complex.

The practical use case is triage, not replacement. PACE gives evaluation teams a way to estimate agent performance earlier and more cheaply, then spend full benchmark budgets on final validation.
— arXiv paper authors (July 2, 2026)
A Conversa do Hearth Outra perspectiva sobre a história
Inventor

Why does evaluation cost matter so much right now? Can't teams just run their models through benchmarks whenever they need to?

Model

They can, but the cost is real. A single full evaluation on SWE-Bench or GAIA can run thousands of dollars and take days. When you're iterating—trying different models, different tool policies, different routing strategies—you can't afford to run every option through the full gauntlet.

Inventor

So PACE is about speed and money. But how do you know a cheap proxy test actually predicts what happens on the real benchmark?

Model

That's the empirical question the paper answers. They tested it across 14 models and found the predictions were accurate about 85% of the time when ranking models, with less than 4% average error on the actual scores.

Inventor

That sounds good. But what could go wrong? What does a proxy test miss?

Model

The things that only show up in complex, multi-step agent work. A model might handle simple atomic tasks fine but fail when it has to chain reasoning across hours, use tools in unexpected ways, or recover from mistakes in a real environment.

Inventor

So PACE is a filter, not a judge.

Model

Exactly. It's a way to say "these three models are worth the expensive validation." The ones that pass PACE still need to prove themselves on the real benchmark before you trust them in production.

Inventor

Who benefits most from this?

Model

Teams that are actively developing and comparing agents. If you're running experiments constantly, PACE saves you money and time on the candidates that won't make the cut anyway.

Fale Conosco FAQ