Asymmetric loss functions over budget -- evaluation considerations
The Core Philosophical Tension
What you're describing is an asymmetric loss function over budget. The
intuition is:
- If the agent is succeeding: marginally exceeding budget has low cost — you'd
rather they finish the job
- If the agent is failing: stopping early has high value — every dollar/hour
past the point where the trajectory is bad is pure waste
This is not a distraction. It is the exact structure of optimal stopping
problems from classical decision theory — when is the expected value of
continued effort negative given current evidence? The benchmark designer's
implicit question is whether the agent can reason about this. Most current
benchmarks sidestep it entirely.
---
What the Giants and Serious Labs Actually Say
METR's RE-Bench (the most rigorous agent-vs-human benchmark in existence right
now) is the clearest real-world example of researchers grappling with this.
Their 2024 findings are stark: agents are told how much time budget they have
remaining and can choose when to submit. The uncomfortable result:
"AI agents tend to assume things are out of scope or try to finish up the task
quickly without trying many things or testing their initial solution."
In other words, current agents give up too early even when continuing would
pay off — the opposite of your failing-agent case. Humans, by contrast, kept
improving as the time budget grew from 30 minutes to 8 hours. This means the
field's current worry is not agents exceeding budget recklessly, but agents
being too conservative. (METR Blog)
The multi-dimensional evaluation literature (frameworks like CLEAR, CLASSic)
explicitly says that cost-unaware benchmarks create a perverse incentive: they
reward building extremely capable but expensive agents, which wouldn't be
viable in deployment. The counterweight to METR's concern, in other words, is
enterprise researchers who argue budget enforcement is underweighted, not
overweighted. (arXiv:2511.14136)
No figure with Sutton/Bengio/LeCun stature has written directly about
budget-softness philosophy for coding agents — this problem is too new and too
operational. But the closest intellectual ancestor is Herbert Simon's
satisficing: agents don't maximize, they search until they find "good enough"
given resource constraints. The question of whether to give up is just
satisficing applied to the meta-level.
---
Are These Secondary Concerns Distractions or Signal?
Honest answer: they are real signal that the field currently treats as noise,
for a practical reason — it's hard to measure cleanly.
Hard cutoffs are standard (SWE-bench, most coding benchmarks) because:
- They make comparisons fair and replicable
- Soft budgets introduce confounds (did agent A spend more because it was more
capable, or more reckless?)
- The primary outcome (did it solve the task?) dominates
But the community is slowly recognizing this is insufficient. The SERA paper
from AllenAI (arXiv:2509.16941) and the reliability gap discussion
(simmering.dev) both point out that how agents fail matters enormously for
real deployment — an agent that recognizes it's stuck and stops cleanly is
categorically better than one that burns compute in a broken loop.
---
The Philosophical Implication for Your Guide
The instruction "you have $15/12h but $0.25 over is fine if you do a great
job" is actually a well-formed policy — it models real economic contracts
(cost-plus with a grace margin). The interesting design question is: should a
succeeding agent be penalized at all for the overage? Probably not. Should a
failing agent get credit for stopping early? Yes, meaningfully so.
The ideal agent behavior given those instructions is:
1. Track trajectory quality against budget consumed (constantly)
2. If trajectory is good → finish the job, minor overage acceptable
3. If trajectory is bad and budget is mostly untouched → exit cleanly, don't
thrash
4. If trajectory is bad and budget is nearly gone → same, but the failure is
costlier
This is not something current benchmarks reward explicitly. It may be worth
encoding it in the guide's scoring rather than leaving it implicit in the
budget efficiency section — because an agent that stops at $2 / 1h with a
clear "this approach won't work" conclusion has demonstrated something
genuinely valuable.
---
Sources:
- METR: Evaluating frontier AI R&D capabilities
- RE-Bench GitHub
- Beyond Accuracy: Multi-Dimensional Framework for Evaluating Enterprise
Agentic AI
- The Reliability Gap: Agent Benchmarks for Enterprise
- SWE-Bench Pro
- METR Time Horizons — Epoch AI
Subscribe to:
Post Comments (Atom)
Asymmetric loss functions over budget -- evaluation considerations
The Core Philosophical Tension What you're describing is an asymmetric loss function over budget. The intuition is: - If the ...
-
A name indicates what we seek. An address indicates where it is. A route indicates how to get there. The internet protocol deals prima...
-
Pattern Description Dogfooding: Creators use their own software heavily, improving it organically. “Thin to Thick” Clients: Many succes...
-
On March 4, 2025, the New York Times shut down its .onion site which it had launched in 2017. Is tor dead? tor published metrics speak to a ...
No comments:
Post a Comment