Asymmetric loss functions over budget -- evaluation considerations

The Core Philosophical Tension What you're describing is an asymmetric loss function over budget. The intuition is: - If the agent is succeeding: marginally exceeding budget has low cost — you'd rather they finish the job - If the agent is failing: stopping early has high value — every dollar/hour past the point where the trajectory is bad is pure waste This is not a distraction. It is the exact structure of optimal stopping problems from classical decision theory — when is the expected value of continued effort negative given current evidence? The benchmark designer's implicit question is whether the agent can reason about this. Most current benchmarks sidestep it entirely. --- What the Giants and Serious Labs Actually Say METR's RE-Bench (the most rigorous agent-vs-human benchmark in existence right now) is the clearest real-world example of researchers grappling with this. Their 2024 findings are stark: agents are told how much time budget they have remaining and can choose when to submit. The uncomfortable result: "AI agents tend to assume things are out of scope or try to finish up the task quickly without trying many things or testing their initial solution." In other words, current agents give up too early even when continuing would pay off — the opposite of your failing-agent case. Humans, by contrast, kept improving as the time budget grew from 30 minutes to 8 hours. This means the field's current worry is not agents exceeding budget recklessly, but agents being too conservative. (METR Blog) The multi-dimensional evaluation literature (frameworks like CLEAR, CLASSic) explicitly says that cost-unaware benchmarks create a perverse incentive: they reward building extremely capable but expensive agents, which wouldn't be viable in deployment. The counterweight to METR's concern, in other words, is enterprise researchers who argue budget enforcement is underweighted, not overweighted. (arXiv:2511.14136) No figure with Sutton/Bengio/LeCun stature has written directly about budget-softness philosophy for coding agents — this problem is too new and too operational. But the closest intellectual ancestor is Herbert Simon's satisficing: agents don't maximize, they search until they find "good enough" given resource constraints. The question of whether to give up is just satisficing applied to the meta-level. --- Are These Secondary Concerns Distractions or Signal? Honest answer: they are real signal that the field currently treats as noise, for a practical reason — it's hard to measure cleanly. Hard cutoffs are standard (SWE-bench, most coding benchmarks) because: - They make comparisons fair and replicable - Soft budgets introduce confounds (did agent A spend more because it was more capable, or more reckless?) - The primary outcome (did it solve the task?) dominates But the community is slowly recognizing this is insufficient. The SERA paper from AllenAI (arXiv:2509.16941) and the reliability gap discussion (simmering.dev) both point out that how agents fail matters enormously for real deployment — an agent that recognizes it's stuck and stops cleanly is categorically better than one that burns compute in a broken loop. --- The Philosophical Implication for Your Guide The instruction "you have $15/12h but $0.25 over is fine if you do a great job" is actually a well-formed policy — it models real economic contracts (cost-plus with a grace margin). The interesting design question is: should a succeeding agent be penalized at all for the overage? Probably not. Should a failing agent get credit for stopping early? Yes, meaningfully so. The ideal agent behavior given those instructions is: 1. Track trajectory quality against budget consumed (constantly) 2. If trajectory is good → finish the job, minor overage acceptable 3. If trajectory is bad and budget is mostly untouched → exit cleanly, don't thrash 4. If trajectory is bad and budget is nearly gone → same, but the failure is costlier This is not something current benchmarks reward explicitly. It may be worth encoding it in the guide's scoring rather than leaving it implicit in the budget efficiency section — because an agent that stops at $2 / 1h with a clear "this approach won't work" conclusion has demonstrated something genuinely valuable. --- Sources: - METR: Evaluating frontier AI R&D capabilities - RE-Bench GitHub - Beyond Accuracy: Multi-Dimensional Framework for Evaluating Enterprise Agentic AI - The Reliability Gap: Agent Benchmarks for Enterprise - SWE-Bench Pro - METR Time Horizons — Epoch AI

No comments:

Post a Comment

Asymmetric loss functions over budget -- evaluation considerations

The Core Philosophical Tension What you're describing is an asymmetric loss function over budget. The intuition is: - If the ...