Asymmetric loss functions over budget -- evaluation considerations

The Core Philosophical Tension What you're describing is an asymmetric loss function over budget. The intuition is: - If the agent is succeeding: marginally exceeding budget has low cost — you'd rather they finish the job - If the agent is failing: stopping early has high value — every dollar/hour past the point where the trajectory is bad is pure waste This is not a distraction. It is the exact structure of optimal stopping problems from classical decision theory — when is the expected value of continued effort negative given current evidence? The benchmark designer's implicit question is whether the agent can reason about this. Most current benchmarks sidestep it entirely. --- What the Giants and Serious Labs Actually Say METR's RE-Bench (the most rigorous agent-vs-human benchmark in existence right now) is the clearest real-world example of researchers grappling with this. Their 2024 findings are stark: agents are told how much time budget they have remaining and can choose when to submit. The uncomfortable result: "AI agents tend to assume things are out of scope or try to finish up the task quickly without trying many things or testing their initial solution." In other words, current agents give up too early even when continuing would pay off — the opposite of your failing-agent case. Humans, by contrast, kept improving as the time budget grew from 30 minutes to 8 hours. This means the field's current worry is not agents exceeding budget recklessly, but agents being too conservative. (METR Blog) The multi-dimensional evaluation literature (frameworks like CLEAR, CLASSic) explicitly says that cost-unaware benchmarks create a perverse incentive: they reward building extremely capable but expensive agents, which wouldn't be viable in deployment. The counterweight to METR's concern, in other words, is enterprise researchers who argue budget enforcement is underweighted, not overweighted. (arXiv:2511.14136) No figure with Sutton/Bengio/LeCun stature has written directly about budget-softness philosophy for coding agents — this problem is too new and too operational. But the closest intellectual ancestor is Herbert Simon's satisficing: agents don't maximize, they search until they find "good enough" given resource constraints. The question of whether to give up is just satisficing applied to the meta-level. --- Are These Secondary Concerns Distractions or Signal? Honest answer: they are real signal that the field currently treats as noise, for a practical reason — it's hard to measure cleanly. Hard cutoffs are standard (SWE-bench, most coding benchmarks) because: - They make comparisons fair and replicable - Soft budgets introduce confounds (did agent A spend more because it was more capable, or more reckless?) - The primary outcome (did it solve the task?) dominates But the community is slowly recognizing this is insufficient. The SERA paper from AllenAI (arXiv:2509.16941) and the reliability gap discussion (simmering.dev) both point out that how agents fail matters enormously for real deployment — an agent that recognizes it's stuck and stops cleanly is categorically better than one that burns compute in a broken loop. --- The Philosophical Implication for Your Guide The instruction "you have $15/12h but $0.25 over is fine if you do a great job" is actually a well-formed policy — it models real economic contracts (cost-plus with a grace margin). The interesting design question is: should a succeeding agent be penalized at all for the overage? Probably not. Should a failing agent get credit for stopping early? Yes, meaningfully so. The ideal agent behavior given those instructions is: 1. Track trajectory quality against budget consumed (constantly) 2. If trajectory is good → finish the job, minor overage acceptable 3. If trajectory is bad and budget is mostly untouched → exit cleanly, don't thrash 4. If trajectory is bad and budget is nearly gone → same, but the failure is costlier This is not something current benchmarks reward explicitly. It may be worth encoding it in the guide's scoring rather than leaving it implicit in the budget efficiency section — because an agent that stops at $2 / 1h with a clear "this approach won't work" conclusion has demonstrated something genuinely valuable. --- Sources: - METR: Evaluating frontier AI R&D capabilities - RE-Bench GitHub - Beyond Accuracy: Multi-Dimensional Framework for Evaluating Enterprise Agentic AI - The Reliability Gap: Agent Benchmarks for Enterprise - SWE-Bench Pro - METR Time Horizons — Epoch AI

Good terminal prefix (pats head)

i do the following to be more efficient: # Allow $VAR and $(cmd) in PROMPT setopt prompt_subst # --------------------------------------------------------------- # DATE / TIME # --------------------------------------------------------------- PROMPT_DATETIME='%D{%b %d %H:%M:%S}' # '%D{%Y-%m-%d %H:%M:%S}' # --------------------------------------------------------------- # EXIT CODE INDICATOR (shows only when nonzero) # --------------------------------------------------------------- PROMPT_EXIT='%(?..[%?])' # --------------------------------------------------------------- # COMMAND TIMING (built-in; prints for cmds > 2s) # --------------------------------------------------------------- REPORTTIME=2 # --------------------------------------------------------------- # PYTHON VIRTUALENV INDICATOR (shows only if active) # If you don't want the default "(venv)" that virtualenv adds, # also set: export VIRTUAL_ENV_DISABLE_PROMPT=1 # --------------------------------------------------------------- PROMPT_VENV='${VIRTUAL_ENV:+(${${VIRTUAL_ENV:t}})}' # --------------------------------------------------------------- # DIRECTORY SHORTENING (fish-style-ish) # compress middle dirs to their first letter # --------------------------------------------------------------- shorten_path () { local P=$1 local -a parts parts=("${(s:/:)P}") if (( ${#parts[@]} > 3 )); then # first / compressed middle / last print -r "${parts[1]}/${(j:/:)${parts[2,-2]/(#m)?*/${MATCH[1]}}}/${parts[-1]}" else print -r "$P" fi } PROMPT_PWD='%~' # PROMPT_PWD='$(shorten_path $PWD)' # --------------------------------------------------------------- # JOB COUNT (background jobs) # --------------------------------------------------------------- PROMPT_JOBS='%(1j.[%j job(s)].)' # --------------------------------------------------------------- # PURE-GIT INDICATORS (no external tools) # branch name, dirtiness, staged, unstaged, untracked flags # --------------------------------------------------------------- git_prompt () { local branch staged unstaged untracked dirty branch=$(git rev-parse --abbrev-ref HEAD 2>/dev/null) || return 0 staged=$(git diff --cached --name-status 2>/dev/null | wc -l | tr -d ' ') unstaged=$(git diff --name-status 2>/dev/null | wc -l | tr -d ' ') untracked=$(git ls-files --others --exclude-standard 2>/dev/null | wc -l | tr -d ' ') dirty="" (( staged > 0 )) && dirty+="+" # staged (( unstaged > 0 )) && dirty+="*" # unstaged (( untracked > 0 )) && dirty+="?" # untracked print -r "[$branch$dirty]" } PROMPT_GIT='$(git_prompt)' # --------------------------------------------------------------- # FINAL PROMPT # --------------------------------------------------------------- PROMPT="${PROMPT_DATETIME} ${PROMPT_EXIT} ${PROMPT_VENV} ${PROMPT_PWD} ${PROMPT_JOBS} ${PROMPT_GIT} %# " (venv) Feb 28 12:22:44 (venv) ~/repos/homemade_benchmark_tests %

Why? How?

“'Why' and 'How' are words so important they cannot be too often used.” Napoleon Bonaparte

"A Name, an Address, a Route" Haiku — Found in RFC 791: DARPA’s 1981 Internet Protocol

 A name indicates what we seek.  

An address indicates where it is.  

A route indicates how to get there.  

The internet protocol deals primarily with addresses.


(Not technically a haiku).

Link.

Media Studies

The Treasure Vault We Forgot: Discovering the Internet Archive

The Treasure Vault We Forgot: Discovering the Internet Archive

Lately, I’ve been feeling like there’s less to watch. Not in the literal sense—there’s more “content” than ever—but most of it feels algorithmically interchangeable and pretty forgettable.

So I turned to my new favorite media source: the Internet Archive.


The Internet Archive: More Than Nostalgia

Most people know the Archive through the Wayback Machine. But their media library is vast—a cultural preservation project without peer. We're talking:

  • 40,000+ feature films in the public domain or Creative Commons
  • Millions of hours of television, including obscure public access shows
  • VHS transfers, classic video game playthroughs, industrial films
  • Music, radio, PSAs, animation, and forgotten gems of every stripe

I recently downloaded stacks of Jeopardy! episodes from the ‘80s and ‘90s—clean, digitized. I grabbed Escape from New York. I found documentaries I hadn’t seen since a single PBS airing decades ago. This is what the internet used to feel like.

For context: Netflix’s U.S. streaming library has around 6,600 titles. The now-defunct DVD service once had 156,000. At the end, they were down to 35,000.

Internet Archive Screenshot

Why the Archive Matters More Than Ever

We live in a paradox: more content is produced each year than ever before, yet less of it is accessible. Much of it is low-quality. The actors are all models and never sweat, and people talk and say nothing at 150 words per minute. The good stuff is trapped behind licensing purgatory and exclusivity deals

Copyright exists to protect creators. But current copyright law—often lasting 70+ years past the creator’s death—frequently blocks access entirely. The Archive bridges that gap, not by pirating, but by preserving. Their legal foundation rests on fair use, public domain, educational exemptions, and direct donations. That hasn’t stopped the lawsuits I believe in copyright. But I also hope the IA fights like hell to get everything in the pub domain we have a right to see.

In 2023, Hachette, HarperCollins, Penguin Random House, and Wiley sued over the Archive’s Open Library project. The Archive lost in district court and is now appealing. If they lose, it could jeopardize not just book lending, but their entire preservation model. And Hollywood is paying close attention.

TV news clips, rare broadcasts, and video content are also under pressure—despite being among the few public repositories for this vanishing media.

Want to help? You don’t need to volunteer. Just:

  1. Upload media if you have it.
  2. Download and back up at-risk content—especially from collections under legal or political threat. Save it in cold storage. Revisit in 20 years. But don’t delete it.


Algorithm-Free Zone (Blessedly)

My favorite part is that there are no recommendations!

No autoplay and no infinite scroll. You simply search and explore.

Algorithms don’t expand your curiosity. They create a fucking echo chamber. The Archive works the opposite way -- it is a maze of strange, beautiful, uncategorized media. A 1930s safety film might sit beside a punk show from 1997. No one’s nudging you toward anything. That’s good for the soul.

Also: there is no corporate puritanism, shall we say. There is no sanitization for international markets or demonetization purges. The Archive doesn’t shock but it doesn’t coddle vistors, either. It preserves the weird, sacred, offensive, boring, and brilliant world.


Surprisingly Well-Organized

Unlike most open platforms, the Archive seems to by and large avoids duplication chaos. I was surprised to find the same clip posted a dozen times with minor edits. Collections are curated, metadata-rich, and often annotated by real librarians.


Sample of the Archive’s Video Collections

Collection Approx. Items Description
Feature Films40,000+Public domain/Creative Commons movies, mostly pre-1970s
Classic TV10,000+Early television, vintage commercials, and more
Home Movies & Amateur Films30,000+Everyday life captured on VHS and film
Prelinger Archives11,000+Educational and government-produced films
News & Public Affairs25,000+Includes the January 6th Archive and raw TV coverage
VJ Archives / Video Games15,000+Gameplay recordings and early machinima
Community Video300,000+Fan edits, lectures, public access shows, and more

That’s just the beginning. The Archive’s user-created metadata and collections elevate browsing into real discovery.


Why I Keep Coming Back

It’s radical to browse a library with no limits. I can save, organize, and download without friction. I don’t need permission. I don’t need to “like” something to remember it. It belongs to the people.

And it reminds me: media used to be weird. The internet used to be weird. I am still weird, so let's embrace weird stuff and make the internet weird again. There’s a richness to old tapes and lost broadcasts that no algorithmically sorted catalog—or AI-generated SLOP—can replicate.


Support the Archive

If you're tired of being managed by your media diet, visit the Internet Archive. If you can, donate. They’re not just preserving the past—they’re defending a future where discovery is still possible.

Start exploring: https://archive.org

You’ll be surprised what you find. Also would like to shout out the common crawl project.


Recommendations & Links

Origin of power

 The term exponent originates from the Latinexponentem, the present participle of exponere, meaning "to put forth".[3] The term power (Latin: potentia, potestas, dignitas) is a mistranslation[4][5] of the ancient Greek δύναμις (dúnamis, here: "amplification"[4]) used by the Greekmathematician Euclid for the square of a line,[6] following Hippocrates of Chios.[7]

Woah!

 https://statmodeling.stat.columbia.edu/2026/01/22/aking/

Asymmetric loss functions over budget -- evaluation considerations

The Core Philosophical Tension What you're describing is an asymmetric loss function over budget. The intuition is: - If the ...