Software Benchmarks: hall of fame

I love these types of things. 

Classics:

Tom's Diner

By Suzanne Vega. Due to this song, she has been called "The mother of the MP3".  It was used by the engineers that worked on mp3 compression the perfect  "I was ready to fine-tune my compression algorithm...somewhere down the corridor, a radio was playing 'Tom's Diner.' I was electrified. I knew it would be nearly impossible to compress this warm a cappella voice."[9]  It is also a nice short length, which is ideal when engineers wanted to test it

I am sitting in the morning at the diner on the corner

I am waiting at the counter for the man to pour the coffee

And he fills it only halfway, and before I even argue

He is looking out the window at somebody coming in

`curl -o suzanne_vega.mp3 https://museumofportablesound.com/wp-content/uploads/2020/07/mopswaxcylindersuzannevega.mp3`

Here is a video that I like, because it is 2 and a half minutes long, and because it has no words. https://www.youtube.com/watch?v=4-ISLpKhQJI 

Lena Forsen 

Image of a model which appeared in tens of thousands of publications and conference proceedings on imaging research from its first use in the 1970s until being retired in the 2010s due to objectifying women. It had become popular in the imaging community because it contains several features – such as light and dark illumination regions, sharp edges, flat regions, smoothly varying regions and textured regions intertwined with each other – which made it useful when testing the behaviour of image-processing or reconstruction algorithms. 

AI Specific:

Pelican on a Bicycle

Will Smith Pasta Video


----

I have a few of my own: 


Dspy is an exciting idea but is still in it's infancy

 Like all followers of the main AI bloggers, I saw Simon Willisons endorsement of DSPy. It reminds me of MLFlow. Early on in my pivot to AI, I recall many a langchain + Mlflow tutorial. It is premature. Evals. Evals. Have i ever actually created a good evals system. Nope.  A few things needed within DSPy. First of all, it has to be able to better handle complete garbage output from the student models (And the teacher models, for that matter -- not to mention the judge reflection models). What i mean is suppose I am creating an AI pipeline that does something. The example doesnt matter but lets say it is categorizing emails in bulk. I should be able to use Dspy to come up with a prompt that works through an optimizer like MIProV2. And if i pick a model that at DSPy's first attempt at a prompt for the task returns something, The optimizer should not only be expecting it to not be the correct category but structurally, not at all something that is what the system should be able to handle. When evaluating this library,  I guess the problem i see is that it seems useful in very unusual, specific circumstances. It is not robust enough t needs easy ways to have the optimizer first solve an easy problem before solving the harder problem. it needs easy ways to start with smart models and move to dumb models, and vice versa.  The dspy signatures dont solve problems. They are for simple tasks. 

The coolest thing about this stream of consiousness article was that when writing it had the links filled out . Kind of neat. 




ARTificial: Why Copyright Is Not the Right Policy Tool to Deal with Generative AI

 Mantegna argues that copyright law is the wrong tool for governing generative AI, especially for text, because copyright was never designed to regulate systems that learn statistical structures rather than copy expressive works. Attempts to stretch copyright to cover AI training and outputs risk (1) failing to meaningfully protect authors, (2) entrenching corporate intermediaries, (3) destabilizing fair use, (4) shrinking the public domain, and (5) degrading AI quality through defensive data practices. She insists that ethical harms (consent, attribution, labor displacement) are real—but conflating “unethical” with “illegal” produces bad law and ultimately worsens outcomes for creators and society.


https://yalelawjournal.org/forum/artificial-why-copyright-is-not-the-right-policy-tool-to-deal-with-generative-ai

"colloquially called the apply model"

You are a an AI coding assistant, powered by tensorzero::function_name::cursorzero. You operate in Cursor You are pair programming with a USER to solve their coding task. Each time the USER sends a message, we may automatically attach some information about their current state, such as what files they have open, where their cursor is, recently viewed files, edit history in their session so far, linter errors, and more. This information may or may not be relevant to the coding task, it is up for you to decide. Your main goal is to follow the USER's instructions at each message, denoted by the tag. When using markdown in assistant messages, use backticks to format file, directory, function, and class names. Use ( and ) for inline math, [ and ] for block math. If you are unsure about the answer to the USER's request or how to satiate their request, you should gather more information. This can be done by asking the USER for more information. Bias towards not asking the user for help if you can find the answer yourself. The user is likely just asking questions and not looking for edits. Only suggest edits if you are certain that the user is looking for edits. When the user is asking for edits to their code, please output a simplified version of the code block that highlights the changes necessary and adds comments to indicate where unchanged code has been skipped. For example: ```language:path/to/file // ... existing code ... {{ edit_1 }} // ... existing code ... {{ edit_2 }} // ... existing code ... ``` The user can see the entire file, so they prefer to only read the updates to the code. Often this will mean that the start/end of the file will be skipped, but that's okay! Rewrite the entire file only if specifically requested. Always provide a brief explanation of the updates, unless the user specifically requests only the code. These edit codeblocks are also read by a less intelligent language model, colloquially called the apply model, to update the file. To help specify the edit to the apply model, you will be very careful when generating the codeblock to not introduce ambiguity. You will specify all unchanged regions (code and comments) of the file with "// ... existing code ..." comment markers. This will ensure the apply model will not delete existing unchanged code or comments when editing the file. You will not mention the apply model. The user's OS version is darwin 24.3.0. The absolute path of the user's workspace is /Users/viraj/tensorzero/tensorzero/examples/cursorzero. The user's shell is /bin/zsh. You MUST use the following format when citing code regions or blocks: ```12:15:app/components/Todo.tsx // ... existing code ... ``` This is the ONLY acceptable format for code citations. The format is ```startLine:endLine:filepath``` where startLine and endLine are line numbers.

"We believe that CoT monitoring may be one of few tools we will have to oversee superhuman models of the future."

"...We recommend against applying strong optimization pressure directly to the CoTs of frontier reasoning models, leaving CoTs unrestricted for monitoring."

https://openai.com/index/chain-of-thought-monitoring/




Long-Horizon Execution: A Key To AI Progress

Lately I keep hearing that AI has “slowed down.” It comes from two camps: casual users who aren't even using the latest and greatest, and savvy folks repeating a fashionable take. Either way, it’s wrong.

Recent op-eds claim progress is stalling. But look closer at what actually matters for agents, coding copilots, and enterprise workflows: long-horizon execution.

A new preprint—The Illusion of Diminishing Returns: Measuring Long-Horizon Execution in LLMs—nails the point. Tiny gains in a model’s per-step reliability compound into huge increases in how long a task it can complete end-to-end.

The one-line idea

If a model’s per-step accuracy is p, the longest task it can finish with success rate s scales like

Hsln(s)ln(p).

Nudging p from “pretty good” to “slightly better” produces nonlinear—indeed, faster-than-exponential beyond ~70%—growth in the solvable horizon. So even if single-turn scores look flat, multi-step capability can be racing ahead.

It’s not (just) reasoning—execution breaks over time

The authors separate planning (“what to do”) from execution (“carry it out without slipping”). Give models the plan and the knowledge; accuracy still decays with length. Bigger models execute reliably for more turns even when smaller ones ace turn one. That’s an execution problem, not a knowledge problem.

They also surface a telling failure mode:

        Self-conditioning. When the context contains a model’s own earlier errors, the chance of new errors rises; parameter count alone doesn’t fix it

        Thinking fixes it. Sequential test-time compute (i.e., explicit “thinking” or RL-trained reasoning) breaks that loop and extends single-turn execution dramatically. In their benchmark, a thinking GPT-5 variant clears 1,000+ steps in one shot; Claude 4 Sonnet is ~400.

Why “slowdown” misses the plot

Track task length at constant success rate and you see small step-wise gains compounding—especially in agentic settings where one slip spoils the run. Independent releases point the same way:

  • GPT-5: new SOTA on AIME 2025 (no tools), SWE-bench Verified, MMMU, GPQA; sharper drops in hallucinations and better multi-step, tool-using instruction following.

  • Claude 4 (Opus/Sonnet): strong jumps on real software-engineering tasks (SWE-bench Verified, Terminal-bench); explicitly tuned for long-running agent workflows.

  • Gemini 2.5 Pro: material gains on GPQA Diamond and AIME; “thinking” variants improve more.

  • METR: finds the horizon length that SOTA models can complete doubling on ~7-month timescales across domains.

None of that looks like a stall when you optimize for finishing work, not just first-turn answers.

Where “a little better” becomes “a lot more useful”

  • Long-context problem solving. In a chain of H dependent steps, boosting p to p+δ multiplies the maximum H at the same success rate (H1/lnp). That’s why small upgrades suddenly let an agent finish instead of stalling mid-project.

  • Error recovery. Thinking mitigates self-conditioning; modest gains in recovering from a local slip unlock much longer runs. Non-thinking frontier models can fail at two sequential steps; RL-trained thinking models stretch to hundreds in one turn.

  • Agent workflows. Spending a few more tokens to “think” beats parallel sampling for long-horizon reliability.

Practical takeaways for builders

  • Measure the right thing. Track horizon length (e.g., steps until 50% success) and turn complexity, not just single-turn accuracy.

  • Bias toward thinking. Enable sequential test-time compute for multi-step tasks; it stabilizes execution and extends single-turn capacity.

  • Manage context. Window/summarize to reduce exposure to your model’s past mistakes and blunt self-conditioning drift.

Bottom line

If you stare at short-task benchmarks, you can convince yourself returns are diminishing. But the moment you optimize for finishing real work—repo-wide code edits, multi-tool research, complex data flows—the picture flips. Small improvements in step reliability and recovery produce outsized gains in end-to-end completion. The frontier is quietly shifting from answering to executing—and by that yardstick, progress is very much alive.

Sources: Sinha et al., Illusion of Diminishing Returns (preprint); OpenAI GPT-5 benchmarks and reliability notes; Anthropic Claude 4 results; Google DeepMind Gemini 2.5 updates; METR horizon-length analyses.


IYKYK

https://gist.github.com/GideonPotok/9d8de616ee20571d1d38ea760c5b99a2