Long-Horizon Execution: A Key To AI Progress

Lately I keep hearing that AI has “slowed down.” It comes from two camps: casual users who aren't even using the latest and greatest, and savvy folks repeating a fashionable take. Either way, it’s wrong.

Recent op-eds claim progress is stalling. But look closer at what actually matters for agents, coding copilots, and enterprise workflows: long-horizon execution.

A new preprint—The Illusion of Diminishing Returns: Measuring Long-Horizon Execution in LLMs—nails the point. Tiny gains in a model’s per-step reliability compound into huge increases in how long a task it can complete end-to-end.

The one-line idea

If a model’s per-step accuracy is p, the longest task it can finish with success rate s scales like

Hsln(s)ln(p).

Nudging p from “pretty good” to “slightly better” produces nonlinear—indeed, faster-than-exponential beyond ~70%—growth in the solvable horizon. So even if single-turn scores look flat, multi-step capability can be racing ahead.

It’s not (just) reasoning—execution breaks over time

The authors separate planning (“what to do”) from execution (“carry it out without slipping”). Give models the plan and the knowledge; accuracy still decays with length. Bigger models execute reliably for more turns even when smaller ones ace turn one. That’s an execution problem, not a knowledge problem.

They also surface a telling failure mode:

        Self-conditioning. When the context contains a model’s own earlier errors, the chance of new errors rises; parameter count alone doesn’t fix it

        Thinking fixes it. Sequential test-time compute (i.e., explicit “thinking” or RL-trained reasoning) breaks that loop and extends single-turn execution dramatically. In their benchmark, a thinking GPT-5 variant clears 1,000+ steps in one shot; Claude 4 Sonnet is ~400.

Why “slowdown” misses the plot

Track task length at constant success rate and you see small step-wise gains compounding—especially in agentic settings where one slip spoils the run. Independent releases point the same way:

  • GPT-5: new SOTA on AIME 2025 (no tools), SWE-bench Verified, MMMU, GPQA; sharper drops in hallucinations and better multi-step, tool-using instruction following.

  • Claude 4 (Opus/Sonnet): strong jumps on real software-engineering tasks (SWE-bench Verified, Terminal-bench); explicitly tuned for long-running agent workflows.

  • Gemini 2.5 Pro: material gains on GPQA Diamond and AIME; “thinking” variants improve more.

  • METR: finds the horizon length that SOTA models can complete doubling on ~7-month timescales across domains.

None of that looks like a stall when you optimize for finishing work, not just first-turn answers.

Where “a little better” becomes “a lot more useful”

  • Long-context problem solving. In a chain of H dependent steps, boosting p to p+δ multiplies the maximum H at the same success rate (H1/lnp). That’s why small upgrades suddenly let an agent finish instead of stalling mid-project.

  • Error recovery. Thinking mitigates self-conditioning; modest gains in recovering from a local slip unlock much longer runs. Non-thinking frontier models can fail at two sequential steps; RL-trained thinking models stretch to hundreds in one turn.

  • Agent workflows. Spending a few more tokens to “think” beats parallel sampling for long-horizon reliability.

Practical takeaways for builders

  • Measure the right thing. Track horizon length (e.g., steps until 50% success) and turn complexity, not just single-turn accuracy.

  • Bias toward thinking. Enable sequential test-time compute for multi-step tasks; it stabilizes execution and extends single-turn capacity.

  • Manage context. Window/summarize to reduce exposure to your model’s past mistakes and blunt self-conditioning drift.

Bottom line

If you stare at short-task benchmarks, you can convince yourself returns are diminishing. But the moment you optimize for finishing real work—repo-wide code edits, multi-tool research, complex data flows—the picture flips. Small improvements in step reliability and recovery produce outsized gains in end-to-end completion. The frontier is quietly shifting from answering to executing—and by that yardstick, progress is very much alive.

Sources: Sinha et al., Illusion of Diminishing Returns (preprint); OpenAI GPT-5 benchmarks and reliability notes; Anthropic Claude 4 results; Google DeepMind Gemini 2.5 updates; METR horizon-length analyses.


No comments:

Post a Comment

"colloquially called the apply model"

You are a an AI coding assistant, powered by tensorzero::function_name::cursorzero. You operate in Cursor You are pair programming with a U...