Cold Hard Cache 💾: September 2025

"colloquially called the apply model"

You are a an AI coding assistant, powered by tensorzero::function_name::cursorzero. You operate in Cursor You are pair programming with a USER to solve their coding task. Each time the USER sends a message, we may automatically attach some information about their current state, such as what files they have open, where their cursor is, recently viewed files, edit history in their session so far, linter errors, and more. This information may or may not be relevant to the coding task, it is up for you to decide. Your main goal is to follow the USER's instructions at each message, denoted by the tag. When using markdown in assistant messages, use backticks to format file, directory, function, and class names. Use ( and ) for inline math, [ and ] for block math. If you are unsure about the answer to the USER's request or how to satiate their request, you should gather more information. This can be done by asking the USER for more information. Bias towards not asking the user for help if you can find the answer yourself. The user is likely just asking questions and not looking for edits. Only suggest edits if you are certain that the user is looking for edits. When the user is asking for edits to their code, please output a simplified version of the code block that highlights the changes necessary and adds comments to indicate where unchanged code has been skipped. For example: ```language:path/to/file // ... existing code ... {{ edit_1 }} // ... existing code ... {{ edit_2 }} // ... existing code ... ``` The user can see the entire file, so they prefer to only read the updates to the code. Often this will mean that the start/end of the file will be skipped, but that's okay! Rewrite the entire file only if specifically requested. Always provide a brief explanation of the updates, unless the user specifically requests only the code. These edit codeblocks are also read by a less intelligent language model, colloquially called the apply model, to update the file. To help specify the edit to the apply model, you will be very careful when generating the codeblock to not introduce ambiguity. You will specify all unchanged regions (code and comments) of the file with "// ... existing code ..." comment markers. This will ensure the apply model will not delete existing unchanged code or comments when editing the file. You will not mention the apply model. The user's OS version is darwin 24.3.0. The absolute path of the user's workspace is /Users/viraj/tensorzero/tensorzero/examples/cursorzero. The user's shell is /bin/zsh. You MUST use the following format when citing code regions or blocks: ```12:15:app/components/Todo.tsx // ... existing code ... ``` This is the ONLY acceptable format for code citations. The format is ```startLine:endLine:filepath``` where startLine and endLine are line numbers.

"We believe that CoT monitoring may be one of few tools we will have to oversee superhuman models of the future."

"...We recommend against applying strong optimization pressure directly to the CoTs of frontier reasoning models, leaving CoTs unrestricted for monitoring."

https://openai.com/index/chain-of-thought-monitoring/

Long-Horizon Execution: A Key To AI Progress

Lately I keep hearing that AI has “slowed down.” It comes from two camps: casual users who aren't even using the latest and greatest, and savvy folks repeating a fashionable take. Either way, it’s wrong.

Recent op-eds claim progress is stalling. But look closer at what actually matters for agents, coding copilots, and enterprise workflows: long-horizon execution.

A new preprint—The Illusion of Diminishing Returns: Measuring Long-Horizon Execution in LLMs—nails the point. Tiny gains in a model’s per-step reliability compound into huge increases in how long a task it can complete end-to-end.

The one-line idea

If a model’s per-step accuracy is $p$ , the longest task it can finish with success rate $s$ scales like

$H_{s} \approx \frac{\ln (s)}{\ln (p)} .$

Nudging $p$ from “pretty good” to “slightly better” produces nonlinear—indeed, faster-than-exponential beyond ~70%—growth in the solvable horizon. So even if single-turn scores look flat, multi-step capability can be racing ahead.

It’s not (just) reasoning—execution breaks over time

The authors separate planning (“what to do”) from execution (“carry it out without slipping”). Give models the plan and the knowledge; accuracy still decays with length. Bigger models execute reliably for more turns even when smaller ones ace turn one. That’s an execution problem, not a knowledge problem.

They also surface a telling failure mode:

Self-conditioning. When the context contains a model’s own earlier errors, the chance of new errors rises; parameter count alone doesn’t fix it

Thinking fixes it. Sequential test-time compute (i.e., explicit “thinking” or RL-trained reasoning) breaks that loop and extends single-turn execution dramatically. In their benchmark, a thinking GPT-5 variant clears 1,000+ steps in one shot; Claude 4 Sonnet is ~400.

Why “slowdown” misses the plot

Track task length at constant success rate and you see small step-wise gains compounding—especially in agentic settings where one slip spoils the run. Independent releases point the same way:

GPT-5: new SOTA on AIME 2025 (no tools), SWE-bench Verified, MMMU, GPQA; sharper drops in hallucinations and better multi-step, tool-using instruction following.
Claude 4 (Opus/Sonnet): strong jumps on real software-engineering tasks (SWE-bench Verified, Terminal-bench); explicitly tuned for long-running agent workflows.
Gemini 2.5 Pro: material gains on GPQA Diamond and AIME; “thinking” variants improve more.
METR: finds the horizon length that SOTA models can complete doubling on ~7-month timescales across domains.

None of that looks like a stall when you optimize for finishing work, not just first-turn answers.

Where “a little better” becomes “a lot more useful”

Long-context problem solving. In a chain of $H$ dependent steps, boosting $p$ to $p + δ$ multiplies the maximum $H$ at the same success rate ( $H \propto 1 / \ln p$ ). That’s why small upgrades suddenly let an agent finish instead of stalling mid-project.
Error recovery. Thinking mitigates self-conditioning; modest gains in recovering from a local slip unlock much longer runs. Non-thinking frontier models can fail at two sequential steps; RL-trained thinking models stretch to hundreds in one turn.
Agent workflows. Spending a few more tokens to “think” beats parallel sampling for long-horizon reliability.

Practical takeaways for builders

Measure the right thing. Track horizon length (e.g., steps until 50% success) and turn complexity, not just single-turn accuracy.
Bias toward thinking. Enable sequential test-time compute for multi-step tasks; it stabilizes execution and extends single-turn capacity.
Manage context. Window/summarize to reduce exposure to your model’s past mistakes and blunt self-conditioning drift.

Bottom line

If you stare at short-task benchmarks, you can convince yourself returns are diminishing. But the moment you optimize for finishing real work—repo-wide code edits, multi-tool research, complex data flows—the picture flips. Small improvements in step reliability and recovery produce outsized gains in end-to-end completion. The frontier is quietly shifting from answering to executing—and by that yardstick, progress is very much alive.

Sources: Sinha et al., Illusion of Diminishing Returns (preprint); OpenAI GPT-5 benchmarks and reliability notes; Anthropic Claude 4 results; Google DeepMind Gemini 2.5 updates; METR horizon-length analyses.

Cold Hard Cache 💾