Cold Hard Cache 💾
"colloquially called the apply model"
"We believe that CoT monitoring may be one of few tools we will have to oversee superhuman models of the future."
"...We recommend against applying strong optimization pressure directly to the CoTs of frontier reasoning models, leaving CoTs unrestricted for monitoring."
https://openai.com/index/chain-of-thought-monitoring/
Long-Horizon Execution: A Key To AI Progress
Lately I keep hearing that AI has “slowed down.” It comes from two camps: casual users who aren't even using the latest and greatest, and savvy folks repeating a fashionable take. Either way, it’s wrong.
Recent op-eds claim progress is stalling. But look closer at what actually matters for agents, coding copilots, and enterprise workflows: long-horizon execution.
A new preprint—The Illusion of Diminishing Returns: Measuring Long-Horizon Execution in LLMs—nails the point. Tiny gains in a model’s per-step reliability compound into huge increases in how long a task it can complete end-to-end.
The one-line idea
If a model’s per-step accuracy is , the longest task it can finish with success rate scales like
Nudging from “pretty good” to “slightly better” produces nonlinear—indeed, faster-than-exponential beyond ~70%—growth in the solvable horizon. So even if single-turn scores look flat, multi-step capability can be racing ahead.
It’s not (just) reasoning—execution breaks over time
The authors separate planning (“what to do”) from execution (“carry it out without slipping”). Give models the plan and the knowledge; accuracy still decays with length. Bigger models execute reliably for more turns even when smaller ones ace turn one. That’s an execution problem, not a knowledge problem.
They also surface a telling failure mode:
Self-conditioning. When the context contains a model’s own earlier errors, the chance of new errors rises; parameter count alone doesn’t fix it
Thinking fixes it. Sequential test-time compute (i.e., explicit “thinking” or RL-trained reasoning) breaks that loop and extends single-turn execution dramatically. In their benchmark, a thinking GPT-5 variant clears 1,000+ steps in one shot; Claude 4 Sonnet is ~400.
Why “slowdown” misses the plot
Track task length at constant success rate and you see small step-wise gains compounding—especially in agentic settings where one slip spoils the run. Independent releases point the same way:
GPT-5: new SOTA on AIME 2025 (no tools), SWE-bench Verified, MMMU, GPQA; sharper drops in hallucinations and better multi-step, tool-using instruction following.
Claude 4 (Opus/Sonnet): strong jumps on real software-engineering tasks (SWE-bench Verified, Terminal-bench); explicitly tuned for long-running agent workflows.
Gemini 2.5 Pro: material gains on GPQA Diamond and AIME; “thinking” variants improve more.
METR: finds the horizon length that SOTA models can complete doubling on ~7-month timescales across domains.
None of that looks like a stall when you optimize for finishing work, not just first-turn answers.
Where “a little better” becomes “a lot more useful”
Long-context problem solving. In a chain of dependent steps, boosting to multiplies the maximum at the same success rate (). That’s why small upgrades suddenly let an agent finish instead of stalling mid-project.
Error recovery. Thinking mitigates self-conditioning; modest gains in recovering from a local slip unlock much longer runs. Non-thinking frontier models can fail at two sequential steps; RL-trained thinking models stretch to hundreds in one turn.
Agent workflows. Spending a few more tokens to “think” beats parallel sampling for long-horizon reliability.
Practical takeaways for builders
Measure the right thing. Track horizon length (e.g., steps until 50% success) and turn complexity, not just single-turn accuracy.
Bias toward thinking. Enable sequential test-time compute for multi-step tasks; it stabilizes execution and extends single-turn capacity.
Manage context. Window/summarize to reduce exposure to your model’s past mistakes and blunt self-conditioning drift.
Bottom line
If you stare at short-task benchmarks, you can convince yourself returns are diminishing. But the moment you optimize for finishing real work—repo-wide code edits, multi-tool research, complex data flows—the picture flips. Small improvements in step reliability and recovery produce outsized gains in end-to-end completion. The frontier is quietly shifting from answering to executing—and by that yardstick, progress is very much alive.
Sources: Sinha et al., Illusion of Diminishing Returns (preprint); OpenAI GPT-5 benchmarks and reliability notes; Anthropic Claude 4 results; Google DeepMind Gemini 2.5 updates; METR horizon-length analyses.
Dissecting ChatGPT's Agent Mode: A Look Under the Hood
We've all seen the demos of AI agents booking flights or ordering pizzas. They often feel like magic. But what's actually happening behind the curtain? I was able to analyze a detailed JSON export from a conversation using ChatGPT's agent mode, which gives us a fascinating, low-level look at the architecture in action.
The Agent's "Inner Monologue"
{"summary": "Clarifying Cedar Brook Fields location", "content": "The user is asking for a flight... I need to clarify which state Cedar Brook Fields is located in, as the user mentioned it might be in either Vermont or New Hampshire. Additionally, I may need to confirm the travel dates..."}
{"summary": "Verifying campground location", "content": "The search shows Cedar Brook Fields is in Vermont, but I need to confirm this by visiting an authoritative source, like the campground's official website."}
A Classic Action-Observation Loop
Think: It generates a plan in its thoughts log. Act: It issues a command to a tool. This is a code message sent to a specific recipient like browser.search or computer.do. The command is a structured JSON object. For example, to perform a search, it sends: {"query": "Cedar Brook Fields campground location Vermont New Hampshire", "source": "computer"} Observe: The tool executes the command and returns its output. Crucially, for the computer.do tool, this output includes a screenshot of the virtual machine's display. The agent sees what it's doing. Re-think: It analyzes the new information (and the screenshot) and generates its next thought, adapting its plan based on the outcome of its last action.
The Toolkit: More Than Just APIs
The Power of Self-Correction
{"summary": "Navigating to Google Flights", "content": "The search results page loaded, but instead of the Google Flights search, it defaulted to Bing. I'll directly navigate to flights.google.com to ensure accurate results."}
{"actions":[{"action":"keypress","keys":["CTRL","Z"]}]}
Grounding and Traceability
From the Code to the Community: How It's Being Used in the Wild
Perform deep research: Analyzing multiple websites to compare products, or even mining Reddit itself for sentiment analysis. One user tasked it with finding ceramic mugs that matched the visual style of a website they were building—a fuzzy, stylistic search that's difficult for traditional tools. Automate grunt work: Filling out job applications, checking product stock, or finding a friend's lost wallet by navigating a German-language lost-and-found website. Solve personal problems: One user even employed it to hunt down the best deal on a specific retired Lego set.
Conclusion: It's a Controller, Not Just a Chatbot
📊 Common Patterns Seen in Successful Software Projects
"A Name, an Address, a Route" Haiku — Found in RFC 791: DARPA’s 1981 Internet Protocol
A name indicates what we seek.
An address indicates where it is.
A route indicates how to get there.
The internet protocol deals primarily with addresses.
(Not technically a haiku).
Link.
What happened to tor?
On March 4, 2025, the New York Times shut down its .onion site which it had launched in 2017. Is tor dead? tor published metrics speak to a volume of activity that doesn’t look like a big drop. It might just be that there is so much less publicity about it nowadays that it feels like that. Also, it seems to have lost its status as anything but a place for criminals. Maybe. One theory is that many of the onion proxy networks are run by the fbi. I don’t know.
I did some research:
In short — Tor isn’t dead at all.
Its code-base, funding, and user-base are still moving forward, but the hype cycle and many clearnet “onion directories” have stagnated, so a newcomer who googles for .onion links mostly meets blog-posts frozen in 2014-2018. Behind that veneer, Tor Browser 14.5 shipped two weeks ago, the network still serves ~2.4 million daily users, and the project’s 2024 fundraising goal was met. What changed is (a) the way journalists talk about Tor, (b) a big technical migration that broke most older onion addresses, and (c) law-enforcement pressure that discouraged public link-lists.
"colloquially called the apply model"
You are a an AI coding assistant, powered by tensorzero::function_name::cursorzero. You operate in Cursor You are pair programming with a U...
-
A name indicates what we seek. An address indicates where it is. A route indicates how to get there. The internet protocol deals prima...
-
Pattern Description Dogfooding: Creators use their own software heavily, improving it organically. “Thin to Thick” Clients: Many succes...
-
On March 4, 2025, the New York Times shut down its .onion site which it had launched in 2017. Is tor dead? tor published metrics speak to a ...