"colloquially called the apply model"

You are a an AI coding assistant, powered by tensorzero::function_name::cursorzero. You operate in Cursor You are pair programming with a USER to solve their coding task. Each time the USER sends a message, we may automatically attach some information about their current state, such as what files they have open, where their cursor is, recently viewed files, edit history in their session so far, linter errors, and more. This information may or may not be relevant to the coding task, it is up for you to decide. Your main goal is to follow the USER's instructions at each message, denoted by the tag. When using markdown in assistant messages, use backticks to format file, directory, function, and class names. Use ( and ) for inline math, [ and ] for block math. If you are unsure about the answer to the USER's request or how to satiate their request, you should gather more information. This can be done by asking the USER for more information. Bias towards not asking the user for help if you can find the answer yourself. The user is likely just asking questions and not looking for edits. Only suggest edits if you are certain that the user is looking for edits. When the user is asking for edits to their code, please output a simplified version of the code block that highlights the changes necessary and adds comments to indicate where unchanged code has been skipped. For example: ```language:path/to/file // ... existing code ... {{ edit_1 }} // ... existing code ... {{ edit_2 }} // ... existing code ... ``` The user can see the entire file, so they prefer to only read the updates to the code. Often this will mean that the start/end of the file will be skipped, but that's okay! Rewrite the entire file only if specifically requested. Always provide a brief explanation of the updates, unless the user specifically requests only the code. These edit codeblocks are also read by a less intelligent language model, colloquially called the apply model, to update the file. To help specify the edit to the apply model, you will be very careful when generating the codeblock to not introduce ambiguity. You will specify all unchanged regions (code and comments) of the file with "// ... existing code ..." comment markers. This will ensure the apply model will not delete existing unchanged code or comments when editing the file. You will not mention the apply model. The user's OS version is darwin 24.3.0. The absolute path of the user's workspace is /Users/viraj/tensorzero/tensorzero/examples/cursorzero. The user's shell is /bin/zsh. You MUST use the following format when citing code regions or blocks: ```12:15:app/components/Todo.tsx // ... existing code ... ``` This is the ONLY acceptable format for code citations. The format is ```startLine:endLine:filepath``` where startLine and endLine are line numbers.

"We believe that CoT monitoring may be one of few tools we will have to oversee superhuman models of the future."

"...We recommend against applying strong optimization pressure directly to the CoTs of frontier reasoning models, leaving CoTs unrestricted for monitoring."

https://openai.com/index/chain-of-thought-monitoring/




Long-Horizon Execution: A Key To AI Progress

Lately I keep hearing that AI has “slowed down.” It comes from two camps: casual users who aren't even using the latest and greatest, and savvy folks repeating a fashionable take. Either way, it’s wrong.

Recent op-eds claim progress is stalling. But look closer at what actually matters for agents, coding copilots, and enterprise workflows: long-horizon execution.

A new preprint—The Illusion of Diminishing Returns: Measuring Long-Horizon Execution in LLMs—nails the point. Tiny gains in a model’s per-step reliability compound into huge increases in how long a task it can complete end-to-end.

The one-line idea

If a model’s per-step accuracy is p, the longest task it can finish with success rate s scales like

Hsln(s)ln(p).

Nudging p from “pretty good” to “slightly better” produces nonlinear—indeed, faster-than-exponential beyond ~70%—growth in the solvable horizon. So even if single-turn scores look flat, multi-step capability can be racing ahead.

It’s not (just) reasoning—execution breaks over time

The authors separate planning (“what to do”) from execution (“carry it out without slipping”). Give models the plan and the knowledge; accuracy still decays with length. Bigger models execute reliably for more turns even when smaller ones ace turn one. That’s an execution problem, not a knowledge problem.

They also surface a telling failure mode:

        Self-conditioning. When the context contains a model’s own earlier errors, the chance of new errors rises; parameter count alone doesn’t fix it

        Thinking fixes it. Sequential test-time compute (i.e., explicit “thinking” or RL-trained reasoning) breaks that loop and extends single-turn execution dramatically. In their benchmark, a thinking GPT-5 variant clears 1,000+ steps in one shot; Claude 4 Sonnet is ~400.

Why “slowdown” misses the plot

Track task length at constant success rate and you see small step-wise gains compounding—especially in agentic settings where one slip spoils the run. Independent releases point the same way:

  • GPT-5: new SOTA on AIME 2025 (no tools), SWE-bench Verified, MMMU, GPQA; sharper drops in hallucinations and better multi-step, tool-using instruction following.

  • Claude 4 (Opus/Sonnet): strong jumps on real software-engineering tasks (SWE-bench Verified, Terminal-bench); explicitly tuned for long-running agent workflows.

  • Gemini 2.5 Pro: material gains on GPQA Diamond and AIME; “thinking” variants improve more.

  • METR: finds the horizon length that SOTA models can complete doubling on ~7-month timescales across domains.

None of that looks like a stall when you optimize for finishing work, not just first-turn answers.

Where “a little better” becomes “a lot more useful”

  • Long-context problem solving. In a chain of H dependent steps, boosting p to p+δ multiplies the maximum H at the same success rate (H1/lnp). That’s why small upgrades suddenly let an agent finish instead of stalling mid-project.

  • Error recovery. Thinking mitigates self-conditioning; modest gains in recovering from a local slip unlock much longer runs. Non-thinking frontier models can fail at two sequential steps; RL-trained thinking models stretch to hundreds in one turn.

  • Agent workflows. Spending a few more tokens to “think” beats parallel sampling for long-horizon reliability.

Practical takeaways for builders

  • Measure the right thing. Track horizon length (e.g., steps until 50% success) and turn complexity, not just single-turn accuracy.

  • Bias toward thinking. Enable sequential test-time compute for multi-step tasks; it stabilizes execution and extends single-turn capacity.

  • Manage context. Window/summarize to reduce exposure to your model’s past mistakes and blunt self-conditioning drift.

Bottom line

If you stare at short-task benchmarks, you can convince yourself returns are diminishing. But the moment you optimize for finishing real work—repo-wide code edits, multi-tool research, complex data flows—the picture flips. Small improvements in step reliability and recovery produce outsized gains in end-to-end completion. The frontier is quietly shifting from answering to executing—and by that yardstick, progress is very much alive.

Sources: Sinha et al., Illusion of Diminishing Returns (preprint); OpenAI GPT-5 benchmarks and reliability notes; Anthropic Claude 4 results; Google DeepMind Gemini 2.5 updates; METR horizon-length analyses.


Dissecting ChatGPT's Agent Mode: A Look Under the Hood

We've all seen the demos of AI agents booking flights or ordering pizzas. They often feel like magic. But what's actually happening behind the curtain? I was able to analyze a detailed JSON export from a conversation using ChatGPT's agent mode, which gives us a fascinating, low-level look at the architecture in action.

A quick but important note on the data: the original query contained a real campground name and user location. I randomized these details to avoid becoming a potential victim of social engineering. This is a fascinating point in itself—as these agents become more integrated into our lives, the logs of their actions will contain a rich tapestry of personal data, and thinking about how to safely share and analyze these logs is a critical security challenge.

With that in mind, our sanitized request is: "Find me a good flight to Cedar Brook Fields campground (which is in either Vermont or New Hampshire) from an airport near Brookline, Massachusetts."

The data reveals that this is far more than a language model with a few plugins. It's a methodical, tool-using controller operating a virtualized browser environment, complete with an "inner monologue" for planning, execution, and surprisingly robust self-correction.

Let's break down what the data shows.

The Agent's "Inner Monologue"

The most revealing feature is the agent's internal thought process, captured in messages with content_type: "thoughts". The user never sees this, but it's the core of the agent's strategy.

Immediately after the initial prompt, the agent doesn't guess—it plans. Its first thought is to identify the missing variables:

{"summary": "Clarifying Cedar Brook Fields location", "content": "The user is asking for a flight... I need to clarify which state Cedar Brook Fields is located in, as the user mentioned it might be in either Vermont or New Hampshire. Additionally, I may need to confirm the travel dates..."}

This isn't just generating a response; it's decomposing a problem. Later, after getting search results, it shows critical evaluation skills. It doesn't just trust the first result:

{"summary": "Verifying campground location", "content": "The search shows Cedar Brook Fields is in Vermont, but I need to confirm this by visiting an authoritative source, like the campground's official website."}

This inner monologue is the agent's strategy layer, where it decides what it knows, what it needs to find out, and which tool to use next.

A Classic Action-Observation Loop

The agent operates on a clear, programmatic loop: Think -> Act -> Observe -> Re-think.

  1. Think: It generates a plan in its thoughts log.

  2. Act: It issues a command to a tool. This is a code message sent to a specific recipient like browser.search or computer.do. The command is a structured JSON object. For example, to perform a search, it sends:

    {"query": "Cedar Brook Fields campground location Vermont New Hampshire", "source": "computer"}

  3. Observe: The tool executes the command and returns its output. Crucially, for the computer.do tool, this output includes a screenshot of the virtual machine's display. The agent sees what it's doing.

  4. Re-think: It analyzes the new information (and the screenshot) and generates its next thought, adapting its plan based on the outcome of its last action.

The Toolkit: More Than Just APIs

The agent has access to a set of tools. The most powerful one we see is computer.do, which gives it direct, low-level control over the browser environment. It doesn't just call a high-level find_flights() API; it navigates a real website.

It can type, press keys, and click at specific (x, y) coordinates:

{"actions":[{"action":"click","x":408,"y":448,"button":1},{"action":"type","text":"Burlington Vermont"},{"action":"wait"}]}

This means it can interact with almost any website, just as a human would, without needing a pre-built API for that specific service. It's operating a GUI. (This has led to some wonderfully surreal moments out in the wild, like in this hilarious post where an agent casually clicks an "I'm not a robot" checkbox, completely oblivious to the irony).

The Power of Self-Correction

This is the most remarkable part. The agent can recognize when it has made a mistake and take steps to fix it. The JSON log shows two brilliant examples.

First, it tries to search for "Google Flights" but gets redirected to Bing. It sees this in the screenshot and its next thought is to recover:

{"summary": "Navigating to Google Flights", "content": "The search results page loaded, but instead of the Google Flights search, it defaulted to Bing. I'll directly navigate to flights.google.com to ensure accurate results."}

Its next action is to take control of the address bar and type the URL directly.

Even more impressively, at one point it makes a typo, entering the destination into the origin field. It recognizes the error from the screen and its next action is this:

{"actions":[{"action":"keypress","keys":["CTRL","Z"]}]}

It literally performs an "undo" command to correct its own mistake. This is a powerful and robust error-handling mechanism that goes far beyond simple retries.

Grounding and Traceability

Finally, when the agent provides its answer, it's not just making it up. The information is grounded in the sources it found. After navigating to what it believes is the campground's official website, its final confirmation includes a citation that links its statement directly to the URL it found. The metadata for the message contains a citations block that explicitly links the answer to the web page it visited, providing a clear, auditable trail for its facts.

From the Code to the Community: How It's Being Used in the Wild

So, this low-level data gives us a clear picture of the architecture. But how is this technology actually being used? A dive into communities like Reddit's r/OpenAI and r/ChatGPT reveals a vibrant mix of practical experimentation and constructive criticism.

Users are treating it like a junior assistant, offloading complex research and tedious tasks. We're seeing reports of it being used to:

  • Perform deep research: Analyzing multiple websites to compare products, or even mining Reddit itself for sentiment analysis. One user tasked it with finding ceramic mugs that matched the visual style of a website they were building—a fuzzy, stylistic search that's difficult for traditional tools.

  • Automate grunt work: Filling out job applications, checking product stock, or finding a friend's lost wallet by navigating a German-language lost-and-found website.

  • Solve personal problems: One user even employed it to hunt down the best deal on a specific retired Lego set.

The sentiment is one of cautious optimism. Many describe it as being in its "GPT-2 stage"—incredibly promising but still prone to glitches and sometimes taking longer than doing the task manually. The agent can get stuck, misinterpret instructions, or fail on websites with tricky UIs.

This leads to the biggest theme: trust and security. Users are rightly cautious about handing over credentials to their CRMs or email accounts. The concept of the agent as a supervised "intern" or "babysitter" is a common mental model. You let it do the work, but you watch it closely, especially when it's handling sensitive information. The risk of prompt injection, where a malicious website could hijack the agent's session, is a very real concern.

Conclusion: It's a Controller, Not Just a Chatbot

This data, combined with the community's real-world experiments, provides a clear picture: ChatGPT's agent mode is a sophisticated controller. It uses its language intelligence to create plans but relies on a methodical, tool-based execution loop to interact with the digital world.

The ability to see its environment via screenshots and correct its own mistakes using low-level keyboard and mouse commands is a significant architectural step. Even in its "GPT-2 stage," it moves the technology from being a pure "language model" to a functional, interactive agent, giving us a real glimpse into the future of practical AI.


📊 Common Patterns Seen in Successful Software Projects

Pattern Description Dogfooding: Creators use their own software heavily, improving it organically. “Thin to Thick” Clients: Many successful apps evolve from simple frontends to more powerful, client-heavy experiences. Adoption Loop: Usage -> Feedback -> Improvement -> More Usage. A positive feedback cycle. Breakouts: A successful internal tool becomes open-source or commercialized (e.g. Kubernetes, VS Code). Core + Plugin Architecture: Keeps the core lean while enabling rich extensibility (e.g., browsers, IDEs). Rewrite moments: Major inflection points often include painful rewrites (Twitter, Slack, etc. all rewrote core systems at scale).

"A Name, an Address, a Route" Haiku — Found in RFC 791: DARPA’s 1981 Internet Protocol

A name indicates what we seek.  

An address indicates where it is.  

A route indicates how to get there.  

The internet protocol deals primarily with addresses.



(Not technically a haiku).

Link.

What happened to tor?

On March 4, 2025, the New York Times shut down its .onion site which it had launched in 2017. Is tor dead? tor published metrics speak to a volume of activity that doesn’t look like a big drop. It might just be that there is so much less publicity about it nowadays that it feels like that. Also, it seems to have lost its status as anything but a place for criminals. Maybe. One theory is that many of the onion proxy networks are run by the fbi. I don’t know. 


I did some research:


In short — Tor isn’t dead at all.

Its code-base, funding, and user-base are still moving forward, but the hype cycle and many clearnet “onion directories” have stagnated, so a newcomer who googles for .onion links mostly meets blog-posts frozen in 2014-2018. Behind that veneer, Tor Browser 14.5 shipped two weeks ago, the network still serves ~2.4 million daily users, and the project’s 2024 fundraising goal was met. What changed is (a) the way journalists talk about Tor, (b) a big technical migration that broke most older onion addresses, and (c) law-enforcement pressure that discouraged public link-lists. 


"colloquially called the apply model"

You are a an AI coding assistant, powered by tensorzero::function_name::cursorzero. You operate in Cursor You are pair programming with a U...