Dissecting ChatGPT's Agent Mode: A Look Under the Hood

Dissecting ChatGPT's Agent Mode: A Look Under the Hood

We've all seen the demos of AI agents booking flights or ordering pizzas. They often feel like magic. But what's actually happening behind the curtain? I was able to analyze a detailed JSON export from a conversation using ChatGPT's agent mode, which gives us a fascinating, low-level look at the architecture in action.

A quick but important note on the data: the original query contained a real campground name and user location. I randomized these details to avoid becoming a potential victim of social engineering. This is a fascinating point in itself—as these agents become more integrated into our lives, the logs of their actions will contain a rich tapestry of personal data, and thinking about how to safely share and analyze these logs is a critical security challenge.

With that in mind, our sanitized request is: "Find me a good flight to Cedar Brook Fields campground (which is in either Vermont or New Hampshire) from an airport near Brookline, Massachusetts."

The data reveals that this is far more than a language model with a few plugins. It's a methodical, tool-using controller operating a virtualized browser environment, complete with an "inner monologue" for planning, execution, and surprisingly robust self-correction.

Let's break down what the data shows.

The Agent's "Inner Monologue"

The most revealing feature is the agent's internal thought process, captured in messages with content_type: "thoughts". The user never sees this, but it's the core of the agent's strategy.

Immediately after the initial prompt, the agent doesn't guess—it plans. Its first thought is to identify the missing variables:

{"summary": "Clarifying Cedar Brook Fields location", "content": "The user is asking for a flight... I need to clarify which state Cedar Brook Fields is located in, as the user mentioned it might be in either Vermont or New Hampshire. Additionally, I may need to confirm the travel dates..."}

This isn't just generating a response; it's decomposing a problem. Later, after getting search results, it shows critical evaluation skills. It doesn't just trust the first result:

{"summary": "Verifying campground location", "content": "The search shows Cedar Brook Fields is in Vermont, but I need to confirm this by visiting an authoritative source, like the campground's official website."}

This inner monologue is the agent's strategy layer, where it decides what it knows, what it needs to find out, and which tool to use next.

A Classic Action-Observation Loop

The agent operates on a clear, programmatic loop: Think -> Act -> Observe -> Re-think.

  1. Think: It generates a plan in its thoughts log.

  2. Act: It issues a command to a tool. This is a code message sent to a specific recipient like browser.search or computer.do. The command is a structured JSON object. For example, to perform a search, it sends:

    {"query": "Cedar Brook Fields campground location Vermont New Hampshire", "source": "computer"}

  3. Observe: The tool executes the command and returns its output. Crucially, for the computer.do tool, this output includes a screenshot of the virtual machine's display. The agent sees what it's doing.

  4. Re-think: It analyzes the new information (and the screenshot) and generates its next thought, adapting its plan based on the outcome of its last action.

The Toolkit: More Than Just APIs

The agent has access to a set of tools. The most powerful one we see is computer.do, which gives it direct, low-level control over the browser environment. It doesn't just call a high-level find_flights() API; it navigates a real website.

It can type, press keys, and click at specific (x, y) coordinates:

{"actions":[{"action":"click","x":408,"y":448,"button":1},{"action":"type","text":"Burlington Vermont"},{"action":"wait"}]}

This means it can interact with almost any website, just as a human would, without needing a pre-built API for that specific service. It's operating a GUI. (This has led to some wonderfully surreal moments out in the wild, like in this hilarious post where an agent casually clicks an "I'm not a robot" checkbox, completely oblivious to the irony).

The Power of Self-Correction

This is the most remarkable part. The agent can recognize when it has made a mistake and take steps to fix it. The JSON log shows two brilliant examples.

First, it tries to search for "Google Flights" but gets redirected to Bing. It sees this in the screenshot and its next thought is to recover:

{"summary": "Navigating to Google Flights", "content": "The search results page loaded, but instead of the Google Flights search, it defaulted to Bing. I'll directly navigate to flights.google.com to ensure accurate results."}

Its next action is to take control of the address bar and type the URL directly.

Even more impressively, at one point it makes a typo, entering the destination into the origin field. It recognizes the error from the screen and its next action is this:

{"actions":[{"action":"keypress","keys":["CTRL","Z"]}]}

It literally performs an "undo" command to correct its own mistake. This is a powerful and robust error-handling mechanism that goes far beyond simple retries.

Grounding and Traceability

Finally, when the agent provides its answer, it's not just making it up. The information is grounded in the sources it found. After navigating to what it believes is the campground's official website, its final confirmation includes a citation that links its statement directly to the URL it found. The metadata for the message contains a citations block that explicitly links the answer to the web page it visited, providing a clear, auditable trail for its facts.

From the Code to the Community: How It's Being Used in the Wild

So, this low-level data gives us a clear picture of the architecture. But how is this technology actually being used? A dive into communities like Reddit's r/OpenAI and r/ChatGPT reveals a vibrant mix of practical experimentation and constructive criticism.

Users are treating it like a junior assistant, offloading complex research and tedious tasks. We're seeing reports of it being used to:

  • Perform deep research: Analyzing multiple websites to compare products, or even mining Reddit itself for sentiment analysis. One user tasked it with finding ceramic mugs that matched the visual style of a website they were building—a fuzzy, stylistic search that's difficult for traditional tools.

  • Automate grunt work: Filling out job applications, checking product stock, or finding a friend's lost wallet by navigating a German-language lost-and-found website.

  • Solve personal problems: One user even employed it to hunt down the best deal on a specific retired Lego set.

The sentiment is one of cautious optimism. Many describe it as being in its "GPT-2 stage"—incredibly promising but still prone to glitches and sometimes taking longer than doing the task manually. The agent can get stuck, misinterpret instructions, or fail on websites with tricky UIs.

This leads to the biggest theme: trust and security. Users are rightly cautious about handing over credentials to their CRMs or email accounts. The concept of the agent as a supervised "intern" or "babysitter" is a common mental model. You let it do the work, but you watch it closely, especially when it's handling sensitive information. The risk of prompt injection, where a malicious website could hijack the agent's session, is a very real concern.

Conclusion: It's a Controller, Not Just a Chatbot

This data, combined with the community's real-world experiments, provides a clear picture: ChatGPT's agent mode is a sophisticated controller. It uses its language intelligence to create plans but relies on a methodical, tool-based execution loop to interact with the digital world.

The ability to see its environment via screenshots and correct its own mistakes using low-level keyboard and mouse commands is a significant architectural step. Even in its "GPT-2 stage," it moves the technology from being a pure "language model" to a functional, interactive agent, giving us a real glimpse into the future of practical AI.


📊 Common Patterns Seen in Successful Software Projects

Pattern Description Dogfooding: Creators use their own software heavily, improving it organically. “Thin to Thick” Clients: Many successful apps evolve from simple frontends to more powerful, client-heavy experiences. Adoption Loop: Usage -> Feedback -> Improvement -> More Usage. A positive feedback cycle. Breakouts: A successful internal tool becomes open-source or commercialized (e.g. Kubernetes, VS Code). Core + Plugin Architecture: Keeps the core lean while enabling rich extensibility (e.g., browsers, IDEs). Rewrite moments: Major inflection points often include painful rewrites (Twitter, Slack, etc. all rewrote core systems at scale).

"A Name, an Address, a Route" Haiku — Found in RFC 791: DARPA’s 1981 Internet Protocol

A name indicates what we seek.  

An address indicates where it is.  

A route indicates how to get there.  

The internet protocol deals primarily with addresses.



(Not technically a haiku).

Link.

What happened to tor?

On March 4, 2025, the New York Times shut down its .onion site which it had launched in 2017. Is tor dead? tor published metrics speak to a volume of activity that doesn’t look like a big drop. It might just be that there is so much less publicity about it nowadays that it feels like that. Also, it seems to have lost its status as anything but a place for criminals. Maybe. One theory is that many of the onion proxy networks are run by the fbi. I don’t know. 


I did some research:


In short — Tor isn’t dead at all.

Its code-base, funding, and user-base are still moving forward, but the hype cycle and many clearnet “onion directories” have stagnated, so a newcomer who googles for .onion links mostly meets blog-posts frozen in 2014-2018. Behind that veneer, Tor Browser 14.5 shipped two weeks ago, the network still serves ~2.4 million daily users, and the project’s 2024 fundraising goal was met. What changed is (a) the way journalists talk about Tor, (b) a big technical migration that broke most older onion addresses, and (c) law-enforcement pressure that discouraged public link-lists. 


Creating A Family Free using GEDCOM

A broad category of how genai is useful is by turning unstructured data into structured. Creating a family tree (GEDCOM format) is a good example of that: I gave it stream of consciousness collection of details about who is who (this person is so-and-so's kid...) and asked for gedcom format

Then I imported the gedcom file into https://www.familyecho.com/ and then exported to pdf and printed.

Analysis on my ChatGPT data

did an initial analysis of my ChatGPT history export. Here's surface-level, first-pass analysis:

Content Types

Most messages are textwith assistant messages also including codethinkingand reasoning_recapUser messages are primarily textwith small amount of multimodal_text.

Conversation Length

Some conversations are significantly longer (e.g., "Genkit to Gemini Refactor" has the most messages).

Message Length (Word Count)

  • Assistant messages: ~236 words on average.

  • User messages: ~222 words on average, but with high variabilityincluding some very long inputs.

python
df = df.withColumn("word_count", size(split(col("content_text"), r"\s+"))) df_user = df_user.withColumn("word_count", size(split(col("content_text"), r"\s+"))) df.select("word_count").describe().show() df_user.select("word_count").describe().show()

Temporal Trends (Monthly Aggregations)

The data spans Feb 2023 to May 2025tracked metrics like average words per conversationmessages per monthTTRand subjectivity/objectivity over this period, revealing fluctuations and trends in language style.

python
monthly = df.withColumn("month", date_format(col("create_time"), "yyyy-MM")) \ .groupBy("conversation_title", "month") \ .agg( sum("word_count").alias("word_count_per_conversation"), avg("word_count").alias("avg_word_count"), count("*").alias("message_count"), ) \ .orderBy("conversation_title", "month") monthly.show(n=5)

User messages show varying monthly averages for word countTTRand subjectivity.

Next Steps

The goal is to track the quality of GPT responses over time, which haven't achieved yet. To do this, plan to use LLM-based evaluation for quality assessment.

This analysis provided baseline for understanding message patterns and trends, but further refinement is needed to track the quality of assistant responses effectively.

Dissecting ChatGPT's Agent Mode: A Look Under the Hood

Dissecting ChatGPT's Agent Mode: A Look Under the Hood We've all seen the demos of AI agents booking flights or ordering pizzas. The...