Cold Hard Cache 💾: Analysis on my ChatGPT data

I did an initial analysis of my ChatGPT history export. Here's a surface-level, first-pass analysis:

Content Types

Most messages are text, with assistant messages also including code, thinking, and reasoning_recap. User messages are primarily text, with a small amount of multimodal_text.

Conversation Length

Some conversations are significantly longer (e.g., "Genkit to Gemini Refactor" has the most messages).

Message Length (Word Count)

Assistant messages: ~236 words on average.
User messages: ~222 words on average, but with high variability, including some very long inputs.

python
df = df.withColumn("word_count", size(split(col("content_text"), r"\s+")))
df_user = df_user.withColumn("word_count", size(split(col("content_text"), r"\s+")))
df.select("word_count").describe().show()
df_user.select("word_count").describe().show()

Temporal Trends (Monthly Aggregations)

The data spans Feb 2023 to May 2025. I tracked metrics like average words per conversation, messages per month, TTR, and subjectivity/objectivity over this period, revealing fluctuations and trends in language style.

python
monthly = df.withColumn("month", date_format(col("create_time"), "yyyy-MM")) \
    .groupBy("conversation_title", "month") \
    .agg(
        sum("word_count").alias("word_count_per_conversation"),
        avg("word_count").alias("avg_word_count"),
        count("*").alias("message_count"),
    ) \
    .orderBy("conversation_title", "month")

monthly.show(n=5)

User messages show varying monthly averages for word count, TTR, and subjectivity.

Next Steps

The goal is to track the quality of GPT responses over time, which I haven't achieved yet. To do this, I plan to use LLM-based evaluation for quality assessment.

This analysis provided a baseline for understanding message patterns and trends, but further refinement is needed to track the quality of assistant responses effectively.

Cold Hard Cache 💾

Analysis on my ChatGPT data

Content Types

Conversation Length

Message Length (Word Count)

Temporal Trends (Monthly Aggregations)

Next Steps

No comments:

Post a Comment

"colloquially called the apply model"

Report Abuse