INFO 230  ·  Cultural Analytics  ·  Spring 2026

Human vs. AI Text: A Cultural Analytics Approach

Maddie MacDonald  ·  UC Berkeley Statistics, MA

Can we distinguish human writing from machine-generated text, and if so, what does that difference actually look like?
Hi Professor Tim! The full analysis, code, and write-up live in the notebook. The README has setup instructions and repo structure if helpful.

The short answer is yes, and the difference goes deeper than just writing style. Using TF-IDF based classification, a logistic regression model separated human from AI chunks with a macro F1 of 0.862. The most predictive features were not random: AI writing consistently reflected more abstract language, while human writing was grounded in specific dates, citations, and institutional vocabulary.

Topic modeling with BERTopic and LDA across two chunking strategies confirmed the same pattern, i.e., human chunks concentrated in narrative and academically grounded topics, while AI chunks concentrated in generic essay-prompt territory. All four model/strategy combinations returned chi-square p-values essentially at zero, meaning the Human vs. AI distinction showed up no matter how the data was sliced.

01

TF-IDF features alone carried strong signal. Logistic Regression achieved macro F1 = 0.862 and Naive Bayes F1 = 0.816, both well above the 0.5 random chance baseline.

02

AI writing reaches for abstraction. Words like potential, significant, additionally, ultimately were the top predictors of AI authorship across every model tested.

03

Human writing is grounded in specifics. Dates like 2007, 2008, 2009 and words like cited, references, essay were the strongest predictors of human authorship.

04

The distinction held across all four models. All chi-square tests returned p-values essentially at zero, meaning topic assignment was never independent of the Human vs. AI label.

Human vs. LLM Text Corpus

788,922 texts from humans and 62 different LLMs. A reproducible 10,000-row sample was used for all analysis (RANDOM_SEED = 230). Kaggle, Zachary Grinberg, 2024.