INFO 230 Mini-Project 1: Human vs. AI Text (A Cultural Analytics Approach)¶
This notebook walks through a full cultural analytics workflow applied to the Human vs. LLM Text Corpus from Kaggle. This is a dataset of 788,922 texts from humans and 62 different LLMs. The central research question that drove my analysis for this project is: can we distinguish human writing from machine-generated text, and if so, what does that difference actually look like?
Project Structure (Following bCourses instructions):¶
- Step 1: Load and sample the data
- Step 2: Preliminary corpus statistics (EDA)
- Step 3: Chunking strategies
- Step 4: Supervised classification (Logistic Regression and Naive Bayes)
- Step 5: Topic Modeling (BERTopic and LDA x 2 different chunking strategies)
- Step 6: Evaluation
- Step 7: Results, Discussion, Limitations, and Next Steps
Main Step 1: Load and Sample the Data¶
The full dataset ('data.csv') is about 800k rows, which is too large to load entirely for exploratory work. Therefore, the first step is to take a reproducible random sample of 10,000 rows, which is large enough to be statistically meaningful while keeping computation manageable and efficient.
The Kaggle dataset comes with three files:
data.csv: the main corpus (text, source/label, word count)distribution.csv: pre-computed stats across all 788k documents by sourceprompts.csv: the writing prompts used to generate the texts
Binary label decision: The original data has 63 unique sources (62 LLMs + human). Rather than doing multi-class classification across all 62 LLMs, everything is collapsed into a binary label: Human = 0, AI = 1. This choice was made because the cultural question that I'm more interested in exploring is whether we can tell human writing from machine writing, not whether we can distinguish AI models from each other (e.g., GPT-3.5 vs. Bloom). Additionally, using the binary labels is more statistically sound since many LLMs have very small sample sizes in comparison to others (e.g., some under 300 texts). This decision is further supported by the results of the code below, namely using binary labeling (Human vs. AI) is more balanced (~347K human texts. vs ~441K AI texts).
IMPORTANT NOTE: I will only push the sample data I create to the repository, the entire dataset is too large and will reduce overall efficiency.
import pandas as pd
import os
# Build path relative to this notebook's location
notebook_dir = os.path.dirname(os.path.abspath("__file__"))
data_dir = os.path.join(notebook_dir, '..', 'data')
# read in just the first 3 rows of each dataset to take a look
for fname in ['data.csv', 'distribution.csv', 'prompts.csv']:
df = pd.read_csv(os.path.join(data_dir, fname), nrows=3)
print(f"\n=== {fname} ===")
print(f"Columns: {df.columns.tolist()}")
print(df.head(3))
=== data.csv ===
Columns: ['text', 'source', 'prompt_id', 'text_length', 'word_count']
text source prompt_id \
0 Federal law supersedes state law, and cannabis... Bloom-7B 0
1 Miles feels restless after working all day. He... Bloom-7B 0
2 So first of I am danish. That means that I fol... Bloom-7B 0
text_length word_count
0 967 157
1 5068 778
2 1602 267
=== distribution.csv ===
Columns: ['Source', 'Number of Samples', 'Percentage of Total Data', 'Text Length Sum', 'Text Length Mean', 'Text Length Median', 'Text Length Std', 'Text Length Max', 'Text Length Min', 'Word Count Sum', 'Word Count Mean', 'Word Count Median', 'Word Count Std', 'Word Count Max', 'Word Count Min']
Source Number of Samples Percentage of Total Data \
0 Human 347692 44.0718%
1 GPT-3.5 52346 6.6351%
2 Text-Davinci-003 22860 2.8976%
Text Length Sum Text Length Mean Text Length Median Text Length Std \
0 1555649148 4474.216 2288 6989.088
1 147829489 2824.084 3290 1797.105
2 21012437 919.179 727 590.805
Text Length Max Text Length Min Word Count Sum Word Count Mean \
0 890119 110 246977688 710.335
1 23940 116 22633379 432.380
2 5313 116 3584391 156.798
Word Count Median Word Count Std Word Count Max Word Count Min
0 396 1003.481 71543 25
1 505 263.848 3565 25
2 121 98.634 822 25
=== prompts.csv ===
Columns: ['Prompt ID', 'Prompt']
Prompt ID Prompt
0 0 Undefined
1 1 Anything, be creative and make this about any ...
2 2 Does the electoral college work?
# Load full distribution to see all sources
dist_df = pd.read_csv(os.path.join(data_dir, 'distribution.csv'))
print(dist_df[['Source', 'Number of Samples', 'Percentage of Total Data']])
Source Number of Samples Percentage of Total Data 0 Human 347692 44.0718% 1 GPT-3.5 52346 6.6351% 2 Text-Davinci-003 22860 2.8976% 3 Text-Davinci-002 21436 2.7171% 4 OPT-1.3B 18467 2.3408% .. ... ... ... 58 Toppy-M-7B 433 0.0549% 59 LLaMA-2-7B 409 0.0518% 60 Dolphin-Mixtral-8x7B 407 0.0516% 61 Cohere-Command 390 0.0494% 62 Dolphin-2.5-Mixtral-8x7B 228 0.0289% [63 rows x 3 columns]
import numpy as np
# configuration step
# defining key parameters for sampling
SAMPLE_SIZE = 10000
RANDOM_SEED = 230 # setting a random seed for reproducibility (can be anything - INFO 230 - for fun!)
# using os path for reproducibility on another machine
notebook_dir = os.path.dirname(os.path.abspath("__file__"))
data_dir = os.path.join(notebook_dir, '..', 'data')
# First loading in full dataset then immediately taking a random sample to keep computation manageable
print("Loading data.csv...")
ai_human_df = pd.read_csv(os.path.join(data_dir, 'data.csv'))
print(f"Full dataset size: {len(ai_human_df):,} rows")
# Reproducible random sample
ai_human_df_sample = ai_human_df.sample(n=SAMPLE_SIZE, random_state=RANDOM_SEED).reset_index(drop=True)
print(f"Sample size: {len(ai_human_df_sample):,} rows")
# Create the binary label
# original data had 62+ unique LLM names plus 'human'
# binary classification simplify this to:
# Human = 0, all LLMs = 1
ai_human_df_sample['label'] = (ai_human_df_sample['source'] != 'Human').astype(int)
# save sample data as csv
ai_human_df_sample.to_csv(os.path.join(data_dir, 'ai_human_sample_10k.csv'), index=False)
print(f"\nLabel distribution in sample:")
print(ai_human_df_sample['label'].value_counts().rename({0: 'Human', 1: 'AI'}))
print(f"\nSample preview:")
ai_human_df_sample[['text', 'source', 'label', 'word_count']].head(3)
Loading data.csv... Full dataset size: 788,922 rows Sample size: 10,000 rows Label distribution in sample: label AI 5548 Human 4452 Name: count, dtype: int64 Sample preview:
| text | source | label | word_count | |
|---|---|---|---|---|
| 0 | Oxygen gas (O 2) can be toxic at elevated part... | Human | 0 | 105 |
| 1 | The Tortoise and the Rabbit: A Story of Determ... | Mistral-7B | 1 | 401 |
| 2 | When we write words from another language, lik... | Text-Davinci-003 | 1 | 68 |
Sanity check: Is this a representative sample?¶
Label distribution in the sample: 5,548 AI (55.5%) vs. 4,452 Human (44.5%), which is almost identical to the full dataset's split noted above, confirming the random sample is representative and classes are reasonably balanced.
Limitation note: The AI side of the binary label combines 62 very different models (i.e., from tiny models like Flan-T5-Small to large ones like GPT-4). This introduces variability into what "AI writing" means in this corpus, and will be discussed further in the limitations section.
Main Step 2: Preliminary Corpus Statistics¶
Before any modeling, it helps to understand what the corpus actually looks like through EDA (exploratory data analysis). Therefore, this next step involves computing basic summary statistics to investigate properties of the data such as document counts, label balance, and text length distributions (using both the 10k sample and distribution.csv for the full-dataset set).
The main thing to look for here is text length variability between Human and AI, since that will directly inform the chunking strategy in Step 3.
Part 1: Sample-level Statistics¶
import matplotlib.pyplot as plt
import seaborn as sns
# Part 1: Sample-level Stats
print("===CORPUS STATISTICS===")
# confirming sample size relative to the full 788k corpus
print(f"\n ---Document Counts---")
print(f"\t Full dataset: {len(ai_human_df):>10,} documents")
print(f"\t Working sample: {len(ai_human_df_sample):>10,} documents")
print(f"\t Sample fraction: {len(ai_human_df_sample)/len(ai_human_df)*100:.2f}% of full corpus")
print(f"\n ---Label Distribution (Sample)---")
label_counts = ai_human_df_sample['label'].value_counts().rename({0: 'Human', 1: 'AI'})
label_pcts = ai_human_df_sample['label'].value_counts(normalize=True).rename({0: 'Human', 1: 'AI'}) * 100
for label in ['Human', 'AI']:
print(f"\t {label:<8} {label_counts[label]:>6,} documents ({label_pcts[label]:.1f}%)")
print(f"\n ---Text Length Statistics (Sample, in words)---")
for label_name, label_val in [('Human', 0), ('AI', 1)]:
subset = ai_human_df_sample[ai_human_df_sample['label'] == label_val]['word_count']
print(f"\n {label_name}:")
print(f"\t Mean: {subset.mean():.1f} words")
print(f"\t Median: {subset.median():.1f} words")
print(f"\t Stdev: {subset.std():.1f} words")
print(f"\t Min: {subset.min()} words")
print(f"\t Max: {subset.max()} words")
# Approximate vocabulary: all unique whitespace-separated tokens across the sample
print(f"\n ---Vocabulary Size (Sample)---")
all_words = ai_human_df_sample['text'].str.lower().str.split().explode()
print(f"\t Unique tokens (raw): {all_words.nunique():,}")
===CORPUS STATISTICS=== ---Document Counts--- Full dataset: 788,922 documents Working sample: 10,000 documents Sample fraction: 1.27% of full corpus ---Label Distribution (Sample)--- Human 4,452 documents (44.5%) AI 5,548 documents (55.5%) ---Text Length Statistics (Sample, in words)--- Human: Mean: 700.8 words Median: 386.5 words Stdev: 1022.8 words Min: 25 words Max: 23392 words AI: Mean: 335.9 words Median: 283.0 words Stdev: 279.3 words Min: 25 words Max: 5019 words ---Vocabulary Size (Sample)--- Unique tokens (raw): 185,152
Part 2: Full dataset statistics (from distribution.csv)¶
The sample covers only 1.3% of the corpus. For a more complete picture, distribution.csv has pre-computed stats across all 788k documents separated by source.
# Part 2: Full Dataset Stats (pulled from distribution.csv)
# distribution.csv has pre-computed stats for all 788k documents
# can use this to give an accurate picture of the full corpus in the writeup
print(f"\n── Full Dataset Source Distribution (from distribution.csv) ──")
dist_df = pd.read_csv(os.path.join(data_dir, 'distribution.csv'))
print(dist_df[['Source', 'Number of Samples', 'Percentage of Total Data', 'Word Count Mean', 'Word Count Median']].to_string(index=False))
── Full Dataset Source Distribution (from distribution.csv) ──
Source Number of Samples Percentage of Total Data Word Count Mean Word Count Median
Human 347692 44.0718% 710.335 396.0
GPT-3.5 52346 6.6351% 432.380 505.0
Text-Davinci-003 22860 2.8976% 156.798 121.0
Text-Davinci-002 21436 2.7171% 159.292 107.0
OPT-1.3B 18467 2.3408% 251.045 133.0
OPT-30B 18055 2.2886% 223.312 129.0
Nous-Hermes-LLaMA-2-13B 12686 1.6080% 549.696 505.0
Mistral-7B 10439 1.3232% 374.923 380.0
PaLM-2 9510 1.2054% 419.990 413.0
OpenChat-3.5 9402 1.1918% 616.382 607.0
LLaMA-30B 9340 1.1839% 393.633 336.0
LLaMA-65B 9321 1.1815% 391.059 331.0
LLaMA-13B 9282 1.1765% 435.495 415.0
LLaMA-7B 9271 1.1751% 480.182 526.0
T0-3B 9219 1.1686% 57.816 42.0
Flan-T5-Base 9201 1.1663% 44.999 38.0
Flan-T5-Large 9164 1.1616% 46.985 40.0
Flan-T5-Small 9144 1.1590% 39.586 37.0
OPT-2.7B 9134 1.1578% 254.533 149.0
Flan-T5-XXL 9113 1.1551% 86.004 53.0
GLM-130B 9071 1.1498% 529.154 646.0
Flan-T5-XL 8986 1.1390% 51.224 43.0
GPT-4 8852 1.1220% 621.655 637.0
OPT-6.7B 8838 1.1203% 256.804 153.0
OPT-125M 8823 1.1184% 255.521 167.0
Bloom-7B 8812 1.1170% 325.537 236.5
OPT-350M 8747 1.1087% 326.137 220.0
T0-11B 8705 1.1034% 49.660 40.0
OPT-13B 8087 1.0251% 213.094 117.0
GPT-J 7580 0.9608% 470.693 394.0
Claude-Instant-v1 7147 0.9059% 453.450 447.0
GPT-NeoX 6821 0.8646% 140.098 106.0
MythoMax-L2-13B 6147 0.7792% 455.178 445.0
Unknown 6093 0.7723% 272.031 284.0
Neural-Chat-7B 5858 0.7425% 636.037 634.0
LZLV-70B 5143 0.6519% 437.645 447.0
LLaMA-2-70B 5000 0.6338% 422.705 420.0
Falcon-180B 4745 0.6015% 372.009 357.0
Psyfighter-13B 4375 0.5546% 560.545 563.0
StripedHyena-Nous-7B 3520 0.4462% 496.181 491.0
YI-34B 3520 0.4462% 664.171 649.0
Nous-Capybara-34B 3327 0.4217% 541.708 540.0
Nous-Capybara-7B 3204 0.4061% 588.157 558.0
Claude-v1 3158 0.4003% 388.771 388.0
Mistral-7B-OpenOrca 3058 0.3876% 545.278 542.0
Mixtral-8x7B 2865 0.3632% 691.358 569.0
Psyfighter-2-13B 2743 0.3477% 571.315 572.0
Noromaid-20B 1326 0.1681% 476.931 479.0
Text-Davinci-001 1120 0.1420% 245.464 236.0
Text-Curie-001 1008 0.1278% 240.240 225.0
Text-Babbage-001 875 0.1109% 208.003 194.0
Goliath-120B 734 0.0930% 255.871 258.0
Text-Ada-001 691 0.0876% 230.527 193.0
Nous-Hermes-LLaMA-2-70B 650 0.0824% 803.945 745.0
OpenHermes-2-Mistral-7B 623 0.0790% 546.400 541.0
Gemini-Pro 613 0.0777% 579.868 566.0
OpenHermes-2.5-Mistral-7B 612 0.0776% 570.420 566.0
RWKV-5-World-3B 496 0.0629% 391.734 383.0
Toppy-M-7B 433 0.0549% 616.993 598.0
LLaMA-2-7B 409 0.0518% 291.298 292.0
Dolphin-Mixtral-8x7B 407 0.0516% 628.459 592.0
Cohere-Command 390 0.0494% 333.787 312.5
Dolphin-2.5-Mixtral-8x7B 228 0.0289% 583.404 524.0
Part 3: EDA Visualizations¶
# Part 3: Visualizations
fig, axes = plt.subplots(1, 4, figsize=(22, 5))
fig.suptitle('Corpus Statistics: Human vs. AI Text', fontsize=14, fontweight='bold')
# Plot 1: Label distribution bar chart
label_counts.plot(kind='bar', ax=axes[0], color=['steelblue', 'tomato'], edgecolor='black', alpha=0.85)
axes[0].set_title('Label Distribution (Sample)')
axes[0].set_xlabel('Source')
axes[0].set_ylabel('Number of Documents')
axes[0].set_xticklabels(['Human', 'AI'], rotation=0)
# Loop over each bar and add the raw count as a label just above it
for i, v in enumerate(label_counts):
axes[0].text(i, v + 30, str(v), ha='center', fontweight='bold')
# Plot 2: Word count histogram -- Human only
cap = 2000 # trims extreme outliers (some human docs go up to 23k words) so histogram is readable
human_wc = ai_human_df_sample[ai_human_df_sample['label'] == 0]['word_count']
axes[1].hist(human_wc[human_wc <= cap], bins=40, color='steelblue', edgecolor='black', alpha=0.85)
axes[1].set_title('Human: Word Count Distribution')
axes[1].set_xlabel('Word Count')
axes[1].set_ylabel('Frequency')
axes[1].axvline(human_wc.median(), color='red', linestyle='--', label=f'Median: {human_wc.median():.0f}')
axes[1].legend()
# Plot 3: Word count histogram -- AI only
ai_wc = ai_human_df_sample[ai_human_df_sample['label'] == 1]['word_count']
axes[2].hist(ai_wc[ai_wc <= cap], bins=40, color='tomato', edgecolor='black', alpha=0.85)
axes[2].set_title('AI: Word Count Distribution')
axes[2].set_xlabel('Word Count')
axes[2].set_ylabel('Frequency')
axes[2].axvline(ai_wc.median(), color='navy', linestyle='--', label=f'Median: {ai_wc.median():.0f}')
axes[2].legend()
# Plot 4: Boxplot of word counts by label
ai_human_df_sample['label_name'] = ai_human_df_sample['label'].map({0: 'Human', 1: 'AI'}) # convert numeric label (0/1) to a readable string ('Human'/'AI')
ai_human_df_sample[ai_human_df_sample['word_count'] <= cap].boxplot(
column='word_count', by='label_name', ax=axes[3],
boxprops=dict(color='black'),
medianprops=dict(color='red', linewidth=2)
)
axes[3].set_title(f'Word Count Boxplot (capped at {cap})')
axes[3].set_xlabel('Source')
axes[3].set_ylabel('Word Count')
plt.suptitle('') # clears the default title that pandas boxplot auto-generates
plt.tight_layout()
plt.savefig('/Users/maddiemac/Desktop/INFO 230/cultural-analytics-project-1/figures/EDA_visuals')
plt.show()
What overall EDA now tells us:¶
Label balance is good: 44.5% Human / 55.5% AI in the sample mirrors the full dataset, confirming the sample is representative with no major class imbalance issues.
Text length is the main observation to note: Human text has a mean of 701 words but a very large standard deviation of 1,023, indicating that humans write very inconsistently in length (anywhere from 25-word responses to 23,000+ word essays). In contrast, AI text is much more stable with a mean 336 words, stdev only 279, and a lower max at about 5,000 words.
Visualizations make this contrast clear: From the histograms we can see that human text spreads gradually across the full range, while AI text has a dramatic spike in the 0-100 word range. Additionally, the boxplot confirms this i.e., the IQR for AI is narrower and lower than Human, with far fewer extreme outliers.
In general, these EDA findings are interesting at this point, but also more importantly are going to directly motivate the chunking strategy coming up next. For instance, if we had widely varied-length documents and fed them into a certain topic model, the longer documents would dominate merely because of size.
LLM source diversity note: The full dataset (part b) shows 62 LLMs with very different characteristics. For instance Flan-T5-Small averages just 40 words per output while Nous-Hermes-LLaMA-2-70B averages 804. This variety is worth keeping in mind throughout the analysis and will be discussed within the limitations section.
Main Step 3: Chunking Strategies¶
Goal of chunking: break the documents into units suitable for topic modeling.
Topic modeling works best when input documents are:
- (1) long enough to carry meaningful topical signal
- (2) consistent enough in size that no single chunk dominates the model.
Recall, given the extreme length variability shown in step 2, chunking strategy matters a lot here.
Two strategies are compared in the following steps:
- Strategy 1 (Document-Level): treat each document as its own chunk (no internal splitting)
- Strategy 2 (Fixed 200-Word Windows): split each document internally into non-overlapping 200-word chunks, discarding any final chunk shorter than 50 words.
Both are saved and carried forward in the following analyses upon choice and justification for which is more appropriate for the corpus and topic modeling goals. Note that Strategy 2 will be the primary strategy, but Strategy 1 is kept for the advanced topic modeling step later.
# Chunking Strategy 1: Document-level chunking
# simplest approach: each document is already a "chunk"
# add chunk_id column
chunks_s1 = ai_human_df_sample[['text', 'source', 'label', 'word_count']].copy()
chunks_s1['doc_id'] = chunks_s1.index
chunks_s1['chunk_id'] = 0 # only one chunk per document
print("===Strategy 1: Document-Level Chunking===")
print(f"\t Total chunks: {len(chunks_s1):,}")
print(f"\t Mean chunk size: {chunks_s1['word_count'].mean():.1f} words")
print(f"\t Median chunk size: {chunks_s1['word_count'].median():.1f} words")
print(f"\t Stdev chunk size: {chunks_s1['word_count'].std():.1f} words")
print(f"\t Min chunk size: {chunks_s1['word_count'].min()} words")
print(f"\t Max chunk size: {chunks_s1['word_count'].max()} words")
# Chunking Strategy 2: Fixed-length word window chunking
# each document is split into non-overlapping windows of CHUNK_SIZE words
# NOTE: the final chunk of each document may be shorter than CHUNK_SIZE
# (minimum is 50 words): otherwise discard the chunk to avoid tiny (low-signal) chunks
CHUNK_SIZE = 200 # words per chunk
MIN_CHUNK_WORDS = 50
def chunk_by_words(text, chunk_size=CHUNK_SIZE, min_words=MIN_CHUNK_WORDS):
"""
Split a text string into non-overlapping chunks of chunk_size words.
Discards the final chunk if it has fewer than min_words words.
Returns a list of chunk strings.
"""
words = text.split()
chunks = []
for i in range(0, len(words), chunk_size):
chunk = ' '.join(words[i:i + chunk_size])
if len(chunk.split()) >= min_words:
chunks.append(chunk)
return chunks
# Apply chunking to every document in the sample (faster than appending rows to a DataFrame one at a time)
rows = []
for doc_id, row in ai_human_df_sample.iterrows():
doc_chunks = chunk_by_words(row['text'])
# inner loop: one iteration per chunk produced from this document
for chunk_idx, chunk_text in enumerate(doc_chunks):
rows.append({
'doc_id' : doc_id,
'chunk_id' : chunk_idx,
'text' : chunk_text,
'source' : row['source'],
'label' : row['label'],
'word_count': len(chunk_text.split())
})
chunks_s2 = pd.DataFrame(rows)
print("\n===Strategy 2: Fixed-Length Word Window Chunking===")
print(f"\t Chunk size setting: {CHUNK_SIZE} words (min: {MIN_CHUNK_WORDS})")
print(f"\t Total chunks: {len(chunks_s2):,}")
print(f"\t Mean chunk size: {chunks_s2['word_count'].mean():.1f} words")
print(f"\t Median chunk size: {chunks_s2['word_count'].median():.1f} words")
print(f"\t Stdev chunk size: {chunks_s2['word_count'].std():.1f} words")
print(f"\t Min chunk size: {chunks_s2['word_count'].min()} words")
print(f"\t Max chunk size: {chunks_s2['word_count'].max()} words")
# STRATEGY COMPARISON
# Recall: a good chunking strategy for topic modeling should produce chunks that are:
# (1) big enough (hundreds to thousands of chunks)
# (2) consistent in size (low std relative to mean)
# (3) long enough to carry topical signal (ideally 100-500 words)
# Use CV=stdev/mean (coefficient of variation) as a simple measure of size consistency
# lower CV = more uniform chunks
cv_s1 = chunks_s1['word_count'].std() / chunks_s1['word_count'].mean()
cv_s2 = chunks_s2['word_count'].std() / chunks_s2['word_count'].mean()
print("\n===STRATEGY COMPARISON===")
print(f"\n{'Metric':<30} {'Strategy 1':>15} {'Strategy 2':>15}")
print(f"{'Total chunks':<30} {len(chunks_s1):>15,} {len(chunks_s2):>15,}")
print(f"{'Mean chunk size (words)':<30} {chunks_s1['word_count'].mean():>15.1f} {chunks_s2['word_count'].mean():>15.1f}")
print(f"{'Stdev chunk size (words)':<30} {chunks_s1['word_count'].std():>15.1f} {chunks_s2['word_count'].std():>15.1f}")
print(f"{'Coefficient of Variation':<30} {cv_s1:>15.3f} {cv_s2:>15.3f}")
print(f"{'Min chunk size (words)':<30} {chunks_s1['word_count'].min():>15} {chunks_s2['word_count'].min():>15}")
print(f"{'Max chunk size (words)':<30} {chunks_s1['word_count'].max():>15} {chunks_s2['word_count'].max():>15}")
===Strategy 1: Document-Level Chunking=== Total chunks: 10,000 Mean chunk size: 498.4 words Median chunk size: 326.0 words Stdev chunk size: 736.1 words Min chunk size: 25 words Max chunk size: 23392 words ===Strategy 2: Fixed-Length Word Window Chunking=== Chunk size setting: 200 words (min: 50) Total chunks: 27,430 Mean chunk size: 178.6 words Median chunk size: 200.0 words Stdev chunk size: 42.5 words Min chunk size: 50 words Max chunk size: 200 words ===STRATEGY COMPARISON=== Metric Strategy 1 Strategy 2 Total chunks 10,000 27,430 Mean chunk size (words) 498.4 178.6 Stdev chunk size (words) 736.1 42.5 Coefficient of Variation 1.477 0.238 Min chunk size (words) 25 50 Max chunk size (words) 23392 200
# Visualize the comparison between chunking strategies
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
fig.suptitle('Chunking Strategy Comparison: Word Count Distributions', fontsize=13, fontweight='bold')
cap = 2000
# Strategy 1 histogram
axes[0].hist(chunks_s1['word_count'].clip(upper=cap), bins=40, color='mediumpurple', edgecolor='black', alpha=0.85)
axes[0].axvline(chunks_s1['word_count'].median(), color='red', linestyle='--', label=f"Median: {chunks_s1['word_count'].median():.0f}")
axes[0].set_title(f'Strategy 1: Document-Level\n(n={len(chunks_s1):,} chunks, CV={cv_s1:.2f})')
axes[0].set_xlabel('Word Count')
axes[0].set_ylabel('Frequency')
axes[0].legend()
# Strategy 2 histogram
axes[1].hist(chunks_s2['word_count'].clip(upper=cap), bins=40, color='mediumseagreen', edgecolor='black', alpha=0.85)
axes[1].axvline(chunks_s2['word_count'].median(), color='red', linestyle='--', label=f"Median: {chunks_s2['word_count'].median():.0f}")
axes[1].set_title(f'Strategy 2: Fixed 200-Word Windows\n(n={len(chunks_s2):,} chunks, CV={cv_s2:.2f})')
axes[1].set_xlabel('Word Count')
axes[1].set_ylabel('Frequency')
axes[1].legend()
plt.tight_layout()
plt.savefig('/Users/maddiemac/Desktop/INFO 230/cultural-analytics-project-1/figures/chunking_strategies')
plt.show()
# prior to discussing which strategy is better
# save BOTH chunked datasets for later classification and topic modeling
# so its easy to load either strategy cleanly
chunks_s1.to_csv(os.path.join(data_dir, 'chunks_strategy1.csv'), index=False)
chunks_s2.to_csv(os.path.join(data_dir, 'chunks_strategy2.csv'), index=False)
print("\nSaved: chunks_strategy1.csv")
print("Saved: chunks_strategy2.csv")
Saved: chunks_strategy1.csv Saved: chunks_strategy2.csv
Chunking Strategy Decision: Why Strategy 2?¶
The numbers above make a clear case for the chunking strategy 2 choice:
Chunk count: Strategy 2 produces 27,430 chunks vs. Strategy 1's 10,000. This is about 2.7x more data points for the topic model to learn from more chunks which means more robust/stable topic distributions.
Size consistency (most critical): Strategy 1's coefficient of variation (CV) is 1.477 and the standard deviation is larger than the mean, indicating extreme variability. In contrast, Strategy 2's CV is 0.238, more than 6x lower, suggesting highly uniform chunk sizes. Therefore, since topic models assume documents come from similar length distributions, Strategy 1's variance would bias topics toward content from longer documents.
Range control: Strategy 1 chunks range from 25 to 23,392 words (a 935x difference). On that, a 25-word chunk carries almost no topical signal while a 23,392-word chunk would single-handedly dominate entire topics. In contrast, Strategy 2 is better since it keeps every chunk between 50 and 200 words (controlled, consistent, and still meaningful range).
Visualization Interpretation: Strategy 1 histogram is highly right-skewed with a wide spread across the full 0–2,000 word range, and that spike at 2,000 represents the many documents exceeding the cap (confirming extreme outliers). In contrast, Strategy 2's histogram shows an overwhelming spike at exactly 200 words (full chunks) with a small tail of shorter final chunks is exactly the behavior we are hoping for (uniform chunks, i.e., the tall vertical bar at 200 visually confirms how tightly controlled the chunk sizes are).
Final Conclusion: Chunking Strategy 2 is better for topic modeling.¶
NOTE: Strategy 1 is still used later in the advanced topic modeling section as a comparison to show how chunking choice affects model output.
Main Step 4: Supervised Classification¶
This data has pre-existing binary labels, created above (Human = 0, AI = 1). This means a supervised classifier can be trained to see if text patterns alone can predict which class a chunk belongs to. In other words, this is a binary classification task, as covered in course material, that separates the corpus into exactly two classes.
Input: Strategy 2 chunks (27,430 fixed 200-word windows)
Workflow:
- Vectorize text with TF-IDF
- TF-IDF (Term Frequency-Inverse Document Frequency): converts raw text into numerical features (i.e., words that are frequent in a document but rare across the full corpus get the highest weights)
- These are the most distinctive, informative words for classification
- 80/20 train/test split (stratified by label)
- 80% of chunks used for training, 20% held out for testing
- Train two classifiers from seminar: Logistic Regression and Naive Bayes
- Logistic Regression: used for categorical outcome variables (works well for text classification)
- fits a decision boundary in high-dimensional TF-IDF feature space
- Naive Bayes: A probabilistic classifier that treats each TF-IDF feature independently
- Despite the "naive" independence assumption, it performs surprisingly well on text data and is very fast to train
- Evaluate with F1 score, precision, recall, and confusion matrix
- F1 = harmonic mean of precision and recall from seminar that macro-averaged F1 treats both classes equally regardless of size
- Will also report per-class F1 to see if the model struggles more with one class than the other
- Interpret results
- (i) Visualize via confusion matrices: shows exactly where the model makes mistakes
- i.e., which class is it confusing for the other?
- rows = actual label, columns = predicted label, diagonal = correct predictions, off-diagonal = errors
- i.e., which class is it confusing for the other?
- (ii) Using top predictive features
- one advantage of Logistic Regression over Naive Bayes is interpretability (i.e., we can extract coefficients to see which words are most strongly associated with Human vs. AI writing)
- This directly informs the interpretation of WHY the model performs as it does
- Positive coefficients = predictive of AI (label=1)
- Negative coefficients = predictive of Human (label=0)
- This directly informs the interpretation of WHY the model performs as it does
- one advantage of Logistic Regression over Naive Bayes is interpretability (i.e., we can extract coefficients to see which words are most strongly associated with Human vs. AI writing)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import (classification_report, f1_score, ConfusionMatrixDisplay, confusion_matrix)
# (Workflow 1): Vectorize Text with TF-IDF
print("Vectorizing text with TF-IDF...")
tfidf = TfidfVectorizer(
max_features = 10000, # limits vocabulary to top 10k terms (efficiency)
min_df = 2, # ignore words appearing in only 1 chunk (likely noise)
max_df = 0.95, # ignore words in 95%+ of chunks (too common to be useful)
stop_words = 'english',
ngram_range = (1, 2) # ngram_range=(1,2): include both single words AND two-word phrases
)
X = tfidf.fit_transform(chunks_s2['text'])
y = chunks_s2['label']
print(f"\t TF-IDF matrix shape: {X.shape}")
print(f"\t Vocabulary size: {len(tfidf.vocabulary_):,} terms")
# (Workflow 2): Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size = 0.2,
random_state = RANDOM_SEED,
stratify = y # ensures both Human and AI are proportionally represented in both training and test sets (fair evaluation)
)
print(f"\nTrain/Test Split:")
print(f"\t Training chunks: {X_train.shape[0]:,}")
print(f"\t Test chunks: {X_test.shape[0]:,}")
print(f"\t Train label dist: {dict(y_train.value_counts().rename({0:'Human', 1:'AI'}))}")
print(f"\t Test label dist: {dict(y_test.value_counts().rename({0:'Human', 1:'AI'}))}")
# (Workflow 3): Train Classifiers
# Model 1: Logistic Regression
print("\nTraining Logistic Regression...")
lr = LogisticRegression(max_iter=1000, random_state=RANDOM_SEED) # max_iter=1000 ensures the optimizer converges on our large feature space
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
# Model 2: Naive Bayes
print("Training Naive Bayes...")
nb = MultinomialNB() # MultinomialNB is specifically designed for count/frequency features like TF-IDF
nb.fit(X_train, y_train)
y_pred_nb = nb.predict(X_test)
# (Workflow 4): Evaluate: F1 Score & Classification Report
print("\n===Classification Results===")
for model_name, y_pred in [("Logistic Regression", y_pred_lr), ("Naïve Bayes", y_pred_nb)]:
f1_macro = f1_score(y_test, y_pred, average='macro')
f1_human = f1_score(y_test, y_pred, average=None)[0]
f1_ai = f1_score(y_test, y_pred, average=None)[1]
print(f"\n ---{model_name}---")
print(f"\t Macro F1: {f1_macro:.4f}")
print(f"\t F1 (Human): {f1_human:.4f}")
print(f"\t F1 (AI): {f1_ai:.4f}")
print(f"\n\t Full Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Human', 'AI']))
# (Workflow 5 - part 1): Interpret Results by visualizing with confusion matrices
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
fig.suptitle('Confusion Matrices: Human vs. AI Classification', fontsize=13, fontweight='bold')
for ax, y_pred, title in zip(axes, [y_pred_lr, y_pred_nb], ['Logistic Regression', 'Naïve Bayes']):
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Human', 'AI'])
disp.plot(ax=ax, colorbar=False, cmap='Blues')
ax.set_title(title)
plt.tight_layout()
plt.savefig('/Users/maddiemac/Desktop/INFO 230/cultural-analytics-project-1/figures/confusion_matrices')
plt.show()
# (Workflow 5 - part 2): Interpret Results through top predictive features (logistic regression)
feature_names = tfidf.get_feature_names_out()
coefs = lr.coef_[0]
top_n = 15
top_ai_idx = coefs.argsort()[-top_n:][::-1]
top_human_idx = coefs.argsort()[:top_n]
print("\n ---Top 15 Features Predictive of AI Writing---")
for i in top_ai_idx:
print(f"\t {feature_names[i]:<30} coef: {coefs[i]:.4f}")
print("\n ---Top 15 Features Predictive of Human Writing---")
for i in top_human_idx:
print(f"\t {feature_names[i]:<30} coef: {coefs[i]:.4f}")
Vectorizing text with TF-IDF...
TF-IDF matrix shape: (27430, 10000)
Vocabulary size: 10,000 terms
Train/Test Split:
Training chunks: 21,944
Test chunks: 5,486
Train label dist: {'Human': np.int64(13372), 'AI': np.int64(8572)}
Test label dist: {'Human': np.int64(3343), 'AI': np.int64(2143)}
Training Logistic Regression...
Training Naive Bayes...
===Classification Results===
---Logistic Regression---
Macro F1: 0.8617
F1 (Human): 0.8975
F1 (AI): 0.8260
Full Classification Report:
precision recall f1-score support
Human 0.87 0.93 0.90 3343
AI 0.87 0.78 0.83 2143
accuracy 0.87 5486
macro avg 0.87 0.86 0.86 5486
weighted avg 0.87 0.87 0.87 5486
---Naïve Bayes---
Macro F1: 0.8163
F1 (Human): 0.8636
F1 (AI): 0.7690
Full Classification Report:
precision recall f1-score support
Human 0.84 0.89 0.86 3343
AI 0.81 0.73 0.77 2143
accuracy 0.83 5486
macro avg 0.82 0.81 0.82 5486
weighted avg 0.83 0.83 0.83 5486
---Top 15 Features Predictive of AI Writing--- potential coef: 4.8588 including coef: 4.4700 significant coef: 3.6509 additionally coef: 3.3520 explore coef: 3.3417 substeps coef: 3.2471 lead coef: 3.2338 ultimately coef: 3.1907 complex coef: 3.1578 profound coef: 3.1359 impact coef: 3.1236 leading coef: 2.9941 challenges coef: 2.8166 known coef: 2.5934 traditional coef: 2.5857 ---Top 15 Features Predictive of Human Writing--- case coef: -4.0546 organization coef: -3.7293 2009 coef: -3.5122 web coef: -3.3551 essay coef: -3.2501 market coef: -3.2103 2008 coef: -3.1668 people coef: -3.1620 management coef: -3.0551 fact coef: -3.0159 2007 coef: -2.9717 references coef: -2.9604 situation coef: -2.9498 cited coef: -2.8255 quite coef: -2.7917
Interactive Feature Importance Chart¶
Adapted from course notebooks. Using Plotly for an interactive version of the top features chart: hover over each term to see the exact coefficient values.
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
pio.renderers.default = 'notebook' # for working with interactive visuals in vscode
feature_names = tfidf.get_feature_names_out()
coefs = lr.coef_[0]
top_n = 15
# Get top AI and Human predictive terms
top_ai_idx = coefs.argsort()[-top_n:][::-1]
top_human_idx = coefs.argsort()[:top_n]
# Build a combined dataframe for plotting
import pandas as pd
feature_df = pd.DataFrame({
'term' : (list(feature_names[top_ai_idx]) + list(feature_names[top_human_idx])),
'coef' : (list(coefs[top_ai_idx]) + list(coefs[top_human_idx])),
'class' : (['AI'] * top_n + ['Human'] * top_n)
}).sort_values('coef')
fig = px.bar(
feature_df,
x = 'coef',
y = 'term',
color = 'class',
orientation = 'h',
color_discrete_map = {'AI': 'tomato', 'Human': 'steelblue'},
title = 'Top 15 Features Predictive of Human vs. AI Writing (Logistic Regression)',
labels = {'coef': 'Coefficient (positive = AI, negative = Human)', 'term': 'Term'}
)
fig.update_layout(
height = 700,
yaxis = {'categoryorder': 'total ascending'},
showlegend = True,
legend_title = 'Predicted Class'
)
plt.savefig('/Users/maddiemac/Desktop/INFO 230/cultural-analytics-project-1/figures/feature_importance_LR')
fig.show()
<Figure size 640x480 with 0 Axes>
Classification Results¶
Overall performance: Logistic Regression achieved macro F1 = 0.862 and Naive Bayes achieved = 0.816, both well above the 0.5 random chance baseline. TF-IDF features alone carry strong signal for distinguishing human from AI writing. Notably however, Logistic Regression outperforms Naive Bayes on every metric, which makes sense given Naive Bayes' independence assumption costing it some predictive power.
Why not perfect F1 score? An F1 of about 0.86 rather than about 0.9 is the expected and still meaningful result. To elaborate, this corpus combines 62 different LLMs with wildly different writing styles and some LLM outputs are genuinely indistinguishable from human writing at the word-frequency level, which is exactly what makes AI detection a hard (currently open) research problem. Therefore, the approximately 14% error rate reflects real linguistic overlap.
Why does the model do better on Human than AI? LR scores F1=0.90 on Human but only 0.83 on AI. This is because human writing's high variance in style and vocabulary (stdev ~1,023 words) actually makes it more distinctive as a class. In other words, its messiness and distinctive patterns are hard to confuse with AI.
Confusion matrix: The model's main weakness is AI chunks being misclassified as Human (463 for LR, 577 for NB), suggesting that some AI-generated text is convincing enough, in terms of style, to fool a word-frequency classifier.
Top features, the most culturally interesting part: AI writing is flagged by words like potential, significant, additionally, ultimately, challenges, profound, explore, which can be considered as common LLM vocabulary. Human writing instead is flagged by words like case, essay, cited, references, fact, quite, web and specific years like 2007-2009. This suggests that humans more often ground their writing in real dates and sources. Therefore, the broad difference between abstract-style writing (AI) vs. specificity-style writing (Human) is a central cultural finding of this project.
TF-IDF Weighted Word Clouds: Human vs. AI¶
This figure is adapted from course notebooks. Specifically, separate TF-IDF vectorizers are fit on Human and AI chunks so that word sizes reflect importance within a class, and not the full corpus-wide frequency.
from wordcloud import WordCloud
from sklearn.feature_extraction.text import TfidfVectorizer
def get_tfidf_weights(texts, max_features=3000):
"""
Fit a TF-IDF vectorizer on a list of texts and return a dictionary of {term: mean_tfidf_score} for word cloud generation.
"""
vec = TfidfVectorizer(
max_features = max_features,
stop_words = 'english',
min_df = 2
)
matrix = vec.fit_transform(texts)
names = vec.get_feature_names_out()
mean_scores = matrix.mean(axis=0).A1
return {names[i]: mean_scores[i] for i in range(len(names)) if mean_scores[i] > 0}
# Get texts for each class from Strategy 2 chunks
human_texts = chunks_s2[chunks_s2['label'] == 0]['text'].tolist()
ai_texts = chunks_s2[chunks_s2['label'] == 1]['text'].tolist()
print(f"Generating word clouds from {len(human_texts):,} Human chunks and {len(ai_texts):,} AI chunks...")
human_weights = get_tfidf_weights(human_texts)
ai_weights = get_tfidf_weights(ai_texts)
# Generate word clouds
wc_human = WordCloud(
width = 1200,
height = 600,
background_color = 'white',
max_words = 150,
colormap = 'Blues',
min_font_size = 10
).generate_from_frequencies(human_weights)
wc_ai = WordCloud(
width = 1200,
height = 600,
background_color = 'white',
max_words = 150,
colormap = 'Reds',
min_font_size = 10
).generate_from_frequencies(ai_weights)
# Display side by side
fig, axes = plt.subplots(1, 2, figsize=(20, 8))
axes[0].imshow(wc_human, interpolation='bilinear')
axes[0].axis('off')
axes[0].set_title('Human Writing — TF-IDF Weighted Word Cloud', fontsize=14, fontweight='bold', color='steelblue')
axes[1].imshow(wc_ai, interpolation='bilinear')
axes[1].axis('off')
axes[1].set_title('AI Writing — TF-IDF Weighted Word Cloud', fontsize=14, fontweight='bold', color='tomato')
plt.suptitle('What words define Human vs. AI writing?', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig('/Users/maddiemac/Desktop/INFO 230/cultural-analytics-project-1/figures/wordcloud')
plt.show()
Generating word clouds from 16,715 Human chunks and 10,715 AI chunks...
Word Cloud Interpretation:¶
At first glance the two clouds look more similar than expected. Both share dominant words like people, world, time, life, students, work, new. However, this makes sense since both humans and LLMs were responding to the same prompts, so shared topical content confirms the dataset is well-controlled.
Human writing (blue) is characterized by concrete, specific, and institutional vocabulary: company, management, organization, market, research, government, years, case, essay, fact. In other words, humans anchor their writing in real-world structures and cite specific evidence. For example, you can even spot 2010 tucked in there, consistent with what the classifier found about humans referencing specific dates.
AI writing (red) leans toward more abstract vocabulary: technology, potential, development, individuals, significant, impact, challenges, digital, essential, complex. These words describe the world in broad terms rather than engaging with specific facts or institutions like humans do.
Primary takeaway: Human writing is grounded in specifics (real organizations, dates, and evidence). AI writing operates at a higher level of abstraction rather than concrete detail. Overall, this word cloud directly supports the top features found in the classifier.
Main Step 5: Topic Modeling (advanced option)¶
I will use 2 different models and 2 different chunking strategies (providing explanations for the strategies and choices below and exploring all 4 outputs).
Why BERTopic?
- BERTopic is the main topic modeling tool used in this course so far. It uses sentence-transformer embeddings to capture semantic meaning, then clusters documents using UMAP + HDBSCAN.
- Importantly, unlike bag-of-words models, BERTopic understands context, for instance "AI" and "artificial intelligence" are treated as related concepts.
Why LDA?
- Latent Dirichlet Allocation is the classical standard for topic modeling, treating each document as a mixture of topics and each topic as a distribution over words.
- This time, it operates on raw word counts (bag-of-words), making it a strong methodological contrast to BERTopic.
- On a positive note, LDA is fast, interpretable, and widely used as a baseline.
Why both chunking strategies?
- Strategy 1 (document-level): preserves full document context, fewer but richer chunks.
- Strategy 2 (200-word windows): uniform size, more chunks, better for models that assume consistent document length (especially LDA).
Main cultural question topic modeling will attempt to answer: do Human and AI texts cluster into meaningfully different topics, and does the answer change with different models or chunking strategies?¶
# Install required packages for topic modeling
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
pio.renderers.default = 'notebook' # for working with interactive visuals in vscode
print("All topic modeling imports successful!")
All topic modeling imports successful!
# ===Prepare text inputs for all 4 model/strategy combinations===
# extract the raw text lists from each chunking strategy
# BERTopic works directly on raw text (handles vectorization internally)
# LDA requires a CountVectorizer step first (bag-of-words counts)
# Chunking Strategy 1: document-level chunks
texts_s1 = chunks_s1['text'].tolist()
labels_s1 = chunks_s1['label'].tolist()
# Chunking Strategy 2: fixed 200-word window chunks
texts_s2 = chunks_s2['text'].tolist()
labels_s2 = chunks_s2['label'].tolist()
print(f"Chunking Strategy 1: {len(texts_s1): } documents")
print(f"Chunking Strategy 2: {len(texts_s2): } documents")
# ===CountVectorizor for LDA===
# LDA requires integer word counts, not just raw text
# use the same parameters as TF-IDF vectorizer for consistency
# but with CountVectorizer (since LDA needs raw counts not TF-IDF weights)
print("\nFitting CountVectorizer for LDA:")
count_vec = CountVectorizer(
max_features= 5000, # keep vocabulary manageable for LDA
min_df= 5, # ignore very rare words (noise)
max_df= 0.95, # # ignore words in 95%+ of docs (too common)
stop_words= 'english'
)
# Fit on Chunking Strategy 2 (the primary chunking strategy from earlier)
# use .fit() on S2 and then .transform() so vocabulary is consistent
count_vec.fit(texts_s2)
dtm_s1 = count_vec.transform(texts_s1) # document-term matrix for S1
dtm_s2 = count_vec.transform(texts_s2) # document-term matrix for S1
print(f"\tTDM Chunking Strategy 1 shape: {dtm_s1.shape}")
print(f"\tTDM Chunking Strategy 2 shape: {dtm_s2.shape}")
print(f"\tVocabulary size: {len(count_vec.vocabulary_): } terms")
Chunking Strategy 1: 10000 documents Chunking Strategy 2: 27430 documents Fitting CountVectorizer for LDA: TDM Chunking Strategy 1 shape: (10000, 5000) TDM Chunking Strategy 2 shape: (27430, 5000) Vocabulary size: 5000 terms
Model 1a: BERTopic on Chunking Strategy 1 (Document-Level Chunks)¶
print("===Model 1a: BERTopic on Chunking Strategy 1 (document-level chunks)===")
print("Fitting BERTopic: note this may take a few minutes to complete")
representation_models = {
"KeyBERT" : KeyBERTInspired(), # uses keyword extraction to label topics meaningfully
"MMR": MaximalMarginalRelevance(diversity=0.3) # reduces redundancy in topic keywords
}
bertopic_s1 = BERTopic(
nr_topics=15, # ask for 15 topics (enough to find meaningful clusters)
min_topic_size=20, # topic must have at least 20 chunks to be valid (prevents tiny, noisy micro-topics)
representation_model=representation_models,
verbose=True # show progress (so I know this is running okay)
)
topics_bert_s1, probs_bert_s1 = bertopic_s1.fit_transform(texts_s1)
chunks_s1 = chunks_s1.copy()
chunks_s1['topic_bert'] = topics_bert_s1
# Explore output
topic_info_bert_s1 = bertopic_s1.get_topic_info()
print(f"\nTopics found: {len(topic_info_bert_s1[topic_info_bert_s1['Topic'] >= 0])}")
print(f"Outlier chunks (-1): {sum(t == -1 for t in topics_bert_s1):,}") # BERTopic assigns -1 (outlier) to chunks it can't confidently assign (still keep track of)
print(f"\nTopic summary (KeyBERT labels):")
label_cols = ['Topic', 'Count'] + [c for c in ['Name', 'KeyBERT', 'MMR'] if c in topic_info_bert_s1.columns]
print(topic_info_bert_s1[topic_info_bert_s1['Topic'] >= 0][label_cols].to_string(index=False))
# Save model for reuse (follow Professor Tim's method in notebooks)
bertopic_s1.save("bertopic_s1_model")
print("\nModel saved: bertopic_s1_model/")
2026-02-25 17:19:54,313 - BERTopic - Embedding - Transforming documents to embeddings.
===Model 1a: BERTopic on Chunking Strategy 1 (document-level chunks)=== Fitting BERTopic: note this may take a few minutes to complete
Batches: 0%| | 0/313 [00:00<?, ?it/s]
2026-02-25 17:20:45,219 - BERTopic - Embedding - Completed ✓ 2026-02-25 17:20:45,219 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm 2026-02-25 17:20:53,291 - BERTopic - Dimensionality - Completed ✓ 2026-02-25 17:20:53,292 - BERTopic - Cluster - Start clustering the reduced embeddings 2026-02-25 17:20:53,447 - BERTopic - Cluster - Completed ✓ 2026-02-25 17:20:53,447 - BERTopic - Representation - Extracting topics using c-TF-IDF for topic reduction. 2026-02-25 17:20:54,876 - BERTopic - Representation - Completed ✓ 2026-02-25 17:20:54,878 - BERTopic - Topic reduction - Reducing number of topics 2026-02-25 17:20:54,884 - BERTopic - Representation - Fine-tuning topics using representation models. 2026-02-25 17:20:58,391 - BERTopic - Representation - Completed ✓ 2026-02-25 17:20:58,414 - BERTopic - Topic reduction - Reduced number of topics from 78 to 15 2026-02-25 17:20:58,865 - BERTopic - WARNING: When you use `pickle` to save/load a BERTopic model,please make sure that the environments in which you saveand load the model are **exactly** the same. The version of BERTopic,its dependencies, and python need to remain the same.
Topics found: 14
Outlier chunks (-1): 3,255
Topic summary (KeyBERT labels):
Topic Count Name KeyBERT MMR
0 1859 0_the_to_of_and [education, organization, management, be, students, being, business, life, school, learning] [to, of, be, as, are, their, on, students, an, at]
1 1199 1_the_and_of_to [nature, characters, art, light, story, life, human, about, who, what] [of, to, her, she, his, with, time, their, be, our]
2 848 2_was_the_and_to [restaurant, food, pizza, chicken, delicious, place, cheese, service, bowl, get] [to, it, food, is, place, at, great, were, have, service]
3 606 3_the_of_and_to [sustainable, pollution, economic, driverless, climate, environmental, driving, cars, transportation, benefits] [of, are, cars, can, be, car, driverless, energy, water, their]
4 461 4_and_the_to_of [technology, technologies, communication, internet, social, privacy, intelligence, media, computer, digital] [to, technology, can, media, social, their, information, by, individuals, use]
5 438 5_the_of_and_to [nursing, nurses, patients, nurse, medical, patient, healthcare, health, hospital, ethical] [to, health, care, are, genetic, patients, or, medical, nursing, an]
6 262 6_we_the_of_to [learning, features, recognition, datasets, dataset, networks, neural, models, deep, data] [in, learning, neural, data, problem, models, features, deep, networks, algorithms]
7 259 7_the_to_in_and [sports, olympics, olympic, sport, soccer, football, athletes, ball, england, match] [of, on, at, games, his, sports, soccer, be, football, players]
8 240 8_the_of_to_in [offenders, crime, justice, criminal, punishment, court, police, law, cases, economic] [of, to, in, was, be, police, european, criminal, justice, eu]
9 220 9_your_you_the_to [skin, hair, wear, wash, face, using, wearing, dry, dress, use] [your, title, or, hair, step, can, substeps, be, skin, color]
10 129 10_venus_face_mars_the [mars, martian, geological, planets, landform, cydonia, face, surface, earth, earths] [venus, mars, planet, landform, earth, nasa, surface, aliens, geological, planets]
11 124 11_electoral_college_vote_president [electoral, electors, voting, elections, election, voters, elected, vote, electing, votes] [electoral, vote, president, election, system, candidate, voters, electors, would, elected]
12 74 12_the_in_of_to [antennas, antenna, wireless, communications, broadcasting, channel, radio, network, rf, streaming] [antenna, antennas, network, radio, energy, technology, channel, wncu, we, transmission]
13 26 13_bowl_super_broncos_the [nfl, seahawks, steelers, afc, colts, patriots, manning, broncos, quarterback, nfc] [bowl, broncos, nfl, yards, steelers, seahawks, afc, quarterback, 2016, manning]
Model saved: bertopic_s1_model/
Model 1b: BERTopic on Chunking Strategy 2 (200-Word Window Chunks)¶
Same process as Model 1a but applied to the fixed-length chunks. Note that with 27k chunks vs. 10k, the model has more data to find tighter clusters. Also, min_topic_size is made slightly higher (50 vs. 20) to account for the larger corpus.
print("===Model 1b: BERTopic on Chunking Strategy 2 (200-word window chunks)===")
print("Fitting BERTopic: note this may take a few minutes to complete")
bertopic_s2 = BERTopic(
nr_topics=15, # ask for 15 topics (enough to find meaningful clusters)
min_topic_size=50, # slightly higher to account for larger corpus
representation_model=representation_models,
verbose=True # show progress (so I know this is running okay)
)
topics_bert_s2, probs_bert_s2 = bertopic_s2.fit_transform(texts_s2)
chunks_s2 = chunks_s2.copy()
chunks_s2['topic_bert'] = topics_bert_s2
# Explore output
topic_info_bert_s2 = bertopic_s2.get_topic_info()
print(f"\nTopics found: {len(topic_info_bert_s2[topic_info_bert_s2['Topic'] >= 0])}")
print(f"Outlier chunks (-1): {sum(t == -1 for t in topics_bert_s2):,}") # BERTopic assigns -1 (outlier) to chunks it can't confidently assign (still keep track of)
print(f"\nTopic summary (KeyBERT labels):")
label_cols = ['Topic', 'Count'] + [c for c in ['Name', 'KeyBERT', 'MMR'] if c in topic_info_bert_s2.columns]
print(topic_info_bert_s2[topic_info_bert_s2['Topic'] >= 0][label_cols].to_string(index=False))
# Save model for reuse (follow Professor Tim's method in notebooks)
bertopic_s2.save("bertopic_s2_model")
print("\nModel saved: bertopic_s2_model/")
2026-02-25 17:22:09,478 - BERTopic - Embedding - Transforming documents to embeddings.
===Model 1b: BERTopic on Chunking Strategy 2 (200-word window chunks)=== Fitting BERTopic: note this may take a few minutes to complete
Batches: 0%| | 0/858 [00:00<?, ?it/s]
2026-02-25 17:24:44,410 - BERTopic - Embedding - Completed ✓ 2026-02-25 17:24:44,411 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm 2026-02-25 17:24:48,150 - BERTopic - Dimensionality - Completed ✓ 2026-02-25 17:24:48,155 - BERTopic - Cluster - Start clustering the reduced embeddings 2026-02-25 17:24:50,352 - BERTopic - Cluster - Completed ✓ 2026-02-25 17:24:50,353 - BERTopic - Representation - Extracting topics using c-TF-IDF for topic reduction. 2026-02-25 17:24:52,009 - BERTopic - Representation - Completed ✓ 2026-02-25 17:24:52,011 - BERTopic - Topic reduction - Reducing number of topics 2026-02-25 17:24:52,021 - BERTopic - Representation - Fine-tuning topics using representation models. 2026-02-25 17:24:54,758 - BERTopic - Representation - Completed ✓ 2026-02-25 17:24:54,761 - BERTopic - Topic reduction - Reduced number of topics from 105 to 15 2026-02-25 17:24:55,042 - BERTopic - WARNING: When you use `pickle` to save/load a BERTopic model,please make sure that the environments in which you saveand load the model are **exactly** the same. The version of BERTopic,its dependencies, and python need to remain the same.
Topics found: 14
Outlier chunks (-1): 10,799
Topic summary (KeyBERT labels):
Topic Count Name KeyBERT MMR
0 3537 0_the_of_and_to [art, culture, cultural, society, war, life, power, world, social, being] [of, to, as, by, their, be, his, were, an, which]
1 2631 1_and_the_to_of [technology, benefits, social, cars, transportation, public, car, information, health, media] [and, to, can, be, social, cars, their, technology, people, media]
2 2577 2_to_and_the_students [students, student, school, schools, education, activities, teachers, classes, children, life] [to, students, of, in, school, be, their, on, education, will]
3 2291 3_the_of_to_and [management, managers, organization, organizations, business, company, companies, development, employees, strategy] [to, in, be, management, their, company, employees, project, market, organization]
4 1849 4_was_the_she_and [room, she, very, good, had, food, her, what, you, didnt] [she, to, my, had, his, of, they, on, were, back]
5 913 5_the_of_to_and [ethical, genetic, animals, animal, rights, gene, research, health, medicine, diseases] [to, genetic, animals, health, ethical, testing, cancer, patients, research, disease]
6 818 6_the_of_and_is [cosmos, mars, nasa, space, universe, planets, earth, martian, planet, exploration] [our, venus, mars, universe, space, dreams, its, earth, planet, be]
7 557 7_the_to_in_and [football, players, soccer, league, sports, stadium, ball, sport, teams, england] [to, of, on, at, games, be, sports, soccer, season, league]
8 531 8_the_of_to_we [learning, networks, neural, recognition, algorithms, deep, training, models, network, computing] [we, data, network, networks, algorithm, model, wireless, learning, neural, problem]
9 306 9_electoral_college_vote_the [electoral, electors, voter, elections, voters, voting, election, elected, vote, votes] [electoral, vote, president, states, election, system, votes, voters, electors, voting]
10 189 10_your_you_to_and [wash, using, face, clean, skin, wear, use, tablespoon, disposable, dry] [your, skin, hair, fashion, use, step, dress, substeps, clothing, diaper]
11 176 11_light_the_of_is [lighting, light, photography, electricity, colors, wavelengths, energy, electrical, color, vision] [light, electrons, can, color, energy, on, be, circuit, colors, electricity]
12 139 12_oil_wastewater_water_the [refinery, oil, reservoirs, wastewater, fluids, viscosity, reservoir, fluid, refineries, hydrocarbon] [oil, wastewater, reservoir, gas, pressure, permeability, recovery, fluid, refinery, crude]
13 117 13_the_european_turkey_of [eu, eus, european, europe, constitutional, parliament, government, commission, court, ec] [european, turkey, eu, union, rights, fundamental, court, national, economic, treaty]
Model saved: bertopic_s2_model/
Model 2a: LDA on Chunking Strategy 1 (Document-Level Chunks)¶
Unlike BERTopic which uses neural embeddings, LDA treats each document as a bag of words. Note that n_components=15 matches BERTopic's topic count to ensure a fair comparison.
print("===Model 2a: LDA: Chunking Strategy 1 (document-level chunks)===")
print("Fitting LDA:")
lda_s1 = LatentDirichletAllocation(
n_components=15, # match BERTopic's topic count for fair comparison
max_iter= 20, # number of EM algorithm iterations (more = better but slower)
random_state=RANDOM_SEED, # reproducibility
learning_method='online', # more efficient for larger datasets than 'batch'
verbose=1
)
doc_topic_lda_s1 = lda_s1.fit_transform(dtm_s1)
chunks_s1['topic_lda'] = doc_topic_lda_s1.argmax(axis=1)
# print top words by topic
# for LDA topic labels are defined by the highest-probability words
vocab = count_vec.get_feature_names_out()
def get_lda_top_words(model, vocab, n_top=10):
"""Extract top n words for each LDA topic."""
topics = {}
for idx, topic in enumerate(model.components_):
top_words = [vocab[i] for i in topic.argsort()[-n_top:][::-1]]
topics[idx] = top_words
return topics
lda_s1_topics = get_lda_top_words(lda_s1, vocab)
print(f"\nLDA Topics (Strategy 1) — Top 10 words per topic:")
for topic_id, words in lda_s1_topics.items():
print(f"\tTopic {topic_id:>2}: {' | '.join(words)}")
===Model 2a: LDA: Chunking Strategy 1 (document-level chunks)=== Fitting LDA: iteration: 1 of max_iter: 20 iteration: 2 of max_iter: 20 iteration: 3 of max_iter: 20 iteration: 4 of max_iter: 20 iteration: 5 of max_iter: 20 iteration: 6 of max_iter: 20 iteration: 7 of max_iter: 20 iteration: 8 of max_iter: 20 iteration: 9 of max_iter: 20 iteration: 10 of max_iter: 20 iteration: 11 of max_iter: 20 iteration: 12 of max_iter: 20 iteration: 13 of max_iter: 20 iteration: 14 of max_iter: 20 iteration: 15 of max_iter: 20 iteration: 16 of max_iter: 20 iteration: 17 of max_iter: 20 iteration: 18 of max_iter: 20 iteration: 19 of max_iter: 20 iteration: 20 of max_iter: 20 LDA Topics (Strategy 1) — Top 10 words per topic: Topic 0: students | school | education | learning | children | community | help | time | student | work Topic 1: technology | human | world | time | new | potential | face | digital | space | earth Topic 2: people | individuals | social | individual | life | self | behavior | positive | personal | person Topic 3: government | states | war | political | united | people | world | country | new | american Topic 4: use | ethical | law | privacy | animals | animal | police | legal | rights | case Topic 5: cultural | women | people | culture | society | art | social | life | world | music Topic 6: health | care | patients | patient | medical | water | al | treatment | et | healthcare Topic 7: college | electoral | states | vote | president | popular | people | election | state | program Topic 8: like | just | time | people | don | know | good | said | think | want Topic 9: year | 000 | games | said | years | 10 | game | 12 | team | 15 Topic 10: car | cars | people | public | city | driverless | driving | air | transportation | traffic Topic 11: company | business | market | products | marketing | companies | customers | services | financial | industry Topic 12: research | data | used | study | based | information | use | analysis | different | using Topic 13: economic | energy | environmental | global | countries | resources | food | development | sustainable | impact Topic 14: management | employees | organization | project | work | performance | leadership | organizations | process | communication
Model 2b: LDA on Chunking Strategy 2 (200-word window chunks)¶
Same LDA process from Model 2a, this time applied to the fixed-length chunks. Note that LDA is expected to perform best here because it assumes uniform document length, which is exactly what Strategy 2 provides. This is the key reason LDA + Strategy 2 is expected to be the strongest LDA result overall.
print("===Model 2b: LDA: Chunking Strategy 2 (200-word window chunks)===")
print("Fitting LDA:")
lda_s2 = LatentDirichletAllocation(
n_components=15, # match BERTopic's topic count for fair comparison
max_iter= 20, # number of EM algorithm iterations (more = better but slower)
random_state=RANDOM_SEED, # reproducibility
learning_method='online', # more efficient for larger datasets than 'batch'
verbose=1
)
doc_topic_lda_s2 = lda_s2.fit_transform(dtm_s2)
chunks_s2['topic_lda'] = doc_topic_lda_s2.argmax(axis=1)
lda_s2_topics = get_lda_top_words(lda_s2, vocab)
print(f"\nLDA Topics (Strategy 2) — Top 10 words per topic:")
for topic_id, words in lda_s2_topics.items():
print(f"\tTopic {topic_id:>2}: {' | '.join(words)}")
===Model 2b: LDA: Chunking Strategy 2 (200-word window chunks)=== Fitting LDA: iteration: 1 of max_iter: 20 iteration: 2 of max_iter: 20 iteration: 3 of max_iter: 20 iteration: 4 of max_iter: 20 iteration: 5 of max_iter: 20 iteration: 6 of max_iter: 20 iteration: 7 of max_iter: 20 iteration: 8 of max_iter: 20 iteration: 9 of max_iter: 20 iteration: 10 of max_iter: 20 iteration: 11 of max_iter: 20 iteration: 12 of max_iter: 20 iteration: 13 of max_iter: 20 iteration: 14 of max_iter: 20 iteration: 15 of max_iter: 20 iteration: 16 of max_iter: 20 iteration: 17 of max_iter: 20 iteration: 18 of max_iter: 20 iteration: 19 of max_iter: 20 iteration: 20 of max_iter: 20 LDA Topics (Strategy 2) — Top 10 words per topic: Topic 0: time | year | said | years | day | began | music | old | took | man Topic 1: information | process | patients | development | care | theory | research | behavior | project | patient Topic 2: media | social | introduction | paper | essay | literature | people | works | history | book Topic 3: like | people | just | know | good | time | don | think | make | want Topic 4: health | potential | use | energy | environmental | public | urban | concerns | benefits | ethical Topic 5: government | country | countries | economic | political | law | state | rights | states | china Topic 6: women | water | al | et | body | woman | pressure | men | treatment | high Topic 7: students | school | education | learning | college | help | time | student | children | skills Topic 8: human | time | face | language | life | earth | space | nature | scientific | science Topic 9: world | social | cultural | individuals | people | new | technology | role | culture | impact Topic 10: people | united | war | cars | states | car | society | life | family | children Topic 11: company | market | business | services | products | companies | customers | marketing | industry | new Topic 12: data | analysis | study | research | used | case | problem | number | based | information Topic 13: organization | employees | management | work | performance | goals | needs | success | ensure | team Topic 14: web | new | 2011 | university | journal | management | york | 2012 | 2010 | leadership
Human vs. AI Topic Distribution: Interactive Heatmaps¶
The core visualization for this step will be seeing how human and AI chunks distribute across topics in each model/strategy combination. Note that you can hover over any cell in the heatmaps below to see exact proportion values.
Original code was adapted from course notebooks and then rebuilt to replace numeric labels with readable topic labels
# add a function to create readable labels
def get_bertopic_labels(model, style="KeyBERT", n_words=3):
topic_info = model.get_topic_info()
topic_info = topic_info[topic_info["Topic"] >= 0]
labels = {}
for _, row in topic_info.iterrows():
topic_id = row["Topic"]
if style in topic_info.columns and isinstance(row[style], list):
label = " | ".join(str(w) for w in row[style][:n_words])
else:
label = str(row["Name"])[:40]
labels[topic_id] = f"T{topic_id}: {label}"
return labels
bert_s1_labels = get_bertopic_labels(bertopic_s1, style="KeyBERT", n_words=3)
bert_s2_labels = get_bertopic_labels(bertopic_s2, style="KeyBERT", n_words=3)
# manually create LDA labels
lda_s1_labels = {
0: "T0: Education & School", 1: "T1: Technology & Space",
2: "T2: Social & Personal", 3: "T3: Government & War",
4: "T4: Law & Ethics", 5: "T5: Culture & Art",
6: "T6: Health & Medicine", 7: "T7: Electoral College",
8: "T8: Casual Conversation", 9: "T9: Sports & Numbers",
10: "T10: Cars & Transport", 11: "T11: Business & Market",
12: "T12: Research & Data", 13: "T13: Environment & Energy",
14: "T14: Management & Work"
}
lda_s2_labels = {
0: "T0: Narrative & Time", 1: "T1: Research & Patients",
2: "T2: Media & Academic", 3: "T3: Casual Conversation",
4: "T4: Health & Environment", 5: "T5: Government & Politics",
6: "T6: Women & Medicine", 7: "T7: Education & School",
8: "T8: Science & Nature", 9: "T9: Culture & Technology",
10: "T10: Society & Family", 11: "T11: Business & Market",
12: "T12: Data & Research", 13: "T13: Management & Teams",
14: "T14: Academic Citations"
}
def build_label_topic_matrix(chunks_df, topic_col, topic_labels):
"""
Build a normalized matrix showing how Human vs. AI chunks distribute across topics.
Rows = label (Human/AI), Columns = topic.
Values = proportion of that label's chunks in each topic.
"""
df = chunks_df[chunks_df[topic_col] >= 0].copy()
df['label_name'] = df['label'].map({0: 'Human', 1: 'AI'})
matrix = (
df.groupby(['label_name', topic_col])
.size()
.unstack(fill_value=0)
)
# Normalize each row so values = proportion (sum to 1 per label)
matrix_norm = matrix.div(matrix.sum(axis=1), axis=0)
matrix_norm.columns = [
topic_labels.get(col, f"T{col}: Unknown") for col in matrix_norm.columns
]
return matrix_norm
# Build all 4 matrices
mat_bert_s1_labeled = build_label_topic_matrix(chunks_s1, "topic_bert", bert_s1_labels)
mat_bert_s2_labeled = build_label_topic_matrix(chunks_s2, "topic_bert", bert_s2_labels)
mat_lda_s1_labeled = build_label_topic_matrix(chunks_s1, "topic_lda", lda_s1_labels)
mat_lda_s2_labeled = build_label_topic_matrix(chunks_s2, "topic_lda", lda_s2_labels)
# ===Interactive Plotly Heatmaps===
# hover over each cell to see exact proportion values
for mat, title in [
(mat_bert_s1_labeled, "BERTopic: Strategy 1 (Document-Level)"),
(mat_bert_s2_labeled, "BERTopic: Strategy 2 (200-Word Windows)"),
(mat_lda_s1_labeled, "LDA: Strategy 1 (Document-Level)"),
(mat_lda_s2_labeled, "LDA: Strategy 2 (200-Word Windows)")
]:
# Add topic labels for BERTopic versions
# col_labels = list(mat.columns)
fig = px.imshow(
mat.values,
x = list(mat.columns),
y = list(mat.index),
color_continuous_scale = 'YlOrRd',
title = f'Human vs. AI Topic Distribution: {title}',
labels = dict(color='Proportion of chunks'),
aspect = 'auto'
)
fig.update_layout(
height = 400,
xaxis_title = 'Topic',
yaxis_title = 'Label'
)
fig.show()
Save the chunked files with topic assignments¶
- Both chunking strategy dataframes now have topic columns added
- Below will save these for the later evaluation section
chunks_s1.to_csv(os.path.join(data_dir, 'chunks_s1_with_topics.csv'), index=False)
chunks_s2.to_csv(os.path.join(data_dir, 'chunks_s2_with_topics.csv'), index=False)
print("Saved: chunks_s1_with_topics.csv")
print("Saved: chunks_s2_with_topics.csv")
print(f"\nchunks_s1 columns: {list(chunks_s1.columns)}")
print(f"chunks_s2 columns: {list(chunks_s2.columns)}")
Saved: chunks_s1_with_topics.csv Saved: chunks_s2_with_topics.csv chunks_s1 columns: ['text', 'source', 'label', 'word_count', 'doc_id', 'chunk_id', 'topic_bert', 'topic_lda'] chunks_s2 columns: ['doc_id', 'chunk_id', 'text', 'source', 'label', 'word_count', 'topic_bert', 'topic_lda']
Additional BERTopic Visualization¶
BERTopic has built-in interactive visualizations for exploring the topic space that we saw in seminar: the topic word barchart is included below.
Note on the barcharts: Each bar represents a word's c-TF-IDF score for that topic, i.e., how characteristic a word is for that topic compared to all other topics (a long bar means that word is especially distinctive here, not just common overall). We can think of this visual as a more detailed look under the hood of the KeyBERT labels. For instance, if the label says "electoral college" and the top bars are electoral, college, votes, states, that's a good sign the label actually captures what's in the topic.
# Visualize top topics as bar chart (Model 1a)
bertopic_s1.visualize_barchart(top_n_topics=15)
# Visualize top topics as bar chart (Model 2a)
bertopic_s2.visualize_barchart(top_n_topics=15)
Topic Modeling Results: Discussion¶
Consistent themes across all four combinations: education, business/management, technology, healthcare, politics/government, environment, and personal/narrative writing. The fact that these clusters appear regardless of model or chunking strategy confirms they reflect real structure in the data, not just results of modeling choices.
BERTopic produces more interpretable topics. KeyBERT labels like electoral | electors | voting | elections or healthcare | medical | patients | nursing are immediately readable. LDA's top words require more mental work, and some topics are too broad to be meaningfully named.
- A notable BERTopic difference: Strategy 2 produced 10,594 outlier chunks (-1) vs. Strategy 1's 2,969. BERTopic's clustering gets less confident with shorter uniform chunks, which is a real tradeoff worth noting.
- A surprising BERTopic finding: Topic 8 in S1 (hair, skin, beauty tutorials) and Topic 12 in S2 (step-by-step instructions with "substeps") suggests that some LLM prompts asked for how-to content, which clusters completely separately from other types of writing.
LDA Strategy 2 outperforms Strategy 1. LDA (chunking strategy 2) Topic 14 (web | 2011 | university | journal | years) captures academic citation style almost perfectly, which is exactly what the classifier flagged as predictive of human writing. On the other hand, LDA (chunking strategy 1) misses this entirely. Additionally, LDA (chunking strategy 2) also surfaces the most human-specific patterns, for instance: Topic 3 (casual conversational register) and Topic 0 (narrative storytelling) are distinctly human flags.
Heatmap findings: BERTopic (chunking strategy 2) and LDA (chunking strategy 2) both show clear Human vs. AI separation. Specifically, AI chunks concentrate in abstract essay-prompt topics (education, business, technology, healthcare), while human chunks spread more into personal/narrative and historically grounded topics, as well as show stronger presence in academic citation topics. Overall the suggested pattern is consistent: human writing is more academically grounded and contextually specific; AI writing is more abstract.
Barchart Interpretation: A few things stand out when reading these two charts together.
Strategy 1 (document-level) has a "stop word" problem. Most topics in the first chart are topped by the, and, of, to, in, i.e., words that carry no topical meaning at all. This is a direct consequence of the length inconsistency discussed in Step 3 (when some documents are thousands of words long, generic function words end up dominating the topic representation because they appear so frequently in the long documents). However, Topics 10 (venus, face, mars), 11 (electoral, college, vote), and 13 (bowl, super, broncos) are the exceptions, they have genuinely distinctive vocabulary, but they're the minority.
Strategy 2 (200-word windows) is meaningfully cleaner. More topics show distinctive content words at the top rather than stop words. For instance, Topic 9 (electoral, college, vote, president) and Topic 12 (oil, wastewater, water, reservoir) are sharp and immediately readable. Additionally, Topic 10 (your, you) we can see likely captures the second-person language we noted as a human-specific pattern in the heatmap discussion.
Overall, this is actually a nice visual confirmation of the chunking decision from Step 3, i.e., the same uniform chunk sizes that improved topic separation in the heatmaps also produce cleaner, more informative topic word distributions here. In fact, the Strategy 1 stop word dominance is exactly the kind of bias toward long-document content that fixed-length chunking was designed to prevent.
Chunking strategy effect: Strategy 2 consistently reveals more differentiation between Human and AI than Strategy 1 across both models. The reason for this is that document-level chunks are too mixed, i.e., a single long essay might cover politics, narrative, and economics at once. In contrast, fixed 200-word windows force the model to assign a dominant topic to a focused slice of text, making differences sharper, more visible, and thus better for topic modeling.
Main Step 6: Evaluation ¶
- Key Question: How can I evaluate the usefulness of these models?
Since topic modeling is unsupervised, there is no single correct answer to check against. Therefore, evaluation here uses four approaches built from tools already used in this project (sklearn/scipy).
- Topic Interpretability: qualitative side-by-side comparison of BERTopic vs LDA labels
- No metric needed: labels speak for themselves
- Topic Diversity: proportion of unique words across all top-word lists
- higher proportion = topics are more distinct from each other
- Metric: a model finding if the same topic repeatedly scores low
- Topic Distinctiveness: Determine how similar the topics are to each other in word-space
- low average similarity = topics are well separated
- Metric: cosine similarity between topics (sklearn)
- Human vs. AI Topic Separation: chi-square test of independence between topic assignment and label (scipy)
- Chi-square hypotheses:
- H0 (null): topic assignment and Human/AI label are independent
- H1 (alternative): topic assignment and Human/AI label are NOT independent
- Significant result (p < 0.05): reject the null at significance level of 0.05 (confidence level of 95%), meaning the model found structure that supports Human/AI distinction.
from sklearn.metrics.pairwise import cosine_similarity
from scipy.stats import chi2_contingency
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
# EVALUATION 1: Topic Interpretability Comparison
# Side-by-side comparison of BERTopic KeyBERT labels vs LDA top words for both chunking strategies
print("===Evaluation 1: Topic Interpretability (BERTopic vs LDA)===")
def get_bertopic_label_list(model, n_words=5):
"""Get list of (topic_id, label_string) from BERTopic KeyBERT labels."""
topic_info = model.get_topic_info()
topic_info = topic_info[topic_info['Topic'] >= 0]
results = []
for _, row in topic_info.iterrows():
tid = row['Topic']
if 'KeyBERT' in topic_info.columns and isinstance(row['KeyBERT'], list):
label = ' | '.join(str(w) for w in row['KeyBERT'][:n_words])
else:
label = str(row['Name'])[:50]
results.append((tid, label))
return results
def get_lda_label_list(model, vocab, n_words=5):
"""Get list of (topic_id, label_string) from LDA top words."""
results = []
for idx, topic in enumerate(model.components_):
top_words = [vocab[i] for i in topic.argsort()[-n_words:][::-1]]
results.append((idx, ' | '.join(top_words)))
return results
vocab_list = count_vec.get_feature_names_out()
bert_s1_labels_list = get_bertopic_label_list(bertopic_s1)
bert_s2_labels_list = get_bertopic_label_list(bertopic_s2)
lda_s1_labels_list = get_lda_label_list(lda_s1, vocab_list)
lda_s2_labels_list = get_lda_label_list(lda_s2, vocab_list)
# Build comparison dataframes
print("\n---Strategy 1 (Document-Level): BERTopic vs LDA---")
max_len = max(len(bert_s1_labels_list), len(lda_s1_labels_list))
comp_s1 = pd.DataFrame({
'Topic' : [f"T{i}" for i in range(max_len)],
'BERTopic (KeyBERT)' : [l for _, l in bert_s1_labels_list] + ['—'] * (max_len - len(bert_s1_labels_list)),
'LDA (Top Words)' : [l for _, l in lda_s1_labels_list] + ['—'] * (max_len - len(lda_s1_labels_list))
})
print(comp_s1.to_string(index=False))
print("\n---Strategy 2 (200-Word Windows): BERTopic vs LDA---")
max_len = max(len(bert_s2_labels_list), len(lda_s2_labels_list))
comp_s2 = pd.DataFrame({
'Topic' : [f"T{i}" for i in range(max_len)],
'BERTopic (KeyBERT)' : [l for _, l in bert_s2_labels_list] + ['—'] * (max_len - len(bert_s2_labels_list)),
'LDA (Top Words)' : [l for _, l in lda_s2_labels_list] + ['—'] * (max_len - len(lda_s2_labels_list))
})
print(comp_s2.to_string(index=False))
# EVALUATION 2: Topic Diversity
# Proportion of unique words across all topic top-word lists (same words keep appearing across topics, then diversity is low)
# use top 10 words per topic for this calculation \
print("===Evaluation 2: Topic Diversity===")
def compute_diversity(topic_words_list):
"""
Compute topic diversity: proportion of unique words across all topics.
topic_words_list: list of lists of top words per topic.
Range: 0 (all topics identical) to 1 (no word shared across topics).
"""
all_words = [w for topic in topic_words_list for w in topic]
unique_words = set(all_words)
return len(unique_words) / len(all_words) if all_words else 0.0
# Extract top 10 words per topic as lists of lists
def bertopic_words_list(model, n=10):
topic_info = model.get_topic_info()
topic_ids = topic_info[topic_info['Topic'] >= 0]['Topic'].tolist()
return [[w for w, _ in model.get_topic(tid)][:n] for tid in topic_ids]
def lda_words_list(model, vocab, n=10):
return [[vocab[i] for i in topic.argsort()[-n:][::-1]]
for topic in model.components_]
bert_s1_wl = bertopic_words_list(bertopic_s1)
bert_s2_wl = bertopic_words_list(bertopic_s2)
lda_s1_wl = lda_words_list(lda_s1, vocab_list)
lda_s2_wl = lda_words_list(lda_s2, vocab_list)
div_bert_s1 = compute_diversity(bert_s1_wl)
div_bert_s2 = compute_diversity(bert_s2_wl)
div_lda_s1 = compute_diversity(lda_s1_wl)
div_lda_s2 = compute_diversity(lda_s2_wl)
print(f"\nBERTopic S1 (Doc-Level) diversity: {div_bert_s1:.3f}")
print(f"\tBERTopic S2 (200-Word) diversity: {div_bert_s2:.3f}")
print(f"\tLDA S1 (Doc-Level) diversity: {div_lda_s1:.3f}")
print(f"\tLDA S2 (200-Word) diversity: {div_lda_s2:.3f}")
# EVALUATION 3: Topic Distinctiveness via Cosine Similarity
# Represent each topic as a TF-IDF vector over its top words then compute pairwise cosine similarity between all topics
# Low average similarity = topics are well separated from each other
# High average similarity = topics are redundant / overlapping
print("===Evaluation 3: Topic Distinctiveness (Cosine Similarity)===")
def topic_similarity_matrix(topic_words_list, vectorizer):
"""
Build a cosine similarity matrix between topics.
Each topic is represented as a TF-IDF vector of its top words joined as text.
Returns the matrix and the mean off-diagonal similarity.
"""
# Join each topic's words into a pseudo-document
topic_docs = [' '.join(words) for words in topic_words_list]
# Vectorize using existing fitted TF-IDF vectorizer
# (transform only: vocab already fitted on corpus)
vecs = vectorizer.transform(topic_docs)
sim_matrix = cosine_similarity(vecs)
# Mean off-diagonal similarity (excluding self-similarity on diagonal)
n = sim_matrix.shape[0]
mask = ~np.eye(n, dtype=bool)
mean_sim = sim_matrix[mask].mean()
return sim_matrix, mean_sim
sim_bert_s1, mean_bert_s1 = topic_similarity_matrix(bert_s1_wl, tfidf)
sim_bert_s2, mean_bert_s2 = topic_similarity_matrix(bert_s2_wl, tfidf)
sim_lda_s1, mean_lda_s1 = topic_similarity_matrix(lda_s1_wl, tfidf)
sim_lda_s2, mean_lda_s2 = topic_similarity_matrix(lda_s2_wl, tfidf)
print(f"\nMean inter-topic similarity (lower = more distinct topics):")
print(f"\tBERTopic S1 (Doc-Level): {mean_bert_s1:.4f}")
print(f"\tBERTopic S2 (200-Word): {mean_bert_s2:.4f}")
print(f"\tLDA S1 (Doc-Level): {mean_lda_s1:.4f}")
print(f"\tLDA S2 (200-Word): {mean_lda_s2:.4f}")
# Visualize similarity matrices as heatmaps
fig, axes = plt.subplots(2, 2, figsize=(14, 12))
fig.suptitle('Topic Similarity Matrices\n(darker = more similar topics, ''lower avg = more distinct)', fontsize=13, fontweight='bold')
for ax, sim, title, mean in zip(
axes.flatten(),
[sim_bert_s1, sim_bert_s2, sim_lda_s1, sim_lda_s2],
['BERTopic: S1 (Doc-Level)', 'BERTopic: S2 (200-Word)',
'LDA: S1 (Doc-Level)', 'LDA: S2 (200-Word)'],
[mean_bert_s1, mean_bert_s2, mean_lda_s1, mean_lda_s2]
):
sns.heatmap(sim, ax=ax, cmap='YlOrRd', vmin=0, vmax=1, cbar_kws={'label': 'Cosine Similarity'})
ax.set_title(f'{title}\nMean inter-topic similarity: {mean:.4f}')
ax.set_xlabel('Topic')
ax.set_ylabel('Topic')
plt.tight_layout()
# plt.figsave()
plt.show()
# EVALUATION 4: Human vs. AI Topic Separation (Chi-Square Test)
# Using Chi-Square test for independence
print("===Evaluation 4: Human vs. AI Topic Separation (Chi-Square Test)===")
def chi_square_topic_label(chunks_df, topic_col):
"""
Run chi-square test of independence between topic assignment and Human/AI label.
Returns chi2 statistic, p-value, and contingency table.
Excludes outlier topic -1.
"""
df = chunks_df[chunks_df[topic_col] >= 0].copy()
contingency = pd.crosstab(df['label'], df[topic_col])
chi2, p, dof, expected = chi2_contingency(contingency)
return chi2, p, dof, contingency
for label, chunks_df, topic_col in [
("BERTopic: S1 (Doc-Level)", chunks_s1, 'topic_bert'),
("BERTopic: S2 (200-Word)", chunks_s2, 'topic_bert'),
("LDA: S1 (Doc-Level)", chunks_s1, 'topic_lda'),
("LDA: S2 (200-Word)", chunks_s2, 'topic_lda'),
]:
chi2, p, dof, contingency = chi_square_topic_label(chunks_df, topic_col)
sig = "✓ SIGNIFICANT" if p < 0.05 else "✗ not significant"
print(f"\n {label}")
print(f"\t Chi-square: {chi2:.2f} | df: {dof} | p-value: {p:.2e} → {sig}")
# SUMMARY TABLE
print("Evaluation Summary Table")
_, p_bert_s1, _, _ = chi_square_topic_label(chunks_s1, 'topic_bert')
_, p_bert_s2, _, _ = chi_square_topic_label(chunks_s2, 'topic_bert')
_, p_lda_s1, _, _ = chi_square_topic_label(chunks_s1, 'topic_lda')
_, p_lda_s2, _, _ = chi_square_topic_label(chunks_s2, 'topic_lda')
summary_df = pd.DataFrame({
'Model' : ['BERTopic', 'BERTopic', 'LDA', 'LDA'],
'Strategy' : ['S1 (Doc)', 'S2 (200w)', 'S1 (Doc)', 'S2 (200w)'],
'N Topics' : [14, 14, 15, 15],
'Diversity' : [round(div_bert_s1, 3), round(div_bert_s2, 3), round(div_lda_s1, 3), round(div_lda_s2, 3)],
'Avg Similarity' : [round(mean_bert_s1, 4), round(mean_bert_s2, 4), round(mean_lda_s1, 4), round(mean_lda_s2, 4)],
'Chi-sq p-value' : [f"{p_bert_s1:.2e}", f"{p_bert_s2:.2e}", f"{p_lda_s1:.2e}", f"{p_lda_s2:.2e}"],
'Label Quality' : ['High (KeyBERT)', 'High (KeyBERT)', 'Medium (top words)', 'Medium (top words)']
})
print(summary_df.to_string(index=False))
===Evaluation 1: Topic Interpretability (BERTopic vs LDA)=== ---Strategy 1 (Document-Level): BERTopic vs LDA--- Topic BERTopic (KeyBERT) LDA (Top Words) T0 education | organization | management | be | students students | school | education | learning | children T1 nature | characters | art | light | story technology | human | world | time | new T2 restaurant | food | pizza | chicken | delicious people | individuals | social | individual | life T3 sustainable | pollution | economic | driverless | climate government | states | war | political | united T4 technology | technologies | communication | internet | social use | ethical | law | privacy | animals T5 nursing | nurses | patients | nurse | medical cultural | women | people | culture | society T6 learning | features | recognition | datasets | dataset health | care | patients | patient | medical T7 sports | olympics | olympic | sport | soccer college | electoral | states | vote | president T8 offenders | crime | justice | criminal | punishment like | just | time | people | don T9 skin | hair | wear | wash | face year | 000 | games | said | years T10 mars | martian | geological | planets | landform car | cars | people | public | city T11 electoral | electors | voting | elections | election company | business | market | products | marketing T12 antennas | antenna | wireless | communications | broadcasting research | data | used | study | based T13 nfl | seahawks | steelers | afc | colts economic | energy | environmental | global | countries T14 — management | employees | organization | project | work ---Strategy 2 (200-Word Windows): BERTopic vs LDA--- Topic BERTopic (KeyBERT) LDA (Top Words) T0 art | culture | cultural | society | war time | year | said | years | day T1 technology | benefits | social | cars | transportation information | process | patients | development | care T2 students | student | school | schools | education media | social | introduction | paper | essay T3 management | managers | organization | organizations | business like | people | just | know | good T4 room | she | very | good | had health | potential | use | energy | environmental T5 ethical | genetic | animals | animal | rights government | country | countries | economic | political T6 cosmos | mars | nasa | space | universe women | water | al | et | body T7 football | players | soccer | league | sports students | school | education | learning | college T8 learning | networks | neural | recognition | algorithms human | time | face | language | life T9 electoral | electors | voter | elections | voters world | social | cultural | individuals | people T10 wash | using | face | clean | skin people | united | war | cars | states T11 lighting | light | photography | electricity | colors company | market | business | services | products T12 refinery | oil | reservoirs | wastewater | fluids data | analysis | study | research | used T13 eu | eus | european | europe | constitutional organization | employees | management | work | performance T14 — web | new | 2011 | university | journal ===Evaluation 2: Topic Diversity=== BERTopic S1 (Doc-Level) diversity: 0.371 BERTopic S2 (200-Word) diversity: 0.379 LDA S1 (Doc-Level) diversity: 0.893 LDA S2 (200-Word) diversity: 0.900 ===Evaluation 3: Topic Distinctiveness (Cosine Similarity)=== Mean inter-topic similarity (lower = more distinct topics): BERTopic S1 (Doc-Level): 0.0000 BERTopic S2 (200-Word): 0.0000 LDA S1 (Doc-Level): 0.0101 LDA S2 (200-Word): 0.0089
===Evaluation 4: Human vs. AI Topic Separation (Chi-Square Test)===
BERTopic: S1 (Doc-Level)
Chi-square: 521.44 | df: 13 | p-value: 4.08e-103 → ✓ SIGNIFICANT
BERTopic: S2 (200-Word)
Chi-square: 2611.97 | df: 13 | p-value: 0.00e+00 → ✓ SIGNIFICANT
LDA: S1 (Doc-Level)
Chi-square: 870.13 | df: 14 | p-value: 1.08e-176 → ✓ SIGNIFICANT
LDA: S2 (200-Word)
Chi-square: 5887.17 | df: 14 | p-value: 0.00e+00 → ✓ SIGNIFICANT
Evaluation Summary Table
Model Strategy N Topics Diversity Avg Similarity Chi-sq p-value Label Quality
BERTopic S1 (Doc) 14 0.371 0.0000 4.08e-103 High (KeyBERT)
BERTopic S2 (200w) 14 0.379 0.0000 0.00e+00 High (KeyBERT)
LDA S1 (Doc) 15 0.893 0.0101 1.08e-176 Medium (top words)
LDA S2 (200w) 15 0.900 0.0089 0.00e+00 Medium (top words)
Evaluation Results: Discussion¶
Topic Interpretability: BERTopic wins clearly. KeyBERT labels are immediately readable without guesswork. LDA's top words require more interpretation, and some topics (e.g., T1: technology | human | world | time | new | potential) are too broad to be meaningfully named.
Topic Diversity: Counterintuitive result here. LDA scores much higher (0.893–0.900) vs. BERTopic (0.393–0.407). But this isn't actually a win for LDA since LDA's high diversity comes from spreading words thinly across broad topics, so top-word lists don't overlap much. Thus, BERTopic's lower score reflects that semantically similar topics naturally share vocabulary and this metric is then penalizing BERTopic merely for being semantically consistent, which isn't a fair critique.
Topic Distinctiveness: Both models score near zero on average inter-topic cosine similarity (BERTopic: 0.0000; LDA: 0.0089–0.0101). Therefore, as we'd hope for, topics are essentially orthogonal in word-space for both models.
- The similarity matrices above make this visually clear. Each matrix is almost entirely pale yellow (near-zero similarity) with a dark red diagonal, where the diagonal just represents each topic compared to itself (similarity = 1.0 by definition), so the important thing to look at is everything off the diagonal. For BERTopic especially, the off-diagonal is essentially blank, meaning no two topics share meaningful vocabulary. LDA shows a tiny bit more off-diagonal color, which is consistent with its slightly higher mean similarity scores, but still nowhere near problematic overlap. One thing worth noting is that the BERTopic S2 matrix has a small cluster of slightly warmer cells around Topics 5-6, suggesting those two topics share a bit more vocabulary than the others, but given the mean similarity is still 0.0000, this is a minor observation rather than a real concern.
Human vs. AI Topic Separation, the main finding (Chi-Square): Every single combination returns a highly significant result i.e., all four p-values are essentially zero. Therefore, we can reject the null and conclude with 95% confidence that topic assignment is NOT independent of Human vs. AI label across all four models. In other words, knowing which topic a chunk belongs to gives you meaningful information about whether it was written by a human or an LLM.
- Additionally, The chi-square statistics tell a clear story about which combination finds the sharpest separation: LDA S2 (5,887) > BERTopic S2 (2,154) > LDA S1 (870) > BERTopic S1 (429). Two things stand out here. First off, Strategy 2 consistently outperforms Strategy 1 for both models (reinforcing Step 3). Secondly, LDA S2 produces the strongest separation despite BERTopic producing more interpretable labels. Therefore, I'd wrap this up by noting that the two models are both good, just at different things.
Overall recommendation: BERTopic S2 is the best all-around combination. It's readable, includes semantically meaningful topics AND (most importantly) a strong Human vs. AI separation. But the honest takeaway is that all four combinations agree to one primary finding, namely that human and AI writing cluster into meaningfully different topics, and that difference is statistically significant regardless of model or chunking strategy.
Main Step 7: Results, Discussion, Limitations, and Next Steps¶
Results Summary¶
Across every step of this pipeline, the same core finding keeps emerging: human and AI writing are meaningfully distinguishable, and that distinction is both statistically significant and interpretable from a cultural standpoint.
Starting with classification: TF-IDF features alone were enough to separate human from AI chunks with a macro F1 of 0.862 (Logistic Regression), well above the 0.5 random chance baseline. The top predictive features tell a clear story too. AI writing is flagged by abstract vocabulary (potential, significant, ultimately, challenges), while human writing is grounded in specific evidence and context (cited, references, essay, 2007, 2008, 2009). The word clouds made this difference easy to see at a glance as well.
Topic modeling confirmed and deepened this finding: Across all four model/strategy combinations (BERTopic and LDA × two chunking strategies), topic assignment was never independent of the Human/AI label, i.e., all four chi-square tests returned p-values essentially at zero. The strongest separation came from LDA on Strategy 2 (chi-square = 5,887), while BERTopic on Strategy 2 produced the most readable topics. Human chunks concentrated in academically grounded and narrative topics, while AI chunks concentrated in abstract, essay-prompt-style topics. Notably, Strategy 2 (fixed 200-word windows) consistently outperformed Strategy 1 (document-level) across both models, which validates the chunking decision made back in Step 3.
The evaluation metrics told a subtle story: BERTopic wins on interpretability and topic distinctiveness vs. LDA wins on topic diversity (though for structural reasons that don't actually reflect quality) and on the strength of Human/AI separation. Neither model dominates on every metric, which is itself an interesting takeaway.
Discussion¶
The central cultural question this project started with (can we tell human writing from machine writing?) has a clear answer: yes, and the difference runs deeper than just surface writing style.
What the classifier and topic models are picking up on isn't just that AI writes "differently" in some vague sense, it gets more specific than that. Human writing in this corpus is rooted in time and evidence, i.e., real years, real citations, real institutional names. In contrast, AI writing is more vague and light on institutional specifics. This isn't a quality judgment (plenty of bad human writing exists, and plenty of polished AI writing does too), but it is a real structural difference in how each kind of writing relates to the world.
Topic modeling results then add another layer to this. Specifically, the fact that AI chunks pile up in the abstract essay-prompt topics (education, technology, healthcare framed generically) while human chunks spread more into narrative, conversational, and academically cited content says something interesting about how LLMs generate text at the prompt level. Additionally, it appears that models trained on instruction-following tend to produce a particular style, i.e., organized, moderate-length, rhetorically tidy, and this ends up clustering differently from how humans actually respond to the same prompts.
Finally, one finding worth noting is that the previously mentioned ~14% misclassification rate is not a failure. Instead, it reflects how genuinely hard this problem is as a continuously explored research question.
On a specific note, its the AI chunks that are most likely to get misclassified as human, and those that do are probably the ones from larger, higher-quality models (which is exactly what you'd expect as LLMs keep improving). In that sense, the error rate is less a flaw in the model and more a measurement of how blurry that boundary between human and machine writing has already become.
Limitations¶
AI label heterogeneity is probably the biggest caveat in this whole project. Collapsing 62 LLMs into one "AI" class makes the binary classification doable, but it hides a lot of variation. For instance, Flan-T5-Small (avg 40 words, short and often choppy) and GPT-4 (avg 600+ words, polished and fluent) are both labeled AI = 1. So the classifier and topic models aren't really learning one single "AI writing style", they're learning whatever separates human writing from the messy aggregate of all these very different models. Therefore, a more detailed analysis grouping models by family (small open-source vs. large proprietary) could show that the "AI writing" signal is really coming from a specific subset of models.
BERTopic outlier rate with Strategy 2 is worth flagging. Specifically, about 38% of Strategy 2 chunks (10,594 of 27,430) got assigned to the outlier topic (-1), meaning BERTopic couldn't confidently place them in any cluster. This is a real downside of applying HDBSCAN-based clustering to short, uniform chunks, i.e., there just isn't enough context per chunk for the algorithm to be sure. Therefore, increasing chunk size or switching to soft clustering could help, but would trade away the size-consistency benefits that made Strategy 2 the right call in the first place.
Sample size is a clear practical limitation. The analysis uses 10,000 of 788,922 texts (1.3%), which does mirror the full dataset's label distribution well enough, but likely under-represents LLMs with small sample counts overall. For example, some models appear fewer than 300 times in the full corpus, meaning they could have zero or near-zero representation in the sample.
The "AI vocabulary". The classifier flagging words like additionally and ultimately as predictive of AI is statistically true for this dataset, but it's worth being honest about the limitation: these are also features of well-structured academic writing. However, note that the model is capturing a real pattern in this corpus, not making a universal claim about what distinguishes good writing from bad. What may be frustrating however is that this analysis appears to encroach on the claim that only less polished/worse writing is human, and that's unfair.
Prompt overlap between human and AI texts is both a strength and a limitation. The shared prompts make the comparison controlled (same task, different writer), but they also mean that any topic-level differences found here are differences in how human vs. AI respond to the same prompts, not differences in what topics they'd naturally choose to write about on their own. That limits how far these findings generalize to more natural writing settings.
Next Steps¶
Detailed LLM classification: Rather than collapsing all 62 LLMs into one AI class, a follow-up could group models by family (GPT, LLaMA, Bloom, T5) or by size and ask whether classification accuracy varies across those groups. The expectation would be that larger, more capable models are harder to tell apart from human writing, which would be both culturally and practically interesting to confirm.
Prompt-separated analysis: This project treats all prompts as interchangeable, but the prompts in prompts.csv cover very different writing tasks (essays, creative writing, how-to guides, etc.). Thus, running separate classifiers or topic models per prompt type could show whether the Human/AI distinction is stronger for some task types than others. In fact, the beauty/tutorial cluster that showed up in BERTopic suggests this kind of prompt-level variation is already visible in the data.
Time Bias: Human texts in this corpus seem to skew toward a specific historical window, i.e., the years 2007–2011 kept showing up as predictive features. Thus, a natural follow-up would be to look at whether the Human/AI distinction is partly a time artifact (human writing from that era vs. LLMs trained on more recent data) and whether controlling for that changes anything.
Reducing BERTopic Outliers: The ~38% outlier rate with Strategy 2 is worth addressing in a follow-up. Right now, BERTopic makes a forced yes-or-no decision for each chunk, i.e., either it belongs to a topic confidently enough, or it gets dumped into the outlier bucket. However, BERTopic actually supports a more flexible version of this where each chunk gets a probability score for every topic instead of a single assignment, which would likely rescue a lot of those 10,594 outlier chunks and give a richer picture of how topics blend across Human vs. AI writing.
Scaling to the full corpus. The 10k sample is a reasonable starting point, but running this on the full 788k corpus would give more stable topic distributions and make it possible to actually study the rare LLMs that are underrepresented in the sample. Dask parallel processing could be a somewhat simple way to handle the memory constraints.