Mini-Project 3: Beyond the Joke: A Multimodal Analysis of Blame and Framing in US Political and COVID Memes¶
Memes are one of the most widespread forms of political communication on the internet today. In just a few words and an image, they assign blame, celebrate heroes, mock politicians, and process collective anxiety about public health crises. Despite their apparent simplicity, memes are doing real cultural work. Namely, they frame narratives, shape how people think about political figures, and circulate ideological positions in a format designed for rapid sharing. This project treats memes not as trivial internet content but as genuine cultural artifacts worth analyzing systematically.
The central research question driving this project is: how do US political and COVID memes differ in how they assign blame, victimhood, and heroism, and do visual, textual, and acoustic features reinforce those framings? To answer this, we draw on three distinct datasets. The primary dataset is the Memes Images: OCR Data, which provides 5,552 political and COVID memes with OCR-extracted text and crowd-annotated entity tags labeling heroes, villains, and victims, along with 16 actual meme image files for visual analysis. To contextualize the meme findings, we bring in the 2020 US Presidential Election Campaign Speeches, a peer-reviewed corpus of 1,081 cleaned campaign speeches from Trump, Biden, Pence, and Harris spanning January 2019 through January 2021. This allows us to ask whether the blame framing found in memes actually reflects what politicians are saying in their own words. Finally, the M-Arg Multimodal Argumentation Dataset provides 17 MP3 audio clips from the 2020 presidential debates with force-aligned utterance timestamps, which we use to extract acoustic speech features and ask whether how politicians literally sound connects to how memes portray them.
The analysis proceeds across eight steps. Steps 1 through 4 cover data loading, preprocessing, and results from text and image analysis of the meme data, including entity frequency analysis, VADER sentiment, TF-IDF vocabulary comparison, word clouds, color palette analysis, and text coverage estimation. Step 5 brings those findings together into a combined summary. Step 6 runs a stability check on a held-out validation set of 650 memes to confirm the main findings hold on unseen data. Step 7 expands the analysis to the campaign speech transcripts, comparing sentiment, vocabulary, and COVID blame-framing language across speakers and against the meme findings. Step 8 adds acoustic feature analysis using librosa on the actual debate audio, extracting zero crossing rate, RMS energy, Chroma STFT, and MFCCs per speaker utterance to ask whether acoustic speech patterns align with meme villain framing.
Notes:
- All figures throughout this notebook are interactive: hovering over charts reveals exact values.
- Sentiment refers to the emotional tone of a text, measured using VADER on a scale from -1 (most negative) to +1 (most positive).
1. Dataset (Creation and Preprocessing)¶
Dataset: Memes Images: OCR Data (Kaggle). This dataset has two components:
- Text: Two CSVs (a training set of 5,552 memes and a validation set of 650 memes). Each row contains the OCR-extracted text from one meme and four entity tag columns labeling named heroes, villains, victims, and other entities.
- Images: 16 sample PNG files (8 COVID memes, 8 US political memes) included as actual image files (.png)
Notably, these two meme categories let us ask a comparative question. Namely, does the same internet meme format function differently when applied to electoral politics vs. a public health crisis? Moreover, the entity tags are what make this dataset interesting, i.e., instead of just asking 'is this meme positive or negative,' we can ask 'who exactly is getting blamed, and who is getting credit'.
1.1: Load Libraries and Data¶
# imports
import pandas as pd
import numpy as np
import ast
import os
from collections import Counter
# for text analysis
import nltk
from nltk.corpus import stopwords
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('stopwords', quiet=True)
nltk.download('vader_lexicon', quiet=True)
nltk.download('punkt', quiet=True)
from sklearn.feature_extraction.text import TfidfVectorizer
# for image analysis
from PIL import Image
import cv2
# for visualizations
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.default = 'notebook'
import matplotlib.pyplot as plt
# sanity check line
print("All libraries loaded successfully")
All libraries loaded successfully
# load the data
train_df = pd.read_csv('../data/text_with_ocr.csv')
val_df = pd.read_csv('../data/validation_data.csv')
# quick look at the data
print(f"Training rows: {len(train_df)}")
print(f"Validation rows: {len(val_df)}")
print(f"Columns: {list(train_df.columns)}")
train_df.head(3)
Training rows: 5552 Validation rows: 650 Columns: ['OCR', 'image', 'hero', 'villain', 'victim', 'other']
| OCR | image | hero | villain | victim | other | |
|---|---|---|---|---|---|---|
| 0 | Bernie or Elizabeth? Be informed. Compare them... | covid_memes_18.png | NaN | NaN | NaN | ['bernie sanders', 'elizabeth warren'] |
| 1 | Extending the Brexit deadline until October 31... | covid_memes_19.png | NaN | ['uk government'] | NaN | NaN |
| 2 | kwai gkwa 0964 #nnevvy applause to Thais from ... | covid_memes_252.png | ['thais'] | NaN | NaN | ['hong kong'] |
1.2: Inspect and Understand the Structure¶
# what do the entity tags look like
# note that tags are stored as string representations of Python lists, e.g. "['donald trump']"
# so they must be parsed properly
def parse_tag(val):
""" Safely parse a tag column from string-list to actual list"""
try:
result = ast.literal_eval(val)
if isinstance(result, list):
return result
return []
except:
return []
# apply the parser to all entity columns
for col in ['hero', 'villain', 'victim', 'other']:
train_df[col + '_parsed'] = train_df[col].apply(parse_tag)
val_df[col + '_parsed'] = val_df[col].apply(parse_tag)
# add a category column based on the image filename prefix
# i.e., images named 'covid_memes_X.png' are COVID; 'memes_X.png' are US politics
train_df['category'] = train_df['image'].apply(
lambda x: 'COVID' if str(x).startswith('covid') else 'US Politics'
)
val_df['category'] = val_df['image'].apply(
lambda x: 'COVID' if str(x).startswith('covid') else 'US Politics'
)
print("Category breakdown (train):")
print(train_df['category'].value_counts())
print()
print("Sample row OCR text:")
print(train_df['OCR'].iloc[0])
Category breakdown (train): category US Politics 2852 COVID 2700 Name: count, dtype: int64 Sample row OCR text: Bernie or Elizabeth? Be informed. Compare them on the issues that matter. Issue: Who makes the dankest memes?
1.3: Preprocessing the OCR Text¶
# clean OCR text
# OCR from memes is inherently noisy
# (e.g., picks up watermarks, usernames, website URLs, and formatting artifacts)
# Here: do light cleaning (but careful not to string too much because the meme language is intentional)
import re
stop_words = set(stopwords.words('english'))
def clean_ocr(text):
"""
Light cleaning for meme OCR text:
- lowercase
- remove URLs and handles
- remove special characters but keep spaces
- strip extra whitespace
"""
if not isinstance(text, str):
return ''
# lowercase
text = text.lower()
# remove URLs
text = re.sub(r'http\S+|www\.\S+', '', text)
# remove @handles
text = re.sub(r'@\w+', '', text)
# keep only letters + spaces
text = re.sub(r'[^a-z\s]', ' ', text)
# collapse whitespace
text = re.sub(r'\s+', ' ', text).strip()
return text
train_df['ocr_clean'] = train_df['OCR'].apply(clean_ocr)
val_df['ocr_clean'] = val_df['OCR'].apply(clean_ocr)
# also compute word count on cleaned text (useful for EDA)
train_df['word_count'] = train_df['ocr_clean'].apply(lambda x: len(x.split()))
print("Preprocessing complete.")
print(f"Average word count per meme: {train_df['word_count'].mean():.1f}")
print(f"Median word count: {train_df['word_count'].median():.1f}")
Preprocessing complete. Average word count per meme: 19.1 Median word count: 16.0
Preprocessing Summary¶
Before we do any analysis, it helps to know what we're actually working with. We loaded two CSVs: training set of 5,552 memes and validation set of 650, each with six columns, namely the OCR-extracted text, the image filename, and four entity tag columns (hero, villain, victim, other).
The first thing we had to deal with is that the entity tags aren't stored as clean lists, but instead stored as strings that look like lists (e.g., "[donald trump]"). So for this preprocessing task, we wrote a parser to convert those into actual Python lists we can work with.
Next, we also added a category label to each row based on the image filename (i.e., anything starting with covid gets labeled COVID, while everything else gets labeled US Politics). This then gave us 2,852 politics memes and 2,700 COVID memes (a pretty even split, which is nice).
Finally, we cleaned the OCR text. Note that Meme OCR is messy by nature as it picks up watermarks, website URLs, @handles, and random punctuation. For this, we lowercased everything, stripped URLs and handles, and removed non-letter characters. In this step, I was careful to not over-clean though, since meme language is intentional and weird in ways that matter. After this final step was completed, the average meme caption is about 19 words, with a median of 16 (so we are dealing with pretty short texts throughout).
2. Primary Methods¶
The first part of this project will use two parallel analysis tracks to answer the research question.
Track 1 (Step 3): Text Analysis (all 5,552 training rows):
- Entity frequency analysis: who appears as villain, hero, and victim across each meme category?
- Sentiment analysis using VADER: do COVID vs. politics memes carry systematically different emotional tones?
- TF-IDF comparison: what words are most distinctive to each category?
Track 2 (Step 4): Image Analysis (16 sample PNGs):
- Color palette analysis: average brightness, saturation, and dominant hue per category.
- Text-to-image area ratio: how much of each meme is covered by text vs. visual content?
Note that in Step 5, I will bring both tracks together into a single combined summary.
3. Text Analysis¶
3.1: Entity Frequency: Who gets cast as villain, hero, and victim?¶
# flatten all entity tags and count by category
# for each role (villain, hero, victim), want to know:
# which named entities appear most frequently
# and does that differ between COVID memes and US Politics memes?
def get_entity_counts(df, role, category=None):
"""
Extract and count entities for a given role and optional category filter
"""
col = role + '_parsed'
if category:
df = df[df['category'] == category]
all_entities = []
for tags in df[col]:
all_entities.extend(tags)
return Counter(all_entities)
# get top villains, heroes, and victims for each category
roles = ['villain', 'hero', 'victim']
categories = ['US Politics', 'COVID']
for role in roles:
print(f"\n=== Top 10 {role.upper()}S ===")
for cat in categories:
counts = get_entity_counts(train_df, role, cat)
print(f"\t {cat}: {counts.most_common(10)}")
=== Top 10 VILLAINS ===
US Politics: [('donald trump', 358), ('joe biden', 149), ('democratic party', 145), ('republican party', 136), ('barack obama', 74), ('democrats', 55), ('hiliary clinton', 44), ('republicans', 39), ('libertarian party', 33), ('hillary clinton', 28)]
COVID: [('donald trump', 146), ('china', 62), ('coronavirus', 60), ('2020', 31), ('covid19', 17), ('joe biden', 13), ('covid infected people', 13), ('barack obama', 11), ('chinese', 11), ('government', 10)]
=== Top 10 HEROS ===
US Politics: [('donald trump', 37), ('barack obama', 32), ('green party', 19), ('libertarian party', 17), ('joe biden', 12), ('bernie sanders', 12), ('jill stein', 11), ('libertarian', 7), ('gary johnson', 5), ('republican party', 5)]
COVID: [('chuck norris', 11), ('corona beer', 11), ('weed', 8), ('dr. anthony fauci', 6), ('joe biden', 5), ('donald trump', 5), ('china', 4), ('russia', 4), ('alcohol', 4), ('barack obama', 3)]
=== Top 10 VICTIMS ===
US Politics: [('donald trump', 37), ('america', 30), ('barack obama', 24), ('democratic party', 21), ('women', 15), ('people', 13), ('americans', 12), ('joe biden', 12), ('mexicans', 11), ('american people', 10)]
COVID: [('donald trump', 22), ('people', 20), ('china', 14), ('coronavirus', 12), ('parents', 8), ('world', 6), ('the world', 6), ('usa', 6), ('america', 6), ('americans', 6)]
# Visualization: top villains by category (side-by-side bar chart)
# here focus on villains because they have the most tag coverage (1,884 rows) and are the most politically revealing
fig = make_subplots(rows=1, cols=2,
subplot_titles=('US Politics Memes: Top Villains',
'COVID Memes: Top Villains'))
for i, cat in enumerate(['US Politics', 'COVID'], 1):
counts = get_entity_counts(train_df, 'villain', cat)
top = counts.most_common(10)
entities = [x[0] for x in top][::-1] # reverse so highest at the top
values = [x[1] for x in top][::-1]
color = '#e74c3c' if cat == 'US Politics' else '#3498db'
fig.add_trace(
go.Bar(y=entities, x=values, orientation='h',
marker_color=color, name=cat),
row=1, col=i
)
fig.update_layout(
title='Who Gets Blamed? Top Villain Entities by Meme Category',
height=420,
showlegend=False,
font=dict(size=11)
)
fig.show()
# Hero comparison
fig2 = make_subplots(rows=1, cols=2,
subplot_titles=('US Politics Memes: Top Heroes',
'COVID Memes: Top Heroes'))
for i, cat in enumerate(['US Politics', 'COVID'], 1):
counts = get_entity_counts(train_df, 'hero', cat)
top = counts.most_common(10)
if not top:
continue
entities = [x[0] for x in top][::-1]
values = [x[1] for x in top][::-1]
color = '#27ae60' if cat == 'US Politics' else '#16a085'
fig2.add_trace(
go.Bar(y=entities, x=values, orientation='h',
marker_color=color, name=cat),
row=1, col=i
)
fig2.update_layout(
title='Who Gets Praised? Top Hero Entities by Meme Category',
height=420,
showlegend=False,
font=dict(size=11)
)
fig2.show()
# victim comparison
fig3 = make_subplots(rows=1, cols=2,
subplot_titles=('US Politics Memes: Top Victims',
'COVID Memes: Top Victims'))
for i, cat in enumerate(['US Politics', 'COVID'], 1):
counts = get_entity_counts(train_df, 'victim', cat)
top = counts.most_common(10)
if not top:
continue
entities = [x[0] for x in top][::-1]
values = [x[1] for x in top][::-1]
color = '#8e44ad' if cat == 'US Politics' else '#d35400'
fig3.add_trace(
go.Bar(y=entities, x=values, orientation='h',
marker_color=color, name=cat),
row=1, col=i
)
fig3.update_layout(
title='Who Gets Hurt? Top Victim Entities by Meme Category',
height=420,
showlegend=False,
font=dict(size=11)
)
fig3.show()
3.1: Entity Frequency (brief interpretation based on code results and figures above):¶
The villain chart above is probably the most interesting result. Donald Trump dominates as the top villain in US Politics memes by a huge margin (nearly 360 appearances, more than double the next entry: Joe Biden at 149). What's also worth pointing out is that both major parties show up in the top 5, which means this dataset captures memes from across the political spectrum, not just one side. But then, the COVID villain list tells a different story, i.e., Trump still tops it, but he's followed by China, coronavirus, and "2020" itself (people were literally blaming the year as a concept, which says a lot about how exhausting the pandemic felt).
Then, the hero data is where things get interesting. In COVID memes, Chuck Norris and corona beer tie for the top hero spot, with weed coming in third. This is classic meme culture, i.e., mixing what seems like genuine commentary (Dr. Fauci at #4) with pure absurdist humor. The politics hero list is also more straightforward but still revealing. Specifically, we can see Trump and Obama both appear as heroes AND villains depending on the meme, which confirms that the same person means completely opposite things to different communities online.
Finally, the victim data rounds things out. Notably, "America," "women," "people," and "Americans" appearing as victims in politics memes reflects the populist framing that runs through a lot of political content, i.e., whoever the villain is, they're hurting regular people. Note though that COVID memes victimize more abstract things like "the world" and "the usa" as a whole rather than specific groups.
3.2: Sentiment Analysis with VADER¶
NOTE: Sentiment refers to the emotional tone of a piece of text (essentially asking: is it positive, negative, or neutral?). Here we measure this using VADER, a tool designed specifically for short social media text, which assigns each meme caption a score from -1 (most negative) to +1 (most positive).
# run VADER sentiment on cleaned OCR text
sid = SentimentIntensityAnalyzer()
def get_compound(text):
"""
Return the VADER compound sentiment score for a text string.
"""
if not isinstance(text, str) or len(text.strip()) ==0:
return 0.0
return sid.polarity_scores(text)['compound']
train_df['sentiment'] = train_df['ocr_clean'].apply(get_compound)
print("Sentiment by category:")
print(train_df.groupby('category')['sentiment'].describe().round(3))
Sentiment by category:
count mean std min 25% 50% 75% max
category
COVID 2700.0 0.000 0.390 -0.962 -0.178 0.0 0.231 0.975
US Politics 2852.0 0.103 0.477 -0.980 -0.225 0.0 0.494 0.986
# visualize sentiment distributions
fig3 = go.Figure()
for cat, color in [('US Politics', '#e74c3c'), ('COVID', '#3498db')]:
subset = train_df[train_df['category'] == cat]['sentiment']
fig3.add_trace(go.Histogram(
x=subset,
name=cat,
opacity=0.65,
marker_color=color,
nbinsx=40
))
fig3.update_layout(
barmode='overlay',
title='Sentiment Score Distribution by Meme Category',
xaxis_title='VADER Compound Score (negative ← 0 → positive)',
yaxis_title='Count',
height=380,
legend=dict(x=0.01, y=0.99),
font=dict(size=11)
)
fig3.show()
# box plot for cleaner comparison
fig4 = px.box(
train_df, x='category', y='sentiment',
color='category',
color_discrete_map={'US Politics': '#e74c3c', 'COVID': '#3498db'},
title='Sentiment Score by Category (Box Plot)',
labels={'sentiment': 'VADER Compound Score', 'category': 'Meme Category'},
points='outliers'
)
fig4.update_layout(height=380, showlegend=False, font=dict(size=11))
fig4.show()
3.2: Sentiment Analysis (brief interpretation based on code results and figures above):¶
As shown, both categories have a median sentiment of exactly 0, which means more than half of all memes in both groups are essentially neutral in tone (i.e., VADER isn't picking up strong positive or negative language in the OCR text). Importantly, this makes sense when you think about it because a lot of meme text is short, punchy, and sarcastic, and VADER doesn't always know what to do with sarcasm.
Thus, the bigger difference between the two categories is in their spread. Specifically, US Politics memes have a much wider interquartile range (Q1: -0.23, Q3: 0.49) compared to COVID memes (Q1: -0.18, Q3: 0.23). That means politics memes swing harder in both directions. In other words, there are more genuinely enthusiastic memes (praising a candidate) and more genuinely angry ones. In contrast, COVID memes cluster tighter around zero, which suggests the emotional tone is more uniform, i.e., less celebratory, less outright hostile, just kind of flat.
The histogram backs up this numerical interpretation. Namely, that huge spike at 0 is mostly COVID memes, while US Politics memes spread out more across the score range, with a noticeable bump around 0.4-0.5 that COVID memes don't really have. Those are the pro-candidate memes pushing the politics average up to 0.103 vs. COVID's 0.000.
Overall, the takeaway here is that sentiment alone doesn't cleanly separate these two categories, i.e., both are mostly neutral by VADER's measure. Note that the entity tag analysis from 3.1 does much more work in terms of showing HOW the two categories actually differ.
3.3: TF-IDF (What Words Distinguish Each Category?)¶
# TF-IDF to find category-distinctive vocabulary
# treat all memes in each category as one big document
# then use TF-IDF to find words that are especially characteristic of politics memes vs. COVID memes relative to each other
# combine all OCR text per category into one string each
politics_text = ' '.join(train_df[train_df['category']=='US Politics']['ocr_clean'])
covid_text = ' '.join(train_df[train_df['category']=='COVID']['ocr_clean'])
# fit TF-IDF on both "documents"
tfidf = TfidfVectorizer(
stop_words='english',
max_features=5000,
ngram_range=(1, 2) # include bigrams for more meaningful phrases
)
tfidf_matrix = tfidf.fit_transform([politics_text, covid_text])
feature_names = tfidf.get_feature_names_out()
# extract top terms per category
def top_tfidf_terms(matrix, row_idx, feature_names, n=20):
"""
Get the top n TF-IDF terms for a given row (category).
"""
row = matrix[row_idx].toarray().flatten()
top_idx = row.argsort()[::-1][:n]
return [(feature_names[i], round(row[i], 4)) for i in top_idx]
politics_terms = top_tfidf_terms(tfidf_matrix, 0, feature_names)
covid_terms = top_tfidf_terms(tfidf_matrix, 1, feature_names)
print("Top TF-IDF terms — US Politics:")
for term, score in politics_terms[:15]:
print(f"\t{term}: {score}")
print("\nTop TF-IDF terms — COVID:")
for term, score in covid_terms[:15]:
print(f"\t{term}: {score}")
Top TF-IDF terms — US Politics: party: 0.5251 biden: 0.2761 trump: 0.275 joe: 0.2457 obama: 0.2379 libertarian: 0.1874 republican: 0.1643 democratic party: 0.1617 democratic: 0.1333 com: 0.1306 republican party: 0.1212 libertarian party: 0.1205 president: 0.1195 people: 0.1107 debate: 0.1065 Top TF-IDF terms — COVID: coronavirus: 0.3228 virus: 0.3205 home: 0.2894 covid: 0.2584 corona: 0.2157 work: 0.2118 mask: 0.2087 trump: 0.1839 work home: 0.1832 china: 0.1738 people: 0.1474 com: 0.1451 like: 0.1389 corona virus: 0.1241 day: 0.1102
# visualize TF-IDF as horizontal bar charts
fig5 = make_subplots(rows=1, cols=2,
subplot_titles=('US Politics: Top Distinctive Terms',
'COVID: Top Distinctive Terms'))
for i, (terms, color) in enumerate(
[(politics_terms[:15], '#e74c3c'), (covid_terms[:15], '#3498db')], 1):
words = [t[0] for t in terms][::-1]
scores = [t[1] for t in terms][::-1]
fig5.add_trace(
go.Bar(y=words, x=scores, orientation='h',
marker_color=color),
row=1, col=i
)
fig5.update_layout(
title='TF-IDF: Most Distinctive Vocabulary by Category',
height=480,
showlegend=False,
font=dict(size=10)
)
fig5.show()
3.3: TF-IDF (brief interpretation based on code results and figures above):¶
Recall that TF-IDF works by finding words that are common in one category but rare in the other. So, these words shown aren't just the most frequent words overall, they're the words that are most specific to each category. Simply put, think of this measure as each category's signature vocabulary.
As we can see from above, the US Politics chart is almost entirely names and party labels (e.g., party, biden, trump, joe, obama, libertarian, republican, democratic). This makes sense because political memes are fundamentally about people and institutions. For instance, the word "party" scoring highest by a wide margin (0.52 vs everything else) suggests that a lot of politics memes are explicitly invoking party identity rather than just individual politicians.
Next however, the COVID chart shifts to a completely different vocabulary (e.g., coronavirus, virus, home, covid, corona, work, mask). What's interesting here is that "home" and "work" rank in the top 5, which reflects how much of the COVID meme discourse was about lockdowns and working from home rather than just the virus itself. Note also that Trump does show up in the COVID chart too (9th place), which lines up with what we saw in the villain analysis, i.e., that he crossed over into COVID discourse as a target of blame.
4. Image Analysis¶
The following results cover the image side of our analysis, using the 16 sample PNG files included in the dataset (8 COVID memes and 8 US Politics memes). In this section we look at two things: the visual color properties of each category (brightness, saturation, and hue) and how much of each meme's image area is covered by text. Note that because we are working with only 16 images, all results here are treated as preliminary observations rather than definitive conclusions (the full image dataset would be needed to test these patterns statistically).
4.1: Load and Organize Images¶
# Load and organize images by category
# images are already extracted into data/SampleImagesData/
# two subfolders: covidMemes/ and usPoliticsMemes/
image_data = {} # { filename: {'img': PIL Image, 'category': str} }
image_dirs = {
'COVID': '../data/SampleImagesData/SampleImagesData/covidMemes',
'US Politics': '../data/SampleImagesData/SampleImagesData/usPoliticsMemes'
}
for cat, folder in image_dirs.items():
for fname in os.listdir(folder):
if fname.endswith('.png'):
img_path = os.path.join(folder, fname)
img = Image.open(img_path).convert('RGB')
image_data[fname] = {'img': img, 'category': cat}
print(f"Loaded {len(image_data)} images")
for cat in ['COVID', 'US Politics']:
n = sum(1 for v in image_data.values() if v['category'] == cat)
print(f" {cat}: {n} images")
Loaded 16 images COVID: 8 images US Politics: 8 images
4.2 Color Palette Analysis¶
# analyze color features for each image
# for each image compute:
# average brightness (V channel in HSV — higher = brighter)
# average saturation (S channel in HSV — higher = more colorful)
# dominant hue bin (H channel — e.g., warm reds vs cool blues)
# these are simple but interpretable features
color_records = []
for fname, info in image_data.items():
img = info['img']
cat = info['category']
# convert PIL image to numpy array, then to OpenCV BGR format
img_np = np.array(img)
img_bgr = cv2.cvtColor(img_np, cv2.COLOR_RGB2BGR)
img_hsv = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2HSV)
# split into H, S, V channels
h_channel, s_channel, v_channel = cv2.split(img_hsv)
# compute means (these are our image-level features)
avg_brightness = v_channel.mean() # 0-255
avg_saturation = s_channel.mean() # 0-255
avg_hue = h_channel.mean() # 0-179 in OpenCV HSV
color_records.append({
'filename': fname,
'category': cat,
'brightness': avg_brightness,
'saturation': avg_saturation,
'hue': avg_hue
})
color_df = pd.DataFrame(color_records)
print(color_df.groupby('category')[['brightness','saturation','hue']].mean().round(2))
brightness saturation hue category COVID 123.65 76.63 48.14 US Politics 132.27 55.18 49.58
# visualize color features
fig6 = make_subplots(rows=1, cols=3,
subplot_titles=('Brightness', 'Saturation', 'Hue'))
metrics = ['brightness', 'saturation', 'hue']
colors_map = {'US Politics': '#e74c3c', 'COVID': '#3498db'}
for col_idx, metric in enumerate(metrics, 1):
for cat in ['US Politics', 'COVID']:
subset = color_df[color_df['category']==cat][metric]
fig6.add_trace(
go.Box(
y=subset,
name=cat,
marker_color=colors_map[cat],
showlegend=(col_idx==1) # only show legend once
),
row=1, col=col_idx
)
fig6.update_layout(
title='Visual Color Features by Meme Category (16 sample images)',
height=400,
boxmode='group',
font=dict(size=11)
)
fig6.show()
4.2: Color Palette Analysis (brief interpretation based on code results and figures above):¶
Keep in mind this is only 16 images (8 per category), so treat these numbers as patterns worth noting rather than hard conclusions.
That said, the differences here are pretty clear. US Politics memes are actually brighter on average (132.27) than COVID memes (123.65), which might seem surprising since you might expect pandemic content to be darker and more gloomy. But a lot of COVID memes in this sample use the classic white-background image macro format, which can actually drag brightness down compared to the more colorful political content.
The bigger difference we can see in this analysis is in saturation. US Politics memes have a much wider spread (the box plot shows values ranging from roughly 25 to 94) meaning some are very colorful and some are nearly grayscale. That grayscale end is likely screenshotted tweets and news articles, which have almost no color. On the other hand, COVID memes are more consistently saturated (median around 76), clustering tighter together.
Lastly, we can see hue is pretty similar between the two categories and probably the least informative of the three features (both categories cover a similar range of colors overall).
4.3 Text Coverage Ratio¶
# estimate what proportion of each meme is text vs. image
# strategy: convert to grayscale, apply adaptive thresholding to isolate
# high-contrast text regions, then calculate what % of pixels are 'text'
# NOTE: this is an approximation (meme text is usually bold, high-contrast, and distinct from photographic content)
text_coverage_records = []
for fname, info in image_data.items():
img = info['img']
cat = info['category']
img_np = np.array(img)
img_gray = cv2.cvtColor(img_np, cv2.COLOR_RGB2GRAY)
# adaptive threshold isolates high-contrast regions (like bold meme text)
thresh = cv2.adaptiveThreshold(
img_gray, 255,
cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
cv2.THRESH_BINARY_INV,
blockSize=15,
C=10
)
# text coverage = proportion of pixels flagged as high-contrast
total_pixels = thresh.size
text_pixels = thresh.sum() // 255
coverage_pct = (text_pixels / total_pixels) * 100
text_coverage_records.append({
'filename': fname,
'category': cat,
'text_coverage_pct': round(coverage_pct, 2)
})
coverage_df = pd.DataFrame(text_coverage_records)
# visualize
# show actual meme images with text coverage % annotated
# much more informative than a dot plot with only 16 images
# sorted by category then by coverage % so you can see the range within each group
sorted_coverage = coverage_df.sort_values(['category', 'text_coverage_pct'], ascending=[True, False])
fig, axes = plt.subplots(2, 8, figsize=(20, 6))
fig.suptitle('Estimated Text Coverage Across All 16 Sample Memes', fontsize=13, fontweight='bold', y=1.02)
for ax, (_, row) in zip(axes.flatten(), sorted_coverage.iterrows()):
fname = row['filename']
cat = row['category']
pct = row['text_coverage_pct']
# get the actual image
img = image_data[fname]['img']
ax.imshow(img)
ax.axis('off')
# color title by category
color = '#3498db' if cat == 'COVID' else '#e74c3c'
# annotate with filename and text coverage percentage
ax.set_title(f"{fname.replace('.png','')}\nText: {pct}%", fontsize=6.5, color=color, fontweight='bold')
plt.tight_layout()
plt.show()
print("Mean text coverage by category:")
print(coverage_df.groupby('category')['text_coverage_pct'].mean().round(2))
Mean text coverage by category: category COVID 16.40 US Politics 12.89 Name: text_coverage_pct, dtype: float64
4.3: Text Coverage Ratio (brief interpretation based on code results and figures above):¶
As shown, COVID memes have a higher average text coverage (16.40%) compared to US Politics memes (12.89%), and looking at the actual images you can see why. The top COVID memes by text coverage tend to be the classic image macro format, i.e., bold white text overlaid directly on a photo, like the "Deadly Wuhan Virus" meme (26.85%) or the "He Handles the Pandemic" one (17.64%). In other words, these formats are designed to deliver a punchy text message front and center.
On the other hand, US Politics memes in this sample lean more toward photo-heavy content, using real photographs of politicians, screenshotted tweets, or image-based jokes where the visual itself carries the meaning rather than overlaid text. Note however, the Democrat Party meme (27.28%) is the outlier on the politics side, a classic text-heavy format that pushes that category's average up.
In this analysis, one thing worth flagging is that our text coverage method (adaptive thresholding) picks up any high-contrast region, not just actual text. So, things like high-contrast clothing, sharp edges in photos, or dark/light boundaries can inflate the numbers slightly. The relative comparison between categories is still useful though, but the exact percentages should be treated as estimates rather than precise measurements.
Overall this adds a small but interesting layer to the color analysis from 4.2 (i.e., COVID memes are not just more saturated visually, they also tend to pack more text into the frame, which fits the image macro format that dominated pandemic meme culture).
5. Putting it Together (Combined Results Summary)¶
This step pulls every finding from Steps 3 and 4 into one place. Here we build a full feature summary table comparing both categories across all metrics (sentiment, entity tags, brightness, saturation, and text coverage) and then visualize the entire 5,552-meme dataset in a single interactive density plot that maps sentiment against villain tag count. The goal in this section is to ask: do the text and image patterns point in the same direction, and what does that tell us about how these two meme categories frame the world differently?
# build the summary table
summary_records = []
for cat in ['US Politics', 'COVID']:
subset_text = train_df[train_df['category']==cat]
subset_color = color_df[color_df['category']==cat]
subset_cov = coverage_df[coverage_df['category']==cat]
top_villain = get_entity_counts(train_df, 'villain', cat).most_common(1)
top_villain_str = top_villain[0][0] if top_villain else 'N/A'
top_hero = get_entity_counts(train_df, 'hero', cat).most_common(1)
top_hero_str = top_hero[0][0] if top_hero else 'N/A'
top_victim = get_entity_counts(train_df, 'victim', cat).most_common(1)
top_victim_str = top_victim[0][0] if top_victim else 'N/A'
summary_records.append({
'Category': cat,
'N (train)': len(subset_text),
'Avg Sentiment': round(subset_text['sentiment'].mean(), 3),
'Top Villain': top_villain_str,
'Top Hero': top_hero_str,
'Top Victim': top_victim_str,
'Avg Brightness': round(subset_color['brightness'].mean(), 1),
'Avg Saturation': round(subset_color['saturation'].mean(), 1),
'Avg Text Coverage %':round(subset_cov['text_coverage_pct'].mean(), 1)
})
summary_table = pd.DataFrame(summary_records).set_index('Category')
print("=" * 48)
print("FULL FEATURE SUMMARY BY CATEGORY")
print("=" * 48)
print(summary_table.T.to_string())
================================================ FULL FEATURE SUMMARY BY CATEGORY ================================================ Category US Politics COVID N (train) 2852 2700 Avg Sentiment 0.103 0.0 Top Villain donald trump donald trump Top Hero donald trump chuck norris Top Victim donald trump donald trump Avg Brightness 132.3 123.6 Avg Saturation 55.2 76.6 Avg Text Coverage % 12.9 16.4
from plotly.subplots import make_subplots
fig9 = make_subplots(
rows=1, cols=2,
subplot_titles=('US Politics Memes', 'COVID Memes'),
shared_yaxes=True
)
# create villain_count column
train_df['villain_count'] = train_df['villain_parsed'].apply(len)
for i, (cat, color) in enumerate([('US Politics', '#e74c3c'), ('COVID', '#3498db')], 1):
subset = train_df[train_df['category'] == cat]
# cap villain count at 3 (covers 99.4% of data)
subset = subset[subset['villain_count'] <= 3].copy()
fig9.add_trace(
go.Histogram2dContour(
x=subset['sentiment'],
y=subset['villain_count'],
colorscale=[[0, 'rgba(255,255,255,0)'], [1, color]],
showscale=False,
ncontours=15,
contours=dict(showlines=True, coloring='fill'),
line=dict(width=0.5, color='rgba(0,0,0,0.25)'),
name=cat
),
row=1, col=i
)
# faint individual dots on top
fig9.add_trace(
go.Scatter(
x=subset['sentiment'],
y=subset['villain_count'],
mode='markers',
marker=dict(size=3, color=color, opacity=0.25),
showlegend=False,
hovertemplate='Sentiment: %{x:.2f}<br>Villains: %{y}<extra></extra>'
),
row=1, col=i
)
fig9.update_layout(
title=dict(
text='Where Do Memes Cluster? Sentiment vs. Villain Tags<br>'
'<sup>Darker = more memes concentrated in that area | '
'faint dots = individual memes | villain count capped at 5</sup>',
x=0.5
),
height=480,
font=dict(size=11),
showlegend=False
)
fig9.update_xaxes(
title_text='Sentiment Score (negative <- 0 -> positive)',
zeroline=True, zerolinecolor='grey', zerolinewidth=1.5,
range=[-1.05, 1.05],
showgrid=True, gridcolor='rgba(0,0,0,0.08)'
)
fig9.update_yaxes(
title_text='Number of Villain Tags',
dtick=1,
range=[-0.3, 3.3],
showgrid=True, gridcolor='rgba(0,0,0,0.08)'
)
fig9.show()
5: Combined Summary (brief interpretation based on code results and figures above):¶
Looking at the summary table first, the two categories are actually pretty similar in size (2,852 vs 2,700 memes) but differ meaningfully across almost every other feature. Specifically, US Politics memes have a slightly positive average sentiment (0.103) while COVID memes sit exactly at 0.0. Moreover, Donald Trump is the top villain, top hero, AND top victim in US Politics memes simultaneously, which really drives home how polarized the political meme landscape is, i.e., the same person means completely different things to different communities.
On the other hand, COVID memes are more all over the place. Specifically, we see Chuck Norris as top hero and Donald Trump as top victim, which reflects that mix of absurdist humor and pandemic blame we saw throughout the analysis.
The image features back this up too. Specifically, COVID memes score higher on both saturation (76.6 vs 55.2) and text coverage (16.4% vs 12.9%), which lines up with their image macro format, while politics memes are brighter overall (132.3 vs 123.6), likely driven by the high-contrast screenshotted content.
Finally, the density plot is where it gets really interesting. Here we see US Politics memes show two distinct clusters, one centered right around 0 sentiment with 0-1 villain tags (the neutral/observational memes) and a second blob shifted toward positive sentiment around 0.5 with 1 villain tag (the partisan "our side is winning" memes). That split blob shape is telling us there are basically two types of politics memes in this dataset. However, COVID memes by contrast form one tight single cluster right at 0 sentiment and 0-1 villain tags, i.e., much more uniform, less variation, less "celebration" and more just flat neutrality or low-level negativity. In other words, COVID didn't really have a "winning side" to celebrate, which shows up directly in the shape of the distribution.
6. Stability Check (do our findings hold on the validation set?)¶
Recall that we loaded a validation set of 650 memes at the start of this project, but we haven't used it until now. Note that since we aren't building a classifier, there's no model to validate in the traditional sense, but we can still use the validation set to check whether the patterns we found in the training set show up consistently in "unseen data". If they do, that gives us more confidence that our findings reflect something real about these meme categories rather than quirks of the specific 5,552 rows we analyzed. Thus, in this step we re-run the same entity frequency analysis (top villains, heroes, and victims) and sentiment comparison on the validation set and check how closely the results match what we found in training.
NOTE: we only validate the text analysis findings here because the image analysis from Step 4 used the 16 sample PNG files which don't have a separate validation set, so there's nothing to replicate on that side.
# apply the same preprocessing to val_df
val_df['ocr_clean'] = val_df['OCR'].apply(clean_ocr)
val_df['word_count'] = val_df['ocr_clean'].apply(lambda x: len(x.split()))
val_df['category'] = val_df['image'].apply(
lambda x: 'COVID' if str(x).startswith('covid') else 'US Politics'
)
for col in ['hero', 'villain', 'victim', 'other']:
val_df[col + '_parsed'] = val_df[col].apply(parse_tag)
val_df['sentiment'] = val_df['ocr_clean'].apply(get_compound)
print("Validation set category breakdown:")
print(val_df['category'].value_counts())
print()
# compare top villains, heroes, and victims between train and validation sets
for role in ['villain', 'hero', 'victim']:
print("=" * 55)
print(f"TOP 5 {role.upper()}S -- TRAIN vs. VALIDATION")
print("=" * 55)
for cat in ['US Politics', 'COVID']:
train_top = get_entity_counts(train_df, role, cat).most_common(5)
val_top = get_entity_counts(val_df, role, cat).most_common(5)
print(f"\n\t{cat}:")
print(f"\tTrain: {[x[0] for x in train_top]}")
print(f"\tValidation: {[x[0] for x in val_top]}")
print()
# compare sentiment
print("=" * 55)
print("AVG SENTIMENT -- TRAIN vs. VALIDATION")
print("=" * 55)
for cat in ['US Politics', 'COVID']:
train_sent = train_df[train_df['category']==cat]['sentiment'].mean()
val_sent = val_df[val_df['category']==cat]['sentiment'].mean()
print(f"\t{cat}: train={round(train_sent,3)} validation={round(val_sent,3)}")
Validation set category breakdown: category US Politics 350 COVID 300 Name: count, dtype: int64 ======================================================= TOP 5 VILLAINS -- TRAIN vs. VALIDATION ======================================================= US Politics: Train: ['donald trump', 'joe biden', 'democratic party', 'republican party', 'barack obama'] Validation: ['donald trump', 'joe biden', 'democratic party', 'republican party', 'republicans'] COVID: Train: ['donald trump', 'china', 'coronavirus', '2020', 'covid19'] Validation: ['donald trump', 'coronavirus', 'china', 'wuhan', 'joe biden'] ======================================================= TOP 5 HEROS -- TRAIN vs. VALIDATION ======================================================= US Politics: Train: ['donald trump', 'barack obama', 'green party', 'libertarian party', 'joe biden'] Validation: ['donald trump', 'barack obama', 'republican party', 'joe biden', 'libertarian'] COVID: Train: ['chuck norris', 'corona beer', 'weed', 'dr. anthony fauci', 'joe biden'] Validation: ['joe biden', 'goicho saib', 'kamala harris', 'zoom meeting', 'darth vader'] ======================================================= TOP 5 VICTIMS -- TRAIN vs. VALIDATION ======================================================= US Politics: Train: ['donald trump', 'america', 'barack obama', 'democratic party', 'women'] Validation: ['republican party', 'america', 'donald trump', 'barack obama', 'people'] COVID: Train: ['donald trump', 'people', 'china', 'coronavirus', 'parents'] Validation: ['donald trump', 'doctors', 'america', 'world', 'italy'] ======================================================= AVG SENTIMENT -- TRAIN vs. VALIDATION ======================================================= US Politics: train=0.103 validation=0.127 COVID: train=0.0 validation=0.009
# visual sanity check
# grouped bar chart comparing train vs. validation for top 5 entities per role (villain, hero, victim) + sentiment
# if train and validation bars are similar heights, findings replicate
from plotly.subplots import make_subplots
fig_val = make_subplots(
rows=2, cols=4,
subplot_titles=(
'US Politics: Villains', 'COVID: Villains',
'US Politics: Heroes', 'COVID: Heroes',
'US Politics: Victims', 'COVID: Victims',
'Sentiment Comparison', ''
),
vertical_spacing=0.18,
horizontal_spacing=0.08
)
roles = ['villain', 'hero', 'victim']
positions = [(1,1), (1,2), (1,3), (1,4), (2,1), (2,2)]
for idx, (role, cat) in enumerate([(r, c) for r in roles for c in ['US Politics', 'COVID']]):
row, col = positions[idx]
# get top 5 entities from training set
train_top = get_entity_counts(train_df, role, cat).most_common(5)
entities = [x[0] for x in train_top]
# get counts for those same entities in validation set
val_counts = get_entity_counts(val_df, role, cat)
train_counts = [x[1] for x in train_top]
val_counts_ = [val_counts.get(e, 0) for e in entities]
# normalize by dataset size so comparison is fair
# train has 5552 rows, val has 650 (need to normalize to per-1000 memes)
train_norm = [c / len(train_df) * 1000 for c in train_counts]
val_norm = [c / len(val_df) * 1000 for c in val_counts_]
# truncate long entity names for readability
short_entities = [e[:12] + '..' if len(e) > 12 else e for e in entities]
# train bars
fig_val.add_trace(go.Bar(
x=short_entities, y=train_norm,
name='Train',
marker_color='#2ecc71',
opacity=0.85,
showlegend=(idx == 0),
legendgroup='train'
), row=row, col=col)
# validation bars
fig_val.add_trace(go.Bar(
x=short_entities, y=val_norm,
name='Validation',
marker_color='#f39c12',
opacity=0.85,
showlegend=(idx == 0),
legendgroup='val'
), row=row, col=col)
# sentiment comparison panel (row 2, col 3)
categories = ['US Politics', 'COVID']
train_sents = [train_df[train_df['category']==c]['sentiment'].mean() for c in categories]
val_sents = [val_df[val_df['category']==c]['sentiment'].mean() for c in categories]
fig_val.add_trace(go.Bar(
x=categories, y=train_sents,
name='Train', marker_color='#2ecc71',
opacity=0.85, showlegend=False,
legendgroup='train'
), row=2, col=3)
fig_val.add_trace(go.Bar(
x=categories, y=val_sents,
name='Validation', marker_color='#f39c12',
opacity=0.85, showlegend=False,
legendgroup='val'
), row=2, col=3)
fig_val.update_layout(
barmode='group',
title=dict(
text='Train vs. Validation: Do the Findings Replicate?<br>'
'<sup>Counts normalized per 1,000 memes so train and validation are directly comparable</sup>',
x=0.5
),
height=620,
font=dict(size=10),
legend=dict(x=0.88, y=0.15),
showlegend=True
)
fig_val.update_yaxes(title_text='Count per 1,000 memes')
fig_val.update_yaxes(title_text='Avg Sentiment Score', row=2, col=3)
fig_val.show()
6: Stability Check (brief interpretation based on code results and figures above):¶
Here in this stability check we can see the villain panels are the most reassuring. Specifically, the bars are nearly identical in height between train and validation for both categories. In fact, the top 4 US Politics villains are exactly the same in both sets (Donald Trump, Joe Biden, Democratic Party, Republic Party) with only the 5th spot differing slightly. COVID villains also replicate well with Trump, China, and coronavirus all appearing in both lists. Note that the missing bars for "2020" and "covid19" in the COVID validation panel just mean those exact strings weren't tagged in the validation set (this is not a problem, just natural variation in how annotators phrased things).
Hero replication is also pretty decent for US Politics, showing that Trump and Obama stay at the top in both sets. However this falls apart for COVID. Specifically, Chuck Norris, corona beer, and weed drop about entirely in validation, replaced by Joe Biden, Darth Vader, and "zoom meeting". This isn't really a red flag though, it just confirms what we said in the original analysis, i.e., that COVID hero tags are sparse and somewhat random (reflecting absurdist humor rather than consistent ideological framing). Moreover, with so few hero tags in the COVID category to begin with, small sample variation is expected and the bars are just too small to be stable.
Next, we can see victim patterns hold reasonably well. Namely, America, Donald Trump, and Barack Obama appear in both US Politics lists, and Donald Trump stays the top COVID victim in both sets too.
Finally, sentiment is probably the cleanest replication of everything here. We see that US Politics goes from 0.103 to 0.127 and COVID goes from 0.000 to 0.009 (essentially the same result in both sets), and you can see clearly in the sentiment panel where the bar heights are almost identical.
Overall, from this stability check we can confirm the core findings replicate. Note that the one area that doesn't (COVID heroes) was already flagged as the weakest and most unpredictable part of the analysis, so even this is still consistent with what we expected.
7. Expand Analysis to Audio/Speech Data¶
This next part of the project adds a third modality to our analysis, namely spoken-political discourse. For this step specifically, we use the 2020 US Presidential Election Campaign Speeches Dataset. This is a peer-reviewed corpus of 1,081 cleaned and curated speeches, delivered by Donald Trump, Joe Biden, Mike Pence, and Kamala Harris between January 2019 and January 2021 (aka, chosen specifically to overlap with COVID). Speeches were sourced from C-SPAN, VoteSmart, Medium, and the Miller Center.
Note that the central question this step asks is: does the framing in political memes reflect what politicians are actually saying in their speeches, or does meme culture exaggerate and distort it?
I will approach this by running the same text analysis methods from Step 3 (VADER sentiment, TF-IDF, keyword search) on the speech audio (turned into transcripts) and comparing the results directly to our meme findings.
7.1 Load and Prepare the Speech Dataset¶
NOTE: this dataset is organized by speaker and source site. Here we load all TSV files, combine them into a single DataFrame, add party labels, and apply the same VADER sentiment pipeline used on the meme data in Step 3
# load all speech TSVs from the data_clean folder
# the dataset is organized as data_clean/{source}/{Speaker}/cleantext_{Speaker}.tsv
# loop through all TSVs, load each one, and tag with speaker and source
import os
import glob
speech_dfs = []
data_clean_path = '../data/data_clean'
for tsv_path in glob.glob(f'{data_clean_path}/**/*.tsv', recursive=True):
parts = tsv_path.replace('\\', '/').split('/')
source = parts[-3] # e.g., cspan, votesmart, medium, millercenter
speaker = parts[-2] # e.g., DonaldTrump, JoeBiden
df = pd.read_csv(tsv_path, sep='\t', on_bad_lines='skip')
df['speaker'] = speaker
df['source_site'] = source
speech_dfs.append(df)
speech_df = pd.concat(speech_dfs, ignore_index=True)
# add party label
party_map = {
'DonaldTrump': 'Republican',
'MikePence': 'Republican',
'JoeBiden': 'Democrat',
'KamalaHarris': 'Democrat'
}
speech_df['party'] = speech_df['speaker'].map(party_map)
# normalize speech type column
speech_df['Type'] = speech_df['Type'].str.strip().str.lower()
# parse date
speech_df['Date'] = pd.to_datetime(speech_df['Date'], errors='coerce')
# drop rows with no clean text
speech_df = speech_df.dropna(subset=['CleanText']).reset_index(drop=True)
# quick look at data
print(f"Total speeches loaded: {len(speech_df)}")
print()
print("Speeches by speaker:")
print(speech_df['speaker'].value_counts())
print()
print("Speeches by party:")
print(speech_df['party'].value_counts())
print()
print("Date range:", speech_df['Date'].min().date(),
"to", speech_df['Date'].max().date())
Total speeches loaded: 44 Speeches by speaker: speaker MikePence 44 Name: count, dtype: int64 Speeches by party: party Republican 44 Name: count, dtype: int64 Date range: 2019-01-17 to 2021-01-17 Total speeches loaded: 80 Speeches by speaker: speaker MikePence 44 KamalaHarris 36 Name: count, dtype: int64 Speeches by party: party Republican 44 Democrat 36 Name: count, dtype: int64 Date range: 2019-01-17 to 2021-01-17 Total speeches loaded: 186 Speeches by speaker: speaker JoeBiden 106 MikePence 44 KamalaHarris 36 Name: count, dtype: int64 Speeches by party: party Democrat 142 Republican 44 Name: count, dtype: int64 Date range: 2019-01-17 to 2021-01-17 Total speeches loaded: 283 Speeches by speaker: speaker JoeBiden 106 DonaldTrump 97 MikePence 44 KamalaHarris 36 Name: count, dtype: int64 Speeches by party: party Democrat 142 Republican 141 Name: count, dtype: int64 Date range: 2019-01-17 to 2021-01-17 Total speeches loaded: 375 Speeches by speaker: speaker MikePence 136 JoeBiden 106 DonaldTrump 97 KamalaHarris 36 Name: count, dtype: int64 Speeches by party: party Republican 233 Democrat 142 Name: count, dtype: int64 Date range: 2019-01-11 to 2021-01-17 Total speeches loaded: 464 Speeches by speaker: speaker MikePence 136 KamalaHarris 125 JoeBiden 106 DonaldTrump 97 Name: count, dtype: int64 Speeches by party: party Republican 233 Democrat 231 Name: count, dtype: int64 Date range: 2019-01-11 to 2021-01-17 Total speeches loaded: 641 Speeches by speaker: speaker JoeBiden 283 MikePence 136 KamalaHarris 125 DonaldTrump 97 Name: count, dtype: int64 Speeches by party: party Democrat 408 Republican 233 Name: count, dtype: int64 Date range: 2019-01-11 to 2021-01-29 Total speeches loaded: 876 Speeches by speaker: speaker DonaldTrump 332 JoeBiden 283 MikePence 136 KamalaHarris 125 Name: count, dtype: int64 Speeches by party: party Republican 468 Democrat 408 Name: count, dtype: int64 Date range: 2019-01-08 to 2021-01-29 Total speeches loaded: 915 Speeches by speaker: speaker DonaldTrump 332 JoeBiden 283 KamalaHarris 164 MikePence 136 Name: count, dtype: int64 Speeches by party: party Republican 468 Democrat 447 Name: count, dtype: int64 Date range: 2019-01-08 to 2021-01-29 Total speeches loaded: 1051 Speeches by speaker: speaker JoeBiden 419 DonaldTrump 332 KamalaHarris 164 MikePence 136 Name: count, dtype: int64 Speeches by party: party Democrat 583 Republican 468 Name: count, dtype: int64 Date range: 2019-01-08 to 2021-01-29 Total speeches loaded: 1052 Speeches by speaker: speaker JoeBiden 420 DonaldTrump 332 KamalaHarris 164 MikePence 136 Name: count, dtype: int64 Speeches by party: party Democrat 584 Republican 468 Name: count, dtype: int64 Date range: 2019-01-08 to 2021-01-29 Total speeches loaded: 1081 Speeches by speaker: speaker JoeBiden 420 DonaldTrump 361 KamalaHarris 164 MikePence 136 Name: count, dtype: int64 Speeches by party: party Democrat 584 Republican 497 Name: count, dtype: int64 Date range: 2019-01-08 to 2021-01-29
# apply VADER sentiment to each speech
# set up using the same get_compound function used in Step 3 (already defined earlier in notebook)
# CleanText is the pre-cleaned speech transcript from the dataset authors
# lowercase CleanText before sentiment (VADER is case-sensitive)
# the Republican speeches are stored in ALL CAPS which would artificially inflates scores later
speech_df['CleanText_lower'] = speech_df['CleanText'].str.lower()
# fix: score sentiment per 50-word chunk rather than full speech
# VADER was designed for short texts
# scoring 3000-word documents compounds toward extreme values and isn't meaningful
# chunking gives a fairer per-unit comparison across speakers
import re
def sentiment_by_chunks(text, chunk_size=50):
"""Split text into chunks of ~50 words and average VADER scores."""
if not isinstance(text, str) or len(text.strip()) == 0:
return 0.0
words = text.split()
chunks = [' '.join(words[i:i+chunk_size])
for i in range(0, len(words), chunk_size)]
scores = [get_compound(chunk) for chunk in chunks if chunk.strip()]
return round(sum(scores) / len(scores), 4) if scores else 0.0
speech_df['sentiment'] = speech_df['CleanText_lower'].apply(sentiment_by_chunks)
print("Sentiment summary by speaker (chunked):")
print(speech_df.groupby('speaker')['sentiment'].describe().round(3))
print()
print("Sentiment summary by party (chunked):")
print(speech_df.groupby('party')['sentiment'].describe().round(3))
Sentiment summary by speaker (chunked):
count mean std min 25% 50% 75% max
speaker
DonaldTrump 361.0 0.344 0.309 -0.901 0.192 0.306 0.551 0.942
JoeBiden 420.0 0.132 0.317 -0.737 -0.063 0.181 0.329 0.887
KamalaHarris 164.0 0.217 0.421 -0.848 -0.023 0.271 0.498 0.959
MikePence 136.0 0.507 0.193 -0.612 0.447 0.560 0.634 0.779
Sentiment summary by party (chunked):
count mean std min 25% 50% 75% max
party
Democrat 584.0 0.156 0.351 -0.848 -0.053 0.208 0.371 0.959
Republican 497.0 0.389 0.291 -0.901 0.227 0.375 0.595 0.942
7.2 Sentiment: Speeches vs. Memes¶
Next, we compare the average VADER sentiment of political speeches to the sentiment we found in political memes in Step 3. The key question we hope to explore is: are politicians speaking in a more positive or negative tone than the memes about them suggest? Note that because the speech transcripts are much longer than meme captions (averaging 1,000-3,000 words vs. ~19 words for memes), we score sentiment per 50-word chunk and average across chunks rather than scoring each speech as one document. This gives a fairer comparison to the meme-level sentiment scores from Step 3.
# sentiment distribution by speaker (box plot)
fig_s1 = px.box(
speech_df,
x='speaker', y='sentiment',
color='party',
color_discrete_map={'Republican': '#e74c3c', 'Democrat': '#3498db'},
title='Sentiment Score by Speaker: 2020 Campaign Speeches (Scored per 50-word chunk and averaged - avoids VADER inflation on long documents)',
labels={'sentiment': 'VADER Compound Score', 'speaker': 'Speaker'},
points='outliers',
category_orders={'speaker': ['DonaldTrump', 'MikePence', 'JoeBiden', 'KamalaHarris']}
)
fig_s1.update_layout(height=420, font=dict(size=11))
fig_s1.show()
# cross-modal comparison: meme sentiment vs speech sentiment
# this is the key comparison chart of Step 7
# plot average sentiment for each speaker/category side by side so we can directly see whether memes match the tone of real speeches
# meme sentiment averages (from Step 3)
meme_politics_sent = train_df[train_df['category']=='US Politics']['sentiment'].mean()
meme_covid_sent = train_df[train_df['category']=='COVID']['sentiment'].mean()
# speech sentiment averages by speaker (chunked scores from 7.1)
speech_sent = speech_df.groupby('speaker')['sentiment'].mean()
# build comparison dataframe
comparison_data = []
# meme rows
comparison_data.append({'label': 'US Politics Memes', 'sentiment': meme_politics_sent, 'type': 'Meme', 'color': '#e74c3c'})
comparison_data.append({'label': 'COVID Memes', 'sentiment': meme_covid_sent, 'type': 'Meme', 'color': '#3498db'})
# speech rows
speaker_colors = {
'DonaldTrump': '#c0392b',
'MikePence': '#e67e22',
'JoeBiden': '#2980b9',
'KamalaHarris': '#8e44ad'
}
speaker_labels = {
'DonaldTrump': 'Trump Speeches',
'MikePence': 'Pence Speeches',
'JoeBiden': 'Biden Speeches',
'KamalaHarris': 'Harris Speeches'
}
for spk, sent in speech_sent.items():
comparison_data.append({
'label': speaker_labels[spk],
'sentiment': sent,
'type': 'Speech',
'color': speaker_colors[spk]
})
comp_df = pd.DataFrame(comparison_data)
# horizontal bar chart (easy to compare at a glance)
fig_s2 = go.Figure()
for _, row in comp_df.iterrows():
fig_s2.add_trace(go.Bar(
y=[row['label']],
x=[row['sentiment']],
orientation='h',
marker_color=row['color'],
name=row['label'],
showlegend=False,
hovertemplate=f"{row['label']}: %{{x:.3f}}<extra></extra>"
))
# add a vertical line at 0
fig_s2.add_vline(x=0, line_width=1.5, line_dash='dash', line_color='grey')
# shade meme rows differently
fig_s2.add_hrect(y0=-0.5, y1=1.5, fillcolor='rgba(200,200,200,0.15)',
line_width=0, annotation_text='Memes',
annotation_position='right')
fig_s2.add_hrect(y0=1.5, y1=5.5, fillcolor='rgba(255,255,255,0)',
line_width=0, annotation_text='Speeches',
annotation_position='right')
fig_s2.update_layout(
title=dict(
text='Meme Sentiment vs. Speech Sentiment (Direct Comparison)<br>'
'<sup>VADER compound score: negative <- 0 -> positive | dashed line = neutral | '
'speech scores averaged per 50-word chunk</sup>',
x=0.5
),
xaxis=dict(title='Avg VADER Compound Score', range=[-0.3, 0.7]),
yaxis=dict(title=''),
height=420,
font=dict(size=11),
margin=dict(r=100)
)
fig_s2.show()
7.2: Sentiment (brief interpretation based on figures above):¶
The most immediate takeaway from this chart is that all four politicians speak more positively than the memes about them suggest. Every speech bar sits to the right of both meme bars, which tells us that meme culture is generally more negative than the actual speech content it draws from.
The gap is most dramatic for Pence, whose speeches score around 0.55 (well above everyone else) likely reflecting his formal, ceremonial speechmaking style which VADER reads as consistently upbeat. Trump's speeches land around 0.35, noticeably more positive than US Politics memes (0.10), which makes sense. More specifically, in his own speeches, Trump is typically boastful and triumphant, while memes about him skew toward mockery and blame.
The Democrat side is more moderate with Biden siting around 0.18 and Harris around 0.22, i.e., closer to the meme scores but still clearly more positive than both meme categories.
Finally, COVID memes sitting right at 0.0 is the starkest contrast. No politician in this dataset speaks anywhere near that neutral/flat, even the most measured speeches are substantially more positive than pandemic meme discourse. This backs up what we found in Step 3. Specifically, COVID memes are uniquely emotionally flat, not because the topic is unimportant but because the meme format processes collective anxiety through detachment and irony rather than the optimistic framing politicians tend to use in public speeches.
7.3 TF-IDF: Speech Vocabulary vs. Meme Vocabulary¶
In this part, we run TF-IDF on the speech transcripts treating each speaker as one big document, then compare the top distinctive terms to the meme TF-IDF results from Step 3.3. The key questions we hope to answer are the following. Do the words that define Trump's speeches show up in the meme vocabulary and does Biden's speech vocabulary connect to COVID meme language?
# TF-IDF on speeches by speaker
# same approach as Step 3.3: treat each speaker as one big document then find the most distinctive vocabulary per speaker
from sklearn.feature_extraction.text import TfidfVectorizer
# combine all speech text per speaker into one string
speaker_docs = {}
for spk in ['DonaldTrump', 'JoeBiden', 'MikePence', 'KamalaHarris']:
texts = speech_df[speech_df['speaker']==spk]['CleanText'].dropna()
speaker_docs[spk] = ' '.join(texts.str.lower())
speakers = list(speaker_docs.keys())
documents = list(speaker_docs.values())
# fit TF-IDF -- include bigrams for more meaningful phrases
tfidf_speech = TfidfVectorizer(
stop_words='english',
max_features=5000,
ngram_range=(1, 2)
)
tfidf_matrix_speech = tfidf_speech.fit_transform(documents)
feature_names_speech = tfidf_speech.get_feature_names_out()
def top_tfidf_terms(matrix, row_idx, feature_names, n=15):
row = matrix[row_idx].toarray().flatten()
top_idx = row.argsort()[::-1][:n]
return [(feature_names[i], round(row[i], 4)) for i in top_idx]
print("Top TF-IDF terms by speaker:")
for i, spk in enumerate(speakers):
terms = top_tfidf_terms(tfidf_matrix_speech, i, feature_names_speech, n=12)
print(f"\n {spk}:")
for term, score in terms:
print(f" {term}: {score}")
Top TF-IDF terms by speaker: DonaldTrump: people: 0.2769 great: 0.2386 know: 0.2371 said: 0.2351 going: 0.2222 want: 0.1917 don: 0.1866 like: 0.1748 right: 0.1516 thank: 0.1494 country: 0.1468 years: 0.1444 JoeBiden: president: 0.2772 people: 0.2712 trump: 0.208 america: 0.1763 american: 0.1713 going: 0.1663 know: 0.1581 country: 0.156 just: 0.1325 need: 0.1318 make: 0.1283 nation: 0.1215 MikePence: president: 0.3737 american: 0.2727 applause: 0.2214 america: 0.2141 people: 0.1968 know: 0.1912 just: 0.171 great: 0.16 today: 0.1544 trump: 0.1407 years: 0.1394 ve: 0.1387 KamalaHarris: people: 0.3991 know: 0.2257 president: 0.1806 let: 0.1742 country: 0.1735 america: 0.1735 joe: 0.1573 states: 0.1562 justice: 0.1333 health: 0.1307 united: 0.1299 fight: 0.1269
# visualize TF-IDF per speaker as side-by-side bar charts
fig_s3 = make_subplots(rows=1, cols=4,
subplot_titles=('Trump', 'Biden', 'Pence', 'Harris'))
speaker_colors_tfidf = {
'DonaldTrump': '#c0392b',
'JoeBiden': '#2980b9',
'MikePence': '#e67e22',
'KamalaHarris': '#8e44ad'
}
for col_idx, spk in enumerate(speakers, 1):
i = speakers.index(spk)
terms = top_tfidf_terms(tfidf_matrix_speech, i, feature_names_speech, n=12)
words = [t[0] for t in terms][::-1]
scores = [t[1] for t in terms][::-1]
fig_s3.add_trace(
go.Bar(y=words, x=scores, orientation='h',
marker_color=speaker_colors_tfidf[spk]), row=1, col=col_idx
)
fig_s3.update_layout(
title='TF-IDF: Most Distinctive Vocabulary per Speaker (Campaign Speeches)',
height=500,
showlegend=False,font=dict(size=9)
)
fig_s3.show()
# overlap analysis: which meme TF-IDF terms also appear in speech TF-IDF?
# this directly answers: does meme vocabulary mirror speech vocabulary?
# get top 30 terms from each meme category (from Step 3.3)
# assumes tfidf, tfidf_matrix, feature_names are still in memory from Step 3.3
politics_meme_terms = set(t[0] for t in top_tfidf_terms(tfidf_matrix, 0, feature_names, n=30))
covid_meme_terms = set(t[0] for t in top_tfidf_terms(tfidf_matrix, 1, feature_names, n=30))
# get top 30 terms from each speaker's speeches
print("Vocabulary overlap -- Meme terms that also appear in Speech TF-IDF top 30:")
print()
for i, spk in enumerate(speakers):
speech_terms = set(t[0] for t in top_tfidf_terms(tfidf_matrix_speech, i, feature_names_speech, n=30))
overlap_politics = politics_meme_terms & speech_terms
overlap_covid = covid_meme_terms & speech_terms
print(f" {spk}:")
print(f" Overlap with US Politics meme terms ({len(overlap_politics)}): "
f"{sorted(overlap_politics)}")
print(f" Overlap with COVID meme terms ({len(overlap_covid)}): "
f"{sorted(overlap_covid)}")
print()
Vocabulary overlap -- Meme terms that also appear in Speech TF-IDF top 30: DonaldTrump: Overlap with US Politics meme terms (7): ['america', 'biden', 'don', 'just', 'like', 'people', 'president'] Overlap with COVID meme terms (7): ['don', 'just', 'like', 'new', 'people', 'time', 'world'] JoeBiden: Overlap with US Politics meme terms (8): ['america', 'donald', 'donald trump', 'just', 'like', 'people', 'president', 'trump'] Overlap with COVID meme terms (7): ['just', 'like', 'people', 'time', 'trump', 'work', 'world'] MikePence: Overlap with US Politics meme terms (6): ['america', 'just', 'like', 'people', 'president', 'trump'] Overlap with COVID meme terms (8): ['day', 'just', 'like', 'people', 'time', 'trump', 'work', 'world'] KamalaHarris: Overlap with US Politics meme terms (9): ['america', 'biden', 'joe', 'joe biden', 'just', 'people', 'president', 'trump', 'vote'] Overlap with COVID meme terms (6): ['just', 'people', 'time', 'trump', 'work', 'working']
7.3: TF-IDF (brief interpretation based on code results and figures above):¶
From the figure above, we can see the TF-IDF chart shows each speaker's most distinctive vocabulary, the words that set them apart from the others. A few things jump out immediately. First off, Trump's chart is dominated by conversational filler words like "people," "great," "know," "said," and "going", which reflects his rally-style speaking pattern more than any specific policy substance. Moreover, Biden and Pence both lead with "president" as their top term, which makes sense since both are frequently referencing Trump by title in their speeches. Lastly, Harris's chart is the most substantively distinct, with words like "justice," "health," "fight," and "united" showing up, i.e., a more issue-focused vocabulary than the others.
Next, the overlap analysis (code result) is where it gets interesting. Here we see that every speaker overlaps with meme vocabulary primarily through generic high-frequency words like "people," "just," "like," and "time", which tells us that the surface-level vocabulary connection between speeches and memes is mostly incidental rather than meaningful. The more revealing overlaps are the named entity ones. Specifically, we can see here that Biden's speeches overlap with US Politics meme terms on "donald trump," "trump," and "donald" meaning Biden references Trump by name frequently enough that it shows up in his TF-IDF, mirroring exactly how Trump dominates the villain tags in our meme data. Harris similarly overlaps on "trump," "joe biden," and "vote." Thus, the takeaway here is that meme culture picks up on the same named targets that politicians themselves are centering in their speeches, in other words the villain framing in memes isn't invented out of nowhere, it reflects who politicians are actually talking about.
7.4 COVID Language: Do Speeches Blame the Same Targets as Memes?¶
In Step 3, we found that COVID memes primarily blame China, coronavirus, and Trump. Here we check whether the politicians' speeches from 2020 onward use similar blame-framing language (searching for COVID-related keywords and measuring how often each speaker invokes them).
# COVID keyword frequency in speeches
# filter to speeches from 2020 onward (the pandemic period)
covid_period = speech_df[speech_df['Date'] >= '2020-01-01'].copy()
print(f"Speeches from 2020 onward: {len(covid_period)}")
print(covid_period['speaker'].value_counts())
print()
# define COVID blame-framing keywords found in our meme analysis
covid_keywords = [
'china', 'chinese', 'wuhan', 'virus', 'coronavirus', 'covid',
'pandemic', 'mask', 'lockdown', 'quarantine', 'fauci',
'death', 'blame', 'failed', 'incompetent'
]
# count keyword mentions per speaker (normalized per 1000 words)
def keyword_rate(text, keywords):
# Count keyword occurrences per 1000 words.
if not isinstance(text, str):
return {k: 0 for k in keywords}
text_lower = text.lower()
words = len(text_lower.split())
if words == 0:
return {k: 0 for k in keywords}
return {k: text_lower.count(k) / words * 1000 for k in keywords}
# apply to each speech
kw_records = []
for _, row in covid_period.iterrows():
rates = keyword_rate(row['CleanText'], covid_keywords)
rates['speaker'] = row['speaker']
rates['party'] = row['party']
kw_records.append(rates)
kw_df = pd.DataFrame(kw_records)
# average keyword rate per speaker
kw_summary = kw_df.groupby('speaker')[covid_keywords].mean().round(3)
print("Average keyword mentions per 1,000 words (2020+ speeches):")
print(kw_summary.T.to_string())
Speeches from 2020 onward: 766 speaker JoeBiden 373 DonaldTrump 263 KamalaHarris 77 MikePence 53 Name: count, dtype: int64 Average keyword mentions per 1,000 words (2020+ speeches): speaker DonaldTrump JoeBiden KamalaHarris MikePence china 0.874 0.297 0.042 0.421 chinese 0.061 0.046 0.024 0.022 wuhan 0.009 0.001 0.000 0.026 virus 1.130 2.032 1.364 1.243 coronavirus 0.604 0.486 0.798 1.184 covid 0.333 1.846 0.980 0.000 pandemic 0.457 2.038 1.730 0.779 mask 0.052 0.287 0.131 0.074 lockdown 0.034 0.024 0.000 0.000 quarantine 0.013 0.041 0.000 0.000 fauci 0.023 0.057 0.000 0.024 death 0.216 0.349 0.180 0.102 blame 0.015 0.032 0.000 0.000 failed 0.087 0.691 0.484 0.083 incompetent 0.008 0.006 0.000 0.000
# visualize COVID keyword usage per speaker as heatmap
# heatmap lets us see at a glance which speakers use which blame-framing words most
import plotly.figure_factory as ff
# prepare matrix
kw_matrix = kw_summary.values
kw_speakers = kw_summary.index.tolist()
kw_labels = covid_keywords
# normalize each keyword to 0-1 so colors are comparable across keywords
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
kw_norm = scaler.fit_transform(kw_matrix.T).T # normalize per keyword
fig_s4 = go.Figure(data=go.Heatmap(
z=kw_norm,
x=kw_labels,
y=kw_speakers,
colorscale='RdBu_r',
text=kw_matrix.round(2),
texttemplate='%{text}',
textfont={'size': 9},
hovertemplate='%{y} -- %{x}<br>Rate per 1000 words: %{text}<extra></extra>'))
fig_s4.update_layout(
title=dict(
text='COVID Blame-Framing Keywords in 2020+ Speeches<br>'
'<sup>Color = normalized frequency | numbers = raw rate per 1,000 words</sup>',
x=0.5
),
xaxis=dict(title='Keyword', tickangle=-35),
yaxis=dict(title='Speaker'),
height=350,
font=dict(size=10)
)
fig_s4.show()
7.4: Covid Language (brief interpretation based on code results and figures above):¶
The heatmap above makes the framing differences between speakers immediately visible. The most notable pattern here is in the "china" column. Specifically, Trump uses it at 0.87 per 1,000 words, more than double Biden (0.30) and far above Harris and Pence. This directly mirrors what we found in COVID memes, where China was the second most common villain tag. Thus, Trump's speeches appear to be a likely source of that framing, i.e., he was consistently naming China as the origin and cause of the pandemic in his public addresses, and that language shows up in the meme data too.
Moreover, we see here that the Democrats flip the script on "covid" and "pandemic." For instance, Biden uses "covid" at 1.85 per 1,000 words and "pandemic" at 2.04 (both roughly 4-5x higher than Trump). Harris also follows a similar pattern. This reflects a fundamental difference in how the two parties talked about the crisis: Democrats named it directly and repeatedly, while Trump leaned more on geographic/blame framing ("china," "wuhan") rather than the disease terminology itself. Also note that Pence is the outlier here on "coronavirus" (1.18), but uses "covid" at exactly 0 likely a deliberate stylistic choice consistent with Republican messaging at the time.
Finally, the "failed" column is also worth noting. Specifically, Biden uses it at 0.69 per 1,000 words, the highest of anyone, which tracks with his campaign strategy of framing Trump's pandemic response as a failure. That blame language shows up in COVID memes too, just directed at different targets depending on who made the meme. Overall this heatmap backs up the core finding: the blame vocabulary in COVID memes isn't random, instead it does map onto the actual language politicians were using in their speeches, just amplified and simplified into the meme format.
8. Acoustic Feature Analysis¶
This final step adds actual audio analysis to the project using the M-Arg Multimodal Argumentation Dataset (Mestre et al., 2021), which is a peer-reviewed corpus of 17 MP3 audio clips from the 5 US 2020 presidential debates, including Trump vs. Biden (x2), a Biden town hall, a Trump town hall, and the Pence vs. Harris VP debate. Each clip has force-aligned sentence-level timestamps linking exact seconds of audio to specific speakers.
Following the approach of Chowdhury et al. (2025), we extract four hand-crafted acoustic features per speaker using librosa: Zero Crossing Rate (ZCR), RMS Energy, MFCCs, and Chroma STFT. The central question to investigate in this step is: do Trump and Biden sound acoustically different in debates, and does that acoustic profile connect to how they are framed in political memes?
8.1 Install librosa and Load Timestamp Data¶
Here, we load the timestamp CSVs (which give us start time, end time, and text per utterance) and the tokenised text split CSVs (which give us speaker labels per utterance). Together these let us clip exactly the right seconds of audio for each speaker without needing the full preprocessed dataset.
# load librosa and set up paths
# NOTE: librosa is the standard Python library for audio feature extraction
import librosa
import librosa.display
import numpy as np
import pandas as pd
import os
import glob
# paths
AUDIO_DIR = '../data/audio_data/data/split audio'
TIMESTAMP_DIR = '../data/audio_data/data/timestamps'
# verify files are found
mp3_files = glob.glob(os.path.join(AUDIO_DIR, '*.mp3'))
ts_files = glob.glob(os.path.join(TIMESTAMP_DIR, '*.csv'))
print(f"MP3 files found: {len(mp3_files)}")
print(f"Timestamp files found: {len(ts_files)}")
print()
for f in sorted(mp3_files):
print(' ', os.path.basename(f))
MP3 files found: 17 Timestamp files found: 17 us_election_2020_1st_presidential_debate_part1.mp3 us_election_2020_1st_presidential_debate_part2.mp3 us_election_2020_2nd_presidential_debate_part1.mp3 us_election_2020_2nd_presidential_debate_part2.mp3 us_election_2020_biden_town_hall_part1.mp3 us_election_2020_biden_town_hall_part2.mp3 us_election_2020_biden_town_hall_part3.mp3 us_election_2020_biden_town_hall_part4.mp3 us_election_2020_biden_town_hall_part5.mp3 us_election_2020_biden_town_hall_part6.mp3 us_election_2020_biden_town_hall_part7.mp3 us_election_2020_trump_town_hall_1.mp3 us_election_2020_trump_town_hall_2.mp3 us_election_2020_trump_town_hall_3.mp3 us_election_2020_trump_town_hall_4.mp3 us_election_2020_vice_presidential_debate_1.mp3 us_election_2020_vice_presidential_debate_2.mp3
ts_records = []
for ts_path in sorted(ts_files):
basename = os.path.basename(ts_path).replace('_timestamp.csv', '')
try:
df = pd.read_csv(ts_path, header=None,
names=['id', 'start_time', 'end_time', 'text'],
on_bad_lines='skip')
df['audio_file'] = basename
ts_records.append(df)
except Exception as e:
print(f"Could not load {basename}: {e}")
print(f"Loaded {len(ts_records)} files")
timestamps_df = pd.concat(ts_records, ignore_index=True)
timestamps_df['start_time'] = pd.to_numeric(timestamps_df['start_time'], errors='coerce')
timestamps_df['end_time'] = pd.to_numeric(timestamps_df['end_time'], errors='coerce')
timestamps_df = timestamps_df.dropna(subset=['start_time', 'end_time'])
print(f"Total utterances: {len(timestamps_df)}")
print(f"Audio files covered: {timestamps_df['audio_file'].nunique()}")
print(timestamps_df.head(3).to_string(index=False))
Loaded 17 files
Total utterances: 6527
Audio files covered: 17
id start_time end_time text audio_file
f000001 0.0 5.80 Gentlemen, a lot of people been waiting for this night, so let’s get going. us_election_2020_1st_presidential_debate_part1
f000002 5.8 9.00 Our first subject is the Supreme Court. us_election_2020_1st_presidential_debate_part1
f000003 9.0 17.88 President Trump, you nominated Amy Coney Barrett over the weekend to succeed the late Ruth Bader Ginsburg on the Court. us_election_2020_1st_presidential_debate_part1
# load speaker labels from tokenised text split CSVs
# the _split.csv files in 'tokenised text/' have two columns:
# 'speaker' and 'text' -- one row per utterance in order
# we use these to know which utterances belong to which speaker then cross-reference with the timestamp CSVs by row position
TOKENISED_DIR = '../data/audio_data/data/tokenised text'
# debate stem -> split CSV filename
split_csv_map = {
'us_election_2020_1st_presidential_debate':
'us_election_2020_1st_presidential_debate_split.csv',
'us_election_2020_2nd_presidential_debate':
'us_election_2020_2nd_presidential_debate_split.csv',
'us_election_2020_biden_town_hall':
'us_election_2020_biden_town_hall_split.csv',
'us_election_2020_trump_town_hall':
'us_election_2020_trump_town_hall_split.csv',
'us_election_2020_vice_presidential_debate':
'us_election_2020_vice_presidential_debate_split.csv',
}
speaker_dfs = {}
for stem, csv_name in split_csv_map.items():
path = os.path.join(TOKENISED_DIR, csv_name)
try:
df = pd.read_csv(path)
speaker_dfs[stem] = df
print(f"{stem}: {len(df)} rows,\t" f"speakers: {df['speaker'].unique().tolist()}")
except FileNotFoundError:
print(f"Not found: {path}")
us_election_2020_1st_presidential_debate: 1904 rows, speakers: ['Chris Wallace', 'Joe Biden', 'Donald Trump'] us_election_2020_2nd_presidential_debate: 1674 rows, speakers: ['Kristen Welker', 'Donald Trump', 'Joe Biden'] us_election_2020_biden_town_hall: 838 rows, speakers: ['George Stephanopoulos', 'Joe Biden', 'Audience Member 1', 'Audience Member 2', 'Audience Member 3', 'Voice Over', 'Audience Member 4', 'Audience Member 5', 'Audience Member 6', 'Audience Member 7', 'Audience Member 8', 'Audience Member 9', 'Audience Member 10', 'Audience Member 11'] us_election_2020_trump_town_hall: 1138 rows, speakers: ['Savannah Guthrie', 'Voice Over', 'Donald Trump', 'Audience Member 12', 'Audience Member 13', 'Audience Member 14', 'Audience Member 15', 'Audience Member 16', 'Audience Member 17', 'Audience Member 18', 'Audience Member 19'] us_election_2020_vice_presidential_debate: 1028 rows, speakers: ['Susan Page', 'Kamala Harris', 'Mike Pence']
8.2 Extract Acoustic Features per Speaker¶
Next, we extract four acoustic features per utterance using librosa, following Chowdhury et al. (2025):
- Zero Crossing Rate (ZCR): how often the audio signal crosses zero per second -> higher = more energetic, consonant-heavy speech.
- RMS Energy: average loudness/force of the speech signal.
- MFCCs (13 coefficients): tonal fingerprint of the voice, i.e., the most widely used feature in speech emotion recognition.
- Chroma STFT: 12 pitch class features capturing harmonic content (Chowdhury et al. highlight this as consistently important across datasets).
Note that we focus on Trump and Biden from the single-speaker town hall files for clean feature extraction, then sample up to 100 utterances per speaker to keep runtime manageable.
# build a mapping from debate stem name to full MP3 path
# the timestamp CSV stems match the MP3 filenames exactly
audio_map = {}
for mp3_path in mp3_files:
key = os.path.basename(mp3_path).replace('.mp3', '')
audio_map[key] = mp3_path
print("Audio file mapping confirmed:")
for k in sorted(audio_map.keys()):
print(f"\t{k}")
Audio file mapping confirmed: us_election_2020_1st_presidential_debate_part1 us_election_2020_1st_presidential_debate_part2 us_election_2020_2nd_presidential_debate_part1 us_election_2020_2nd_presidential_debate_part2 us_election_2020_biden_town_hall_part1 us_election_2020_biden_town_hall_part2 us_election_2020_biden_town_hall_part3 us_election_2020_biden_town_hall_part4 us_election_2020_biden_town_hall_part5 us_election_2020_biden_town_hall_part6 us_election_2020_biden_town_hall_part7 us_election_2020_trump_town_hall_1 us_election_2020_trump_town_hall_2 us_election_2020_trump_town_hall_3 us_election_2020_trump_town_hall_4 us_election_2020_vice_presidential_debate_1 us_election_2020_vice_presidential_debate_2
# assign speakers to utterances based on debate structure
# presidential debates: Trump and Biden both present
# biden town hall parts: Biden only
# trump town hall parts: Trump only
# vice presidential debate: Pence and Harris
def assign_speaker_from_file(audio_file):
"""
Returns speaker name(s) present in a given audio file.
For debates with multiple speakers we return 'mixed', speaker-level splitting requires the preprocessed dataset.
"""
af = audio_file.lower()
if 'biden_town_hall' in af:
return 'Joe Biden'
elif 'trump_town_hall' in af:
return 'Donald Trump'
elif 'vice_presidential' in af:
return 'mixed_pence_harris'
elif 'presidential_debate' in af:
return 'mixed_trump_biden'
return 'unknown'
timestamps_df['inferred_speaker'] = timestamps_df['audio_file'].apply(assign_speaker_from_file)
# keep only single-speaker files for clean feature extraction
single_speaker = timestamps_df[~timestamps_df['inferred_speaker'].str.startswith('mixed')].copy()
print("Utterances by inferred speaker (single-speaker files only):")
print(single_speaker['inferred_speaker'].value_counts())
Utterances by inferred speaker (single-speaker files only): inferred_speaker Donald Trump 1132 Joe Biden 838 Name: count, dtype: int64
# extract acoustic features for sampled utterances per speaker
# for each utterance we clip the right seconds of audio using the timestamp
# then extract ZCR, RMS, MFCC, and Chroma using librosa
# NOTE: this may take a few minutes depending on your machine
MAX_UTTERANCES = 100 # increase if you have more time/compute
SR = 22050 # librosa default sample rate
def extract_features(audio_path, start_sec, end_sec, sr=SR):
"""
Load only the utterance segment using offset + duration.
Returns a dict of feature values or None if the clip is too short
"""
duration = end_sec - start_sec
if duration <= 0.1:
return None
try:
y, _ = librosa.load(
audio_path, sr=sr, offset=start_sec, duration=duration
)
if len(y) < sr * 0.1:
return None
# zero crossing rate -- higher = more energetic consonant-heavy speech
zcr = float(np.mean(librosa.feature.zero_crossing_rate(y)))
# RMS energy -- average loudness of the utterance
rms = float(np.mean(librosa.feature.rms(y=y)))
# MFCCs -- 13 mel-frequency cepstral coefficients
# captures the tonal/spectral fingerprint of the voice
mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
mfcc_mean = np.mean(mfcc, axis=1)
# Chroma STFT -- 12 pitch class features
# captures harmonic content, highlighted in Chowdhury et al. (2025)
chroma = librosa.feature.chroma_stft(y=y, sr=sr)
chroma_mean = float(np.mean(chroma))
result = {'zcr': zcr, 'rms': rms, 'chroma': chroma_mean}
for j, val in enumerate(mfcc_mean):
result[f'mfcc_{j+1}'] = float(val)
return result
except Exception:
return None
# sample utterances per speaker and extract features
feature_records = []
for speaker in ['Joe Biden', 'Donald Trump']:
subset = single_speaker[
single_speaker['inferred_speaker'] == speaker
]
subset = subset.sample(
n=min(MAX_UTTERANCES, len(subset)), random_state=42
)
print(f"Extracting features for {speaker} ({len(subset)} utterances)...")
for _, row in subset.iterrows():
audio_path = audio_map.get(row['audio_file'])
if not audio_path:
continue
feats = extract_features(
audio_path, row['start_time'], row['end_time']
)
if feats:
feats['speaker'] = speaker
feats['audio_file'] = row['audio_file']
feature_records.append(feats)
feat_df = pd.DataFrame(feature_records)
print(f"\nFeatures extracted: {len(feat_df)} utterances")
print()
print("Summary by speaker:")
print(
feat_df.groupby('speaker')[['zcr', 'rms', 'chroma', 'mfcc_1']]
.describe()
.round(4)
)
Extracting features for Joe Biden (100 utterances)...
Extracting features for Donald Trump (100 utterances)...
Features extracted: 197 utterances
Summary by speaker:
zcr \
count mean std min 25% 50% 75% max
speaker
Donald Trump 97.0 0.1519 0.0386 0.0738 0.1276 0.1515 0.1775 0.2514
Joe Biden 100.0 0.1053 0.0251 0.0403 0.0904 0.1011 0.1213 0.1921
rms ... chroma mfcc_1 \
count mean ... 75% max count mean std
speaker ...
Donald Trump 97.0 0.0435 ... 0.361 0.4849 97.0 -243.0097 26.1616
Joe Biden 100.0 0.0424 ... 0.396 0.6170 100.0 -261.5239 15.3010
min 25% 50% 75% max
speaker
Donald Trump -328.3606 -260.2521 -248.7375 -232.2854 -167.4580
Joe Biden -304.4760 -272.0928 -260.5080 -251.2004 -219.8857
[2 rows x 32 columns]
8.3 Visualize: Do the Speakers Sound Different?¶
Next, visualize all four acoustic features side by side using box plots, one box per speaker per feature. This lets us directly compare whether Trump and Biden have systematically different acoustic profiles in their debate appearances, and if so, in which direction and by how much.
# box plots: ZCR, RMS, Chroma, and MFCC-1 by speaker
# four panels side by side, colored by party
# boxmean=True shows the mean as a dashed line inside each box
import plotly.graph_objects as go
from plotly.subplots import make_subplots
speaker_colors = {'Donald Trump': '#c0392b', 'Joe Biden': '#2980b9'}
features_to_plot = ['zcr', 'rms', 'chroma', 'mfcc_1']
feature_labels = [
'Zero Crossing Rate (ZCR)',
'RMS Energy',
'Chroma STFT (Harmonic Content)',
'MFCC-1 (Spectral Energy)'
]
fig_a1 = make_subplots(rows=1, cols=4, subplot_titles=feature_labels)
for col_idx, feat in enumerate(features_to_plot, 1):
for speaker in ['Donald Trump', 'Joe Biden']:
subset = feat_df[feat_df['speaker'] == speaker][feat].dropna()
fig_a1.add_trace(
go.Box(
y=subset,
name=speaker,
marker_color=speaker_colors[speaker],
showlegend=(col_idx == 1),
legendgroup=speaker,
boxmean=True
),
row=1, col=col_idx
)
fig_a1.update_layout(
title=dict(
text=(
'Acoustic Feature Comparison: Trump vs. Biden '
'(2020 Town Hall Debates)<br>'
'<sup>Each point = one utterance | '
'dashed line inside box = mean | '
'features follow Chowdhury et al. (2025)</sup>'
),
x=0.5
),
height=480,
font=dict(size=10),
boxmode='group',
legend=dict(title='Speaker')
)
fig_a1.show()
# print numerical summary for more interpretation
print("=" * 60)
print("ACOUSTIC FEATURE SUMMARY BY SPEAKER")
print("=" * 60)
summary = feat_df.groupby('speaker')[
['zcr', 'rms', 'chroma', 'mfcc_1']
].agg(['mean', 'median', 'std']).round(5)
print(summary.to_string())
============================================================
ACOUSTIC FEATURE SUMMARY BY SPEAKER
============================================================
zcr rms chroma mfcc_1
mean median std mean median std mean median std mean median std
speaker
Donald Trump 0.15191 0.15155 0.03864 0.04352 0.04285 0.00765 0.34061 0.33933 0.03908 -243.00967 -248.73750 26.16161
Joe Biden 0.10530 0.10111 0.02506 0.04240 0.04358 0.00661 0.38049 0.37709 0.05093 -261.52390 -260.50801 15.30103
8.3: Audio Feature Comparison (brief interpretation based on code results and figures above):¶
We can see from above that the four acoustic features paint a pretty clear picture of how differently Trump and Biden sound in their town hall debates.
Specifically, the ZCR difference is the most obvious one. Trump's average is 0.152 compared to Biden's 0.105, which is a meaningful gap. Note that ZCR measures how often the audio signal crosses zero per second, and higher values indicate more energetic, consonant-heavy speech. So, in other words, Trump's speeches are literally more acoustically "busy" at the signal level. Biden's distribution is also much tighter (std 0.025 vs Trump's 0.039), meaning Biden is more consistent in his speech energy across utterances while Trump swings around more.
Next, RMS energy is actually pretty similar between the two. Trump averages 0.044 and Biden 0.042, so raw loudness isn't really what separates them. They're speaking at roughly the same volume, which makes sense given both are in controlled town hall settings with microphones.
Moreover, the Chroma STFT difference is interesting. Here we see Biden scores higher (0.381 vs Trump's 0.341), meaning Biden's speech carries more harmonic richness and tonal variety. This tracks with Biden's more measured, conversational delivery style compared to Trump's more fast, consonant-heavy, and abrupt speaking pattern.
Lastly, MFCC-1 captures overall spectral energy and both are negative (that's normal for this feature), but Trump's value is less negative (-243 vs -262), meaning his speech carries more low-frequency spectral energy overall. Biden's MFCC-1 distribution is also notably tighter, again reinforcing the consistency pattern.
Overall the main takeaway from this step is that Trump and Biden sound genuinely different at the acoustic level, and ZCR and Chroma are the features that show it most clearly.
8.4 Connecting Acoustic Features to Meme Framing¶
Notably, this is the cultural analytics payoff of the entire audio analysis, i.e., bringing together Steps 3, 7, and 8 to ask: does the acoustic profile of each speaker's voice connect to how they are culturally represented in political memes?
Recall, Trump is the #1 villain in our meme dataset by a wide margin (358 villain tags vs. Biden's 149 in US Politics memes). If his speech also carries acoustic markers associated with dominance and aggression (high ZCR, high energy), that would suggest meme culture is picking up on something real in how he actually sounds, not just what he says.
# build cross-modal comparison table
# acoustic averages per speaker from Step 8.2
acoustic_summary = feat_df.groupby('speaker')[
['zcr', 'rms', 'chroma', 'mfcc_1']
].mean()
# villain counts from meme data (Step 3)
# how many times does each politician appear as villain across all memes?
speaker_to_meme = {
'Donald Trump': 'donald trump',
'Joe Biden': 'joe biden'
}
villain_counts = {}
for speaker, meme_name in speaker_to_meme.items():
count = sum(
1 for tags in train_df['villain_parsed']
if meme_name in [t.lower() for t in tags]
)
villain_counts[speaker] = count
print("Villain tag counts in meme dataset:")
for spk, cnt in villain_counts.items():
print(f" {spk}: {cnt} villain appearances")
print()
print("Acoustic feature averages:")
print(acoustic_summary.round(5).to_string())
Villain tag counts in meme dataset:
Donald Trump: 504 villain appearances
Joe Biden: 162 villain appearances
Acoustic feature averages:
zcr rms chroma mfcc_1
speaker
Donald Trump 0.15191 0.04352 0.34061 -243.00967
Joe Biden 0.10530 0.04240 0.38049 -261.52390
# combined visualization: acoustic profile vs. meme villain count
# 4 panels: ZCR, RMS, Chroma, and Villain Count side by side directly shows whether acoustic energy aligns with meme blame assignment
fig_a2 = make_subplots(
rows=1, cols=4,
subplot_titles=(
'Avg ZCR (Speech Energy)',
'Avg RMS Energy',
'Avg Chroma (Harmonic)',
'Villain Tags in Memes'
)
)
speakers = ['Donald Trump', 'Joe Biden']
sp_colors = ['#c0392b', '#2980b9']
for col_idx, feat in enumerate(['zcr', 'rms', 'chroma'], 1):
for spk, col in zip(speakers, sp_colors):
fig_a2.add_trace(
go.Bar(
x=[spk],
y=[acoustic_summary.loc[spk, feat]],
marker_color=col,
name=spk,
showlegend=(col_idx == 1),
legendgroup=spk
),
row=1, col=col_idx
)
# villain counts panel
for spk, col in zip(speakers, sp_colors):
fig_a2.add_trace(
go.Bar(
x=[spk],
y=[villain_counts[spk]],
marker_color=col,
name=spk,
showlegend=False,
legendgroup=spk
),
row=1, col=4
)
fig_a2.update_layout(
title=dict(
text=(
'Acoustic Energy vs. Meme Villain Framing -- '
'Trump vs. Biden<br>'
'<sup>Does higher speech energy match higher villain '
'assignment in memes?</sup>'
),
x=0.5
),
height=430,
font=dict(size=10),
barmode='group',
showlegend=True
)
fig_a2.show()
8.4: Acoustic to Meme Connection (brief interpretation based on code results and figures above):¶
Now that we know Trump and Biden sound acoustically different, the question is whether that difference lines up with how meme culture treats them. The villain count panel reveals the answer. Specifically, Trump has 504 villain appearances in our meme dataset compared to Biden's 162, more than 3x as many.
What's worth noting is that the feature that separates them most acoustically (ZCR) is also the one that maps most cleanly onto that villain gap. Namely, high ZCR, high villain count for Trump vs. lower ZCR, lower villain count for Biden. RMS, on the other hand, is nearly identical between the two speakers, and chroma actually favors Biden, yet neither of those features tracks with the villain framing at all.
This suggests that meme culture is not simply responding to how loud or how harmonically rich a politician sounds. It's responding to something closer to speech energy and aggressiveness, i.e., exactly what ZCR captures. The acoustic feature that best predicts villain assignment in memes is the one that measures how charged and energetic the speech signal is at its most basic level.
Results/Discussion¶
Across eight total steps and three datasets, this project set out to answer one primary question: how do US political and COVID memes differ in how they assign blame, and do the patterns we find in memes connect to anything real in how politicians actually speak and sound? Through all the work done in this project, we can say the short answer is yes, but in ways worth unpacking and summarizing beyond the brief interpretations scattered throughout above.
First off, the most consistent finding across every analysis is that Donald Trump functions as a villain in both meme categories simultaneously. He is the top villain in US Politics memes by a wide margin, but he also tops the COVID villain list, crosses over as a hero in some communities, and appears as the top victim in others. Notably, no other figure in the dataset comes close to this level of cultural saturation (i.e., dominates the data). This tells us something important about how political memes work. Namely that they do not simply reflect reality, they amplify and polarize it. The same person can mean completely opposite things depending on who made the meme, which is exactly what you would expect from a deeply divided political media landscape.
Next, the sentiment and vocabulary findings add some texture, so to speak, to that picture. Specifically, both meme categories are emotionally flat on average, with median VADER scores at zero, but US Politics memes have a much wider spread. They swing harder in both the positive and negative direction, capturing both celebratory "our side is winning" content and genuinely hostile blame assignment. On the other hand, COVID memes are more uniform, sitting in a narrow band of low-level negativity with no equivalent celebratory cluster. Importantly, COVID (unlike partisan politics) did not have a winning side to celebrate.
Moving into the speech transcript analysis in Step 7 is where things get genuinely interesting from a cultural analytics perspective. Every politician in the dataset speaks more positively than the memes about them suggest. The gap is largest for Pence and Trump, whose actual speeches are upbeat and triumphant in tone, while memes about them are overwhelmingly negative. This tells us that meme culture is not just passively reflecting political speech. It is actively inverting the tone of that speech and redirecting it as blame. For instance, politicians say "we are winning," and memes respond by casting them as villains. That inversion is itself a form of political counter-narrative.
Moreover, the COVID keyword analysis makes the sourcing of that blame even clearer. Trump used China-framing language at nearly three times the rate of Biden in his 2020 speeches, and that exact framing shows up as the second most common villain tag in COVID memes. In other words, the meme vocabulary did not invent the China blame narrative. It absorbed it directly from political speech and amplified it into a cultural shorthand. This is one of the most concrete findings in the project because it draws a direct line from a politician's rhetorical choices to the memes that circulate in the broader culture.
Finally, the acoustic analysis in Step 8 closes the loop in an unexpected direction. In this step we notice Trump and Biden do not just speak differently in content, they sound different at the signal level. Trump's zero crossing rate is meaningfully higher, indicating more energetic, high-frequency speech, while Biden's acoustic profile is more consistent and harmonically richer. The fact that the acoustic feature that separates them most clearly, ZCR, also tracks with the villain count gap in memes (504 for Trump vs 162 for Biden) suggests that meme framing is not purely a response to what politicians say. It may also be a response to how they say it. All in all, how a politician sounds physically, the texture and energy of their voice, appears to be part of what makes them culturally legible as a threat or antagonist in internet discourse.
Summing it all up, these findings support a view of political memes as a genuine form of cultural production that is both responsive to and distorting of political reality. They borrow vocabulary and blame targets from real political speech, then strip away the positive framing, flatten the emotional nuance, and redistribute blame through a polarizing lens. Understanding this process matters because memes are not just on the sidelines of political communication...but for many people, they are the primary way political ideas travel and spread!
Limitations & Directions for Future Study¶
Every methodological choice in this project came with tradeoffs, and being upfront about them matters.
First off, probably the most significant limitation is the image dataset. Only 16 sample PNG files were included with the Memes Images: OCR Data download, which means the entire image analysis in Step 4 rests on 8 COVID memes and 8 US Politics memes. Thus, while the color palette and text coverage findings are genuinely interesting as preliminary observations, they cannot be treated as statistically meaningful conclusions. The full image dataset referenced in the CSV would need to be available to test these patterns properly. Therefore, future work should prioritize sourcing a version of this dataset with all images intact.
Next, face detection was attempted as a third image analysis track using both OpenCV's Haar cascade detector and MediaPipe, with the goal of asking whether political memes personalize blame through faces more than COVID memes do. Both approaches produced unreliable results on meme content, which is not surprising in retrospect. Memes routinely feature cartoons, side profiles, dark backgrounds, and stylized imagery that defeats standard face detectors trained on clean frontal photographs. Given these results, we chose to drop face detection from the final analysis entirely rather than include findings we could not trust. Thus, a more robust approach worth pursuing in future work would be to use a model specifically trained on social media imagery, or a manual annotation of a smaller sample.
On the text side, the OCR extraction process introduces noise that our cleaning pipeline only partially addressed. The presence of watermark text like "imgflip.com" in the TF-IDF results is a small but visible reminder that meme OCR is inherently messy. So for the future, perhaps a more aggressive domain-specific cleaning step, or a dataset with manually verified transcriptions, would improve the quality of the text analysis.
Moving on, while the speech transcript analysis in Step 7 is methodologically clean, it is worth noting that VADER was designed for short social media text and is being applied here to long political speeches. However keep in mind that we addressed this by chunking speeches into 50-word segments and averaging, which is a reasonable workaround, but it is still an approximation. But perhaps, a sentiment tool fine-tuned on political speech would be more appropriate for this task.
Lastly, the acoustic analysis in Step 8 has two notable constraints. First, we only analyzed Trump and Biden because the clean single-speaker audio files available were the town hall recordings. The presidential debate files contain both speakers mixed together, and without the full force-aligned utterance-level audio clips from the M-Arg preprocessed dataset, cleanly separating speakers in the debate audio was not feasible. This means Pence and Harris are absent from the acoustic comparison, which limits the party-level conclusions we can draw. Second, 100 utterances per speaker is a reasonable sample but a small one. Therefore, maybe running the full extraction across all available utterances would give more stable feature estimates.
More broadly, this project demonstrates correlation between acoustic features and meme framing but cannot establish causation. The finding that Trump's ZCR tracks with his villain count is suggestive and worth pursuing, but a proper causal claim would require controlled experimental design that is well outside the scope of a single course project.
Looking forward, the most natural extension of this work would be to build a multimodal classifier that uses text, image, and acoustic features together to predict entity framing in memes, following the approach of Chowdhury et al. (2025) more directly. A topic model over the full speech corpus, as Professor Tim suggested, would also add a meaningful layer to the Step 7 findings by surfacing thematic structure in how different politicians talk about the same events. Finally, expanding the meme dataset to include more recent political cycles or a non-US context would test whether the blame framing patterns found here are specific to the 2020 US political moment or reflect something more general about how meme culture processes political conflict.
References¶
Datasets:¶
yogesh239. (2023). Memes Images: OCR Data. Kaggle. https://www.kaggle.com/datasets/yogesh239/text-data-ocr
Chalkiadakis, I., Angles d'Auriac, L., Peters, G.W., and Frau-Meigs, D. (2025). A text dataset of campaign speeches of the main tickets in the 2020 US presidential election. Scientific Data, 12, 662. https://doi.org/10.1038/s41597-025-04681-x Dataset available at: https://doi.org/10.5281/zenodo.14785782
Mestre, R., Milicin, R., Middleton, S.E., Ryan, M., Zhu, J., and Norman, T.J. (2021). M-Arg: Multimodal Argument Mining Dataset for Political Debates with Audio and Transcripts. Proceedings of the 8th Workshop on Argument Mining, pp. 78-88. Association for Computational Linguistics. https://aclanthology.org/2021.argmining-1.8/ Dataset available at: https://zenodo.org/records/5653504
Methods and Tools:¶
Chowdhury, J.H., Ramanna, S., and Kotecha, K. (2025). Speech emotion recognition with light weight deep neural ensemble model using hand crafted features. Scientific Reports, 15, 11824. https://doi.org/10.1038/s41598-025-95734-z
Hutto, C.J. and Gilbert, E.E. (2014). VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text. Proceedings of the 8th International AAAI Conference on Weblogs and Social Media (ICWSM-14). https://doi.org/10.1609/icwsm.v8i1.14550
McFee, B., Raffel, C., Liang, D., Ellis, D.P.W., McVicar, M., Battenberg, E., and Nieto, O. (2015). librosa: Audio and Music Signal Analysis in Python. Proceedings of the 14th Python in Science Conference (SciPy 2015), pp. 18-24. https://doi.org/10.25080/majora-7b98e3ed-003