INFO 230 Mini-Project 2: Marvel Universe Social Network¶
This notebook walks through a full network analysis cultural analytics workflow applied to the Marvel Universe Social Network from Kaggle. The dataset captures over 60 years of Marvel comic publishing history through three files: a node list distinguishing heroes from the comic issues, a two-sided hero-to-comic appearance record, and a pre-projected hero-to-hero co-appearance network.
The central question driving this analysis is whether the structure of that network (who appears with whom, how often, and in what clusters) reflects the cultural and editorial choices that shaped the Marvel Universe over time.
1. Motivation¶
The Marvel Universe is one of the most expansive and deliberately constructed narrative universes in the history of cultural production. Marvel has been publishing comics since the 1960s, and over those six decades, thousands of heroes have shared panels, storylines, and team-ups. Notably, every one of those co-appearances was an editorial choice. For example, someone decided that Spider-Man and Captain America should be in the same issue, that Wolverine should cross over into the Avengers, and that certain characters would sit at the center of the universe, while others stayed on the margins.
Therefore, this project asks whether we can see those kinds of choices in the data. Specifically, this project will explore the following primary questions.
- Does network centrality reflect cultural prominence?
- Do the heroes who show up most centrally in the network match the ones we'd actually recognize as the "pillars" of Marvel?
- Do communities in the network correspond to real Marvel teams?
- Can the algorithm recover the X-Men, the Avengers, the Fantastic Four on its own, or is the network messier than the official team structure suggests? In other words, does the data tell a more complicated story than the teams Marvel actually intended?
- Does frequency of co-appearance tell a different story than just co-appearance count?
- Think of it like a social network... some people know everyone but only barely. Others have a tight-knit group they're always around. So, weighting edges by how often two heroes appear together (not just whether they do) will let us tell those two patterns apart.
General Setup¶
- Imports and configuration. Community detection uses Louvain when available, falling back to greedy modularity otherwise: this pattern was adapted from the course Holmes co-occurrence notebook.
import plotly.io as pio
pio.renderers.default = "notebook"
import pandas as pd
import numpy as np
import networkx as nx
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from collections import Counter
import warnings
import os
warnings.filterwarnings("ignore")
print(f"NetworkX : {nx.__version__}")
print(f"Pandas : {pd.__version__}")
print(f"NumPy : {np.__version__}")
NetworkX : 3.6.1 Pandas : 3.0.0 NumPy : 2.3.5
# community detection is handled differently depending on your version of NetworkX
# Louvain was added in v2.7 so if it's not available we fall back to greedy modularity
# either way, the rest of the notebook uses whichever method is available
try:
from networkx.algorithms.community import louvain_communities
HAS_LOUVAIN = True
except ImportError:
from networkx.algorithms.community import greedy_modularity_communities
HAS_LOUVAIN = False
# tells us which method we're working with for the rest of the notebook
method_name = "Louvain" if HAS_LOUVAIN else "Greedy Modularity"
print(f"Community detection method: {method_name}")
Community detection method: Louvain
# Configuration
# all file paths and analysis thresholds are kept here in one place
# so they are easy to adjust without hunting through the notebook
DATA_DIR = "../data"
NET_OUT = "./network_output"
os.makedirs(NET_OUT, exist_ok=True)
# minimum number of shared comic appearances required to include an edge
# set to 2 to filter out one-off coincidental appearances
MIN_EDGE_WEIGHT = 2
# maximum nodes to show in any network visualization
# keeping this low is intentional
# plotting all 6,000+ heroes produces an unreadable hairball (hence the commented-out 150 below)
# TOP_N_NODES = 150 # BEWARE HAIRBALL
TOP_N_NODES = 50
SEED = 230 # random seed for reproducible layouts and community detection
# for figures
community_colors = [
"#e63946","#457b9d","#2a9d8f","#e9c46a",
"#f4a261","#264653","#a8dadc","#c77dff"
]
community_names = {
0: "Defenders and Mystic Heroes",
1: "X-Men",
2: "Fantastic Four",
7: "Thor and Cosmic Heroes",
8: "Avengers",
13: "Spider-Mans World"
}
print("Configuration ready:")
print(f"\tData directory : {DATA_DIR}")
print(f"\tOutput directory : {NET_OUT}")
print(f"\tMin edge weight : {MIN_EDGE_WEIGHT}")
print(f"\tMax display nodes: {TOP_N_NODES}")
print(f"\tSeed : {SEED}")
Configuration ready: Data directory : ../data Output directory : ./network_output Min edge weight : 2 Max display nodes: 50 Seed : 230
Load in the data¶
# load all three raw files
nodes = pd.read_csv("../data/nodes.csv")
edges = pd.read_csv("../data/edges.csv")
hero_net = pd.read_csv("../data/hero-network.csv")
# quick inspection
print("nodes.csv :", nodes.shape)
print("edges.csv :", edges.shape)
print("hero-network :", hero_net.shape)
nodes.csv : (19090, 2) edges.csv : (96104, 2) hero-network : (574467, 2)
2. Dataset Development¶
The dataset for this project is the Marvel Universe Social Network from Kaggle, originally compiled from Marvel's published comic index. It came with three files:
nodes.csvlists every node and whether it's a hero or a comic issue.edges.csvis the two-sided network (i.e., each row is one hero appearing in one comic issue).hero-network.csvis a pre-projected hero-to-hero network where two heroes are connected if they appeared in the same issue.
Note that this dataset is ready-made, but it needs meaningful work (as follows) before it is usable for analysis:
Development Step 1: Build a weighted edge list from scratch¶
# count how many times each hero pair co-appears
# this becomes the edge weight
# raw hero-network.csv lists each shared comic as a separate row
# so grouping and counting gives a proper weighted edge list
hero_weighted = (
hero_net.groupby(["hero1", "hero2"]).size().reset_index(name="weight")
)
print("Unique hero pairs (edges):", len(hero_weighted))
print("Weight distribution:")
print(hero_weighted["weight"].describe().round(2))
Unique hero pairs (edges): 224181 Weight distribution: count 224181.00 mean 2.56 std 7.90 min 1.00 25% 1.00 50% 1.00 75% 2.00 max 1275.00 Name: weight, dtype: float64
Development Step 2: Apply a minimum edge weight threshold¶
- NOTE: A weight of 1 means two heroes shared exactly one comic issue (likely coincidental background appearance rather than a meaningful relationship).
- Therefore I will filter to a minimum edge weight of 2 to keep only these pairs (i.e., those with at least two co-appearances).
# filter the data
hero_filtered = hero_weighted[hero_weighted["weight"] >= MIN_EDGE_WEIGHT].copy()
print(f"Edges before filtering : {len(hero_weighted):,}")
print(f"Edges after filtering : {len(hero_filtered):,}")
print(f"Edges removed : {len(hero_weighted) - len(hero_filtered):,}")
Edges before filtering : 224,181 Edges after filtering : 85,352 Edges removed : 138,829
Development Step 3: Flag truncated hero names¶
- Some hero names in the dataset are cut off at 20 characters (e.g., ABOMINATION/EMIL BLO).
- This step will flag these so they're documented as a known data quality limitation rather than silently passed over.
# names truncated at exactly 20 characters are a known encoding issue in this dataset
truncated = nodes[nodes["type"] == "hero"]["node"].str.len() == 20
print(f"Potentially truncated hero names: {truncated.sum()}")
Potentially truncated hero names: 1351
Development Step 4: Validation of all datasets that they are good to proceed with¶
# quick validation pass across all three files before building any graphs
# this confirms no null values slipped through and documents basic shape in a clean summary output
print("=== nodes.csv ===")
print(f"\tShape: {nodes.shape}")
print(f"\tTypes: {nodes['type'].value_counts().to_dict()}")
print(f"\tNulls: {nodes.isnull().sum().to_dict()}")
print("\n=== edges.csv ===")
print(f"\tShape: {edges.shape}")
print(f"\tUnique heroes: {edges['hero'].nunique()}")
print(f"\tUnique comics: {edges['comic'].nunique()}")
print(f"\tNulls: {edges.isnull().sum().to_dict()}")
print("\n=== hero-network.csv (after weighting + filtering) ===")
print(f"\tRaw rows: {len(hero_net):,}")
print(f"\tUnique pairs: {len(hero_weighted):,}")
print(f"\tAfter filter: {len(hero_filtered):,}")
print(f"\tNulls: {hero_filtered.isnull().sum().to_dict()}")
=== nodes.csv ===
Shape: (19090, 2)
Types: {'comic': 12651, 'hero': 6439}
Nulls: {'node': 0, 'type': 0}
=== edges.csv ===
Shape: (96104, 2)
Unique heroes: 6439
Unique comics: 12651
Nulls: {'hero': 0, 'comic': 0}
=== hero-network.csv (after weighting + filtering) ===
Raw rows: 574,467
Unique pairs: 224,181
After filter: 85,352
Nulls: {'hero1': 0, 'hero2': 0, 'weight': 0}
All three files pass validation with no null values. The filtered weighted edge list (hero_filtered) is now ready to be used as the foundation for graph construction in the next section.
3. Analytical Approach & Metric Justification¶
The goal in this step of the project is not just to describe the network, but also ask whether the structure of the Marvel co-appearance network reflects meaningful patterns in how Marvel has constructed its universe. The metrics discussed below were chosen because each one answers a distinct cultural question and not just a technical one.
NOTE Two Graphs, Not Just One: This analysis is run on both weighted and unweighted versions of the hero-hero network. Keep in mind that the unweighted graph treats any co-appearance relationship as equal regardless of the frequency of such. In contrast, the weighted graph scales edge strength by how many times two heroes have shared a comic issue. Therefore, comparing centrality rankings across both these versions will tell us whether frequency of collaboration changes who looks "important", or whether the top heroes dominate either way.
Metric 1 - Degree Centrality: How many unique heroes does each hero co-appear with? This is the most direct measure of connectivity and serves us as a sanity check. For instance, if Captain America and Spider-Man don't rank near the top, then something is wrong with the data. Overall culturally, high degree = broad reach across the universe.
Metric 2 - Betweenness Centrality: Which heroes sit on the shortest paths between other heroes who wouldn't otherwise be connected? This identifies bridge character, for instance heroes like Wolverine who cross team lines and link otherwise separate corners of the Marvel Universe. This is arguably the most culturally interesting metric because it captures narrative role and not just general popularity.
Metric 3 - Eigenvector Centrality: This measures not just how many connections a hero has but also how well-connected these connections are. For instance, a hero who only appears with other major heroes will score higher than one who appears with obscure characters the same number of times, thereby capturing prestige rather than pure reach.
Community Detection - Louvain algorithm: Groups heroes into communities based on density of shared connections. The key question here is whether these algorithmically detected communities map onto real Marvel teams. Therefore, with this metric I will compare the top communities against known groupings like X-Men, Avengers, and Fantastic Four.
Comic-Level Analysis: Before looking at how heroes connect to each other, it's worth stepping back and asking which comic issues brought the most heroes together in the first place? Therefore, this part of the analysis will look at the comics themselves as the unit of interest, i.e., since its something that gets lost once you collapse everything down to just hero-to-hero connections.
Part 3.1: Build the weighted hero-hero graph¶
# build the weighted hero-hero graph from our filtered edge list
# each edge carries a "weight" attribute = number of shared comic appearances
# the higher the weight -> the more times that pair of heroes has co-appeared
G_weighted = nx.Graph()
for _, row in hero_filtered.iterrows():
G_weighted.add_edge(row["hero1"], row["hero2"], weight=row["weight"])
print("Weighted graph")
print(f"\tNodes: {G_weighted.number_of_nodes():,}")
print(f"\tEdges: {G_weighted.number_of_edges():,}")
Weighted graph Nodes: 4,465 Edges: 62,128
Part 3.2: Build the unweighted version for comparison¶
# build the unweighted version by copying the weighted graph and setting all edge weights to 1
# this lets us compare centrality rankings with and without frequency of co-appearance
# same nodes and edges as G_weighted: only difference is every connection is treated as equal
G_unweighted = nx.Graph(G_weighted)
for u, v in G_unweighted.edges():
G_unweighted[u][v]["weight"] = 1
print("Unweighted graph")
print(f"\tNodes: {G_unweighted.number_of_nodes():,}")
print(f"\tEdges: {G_unweighted.number_of_edges():,}")
Unweighted graph Nodes: 4,465 Edges: 62,128
Part 3.3: Compute centrality metrics on both graphs¶
# note that between-ness is an expensive computation on large graphs
# use a sample of 500 nodes to approximate (standard practice for networks this size)
# degree centrality: what fraction of all heroes does each hero co-appear with?
# fast to compute, serves as our sanity check
print("Computing degree centrality...")
deg_weighted = nx.degree_centrality(G_weighted)
deg_unweighted = nx.degree_centrality(G_unweighted)
# betweenness centrality: which heroes sit on the shortest paths between other heroes?
# this is expensive to compute exactly on a graph this size, so we approximate
# using a random sample of 500 nodes (k=500): standard practice for large networks
print("Computing betweenness centrality (approximated, k=500)...")
btw_weighted = nx.betweenness_centrality(G_weighted, weight="weight", k=500, seed=SEED)
btw_unweighted = nx.betweenness_centrality(G_unweighted, k=500, seed=SEED)
# eigenvector centrality: how well-connected are a hero's connections?
# max_iter=1000 gives the algorithm enough iterations to converge on this size graph
print("Computing eigenvector centrality...")
eig_weighted = nx.eigenvector_centrality(G_weighted, weight="weight", max_iter=1000)
eig_unweighted = nx.eigenvector_centrality(G_unweighted, max_iter=1000)
print("Done.")
Computing degree centrality... Computing betweenness centrality (approximated, k=500)... Computing eigenvector centrality... Done.
Part 3.4: Compile into a metrics dataframe¶
- adapted from the centrality metrics table pattern in the course Holmes co-occurrence notebook
# compile all centrality results into a single dataframe for easy comparison
# each row is one hero, each column is a metric (weighted and unweighted versions)
# adapted from the centrality metrics table pattern in the course Holmes co-occurrence notebook
heroes_list = list(G_weighted.nodes())
metrics_df = pd.DataFrame({
"hero" : heroes_list,
"degree_weighted" : [deg_weighted[h] for h in heroes_list],
"degree_unweighted" : [deg_unweighted[h] for h in heroes_list],
"betweenness_wtd" : [btw_weighted[h] for h in heroes_list],
"betweenness_unwtd" : [btw_unweighted[h] for h in heroes_list],
"eigenvector_wtd" : [eig_weighted[h] for h in heroes_list],
"eigenvector_unwtd" : [eig_unweighted[h] for h in heroes_list],
"raw_degree" : [G_weighted.degree(h) for h in heroes_list],
}).sort_values("degree_weighted", ascending=False).reset_index(drop=True) # sort by weighted degree (most connected heroes at top)
# quick sanity check (if Captain America and Spider-Man aren't near the top something is wrong)
print("Top 10 by weighted degree:")
print(metrics_df[["hero","degree_weighted","betweenness_wtd","eigenvector_wtd"]].head(10))
Top 10 by weighted degree:
hero degree_weighted betweenness_wtd eigenvector_wtd
0 CAPTAIN AMERICA 0.225806 0.060866 0.010815
1 SPIDER-MAN/PETER PAR 0.195789 0.051988 0.002985
2 IRON MAN/TONY STARK 0.170251 0.027778 0.006454
3 WOLVERINE/LOGAN 0.159722 0.033317 0.002875
4 THING/BENJAMIN J. GR 0.149642 0.024430 0.005661
5 THOR/DR. DONALD BLAK 0.148297 0.023400 0.005265
6 SCARLET WITCH/WANDA 0.145833 0.019366 0.006129
7 MR. FANTASTIC/REED R 0.142697 0.030277 0.005415
8 HUMAN TORCH/JOHNNY S 0.138441 0.019567 0.005508
9 HAWK 0.133065 0.034386 0.004645
The top 10 checks out exactly as expected. Namely, Captain America, Spider-Man, and Iron Man sit at the top by weighted degree, which aligns perfectly with their cultural prominence as the most recognizable pillars of Marvel. The fact that our network structure independently recovers this ranking without any prior Marvel knowledge is a good sign that degree centrality is a meaningful metric here.
One early surprise worth noting: Hawkeye (HAWK) ranks 10th by degree but has the second highest betweenness score in this top 10, sitting above Iron Man, Thor, and Mr. Fantastic. This suggests that despite not being the most broadly connected hero, Hawkeye plays a surprisingly large role as a bridge between otherwise separate parts of the network, which is something we'll explore further in the centrality comparison.
Part 3.5: Community detection¶
# run community detection on the weighted graph
# Louvain groups heroes into communities by maximizing how densely connected
# each group is relative to what you'd expect by chance
if HAS_LOUVAIN:
communities = louvain_communities(G_weighted, weight="weight", seed=SEED)
method_used = "Louvain"
# fallback if Louvain isn't available in this version of NetworkX
else:
communities = greedy_modularity_communities(G_weighted, weight="weight")
method_used = "Greedy Modularity"
# build a lookup dictionary: hero name -> community ID
# (easy to color nodes and filter by community later)
community_map = {}
for i, comm in enumerate(communities):
for hero in comm:
community_map[hero] = i
# add community assignment as a column in our metrics dataframe
metrics_df["community"] = metrics_df["hero"].map(community_map)
print(f"Method: {method_used}")
print(f"Communities detected: {len(communities)}")
# the top 5 by size gives us a sense of how the network is structured
# we expect a few large communities (major teams) and many small ones (minor characters)
print("\nTop 5 communities by size:")
comm_sizes = pd.Series([len(c) for c in communities]).sort_values(ascending=False)
print(comm_sizes.head())
Method: Louvain Communities detected: 36 Top 5 communities by size: 1 901 8 697 13 659 0 418 2 408 dtype: int64
The algorithm detected 36 communities in total, but the size distribution tells the real story, i.e., the top 5 communities alone account for the majority of heroes, with the largest containing 901. This is actually a pretty common pattern in real-world networks, that a handful of large, dense groups take up most of the space while lots of smaller ones exist on the edges. So, the key question we'll answer in the results section is whether those top communities actually map onto recognizable Marvel teams.
Part 3.6: Build the hero-comic network from edges.csv¶
# build the hero-comic network directly from edges.csv
# this is a two-sided network: heroes on one side, comic issues on the other
# unlike the hero-hero graph, no projection has happened here yet (working with the raw co-appearance data)
B = nx.Graph()
heroes = edges["hero"].unique()
comics = edges["comic"].unique()
# the bipartite=0/1 tags are required by NetworkX to recognize the two sides
B.add_nodes_from(heroes, bipartite=0) # heroes
B.add_nodes_from(comics, bipartite=1) # comics
# each row in edges.csv is one hero appearing in one comic issue
for _, row in edges.iterrows():
B.add_edge(row["hero"], row["comic"])
print("Hero-comic network:")
print(f"\tHero nodes: {len(heroes):,}")
print(f"\tComic nodes: {len(comics):,}")
print(f"\tEdges: {B.number_of_edges():,}")
print(f"Valid two-sided structure: {nx.is_bipartite(B)}")
Hero-comic network: Hero nodes: 6,439 Comic nodes: 12,651 Edges: 96,104 Valid two-sided structure: True
4. Results & Visualizations¶
This section presents the results of the analysis across five areas: degree distribution, centrality rankings (weighted vs. unweighted), community detection, the comics side of the network, and network visualizations via both an inline Plotly plot and interactive community files exported to Cytoscape and pyvis. All visualizations are filtered to avoid an unreadable hairball (the full graph is never plotted directly).
Part 4.1: Degree Distribution¶
These results explore how connected are heroes in this network? This histogram below shows the full degree distribution, and the log-log plot reveals whether the network follows a scale-free pattern, meaning a small number heroes have vastly more connections than everyone else. This is a known property of many real-world networks and is worth confirming here.
Degree distribution dual-plot pattern adapted from the course Holmes co-occurrence notebook.
# extract the degree of every hero in the weighted graph
degrees = [d for _, d in G_weighted.degree()]
fig = make_subplots(rows=1, cols=2,
subplot_titles=("Degree Distribution", "Log-Log Degree Distribution"))
# left: standard histogram showing the raw degree distribution
fig.add_trace(
go.Histogram(x=degrees, nbinsx=100, marker_color="#e63946", name="Degree"),
row=1, col=1
)
# right: log-log plot to check for scale-free behavior
# if the points fall roughly on a straight line in log-log space, the network follows a power law
# meaning a small number of heroes have vastly more connections than everyone else
degree_counts = Counter(degrees)
x_vals = sorted(degree_counts.keys())
y_vals = [degree_counts[x] for x in x_vals]
fig.add_trace(
go.Scatter(x=x_vals, y=y_vals, mode="markers",
marker=dict(color="#457b9d", size=4), name="Log-Log"),
row=1, col=2
)
fig.update_xaxes(type="log", title_text="Degree (log)", row=1, col=2)
fig.update_yaxes(type="log", title_text="Count (log)", row=1, col=2)
fig.update_xaxes(title_text="Degree", row=1, col=1)
fig.update_yaxes(title_text="Count", row=1, col=1)
fig.update_layout(
title_text="Hero Degree Distribution: Weighted Graph",
showlegend=False,
height=450,
width=700
)
fig.show(renderer="notebook")
The plots here agree. The histogram on the left shows that the vast majority of heroes have very few connections and that most are niche characters who only ever appeared alongside a handful of others. The log-log plot on the right confirms this same pattern. We can see this by observing that the points fall roughly along a downward slope, meaning a small number of heroes (Captain America, Spider-Man, etc.) have an enormous number of connections while most have very few. In other words, the Marvel Universe has a clear hierarchy, a small celebrity tier at the top and a long tail of minor characters below. Think of it like Hollywood, a few A-listers dominate and everyone else is an extra.
4.2: Centrality Rankings: Weighted vs. Unweighted¶
The table below shows the top 15 heroes by each centrality metric, for both the weighted and unweighted graphs. The key question we are asking here is does factoring in how often two heroes co-appear change who ranks as most important, or do the same heroes dominate either way?
top_n = 15
# helper function that pulls the top N heroes for a given metric
# and returns a side-by-side table of weighted vs unweighted rankings
def top_comparison(metric_wtd, metric_unwtd, label):
wtd = metrics_df.nlargest(top_n, metric_wtd)[["hero", metric_wtd]].reset_index(drop=True)
unwtd = metrics_df.nlargest(top_n, metric_unwtd)[["hero", metric_unwtd]].reset_index(drop=True)
wtd.columns = ["Hero (weighted)", f"{label} (wtd)"]
unwtd.columns = ["Hero (unweighted)", f"{label} (unwtd)"]
return pd.concat([wtd, unwtd], axis=1)
print("=== Degree Centrality ===")
print(top_comparison("degree_weighted", "degree_unweighted", "Degree").to_string(index=False))
print("\n=== Betweenness Centrality ===")
print(top_comparison("betweenness_wtd", "betweenness_unwtd", "Betweenness").to_string(index=False))
print("\n=== Eigenvector Centrality ===")
print(top_comparison("eigenvector_wtd", "eigenvector_unwtd", "Eigenvector").to_string(index=False))
=== Degree Centrality ===
Hero (weighted) Degree (wtd) Hero (unweighted) Degree (unwtd)
CAPTAIN AMERICA 0.225806 CAPTAIN AMERICA 0.225806
SPIDER-MAN/PETER PAR 0.195789 SPIDER-MAN/PETER PAR 0.195789
IRON MAN/TONY STARK 0.170251 IRON MAN/TONY STARK 0.170251
WOLVERINE/LOGAN 0.159722 WOLVERINE/LOGAN 0.159722
THING/BENJAMIN J. GR 0.149642 THING/BENJAMIN J. GR 0.149642
THOR/DR. DONALD BLAK 0.148297 THOR/DR. DONALD BLAK 0.148297
SCARLET WITCH/WANDA 0.145833 SCARLET WITCH/WANDA 0.145833
MR. FANTASTIC/REED R 0.142697 MR. FANTASTIC/REED R 0.142697
HUMAN TORCH/JOHNNY S 0.138441 HUMAN TORCH/JOHNNY S 0.138441
HAWK 0.133065 HAWK 0.133065
BEAST/HENRY &HANK& P 0.133065 BEAST/HENRY &HANK& P 0.133065
VISION 0.132841 VISION 0.132841
CYCLOPS/SCOTT SUMMER 0.129928 CYCLOPS/SCOTT SUMMER 0.129928
STORM/ORORO MUNROE S 0.129928 STORM/ORORO MUNROE S 0.129928
INVISIBLE WOMAN/SUE 0.127464 INVISIBLE WOMAN/SUE 0.127464
=== Betweenness Centrality ===
Hero (weighted) Betweenness (wtd) Hero (unweighted) Betweenness (unwtd)
CAPTAIN AMERICA 0.060866 SPIDER-MAN/PETER PAR 0.088173
SPIDER-MAN/PETER PAR 0.051988 CAPTAIN AMERICA 0.084346
HAWK 0.034386 WOLVERINE/LOGAN 0.046091
WOLVERINE/LOGAN 0.033317 IRON MAN/TONY STARK 0.041897
MR. FANTASTIC/REED R 0.030277 HAWK 0.037527
IRON MAN/TONY STARK 0.027778 THOR/DR. DONALD BLAK 0.036549
PUNISHER II/FRANK CA 0.027709 DAREDEVIL/MATT MURDO 0.031282
FURY, COL. NICHOLAS 0.027186 MR. FANTASTIC/REED R 0.030192
THING/BENJAMIN J. GR 0.024430 THING/BENJAMIN J. GR 0.029502
THOR/DR. DONALD BLAK 0.023400 HAVOK/ALEX SUMMERS 0.028371
HAVOK/ALEX SUMMERS 0.023303 DR. STRANGE/STEPHEN 0.027734
DAREDEVIL/MATT MURDO 0.023110 PUNISHER II/FRANK CA 0.027376
DR. STRANGE/STEPHEN 0.023082 FURY, COL. NICHOLAS 0.026515
SUB-MARINER/NAMOR MA 0.022711 SHE-HULK/JENNIFER WA 0.025340
WATSON-PARKER, MARY 0.021304 SUB-MARINER/NAMOR MA 0.025129
=== Eigenvector Centrality ===
Hero (weighted) Eigenvector (wtd) Hero (unweighted) Eigenvector (unwtd)
PATRIOT/JEFF MACE 0.783672 CAPTAIN AMERICA 0.135910
MISS AMERICA/MADELIN 0.619115 WOLVERINE/LOGAN 0.116683
HUMAN TORCH ANDROID/ 0.031016 SCARLET WITCH/WANDA 0.115884
DORMA [ATLANTEAN] 0.027687 IRON MAN/TONY STARK 0.114405
CAPTAIN AMERICA 0.010815 VISION 0.113984
IRON MAN/TONY STARK 0.006454 SPIDER-MAN/PETER PAR 0.112280
WHIZZER/ROBERT L. FR 0.006394 THING/BENJAMIN J. GR 0.112253
VISION 0.006241 BEAST/HENRY &HANK& P 0.110432
SCARLET WITCH/WANDA 0.006129 HUMAN TORCH/JOHNNY S 0.110242
THING/BENJAMIN J. GR 0.005661 MR. FANTASTIC/REED R 0.109859
HUMAN TORCH/JOHNNY S 0.005508 THOR/DR. DONALD BLAK 0.109238
WASP/JANET VAN DYNE 0.005437 STORM/ORORO MUNROE S 0.107589
MR. FANTASTIC/REED R 0.005415 WASP/JANET VAN DYNE 0.107101
INVISIBLE WOMAN/SUE 0.005329 CYCLOPS/SCOTT SUMMER 0.106120
THOR/DR. DONALD BLAK 0.005265 INVISIBLE WOMAN/SUE 0.105972
Degree Centrality tells the least surprising story here. The weighted and unweighted rankings are completely identical, meaning the heroes who co-appear with the most unique characters are also the ones who co-appear most frequently. Weighting edges doesn't change anything here, which tells us that broad reach and frequent collaboration go hand in hand in the Marvel Universe.
Betweenness Centrality is where things get more interesting. The weighted and unweighted rankings diverge noticeably. Specifically, Spider-Man actually overtakes Captain America as the top bridge character in the unweighted version, while weighting pushes Captain America back to the top. This suggests Captain America's bridging role is driven more by the strength of his ties than by the sheer number of paths he sits on. Also, Hawkeye's continued presence near the top of both lists reinforces the earlier observation that he plays a much bigger bridging role than his overall popularity would suggest.
Eigenvector Centrality is the biggest surprise observed in these results. The weighted version is dominated by three obscure Golden Age heroes, specifically Patriot/Jeff Mace, Miss America, and the Human Torch Android, rather than the usual suspects. This happens because eigenvector centrality rewards being connected to well-connected heroes, and these characters appeared heavily in early issues alongside a tight cluster of other well-connected Golden Age heroes, inflating their scores. The unweighted version recovers the expected ranking with Captain America at the top. Therefore, this is actually a meaningful finding because it suggests that raw co-appearance frequency can distort prestige metrics in ways that don't reflect cultural prominence at all.
# bar chart showing the top 15 heroes by weighted degree, colored by their detected community
# this lets us see at a glance whether the most connected heroes cluster into the same community
# or whether they span multiple different teams
top15 = metrics_df.head(top_n).copy()
top15["community_str"] = top15["community"].map(community_names).fillna("Community " + top15["community"].astype(str))
fig = px.bar(
top15,
x="hero",
y="degree_weighted",
color="community_str",
title="Top 15 Heroes by Weighted Degree Centrality",
labels={"hero":"Hero", "degree_weighted":"Degree Centrality (weighted)", "community_str": "Community"},
height=500
)
fig.update_layout(xaxis_tickangle=45)
fig.show(renderer="notebook")
The color coding is what we should pay attention to here. The top 15 heroes split cleanly across five communities, with the Avengers (blue) and Spider-Man's World (red) dominating the top spots, followed by the X-Men (green), the Fantastic Four (purple), and Thor's corner of the universe (orange). No single community monopolizes the top, i.e., the most connected heroes in Marvel are spread across distinct editorial groupings.
Part 4.3: Community Detection Results¶
The Louvain algorithm detected communities purely from the network structure, i.e., with no Marvel knowledge involved. The chart below then shows the largest communities by size, and the table maps the top heroes in each community to help interpret what each one represents. Main question to ask ourselves here is do they look like real Marvel teams?
# count how many heroes belong to each community and grab the top 8 by size
top_communities = (
metrics_df.groupby("community").size().reset_index(name="size").sort_values("size", ascending=False).head(8)
)
# map community IDs to readable names for the x-axis
top_communities["community_name"] = top_communities["community"].map(community_names).fillna("Community " + top_communities["community"].astype(str))
fig = px.bar(
top_communities,
x="community_name",
y="size",
title="Top 8 Communities by Size",
labels={"community_name": "Community", "size": "Number of Heroes"},
color="size",
color_continuous_scale="Blues",
height=400
)
fig.update_layout(xaxis_tickangle=45)
fig.show(renderer="notebook")
As shown above, the X-Men come out as the largest detected community by a decent margin with nearly 900 heroes, followed by the Avengers and Spider-Man's World. This makes intuitive sense since the X-Men universe has an enormous roster of mutant characters that have been introduced and expanded over decades, giving it a natural density advantage. The two unnamed communities (11 and 5) are smaller groupings the algorithm detected that don't map cleanly onto a single recognizable Marvel team, which is something worth coming back to in the discussion section.
# print the top 5 heroes per community by weighted degree
# this helps us interpret what each community actually represents and verify whether the algorithm recovered real Marvel teams
print("Top 5 heroes per community (by weighted degree): \n")
for comm_id in top_communities["community"].values:
heroes_in_comm = (
metrics_df[metrics_df["community"] == comm_id]
.nlargest(5, "degree_weighted")["hero"]
.tolist()
)
# use the readable community name if we have one, otherwise fall back to the ID
comm_label = community_names.get(comm_id, f"Community {comm_id}")
print(f"\t{comm_label}: {', '.join(heroes_in_comm)}")
Top 5 heroes per community (by weighted degree): X-Men: WOLVERINE/LOGAN , BEAST/HENRY &HANK& P, CYCLOPS/SCOTT SUMMER, STORM/ORORO MUNROE S, COLOSSUS II/PETER RA Avengers: CAPTAIN AMERICA, IRON MAN/TONY STARK , SCARLET WITCH/WANDA , HAWK, VISION Spider-Mans World: SPIDER-MAN/PETER PAR, DAREDEVIL/MATT MURDO, JAMESON, J. JONAH, WATSON-PARKER, MARY , ROBERTSON, JOE Defenders and Mystic Heroes: HULK/DR. ROBERT BRUC, DR. STRANGE/STEPHEN , JONES, RICHARD MILHO, NORRISS, SISTER BARB, HELLCAT/PATSY WALKER Fantastic Four: THING/BENJAMIN J. GR, MR. FANTASTIC/REED R, HUMAN TORCH/JOHNNY S, INVISIBLE WOMAN/SUE , SILVER SURFER/NORRIN Thor and Cosmic Heroes: THOR/DR. DONALD BLAK, THUNDERSTRIKE/ERIC K, ODIN [ASGARDIAN], LOKI [ASGARDIAN], BALDER [ASGARDIAN] Community 11: BEETLE/ABNER RONALD , MOONSTONE II/KARLA S, SCREAMING MIMI/MELIS, POWER MAN/ERIK JOSTE, CITIZEN V II/HELMUT Community 5: NOVA/RICHARD RIDER, FIRESTAR/ANGELICA JO, NAMORITA/NITA PRENTI, SPEEDBALL/ROBBIE BAL, JUSTICE II/VANCE AST
This is actually the clearest validation of the analysis. The algorithm, with no prior Marvel knowledge, almost perfectly recovered the actual team rosters. The X-Men community contains Wolverine, Cyclops, Storm, and Colossus. The Avengers community has Captain America, Iron Man, Scarlet Witch, and Vision. The Fantastic Four community is exactly the four core members plus Silver Surfer. Thor's community is entirely Asgardian characters.
The two unnamed communities are interesting too. Community 11 maps onto the Thunderbolts, a team of reformed villains, which explains why the names are less immediately recognizable. Community 5 looks like the New Warriors, a younger superhero team from the 1990s. The fact that the algorithm surfaced these as distinct groupings without being told they were teams is a strong signal that the community structure in this network is real and meaningful, not just noise.
Part 4.4 The Comics Side of the Network¶
Next, before collapsing everything down to hero-to-hero connections, it's worth asking which comic issues served as the biggest gathering points and which heroes showed up across the most issues. This is the editorial production side of the network we are exploring.
# look at the network from the comics side: which issues brought the most heroes together?
# this is the editorial production angle: a comic with many heroes was a deliberate crossover or team event, not just a solo issue
comic_degree = {n: d for n, d in B.degree() if n in set(comics)}
# grab the top 15 comics by number of heroes featured
top_comics = pd.DataFrame(
sorted(comic_degree.items(), key=lambda x: -x[1])[:15],
columns=["comic", "hero_count"]
)
# the dataset uses internal comic abbreviations — we decode the known ones here
comic_labels = {
"COC 1" : "COC 1 (Contest of Champions)",
"IW 1" : "IW 1 (Infinity War)",
"IW 2" : "IW 2 (Infinity War)",
"IW 3" : "IW 3 (Infinity War)",
"IW 4" : "IW 4 (Infinity War)",
"IW 6" : "IW 6 (Infinity War)",
"H2 279" : "H2 279 (Hulk)",
"FF 368" : "FF 368 (Fantastic Four)",
"FF 369" : "FF 369 (Fantastic Four)",
"FF 370" : "FF 370 (Fantastic Four)",
"FF 3" : "FF 3 (Fantastic Four)",
"TB 25" : "TB 25 (Thunderbolts)",
"A3 1" : "A3 1 (Avengers)",
"MAXSEC 3" : "MAXSEC 3 (Maximum Security)",
"M/GN 1" : "M/GN 1 (Marvel graphic novel)",
}
top_comics["comic_label"] = top_comics["comic"].map(comic_labels).fillna(top_comics["comic"])
fig = px.bar(
top_comics,
x="comic", # keep the short code on the x-axis
y="hero_count",
title="Top 15 Comics by Number of Heroes Featured",
labels={"comic": "Comic Issue", "hero_count": "Number of Heroes"},
hover_data={"comic_label": True, "comic": False, "hero_count": True},
height=500
)
fig.update_layout(xaxis_tickangle=45)
fig.show(renderer="notebook")
Here we see the top spot goes to Contest of Champions, a 1982 crossover event that assembled virtually every Marvel hero at the time, which explains the massive hero count of 110. The Infinity War issues dominate the rest of the top 5, which makes sense as that storyline was one of Marvel's biggest universe-wide crossover events. The Fantastic Four issues appearing repeatedly reflect how central that team was as a hub for introducing and connecting new characters in the early Marvel era. Note that the x-axis uses the dataset's internal comic abbreviations the full titles are shown when you hover your mouse over a bar.
# flip the question around: instead of which comics had the most heroes,
# which heroes showed up across the most comic issues?
hero_degree_B = {n: d for n, d in B.degree() if n in set(heroes)}
# grab the top 15 heroes by number of issues appeared in
top_hero_comics = pd.DataFrame(
sorted(hero_degree_B.items(), key=lambda x: -x[1])[:15],
columns=["hero", "issue_count"]
)
fig = px.bar(
top_hero_comics,
x="hero",
y="issue_count",
title="Top 15 Heroes by Number of Comic Issues Appeared In",
labels={"hero": "Hero", "issue_count": "Number of Issues"},
color="issue_count",
color_continuous_scale="Reds",
height=450
)
fig.update_layout(xaxis_tickangle=45)
fig.show(renderer="notebook")
From this plot we can see Spider-Man beats out Captain America as the hero who appeared across the most individual comic issues (over 1,500 in total). This is a subtle but interesting distinction from the degree centrality results, where Captain America ranked first. Captain America co-appears with a broader range of unique heroes, but Spider-Man shows up in more individual issues overall. The rest of the top 15 is largely unsurprising...the core Avengers and Fantastic Four members dominate, but the presence of Mary Jane Watson-Parker and Daredevil is worth noting since they are supporting characters rather than traditional team heroes, suggesting that Spider-Man's extended cast got pulled into a lot of issues by association.
Part 4.5: Cytoscape Export and PyVis Interactive Network Visualizations¶
Rather than exporting the full 4,000+ node graph (which produces an unreadable hairball even in Cytoscape), each community is exported as its own GraphML file. Each file contains the top 20 heroes by degree with edges filtered to a minimum weight of 5, keeping only the strongest co-appearance relationships. These files are loaded directly into Cytoscape and shared via NDEx in the section below.
Export pattern adapted from the course Holmes co-occurrence notebook.
# filter the full graph down to the top N heroes by degree for visualization
# plotting all 4,000+ heroes at once produces an unreadable hairball
# TOP_N_NODES is set in the config cell at the top of the notebook
top_nodes = metrics_df.head(TOP_N_NODES)["hero"].tolist()
G_sub = G_weighted.subgraph(top_nodes).copy()
print(f"Subgraph nodes: {G_sub.number_of_nodes()}")
print(f"Subgraph edges: {G_sub.number_of_edges()}")
Subgraph nodes: 50 Subgraph edges: 1171
# static inline visualization of the top 50 heroes by degree
# node color = community, node size = degree
# this is a preview only (the full interactive per-community versions are in the figures folder)
# compute a spring layout: nodes with stronger ties are pulled closer together
pos = nx.spring_layout(G_sub, weight="weight", seed=SEED)
# color each node by its community and scale size by degree
node_colors = [community_colors[community_map.get(n, 0) % len(community_colors)] for n in G_sub.nodes()]
node_sizes = [5 + G_weighted.degree(n) * 0.08 for n in G_sub.nodes()]
# build edge traces: each edge needs a start point, end point, and None to break the line
edge_x, edge_y = [], []
for u, v in G_sub.edges():
x0, y0 = pos[u]
x1, y1 = pos[v]
edge_x += [x0, x1, None]
edge_y += [y0, y1, None]
node_x = [pos[n][0] for n in G_sub.nodes()]
node_y = [pos[n][1] for n in G_sub.nodes()]
# use readable community names in the hover tooltip instead of raw IDs
def get_community_label(n):
comm_id = community_map.get(n)
return community_names.get(comm_id, f"Community {comm_id}")
node_text = [
f"{n}<br>Degree: {G_weighted.degree(n)}<br>Community: {get_community_label(n)}"
for n in G_sub.nodes()
]
fig = go.Figure()
# add edges as a faint line trace
fig.add_trace(go.Scatter(
x=edge_x, y=edge_y,
mode="lines",
line=dict(width=0.5, color="rgba(255,255,255,0.15)"),
hoverinfo="none"
))
# add nodes as a scatter trace with labels and hover info
fig.add_trace(go.Scatter(
x=node_x, y=node_y,
mode="markers+text",
marker=dict(
size=node_sizes,
color=node_colors,
line=dict(width=0.5, color="white")
),
text=list(G_sub.nodes()),
textposition="top center",
textfont=dict(size=7, color="white"),
hovertext=node_text,
hoverinfo="text"
))
fig.update_layout(
title="Top 50 Heroes by Degree — Colored by Community",
paper_bgcolor="#1a1a2e",
plot_bgcolor="#1a1a2e",
font=dict(color="white"),
showlegend=False,
height=700,
xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
yaxis=dict(showgrid=False, zeroline=False, showticklabels=False)
)
fig.show(renderer="notebook")
The spring layout here naturally separates the major communities without any manual intervention. The X-Men cluster pulls to the left, anchored by Wolverine and Cyclops, while the Avengers and Fantastic Four overlap in the center, reflecting the fact that those two teams have a long history of crossover appearances. Spider-Man and his supporting cast (Mary Jane, J. Jonah Jameson, Daredevil) hang off to the right, relatively isolated from the mutant corner of the universe. Node size reflects degree centrality, so the largest circles (Captain America, Spider-Man, Thing, Mr. Fantastic) are immediately visible as the most connected heroes. The few heroes sitting at the intersection of two color clusters, like Beast and Quicksilver, are characters who actually switched teams in the comics at some point, and the network caught that on its own...yay!
# export individual community subgraphs for Cytoscape
# much cleaner than the full graph (one team at a time)
for comm_id, comm_name in community_names.items():
comm_nodes = (
metrics_df[metrics_df["community"] == comm_id]
.nlargest(20, "degree_weighted")["hero"] # drop from 40 to 20
.tolist()
)
G_comm_export = G_weighted.subgraph(comm_nodes).copy()
# only keep edges with weight >= 5 to show strongest relationships only
edges_to_remove = [(u, v) for u, v, d in G_comm_export.edges(data=True) if d.get("weight", 0) < 5]
G_comm_export.remove_edges_from(edges_to_remove)
# tag nodes before export
for node in G_comm_export.nodes():
G_comm_export.nodes[node]["degree"] = int(G_weighted.degree(node))
G_comm_export.nodes[node]["community"] = int(community_map.get(node, -1))
export_path = os.path.join(NET_OUT, f"marvel_{comm_name.replace(' ', '_').replace('&', 'and').replace(chr(39), '')}.graphml")
nx.write_graphml(G_comm_export, export_path)
print(f"Exported: {export_path}")
Exported: ./network_output/marvel_Defenders_and_Mystic_Heroes.graphml Exported: ./network_output/marvel_X-Men.graphml Exported: ./network_output/marvel_Fantastic_Four.graphml Exported: ./network_output/marvel_Thor_and_Cosmic_Heroes.graphml Exported: ./network_output/marvel_Avengers.graphml Exported: ./network_output/marvel_Spider-Mans_World.graphml
Cytoscape Network Visualizations (via NDEx — no account needed to view):
These are exported from Cytoscape for exploratory purposes (nodes are clickable and the network is interactive).
Part 4.5 continued: Interactive Community Visualizations¶
# generate interactive pyvis HTML for the top 4 communities by size
# each file shows the top 40 heroes within that community
# saved to the figures folder and linked from the GitHub Pages index
from pyvis.network import Network
FIGURES_DIR = "../figures"
os.makedirs(FIGURES_DIR, exist_ok=True)
# get the top 4 community IDs by size
top_comm_ids = (
metrics_df.groupby("community").size().sort_values(ascending=False).head(4).index.tolist())
for COMMUNITY_ID in top_comm_ids:
# filter to top 40 heroes in this community by weighted degree
community_nodes_filtered = (
metrics_df[metrics_df["community"] == COMMUNITY_ID].nlargest(40, "degree_weighted")["hero"].tolist())
G_comm = G_weighted.subgraph(community_nodes_filtered).copy()
print(f"Community {COMMUNITY_ID} — {G_comm.number_of_nodes()} nodes, {G_comm.number_of_edges()} edges")
nt = Network(height="700px", width="100%", bgcolor="#1a1a2e", font_color="white", notebook=False)
for node in G_comm.nodes():
deg = G_weighted.degree(node)
comm = community_map.get(node, 0) % len(community_colors)
nt.add_node(node, label=node, color=community_colors[comm], size=min(5 + deg * 0.08, 30), title=f"{node}<br>Degree: {deg}<br>Community: {community_names.get(COMMUNITY_ID, str(COMMUNITY_ID))}")
for u, v, data in G_comm.edges(data=True):
nt.add_edge(u, v, value=data.get("weight", 1))
nt.set_options("""
{
"physics": {
"stabilization": {"enabled": true, "iterations": 500},
"forceAtlas2Based": {
"gravitationalConstant": -120,
"springLength": 200,
"springConstant": 0.05
},
"solver": "forceAtlas2Based",
"minVelocity": 0.75
}
}
""")
filename = os.path.join(FIGURES_DIR, f"interactive_community_{COMMUNITY_ID}.html")
nt.save_graph(filename)
# inject a readable title and fixed header into the saved HTML
with open(filename, "r") as f:
html = f.read()
comm_label = community_names.get(COMMUNITY_ID, f"Community {COMMUNITY_ID}")
html = html.replace(
"<head>",
f"<head><title>{comm_label} — Marvel Network</title>"
f"<style>body::before {{content: '{comm_label} — Top 40 Heroes'; "
f"position: fixed; top: 10px; left: 50%; transform: translateX(-50%); "
f"color: white; font-family: Arial, sans-serif; font-size: 18px; "
f"font-weight: bold; z-index: 9999;}}</style>"
)
with open(filename, "w") as f:
f.write(html)
print(f"\t Saved to {filename}")
Community 1 — 40 nodes, 721 edges Saved to ../figures/interactive_community_1.html Community 8 — 40 nodes, 687 edges Saved to ../figures/interactive_community_8.html Community 13 — 40 nodes, 573 edges Saved to ../figures/interactive_community_13.html Community 0 — 40 nodes, 278 edges Saved to ../figures/interactive_community_0.html
The four interactive community visualizations are saved to the figures/ folder and will be linked from the project index page on GitHub Pages. Each file can be explored directly in the browser. Keep in mind nodes are draggable, and hovering over any node shows the hero name, degree, and community. Interpretation of what these communities represent and what they tell us about the structure of the Marvel Universe is discussed in the next section.
5. Discussion¶
The results hold up well against what we'd expect from Marvel lore, but a few findings are worth unpacking beyond just "it worked".
(1) The network structure is real The Louvain algorithm detected 36 communities with zero prior Marvel knowledge, and the top ones map almost perfectly onto actual teams. Specifically, the X-Men came out the most tightly connected community of all (92% of all possible connections within the top 40 heroes actually exist) which makes sense given how much the X-Men tend to stick to their own corner of the Marvel Universe. The Avengers were then close behind at 88%, while the Fantastic Four (61%) and Thor's world (59%) were noticeably looser, reflecting how often those characters cross over into the broader universe. Additionally, the Defenders were the biggest surprise at only 36% density, but that actually tracks perfectly with Marvel history since the Defenders were never a formal team in the way the Avengers were. Instead, they were always more of a loose collection of street-level and mystical heroes who occasionally showed up in the same story, and the network picked that up on its own.
(2) Degree centrality told the expected story & Betweenness was more interesting Captain America and Spider-Man at the top of the degree rankings is basically a sanity check since, if they weren't there, something would be wrong with the data. But betweenness centrality is where things got more interesting. Specifically, in the weighted version, Captain America leads as the top bridge character, but Spider-Man overtakes him in the unweighted version. This suggests Cap's bridging role comes from the strength of his ties, i.e., he repeatedly co-appears with heroes across many different teams, while Spider-Man bridges more through sheer range of unique connections. Finally, the real surprise here is Hawkeye, who shows up near the top of both betweenness lists despite not being anywhere near the most famous hero in the dataset. From this analysis we can see that he plays a much bigger bridging role than his overall popularity would suggest, connecting parts of the network that wouldn't otherwise be linked.
(3) Eigenvector centrality broke down in a revealing way. Here we saw the weighted version was dominated by three obscure Golden Age heroes, specifically Patriot/Jeff Mace, Miss America, and the Human Torch Android, rather than the modern icons you'd expect. This happened because eigenvector centrality rewards being connected to well-connected heroes, and these characters co-appeared heavily in early 1940s issues alongside a tight cluster of other prominent Golden Age figures. Therefore, their inflated scores here are not a bug/issue with the code, they're actually revealing something real about the data, specifically that collapsing 60+ years of publishing into one static graph makes early-era characters look artificially prestigious. On the other hand, the unweighted version does recover the expected ranking with Captain America on top, which confirms this as an edge-weighting issue rather than a structural one.
(4) Spider-Man vs. Captain America: Meaningful Cultural Distinction Interestingly, Spider-Man appeared in more individual comic issues than any other hero (over 1500), while Captain America co-appeared with the broadest range of unique heroes. This is not a trivial difference, but a notable one. Namely that Spider-Man is Marvel's most prolific solo character, while Captain America is its most collaborative one. Overall, the network captures that distinction cleanly and it lines up with how these characters have actually been used editorially over the decades.
6. Limitations ¶
There are a few limitations to this analysis worth noting.
(1) The network is static All 60+ years of Marvel publishing history are collapsed into one graph, so there's no way to distinguish a 1963 co-appearance from a 2003 one for example. Thus, characters who were prominent in the early Marvel era get systematically inflated scores (eigenvector centrality results make this especially visible), while characters introduced later are underrepresented regardless of how important they've become. Therefore, a version of this analysis that tracks changes over time, e.g., slicing the network by publication decade, would give a much richer picture of how the Marvel Universe actually evolved over time.
(2) Co-appearance is not the same as interaction Two heroes can appear in the same comic issue without ever actually meeting or sharing a panel. For instance, a 110-hero crossover event like Contest of Champions counts as thousands of unique co-appearance pairs even if most of those heroes never interact. So, it's worth keeping in mind that this inflates the connectivity of characters who appeared in large crossover events, meaning our edges represent editorial proximity rather than actual narrative relationship.
(3) 1,351 hero names are truncated at exactly 20 characters This is a known encoding issue in the dataset. Specifically, names like ABOMINATION/EMIL BLO and SPIDER-MAN/PETER PAR are cut off mid-name. Thus, some characters may be duplicated under different truncated forms, which could introduce noise into the centrality results, particularly for less prominent characters.
(4) Community detection has a random component Louvain produces slightly different results depending on the random seed. However, the big picture results are stable, namely the X-Men, Avengers, and Fantastic Four will always cluster together, but the exact composition of smaller communities and the assignment of crossover characters like Wolverine or Beast could shift between runs.
(5) The dataset may not be complete The Kaggle source was compiled from Marvel's published index but is unlikely to be exhaustive, particularly for more recent publications or digital-first comics. Keep in mind, this analysis reflects what's in the dataset, not the full Marvel Universe.
7. Further Study¶
The following are a few key ideas to apply for further study.
(1) Build a temporal (track changes over time) network The most natural next step would be splitting the data by publication decade and watching how the community structure changes over time. This would be a good idea because, for instance, the 1960s network looks nothing like the 1990s or 2010s and thus tracking how communities form, merge, and fall apart could tell a much more interesting cultural story than the static snapshot we have here. Additionally, this would also directly fix the eigenvector centrality issue by making era effects visible rather than buried.
(2) Add character metadata to the nodes Adding character attributes like first appearance year, gender, team affiliation, and hero vs. villain status would let us test whether network positions actually predicts any of these. For instance, we could explore the questions: does centrality correlate with how long a character has been in publication? Do bridge characters share certain traits? Etc. Overall, these questions would push the analysis closer to a cultural sociology study rather than just network analysis.
(3) Replace co-appearance with actual narrative interaction Using NLP on the comic text itself to identify which characters speak to each other or are described as interacting would produce a much more meaningful network, and would directly address the core limitation that co-appearance doesn't equal interaction. This is the character interaction network from option 1 of the project brief, and would be a natural extension of the text analysis methods from Mini Project 1.
(4) Compare Marvel vs. DC This would just be cool but totally optional. Specifically, using a similarly structured DC Universe dataset, we could ask whether the two publishers built their universes differently at a structural level. This would allow us to explore questions like: do they have the same scale-free degree distribution? Are Marvel communities tighter than DC ones? Is there a measurable difference in how crossover-heavy each universe is editorially? All in all, these are exactly the kinds of comparative cultural questions that this type of network analysis is well suited to answer.