Nathan Cunningham, a PhD student at the University of Warwick, does some awesome statistical analysis using R, often involving Harry Potter or The Simpsons as the data source. Our favorite article by Cunningham calculates the most frequently appearing side characters in the 622 episodes of The Simpsons, a monumental task if you were to go through each episode yourself.
Luckily, rather than sitting through every episode, R
rvest makes it possible to scrape webpages in bulk and extract the relevant data, which saves a ton of time. From their Cunn extracted all of the character names from the episode descriptions, and calculated how many episodes each side character appeared in (read the full procedure).
The process is far more complicated than I make it sound, and Cunningham did a lot of work creating this top 20 side characters graph, including writing and publishing each of the R inputs he used for ease of replication. Even something that sounds relatively simple, like extracting names from a paragraph, can be challenging when you realize a name may appear as “Moe,” “Moe Szyslak,” “Moe’s Tavern,” etc…
The results seem fairly solid. Intuitive and supported by Cunningham’s list of main characters’ episode totals that has Homer in all but 10 episodes. While scraping Wikipedia for character references probably isn’t the most accurate way to collect the data (physically watching every episode or scraping the scripts would probably be slightly more accurate), it’s good enough for me.
The Top 20 Side Characters (by # of Episodes)
“Thank you! Come again.”
“grease me up woman!”
“That’s Homer Simpson, one of your ‘condescending noun’ from sector 7G.”
“Have a wowwipop.”
“How’s my blubber in-law?”
“See, statements like that are why people think we’re gay.”
“Ow, my eye! I’m not supposed to get pudding in it!”
“Hey hey, kids!”
“I know very little about children.”
“When I get a hold of you-”