Kevin Bacon and Wikipedia: My introduction to web scraping
There’s a game played among movie buffs called “Six Degrees of Kevin Bacon”. The goal is to find the fewest degrees of seperation between any actor or actress and Kevin Bacon by connecting them through other actors they’ve performed in a movie with. For example, Tom Hanks has a “Bacon number” of 1, since he co-starred in Apollo 13 with Bacon. The Civil War general William Rufus Shafter has a Bacon number of 7, if you count the tenuous connection between Theodore Roosevelt and Oprah Winfrey in Food, Inc. Since Kevin Bacon is a prolific actor, the theory is that nearly all actors and actresses are within six degrees of seperation from Kevin Bacon. Check out the “Oracle of Bacon” to explore more connections, or just Google “bacon number person”. Fortunately, Kevin Bacon’s own perspective of the game has taken a significant turn for the better.
“Getting to Philosophy” is the Wikipedia version of this game. The idea is that as an encyclopaedia, most Wikipedia pages have links to more general topics, e.g. “Stanford University” → “private university”, with “Philosophy” being the most general topic. The difference between this and the Kevin Bacon game is that we only use the first link to connect to the next page. I’ll illustrate this with some Wikipedia pages of personal interest:
- Lithium-ion battery → Rechargeable battery → Battery (electricity) → Electrochemical cell → Electrical energy → Electric potential energy → Potential energy → Energy → Physics → Natural science → Science → Knowledge → Fact → Verificationism → Philosophy. 14 degrees
- Philadelphia Eagles → American football → Gridiron football → Football → Team sport → Sport → Competition → Territory (animal) → Ethology → Scientific method → Scientific technique → Systematic process → Critical thinking → Objectivity (philosophy) → Philosophy. Also 14 degrees
- Delaware → U.S. State → Polity → Entity → Existance → Ontology → Philosophy. 6 degrees
There are some rules for what counts as the first link:
- It must link to a valid Wikipedia page—no in-page citations, external links, “meta” Wikipedia pages like this one for pronunciation, etc
- The link can’t be in parentheses, since these links typically reference language pages (see pages like Science or Egypt)
- The link can’t be a previously referenced page, to avoid loops. If the first link has already been referenced, go to the second link, and so on.
To learn web scraping and Python fundamentals, I wrote a script to automatically find the degree of seperation between a Wikipedia page and “philosophy”. In this post, I share my results. Keep in mind I only search the English-language version of Wikipedia (https://en.wikipedia.org).
Random pages
I spent many an hour in high school looking at random Wikipedia pages, so the first thing I tried is investigating the distribution of degrees of seperation for 500 random pages. Check it out here:
The distribution is right-tailed—most pages are clustered between 8-17 degrees, but there’s a long tail of very distant articles.
Some fun ones:
- At five degrees removed, the closest articles were the novel Warbreaker and the 1923 silent movie Drifting. Both connected through genre → category → ontology → philosophy
- At 29 degrees removed, the most distant article was Chennai Slam, the reigning champions of the Indian basketball league
- Skimming through the random articles, many appeared to be small villages around the world like Deh Sheykh, Sirjan, international media like Three Plus Two, or musical groups like Dave’s True Story. The most banal entries I came across were banana peel and fishing bait. One that stuck out was Mysterious Castles of Clay. I also learned about the practice of eco-running.
Top 100 pages
Wikipedia has an article for nearly everything, obscure as it may be. How does this distribution change when we look at the most popular Wikipedia articles? The below plot shows the distribution of the 100 most popular pages on Wikipedia:
The distributions are similarly shaped. We can look at the population statistics to more quantitatively compare this distribution to the distribution of random pages:
Statistic | Random pages | Top 100 pages |
---|---|---|
Mean | 13.6 | 12.9 |
Median | 13 | 12 |
Mode | 12 | 10 |
Minimum | 5 | 5 |
Maximum | 29 | 27 |
Overall, the distribution of top 100 is centered slightly to the left of the random distribution.
I attribute this to fewer “out there” articles compared to the population of
random pages.
Is there a relationship between degrees of seperation and page rank? Just for fun, I made an interactive plot below to investigate this. Hover over a point to learn more about it, and click on a point to go to its Wikipedia page.
As to be expected, there’s no relationship between page rank and degrees from
“philosophy”.
There were six pages that were five degrees from “philosophy”—five of which are TV shows (The Big Bang Theory, Game of Thrones, How I Met Your Mother, Breaking Bad, and Lost)—all of which connected through the genre → category → ontology → philosophy connection documented earlier. Interestingly, the page in the top 100 with the highest “philosophy number” (27) was also a TV show—Glee. For whatever it’s worth (read: very little), the second-highest-ranking page, Donald Trump, is equally removed from “philosophy” than the third-highest-ranking page, Barack Obama.
Categories
Wikipedia also maintains a list of the 30 most popular pages in 15 different categories. We can characterize the distributions of degrees of seperation for the pages in each category with a boxplot:
Sports teams have the closest distribution, since many of the teams are association football (“soccer”) teams that follow the same path to “philosophy”. As we’ve already seen, the “Films and TV series” category has the largest distribution.
Since the above page lists the popularity of each page in each category, we can also characterize the popularity of each category. Below is a bar chart showing the page popularity of the 1st, 10th, and 30th most popular pages in each category. The top page in each category is displayed in the label.
I was surprised by two results:
- The popularity of countries. Wouldn’t Wikipedia’s most popular page, the United States, have very general information that’s not of interest to the casual Wikipedian? I typically go to Wikipedia to find very specific information. But I can see the US page’s popularity coming from school reports, etc.
- The popularity of, well, popular people (singers, actors, sportsmen, bands). Michael Jackson is at #6, Lady Gaga at #9, and Eminem at #10. While I’d expect Google queries to be high for these cultural icons, I didn’t expect Wikipedia to be such a prominent source for information. Of course, the entertainers at the top of the list are all American, which is partially influenced by using the English-language version of Wikipedia and widespread, high-quality internet access in America.
Coding
This exercise was a good “weekend project” introduction to commonly-used Python packages like BeautifulSoup (web scraping), pandas (data frames), matplotlib (plotting), and bokeh (interactive web plots). The GitHub repository is here.