Kevin Bacon and Wikipedia: My introduction to web scraping

There’s a game played among movie buffs called “Six Degrees of Kevin Bacon”. The goal is to find the fewest degrees of seperation between any actor or actress and Kevin Bacon by connecting them through other actors they’ve performed in a movie with. For example, Tom Hanks has a “Bacon number” of 1, since he co-starred in Apollo 13 with Bacon. The Civil War general William Rufus Shafter has a Bacon number of 7, if you count the tenuous connection between Theodore Roosevelt and Oprah Winfrey in Food, Inc. Since Kevin Bacon is a prolific actor, the theory is that nearly all actors and actresses are within six degrees of seperation from Kevin Bacon. Check out the “Oracle of Bacon” to explore more connections, or just Google “bacon number person”. Fortunately, Kevin Bacon’s own perspective of the game has taken a significant turn for the better.

“Getting to Philosophy” is the Wikipedia version of this game. The idea is that as an encyclopaedia, most Wikipedia pages have links to more general topics, e.g. “Stanford University” → “private university”, with “Philosophy” being the most general topic. The difference between this and the Kevin Bacon game is that we only use the first link to connect to the next page. I’ll illustrate this with some Wikipedia pages of personal interest:

Lithium-ion battery → Rechargeable battery → Battery (electricity) → Electrochemical cell → Electrical energy → Electric potential energy → Potential energy → Energy → Physics → Natural science → Science → Knowledge → Fact → Verificationism → Philosophy. 14 degrees
Philadelphia Eagles → American football → Gridiron football → Football → Team sport → Sport → Competition → Territory (animal) → Ethology → Scientific method → Scientific technique → Systematic process → Critical thinking → Objectivity (philosophy) → Philosophy. Also 14 degrees
Delaware → U.S. State → Polity → Entity → Existance → Ontology → Philosophy. 6 degrees

There are some rules for what counts as the first link:

It must link to a valid Wikipedia page—no in-page citations, external links, “meta” Wikipedia pages like this one for pronunciation, etc
The link can’t be in parentheses, since these links typically reference language pages (see pages like Science or Egypt)
The link can’t be a previously referenced page, to avoid loops. If the first link has already been referenced, go to the second link, and so on.

To learn web scraping and Python fundamentals, I wrote a script to automatically find the degree of seperation between a Wikipedia page and “philosophy”. In this post, I share my results. Keep in mind I only search the English-language version of Wikipedia (https://en.wikipedia.org).

Random pages

I spent many an hour in high school looking at random Wikipedia pages, so the first thing I tried is investigating the distribution of degrees of seperation for 500 random pages. Check it out here:

The distribution is right-tailed—most pages are clustered between 8-17 degrees, but there’s a long tail of very distant articles.

Some fun ones:

At five degrees removed, the closest articles were the novel Warbreaker and the 1923 silent movie Drifting. Both connected through genre → category → ontology → philosophy
At 29 degrees removed, the most distant article was Chennai Slam, the reigning champions of the Indian basketball league
Skimming through the random articles, many appeared to be small villages around the world like Deh Sheykh, Sirjan, international media like Three Plus Two, or musical groups like Dave’s True Story. The most banal entries I came across were banana peel and fishing bait. One that stuck out was Mysterious Castles of Clay. I also learned about the practice of eco-running.

Top 100 pages

Wikipedia has an article for nearly everything, obscure as it may be. How does this distribution change when we look at the most popular Wikipedia articles? The below plot shows the distribution of the 100 most popular pages on Wikipedia:

The distributions are similarly shaped. We can look at the population statistics to more quantitatively compare this distribution to the distribution of random pages:

Statistic	Random pages	Top 100 pages
Mean	13.6	12.9
Median	13	12
Mode	12	10
Minimum	5	5
Maximum	29	27

Overall, the distribution of top 100 is centered slightly to the left of the random distribution. I attribute this to fewer “out there” articles compared to the population of random pages.

Is there a relationship between degrees of seperation and page rank? Just for fun, I made an interactive plot below to investigate this. Hover over a point to learn more about it, and click on a point to go to its Wikipedia page.

Bokeh Plot

As to be expected, there’s no relationship between page rank and degrees from “philosophy”.

There were six pages that were five degrees from “philosophy”—five of which are TV shows (The Big Bang Theory, Game of Thrones, How I Met Your Mother, Breaking Bad, and Lost)—all of which connected through the genre → category → ontology → philosophy connection documented earlier. Interestingly, the page in the top 100 with the highest “philosophy number” (27) was also a TV show—Glee. For whatever it’s worth (read: very little), the second-highest-ranking page, Donald Trump, is equally removed from “philosophy” than the third-highest-ranking page, Barack Obama.

Coding

This exercise was a good “weekend project” introduction to commonly-used Python packages like BeautifulSoup (web scraping), pandas (data frames), matplotlib (plotting), and bokeh (interactive web plots). The GitHub repository is here.

Random pages

Top 100 pages

Categories

Coding