There’s a game played among movie buffs called “Six Degrees of Kevin Bacon”. The goal is to find the fewest degrees of seperation between any actor or actress and Kevin Bacon by connecting them through other actors they’ve performed in a movie with. For example, Tom Hanks has a “Bacon number” of 1, since he co-starred in Apollo 13 with Bacon. The Civil War general William Rufus Shafter has a Bacon number of 7, if you count the tenuous connection between Theodore Roosevelt and Oprah Winfrey in Food, Inc. Since Kevin Bacon is a prolific actor, the theory is that nearly all actors and actresses are within six degrees of seperation from Kevin Bacon. Check out the “Oracle of Bacon” to explore more connections, or just Google “bacon number person. Fortunately, Kevin Bacon’s own perspective of the game has taken a significant turn for the better.

“Getting to Philosophy” is the Wikipedia version of this game. The idea is that as an encyclopaedia, most Wikipedia pages have links to more general topics, e.g. “Stanford University”“private university”, with “Philosophy” being the most general topic. The difference between this and the Kevin Bacon game is that we only use the first link to connect to the next page. I’ll illustrate this with some Wikipedia pages of personal interest:

There are some rules for what counts as the first link:

  • It must link to a valid Wikipedia page - no in-page citations, external links, “meta” Wikipedia pages like this one for pronunciation, etc
  • The link can’t be in parentheses, since these links typically reference language pages (see pages like Science or Egypt)
  • The link can’t be a previously referenced page, to avoid loops. If the first link has already been referenced, go to the second link, and so on.

To learn web scraping and Python fundamentals, I wrote a script to automatically find the degree of seperation between a Wikipedia page and “philosophy”. In this post, I share my results. Keep in mind I only search the English-language version of Wikipedia (https://en.wikipedia.org).

Random pages

I spent many an hour in high school looking at random Wikipedia pages, so the first thing I tried is investigating the distribution of degrees of seperation for 500 random pages. Check it out here:

The distribution is right-tailed - most pages are clustered between 8-17 degrees, but there’s a long tail of very distant articles.

Some fun ones:

Top 100 pages

Wikipedia has an article for nearly everything, obscure as it may be. How does this distribution change when we look at the most popular Wikipedia articles? The below plot shows the distribution of the 100 most popular pages on Wikipedia:

The distributions are similarly shaped. We can look at the population statistics to more quantitatively compare this distribution to the distribution of random pages:

Statistic Random pages Top 100 pages
Mean 13.6 12.9
Median 13 12
Mode 12 10
Minimum 5 5
Maximum 29 27


Overall, the distribution of top 100 is centered slightly to the left of the random distribution. I attribute this to fewer “out there” articles compared to the population of random pages.

Is there a relationship between degrees of seperation and page rank? Just for fun, I made an interactive plot below to investigate this. Hover over a point to learn more about it, and click on a point to go to its Wikipedia page.

Bokeh Plot


As to be expected, there’s no relationship between page rank and degrees from “philosophy”.

There were six pages that were five degrees from “philosophy” - five of which are TV shows (The Big Bang Theory, Game of Thrones, How I Met Your Mother, Breaking Bad, and Lost) - all of which connected through the genrecategoryontologyphilosophy connection documented earlier. Interestingly, the page in the top 100 with the highest “philosophy number” (27) was also a TV show - Glee. For whatever it’s worth (read: very little), the second-highest-ranking page, Donald Trump, is equally removed from “philosophy” than the third-highest-ranking page, Barack Obama.

Categories

Wikipedia also maintains a list of the 30 most popular pages in 15 different categories. We can characterize the distributions of degrees of seperation for the pages in each category with a boxplot:

Sports teams have the closest distribution, since many of the teams are association football (“soccer”) teams that follow the same path to “philosophy”. As we’ve already seen, the “Films and TV series” category has the largest distribution.

Since the above page lists the popularity of each page in each category, we can also characterize the popularity of each category. Below is a bar chart showing the page popularity of the 1st, 10th, and 30th most popular pages in each category. The top page in each category is displayed in the label.

I was surprised by two results:

  • The popularity of countries. Wouldn’t Wikipedia’s most popular page, the United States, have very general information that’s not of interest to the casual Wikipedian? I typically go to Wikipedia to find very specific information. But I can see the US page’s popularity coming from school reports, etc.
  • The popularity of, well, popular people (singers, actors, sportsmen, bands). Michael Jackson is at #6, Lady Gaga at #9, and Eminem at #10. While I’d expect Google queries to be high for these cultural icons, I didn’t expect Wikipedia to be such a prominent source for information. Of course, the entertainers at the top of the list are all American, which is partially influenced by using the English-language version of Wikipedia and widespread, high-quality internet access in America.

Coding

This exercise was a good “weekend project” introduction to commonly-used Python packages like BeautifulSoup (web scraping), pandas (data frames), matplotlib (plotting), and bokeh (interactive web plots). The GitHub repository is here.