Exploring the Network Behind Wikispeedia


Lab Reports, Networks

Introduction

I have fond memories of discovering a game called Wikispeedia in the early 2010s. It’s a game based on the concept that given two randomly generated Wikipedia pages, there should be a path via hyperlinks to reach one from the other. This concept is gamified by adding a time element to see how fast one is able to find this path. Because this game relies on the relatedness of topics, I was inspired to use a network visualization to represent these relations.

I had also recently read an article which explored the differences in how the various Wikipedias per language are structured. This article investigates whether the network structure that Wikipedia is built upon can indicate anything about the culture of that particular Wikipedia edition (ex: English Wikipedia, Wikipédia
en français, etc.). I found the network visualizations included in this article to be very interesting which inspired me to do my own research in this area.

For the purpose of this project I wanted to take a look at the Wikispeedia data and see which Wikipedia pages are traversed through the most as a part of the path from source page to target page. I am also interested in seeing which pages link to the most number of other pages as well as how the network in general looks for this dataset.

Methodology

Finding a Dataset

Since I knew I wanted to work with a dataset containing information about the links between Wikipedia pages, I searched specifically for datasets in this category. Websites such as CASOS, SNAP, and the Gephi Wiki page proved to be useful platforms for finding datasets suitable for network visualizations. I ended up finding the Wikispeedia navigation paths dataset on Stanford’s Large Network Dataset collection (SNAP). This dataset consisted of data collected through the human computation game, Wikispeedia. It contained datasets for both finished and unfinished paths from the game. I decided to focus on just the finished paths for the purpose of this lab. The finished paths dataset contained 51,318 rows with the following columns:

Hashed IP AddressTimestampDuration (seconds)PathDifficulty Rating
Table 1: Columns present in the dataset

A snippet of the data can be seen below in Table 2

Hashed IP AddressTimestampDuration (seconds)PathDifficulty Rating
6a3701d319fc37541297740409166Achilles;Ethiopia;Africa;Gold2
3824310e536af032134475341288Plum;Apricot;China;Korea;South_Korea3
415612e93584d30e1349298640138Arugula;Vegetable;Bean1
Table 2: Snippet of dataset

Tools

During the process of creating my visualization, I used a variety of tools including Open Refine to clean the data and RStudio to format the data in a way that Gephi would be able to interpret in order to create the final network visualization.

Process

Cleaning the Data

This dataset was surprisingly well formatted from the source, but still required some cleaning. Since the dataset consisted of more than 50,000 rows I attempted to reduce that number to something that would be easier to work with and minimize the possibility of overloading RStudio and Gephi. I noticed that some paths included the “<” character which indicated that a user had pressed the back button while finding the path from the source page to target page. To keep things simple, I decided to filter out all paths in which the user clicked the back button. I wanted each edge in the resulting network to represent a relation between topics, not necessarily the direction of the hyperlink on the Wikipedia page. Since I was hoping to create an undirected network it made sense to remove these paths.

Next, I noticed that some rows had special characters that would eventually impact the readability of the resulting node labels in the network. I went ahead and removed all rows that included these characters. I also decided to only look at paths that were completed in under 2 minutes and filtered the data based on the duration column. After doing all this, there were still 24,205 remaining rows. I decided to attempt to create my network with this cleaned and filtered dataset.

Formatting the Data

Once the data was cleaned, it was ready to be formatted for Gephi to interpret the nodes and edges relationship. To do this I used RStudio which proved to be a powerful tool even for a dataset of more than 24,000 rows. I followed the process to transform the data into a weighted edge list where each row consists of two nodes that are connected. I was careful when creating the edge list to only indicate an edge between two adjacent pages in the Wikipedia path. Ultimately there ended up being 808 nodes and 1770 edges. An example of what the formatted data looked like can be seen in Table 3 below.

SourceTargetTypeWeight
AchillesEthiopiaundirected1
EthiopiaAfricaundirected1
AfricaGoldundirected3
Table 3: Formatted data from RStudio

Creating the Network

I imported this spreadsheet into Gephi to create the network visualization. Initially the graph looked very crowded and completely unreadable. I played around with various layouts and ended up choosing the Fruchterman Reingold layout since it consolidated all of the nodes into a circular area and clustered connected nodes near each other.

I also decided to represent the color and thickness of each edge based on its weight. This means an edge between two Wikipedia pages that occurred often in the paths from a random source to target page would be colored a darker green color and be much thicker compared to an edge between two pages that only occurred once within all the paths in the dataset. This is useful in determining strong relationships between two topics on Wikipedia.

Since there are more than 800 nodes displayed on this network, I knew I had to use size and color of the nodes carefully to improve the readability. First I ran the average path length statistic offered by Gephi and determined that the average path length was about 4.095 for this network. The average path length also provides information about the betweenness centrality which counts the shortest path between every pair of nodes in the network and then determines how often a particular node occurs in those shortest paths. It’s a great way to measure which nodes play an important intermediate role within the network. In this case, I used the betweenness centrality measure to indicate node size so that important nodes and larger and less important nodes are smaller in size.

Figure 1: Network statistics

I also decided to apply some clustering in order to color the graph and show groups of nodes that were more connected within the network. I ran the modularity statistic with a resolution of 1.5 which resulted in 9 clusters of nodes shown below.

Figure 2: Modularity class partitions

Results and Analysis

The resulting network can be seen below in Figure 3.

Figure 3: Resulting network visualization

Each node represents a Wikipedia page that was traversed in the Wikispeedia game, and each edge represents a link between two pages. However, since this graph is undirected it does not provide information about the direction of the hyperlink so only a relation between two pages can be determined – not a direction of this relationship.

Gephi made it easy to understand some further statistics on this graph. The diameter, which refers to the maximum shortest distance between two nodes, is 11. Within this context this means that to get from one random Wikipedia source page from within this network to another target page, the maximum path would be via 11 nodes or less. I was also able to determine that the average degree is 4.381 edges and the density is 0.005. Such a low density means that most of the nodes within this network are not actually connected. This is further solidified by the average degree of 4.381 in a network that consists of 808 nodes. This makes sense for this type of data since this network represents paths that are commonly traversed to get from a source to a target page – not necessarily all of the relations that exist between Wikipedia pages.

Since the graph shown in Figure 3 is quite large and hard to read, I decided to play around with adding in some filtering. I filtered based on degree and removed nodes with 3 or less edges. This removed many of the nodes which existed on the periphery of the circular layout and resulted in only 44.8% of the nodes to be visible. This can be seen in Figure 4.

Figure 4: Network after nodes with 3 edges or less were filtered out

It’s interesting to note that the nodes with the largest betweenness centrality are United States, Earth, Africa, England, Europe. For the source and target Wikipedia pages included in this dataset, these Wikipedia pages proved to be vital steps in the paths. Most of these nodes are countries which may explain their importance in the source to target path since the Wikipedia page of a country covers a wide array of information including geography, culture, history, etc. It’s also interesting to see that the nodes with the largest betweenness centrality are not necessarily the nodes that are connected to the highest weighted edges.

Figure 5: Node with the largest betweenness centrality
Figure 6: Edges with the largest weight

The thickest edges occur between Batman and Scotland and Batman and Chemistry. Since the weight of each edge is determined by aggregating the matching source and target values, this indicates that the link between these two nodes was traversed the most amount of times in the paths included in this dataset.

Because of the nature of this dataset, there are no isolates in this network since each node is part of a completed path between the source and the target page.

Reflection

Gephi proved to be a very useful tool in visualizing this large network of nodes and edges. With the option to customize various aspects of the network it was easy to understand and analyze the relationship between the various key nodes.

In the future there are a few different things I would like to do to continue to delve deeper into this project. First, I would like to create a hypergraph to understand more about the relationship between the clusters that were identified by Gephi. I would also like to experiment with turning this into a directed graph to preserve the direction of the link that each edge represents. I would like to see how that network would compare to this one and whether or not the nodes with the largest betweenness centrality remain the same or not. Lastly, I think it could be interesting to also introduce the data from the unfinished paths since that may illuminate isolates that exist within the Wikipedia network structure.

Overall, I really enjoyed exploring this dataset and was very impressed by both RStudio and Gephi and how powerful these tools are in creating complex network visualizations.

References