{"id":33475,"date":"2022-07-04T20:43:36","date_gmt":"2022-07-05T00:43:36","guid":{"rendered":"https:\/\/studentwork.prattsi.org\/infovis\/?p=33475"},"modified":"2022-07-04T20:43:36","modified_gmt":"2022-07-05T00:43:36","slug":"exploring-the-network-behind-wikispeedia","status":"publish","type":"post","link":"https:\/\/studentwork.prattsi.org\/infovis\/labs\/exploring-the-network-behind-wikispeedia\/","title":{"rendered":"Exploring the Network Behind Wikispeedia"},"content":{"rendered":"\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-full\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"458\" height=\"370\" src=\"https:\/\/i0.wp.com\/studentwork.prattsi.org\/infovis\/wp-content\/uploads\/sites\/3\/2022\/07\/wikispeedia.png?resize=458%2C370&#038;ssl=1\" alt=\"\" class=\"wp-image-33550\" srcset=\"https:\/\/i0.wp.com\/studentwork.prattsi.org\/infovis\/wp-content\/uploads\/sites\/3\/2022\/07\/wikispeedia.png?w=458&amp;ssl=1 458w, https:\/\/i0.wp.com\/studentwork.prattsi.org\/infovis\/wp-content\/uploads\/sites\/3\/2022\/07\/wikispeedia.png?resize=300%2C242&amp;ssl=1 300w, https:\/\/i0.wp.com\/studentwork.prattsi.org\/infovis\/wp-content\/uploads\/sites\/3\/2022\/07\/wikispeedia.png?resize=223%2C180&amp;ssl=1 223w\" sizes=\"auto, (max-width: 458px) 100vw, 458px\" \/><\/figure><\/div>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"introduction\">Introduction<\/h2>\n\n\n\n<p>I have fond memories of discovering a game called <a href=\"https:\/\/dlab.epfl.ch\/wikispeedia\/play\/\" target=\"_blank\" rel=\"noreferrer noopener\">Wikispeedia<\/a> in the early 2010s. It&#8217;s a game based on the concept that given two randomly generated Wikipedia pages, there should be a path via hyperlinks to reach one from the other. This concept is gamified by adding a time element to see how fast one is able to find this path. Because this game relies on the relatedness of topics, I was inspired to use a network visualization to represent these relations. <\/p>\n\n\n\n<p>I had also recently read an <a rel=\"noreferrer noopener\" href=\"https:\/\/docmarionum1.medium.com\/what-wikipedias-network-structure-can-tell-us-about-culture-38f8caabf69d\" target=\"_blank\">article<\/a> which explored the differences in how the various Wikipedias per language are structured. This article investigates whether the network structure that Wikipedia is built upon can indicate anything about the culture of that particular Wikipedia edition (ex: English Wikipedia, Wikip\u00e9dia<br>en fran\u00e7ais, etc.). I found the network visualizations included in this article to be very interesting which inspired me to do my own research in this area.<\/p>\n\n\n\n<p>For the purpose of this project I wanted to take a look at the Wikispeedia data and see which Wikipedia pages are traversed through the most as a part of the path from source page to target page. I am also interested in seeing which pages link to the most number of other pages as well as how the network in general looks for this dataset. <\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"methodology\">Methodology<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"finding-a-dataset\">Finding a Dataset<\/h3>\n\n\n\n<p>Since I knew I wanted to work with a dataset containing information about the links between Wikipedia pages, I searched specifically for datasets in this category. Websites such as <a rel=\"noreferrer noopener\" href=\"http:\/\/www.casos.cs.cmu.edu\/tools\/data.php\" target=\"_blank\">CASOS<\/a>, <a rel=\"noreferrer noopener\" href=\"https:\/\/snap.stanford.edu\/data\" target=\"_blank\">SNAP<\/a>, and the <a rel=\"noreferrer noopener\" href=\"https:\/\/github.com\/gephi\/gephi\/wiki\/Datasets\" target=\"_blank\">Gephi Wiki page<\/a> proved to be useful platforms for finding datasets suitable for network visualizations. I ended up finding the <a rel=\"noreferrer noopener\" href=\"https:\/\/snap.stanford.edu\/data\/wikispeedia.html\" target=\"_blank\">Wikispeedia navigation paths<\/a> dataset on Stanford&#8217;s Large Network Dataset collection (SNAP). This dataset consisted of data collected through the human computation game, Wikispeedia. It contained datasets for both finished and unfinished paths from the game. I decided to focus on just the finished paths for the purpose of this lab. The finished paths dataset contained 51,318 rows with the following columns:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><tbody><tr><td class=\"has-text-align-center\" data-align=\"center\"><strong>Hashed IP Address<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>Timestamp<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>Duration (seconds)<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>Path<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>Difficulty<\/strong> <strong>Rating<\/strong><\/td><\/tr><\/tbody><\/table><figcaption><em>Table 1:<\/em> Columns present in the dataset<\/figcaption><\/figure>\n\n\n\n<div style=\"height:1px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>A snippet of the data can be seen below in<strong> Table 2<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-table is-style-stripes has-small-font-size\"><table class=\"has-fixed-layout\"><tbody><tr><td class=\"has-text-align-center\" data-align=\"center\"><strong>Hashed IP Address<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>Timestamp<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>Duration (seconds)<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>Path<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>Difficulty<\/strong> <strong>Rating<\/strong><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">6a3701d319fc3754<\/td><td class=\"has-text-align-center\" data-align=\"center\">1297740409<\/td><td class=\"has-text-align-center\" data-align=\"center\">166<\/td><td class=\"has-text-align-center\" data-align=\"center\">Achilles;Ethiopia;Africa;Gold<\/td><td class=\"has-text-align-center\" data-align=\"center\">2<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">3824310e536af032<\/td><td class=\"has-text-align-center\" data-align=\"center\">1344753412<\/td><td class=\"has-text-align-center\" data-align=\"center\">88<\/td><td class=\"has-text-align-center\" data-align=\"center\">Plum;Apricot;China;Korea;South_Korea<\/td><td class=\"has-text-align-center\" data-align=\"center\">3<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">415612e93584d30e<\/td><td class=\"has-text-align-center\" data-align=\"center\">1349298640<\/td><td class=\"has-text-align-center\" data-align=\"center\">138<\/td><td class=\"has-text-align-center\" data-align=\"center\">Arugula;Vegetable;Bean<\/td><td class=\"has-text-align-center\" data-align=\"center\">1<\/td><\/tr><\/tbody><\/table><figcaption><em>Table 2<\/em>: Snippet of dataset<\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"tools\">Tools<\/h3>\n\n\n\n<p>During the process of creating my visualization, I used a variety of tools including <a rel=\"noreferrer noopener\" href=\"https:\/\/openrefine.org\/\" target=\"_blank\">Open<strong> <\/strong>Refine<\/a> to clean the data and <a rel=\"noreferrer noopener\" href=\"https:\/\/www.rstudio.com\/\" target=\"_blank\">RStudio<\/a><strong> <\/strong>to format the data in a way that <a href=\"https:\/\/gephi.org\/\" target=\"_blank\" rel=\"noreferrer noopener\">Gephi<\/a><strong> <\/strong>would be able to interpret in order to create the final network visualization. <\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"process\">Process<\/h3>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"cleaning-the-data\">Cleaning the Data<\/h4>\n\n\n\n<p>This dataset was surprisingly well formatted from the source, but still required some cleaning. Since the dataset consisted of more than 50,000 rows I attempted to reduce that number to something that would be easier to work with and minimize the possibility of overloading RStudio and Gephi. I noticed that some paths included the &#8220;&lt;&#8221; character which indicated that a user had pressed the back button while finding the path from the source page to target page. To keep things simple, I decided to filter out all paths in which the user clicked the back button. I wanted each edge in the resulting network to represent a relation between topics, not necessarily the direction of the hyperlink on the Wikipedia page. Since I was hoping to create an undirected network it made sense to remove these paths. <\/p>\n\n\n\n<p>Next, I noticed that some rows had special characters that would eventually impact the readability of the resulting node labels in the network. I went ahead and removed all rows that included these characters. I also decided to only look at paths that were completed in under 2 minutes and filtered the data based on the duration column. After doing all this, there were still 24,205 remaining rows. I decided to attempt to create my network with this cleaned and filtered dataset.  <\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"formatting-the-data\">Formatting the Data<\/h4>\n\n\n\n<p>Once the data was cleaned, it was ready to be formatted for Gephi to interpret the nodes and edges relationship. To do this I used RStudio which proved to be a powerful tool even for a dataset of more than 24,000 rows. I followed the process to transform the data into a weighted edge list where each row consists of two nodes that are connected. I was careful when creating the edge list to only indicate an edge between two adjacent pages in the Wikipedia path. Ultimately there ended up being <strong>808 nodes<\/strong> and <strong>1770 edges<\/strong>. An example of what the formatted data looked like can be seen in <strong>Table 3<\/strong> below.<\/p>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table class=\"has-fixed-layout\"><thead><tr><th class=\"has-text-align-center\" data-align=\"center\">Source<\/th><th class=\"has-text-align-center\" data-align=\"center\">Target<\/th><th class=\"has-text-align-center\" data-align=\"center\">Type<\/th><th class=\"has-text-align-center\" data-align=\"center\">Weight<\/th><\/tr><\/thead><tbody><tr><td class=\"has-text-align-center\" data-align=\"center\">Achilles<\/td><td class=\"has-text-align-center\" data-align=\"center\">Ethiopia<\/td><td class=\"has-text-align-center\" data-align=\"center\">undirected<\/td><td class=\"has-text-align-center\" data-align=\"center\">1<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">Ethiopia<\/td><td class=\"has-text-align-center\" data-align=\"center\">Africa<\/td><td class=\"has-text-align-center\" data-align=\"center\">undirected<\/td><td class=\"has-text-align-center\" data-align=\"center\">1<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">Africa<\/td><td class=\"has-text-align-center\" data-align=\"center\">Gold<\/td><td class=\"has-text-align-center\" data-align=\"center\">undirected<\/td><td class=\"has-text-align-center\" data-align=\"center\">3<\/td><\/tr><\/tbody><\/table><figcaption><em>Table 3: <\/em>Formatted data from RStudio<\/figcaption><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"creating-the-network\">Creating the Network<\/h4>\n\n\n\n<p>I imported this spreadsheet into Gephi to create the network visualization. Initially the graph looked very crowded and completely unreadable. I played around with various layouts and ended up choosing the Fruchterman Reingold layout since it consolidated all of the nodes into a circular area and clustered connected nodes near each other. <\/p>\n\n\n\n<p>I also decided to represent the color and thickness of each edge based on its weight. This means an edge between two Wikipedia pages that occurred often in the paths from a random source to target page would be colored a darker green color and be much thicker compared to an edge between two pages that only occurred once within all the paths in the dataset. This is useful in determining strong relationships between two topics on Wikipedia. <\/p>\n\n\n\n<p>Since there are more than 800 nodes displayed on this network, I knew I had to use size and color of the nodes carefully to improve the readability. First I ran the <em>average path length <\/em>statistic offered by Gephi and determined that the average path length was about 4.095 for this network. The average path length also provides information about the <em>betweenness centrality<\/em> which counts the shortest path between every pair of nodes in the network and then determines how often a particular node occurs in those shortest paths. It&#8217;s a great way to measure which nodes play an important intermediate role within the network. In this case, I used the betweenness centrality measure to indicate node size so that important nodes and larger and less important nodes are smaller in size.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large is-resized\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/studentwork.prattsi.org\/infovis\/wp-content\/uploads\/sites\/3\/2022\/07\/Screen-Shot-2022-07-04-at-8.21.56-PM-1-1024x834.png?resize=512%2C417&#038;ssl=1\" alt=\"\" class=\"wp-image-33546\" width=\"512\" height=\"417\" srcset=\"https:\/\/i0.wp.com\/studentwork.prattsi.org\/infovis\/wp-content\/uploads\/sites\/3\/2022\/07\/Screen-Shot-2022-07-04-at-8.21.56-PM-1.png?resize=1024%2C834&amp;ssl=1 1024w, https:\/\/i0.wp.com\/studentwork.prattsi.org\/infovis\/wp-content\/uploads\/sites\/3\/2022\/07\/Screen-Shot-2022-07-04-at-8.21.56-PM-1.png?resize=300%2C244&amp;ssl=1 300w, https:\/\/i0.wp.com\/studentwork.prattsi.org\/infovis\/wp-content\/uploads\/sites\/3\/2022\/07\/Screen-Shot-2022-07-04-at-8.21.56-PM-1.png?resize=768%2C625&amp;ssl=1 768w, https:\/\/i0.wp.com\/studentwork.prattsi.org\/infovis\/wp-content\/uploads\/sites\/3\/2022\/07\/Screen-Shot-2022-07-04-at-8.21.56-PM-1.png?resize=800%2C651&amp;ssl=1 800w, https:\/\/i0.wp.com\/studentwork.prattsi.org\/infovis\/wp-content\/uploads\/sites\/3\/2022\/07\/Screen-Shot-2022-07-04-at-8.21.56-PM-1.png?resize=221%2C180&amp;ssl=1 221w, https:\/\/i0.wp.com\/studentwork.prattsi.org\/infovis\/wp-content\/uploads\/sites\/3\/2022\/07\/Screen-Shot-2022-07-04-at-8.21.56-PM-1.png?w=1228&amp;ssl=1 1228w\" sizes=\"auto, (max-width: 512px) 100vw, 512px\" \/><figcaption><em>Figure 1: <\/em>Network statistics<\/figcaption><\/figure><\/div>\n\n\n\n<p>I also decided to apply some clustering in order to color the graph and show groups of nodes that were more connected within the network. I ran the modularity statistic with a resolution of 1.5 which resulted in 9 clusters of nodes shown below.<\/p>\n\n\n\n<div class=\"wp-block-image is-style-default\"><figure class=\"aligncenter size-full is-resized\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/studentwork.prattsi.org\/infovis\/wp-content\/uploads\/sites\/3\/2022\/07\/Screen-Shot-2022-07-04-at-8.24.07-PM.png?resize=540%2C225&#038;ssl=1\" alt=\"\" class=\"wp-image-33547\" width=\"540\" height=\"225\" srcset=\"https:\/\/i0.wp.com\/studentwork.prattsi.org\/infovis\/wp-content\/uploads\/sites\/3\/2022\/07\/Screen-Shot-2022-07-04-at-8.24.07-PM.png?w=720&amp;ssl=1 720w, https:\/\/i0.wp.com\/studentwork.prattsi.org\/infovis\/wp-content\/uploads\/sites\/3\/2022\/07\/Screen-Shot-2022-07-04-at-8.24.07-PM.png?resize=300%2C125&amp;ssl=1 300w, https:\/\/i0.wp.com\/studentwork.prattsi.org\/infovis\/wp-content\/uploads\/sites\/3\/2022\/07\/Screen-Shot-2022-07-04-at-8.24.07-PM.png?resize=400%2C167&amp;ssl=1 400w\" sizes=\"auto, (max-width: 540px) 100vw, 540px\" \/><figcaption><em>Figure 2: <\/em>Modularity class partitions<\/figcaption><\/figure><\/div>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"results-and-analysis\">Results and Analysis<\/h2>\n\n\n\n<p>The resulting network can be seen below in <strong>Figure 3<\/strong>.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"840\" height=\"768\" src=\"https:\/\/i0.wp.com\/studentwork.prattsi.org\/infovis\/wp-content\/uploads\/sites\/3\/2022\/07\/Screen-Shot-2022-07-04-at-4.42.43-PM-1024x936.png?resize=840%2C768&#038;ssl=1\" alt=\"\" class=\"wp-image-33531\" srcset=\"https:\/\/i0.wp.com\/studentwork.prattsi.org\/infovis\/wp-content\/uploads\/sites\/3\/2022\/07\/Screen-Shot-2022-07-04-at-4.42.43-PM.png?resize=1024%2C936&amp;ssl=1 1024w, https:\/\/i0.wp.com\/studentwork.prattsi.org\/infovis\/wp-content\/uploads\/sites\/3\/2022\/07\/Screen-Shot-2022-07-04-at-4.42.43-PM.png?resize=300%2C274&amp;ssl=1 300w, https:\/\/i0.wp.com\/studentwork.prattsi.org\/infovis\/wp-content\/uploads\/sites\/3\/2022\/07\/Screen-Shot-2022-07-04-at-4.42.43-PM.png?resize=768%2C702&amp;ssl=1 768w, https:\/\/i0.wp.com\/studentwork.prattsi.org\/infovis\/wp-content\/uploads\/sites\/3\/2022\/07\/Screen-Shot-2022-07-04-at-4.42.43-PM.png?resize=1536%2C1405&amp;ssl=1 1536w, https:\/\/i0.wp.com\/studentwork.prattsi.org\/infovis\/wp-content\/uploads\/sites\/3\/2022\/07\/Screen-Shot-2022-07-04-at-4.42.43-PM.png?resize=800%2C732&amp;ssl=1 800w, https:\/\/i0.wp.com\/studentwork.prattsi.org\/infovis\/wp-content\/uploads\/sites\/3\/2022\/07\/Screen-Shot-2022-07-04-at-4.42.43-PM.png?resize=197%2C180&amp;ssl=1 197w, https:\/\/i0.wp.com\/studentwork.prattsi.org\/infovis\/wp-content\/uploads\/sites\/3\/2022\/07\/Screen-Shot-2022-07-04-at-4.42.43-PM.png?w=1824&amp;ssl=1 1824w, https:\/\/i0.wp.com\/studentwork.prattsi.org\/infovis\/wp-content\/uploads\/sites\/3\/2022\/07\/Screen-Shot-2022-07-04-at-4.42.43-PM.png?w=1680 1680w\" sizes=\"auto, (max-width: 840px) 100vw, 840px\" \/><figcaption>Figure 3: Resulting network visualization<\/figcaption><\/figure>\n\n\n\n<p>Each node represents a Wikipedia page that was traversed in the Wikispeedia game, and each edge represents a link between two pages. However, since this graph is undirected it does not provide information about the direction of the hyperlink so only a relation between two pages can be determined &#8211; not a direction of this relationship.  <\/p>\n\n\n\n<p>Gephi made it easy to understand some further statistics on this graph. The diameter, which refers to the maximum shortest distance between two nodes, is 11. Within this context this means that to get from one random Wikipedia source page from within this network to another target page, the maximum path would be via 11 nodes or less. I was also able to determine that the average degree is 4.381 edges and the density is 0.005. Such a low density means that most of the nodes within this network are not actually connected. This is further solidified by the average degree of 4.381 in a network that consists of 808 nodes. This makes sense for this type of data since this network represents paths that are commonly traversed to get from a source to a target page &#8211; not necessarily all of the relations that exist between Wikipedia pages. <\/p>\n\n\n\n<p>Since the graph shown in <strong>Figure 3<\/strong> is quite large and hard to read, I decided to play around with adding in some filtering. I filtered based on degree and removed nodes with 3 or less edges. This removed many of the nodes which existed on the periphery of the circular layout and resulted in only 44.8% of the nodes to be visible. This can be seen in <strong>Figure 4<\/strong>.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"840\" height=\"711\" src=\"https:\/\/i0.wp.com\/studentwork.prattsi.org\/infovis\/wp-content\/uploads\/sites\/3\/2022\/07\/Screen-Shot-2022-07-04-at-5.19.08-PM-1024x867.png?resize=840%2C711&#038;ssl=1\" alt=\"\" class=\"wp-image-33540\" srcset=\"https:\/\/i0.wp.com\/studentwork.prattsi.org\/infovis\/wp-content\/uploads\/sites\/3\/2022\/07\/Screen-Shot-2022-07-04-at-5.19.08-PM.png?resize=1024%2C867&amp;ssl=1 1024w, https:\/\/i0.wp.com\/studentwork.prattsi.org\/infovis\/wp-content\/uploads\/sites\/3\/2022\/07\/Screen-Shot-2022-07-04-at-5.19.08-PM.png?resize=300%2C254&amp;ssl=1 300w, https:\/\/i0.wp.com\/studentwork.prattsi.org\/infovis\/wp-content\/uploads\/sites\/3\/2022\/07\/Screen-Shot-2022-07-04-at-5.19.08-PM.png?resize=768%2C650&amp;ssl=1 768w, https:\/\/i0.wp.com\/studentwork.prattsi.org\/infovis\/wp-content\/uploads\/sites\/3\/2022\/07\/Screen-Shot-2022-07-04-at-5.19.08-PM.png?resize=1536%2C1301&amp;ssl=1 1536w, https:\/\/i0.wp.com\/studentwork.prattsi.org\/infovis\/wp-content\/uploads\/sites\/3\/2022\/07\/Screen-Shot-2022-07-04-at-5.19.08-PM.png?resize=800%2C677&amp;ssl=1 800w, https:\/\/i0.wp.com\/studentwork.prattsi.org\/infovis\/wp-content\/uploads\/sites\/3\/2022\/07\/Screen-Shot-2022-07-04-at-5.19.08-PM.png?resize=213%2C180&amp;ssl=1 213w, https:\/\/i0.wp.com\/studentwork.prattsi.org\/infovis\/wp-content\/uploads\/sites\/3\/2022\/07\/Screen-Shot-2022-07-04-at-5.19.08-PM.png?w=1724&amp;ssl=1 1724w\" sizes=\"auto, (max-width: 840px) 100vw, 840px\" \/><figcaption><em>Figure 4: <\/em>Network after nodes with 3 edges or less were filtered out<\/figcaption><\/figure>\n\n\n\n<p>It&#8217;s interesting to note that the nodes with the largest betweenness centrality are United States, Earth, Africa, England, Europe. For the source and target Wikipedia pages included in this dataset, these Wikipedia pages proved to be vital steps in the paths. Most of these nodes are countries which may explain their importance in the source to target path since the Wikipedia page of a country covers a wide array of information including geography, culture, history, etc. It&#8217;s also interesting to see that the nodes with the largest betweenness centrality are not necessarily the nodes that are connected to the highest weighted edges.<\/p>\n\n\n\n<div class=\"wp-block-group\"><div class=\"wp-block-group__inner-container is-layout-flow wp-block-group-is-layout-flow\">\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:100%\">\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:50%\">\n<div class=\"wp-block-image\"><figure class=\"aligncenter is-resized\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/studentwork.prattsi.org\/infovis\/wp-content\/uploads\/sites\/3\/2022\/07\/Screen-Shot-2022-07-04-at-8.29.59-PM-1024x763.png?resize=417%2C310&#038;ssl=1\" alt=\"\" class=\"wp-image-33548\" width=\"417\" height=\"310\" srcset=\"https:\/\/i0.wp.com\/studentwork.prattsi.org\/infovis\/wp-content\/uploads\/sites\/3\/2022\/07\/Screen-Shot-2022-07-04-at-8.29.59-PM.png?resize=1024%2C763&amp;ssl=1 1024w, https:\/\/i0.wp.com\/studentwork.prattsi.org\/infovis\/wp-content\/uploads\/sites\/3\/2022\/07\/Screen-Shot-2022-07-04-at-8.29.59-PM.png?resize=300%2C224&amp;ssl=1 300w, https:\/\/i0.wp.com\/studentwork.prattsi.org\/infovis\/wp-content\/uploads\/sites\/3\/2022\/07\/Screen-Shot-2022-07-04-at-8.29.59-PM.png?resize=768%2C573&amp;ssl=1 768w, https:\/\/i0.wp.com\/studentwork.prattsi.org\/infovis\/wp-content\/uploads\/sites\/3\/2022\/07\/Screen-Shot-2022-07-04-at-8.29.59-PM.png?resize=1536%2C1145&amp;ssl=1 1536w, https:\/\/i0.wp.com\/studentwork.prattsi.org\/infovis\/wp-content\/uploads\/sites\/3\/2022\/07\/Screen-Shot-2022-07-04-at-8.29.59-PM.png?resize=2048%2C1527&amp;ssl=1 2048w, https:\/\/i0.wp.com\/studentwork.prattsi.org\/infovis\/wp-content\/uploads\/sites\/3\/2022\/07\/Screen-Shot-2022-07-04-at-8.29.59-PM.png?resize=800%2C596&amp;ssl=1 800w, https:\/\/i0.wp.com\/studentwork.prattsi.org\/infovis\/wp-content\/uploads\/sites\/3\/2022\/07\/Screen-Shot-2022-07-04-at-8.29.59-PM.png?resize=241%2C180&amp;ssl=1 241w, https:\/\/i0.wp.com\/studentwork.prattsi.org\/infovis\/wp-content\/uploads\/sites\/3\/2022\/07\/Screen-Shot-2022-07-04-at-8.29.59-PM.png?w=1680 1680w\" sizes=\"auto, (max-width: 417px) 100vw, 417px\" \/><figcaption>Figure 5: Node with the largest betweenness centrality<\/figcaption><\/figure><\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:50%\">\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"840\" height=\"921\" src=\"https:\/\/i0.wp.com\/studentwork.prattsi.org\/infovis\/wp-content\/uploads\/sites\/3\/2022\/07\/Screen-Shot-2022-07-04-at-8.31.23-PM-934x1024.png?resize=840%2C921&#038;ssl=1\" alt=\"\" class=\"wp-image-33549\" srcset=\"https:\/\/i0.wp.com\/studentwork.prattsi.org\/infovis\/wp-content\/uploads\/sites\/3\/2022\/07\/Screen-Shot-2022-07-04-at-8.31.23-PM.png?resize=934%2C1024&amp;ssl=1 934w, https:\/\/i0.wp.com\/studentwork.prattsi.org\/infovis\/wp-content\/uploads\/sites\/3\/2022\/07\/Screen-Shot-2022-07-04-at-8.31.23-PM.png?resize=274%2C300&amp;ssl=1 274w, https:\/\/i0.wp.com\/studentwork.prattsi.org\/infovis\/wp-content\/uploads\/sites\/3\/2022\/07\/Screen-Shot-2022-07-04-at-8.31.23-PM.png?resize=768%2C842&amp;ssl=1 768w, https:\/\/i0.wp.com\/studentwork.prattsi.org\/infovis\/wp-content\/uploads\/sites\/3\/2022\/07\/Screen-Shot-2022-07-04-at-8.31.23-PM.png?resize=800%2C877&amp;ssl=1 800w, https:\/\/i0.wp.com\/studentwork.prattsi.org\/infovis\/wp-content\/uploads\/sites\/3\/2022\/07\/Screen-Shot-2022-07-04-at-8.31.23-PM.png?resize=164%2C180&amp;ssl=1 164w, https:\/\/i0.wp.com\/studentwork.prattsi.org\/infovis\/wp-content\/uploads\/sites\/3\/2022\/07\/Screen-Shot-2022-07-04-at-8.31.23-PM.png?w=1198&amp;ssl=1 1198w\" sizes=\"auto, (max-width: 840px) 100vw, 840px\" \/><figcaption>Figure 6: Edges with the largest weight<\/figcaption><\/figure><\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div><\/div>\n\n\n\n<p>The thickest edges occur between <em>Batman and Scotland<\/em> and <em>Batman and Chemistry<\/em>. Since the weight of each edge is determined by aggregating the matching source and target values, this indicates that the link between these two nodes was traversed the most amount of times in the paths included in this dataset.<\/p>\n\n\n\n<p>Because of the nature of this dataset, there are no isolates in this network since each node is part of a completed path between the source and the target page.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"reflection\">Reflection<\/h2>\n\n\n\n<p>Gephi proved to be a very useful tool in visualizing this large network of nodes and edges. With the option to customize various aspects of the network it was easy to understand and analyze the relationship between the various key nodes. <\/p>\n\n\n\n<p>In the future there are a few different things I would like to do to continue to delve deeper into this project. First, I would like to create a hypergraph to understand more about the relationship between the clusters that were identified by Gephi. I would also like to experiment with turning this into a directed graph to preserve the direction of the link that each edge represents. I would like to see how that network would compare to this one and whether or not the nodes with the largest betweenness centrality remain the same or not. Lastly, I think it could be interesting to also introduce the data from the unfinished paths since that may illuminate isolates that exist within the Wikipedia network structure.<\/p>\n\n\n\n<p>Overall, I really enjoyed exploring this dataset and was very impressed by both RStudio and Gephi and how powerful these tools are in creating complex network visualizations.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"references\">References<\/h2>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:100%\">\n<div class=\"wp-block-group\"><div class=\"wp-block-group__inner-container is-layout-flow wp-block-group-is-layout-flow\">\n<ul class=\"wp-block-list\"><li>Robert West and Jure Leskovec:\u00a0<a href=\"http:\/\/infolab.stanford.edu\/~west1\/pubs\/West-Leskovec_WWW-12.pdf\">Human Wayfinding in Information Networks.<\/a>\u00a0<em>21st International World Wide Web Conference (WWW),<\/em>\u00a02012.<\/li><\/ul>\n\n\n\n<ul class=\"wp-block-list\"><li>Robert West, Joelle Pineau, and Doina Precup:\u00a0<a href=\"http:\/\/infolab.stanford.edu\/~west1\/pubs\/West-Pineau-Precup_IJCAI-09.pdf\">Wikispeedia: An Online Game for Inferring Semantic Distances between Concepts.<\/a>\u00a0<em>21st International Joint Conference on Artificial Intelligence (IJCAI),<\/em>\u00a02009.<\/li><\/ul>\n\n\n\n<ul class=\"wp-block-list\"><li><a href=\"https:\/\/snap.stanford.edu\/data\/wikispeedia.html\">https:\/\/snap.stanford.edu\/data\/wikispeedia.html<\/a><\/li><\/ul>\n\n\n\n<ul class=\"wp-block-list\"><li><a href=\"https:\/\/docmarionum1.medium.com\/what-wikipedias-network-structure-can-tell-us-about-culture-38f8caabf69d\">https:\/\/docmarionum1.medium.com\/what-wikipedias-network-structure-can-tell-us-about-culture-38f8caabf69d<\/a><\/li><\/ul>\n\n\n\n<ul class=\"wp-block-list\"><li><a href=\"http:\/\/www.martingrandjean.ch\/gephi-introduction\/\">http:\/\/www.martingrandjean.ch\/gephi-introduction\/<\/a><\/li><\/ul>\n\n\n\n<ul class=\"wp-block-list\"><li><a href=\"https:\/\/gephi.org\/users\/quick-start\/\">https:\/\/gephi.org\/users\/quick-start\/<\/a><\/li><\/ul>\n<\/div><\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Introduction I have fond memories of discovering a game called Wikispeedia in the early 2010s. It&#8217;s a game based on the concept that given two randomly generated Wikipedia pages, there should be a path via hyperlinks to reach one from the other. This concept is gamified by adding a time element to see how fast&hellip;<\/p>\n","protected":false},"author":4005,"featured_media":33550,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[149,342],"tags":[],"coauthors":[1792],"class_list":["post-33475","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-labs","category-networks"],"jetpack_featured_media_url":"https:\/\/i0.wp.com\/studentwork.prattsi.org\/infovis\/wp-content\/uploads\/sites\/3\/2022\/07\/wikispeedia.png?fit=458%2C370&ssl=1","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/paBdcV-8HV","_links":{"self":[{"href":"https:\/\/studentwork.prattsi.org\/infovis\/wp-json\/wp\/v2\/posts\/33475","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/studentwork.prattsi.org\/infovis\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/studentwork.prattsi.org\/infovis\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/studentwork.prattsi.org\/infovis\/wp-json\/wp\/v2\/users\/4005"}],"replies":[{"embeddable":true,"href":"https:\/\/studentwork.prattsi.org\/infovis\/wp-json\/wp\/v2\/comments?post=33475"}],"version-history":[{"count":4,"href":"https:\/\/studentwork.prattsi.org\/infovis\/wp-json\/wp\/v2\/posts\/33475\/revisions"}],"predecessor-version":[{"id":33551,"href":"https:\/\/studentwork.prattsi.org\/infovis\/wp-json\/wp\/v2\/posts\/33475\/revisions\/33551"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/studentwork.prattsi.org\/infovis\/wp-json\/wp\/v2\/media\/33550"}],"wp:attachment":[{"href":"https:\/\/studentwork.prattsi.org\/infovis\/wp-json\/wp\/v2\/media?parent=33475"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/studentwork.prattsi.org\/infovis\/wp-json\/wp\/v2\/categories?post=33475"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/studentwork.prattsi.org\/infovis\/wp-json\/wp\/v2\/tags?post=33475"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/studentwork.prattsi.org\/infovis\/wp-json\/wp\/v2\/coauthors?post=33475"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}