Network visualization is not a task for the easily discouraged. Between the search for suitable dataset, the occasional program crash when Gephi would get overloaded, and endless questions that crop up, working with network data requires patience and attention to detail. After familiarizing myself with some of Gephi’s basic functions by practicing with their “quick start guide” and undertaking trial and error with several datasets, like Pablo Gleiser and Leon Danon’s (2003) jazz musician network dataset, I settled on a dataset focused on scholar coauthorship of articles related to network science created by Mark Newman (2006), accessed via the Gephi wiki. My primary goals working with Newman’s dataset in Gephi was to identify clusters within this network and to experiment with design to determine what design methods would best underscore the nuances within this community. As I played with the dataset in Gephi, my initial goals transformed into identifying an explanation for the density of this network as well as determining what methods are best suited to visualizing open networks.
When I first uploaded Newman’s coauthorship dataset, I applied some of the basic steps from the Gephi quick start guide in order to get oriented within the dataset, playing with different layouts and running reports to give myself more data to work with. I also took into consideration notes and questions in class after presenting a rough draft of this network, included below
I decided to go back to the drawing board with comments from my professor and peers in mind; I leaned into understanding why the network is so spare by comparing the number of triangles with other measures of connectedness, centrality, and community. I also kept an open mind to different strategies for laying out, coloring, or otherwise working with this network.
The network below is what was generated by Gephi when I first restarted, nodes sized and colored by node degree; red represents nodes with the highest degree, blue represents nodes with the lowest degree. Edge colors are a blend of the coloring of source and target nodes. Thickness of edge lines correspond to the degree of each node, thicker lines indicating higher degree nodes.
Alternating between the Force Atlas, Expand, Fructerman-Reingold, and “Noverlap” layout functions, I achieved a regular, ciruclar structure of the same network representing degree (coloring and sizing of nodes remaining the same at this point).
I found that Force Atlas, Yifan Hu, and other layout functions resulted in large, sparse networks like the one above. I reapplied the Fructerman-Reingold layout and obtained the following network.
After running the modularity, clustering co-efficient metric, connectivity, eigenvector, graph density, edge overview reports, I focused on eigenvector centrality and number of triangles present with the network and experimented with coloring, sizing, and partitions for node and edge sizing. When visualizing node degree, closeness, and clustering coefficients via color, no clear patterns emerged – the quality most emphasized was the lack of overall connections between nodes. Manipulating the size of nodes based on these variables wasn’t helpful either given the relatively limited range of degrees present (degree ranged between 1 and 34) and, again, the overall lack of connection between nodes. Edge thickness was a better proxy for degree.
At this point, I took a step back from the graph and focused on the results of the statistics reports. In re-running the community detection statistics reports, I found the number of communities present was never less than 396, no matter how high I set the resolution of the community detection function. When comparing reports, I realized that this was exactly the same as the number of weakly connected components present in this dataset – 396.
Unlike modularity and closeness measures, visualizing the number of triangles present (obtained via the clustering coefficient report) provided a more effective proxy for understanding spheres of influence among authors. I compared networks visualizing the number of triangles with others visualizing eigenvector centrality. This was inspired by Newman (2006), as he created this dataset to demonstrate the usefulness of eigenvector values when assessing centrality of components in a network. What I found especially interesting about this comparison is the nuance provided by the number of triangles relative to eigenvector centrality. Compare the two networks below – the first represents eigenvector centrality, the second represents the number of triangles. In both, red is assigned to nodes and edges with the highest values, blue is assigned to nodes and edges with the lowest values, with transitional colors assigned to values in between. In the eigenvector network, only one cluster of nodes in the lower right quadrant stands out, with some hints of green in the upper left quadrant. In the triangles network, the same cluster from the eigenvector network stands out, as well as three additional clusters
Eigenvector Centrality
Number of triangles
From here, I started filtering out nodes by their degree to take a closer look at the areas of interest in the number of triangles network to take a closer look at coauthors in these nodes. The next sequence of visualizations shows what that process looked like.
Here’s our network with any nodes with a degree of 5 or more remaining.
Now, nodes with a degree of 10 or higher, and the remaining network re-laid out.
Now, only nodes with a degree of 16 or higher, re-laid out again, and labeled.
And, lastly, only nodes with a degree of 20 or higher, re-laid out again, and labeled.
In filtering out more and more nodes, the recurring red cluster becomes more nuanced, highlighting nodes that had repeated connections to each other (i.e., authored multiple papers together). The relative intensity of the relationship between Cagney, Mansfield, and Uetz compared to others in their cluster becomes more apparent after the network was relaid out, and even more so when the degree filter was set to 20. Additionally, we see some clusters shrink to review pairs of authors who collaborated the most and some individual authors who collaborate more often than others. For example, once nodes with a degree less than 16 were filtered out, the unequal relationships between Barabasi, Oltvai, and Jeong became clear. Each author has collaborated with the other two, with Barabasi collaborating with Jeong more often than Oltvai, and Jeong and Oltvai collaborating even less frequently. In the final network, Newman and Young are unconnected to the remaining triangles. This indicates that these authors collaborate with others more often than the individuals with whom they collaborate.
After reaching the last iteration of this network, I turned to Google Scholar to take a closer look at the remaining authors. While all of these authors focus on network science, the source for these networks varied. Cagney, Mansfield, and Uetz’ scholarly writing is primarily grounded in microbiology or plant and animal biology. Although Barabasi and Jeong often evaluate networks regarding social interactions, their networks focus on human health and biology instead of plant or animal. Even Newman, who has dedicated his academic career ot network science, describes himself on his website as “the physicist who works on networks” at the University of Michigan, Ann Arbor. This is a potential explanation for why this overall network was so sparse – all authors in this network were united in their focus on network science in some way, but their disciplines diverged.
Delving into network data, both in the sense of locating a network dataset that worked for me and also working with the coauthorships in network science dataset, compelled me to think critically about the different ways we are all connected to each other as well as how fragile these connections can be. Searching for a suitable dataset and working with it in Gephi was an experience that confirmed for me that we stand on the shoulders of those who have come before us, that we stand on the shoulders of giants as we reach for what we don’t know. As far as next directions, with this dataset or other, I’m interested in comparing the communities present within other datasets focused on academic coauthorship. The last iteration of my network visualization gave me a manageable list of authors to delve deeper into and I’m still intrigued by the seeming lack of overlap of primary disciplines between these scholars. I believe this diversity of discipline explains the lack of density present in this network and I’m curious as to how much denser a network of coauthors would be if it was limited to, say, a single discipline, or a group of academic publications centered around a single discipline.