Scholarly Collaboration in Network Science

For this lab, I utilized the Co-authorships in Network Science dataset compiled by M. Newman in 2006 (retrieved from CASOS). This data represents the instances of co-authorship between 1,588 scientists writing about network theory as gathered from the bibliographies of two articles on the topic. The dataset contains 1,589 nodes, each one assigned to a specific author. The network is undirected because co-authorship is mutual and it is weighted “directly in terms of the number of collaborations between authors and inversely in terms of the number of other authors involved” (University of California Irvine, n.d.).

I chose this data because I am interested in using visualizations to communicate information related to academics, including co-authorships, citations, and other scholarly collaborations. These kinds of networks can be very large or small, but this one seemed like a manageable size initially. I entered the lab with several expectations and goals. Since the data is based on a specific group of authors cited by the same publications, I imagined that the network would have a high density and would have a relatively small number of main clusters. I also wanted to reach the following goals with my visualization:

Display which author/authors has/have the greatest centrality
See the relationships between authors connected to the most central nodes
Use color to encode clusters and size to encode degree
Determine the diameter and density of the network — is the network dense or are there a few main nodes with all the connections (are there structural holes)?

As will be seen, the network was more complex than anticipated. However, the final product has potential for improvement and has helped me understand how to better handle network data.

Inspirational Visualizations

This network visualization that shows a network of scientific collaboration as pulled from PubMed is large but dense. I thought the use of color was effective and it is striking how many red nodes and edges there are. Some nodes are also significantly larger than the rest, drawing the eye directly to the most central nodes of the network. I also liked how nodes with only a few connections are pushed to the outer edges of the network and draw less attention. Ultimately, this is an effect I would have liked to achieve in my own visualization.

For a different layout, I found this visualization showing conversations between students on a listserv to be quite powerful. I question how effective the radial layout would be for a network with as many nodes as mine, but here it communicates the concentration of conversations and how several students (the particularly bright spots) reach out across the network the most. With my visualization, I wanted to get the same message across. In the future, I want to try a radial as well.

Lastly, this network of retweets using hashtags related to GamerGate stands out to me because of how polarized the nodes are. There are some cross-discussions happening, but the conversation is primarily concentrated on either side of the network. I think it is a great example of the importance of positioning of nodes in a network, and in my work going forward, I would use this as a reminder to aim for the same level of clarity through positioning.

Methods & Results

The earliest stage of the network. Note the high volume of isolates and dyads.

I opened the GML file of the data and began with the Force Atlas 2 layout because I thought that having nodes repel and attract each other worked best for showing relationships between individuals. I also tried the Fruchterman Reingold algorithm, which could have worked as well, but I did not have a chance to investigate it in any depth. Before adjusting the gravity pull, the nodes were very spread out and fragmented, making it difficult to see relationships. Increasing the gravity pulled the nodes into a circle and drew the nodes together, but there was still a great deal of fragmentation due to a large number of isolates and small clusters containing just 2-3 nodes. There was also too much overlap, especially in these numerous small clusters, so I used the prevent overlap feature and slightly decreased the gravity. The nodes began to spread apart enough to see the different connections and cluster shapes more distinctly.

Once the layout was stable and before manipulating the nodes any further, I ran the statistics. The diameter of the network was 17 and the density was only 0.002, so it was very far from being fully connected. There was a small range in degree, the greatest being 34 author collaborations and the lowest being 0 (for an average of 3.451), but there were far more nodes on the lower end of that range than the higher end. When I used the ranking feature to size nodes by degree, this concentration on the low end caused nodes of different values to be indistinguishable. It was not clear which nodes were in the same positions/roles and which were not. I adjusted the max/min sizes to create a better range of node size to show differences in degree. I also sized the edges by weight to show how many collaborations there are between authors.

For clustering, I ran the modularity starting with the resolution at 1. This created 407 communities, which is far too

The network prior to filtering, after partition. Note the many clusters and colors.

many for such a small dataset. I tried other resolutions, but even at 1.5 there were still 402 communities. Going into the partition pane, I colored the nodes according to the modularity. This led to a huge range of colors with little variation, making the clusters too similar in color and creating the false impression that one cluster was connected to another when they were not.

Though not ideal, I decided that the isolates and the dyads could be excluded to reduce the clutter. I filtered the network to include only nodes with a degree of 3 or higher, which lowered the modularity to 88 communities and reduced the number of nodes to 752. If I filtered to higher degrees, nodes began to disappear that were connected to central clusters. The problem of similar colors persisted, so I manually changed the color of the smaller clusters to gray in order to draw focus to only the densest clusters. This helped highlight the biggest clusters, but their colors are still too similar and need to be edited (particularly the different hues of green, blue, and purple nodes).

The latest version of the network.

The visualization I created does show some interesting connections that would not be clear just by looking at a list or table of coauthors. The centrality of certain nodes is clear, especially in the pink cluster toward the middle of the network in which the center node is visibly larger than the rest and has thick edges reaching out to other slightly smaller nodes. The edge weights are very clear in the colored clusters, and it is interesting to note that in the denser clusters that are more fully connected (such as the bright blue, orange, and purple clusters at the very edges of the network), the edge weights are mostly the same between all nodes. The visualization shows that there are subsets of this community of authors that interact a great deal with each other but do not reach outside their group. On the other hand, there are groups that will connect to one another with just a few, or even one, collaboration. The coloring nicely shows these links and enables the viewer to follow the paths between clusters. In the center, all of the clusters are connected in some way. If the nodes were labeled, we would be able to tell which authors have the most centrality and which are reaching out to collaborate between groups.

Future Directions

If I were to work on this visualization more, my first priority would be fixing the color problems discussed above. I would also like to experiment with more types of layouts, although I think using Force Atlas 2 ultimately communicates most of the information I wanted to convey. I would want to figure out how to better pull the most connected nodes to the center and push the rest outwards (as in the PubMed visualization example above). Ideally, I would also make the visualization interactive and dynamic. The display would be greatly enhanced with filtering options for the user and the ability to hover over nodes to see the names of authors as well as their stats. The nodes could have more attributes assigned to them for even more enhanced filtering and analysis, such as institutional affiliation.

In terms of actually using the visualization, it could be a good starting point for understanding the relationships between the authors writing about network theory. Some research questions immediately come to mind: what are the authors within each cluster writing about? What topics are connecting one cluster to another? Using other types of visualizations in conjunction with the network could show this kind of information, such as a bar graph depicting popular topics or a map that charts where these authors are writing from to show geographical relationships.

References

M. E. J. Newman, Finding community structure in networks using the eigenvectors of matrices, Preprint physics/0605087 (2006)

University of California, Irvine. Newman – NetScience Co-Authors. Datasets. Retrieved from http://moreno.ss.uci.edu/data.html#netsci

Information Visualization

Student work at the School of Information, Pratt Institute

Scholarly Collaboration in Network Science