Introduction
Academic publishing is a large, lucrative business. From textbooks to journal articles and many formats in between, the publishing and distribution of scholarly materials rake in over twenty-five billion dollars a year (Bluestone, 2015).
I have the pleasure of working for a giant in the industry, ITHAKA Harbors, Inc. Specifically, I work for a subsidiary called JSTOR, a digital platform which licenses journals and makes their content available to various subscribers in higher education. The journals are organized by subject to facilitate efficient and efficacious discovery, but inevitably these groupings prove outdated, as fields of study become more interdisciplinary and less easily defined. To ensure that JSTOR remains both relevant and innovative, the content development department is in constant conversation about the boundaries of current subjects and the potential of new ones.
For the least six months, I have collected data on the latest developments in higher education. My data collection, or, more aptly, data selection process is outlined in Appendix I. The purpose of the dataset is two-fold:
- To provide a snapshot of the current trends in higher education
- To inform general content development inquiries
In addition to producing this data, I am also charged with designing visualizations to help communicate key conclusions. During conversations with many interested parties within ITHAKA, the desire to highlight and explore the interdisciplinarity of the data arose repeatedly.
Discussion
Given its utility as a display of relationships and interconnectedness, a network visualization is appropriate to show the most commonly occurring subjects and those which are most frequently associated.
In his search to discern the centrality of Hegel in the history of philosophy, Adam Hogan, shares a vis created by his friend using Gephi:
The graphic shows Hegel as a highly influential figure, but Hogan was dubious, saying that “[his] money would have been on Plato” (Hogan, 2012). Using data scraped from Wikipedia and relying on his own experience with the history of philosophy, Hogan tests his theory by recreating the vis with a few key changes.
First Hogan changes the edges from undirected to directed, a better proxy for the inspired/inspirer relationship. No discernable differences. Next, Hogan swaps the “default Authority measure” in Gephi (Hogan, 2012) for PageRank, “the secret sauce that Google used” to decide which sites were most important in search results (Sullivan, 2016). Certain that this alteration would be the key to a more accurate vis, Hogan met with disappointment again.
Eventually, Hogan decided that the issue lay within the data and did not involve Gephi at all. I can certainly relate as my own vis has had problems I have not been able to resolve in Gephi. The main difference is that I have no one to blame for an unruly dataset but myself.
Here’s Hogan’s vis again (part 2), but with an interactive interface.
Because the audience for my vis will almost certainly value a movable, scalable graphic over a pdf, I’m committed to untangling the complications of SigmaJS to produce something similar.
The following visualization is a result of manipulating a JSTOR dataset.
I am exclusively interested in the aesthetics of this vis, however. In particular, I’m partial to substituting the nodes for the node labels and to relative node (font) sizing according to “structural importance (betweenness centrality)”. Also, the edges are thicker for “more strongly… associated” words. (“Generating and Visualizing…”, 2015).
Nodus Labs produced a vis with color coded groupings to show the structure of Russian protest groups.
Fortunately, they outlined their process, and in Future Directions, I share my intent to apply their approach to a more representative dataset than what is used in this lab.
Materials
I used a computer with the following programs: Gephi (0.9.1) and the SigmaJS plugin (for Gephi 0.9.0), R, and Excel.
Methods and Results
Data selection is outlined in Appendix I. From the larger dataset, I cut out the Subjects field and pasted it into a CSV file. This file proved to large to parse through in R, so I pared down the CSV again to the first 100 records. I uploaded this csv file to my R workspace and applied the R code provided by Professor Chris Sula.
for generating all possible edges and attaching the frequency of their occurrence within the data. The code includes a line which saves the edges table to a new CSV file. In Excel, I trimmed the cells to remove the leading spaces that would cause duplicate data in Gephi. I also removed subjects which were not considered “official,” e.g. “Applicable JSTOR Subject Not Found.”
The edges table was uploaded to Gephi with Type set to Undirected. I copied Node names into the Labels column and assigned the field containing frequency to the edges’ Weight field. Then, I calculated betweenness and Weighted Degree.
I applied the following layouts to the final vis in this order: OpenOrd, Expansion (ten times), Noverlap. I set node color to #75b5ff and node size to a scale of 5 to 100 by Weighted Degree. I set edge color on a scale from pale to deep orange by edge’s weight and edge thickness by the same measure.
To be consistent with other visualizations I’ve made for my office, I chose #D1D1D1 as the background color.
I performed calculations for diameter and density. The diameter was 1, which is expected because each subject in the small dataset was connected to every other subject at least once. The density was high for the same reason.
Future Directions
I experimented with several layouts before resetting and applying the final mix. While they produced interesting visuals, they did not show the groupings I envisioned (like those of the Russian protest groups vis).
Here’s the Fruchterman Reingold
and the Force Atlas:
For future iterations, I will try still more layouts, starting with the procedure of Nodus Labs.
I would like for the vis to be interactive before presenting it to my supervisors (one of whom expressed explicitly her desire to see this done). Unfortunately, because of unknown complications with SigmaJS, this version is only available as an image or from the Gephi file. Embedding the vis in a site on our intranet with the ability to be manipulated and modified for the user’s specific interest is the ideal, so I will continue troubleshooting.
Further, the data used to create this vis were not a representative sampling. The most prominent nodes (like Public Policy & Administration) do not reflect the statistics taken of the original data (which show other subjects to be most popular by frequency of occurrence and other measures). Though, the vis is accurate in displaying which subjects occur together (Health Policy, Public Health, and Health Sciences are strongly linked, as is Health Policy and Public Policy & Administration). Because Gephi was limited in the size of the imported file just like R, the full dataset is not an option for analysis. Therefore, I will work with the other data scientists and information professionals in my office to select “better” records for analysis and visualization.
On a strictly aesthetic basis, coworkers who have viewed the vis explained that the color scheme is misleading and the overlay of visual cues causes confusion. In class, we discussed that the use of color only stretches so far, so I will be revisiting those design decisions as well.
Additionally, the data are private, so the final visualization won’t be shareable outside of my company. Next steps for this vis thus include making it publicly available, so that all interested parties have access to this wealth of information concerning higher education.
References
Bluestone, Marisa (2015). U.S. Publishing Industry’s Annual Survey Reveals Nearly $28 Billion Revenue in 2015. Association of American Publishers. Retrieved from http://newsroom.publishers.org/us-publishing-industrys-annual-survey-reveals-nearly-28-billion-in-revenue-in-2015/
Generating and Visualizing Topic Models with Tethne and MALLET (2015). ASU Digital Innovation Group. Retrieved from http://diging.github.io/tethne/api/tutorial.mallet.html
Hogan, Adam (2012). Visualizing the History of Philosophy as a social network: The Problem with Hegel. Design & Analytics. Retrieved from http://www.designandanalytics.com/visualizing-the-history-of-philosophy-as-a-social-network-the-problem-with-hegel
Nodus Labs (2011). Network Analysis of Russian Protest Groups on Facebook using Gephi and Netvizz. Nodus Labs. Retrieved from http://noduslabs.com/cases/russian-protest-network-analysis-facebook-gephi-netvizz/
Sullivan, Danny (2016). RIP Google PageRank score: A restrospective on how it ruined the web. Search Engine Land. Retrieved from http://searchengineland.com/rip-google-pagerank-retrospective-244286
Appendix I: Data Selection Process
- Move from highest to lowest ranked on THE World University Rankings
- Find institutional website.
- Utilize Google advanced search shorthand to locate relevant sub-sites. The following slides explain this process.
- Key search terms are listed in a later slide.
- If information is difficult to glean from Google search, use the search function (and all other available tools) on the institution’s home page.
- Read every relevant site that has the institutional site name as its base (e.g. for University of California, Berkeley , peruse all pertinent sites with the base “berkeley.edu”).
- Relevance refers to the potential for each initiative to produce new scholarship as a “tangible priorit[y]” of the institution.
- Pull data from these sites and organize in Access Database.
- If no date is found, record is excluded from data set.
- Consult JSTOR subject page to determine appropriate classification, if necessary (e.g. Sustainability includes “energy,” “energy policy,” etc.)