Trends in Scholarly Publishing at the Icahn School of Medicine


Visualization

Introduction to the Data

For my final project, I wanted to use data that had meaning to me and create visualizations that could be immediately useful. I was initially inspired by the work I have been doing for one of my internships at the Icahn School of Medicine at Mount Sinai. Their library is currently implementing new ways of measuring scholarly output, including altmetrics analytics which I have been working on. I could not use that data because it is still in the early stages of development, so I instead decided to use Scopus to get data on publications by ISMMS faculty. I aimed to create finished products that could go directly to my target audience of ISMMS librarians and be used as part of their research into faculty publishing and the reach that ISMMS has in the larger scholarly conversation. Previous work has been done at the library regarding collaborations between ISMMS and institutions in other countries, but this was for a single year and did not account for other factors such as the number or subject of the publications. My plan for this project was to use that as a starting point to improve upon, and my ideas expanded from there as I worked with the data.

Scopus has data on ISMMS publications from 1965 through the present pertaining to almost 44,000 citations. There is a huge amount of data that can be extracted from this and Scopus offers an analysis function that breaks some of this data down to focus on the following factors: year, source (journal title), author, affiliation (institution the author is affiliated with), country/territory, document type (article, book chapter, review, etc.), and subject area. Each of these factors can also be drilled down further. It is easier to extract data from these categories than to export thousands of citations, but is still very time-consuming because only one data set for one factor can be exported. For instance, I cannot export data on subjects in certain countries for a span of time. Each of those pieces must be pulled individually, then manually input into a spreadsheet.

I originally planned to use the affiliation data to create a network of institutions that would show the strength of connections as well as the most popular subject area that was written about in those collaborations. However, this did not work out because Scopus limits the number of rows that can be exported for the affiliation variable (I contacted Scopus and they confirmed this), so the only way to get around this would have been to export citations in batches and separate that data. This was not possible given the scope of this project.

Ultimately, I feel being forced to go in a new direction led to a better product. I had to decide on a new, much clearer story to tell. The aspect of the data I was most interested in was the diversity of subject areas that fall under publications of ISMMS faculty. There are 28 total subject areas for the entire span of 1965-2014 (my project does not include 2015 and 2016 data as it is incomplete). As the visualizations show, Medicine is naturally the dominant subject, but there are others such as Arts and Humanities and Astronomy that are less expected to be coming from a medical school. I was struck by the changes in popularity of subjects, as well as by the changing connections between ISMMS and other countries over time, so I started from there. My final products include two dashboards for a total of five visualizations using Tableau Public and two networks created using Gephi. Together, these graphs represent different aspects of the data and convey a multi-dimensional story.

Network Visualizations

Subjects Network

The first network is meant to be studied in conjunction with the first dashboard because each of these pertain to the broader picture of the data. The network shows which subjects appeared together in cross-listed publications from 1965-2014 (i.e. an article classified under more than one subject area) using the Force Atlas 2 algorithm. Each of the 28 nodes is a subject, while the edges are weighted by the number of times they appear together. The nodes are colored by cluster, which were determined by running the modularity with a resolution of 1.0. A resolution of 1.0 created two clusters, 0.75 created three, and 0.5 created five. I tried these varying resolutions to see the network broken down further than just two clusters, but found that the view with two clusters was actually the most informative. With more clusters, subjects with weaker ties were being grouped together and the relationships between the subjects were not as clear. The network above shows subjects like Economics, Social Sciences, and Arts and Humanities, the nodes for which are being repelled from the center, as being clustered with the most popular medical subjects including Medicine, Neuroscience, and Biochemistry (blue). The other cluster (red) is mostly subjects that are not medical, such as Computer Science and Engineering. This shows that soft sciences and humanities are more likely to be written about in an article that is also classified under life sciences or other medical topics. The network also shows that non-medical sciences are written about together more frequently than they are with medical sciences. The center nodes all represent the most popular subjects and are also the most similar subjects.

AH Network

The second network is to be studied together with the second dashboard because both are focused on the Arts and Humanities. I felt having only big-picture visualizations was not enough to be compelling and would eventually fail to communicate new information to the user, so I chose to narrow down to the subject that has the least obvious connection to ISMMS. Taking the same approach as in the first network, I compiled the data on which subjects Arts and Humanities articles were cross-listed with from 1967-2014 (Arts and Humanities did not appear as a subject until 1967) and used the Force Atlas 2 algorithm to generate the network. The nodes are again colored by clusters determined with a resolution of 1.0, which created three communities. Of the 28 total subjects associated with ISMMS, 17 can be tied to Arts and Humanities for a total of 18 nodes in this network. We can see that Mathematics and Decision Sciences (red), the only two formal sciences in the network, are the most isolated and are only written about with one another and Arts and Humanities. The second largest cluster (green) is interesting because it contains two applied sciences (Engineering and Computer Science), a life science (Agricultural and Biological Sciences), and Business. This shows that when Business, Management, and Accounting (one of the most non-medical subjects in the network) is written about, it will mostly likely be in relation to one of those three sciences. It does make sense that topics like Computer Science and Agricultural and Biological Sciences overlap with Business. Lastly, the largest cluster (blue) shows that Arts and Humanities have the strongest ties to medical and social sciences.

Dashboards

Dashboard 1

The first dashboard went through many stages of trial and error before finding the right combination of variables and the best way to represent them. This dashboard gives an overview of trends for 1965-2014. The area chart shows the total number of publications by year, broken down by subject. This allows the user to see both how many publications were produced in a given year as well as how many publications under a certain subject area were produced. It is clear that the gap between the number of articles on Medicine and those on all other subjects has increased, though there appears to be a pattern of rising and falling numbers of Medicine publications about every five years. That is, until 2011 when the total number for Medicine started to steadily grow. The next most popular subject, Biochemistry, Genetics, and Molecular Biology, largely follows the same trend as Medicine, indicating that these subjects are closely related. This is confirmed by the first network, in which these two subjects have the strongest tie between them. We can also see from the area chart that variety in subjects has continued to grow, with more new subjects appearing over time, particularly since the year 2000. Although there are many colors packed together at the bottom of the chart, it is still easy to see that subject areas are expanding. I tried to make the colors as easy to differentiate as possible and not place colors that are too similar beside each other.

I next wanted to show the growing relationships between ISMMS and institutions in other countries, so I supplemented the area chart with a map depicting those countries and the most popular subjects written about in those collaborations. This map is broken down by decade, but because Tableau cannot comprehend date ranges, each year in the filter actually represents ten years (i.e. selecting 1974 will show the totals for all years between 1965 and 1974). It was unfortunately not possible to color each point on the map by the most popular subject without redoing the entire data set to just include those top values (a very long process), so each point is instead a pie chart. This is not ideal, but because it is not important for the user to determine specific numbers from this map, it will suffice. As the user filters through each decade, they will see the number of points on the map increase as the ISMMS network expands. This visualization allows the user to see the concentration of collaborations. The points are clustered in certain regions, particularly Western Europe, the Middle East, Asia, and Central and South America. Knowing this clustering occurs could lead to further research into why ISMMS collaborations are focused in certain parts of the world.

To break down this data on subjects and countries further, I use Tableau’s circle view to show the distribution of subjects per country. The year (decade) filter simultaneously applies to both this chart and the map, so the user sees both dimensions at the same time. The size of the circles encodes the number of documents that fall under a subject in a given country. This chart is effective because users can immediately see when there is or is not a circle for a subject or country. The chart automatically sorts to show the countries with the most collaborations with ISMMS first, and the user can scroll further to see the descending values. It was not possible to just display the top ten countries, but this is an acceptable compromise. It is interesting to see the order of countries change in each decade, such as when Japan is the top collaborator for 1985-1994 but falls to the ninth place spot in 2005-2014. Again, looking at this visualization opens doors to more questions that could explain why certain collaborations take place and trends occur.

Dashboard 2

For the second dashboard my goal was to again delve into one subject, Arts and Humanities, to show more information than the second network discussed above could capture. I wanted to find a way to incorporate journals into the visualizations, and this seemed to be a good opportunity to do so. This dashboard contains a stacked bar graph showing the total number of articles classified under Arts and Humanities that were published in each journal for the years 1967-2014. The colors of each bar represent the subjects that are cross-listed with Arts and Humanities in that journal. The colors in this dashboard match those in the first dashboard for consistency. Although not all of the journal titles are being displayed as labels on the x-axis, hovering over the bar will display the title. We can see that the most popular journal is Neurology, which has published articles that are classified as Art and Humanities along with Medicine and/or Neuroscience. Using stacked bars allows the user to see the breakdown of subjects within each publication. This is an effective supplement to the network because it shows the relationship between Arts and Humanities and other subjects at a more detailed level. The user can look at an edge in the network, then look at this dashboard to find out exactly what that edge consists of.

Lastly, I use a tree map which divides journal titles according to the subjects of articles in that journal (that is, subjects that appear together with Arts and Humanities). I chose to use a tree map because it can display multiple variables at once in a condensed, but still easily readable, format. Each box is a journal, sized according to the number of articles and colored according to subject. The tree map shows how many different journals a subject appears in, which is an interesting perspective. We can see, for example, that some subjects may have multiple articles but only appear in one publication, such as Decision Sciences. It is possible to filter by journal title, which will affect both visualizations on this dashboard and thus will display a single section of the tree map and one bar of the stacked bar chart. This would be helpful to users interested in viewing the information for specific publications or getting a closer look at the smaller bars or boxes that are slightly harder to see. The subject legend also acts as a filter for both charts.

Conclusion

Overall, my visualizations work together to represent large, multifaceted sets of data. I aimed to create functional, informative products that could be immediately utilized by ISMMS librarians who are interested in tracking the scholarly output of the school’s faculty. With the first network and dashboard, they can interact with several variables to see almost 50 years of data in small, easily consumable pieces. With the second network and dashboard, they can see details about a particular subject especially in relation to journal titles. These visualizations can help to investigate a number of topics:

  • Trends in the publication of articles classified under certain subjects – when is a given subject most popular? What could an increase or decrease be attributed to – institutional changes, world events?
  • Trends in collaborations between ISMMS and institutions in other countries – are there official partnerships determining the clusters of activity that are evident on the map?
  • Relationships between subject areas
  • What journals are most frequently published in

There are still many variables left to look into more deeply. Ideally, I would have liked to break down the networks to an institutional level. All of the factors that Scopus provides data for could potentially be combined into massive visualizations using the same methods that went into making these smaller ones. Incorporating altmetrics is another possibility that could be integrated into these graphs. Even though my project is not concentrated on authors or citations, it is a foundation for representing that kind of low-level data.