A visualization of the co-authorship phenomenon on the subject of network theory up to 2006

Introduction

In my explorations in this course so far, I have stuck to the theme of art, exploring data from different art fields and visualizing them. Report 1 was a study on the visualization process of music, and report 2 was a study on the visualization of the distribution of galleries in New York. So in this report 3, I will also continue to explore another field in art – literature and co-authorship.

As one of the mainstream information transmission media, the internet plays an irreplaceable role in our lives. Network theory is analyzing graphs that represent systems containing discrete objects. The study of network theory enables us to better understand the relationships between different seemingly discrete objects. In other words, the study of network theory can lead to a better understanding of the relationships between seemingly discrete objects.

In this report, the phenomenon of co-authorship on the subject of network theory up to 2006 will be studied, and the collaboration between authors and the number of publications by different authors will be analyzed through visualization.

Data

When looking for sources of databases in related fields, I consulted several websites such as CASOS/S NAP/ Gephi wiki, etc. I typed in keywords such as “network theory”, “co-authorship” and find a database of authors and books on network theory and experimentation collected by M.E.J. Newman in 2006, which contains 1589 authors and more than 2000 books.

Data link

(M. E. J. Newman, Finding community structure in networks using the eigenvectors of matrices, Preprint physics/0605087 (2006).)

What is most surprising is that Professor Newman also created a column called “Weight” to count the number of authors who have published in the field, based on a weighting algorithm for the relevant subject. This column was the one I found most helpful for my research, so I chose this database and processed it.

Process

I processed the data in Excel, transforming the file into CSV. format according to the basic requirements of Node and Edge, and got the following data.

Inspiration

I clearly understood that the research direction of the report was the co-authorship relationship between the authors, so I focused not on the relationship between a particular node and other nodes and the system, but on the relationship between the whole system and the percentage of each id in the system. Also, the relationships in my edge chart are undirected. Therefore, I will use the type of visualization that is appropriate for whole networks.

Inspired by a presentation example in class – Hypergraph of keyword clusters in bioethics articles (1970-2010)- a visualization that is also about authorship. The visualization is also about finding commonalities in works in a given field by frequency of occurrence and distinguishing them by different shapes and colors. Works with commonalities are marked with the same color, and the size of the shape indicates the frequency of the content: i.e., the larger the circle, the higher the frequency of occurrence.
I think this example has similarities with my research, so in my visualization report, it will also be distinguished by color and shape.

Visualization and Discussion

The tool that helped me with the visualization was Gephi, an excellent tool that can convert multiple types of files such as csv. gml. into visual charts.

First, I imported the csv files of Node and Edge into Gephi and got the following diagram (Fig-1). I could already see the complex interlocking relationships between many of the nodes, but this was obviously not enough for visualization.

So my first step in dealing with them was to assign different shapes to these nodes according to their frequency, so I distributed the shape sizes according to “Degree” in Node’s Ranking, setting the minimum value to 2 and the maximum value to 35 in order to distinguish more between the frequency and the frequency, and got the following graph (Fig-2).

But it’s apparently still unclear enough, so I added color to the data in the Node’s Partition, and the basis for adding color comes from “Modularity” (Fig-3), which can divide different nodes into different communities according to the degree of closeness. I set the Modularity to 0.5 to get 408 communities and get the following graph (Fig-4).

In order to show a more legible Layout, I searched for a suitable template in the “Layout function area”, as mentioned above, this is a relationship diagram of whole networks, so I chose the template “Force Atlas2”, and got the following diagram (Fig-5), you can see several communities are closely connected in the center of the screen, and there are still some communities scattered around in the blank space. After comparing with the original database, Ifound that these discrete parts’s frequency are between 0.1% and 1%, so they can be ignored in this study, and more attention will be paid to the relationship between connected authors (co-authorship). (Fig-6)

After adjusting some Layout settings (e.g. Gravity, Scaling, etc.), I obtained the following figure (Fig-7), which shows that the authors with co-authored works are clearly linked together, but there are still a lot of gray dots in the figure (these dots indicate authors without co-authored works). Therefore, I need to remove these gray dots, and I need to use the “Filter function”. After turning the Filter value to 6, I got a satisfactory image (Fig-8).

From this visualization, it is clear that there are five main groups of authors working closely together, while group A (purple part), dominated by people like Maritan A, VesPignani A, etc., has the highest number of co-authored works, often in collaboration with group B (blue part), dominated by people like BOCCALETTI, S, etc., and group C, dominated by people like Newman, M, etc. ( dark gray section). In addition, group D (green part), dominated by Barabasa, A, Jeong, H, etc., has the largest number of authors with high co-authorship frequency and the closest relationship, while group E (pink part), dominated by Kahng, B, etc., does not have as many authors as other groups and does not have as many co-authorships as other groups, but still maintains a close relationship with other groups. It is also worth noting that the closeness of the relationship between the authors (i.e., the number of co-authorship) is reflected by the thickness of the edge line: the thicker the line, the higher the number of co-authorship between the two authors.

In addition to these groups, there are still some discrete groups around the chart, such as group F (orange part) (Fig-9), which shows centrality symmetry at the top left of the canvas, and group G (flesh-pink part) (Fig-10), represented by White,H and Wellman,B at the bottom right, etc. They form more closed small groups and only cooperate within the group, so these groups are not the report’s target, and I’ll continue focus on the associated authors in the middle.

Through this visualization, the relationship between groups and the frequency of each author’s co-authorship are well represented.

However, I still wanted to explore more correlations, so I ran “Network Diameter” in Statistics on the right side and got some new filters on “Centrality”, which gave me more information on correlations.

The graph (Fig-11) obtained using “Betweeness Centrality” shows that there are several central authors in the system of network theory and experimentation – Newman,M, Holme,P, Jeong,H, PASTORSATORRAS, R, etc. — whose presence ties together the entire system. Among them, Newman,M has the largest centrality, i.e., his role in the system is the most important.

The graph obtained using “Closeness Centrality” (Figure-12) shows that most of the values are consistent, except for Bassler, K and Toroczkai, Z, which are most correlated with each other in the upper right, i.e., most of the authors are co-authored with the same frequency as other authors.

Finally, in order to provide users with more intuitive access to author information and how they are related when viewing the visualization, I clicked the “prevent overlap” feature. Since the distance between authors is small and there are 1500+ authors, it would be confusing to show all the labels on the final chart, so I chose Hide non-selected label in the so that the user can move the mouse to a different dot and the author’s name will be displayed. Or type the author’s last name in the search field and it will also mark the dot on the chart. I explored different settings in the appearance palette, including resizing and coloring the nodes, and adjusting settings in the layout for better visual clarity. In addition, I adjusted the settings in the preview panel to get a more subtle rendering. The final result is shown in the figure below (Fig-13)

Here is the link to my interactive website.

Reflection and future development

Since this is my first time using Gephi and I am still in the process of familiarizing myself with the tool, I have used only some of the features in this report to achieve the target. The final presentation satisfied me because it shows the co-authorship between authors in a more visual way. However, this layout is not the one I am most satisfied with, and I think there should be other layouts that can be better presented.

Unfortunately, this database does not contain fields related to the content of books co-authored by authors, so it is not possible to categorize Nodes like an inspiration case and get some more accurate group characteristics.

Information Visualization

Student work at the School of Information, Pratt Institute

A visualization of the co-authorship phenomenon on the subject of network theory up to 2006

Related posts: