The Words of Dickens: A Network Visualization


Lab Reports, Networks, Visualization

Background

Creating a network visualization, a graph of the relationships between data points requires very specifically structured data. As I searched the internet for a dataset that piqued my interest, I discovered that a large number of available materials deal with social networking examples, for example how a group of people is related by different degrees through Facebook friends. The high availability of this type of content was not a huge shock to me; as social media constantly progresses, we want to study more about how different users are connected to one another.

However, I came across a set of data on the Gephi Github page that stuck out to me as different from the rest. This network contained the relationships between adjectives and nouns in the novel “David Copperfield” by Charles Dickens. Though I have not read the book myself, I chose this set as I was curious to see how the language of the novel looked in a visualization.

Process

The dataset that I chose contains 112 nodes, or words used in the novel, and 425 edges, or adjacencies between the adjectives and nouns. By uploading this file into the program Gephi, I was able to explore the relationships in a visual way. Gephi is an open-source, free application that affords users the ability to manipulate networks, analyze layouts, and uncover connections and patterns in data. Upon uploading the dataset, formatted as a GML (Geography Markup Language) file, I began by changing the layout of the clusters. I ended up choosing Fruchterman Reingold, with the area set to 100 and gravity set to 10. This created a web of nodes that were spaced out and visually easy to digest. Words on the outside of the visualization have fewer adjacencies, and words on the inside of the cluster have more adjacencies to others written in “David Copperfield.”

The weighted degree report.

Gephi has statistical abilities to run several tests on the data. I chose to run the “Weighted Degree” report, which is different from just the degree because it takes into account the number of edges for each node as well as the weight. I then was able to use the weighted degree to size the nodes in the graph, with the minimum set to 5 and max at 15. I used this same criterion for the labels, with smaller words having fewer adjacencies and bigger words connected to more.

The way this dataset was structured gave nouns and adjectives different values, either a 1 or 0. I was able to use this value to create a color binary that allows viewers to distinguish between the types of nodes and clusters. I also used this color scheme or red (nouns) and blue (adjectives) to color the edges. This way, it is easier to see if an edge leads between two nouns, two adjectives, or between nouns and adjectives.

The final visualization of the adjacency network of common adjectives and nouns in the novel David Copperfield by Charles Dickens.

Inspiration

Because I have never worked with network visualizations before, I needed to conduct some research on the best methods for displaying relationships. One graph that I found relevant to my work was “A Graph of Medium’s Tags” by Ludi Rehak. As the title suggests, the graph covers the most common tags on 1,000 of Medium’s most popular stories. I chose this visualization as an inspiration for my work because it also works with words as data points. What I took away from the graph was how Rehak used the size of the nodes to show the number of times each tag was used. Also, I appreciated how color related to the category of tags.

Created by Ludi Rehak, backend software engineer at Mist Systems

Reflection

Overall, I found that Gephi took some time to learn how to use. It is not the most intuitive system for a first time user, especially one not too familiar with network visualizations as a whole. After several tutorials, however, it was much easier to get the hang of the program and be able to manipulate the data as I pleased. I think that using color to denote nouns and adjectives was successful. Size was reserved to indicate the number of adjacencies. By assigning color and size different meanings, viewers can more easily understand the data.

In another iteration of this graph, there are definitely things I would like to change. I would like to be able to use the number of adjacencies between words to make the edges larger or smaller in weight. While I liked the Fruchterman Reinhold layout the best, I wished that the data was sorted into clearer clusters. I also think that having the graph be interactive and allow users to roll over a word to see which other words it connects to would communicate a lot more information about each relationship.

References

https://towardsdatascience.com/a-graph-of-mediums-tags-8e3cf6cad1d9

M. E. J. Newman, Phys. Rev. E 74, 036104 (2006).

https://github.com/gephi/gephi/wiki/Datasets