Network Lab: Word Adjacencies in Dickens’ David Copperfield


Visualization

Introduction:

I was curious to explore network analysis in terms of literature and found a dataset on the Gephi Github site about word adjacencies of common nouns and adjectives from the Charles Dickens novel David Copperfield. The dataset included 112 words, 58 adjectives and 54 nouns included with 425 edges. This gave me some indication about word adjacencies in the dataset. Unfortunately, I have not read David Copperfield, but felt like that wouldn’t prevent me from exploring the dataset a little further. I knew the dataset was directed and with the number of edges I figured there might be some discoveries to be made.

  • Is there one noun and/or adjective that is used most often? Knowing that Dickens wrote Copperfield in 19 installments, I figured there would likely be some kind of repetition.
  • Are there a maximum number of pairs used with one or more words?
  • Do certain words have greater connections than other nodes? Is there any kind of pattern to be seen? Maybe some words only have a few ties and other words have a greater number of ties? The degree of the node would tell me more, particularly in-degree and out-degree since this is directed.
  • I thought there might be some kind of relationship to gender and adjectives used with woman or with man that might reveal something about the novel.
  • Betweenness centrality might tell me something about how all the words are connected. Are there any words that play a central role?

Inspirations:

I was curious to see what other things had been done with literature using Gephi. I found an interesting visualization produced by the Georgia Tech Digital Humanities lab related to Thomas Jefferson. This was an arc diagram and can be found here. I liked that this visualization related to archives- 30,000 papers of Thomas Jefferson and at a glance you can see the categories and density of arcs.

Thomas Jefferson Arc Diagram from Georgia Tech DH lab illustrating relations found in papers of Jefferson

I also watched a brief video from the 2014 Design Automation Conference that looked at network visualizations in literature. The two I found most related to Gephi were the author network– showing co-authors of papers and their connections. The second was a citation network. What became immediately obvious to the creators was that the co-authorship network was denser showing a lot of collaboration. The citation network was sparse and not as richly connected. Although I thought all of the sub communities/networks were interesting to see even if not connected to the larger whole.

DAC Citation Network, shows connections communities

Process- Results Disc:

I used Gephi 0.8.1 beta which I downloaded from the Gephi site and used in the lab and at home. The Dickens word adjacencies dataset was pre-set to open in Gephi. I simply downloaded the GML file and then set Gephi to open that file. I knew there were nouns and adjectives in the dataset, but wasn’t sure how they were distinguished. Once I took a look in the Data Laboratory file, I changed all words with a value of 0.0 to adjective and 1.0 to noun. This helped me see a little more clearly between the two.

My first tests I focused solely on noun and adjective relation and usage and created the graph below.

noun_adj_vifan_hu_layout

I thought I would highlight nouns and adjectives and then see if I could find some kind of interesting relationship between words associated with “woman” and words associated with “man.” This did not end up revealing anything.

In the Overview Panel, I ran some of the statistics and then reviewed the Data tab to see if anything interesting popped up. After running “Average Degree” and “Average Weighted Degree.” I then went to the Ranking panel for Nodes and chose “Degree” which showed a range between 1 and 49. After applying, I immediately saw “Aunt” had the lowest degree and “Little” had the highest degree.

There were several other nodes that were small in scale, but the layout I ran pushed “Aunt” furtherest out from the rest of the cluster of words which indicated to me it must have been one of the weakest links, perhaps with only one connection. I then explored node size and saw that it was preset to min 1 and max 4. I changed from min 1 to max 15 and again was reassured that “Little” was held the highest degree.

I thought I would explore more with incoming and outgoing edges, but this did not change much and made it difficult to see the edges and their respective weights. I did notice some words were more prevalent in addition to “Little” there was “old”, “other”, “same”, and “good.” I experimented with Betweenness Centrality and then more with layouts.

betweenness centrality graph highlighting words

I decided maybe the Eigenvector Centrality would give me more interesting information about the nouns and adjectives. What I understood about Eigenvector Centrality is that it shows importance based on node’s connections to other important nodes. I learned more about this from here. Once I applied the Eigenvector Centrality, my graph really altered- in a more interesting way I felt. The default EC is set to 100 iterations. I left it at that based on what I read here.

eigenvector centrality- changed node size and relations

Next I decided to change the layout to Yifan Hu Proportional then moved onto experimenting with Modularity. I first ran it with 1.0 iterations which gave me seven classes. I wanted fewer in an effort to simplify the relationships and set it to 1.5 which gave me four classes. I narrowed this down to three colors since both class 2 and 3 were the same percentage (9.82)and put the first class 63.39 at blue and the 16.96 at red.

legend for modularity class

There are mostly blue is clearly visible and the red are strongly highlighted which I thought played off nicely leaving the yellow ones least visible since they play a less role. I played around with the color settings and moving some of the nodes. Overall the graph seemed too dense in the center. It’s a tricky balance between visually appealing and visually readable. I wanted more space, more room. I went back to ForceAtlas2 knowing I could change some things around. I set the scaling high to 50 to try to open things up.

force atlas2 graph with modularity highlighted

Future Directions

I would like to download the plugin to create a search criteria for the words listed along with a legend that explains the colored relationships which would make the graph more usable. I would also like to try to use filtering to highlight top ten word adjacencies.

My decision to use an pre-existing dataset for Gephi rather than create my own was made with the idea that I could spend more time understanding the software and network analysis. I’m not so sure that was the right decision in the end. I tried playing around with the edge cvs file to see if working in the data lab might give me more insights.

original-altered-csv

This complicated things and added a whole outer unconnected node section. I am not sure why this happened so I would like to look into that further. The image below on the left shows the initial result. I noted in the Data tab that I had rows in the edge table that had a source and target but no type or id so I deleted those which resulted in just the outer ring being illustrated.

 

screenshots of experiments in gephi with csv files

I would like to try working with Gephi again with a dataset that I either manipulate or create so I can de-mystify more of what’s happening.

I found this tutorial on Gephi also related to literature and thought about playing around with the filtering options to see how that might impact the network. For a future dataset, I would be more interested in choosing a dataset that would show either relationships between published works or relationships between authors to highlight possible influences between authors and/or their works.

Citation for Dataset:

  1. E. J. Newman, Phys. Rev. E 74, 036104 (2006). Found here.