Word Adjacencies in David Copperfield by Charles Dickens


Visualization

M.E.J. Newman, released a word adjacency network of  common adjectives and nouns found in Charles Dickens’ novel, David Copperfield. This dataset consists of 112 nodes (words) and 425 edges (adjacencies). The dataset features various other categories that identify the relationship each word has with others, and the overall data.

Initially, the underlying question that influenced the design of this network graph was identifying and visualization clusters within the dataset. As work on the network graph continued, the goal evolved into identifying the relationship between the words based on their value. The following example, a network graph created using graphing topic terms from The Art of Literary Text Analysis by Stéfan Sinclair and Geoffrey Rockwell, identifies clusters within the terms. This example was also chosen because it also using terms from a text to create the network graph.

kurzgeschichten_network

Image source: https://wiki.de.dariah.eu/pages/viewpage.action?pageId=40213783

The dataset used for this network graph was found through Gephi on the wiki page, which provides a number of sample datasets in various formats.  It was downloaded as a geography markup language (GML) file and uploaded directly into Gephi. The plugin, Sigmajs Exporter, was eventually downloaded into Gephi in order to make the final network graph interactive.

With hopes of identifying clusters, several of the statistical analyses were run, including modularity. The following was the modularity report result.

communities-size-distribution

It was still difficult to identify what information was used to identify the 7 modularity classes, so attention was refocused to the value of each word. In the dataset, each word was given a value of either 0.0 (adjective) or 1.0 (noun). The layout Fruchterman Reingold, with gravity set to 5, was finally chosen. The ranking was set to “weighted degree” with a minimum size of 10 and a maximum size of 30. The nodes were colored either blue (adjective) or yellow (noun).

 

In the preview window, the edges were curved and colored, either blue or yellow, based on the value of the node it extended from. The background was also changed, and the nodes were labeled and made to be proportional to the size of the node.

DCLab

Upon speaking with others using the same dataset, it was decided that more information would be necessary in order to have a better understanding of the overall data provided in the dataset. Options were limited without a full grasp of why certain categories were included, and what certain data was based on, such as the modularity classes.

Despite the lack of key information, the dataset could still be used. A dashboard, for example, could be created, that features two network graphs. One would have the ranking set to “weighted in-degree”, which would size the nodes based on the number of edges coming to them. The other would have the ranking set to“weighted out-degree”, which would size the nodes based on the number edges going away from them. Each of the graphs on this new dashboard would be using different colors schemes to help users identify them.

It would also be interesting to create a similar word adjacency network graph for every novel by Charles Dickens, and possibly a final network graph that visualizes the word adjacencies in all his works.