Word Adjacencies in Dickens’ David Copperfield


Lab Reports, Networks, Visualization
Charles Dickens, 1843
Charles Dickens , 1843 daguerrotype by Unbek in America , the earliest known photographic portrait of the author. Source: Wikipedia

Charles Dickens first published David Copperfield serially 1849–50, and as a book in 1850. It is his most autobiographical novel and is told from the perspective of an adult David Copperfield, looking back at his life and his development over time.

As a former student of English Literature, I was excited to work with a topic that was familiar to me. I was curious to see how the language in a book that is nearly 200 years old would appear when visualized and to see if any themes emerge in the data that would give someone who is otherwise unfamiliar with the novel a basic understanding of the book based on the network of words.

Inspiration

My first thought about network visualization was that they reminded me of word clouds where the use of scale and color informs how the viewer perceives hierarchy/frequency of different words.

Google Image search for ‘word clouds’

Additionally, after I got started with experimenting, I took a look at a couple of other student examples from Alvina Lai and Anna Size who used the same dataset to make sure that I was on the right track.

Materials

The software used for this project is called Gephi, which is free and open-source for Windows, Mac, and Linux. It is used to calculate network statistics, detect clusters, and filter, style, and label. Plugins are available from open-source developers which provide additional features. Once the visualizations are finalized, they can be exported as PDF, PNG, or SVG.

The data set I am working with was found on the Gephi GitHub page. The file was in Graph Modeling Language (GML) format, which I was unfamiliar with, but I opened in a text editor and it seemed to be consistently formatted so I decided to try out importing.

GML node syntax
GML edge syntax

Methods & Process

My first step was to open the GML file in Gephi and review the data import. There were no errors so I was able to move forward. There were 112 nodes and 425 edges in the GML file. I added a column for Word Type as well so that I could style nouns and adjectives differently.

Next, I ran the following statistics and checked that the data tables updated with the calculated values:

  1. Degree = 3.795
  2. Diameter = 5
  3. Density = 0.68

With the data ready to go, I started to experiment with layout and styling on the Overview tab and Previewing/Exporting some samples:

Started with Fruchterman-Reingold layout. The outer nodes are too small to read and the edges are thin and hard to differentiate.
Updated the previous iteration to set the font size proportional to frequency in order to determine high-level themes from the data.

ForceAtlas2 layout. Node labels are uniform size and edges are thicker with gradient colors based on weight. Labels are illegible.
Figured out how to make the adjective nodes yellow and the nouns blue in addition to the proportional text. No text overlaps and all words are legible.

Results & Analysis

With a few changes to the labels (font & color) and the node sizes and edge style, I decided that the version below is the most clear and shows which nouns and adjectives are most common and have the most connections. I suppose since the words ‘little’, ‘old’, and ‘good’ stand out most, this somewhat achieves my initial goal of conveying broad themes of David Copperfield.

Reflection & Next Steps

Overall, I was nervous about working with Gephi at first since I had no previous experience with graph theory or networks. However, once the data imported cleanly and I started experimenting with the styling I became more comfortable. Seeing how the data and layout adjustments impacted the visualization helped me to understand the statistical data better. This data set was relatively small and simple, so I would be interested in trying out something with more complexity or modularity in the future.

Sources

Dickens, Charles. David Copperfield. http://www.gutenberg.org/files/766/766-h/766-h.htm#link2HCH0001

M. E. J. Newman, Phys. Rev. E 74, 036104 (2006).