The English language has patterns, which can be revealed through network visualizations. Analyzing noun and adjective frequency and relationship can help understand the common English language habits for a certain time period. This visualization looks at the writing of 19th century writer Charles Dickens, specifically his novel David Copperfield. Through this, we will explore the development process of the visualization and some conclusions drawn from the final product.
The program used was a free network visualization program, Gephi (link here).
The dataset was an open and free file from gephi Wiki (link here). The dataset used was “GML. Word adjacencies.”
First I looked for datasets. I wanted to use one related to data from the website TV Tropes because I understand the subject matter well and have used the website before. However, the dataset did not include information about nodes. Instead, I chose the dataset about David Copperfield due to familiarity with 19th century literature.
For inspiration material, I wanted something to guide my use of color and edge space.
I found Grocery Purchasing Correlations (link here). I liked that the colors were somewhat complementary, to help tell nodes apart from edges. I also liked that how nodes did not have a color (often black) outline, which would sometimes overlap and block text.
I admired the Netflix Similarity Map (link here) because it is both aesthetically pleasing and also because it was much more detailed than the grocery visualization. Edges did not overlap. However, I recognized that my dataset will not achieve that impact.
I downloaded the dataset and opened it in the Gephi program. I was working with 112 nodes and 425 edges with undirected relationships. On the overview tab, I ran statistics for Average Degree (about 7), Network Diameter (5), and Graph Density (.068).
In the nodes data table, I created a new column. The dataset included both adjective and nodes, so I created a “speech” column marked each row as “a” or “n”. Thus, I was able to color the nodes with a partition for speech. For node size, I had them appear by degree.
For layout, Force Atlas 2 was the first option chosen. Fruchterman Rheingold was a potential aesthetic option but limited the information provided by the distance between nodes. The final layout choice was Force Atlas because it lets the nodes be further from each other, and the room in between was helpful.
After experimenting with edge color, I decided that the color will be a mix of the nodes because it provides the most information. Originally the color was source, but the appearance of the dataset was “like spaghetti.” I adjusted the colors so there was orange (close to the color “copper”) for adjectives, which were more common. Nouns were blue because it was contrasting the orange. It also helped bring attention to the nouns, which are both smaller and fewer. The resulting edge mix color, the green-grey, was acceptable. I removed the node border so the black text is more visible. Having edges with colors different from most of the nodes made me feel more confident in this decision.
I considered changing the font to a serif, to match the time period of the novel. However, visually it was neither helpful nor pleasant. A sans-serif made more sense for a screen view.
Through asking for assistance, I was able to learn that there is no way to add a color key or a title to the exported image. Because these materials were going into a blog, this information was noted. However, if this chart was interactive, or printed in some way, it would have been preferred that both of these variables were included.
I found the interactive view of the chart more helpful then the static. I appreciated being able to hover over nodes and see clearly the items that were directly related to it. I was recommended to export the network and put them onto GitHub ( link here) so other users can experience its interactive version.
The dataset provided more adjectives than nouns. A singular noun node may have multiple adjectives related to it. Adjectives also relate to other adjectives more than nouns to nouns.
The bigger nodes, which are words that have the most edges, connect frequently to both nouns and adjectives. The smaller nodes tend to be adjectives connecting to nouns, which then connect to a larger web of adjectives.
When the edges were colored by source, orange (adjectives) dominate the screen. The final result, though, had the edges mixed by color.
When edges are colored by the mix of the node colors, a grey-green (combination of the orange adjective and the blue noun) dominate. There are instances of blue and orange lines. This confirms that adjectives tend to connect to nouns.
I found that the subject matter was important. I wanted to work on the TV Tropes dataset due to interest, but was also able to accept working with the Copperfield dataset because literature is also an interest.
Because language evolves over time, and because the words are taken out of context, it was not always clear if a word was meant to be an adjective or a noun. To create a more accurate visualization, I would read the novel to understand the most common usage of certain words in order to attribute the right label to them.
It would also be interesting to compare this novel to other novels, to see if the same words appear, or if they have the same relationships. It would also be fascinating to compare Dickens to other writers of his time. This project opened the fun and potential of visualization related to language, and I’d like to be able to create a dataset with literature to explore this topic more.