The Connection Between Words in Novel David Copperfield


Lab Reports, Networks, Visualization
Introduction

I’ve always interested in anything about language and linguistics, especially to know how different language connect adjectives and nouns. I believe the language we speak would somehow impact how individuals perceive the world in their own way.
Lera Boroditsky, a cognitive scientist and professor in the fields of language and cognition. Borodisky is now one of the main contributors to the Theory of Linguistic Relativity. In a TED talk “How language shapes the way we think“, she mentions the interesting differences of “grammatical gender” in different languages. Based on grammatical gender, every noun gets assigned a gender, often masculine or feminine. For instance, in the talk, Borodisky mentioned that “the sun is feminine in German but masculine in Spanish, and the moon, the reverse.”
I found one of the datasets in Gephi’s Github resource that can give me a taste of analyzing linguistics/semantic relationships of a book, thus, I decided to take that dataset to explore how adjectives connect to nouns in this book. By checking the final data viz outcome, I should be able to know the preferences of word selections from the author and the most frequently used words.

Tools and Dataset

Gephi is a good tool for those who would like to visualize the relationships between topics, objects, people, and incidents. It’s a free open graph platform/software that can run on Windows/Mac/Linux. For this data viz exploration, I’d like to play with the styling and analysis feature to discover the best way of visual storytelling as a network graph.
As I mentioned in the introduction section, my interests in linguistics drive this exploration of data viz. Thus, I choose to pick the dataset “adjacency network of common adjectives and nouns in the novel David Copperfield by Charles Dickens.” as the data visualization material in this report. You can find the GML file here.

Inspiration/Critique

Networks data visualizations would definitely have a lot of lines and connecting points. This would also bring up a concern about the readability of the final outcome, do the audience can really understand the relationships and the story? I think often times the audiences can only identify the patterns by recognizing the size of nodes (connecting points) and the density of the lines in the whole network. Thus, I can’t really tell if there are a single/a few major links between nodes as the main storyline from these all three graphs. But I like the color selections since that showcase the vibe of different stories and would like to apply this to my work.

Methods/Process
Pull Data and Observe

The very first graph generated by Gephi is what I expected to see. My dataset is a decent one to play with the network type of data viz. This dataset has 112 nodes and 425 edges, the average degree is 7.589, and the graph density is 0.034.
Before I style the nodes and edges, the graph seems manageable to read and interpret based on the scale and complicity. Thus, started to play with the layout style and tried to figure out which one is the best representation of this dataset.

Before styling

I adjusted the node size based on degree, so I can see which word is the most connected one and what other words connect to the most connected one. Also, I tried to color the edges based on the degree as well, but seems like the value distribution is not so obvious to tell.
After the basic styling, I’d also want to know the degree of the connection between words here. But seems like the dataset didn’t have the value as the weight to display how strong the connections are between words. And that’s the reason why I started to look for some alternative options for my network data viz work.

The Weight Value are the same
Alternatives

My professor suggested Voyant as a tool to analyze the full text of David Copperfield by Charles Dickens. By using this tool, I might be able to get the weight of connections between words, in order to make the Voyant extract this value, I had to go to Project Gutenberg to get the full plain text to input and obtain the value I need.
Not sure of what reason, Voyant can’t really let me export the file I need for network data viz, thus, other tool called textexture that also been introduced to me as the alternatives. But unfortunately, they no longer allow new signups, so I can’t pull my data in to see the result.

Result

Since I still want to see the degree of how strong the connections are between words, I decided to manually make up the values to see how the graph might look like if the weight value applied.
Also, I felt the edges are more visible when the background color is dark with light colored labels. The small degree of nodes has darker labels with them since I would think that might be minor information of this graph compare with the most connected ones.

Reflection

The topic of understanding connections between words is definitely one of the data visualizations I would like to make in the future again, but the thing is that related resource or tools are pretty hard to find. I’d even want to do two graphs to compare two different languages to see if there is any similarity or patterns across languages.