Introduction
I wanted to visually explore the relationship between words and luckily, the Gephi Github
contained a data set titled Word adjacencies: adjacency network of common adjectives and nouns in the novel David Copperfield by Charles Dickens (Newman, 2006). This data set would be perfect as a sandbox for testing out Gephi, an open source network analysis tool.
The visualization I collected as models for this exercise focused on using different attributes to make sense of overwhelming data.
Figure 1 uses clusters and colors to visualize relationships between words. The researchers that created this network analysis wanted to visualize the word-morph game where individual’s change one letter of a word to slowly transform it to another word (ex: cat-cot-dot-dog). The force-directed network shows all correct English three-letter words, where two words are connected by an edge if they differ in exactly one letter. Considering the overwhelming amount of data, I thought the researchers that created this graph found a good way of visualizing the relationship between certain words, though perhaps including text in key nodes might help guide users through the network without the need to consult their paper.
Meanwhile, Fig. 2 uses a radial diagram to illustrate the relationship between words and chapters in the popular novel, “Les Misérables.” The novel has 365 chapters that are grouped into books. The graph uses colors, text, and location to map the relationship between words in individual chapters, books, and the entire novel. While some radial diagrams can be confusing, I thought this one was very clear and displayed information successfully.
Figure 3 uses colors and text to emphasize certain nodes while the edges represent physical interactions. I found this graph interesting because the authors of the study did not use color to emphasize clusters. Instead they wanted to draw the reader’s attention to certain major proteins (nodes). The network effectively shows the relationship between major and minor nodes and their physical interactions.
Creating the Visualization
I imported my zipped GML file into Gephi. GML (Graph Modeling Language) is a text file format that supports network data with a very easy syntax and is very compatible with Gephi (Gephi, 2016). Gephi automatically graphed my data in the Overview tab, but I had to decide how I wanted to interpret the relationship between these words. I clicked on the Data Laboratory tab to analyze the content in the GML file. Nouns and adjectives had different weights, which meant I could easily distinguish and quantify the two variables. The information was also directed, which meant I could explore the relationship between the adjacent nouns and verbs. I wanted to look into what adjectives modified which nouns and which nouns were modified the most, so I decided to do two networks:
• The first graph would address the most common adjectives used to describe nouns. This network would be an out-degree directed graph because most of the times, though not always, adjectives are placed before nouns
• The second graph would address which nouns are described the most. This network would be an in-degree directed graph.
The explanation for the two graphs can also be explained this way: adjectives->noun.
I decided my network would answer four questions:
• Are nouns or adjectives more popular in David Copperfield?
• Which are the most described nouns and most used descriptive adjectives?
• What is the relationship between the top 3 nouns and adjectives?
• Can anything be deduced about the novel by looking at relationship between nouns and adjectives?
I clicked on the Overview tab again. In the Nodes section, I chose Attribute>>Size>>Out-Degree and then In-Degree for the reasons discussed in the paragraph above. I then chose Attribute>>Color>>Value as a way of using colors to distinguish between nouns and verbs. I decided not to change the default red setting for adjectives and green setting for nouns, since it obeyed the hierarchy of color. I selected Label Adjust as my Layout since my graph is based on understanding the relationship between words, thus legibility is very important. In the Preview tab, I selected Default Straight for the edges because I wanted users to clearly see how nouns and adjectives were connected. I also decided on Adobe Caslon Pro for the font because I thought it gave my graph an old-timey Charles Dickens vibe without being obnoxious. I had to play around with some of the nodes because they overlapped, but overall, I was pleased with how my first Gephi graphs looked.
Interpreting the Data
Based on this dataset, adjectives are featured more prominently in David Copperfield. I hypothesized that Charles Dickens might use more adjectives than nouns since he was supposedly paid by the words he wrote and multiple adjectives can be added to describe a single noun, but this “fact” turned out to be apocryphal. Apparently, his style was meant to be tongue in cheek, so he imitated “long-winded bureaucratic, professional or ceremonious jargon to satirize the institutions that use such language” (Grossman, 2012). Perhaps this is why more adjectives are repeated throughout the text or maybe this is just a common phenomenon in fictional prose.
The variety of individual adjectives and nouns is almost identical (51.79% vs 48.21%). However, upon further consideration, the creator of the dataset did not explain how he narrowed down the most common nouns and adjectives (did he try to select approximately the same amount of adjectives and nouns?), so this might not be the best dataset for analyzing which parts of speech are used the most in the novel.
However, this dataset does a good job of selecting the most described nouns and the most descriptive adjectives.
Top five descriptive adjectives are (Fig. 4):
• Good
• Same
• Other
• Small
• Poor
The top five described nouns are (Fig. 5):
• Man
• Room
• Way
• Face
• Thing
Even though I have never read David Copperfield, I thought these results could help further a literary analysis of the novel. I thought it was interesting that “good” described the most nouns, even though the word “poor” also factored into the top five. “Man” is the most described noun in the novel, since the title of the book contains a man’s name this is not surprising, but it is interesting that it is the only living being to make it to the top five described nouns.
The top three adjectives described the following nouns:
• Good = Something, Friend, Thing, Head, Time, Day, Person, Man, Word, Mind, Fire, Night, Hope, Part, Place, Arm, Woman
• Same = Thing, Room, Time, Place, Manner, Boy, Air, Word, State, Face, Moment
• Other = Hand, Way, House, Side, Face, Evening, Door, Head, Part, Room, Friend, Place, Time, Manner, Course, Boy, Person, Child
Perhaps researchers may find it heartening that the novel contains a variety of humans described as being “good.” These include, “good friend,” “good person,” “good man,” and good “woman.” Though Dickens seems to prefer describing non-living things as good.
I also wanted to verify how “poor” was used, since a lot of the Charles Dickens’ novels deal with poverty.
• Poor = Hand, Head, Child, Boy, Mother, Man
Without context it is difficult to tell whether this word has an economic connotation or a pitying one.
The top three nouns are described by adjectives the following way:
• Man = Little, Old, First, Young, Better, Black, Good, Best, Happy, Certain, Common, Poor, Agreeable, Alone
• Room = Little, Old, First, Other, Best, Long, Large, New, Dark, Quiet, Small, Same, Usual
• Way = Little, Old, Other, Best, Short, Pretty, Whole, Long, New, Wrong, Quiet, Light, General, Small
Someone interested in performing a gender analysis of the novel might also look at the most common adjectives used to describe women in the novel and compare it against the word “man”:
• Woman = Little, Old, Better, Young, Good, Pretty
Based on this analysis, less words are used to describe women and when they are described it mostly pertains to their physical appearance.
As a fun exercise, I tried to guess the themes and motifs of David Copperfield based on the relationships between the top words. I hypothesized that it dealt with the plight of a man, that people he felt kinship towards were poor and down on their luck, that friendship played a strong role, women were used as a symbol of goodness/hope, and that at the end good prevailed. The novel was probably set in a city with a lot of scenes indoors.
Weaknesses, Lessons, and Future Directions
The dataset is not perfect. It’s flaws are most apparent in Figure 5 were Gephi does not understand when multiple adjectives are being used to describe one noun and interprets the first adjective as a modifier of the second adjective. Also, some nouns can be used before an adjective (ex: “something small”) and there is no way to take this into account in the set.
I wish the adjectives nodes in Figure 5 would not draw the user’s attention away from the noun nodes. In further analysis, I would try to experiment more ways of using text, color and size to better draw users to the intended information.
I would have liked to interpret this data using a hyper-graph. My next project will involve mastering that skill-set.
References
• Gephy (2016). GML Format. Retrieved from https://gephi.org/users/supported-graph-formats/gml-format/.
• Grossman, J. (2012, February 2). Five myths about Charles Dickens. Washington Post. Retrieved from https://www.washingtonpost.com/opinions/five-myths-about-charles-dickens/2012/01/30/gIQAp0cUlQ_story.html.
• Newman, M.E.F. (2006). “Word adjacencies: adjacency network of common
adjectives and nouns in the novel David Copperfield by Charles Dickens.” Phys. Rev. 74, 036104. Retrieved from https://github.com/gephi/gephi/wiki/Datasets.