A Tangled Web of Words: Gephi Takes on Dickens


Visualization

Network visualizations are my favorite – just when you think you know your data, it might suddenly show you something new. Gephi is a complicated tool, but I really enjoy the types of visualizations you can make and how they’re usually very pleasant to look at –  a bonus when you want audiences to really spend time with your data.  My introduction to Gephi happened a year ago, when I was part of a research team studying intersections in LIS literature. After gathering data from over 1,000 articles, we wanted to see what types of identity intersections were occurring.  Did racism get mentioned alongside feminism and archives? Do LGBT themes ever appear alongside civil rights in libraries, or were all these identity terms of marginalized groups just lumped under the umbrella of general diversity?visual_1.25_hypergraph_network

In an attempt to make sense of our data, we turned to Gephi to see if any patterns arose. By creating a hypergraph, we were able to identify groups of works that were related to one another, and even which topics were talked about the most. It was a great exercise that helped us see how our data fit together, and influenced the way we went forward with presenting our findings. Without the help of Gephi’s hypergraph, it would have taken much longer to see and understand what correlations were being made within the data.

I was hopeful that my tiny history with Gephi would mean my next project had to yield amazing results.  Browsing other network visualizations, I found a large network viz created by researchers at University of Washington that showed Twitter activity following the Boston Marathon bombings of 2013. As a Bostonian who had just moved out of town a month earlier, I had followed the news stories to see what was actually happening back in my hometown. Many people were confused about who was attacking whom, and also where and why it was all happening, and Twitter was a quick, easy way to communicate. 

This particular graph maps hashtags used within the four day period after the initial attacks; the modularity classes that appear here are different conversations about the events and how they were being handled. Pink hashtags are about the MIT bombings, green about the manhunt that had started in Watertown, red shows overall Boston pride and support, and so on.

Since this was such a huge news story that was also very close to me and people I knew, I had no trouble reading this graph or finding patterns within it. My lab, however, was not so clear… I used a sample dataset from github that is supposed to show word adjacencies in Charles Dickens’ novel David Copperfield. Having never read this story and knowing nothing about it (is this not about the magician…? oh), it was very difficult for me to understand any patterns that may or may not have appeared as I was working on the visualizations.

To start, I ran the data through ForceAtlas 2 to air it out a bit, and also ran degree. Since that didn’t give me a particularly interesting network shape, I also tried Fruchterman Reingold and Yifan Hu Multilevel layouts before deciding to play around with modularity. Last time I had used Gephi this seemed to work so well, showing wonderful patterns and creating interesting clumps of data to study. I ran modularity at 1.0, and Gephi returned 7 classes.

Screen Shot 2016-06-22 at 11.32.58 PMScreen Shot 2016-06-22 at 11.37.34 PM

The distribution seemed okay, so I moved forward with the visualization. I colored nodes and edges by modularity classes, and played with optimal distance of nodes in Yifan Hu and in order to get a shape I liked. Even though I ended up with a quite beautiful graph, I have no idea what it means. A look at the modularity classes in the data laboratory did not seem to yield any clues – even though this data only includes nouns and adjectives used in the novel, both word types are mixed within each modularity.

dickens

The word “little” has the most intersections and occurrences in the book, but the other words in that modularity class don’t seem to have anything else in common besides the obvious fact that they can be paired with the term “little” to make understandable phrases. Modularity class 5 shows the most promise of making sense, seeming to include words that relate to the concept of family, but again, as I have not read the book, I can’t be sure what this is trying to tell me.

Screen Shot 2016-06-22 at 11.35.33 PM

To add further confusion, the color attributed to modularity class 5 looks like it should be blue, but the color blue is also not that visible within the graph. A group that large should be more prominent, like the red or yellow categories shown, so I am not sure what happened here.

Going forward, I would definitely start off with a dataset I understood well before I tried to look for any cool patterns that might be lurking within the numbers. Working essentially in the dark with such a sensitive tool started to feel like an exercise in futility. I took a look at some other visualizations that showed relationships between words, and came across a nice series by Chris Harrison. Using a simple format of “rays” of words that shoot off a central shape, Harrison’s visualizations compare word associations between and across two words that have opposite definitions. The structure of these graphs are immediately clear, and using semantic color associations of each word helps ground the information even more. Whatever terms that appear in the central rays of the viz are neutrally used between both words, and also appear in a shade that is neutral to the two colors used on either end. Screen Shot 2016-06-22 at 11.54.41 PM

I love how easy these visualizations are to understand, and think that maybe it’s better to show straightforward linguistic concepts instead of trying to be intricate with specific patterns and themes. In order to understand the visualization I created, I definitely will need to speak to someone who has read David Copperfield, or perhaps even someone who is intimate with all of Dickens’ works and general writing style. I realize that the specificity shown is built from the dataset itself, so perhaps a less complicated Gephi dataset is all I need in order to get a visualization I’m happy with!