Info Vis Lab 4: Gephi


Lab Reports

The primary goal of this lab was to become familiar with Gephi, especially its quick save feature. The data I used is from Big Allied and Dangerous (BAAD), a project that “focuses on creation and maintenance of a comprehensive database of terrorist organizational characteristics.” The data includes 395 distinct terrorist groups that perpetrated at least one attack between 1998 and 2005. I used this data to create two distinct network visualizations.

I had three visualizations that I used as guides for this lab. The first was from “Visualizing crime patterns data as a graph,” which comes from visualization company Cambridge Intelligence’s website. Their visualization links types of crimes to specific incidents, with their data coming from the Cambridge (MA) police station’s Record Management System. While they go through several iterations of their visualization, my focus is on the first one created. The pattern that is displayed is very similar to the first visualization I made, with multiple large clusters formed that had little connectivity to one another. They ended up removing the leaf nodes and focusing on the other connections. This was not an option for my first dataset, but it helped me see that I didn’t necessarily need the country data there to find relations between groups.

The second visualization I looked at was “Shakespeare tragedies as network graphs,” which was created by Martin Grandjean. The eleven tragedies (Titus Andronicus, Romeo and Juliet, Julius Caesar, Hamlet, Troilus and Cressida, Othello, King Lear, Macbeth, Timon of Athens, Antony and Cleopatra, and Coriolanus) are each separate from one another in the visualization. This internal cohesion is what I wanted to achieve with my visualization, with the groups connected to one another just as the character nodes are. It also helped to show that nodes for the countries are unnecessary; there aren’t any nodes labeled Hamlet or Titus Andronicus.

The third and final visualization that I used as a guide for this lab was “Lexical Distance Among the Languages of  Europe,” created by Teresa Elm. I thought this visualization was interesting for the way that it incorporated a semblance of geographic spacing into the network. The connections between different clusters (the smallest of the “lexical distance” lines) helped to create a more complete picture of the network than if each language group was simply floating in space.

The first dataset I used was a matrix of the terrorist groups and the countries they operated in. This data had to be altered and transposed in Excel before it could be uploaded into Gephi. Unfortunately, once it was uploaded, it was soon clear that there was no data connecting the groups to one another; each group was only connected to a country. While the visualization this created was interesting, it wasn’t quite what I was looking for. It only created a visualization where every country node was important (in terms of centrality) and every other node was not. (See link below).

Terrorist Groups by Country

In order to find and display connections between groups, I had to edit and transpose a second dataset in Excel that contained edges between groups. This dataset provided what at first appeared to be a completely different visualization, with clusters made up of dozens of (seemingly) equally connected nodes. Once I compared it to the first visualization, it was clear that the country clusters had been preserved, even though no country data was technically present. Since Gephi kept the same color scheme for both visualizations, it was a simple matter to track the country clusters onto the second visualization. I think that ultimately the second visualization could stand on its own, but the context of the countries would have to be transferred in some way.

In addition, it’s easier to count the number of nodes in the smallest clusters in the first visualization because the edges are more obviously separated from one another. I wish that I could have found a way to make the terrorist group names more legible in the second visualization, but the sheer number of them, and the length of many of the names made it unworkable. (See link below).

Terrorist Groups Interconnected

After looking over my visualizations, I was surprised to see that Greece has the largest number of active groups, especially since the 2003 Invasion of Iraq had begun during the time frame, albeit towards the end. I also expected Israel to rank higher (it’s fourth, after Greece, India, and Iraq). After some consideration, I noted that while this data is good for determining the number of terrorist groups active in each country, it does not necessarily correspond to the amount of terrorist activity.

For example, while Hezbollah and Al-Qaeda would both only count for one node each (for Israel and Iraq, respectively), they may be responsible for many more incidents than several of the Greek organizations put together. Therefore, while this seems to show that Greece was the country most affected by terrorism from 1998 to 2005, it’s in fact only possible to determine that Greece had the most active terrorist organizations in its borders. There’s also the issue of organizations operating in multiple nations. Al-Qaeda, for example, was active in several areas during this dataset’s time frame, but there are no links between Al-Qaeda in the US and Al-Qaeda in Afghanistan, because there is only one node for each terrorist group.

The data also makes no attempt to denote the severity of the attacks that each group carried out. A single assassination of a single target by a small organization creates a node that is equal to the node of an organization that carries out hundreds of large scale attacks over the entire eight-year timeline, as long as both occur in the same country (or countries with an equal number of nodes). That being said, these are issues with the initial dataset, not with the way Gephi renders the network.

Overall, I found Gephi to be an effective, if occasionally temperamental program. The incremental way that the user runs the analysis is very useful for preventing information overload. This is especially true for color-coding and changing the size of the nodes and edges; the different ways to differentiate only appear after each analysis is run. BAAD is apparently compiling a 2.0 version of the data, so I’d like to see what additions they’ve made and if any more nuanced networks can be created with the additional information provided.