Visualizing In-School Friendships With Gephi

Background & Data

In the interest of exploring Gephi, I wanted to look at a network of people with a few different characteristics so as to understand how to visualize relationships in terms of human commonalities. This was partially inspired by the sample visualizations I found, but also because I had not used Gephi in earnest and was hoping to maximize the amount of design possibilities from a data set. Accordingly, I chose a dataset from Carnegie Mellon’s CASOS database, called “Goodreau’s Faux Mesa High School.” Described as a simulation of an in-school friendship network, the network presented friendship data on the gender, grade, and race of 205 students. The data was written in an XML page which gave the properties for the 205 nodes, relationships which comprised 404 edges, and the information that the graph was directed.

Research Questions

From this dataset, I generated a few questions to guide my Gephi lab. Because I was unsure exactly what Gephi could do, I kept my questions broad enough to include all of the properties in the dataset.

Question 1: To what extent are these high school friendships structured by gender? grade? race?

Question 2: What patterns across these properties can be found in the most “in the know” agents? (the term taken from the CASOS Analysis Report about this network, simply meaning nodes with the highest degree)

Sample Visualizations As Inspirations

The first visualization I wanted to take inspiration from was a network called “The Merging Worlds of Technology and Cars”, created for Bloomberg. The network itself is problematic for trying to represent several types of relationships equally and at once (e.g. investment, partnership). What I took from it was more on the design side, specifically the use of color. I found the viz effective for how it labeled nodes individually but also colored them by industry groupings: technology, carmakers, and rideshare companies, the latter being the impetus for the graph. I wasn’t quite sure how my nodes would cluster, but this network made me conscious of how I might use color to convey relationships between nodes that didn’t necessarily connect through the metric of in-school friendships. (figure 1)

Figure 1

My next sample visualization is one of the many “Facebook visualizations” floating around on the internet, this one in particular a mutual friend network visualization. The article explains that this particular visualization ought to be experienced dynamically, but I was more inspired by the organization. Amidst all of the people in the network, one is singled out and centered while the others orient about that one. If it was necessary, I wanted to try and explore this formation in Gephi. (Figure 2)

Figure 2

Finally, I looked at a digital humanities project which used a radial layout in Gephi to visualize biodiversity databases, in which color represented threat status with each “spoke” signifying a country . This graph inspired me to look more into radial axis layout before coming to class, and I ultimately tried to apply the layout to my data set (with mixed results). (Figure 3)

Figure 3

Methods & Results

After deciding on my data set of in-school friendships, I used Google Refine to convert the XML into a csv file. Using the Web Addresses function in Refine’s Create Project screen, I was able to select the XML element that I wanted to create a table for (first the nodes with the three properties, then the edges with the source/target designation). For the nodes table, I had to transpose and remove empty rows in order to create a normalized csv file that featured name, gender, grade, and race. For the edge table, I added a column to specify “Directed” per the CASOS analysis report and output to csv.

After creating my project in Gephi and successfully importing both csv files to my data laboratory, I began my analysis with the Force Atlas 2 layout. Next, I ran statistics on my network. The average degree was 3.941, the diameter was 16, density .01, and modularity was .802 with a resolution of 1.0. While the density especially seemed a bit lower than we had discussed in class (with social networks, albeit of a larger scale, being as dense as 20%), I did note a healthy amount of isolates that for one reason or another might explain the number. The other strange discovery with my statistics was that for every single node, the In-Degree and Out-Degree were exactly the same. This led me to believe that the network was almost certainly undirected in opposition to what the CASOS file had said. Regardless, for my assorted sizing by degree I used a minimum size of 1 and a maximum of 26 so effectively double the size of every node and use the exact value listed as each node’s degree.

Sizing my nodes from 1 to 26 and coloring according to the modularity class, I created my first visualization for this data set (figure 4).

Figure 4

Before continuing on with my research questions, I looked at a few other layouts and how they compared in design. For consistency’s sake, I followed the same steps with sizing/coloring to display the clusters. The next layout I chose was Fruchterman Reingold (figure 5). I ultimately chose not to continue with this layout, moving on to the radial axis layout (figure 5).

Figure 5

Satisfied with how these two layouts differentiated the clusters, I returned to Force Atlas 2 to begin addressing the various node properties of my first research question.

Sizing nodes consistently by degree and simply changing the colors depending on property, I created three separate network visualizations that conveyed the relationships within the school according to gender (figure 6), grade, and race (figure 7). In each case, I was sure to copy the particular property column information to the label column in the data laboratory.

Figure 6 – Gender

Figure 7 – Race

To address question 2, I had the idea to use the radial axis layout to evaluate the node with the largest degree in the manner of the mutual friend network from my sample visualizations. I quickly realized, though, that the radial layout was better-suited to the kind of graph seen in the DH biodiversity project and that I could apply the “characteristics of ‘in the know’ students” question not only to a single node but to the hub of each cluster. The resulting graphs, which I again ran and colored according to separate gender/grade/race properties, are exemplified here (figure 8).

Figure 8 – Gender

I spent the remainder of the lab working in Numbers to try and generate a source cluster and target cluster table so as to create a hypergraph, but unfortunately ran out of time. As I discuss below, this is still something I would like to explore in the future.

Discussion

Out of the three node properties I highlighted with my networks, by far the most telling property was gender. The Force Atlas 2 graph in particular revealed how friendships among this group of students are anchored by a shared gender, with three of the four largest clusters exhibiting an overwhelming percentage of female students and another predominately male (albeit with more females weighing in than in the female-dominated clusters). Considering the even split between the students (100 females, 105 males), it seems significant too that so many more clusters are female-centric. It is possible that the network appears skewed due to the nodes because seven of the ten nodes with largest degrees are female, but it is difficult to say. For a complete picture, I counted the amount of isolates across male and female and found that the number was nearly identical, again supporting the claims of this network.

The student friendships clustered by grade but I found this somewhat unremarkable, especially since it was so consistent for every grade. More intriguing was the factor of race, although even this network only exhibited a close-knit friendship between Native American students. It is worth mentioning that this might be the case because, according to the data set’s CASOS analysis report, this high school has a majority of Native American and Hispanic students, allowing for more nodes to cluster with those particular properties.

For question 2, I ultimately found the radial axis most helpful just for examining trends among the most “in-the-know” nodes within each cluster. Regardless of how the nodes had clustered (which I found most visible in the Force Atlas 2 graphs anyhow), the radial axis viz showed that in almost every cluster the gender of the most “in-the-know” student predicted the majority gender within the rest of the cluster, suggesting a directionality after all when considered along with the conclusions about how the genders clustered in question 1. This to say, it seems that of all the properties provided by the data set, gender is the most significant for friendships in this school.

Future Directions

I unfortunately ran out of time in class in trying to create a the cluster nodes and edges tables, but I would be interested in seeing what a hypergraph looks like for this data set. With a resolution of 1.0, the clusters in this network are already fairly well-defined in the layouts I used, but there were enough clusters in my work towards the new table that I would be fascinated by what that could offer. Similarly, if I were to work with another in-school friendship network, I would be interested to incorporate an element of time in which I created separate networks for the same nodes across grades and considered how the influence of these properties changed (or not) throughout adolescence.

Information Visualization

Student work at the School of Information, Pratt Institute

Visualizing In-School Friendships With Gephi