This semester I have been exploring various aspects of public health information as data visualizations. I was interested in finding a dataset to use to explore network data with Gephi, open source software that allows you to create various network visualizations with datasets.
In my search I came across this dataset on the website SocioPatterns, a Dublin based organization that collects data on “social dynamics” and their resultant patterns. They have an incredible data visualization on infectious socio patterns based on interactions of people wearing electronic sensors at the Science Gallery in Dublin to simulate infectious transmission.
The dataset I chose to use was the result of a study of the social interactions of children in a primary school from first through fifth grade and their teachers. Among the data recorded were the children’s gender, grade, frequency and duration of face-to-face interactions.
I thought given the current outbreak of measles, a highly contagious disease, among school age children there could be something to be gleaned from the patterns among interactions, among other information. Each grade has two classes, distinguished as A and B (ie 2A, 2B, 3A, 3B, etc).
Socio Patterns provided GEXF files to downloaded that imported easily into GEPHI. The dataset has its own visualization on the website:
My first observation was that the original colors are unhelpful. It seemed to me that it would be better for each grade to have its own color with different shades of that color for each class. Gephi allows you to alter the node colors in the “overview” panel under “appearance”. While you can save versions of the new colors, the old colors are embedded in the original dataset so you would have to alter that if you wanted to make a permanent change. You also can’t create a legend in Gephi, which would be useful.
I used the Gephi layout ForceAtlas 2 and changed the colors:
1A – Light pink
1B – Dark Pink
2A – Light green
2B – Dark green
3A – Light yellow
3B – Dark yellow
4A – Light purple
4B – Dark purple
5A – Light blue
5B – Dark blue
Teachers – Black
This allows you to see not only how the classes relate to each other but also to more easily see how the grades relate to each other (anyone who remembers elementary school will likely remember what a difference a year made in the life of a kid).
I expected to see each grade most connected to each other the most and did not find this to be the case. Based on the nodes related by color I was able to see that 3rd grade (yellows) and 5th grade (blues) classes interacted most with their same grade piers. However 1B and 2B were just as connected, though they are two different grades. As their teachers (if I presume that the teacher of each class interacted most with their own students) are closely connected it’s possible that they were central in causing the kids from their respective classes to also interact more. The other 2nd grade class appears to be the most isolated overall while the other 1st grade class seems to have had encounters with kids from many of the other grades. For the 4th grade classes it’s possible that 4B interacted more with the 3rd graders while 4A interacted with the 5th graders.
The “appearance” panel also allows you to change the scale of the nodes, allowing you to scale the size of the nodes by degree under “ranking”. Therefore the greater the amount of interactions, the larger the node size. While kids in each class had a range of less to more interactions, overall 3rd graders appear to be the most interactive with both each other and other kids. It’s possible that, being the middle grade, they are the kids most likely to socialize with both older and younger kids, with a few kids who seem to to be the school rock stars. However, 1A appears to have at least one kid who is equal in interactions as several third graders. I t’s also worth noting that 1A’s teacher had the least interactions of all of the teachers.
The edges I found to be less useful because there was so much overlap, so I tried The Yifan Hu layout of the same data. I found the nodes to be more overlapped and clustered but the edges to be more informative, especially given their proportional thickness.
The other attribute from this dataset was gender. Again the embedded colors were problematic so I changed them to the obvious pink for girls and blue for boys. Unknown gender is in green.
Unfortunately it was at this point that i started having problems with Gephi. Inexplicably I was no longer able to change edge features that would export as pngs. So while my layout looked like this in Gephi:
The png looked like this:
I have no idea if I inadvertently changed a setting or if Gephi was misbehaving (there were other settings in preview that stopped working for me). However the nodes displayed by gender rather than class tell an interesting story.
My first observation is that the majority of kids with the most interactions are generally boys and that the most interactive boys and most interactive girls in 3rd grade interact a lot with each other. I also noticed that the most gender segregated grade is 5th grade.
How could this all be useful in terms of public health? If one were looking for the source of a contagious disease in a school it might make sense to first look at those third grade boys and girls who interact most with each other and all of the other grades. It may be that, ironically, the more popular you are the more likely you are to actually have cooties.