Background and goals
Since 2014, I have been working as a research assistant for Linked Jazz, a Pratt Institute-based project that experiments with applying semantic web and linked open data (LOD) technologies to cultural heritage materials. The largest and most visible segment of our work is the visualization of relationships between jazz musicians. The nodes of the graph are derived from transcripts of interviews with jazz musicians, with each node representing a person interviewed or a person mentioned. Each edge represents a directional “knows of” relationship between the person interviewed and the person being mentioned in the interview. To date, ca. 52 interviews have been processed and are represented in our graph.
The past year, I worked with this set of names towards enriching the entire list of people with a gender attribute. This was an experiment in tapping linked open data resources via endpoints and APIs to enrich our dataset. Using Python scripts, LOD resources, such as VIAF, DBpedia (representing the Wikipedia dataset), and MusicBrainz were queried to obtain a gender attribute. The details of this experiment lies beyond the scope of this lab report, but more details can be found in my blogpost on the Linked Jazz website.
A final realization of this work would be the enhancement of the current Linked Jazz tool with a gender overlay and set of filters. As a first step, however, I decided to use our Gephi lab day to create preliminary visualizations of the results of this data enrichment. I was less interested in quantifying gender division in our network than I was in representational methods that could adequately address the ways in which one might use a visualization to explore gender. I tried to think of questions a person might ask of the interviews, for example, “What does the overall distribution of gender look like in this series of interviews?” or “Do women tend to mention other women more than men mention women?” Another question might be, “Are there any men who mentioned a lot of women or conversely no women at all?” Because the current Linked Jazz visualization tool provides access to actual transcript passages, the hints provided by a gender overlay would enable researchers to click further into the source material to read precisely what was said.
A secondary concern was to also faithfully represent this as an experiment in designing automated methods for data enrichment. For example, often no gender information was found for a person using these methods, in which case, the gender was always stored as “unknown”, regardless of whether the person’s gender was actually common knowledge.
The primary inspiration for my visualization is the original Linked Jazz network visualization tool that was originally developed by Matt Miller, co-director of Linked Jazz with Cristina Pattuelli and the developer. I have always appreciated how the network requires only a very basic introduction, or even none at all, to enjoy. Understanding it as an experiment in building tools on LOD foundations requires more extensive orientation, but again, it is not required to discover connections and explore these oral histories in this novel way.
A second series of visualizations that I found interesting to use as a model was created by the ViDi research group at UC Davis. These visualizations represent the friendship and aggression network of students, grades 8-12. Unfortunately, it is not stated whether it represents only one school or many schools. Nevertheless, because grade distribution was an important clustering factor, I like that the first friendship network visualization shows this clustering and then the second proceeds to represent gender distribution across these clusters.
A third and final visualization that I found interesting—less for any aesthetic beauty, than for some of the choices made in communicating information—was created by someone named Marc Smith that was the subject of a Cyborgology blogpost by Whitney Erin Boesel. It represents tweets with the hashtag for the 2012 American Sociological Association meeting, with further information about the top hashtags from clustered groups. On a very immediate level, I like the use of face images to represent people in the network and that they are sized proportional to frequency. Densities in the network and nodes connected by no more than one or two edges are also easy to discern.
Preparing the data
Linked Jazz publishes most of its datasets on the website, including the nodes and edges files used by Matt Miller to create the network visualization Gephi. In order to conform as much as possible to the existing tool for future integration, I downloaded his files to use as the base for my own.
Our list of name entities consists of over 2000 people derived from interviews. Matching on the literal person name in Matt’s CSV nodes file, I used Python to append the acquired gender data from my experiment for 1330 names to the nodes file. There were several other groups of names that justifiably needed to be identified by hand (without undermining the experiment), and these were also added to the file, bringing the total number of people with positive gender data to 1564. The only gender values represented were those acquired programmatically: ‘male’, ‘female’, and ‘unknown’. In cases where no value was acquired, ‘unknown’ was added as the default. Since I also wanted to include photos of the musicians, but only for those who are the interviewees to avoid distraction from the main focus of the visualization, I downloaded these photos from the Linked Jazz website and added the file names to my csv file. The Gephi plugin Image Preview needed to be installed, however, in order to enable images to be used for nodes.
Once this preparation work was done, the amended node file and unaltered edge file were easily imported into Gephi to begin creating the network graph.
Creating the graph
The graph consists of 2006 nodes and 3646 edges and was created in Gephi using the Force Atlas algorithm. After experimenting with several versions, I realized that the graph communicated most effectively when the hubs were stretched out to show the edges and the fanning of nodes around interviewed subjects more clearly. Since the focus is gender, I changed the color of the nodes in Gephi’s Data Laboratory tab according to the gender column value: blue for ‘M’, red for ‘F’, and gray for ‘Unknown’, complying with standard coloring conventions to ensure readability. Edges were set to the color of edge targets, since the focus is gender distribution in the content of the interviews, not in our choice of which interviews to process. The outline of the nodes was set to the source color to ensure that the gender of the nodes with images could be read, and the nodes were set to be sized proportional to rank (number of edge connections). This was my preliminary graph:
This graph also can be viewed as a troubleshooting tool for the Linked Jazz project and as a snapshot of the current state of open data resources with regard to gender information. An example of the former is that the loose nodes along the periphery represent the fact that the relationships for one processed transcript are no longer represented in our graph, and that the people from that transcript now exist as orphans in our database. An example of the latter point is that many entities did not acquire gender data by means of querying these open data resources, not even some of our interviewees, as evinced by the abundance of gray lines and nodes and a few gray-outlined images.
In order to see the gender distribution more clearly, I decided to remove the nodes.
And then further, I decided to use filter queries in the Overview window to only show edges by target gender type. The contrast in line density between blue and red is extremely clear.
In the last graph representing interviewees (photo nodes) of different genders mentioning women (the red lines), it is easy to see who mentioned many women and who mentioned one or none (with the caveat that some may only look like one or none and actually be more, if any of the mentioned people of “unknown” gender turn out to be women). On the other end of the spectrum, this loose graph also allowed me to identify denser intersections that represent a person who is mentioned by many people. I identified three large intersections in the mid-section of the graph and a fourth, lesser intersection in the lower half.
When I inspected the nodes on each of those intersections, I found they were all well-known female jazz vocalists. By moving the filtered graph into a new workspace, I used further filtering and selection to add labels to those nodes. In order of ranked mention, those nodes are Sarah Vaughan, Ella Fitzgerald, Billie Holiday, and Lena Horne, as seen here (name label size relates to rank).
As expressed above, the ideal final version of these visualizations would be its integration as a tool on the current Linked Jazz platform to allow users, like jazz researchers and digital humanists, the possibility to easily explore the Linked Jazz social network and interview data through the lens of gender. If this method proves successful, additional attributes could be added to the name nodes, like instrument played, date of birth, and places lived.
In the meantime, however, the main objective will be to create an interactive prototype represented by the visualizations shown here, possibly by using the Gephi plugin Sigmajs Exporter. The main interactive functionality to implement would be gender filtering on both the interviewees and people mentioned, as well as the ability to see the name labels for nodes/intersections of interest on mouseover.