Introduction
With this project, I explored a biological network of human disease, and disease genes. I mainly approached the visualization with an exploratory eye, but was loosely looking to observe any correlations that could be found between classes of disease, as well as ties between diseases and their related disease genes – the former of which ended up being much easier to evaluate, as a layman.
Materials
The dataset that I used for this project was titled “Diseasome” (originally authored by Goh K-I, Cusick ME, Valle D, Childs B, Vidal M, Barabási A-L), and I retrieved it from the Gephi Wiki . The zipped file can be found here: GEXF. Since the data was already formatted for Gephi, I did not necessarily need to clean or, change it in any way, prior to working on manipulations in Gephi 0.9.1. Later, however, I attempted to create a hypergraph in Gephi, which required some data work in Excel.
Methods & Discussion
After opening the dataset in Gephi, the first step was to check the “context” for data accuracy. Everything looked as it should, so I began experimenting with different layouts. All of the visual iterations of this dataset that I came across (and there were many) utilized a force-directed visualization, which was also Gephi’s automatic recommendation, so it seemed obvious to move forward with this design.
First, I toyed with the “Force Atlas” and “Force Atlas 2” layouts, but I was not satisfied with these, as they resulted in very spread out designs, with lots of white space, and yet, a great deal of tight, overlapping elements. Briefly, I worked on moving individual nodes, to create better spacing, but finally, I switched the layout to “Fruchterman Reingold,” and found the spacing much more pleasing.
At this point, I ran some statistics. I started by running, and experimenting with the “modularity,” for which I decided on a setting slightly below the standard (~0.8). Then, I also ran “graph density” for a directed graph (0.002), and “network diameter” (15).
Now, when it came time to play with colors, sizes, and labels, I approached the design knowing my biggest challenge would be over-crowding, especially when incorporating labels. As exemplified by the visualization below, I saw no shortage of chaos, when looking at previous work with this dataset.
And, for the most part, I noticed that the attempted solution was to include a legend, in order to reduce labeling; however, I found this to be very difficult to follow, in practice.
On the other hand, one approach that I did appreciate in other network examples was the choice to employ a black background, with white labels. Maybe this is just personal aesthetic, but I felt that this design better differentiated the labels from the nodes, and helped my eye to focus on more of the network’s details.
Starting with what I learned from this last example (while using the first 2 examples as my “what not to do” guide), I first changed the color of the background to black, and the color of the node labels to white. Then, I set the labels to “proportional size,” and worked on the font. I changed the font size to 5, and it seemed that much was lost, so I changed it to 8, but it looked too hectic, so I settled back on 5. Next, I began fiddling with the options for coloring node attributes. Coloring the nodes by “type” was impactful, as it showed a strong distinction between the “disease” nodes and the “disease gene” nodes.
Ultimately, however, I was more interested in the visualization produced through coloring the nodes by “disease class.” Here, many clear relationships emerged: disease classes were grouped, and links between classes became evident (ex. obvious, direct links like “Obesity” to “Asthma,” as well as more curious correlations, like “Cardiomyopathy” to “Deafness,” which inspired me to do some basic reading via Google Scholar).
Since this visualization allowed me to successfully observe relationships, and develop questions based on what I saw, I felt satisfied, and decided to move onto the creation of a hypergraph.
In order to do this, I exported the node and edge tables from Gephi into Excel, where I first worked on creating a new edge table by replacing the “source” and “target” values with cluster numbers. Then, I created a new nodes table which expressed the size of each cluster. After importing these new tables back into Gephi, and beginning to manipulate the result, I was met with a technical problem; after many attempts at troubleshooting, and way too much time, I still could not change the sizes of the nodes on the visualization. For this reason, I decided to take a break from the unfinished hypergraph.
Future Directions
If I were to continue with this project, I would definitely like to do more work on the hypergraph, to see what trends may materialize. I would continue to troubleshoot the node sizes, manipulate the colors, add and design labels, etc. Aside from the hypergraph, I would also like to continue with the first visualizations that I made. I think that these allow for the development of more pointed research questions, and in the pursuit of such questions, I could zoom in on a particular relationship or, collection of relationships, within the network, to take a closer look at smaller clusters, as well as the links between disease genes. Currently, the disease genes can barely be analyzed, as they are far too small to be labeled, and my knowledge of each gene is limited, but with more time, this data could be made more valuable.