Medical advances have shaped the past, and will continue to determine the future. We celebrate new discoveries, but lack the understanding to appreciate all that we do not know. To reflect on history, germ theory, now considered common sense, only gained widespread popularity in the late 19th century. Additionally many innovations we take for granted (antibiotics and contraceptive pills for example) are as recent as the 20th century, a mere 2-3 generations ago. Acknowledging how new most of our medical understanding is, I can’t help but wonder – how much are we still getting wrong?
The mapping of the human genome, has helped to fill in some of these unknown blanks, and could one day prove to be our greatest medical achievement. Linking genes and diseases enables possible discoveries for not only treatments and cures, but also prevention. Advancements are already being made, as acknowledged by the 2020 Nobel Prize in Chemistry being awarded to two scientists who developed CRISPR, a method for genome editing. Future breakthroughs are inevitable.
Not having a medical background, but still intrigued by this relatively new understanding of the human genome, I was excited to discover Diseasome, a network of linked genes and diseases, formed from a 2007 dataset. The data reveals both disease-gene associations and subsequently which diseases are likely to develop concurrently.
I opted to explore the data using network analysis, as I was not quite sure what I was looking for and open to unexpected discoveries.
As someone new to working with network data, I used the open-source software, Gephi for my analysis and visualization. Although, mastering Gephi would take a great deal of time (and I suspect a better understanding of statistics), it is possible to download the program and relatively quickly start your exploration. Gephi generates some statistical data and offers a variety of filtering tools. Setting the layout to ForceAtlas 2 allows for further refining of the visualization.
Finding a usable dataset proved to be more challenging that getting familiarized with Gephi. Many links to data had since died, datasets were too large for my computer to handle, files were in formats I was unable to open, or my unfamiliarity with network analysis rendered the data too difficult to even attempt to understand. Ultimately I discovered Diseasome on a GitHub post dedicated to Gephi sample datasets.
Lastly I exported my visualizations from Gephi as SVGs and refined the colors, line weights and type using Adobe Illustrator.
The entire dataset consisted of 1,419 nodes and 3,926 edges.
I used Gephi to run some statistics:
• Average Degree: 2.767 – each node is connected to almost 3 other nodes
• Network density: 2% – low, implying there is not an overall connectedness of data points
• Network diameter: 15 – it would take 15 connections to link the two furthest points of data
• Modularity: 87% – there are dense connections between the nodes within a module but sparse connections between nodes in different modules
I started my visual exploration by looking at all of the data, coloring on modularity class. The resulting graphic proved to be too unyielding to be meaningfully viewed in a static graphic. This view did, however, reveal that the diseases comprised the larger nodes, while the genes were much smaller offshoots – inspiring me to filter out the genes and look at only diseases. Admittedly, this path diverged from my initial intention, but having a limited understanding genetics, focussing on the diseases seemed like the more logical course of action.
The disease data, consisting of 516 nodes and 2,376 edges, was more manageable to display in a visualization. I used ForceAtlas 2’s strong gravity feature to pull modules closer together, while preventing overlap for easier legibility. Color was determined by ranking nodes on degree. I used a a red / orange color palette as it is reminiscent of blood cells. The colors and white text popped clearest against a dark navy blue background. I maintained the same color palette and styling for all my graphics, as they were based on the same dataset.
I now had a sense of which diseases where most interconnected, represented by the larger and more vivid colored nodes in the above visualization. In order to take a closer look at one such disease, diabetes mellitus, I filter as an ego-network and focused on all diseases and genes with a direct connection. I ranked and colored by degree, using ForceAtlas 2’s linlog feature to space out the nodes in order to facilitate labeling. I removed the gene labels, as they were the smallest nodes, and as someone with limited genetic understanding, the least meaningful.
The resulting visual showed some expected links to diabetes: obesity, hypertension, and myocardial infarction. More surprising connections were also captured: rheumatoid arthritis, leprechaunism and the progression of SARS.
When visualizing all diseases, I noted that most cancers seemed to be closely connected. For my next graphic, I filtered on cancer, to find out if all cancers shared links. I discovered five cancers shared no connections to the main cancer module, denoted in the below visualization with blue text. Most interesting, two of these solo cancers were connected to one another – a form of skin cancer and a form of brain cancer.
There are endless connections to be discovered by exploring the Diseasome data. I was intrigued to see two disparate cancers (brain and skin) sharing their own separate module, and am now curious if there are other, stand alone partnerships to explore. Having a better understanding of the human genome would facilitate more extensive analysis, as well as having more experience with network data.
I am curious to pursue more network analysis. I was discouraged by the majority of network data I found online and would next like to work with data that I at least played some role in collecting. Having more familiarity with the dataset will help me to better understand the explorations that Gephi reveals to me.