According to the World Health Organization (WHO), cancer is the second biggest reason for global deaths, with an approximated 9.6 million deaths in 2018 alone (2018). That is, for many people, cancer is no longer an obscure disease that they read about in the news–it is a disease that is easy to find with cases ranging from human organs to blood vessels. Often, people are told that cancer is highly transferrable within one’s body–a patient originally diagnosed with colon cancer can have the disease spread to other parts of the body. However, to a non-medical professional, it is difficult to understand how exactly the different types of cancer and cancerous genes are interrelated. Which cancer is most highly associated with another? Are there any “rare” types of cancer? How exactly are diseases and genes interrelated? This visualization project aims to provide some insight into the aforementioned questions through an examination of network data of human diseases.
Two visualizations were referenced prior to designing the cancer disease-gene network view.
The first reference was Hu X. et al’s research on mapping nervous system diseases at the molecular level. Although I liked the usage of different shapes to indicate whether the nodes were major molecular substances or diseases, I found it difficult to see any meaningful insight without medical knowledge. If the color scheme used did not force a gradient within the node as well as across clusters, it might have been easier to see some pattern across the network visualized. However, I was able to see a more interesting clustering in the research’s other graph of the interrelationships between recent subjects and major molecular substances. The visualization’s filtering of nodes to only include recent values allowed viewers to more easily grasp whether certain nodes were associated or not. The reduced number of nodes and smaller max-node-size also allowed for legible labels.
The second reference was a visualization of the Human Disease Network in an article from Quanta Magazine. Unlike the previous reference, despite the numerous human diseases shown, viewers with no understanding of pathology could understand the general theme of the disorder classes shown. The color coding of the clusters (disorder class) further aided such understanding of network associations (e.g. how cancer-related and opthamological diseases are closely connected within its class while muscular diseases are more associated across multiple types of disorder classes). I did think that the network could perhaps have benefited from further clustering or by changing the background to a darker color so that viewers can better differentiated the disorder classes with lighter colors (i.e. endocrine, immunological, unclassified).
Gephi was used to import, edit, and visualize the diseases dataset. Open source and free, Gephi may not be the most user-friendly for a normal laptop processor; however, it has many features that allow users to create diverse types of network maps and graphs. The software also allows users to directly import .csv (and other) data files and to use its data laboratory module to directly make edits to the dataset.
The data used came directly from Gephi’s wiki page of free datasets. The wiki had a pre-formatted GEXF file ‘Diseasome‘ which contains network data of human diseases/disorders and genes. The original dataset comes from a study by Goh I. et al’s study on human diseases (2007).
Research and Visualization Methodology
I downloaded the ‘Diseasome’ dataset directly from Gephi’s wiki of datasets–as it was already a GEXF file and formatted to be able to be directly opened in Gephi without specifying specifics, I opened the downloaded file as a new project on Gephi. In the initial overview of the dataset, I found the number of nodes to be too large for the project’s focus on cancer-related diseases and genes and used the filter function to filter where nodes’ disease class (attribute) is equal to ‘Cancer.’ I wanted to verify that there were enough nodes and edges before deleting any nodes in the data table. I found such to be the case as the resulting table contained 88 nodes and 548 edges–a range that should allow viewers to read the labels and not be too overwhelmed when clustered. I then deleted the nodes where ‘disease class = cancer’ in the data table so that the statistics would calculate metrics based on the filtered dataset and not the whole ‘Diseasome’ table.
Prior to stylizing the visualization, I used the Statistics tool in the Overview module to understand what the average degree, network diameter, graph density, and modularity were. Then I changed the visualization’s layout–I selected Force Atlas 2 from the dropdown menu since the I wanted to have all the cancer-related nodes in a plane and show clustering among associated nodes (i.e. where there is high attraction by edges). Since the dataset contained two modes (disease and genes) I thought it would be best to have a force directed display rather than an arc, radial diagram, or a matrix, which I believe is more apt for a single-mode dataset. I then edited the appearance of the graph, coloring the nodes by modularity. I realized that the default color scheme assigns similar shades of grey to two clusters, and decided to go with the second preset palette which gave each cluster a distinct enough color for users to easily distinguish the nodes from another at a glance. Size of nodes were set to be ranked by degree so that diseases with the most number of associations (edges) will be larger, perhaps providing an interesting insight for the viewers.
When zooming in and out of the graph, I noticed that there were a couple isolates far from the center–in order to prevent viewers from not seeing such nodes, I adjusted the gravity to be 10 in the layout to bring them closer to the main view. Some isolates were manually dragged closer to the center. I also dissuaded hubs and used the prevent overlap function to avoid an overly compact view in the center. In the Preview mode, I adjusted the font size of the labels to make it more legible.
After confirming the final visualization in Preview, I exported the final version both as a PDF (static image) file and through Sigma.js export in order to allow users to either interact with an image or an interactive network with search, hover, click, and fade functions.
The resulting network of cancer disorder-gene association made in Gephi can be found below:
The statistical analysis of the dataset provided the following results:
- Total # Nodes: 88
- Total # Edges: 548
- Average Degree: 6.2
- Network Diameter: 6
- Graph Density: 0.07
- Modularity: 0.5 with 10 communities
The resulting visualization provides a couple interesting insights (though more definitely may exist in the eyes of a cancer-expert). Although most types of cancer disorders and genes appear related gene-to-gene, gene-to-disorder, or disorder-to-disorder, some isolated nodes exist that appear only to be associated with one another such as the ‘basal cell carcinoma’, ‘medulloblastoma’, and ‘squamous cell carcinoma’. This could potentially mean that not all types of cancer-related genes and diseases are transferrable to other forms of cancer. Moreover, the larger sized nodes indicate that there are some diseases that are more highly related to others, potentially indicating a higher risk of disease transfer or spread. The image below shows how colon cancer is one of the largest nodes (highest degree) and how it ties to multiple nodes across clusters. Other larger sized nodes such as breast cancer, prostate cancer, and gastric cancer were interesting as the three alongside colon cancer are found to be the leading cases of cancer around the world according to the WHO (2018).
I believe that the visualization would have benefited from distinguishing the genes and disorders by using different shapes for the nodes (similar to the Hu, X. et al’s visualization). The data could have been cleaned up further if there were an excel version of the file, so that a column that distinguishes disorders and genes could be added as another attribute, though it most likely would require assistance from a medical professional.
Moreover, custom colors that are color-blind friendly could be used to increase the accessibility of the visualization. Although I had exported the Sigma.js format of the chart that is more interactive and searchable, if a color-blind user were to only have access to my static pdf-network visualization, they could have a hard time understanding the associations without reading all of the labels. Upon usability-peer review, the visualization could have merited from a lighter outline instead of a darker one to minimize distraction–a darker outline may be useful for a white background.
This visualization of the cancer gene-disease network can be useful for users who have always been interested in the interrelatedness of different types of cancer. The network can be taken as an initial guide in understanding whether some disorders exist that are, in fact, not associated at all with other forms of cancer-related nodes (or vice versa, whether some disease genes are particularly highly connected to others). With more insight from a medical professional or contextual research into the field of disease ontology and pathology, I believe the visualization could be strengthened to provide valuable insights about the human disease network and perhaps help researchers think of ways in which spread of diseases to other parts of the body can be countered or prevented. Perhaps an addition of additional attributes like fatality rate or speed of spread can provide further insight into human disease tracking.
Cancer. (2018, September 12). Retrieved October 23, 2020, from https://www.who.int/news-room/fact-sheets/detail/cancer
Greenwood, V., & Quanta Magazine moderates comments to facilitate an informed, S. (2015, January 29). Disease Networks Show Molecular Connections. Retrieved October 23, 2020, from https://www.quantamagazine.org/disease-networks-show-molecular-connections-20150129/
Hu, X., Zhao, D., & Strotmann, A. (2013, June 25). Mapping Molecular Association Networks of Nervous System Diseases via Large-Scale Analysis of Published Research. Retrieved October 23, 2020, from https://journals.plos.org/plosone/article?id=10.1371%2Fjournal.pone.0067121
Goh I., Cusick E., Valle D., Childs B., Vidal M., Barabási L. (2007), The Human Disease Network. Retrieved October 21, 2020, from http://gephi.org/datasets/diseasome.gexf.zip. Proc Natl Acad Sci USA 104:8685-8690
Vardimon, H. (n.d.). Portfolio. Retrieved October 23, 2020, from https://hagarvardimon.com/