Visualization Analysis of Diseases and Genes

Introduction

Modern medical research has proved that human diseases are related to genes directly or indirectly. The cause of most diseases can be found in genes. There are two main types of gene-related diseases. The first category is hereditary disease, which has more than 4,000 types and are inherited from parents through genes. The second category is common disease, such as cardiovascular diseases, diabetes, and various types of cancers, which are the results of interaction between multiple genetic and environmental factors.

Although not having basis of medical research，I still interested in discovering the relationship between genes and diseases. Inspired by several visualization analysis of similar topic, I decided to explore this topic via network analysis, in order to have an in-depth understanding of rules behind the data. The dataset I used is Diseasome, a network that reveals both the disease-gene association and association among diseases.

When I started to analyze the dataset, I didn’t identify the research direction, because massive amounts of data need to be explored from various perspectives. Meanwhile, I presupposed several possible directions, including disease-gene associations, connection among diseases, and disease-causing gene of specific disease, etc.

Inspiration

Multiple visualizations were referenced to have an overall acknowledgement of network analysis. Two of the most important projects are Connect the Diseases (from Diane Ferrera) and Genes and Disorders: Cancer (from Ji Hee Yoon).

The first visualization research uses contrast of nodes color and size to demonstrate the degree of interconnection of diseases, also focusing on specific disease to get ego-network visualization. Meanwhile, in order to prevent information chaos, the labels are added manually, which is necessary in creating a clear and well-organized visualization. The second visualization reference made me realized the importance of concentrating on certain topics or features. By flexibly using filter and adjusting node size to make the visualization more legible and insightful.

Materials

Gephi, an open-source visualization and exploration software for multiple kinds of graphs and networks. It takes time to master fundamental skills of Gephi, but the tutorial videos helped a lot on grasping rapidly the internal structure of it. After primary exploration, beginner could find the strong capability of Gephi and its friendliness for codeless researchers.

The dataset “Diseasome” came from Gephi’s wiki free dataset. It is pre-formatted for network analysis, which supports direct import into Gephi. For revising and labeling work, I used Adobe Photoshop, a widely used image editing software, to add labels and annotations.

Analysis

There is total 1,419 nodes and 3,926 edges in dataset. The original version of visualization created automatically by Gephi shows tightly cohesive nodes and edges. Although it can be seen that multiple colors distinguish the types of disease in which genes are distributed, it is still far from achieving explicit connections among numerous objects.

Following the instructions in the tutorial, I tried multiple layouts to help the modules to present a structured form which distinguish priorities. The layouts I used included ForceAtlas 2, Noverlap，OpenOrd and Expansion, each plays a pivotal role in different dimensions.

After several times attempts, the final version of visualization displays a relatively reasonable structure, although it still needs further operation to mine more useful information. By adjusting the size of nodes by degree, the contrast among diseases became more clear，the overall legibility was greatly improved. Several failed attempts made me realized neither dense nor uniform distribution can show real relationship of this dataset.

The other difficulties I met includes the lack of undo function，thus I have to run different Gephi files separately in case of the replacement of key versions caused by multiple operation attempts.

1. Disease and Gene Types

From overall perspective, there are two types of nodes, diseases and disease genes. The first visualization uses two colors to distinguish these two types. The light blue nodes stand for diseases, and red nodes stand for genes. And the size of disease nodes is determined by the number of genes participating in these disease (which is “degree” parameter, one of the key parameters of dataset).

2. Disease Association Degree

When excluded nodes of genes, diseases can be analyzed separately. The second visualization shows the association degree of disease with genes. Both the size and color of nodes are proportional to the number of genes participating in diseases. The contrast of visual shows Colon cancer, Deafness and Leukemia have the most number of associations compared with others, followed by Diabetes mellitus, Breast cancer, retinitis pigmentosa, etc.

3. Disease Categories

The third visualization distinguished disease categories with different colors. All nodes are colored by diseases class to which they belong. It can be seen from visualization result that Cancer category consists of largest number of diseases, followed by Neurological, Multiple, Ophthamological, etc.

According to the default principles of dataset, diseases connection occurs when one or more genes are shared by diseases. The association degree of diseases can be inferred from this visualization that Cancer diseases have the closest interconnection，Neurological diseases, Multiple diseases, Metabolic diseases etc. have loose interconnection.

4. Genes Association Degree

During the exploration process, I found most of the genes are connected with single disease, while there are genes connected with two or more diseases. This fact means there exists the mutation of same gene influences multiple diseases.

The fourth visualization colored and sized genes by degree, thus the quantity of diseases connected with genes became visible. There are about 40% of genes connect with single disease. About 34% of genes connect with 5 or more diseases, which scattered in various disease types, mainly concentrate in Cancer diseases.

Summary

In conclusion, the association between diseases and disorder genes of this dataset is partly presented in four visuals. The results need to combined with practical medicine theory to judge and interpret. For next step, the ego-network of each common diseases can be analyzed to give some new insights about dataset.

Reflection

Due to the lack of medical knowledge, the existing findings are based on my basic cognition of genetics. What I need is to learn more genetic knowledge in detail in the future, in order to make more accurate and in-depth analysis.

Secondly, I realized that goals setting is essential when conducting visualization analysis. Starting from the original intention of solving problems, aim to discover new problems in the exploration. Large scale dataset has unlimited possibilities, it’s important to propose research direction first, and then begin to explore from various perspectives.

Meanwhile, Gephi is a powerful visualization tool, it need exploration by ourselves to discover its professional functions. There are some detail functions that can be improved, such as lack of undo function, and system freezing problem when two files opened at the same time. Besides, I hope Gephi’s layouts can provide sample demonstration, making users understand its specific application effect and infinite possibilities.

Information Visualization

Student work at the School of Information, Pratt Institute