Visualization of genetic disease-phenotype similarities by multiple maps t-SNE with Laplacian regularization
© Xu et al.; licensee BioMed Central Ltd. 2014
Published: 22 October 2014
From a phenotypic standpoint, certain types of diseases may prove to be difficult to accurately diagnose, due to specific combinations of confounding symptoms. Referred to as phenotypic overlap, these sets of disease-related symptoms suggest shared pathophysiological mechanisms. Few attempts have been made to visualize the phenotypic relationships between different human diseases from a machine learning perspective. The proposed research, it is anticipated, will visually assist researchers in quickly disambiguating symptoms which can confound the timely and accurate diagnosis of a disease.
Our method is primarily based on multiple maps t-SNE (mm-tSNE), which is a probabilistic method for visualizing data points in multiple low dimensional spaces. We improved mm-tSNE by adding a Laplacian regularization term and subsequently provide an algorithm for optimizing the new objective function. The advantage of Laplacian regularization is that it adopts clustering structures of variables and provides more sparsity to the estimated parameters.
In order to further assess our modified mm-tSNE algorithm from a comparative standpoint, we reexamined two social network datasets used by the previous authors. Subsequently, we apply our method on phenotype dataset. In all these cases, our proposed method demonstrated better performance than the original version of mm-tSNE, as measured by the neighbourhood preservation ratio.
Phenotype grouping reflects the nature of human disease genetics. Thus, phenotype visualization may be complementary to investigate candidate genes for diseases as well as functional relations between genes and proteins. These relationships can be modelled by the modified mm-tSNE method. The modified mm-tSNE can be applied directly in other domain including social and biological datasets.
A large number of studies proved that mutations of functionally related genes are associated with genetic diseases characterized with overlapping phenotypes [1, 2]. On the other hand, diseases with different clinical features and genes may also have similar pathophysiological mechanisms [3, 4]. Based on these assumptions, a number of studies focus on developing computational frameworks for discovering disease-related gene candidates by exploiting complex associations between phenotypes and genotypes found within heterogeneous genomic datasets such as gene expression data, protein-protein interaction networks [5, 6] and gene ontology annotations . Studying the associations between diseases not only help us to find their common genetic basis , but also provide novel insights into molecular mechanisms  and future drug targets for pharmaceutical research .
However, mm-tSNE may have some disadvantages that high importance weight points in the same map do not correspond to the same cluster. That provide difficulty to explain the meaning of each map. We introduced a Laplacian regularization procedure for mm-tSNE. The Laplacian regularization has been used for many other objective functions such as linear regression  and Gaussian Mixture Model . The advantage of regularization for mm-tSNE is that it adopts clustering structures of variables and provides more sparsity to estimated parameters. Our experimental results indicate that the novel method can achieve comparable performance and provide a more flexible framework for data visualization than mm-tSNE.
The input phenotype similarity matrix A is a symmetric matrix in which each row (and column) corresponds to a phenotype. Phenotype similarity was constructed by van Driels et al.  using the Online Mendelian Inheritance in Man (OMIM) database [16, 17]. The disease classification is obtained from the Human Disease Network , which uses plain-text to summarize the specific features of the disease. We obtained a similarity matrix P among 1,014 phenotypes within 21 disease classes. Similarities which did not exceed a threshold of 0.5 were filtered from the results.
t-Stochastic Neighbourhood Embedding (t-SNE)
Multiple maps t-SNE
mm-tSNE is an extension of t-SNE, which constructs several maps M to visualize non-metric properties of phenotype similarities that alleviates the limitation of one single metric map. According to the nature of input similarity matrix P in high dimensional space, we normalized the original similarity matrix A to make sure that the input similarity matrix P could be a symmetric, non-negative and sums up to one.
The cost function is the same as Eq. (3), but the optimization of the cost fuction with respect to the locations of the points in all phenotype maps and with respect to the weights .
Multiple maps t-SNE with Laplacian regularization
Neighborhood preservation ratio
where count the number of elements in a set and n is the total number of phenotypes. In this paper, we apply the same way to assess NPR that helps us to choose the numbers of maps by combining the number of maps m and λ. We choose eleven different λ from 0 to 0.01, interval by 0.001. When λ= 0, our method equals to mm-tSNE.
Model selection and performance comparison
Laplacian regularized mm-tSNE reveals intransitive similarity
Overall, we found that phenotypes belonging to the same disease class are tend to group together. However, some phenotypes in the same disease category are overlapping with other disease class. These diseases include but not limited to developmental, skeletal diseases. This is reasonable because that most developmental disease would be expected to affect multiple tissues.
Extracted similarities from original matrix.
Phenotype with OMIM ID
Importance weights for extracted phenotypes.
Besides CD, at Map 6 (see Figure 6) ABS has another close neighbour --Melnick-Needles syndrome (MNS, OMIM: 309350) with a similarity 0.502. ABS, CD and MNS are all neighbours at Map 6. However, it is surprisingly to see that the similarity between ABS and MNS is 0 (See Table 1). We then investigate these three phenotypes further. MNS is a skeletal disease that associated with abnormal skeletal development, as well as other health-related problems. Some main symptoms of it include short stature, abnormally long fingers and toes, irregular ribs . ABS is belongs to an unclassified disease, but they share the most common symptoms . CD is a severe disorder that affects the development of the skeleton and reproductive system. Although these three disorders are in three different categories (Skeletal, unclassified and developmental respectively), the common symptoms is that they are all related to skeleton system and they are often life-threatening in the new born period. The analysis shows that although the direct similarity between ABS and MNS is 0 as measured by the text mining approach from , our method indeed inferred their real relationships from the data. This is not inconsistent with the modelling of intransitive similarity because they are in the same metric space Map 6.
We develop a novel visualization method--graph Laplacian regularized mm-tSNE. The regularization of mm-tSNE put more sparsity to the weights of data points in different maps and less sparsity to the coordinates of data points than previous method. By doing this, we got better visualization results and novel biological interpretation. On the application of this method, we found that our approach can identify interesting intransitive similarity among disease phenotypes. This approach also adds more flexibility for visualization tasks. For example, we can adjust the parameter λ to provide the weights (and the coordinate of data points in low dimensional space) more or less sparsity to "zoom in" or "zoom out" data points in different maps. We expect the new technique could be useful in more general visualization analysis in other field.
This work was supported in part by NSF IIP 1160960, NNS IIP 1332024, NSF CCF 0905291, NSFC 90920005, NSFC 61170189 and China National 12-5 plan 2012BAK24B01
Publication of this article has been funded by the NSF IIP 1160960, NNS IIP 1332024
This article has been published as part of BMC Medical Genomics Volume 7 Supplement 2, 2014: IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2013): Bioinformatics in Medical Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcmedgenomics/supplements/7/S2.
- Brunner HG, Van Driel MA: From syndrome families to functional genomics. Nature Reviews Genetics. 2004, 5 (7): 545-551. 10.1038/nrg1383.View ArticlePubMedGoogle Scholar
- Lim J, et al: A protein-protein interaction network for human inherited ataxias and disorders of Purkinje cell degeneration. Cell. 2006, 125 (4): 801-814. 10.1016/j.cell.2006.03.032.View ArticlePubMedGoogle Scholar
- Limviphuvadh V, et al: The commonality of protein interaction networks determined in neurodegenerative disorders (NDDs). Bioinformatics. 2007, 23 (16): 2129-2138. 10.1093/bioinformatics/btm307.View ArticlePubMedGoogle Scholar
- Huynen MA, Brunner HG: Phenome connections. Trends in genetics. 2008, 24 (3): 103-106. 10.1016/j.tig.2007.12.005.View ArticlePubMedGoogle Scholar
- Lage K, et al: A human phenome-interactome network of protein complexes implicated in genetic disorders. Nature biotechnology. 2007, 25 (3): 309-316. 10.1038/nbt1295.View ArticlePubMedGoogle Scholar
- Oti M, et al: Predicting disease genes using protein-protein interactions. Journal of medical genetics. 2006, 43 (8): 691-698. 10.1136/jmg.2006.041376.PubMed CentralView ArticlePubMedGoogle Scholar
- Freudenberg J, Propping P: A similarity-based method for genome-wide prediction of disease-relevant human genes. Bioinformatics. 2002, 18 (suppl 2): S110-S115. 10.1093/bioinformatics/18.suppl_2.S110.View ArticlePubMedGoogle Scholar
- Loscalzo J, Kohane I, Barabasi AL: Human disease classification in the postgenomic era: a complex systems approach to human pathobiology. Molecular systems biology. 2007, 3 (1):Google Scholar
- Wang Q, et al: Multi-Dimensional Prioritization of Dental Caries Candidate Genes and Its Enriched Dense Network Modules. PloS one. 2013, 8 (10): e76666-10.1371/journal.pone.0076666.PubMed CentralView ArticlePubMedGoogle Scholar
- Csermely P, et al: Structure and dynamics of molecular networks: A novel paradigm of drug discovery: A comprehensive review. Pharmacology & therapeutics. 2013, 138 (3): 333-408. 10.1016/j.pharmthera.2013.01.016.View ArticleGoogle Scholar
- Legendre P, Legendre L: Numerical ecology. 2012, 20: ElsevierGoogle Scholar
- Van der Maaten L, Hinton G: Visualizing non-metric similarities in multiple maps. Machine learning. 2012, 87 (1): 33-55. 10.1007/s10994-011-5273-4.View ArticleGoogle Scholar
- Li C, Li H: Network-constrained regularization and variable selection for analysis of genomic data. Bioinformatics. 2008, 24 (9): 1175-1182. 10.1093/bioinformatics/btn081.View ArticlePubMedGoogle Scholar
- He X, et al: Laplacian regularized gaussian mixture model for data clustering. Knowledge and Data Engineering, IEEE Transactions on. 2011, 23 (9): 1406-1418.View ArticleGoogle Scholar
- van Driel MA, et al: A text-mining analysis of the human phenome. European journal of human genetics. 2006, 14 (5): 535-542. 10.1038/sj.ejhg.5201585.View ArticlePubMedGoogle Scholar
- Hamosh A, et al: Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic acids research. 2005, 33 (suppl 1): D514-D517.PubMed CentralPubMedGoogle Scholar
- Jiang X, et al: Modularity in the genetic disease-phenotype network. FEBS letters. 2008, 582 (17): 2549-2554. 10.1016/j.febslet.2008.06.023.View ArticlePubMedGoogle Scholar
- Van der Maaten L, Hinton G: Visualizing Data using t-SNE. Journal of Machine Learning Research. 2008, 9 (11):Google Scholar
- Lacoste-Julien S, Sha F, Jordan MI: DiscLDA: Discriminative learning for dimensionality reduction and classification. Advances in neural information processing systems. 2008Google Scholar
- Jamieson AR, et al: Exploring nonlinear feature space dimension reduction and data representation in breast CADx with Laplacian eigenmaps and t-SNE. Medical physics. 2010, 37: 339-10.1118/1.3267037.PubMed CentralView ArticlePubMedGoogle Scholar
- Verloes A, et al: Fronto - otopalatodigital osteodysplasia: Clinical evidence for a single entity encompassing Melnick - Needles syndrome, otopalatodigital syndrome types 1 and 2, and frontometaphyseal dysplasia. American journal of medical genetics. 2000, 90 (5): 407-422. 10.1002/(SICI)1096-8628(20000228)90:5<407::AID-AJMG11>3.0.CO;2-D.View ArticlePubMedGoogle Scholar
- McGlaughlin KL, et al: Spectrum of Antley-Bixler syndrome. Journal of Craniofacial Surgery. 2010, 21 (5): 1560-1564. 10.1097/SCS.0b013e3181ec6afe.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.