Measuring disease similarity and predicting disease-related ncRNAs by a novel method

Background Similar diseases are always caused by similar molecular origins, such as diasease-related protein-coding genes (PCGs). And the molecular associations reflect their similarity. Therefore, current methods for calculating disease similarity often utilized functional interactions of PCGs. Besides, the existing methods have neglected a fact that genes could also be associated in the gene functional network (GFN) based on intermediate nodes. Methods Here we presented a novel method, InfDisSim, to deduce the similarity of diseases. InfDisSim utilized the whole network based on random walk with damping to model the information flow. A benchmark set of similar disease pairs was employed to evaluate the performance of InfDisSim. Results The region beneath the receiver operating characteristic curve (AUC) was calculated to assess the performance. As a result, InfDisSim reaches a high AUC (0.9786) which indicates a very good performance. Furthermore, after calculating the disease similarity by the InfDisSim, we reconfirmed that similar diseases tend to have common therapeutic drugs (Pearson correlation γ2 = 0.1315, p = 2.2e-16). Finally, the disease similarity computed by infDisSim was employed to construct a miRNA similarity network (MSN) and lncRNA similarity network (LSN), which were further exploited to predict potential associations of lncRNA-disease pairs and miRNA-disease pairs, respectively. High AUC (0.9893, 0.9007) based on leave-one-out cross validation shows that the LSN and MSN is very appropriate for predicting novel disease-related lncRNAs and miRNAs, respectively. Conclusions The high AUC based on benchmark data indicates the method performs well. The method is valuable in the prediction of disease-related lncRNAs and miRNAs. Electronic supplementary material The online version of this article (doi: 10.1186/s12920-017-0315-9) contains supplementary material, which is available to authorized users.


Background
One way to indicate the associations between pair-wise diseases in quantitatively is their similarity. In comparison with the associations, disease similarity can indicate the relationships between diseases of multiple categories more clearly and easily, for instance, cancers [1]. In the previous studies, disease similarity was exploited to compute similarities between protein-coding RNA genes (PCGs), which can help to disclose the complex pathogenesis of diseases [1]. Moreover, disease similarity was also employed to calculate similarities between micro-RNA genes (miRNAs) [2,3], and long non-coding RNA genes (lncRNAs) [4][5][6][7][8], respectively, which could be applied for constructing functional network of non-coding RNA genes (ncRNAs). Recently, similarity between diseases was even utilized to predict potential therapeutic drugs for diseases [9][10][11][12].
Semantic associations and disease gene associations are often considered to be quantitative for evaluating disease similarity. Semantic associations between diseases were documented in the ontology around disease terms. The most widely used ontology for calculating disease similarity is Disease Ontology (DO) [13], which is the first ontology to be established around disease terms. DO defines a type of semantic association named 'IS_A' relationship, which reflects set inclusion relationships between disease terms [14]. Disease terms of DO could build a directed acyclic graph (DAG) based on the 'IS_A' relationship. Disease-related genes were distributed in different sources, such as Comparative Toxicogenomics Database (CTD) [15], Online Mendelian Inheritance in Man (OMIM) [16], Gene Reference into Functions (Gen-eRIFs) [17], Genetic Association Database (GAD) [18], and so on.
Three widely used methods for computing the similarity of terms of ontology were presented by Resnik [19], Lin [20], and Wang et al. [21] repectively. All of these three methods were utilized for computing disease similarity by DOSim [1]. Resnik presented Information content (IC) of terms of ontology [19], and in this method, IC of the most informative common ancestor (MICA) of pair-wise diseases was served as the similarity of them. Due to the IC of the pair-wise terms and the IC of the MICA could contribute to the similarity of them, Lin [20] improved Resnik's method. By the contrast of Resnik's and Lin's method, Wang et al. [21] computed the similarity between terms fully based on semantic associations of terms in ontology.
In recent years, three methods for calculating similarity of terms of DO were presented. Disease-related genes have been the focus of all these methods. In another word, the similarity of two diseases was converted to the similarity of the two gene sets of diseases. Mathur and Dinakarpandian first presented to utilize the figure of overlapping genes to calculate disease similarity [22]. Even though two gene sets have no shared genes, these two sets could also be connected by their presence during the same or similar biological process. Therefore, Mathur and Dinakarpandian designed a process-similarity based (PSB) method to compute disease similarity based on biological process terms of Gene Ontology [23,24]. Besides biological process, coexpression [25] and protein-protein interaction [26] could also be employed to similarity of disease-related gene sets [27,28]. Hence, Cheng et al. combined semantic association and the comprehensive gene functional network to compute disease similarity (SemFunSim) [11], which performs very well.
Improved knowledge has suggested that semantic associations and disease gene associations are two types of significant associations, which were widely exploited to measure disease similarity. Recent studies focused on incorporating disease gene associations from different views. Eventually, comprehensive gene functional network (GFN) was incorporated in SemFunSim method [11], in which functional interactions of pair-wise genes were considered. Obviously, it is straightforward to consider that whether the entire network could be completely utilized to measure disease similarity. For this purpose, we designed a novel method, called InfDis-Sim, to figure out disease similarity by modeling the information flow in the comprehensive GFN in this study.

Date source Disease ontology
Disease terms and semantic associations were originated from DO [13] (Table 1), which is manually curated for diseases names. As for now, it includes 7124 'IS_A' relationships between 6920 terms.

Disease gene association network
Disease-related genes are derived from the latest version of diversed open source sources involving CTD [15], GAD [18], GeneRIFs [17], and OMIM [16]. Disease terms in these databases were distributed to DO according to SIDD [29]. After integrating all of these four widely used sources, 130,144 associations between 3178 disease terms and 11,717 genes were obtained as disease gene association network (Additional file 1).

Comprehensive gene functional network
Comprehensive GFN was estimated from HumanNet [30], which is built around Homo sapiens. Multiple interactions spanning human mRNA co-expression, proteinprotein interaction, protein complex, and comparative genomics data sets, combining with alike lines of evidence from orthologs in yeast, fly and worm are comprehensively analyzed for the network utilizing a probabilistic method. Currently, it contains 476,399 interactions among 16,243 genes [30].

Disease-related drugs
Disease-related drugs were derived from robust, publicly accessible databases CTD [15], which elucidates the process that chemicals affect human health. Disease terms in CTD were distributed to DO according to SIDD [29]. As a result, 16,639 associations between 1093 diseases and 3887 drugs were obtained.

Disease-related lncRNAs
Human lncRNA-disease associations [31][32][33][34][35][36] were incorporated into the lncRNA similarity network (LSN), which was constructed based on disease similarity, to predict potential relationships between diseases and lncRNAs. These associations were derived from a manually curated database LncRNADisease [37], which provided experimentally supported disease-lncRNA associations. After removing disease terminologies not in DO and deploying of duplicate associations, 602 associations between 167 diseases and 338 lncRNAs were obtained (Additional file 2).

Disease-related miRNAs
Disease-related human miRNAs were extracted from the Human microRNA Disease Database (HMDD) v2.0 [3]. After manually mapping disease terms of HMDD to DO, we got 5710 associations between 556 miRNAs and 265 diseases (Additional file 3).

Method for calculating disease similarity
In this study, we designed a novel method to compute disease similarity by modelling the information flow in the comprehensive GFN. In the previous study, a tool called ITM Probe [38] was created for analyzing information flow in the network based on random walk with damping. Currently, three models involving absorbing, emitting, and channel were employed in ITM Probe. According to these three models [39], the initial nodes which are the starting points of the random walk and the sink nodes which are the ending points of the random walk are regarded as boundary nodes, and the rest of the nodes in the network are regarded as transient nodes. Channel model [39] was designed for directed information flow, which extends absorbing model that specify the source of the information flow and emitting model that distributes end of information flow. Here, channel model was employed to the network involving disease gene association network and the comprehensive GFN. In this network, disease terms couldn't be directly linked to each other, however, they could be associated based on their related genes. According to Fig. 1, diseases in the network were considered as boundary nodes, and all the genes were considered as transient nodes. To distribute a weight to each transient Fig. 1 Workflow of InfDisSim to demonstrate the basic ideas of measuring disease similarity nodes for disease, a given disease was considered as both the source node and the sink node in the information flow, and damping factor was distributed as 0.85 based on previous study [39]. Assuming N genes exist in the integrative network. Each disease can be represented as N-dimension vector based on the ITM Probe. For a give disease t 1 , the weight vector can be described as: where WV t 1 indicates a weight vector of t 1 , and w 1, i indicates the weight score of t 1 on the ith dimension. Then, disease similarity based on the information flow could be defined as the cosine of their vectors as following: Because disease similarity could be reflected by semantic associations and the disease gene associations, the disease similarity is defined as following: where G 1 , G 2 indicates gene set of t 1 and t 2 , respectively. G MICA is the gene set of t 3 , which is the most informative common ancestor of t 1 and t 2 . And |.| represents the number of terms in the specified set. According to Lin's research, the definition of similarity between pair of terms of DO is as following: or where G root represents gene sets of the root node of the DAG of DO. According to the eq. 5, the semantic similarity between t 1 and t 2 is proportional to |G 1 | and |G 2 |, and is inversely proportional to |G MICA |. Therefore, the proportional relation of Eq. 3 is consistent with the proportional relation of Lin's method. Assuming T 1 and T 2 are two disease sets, which includes n, and m diseases, respectively. Similarity between two disease sets (Fig. 2) was defined in the eq. 6 as following: where t 1,i , and t 2,j represent the ith and jth diseases of T 1 and T 2 , respectively. Sim(t 1, i − > T 2 ) represents similarity from a disease term of T 1 to T 2 . Taken t 1,1 for example, the eq. 7 gives the definition as following: Method for predicting disease-related lncRNAs and miRNAs Disease-related lncRNAs and miRNAs were indicated applying a global network ranking algorithm called random walk with restart (RWR) [40]. The random walker starts from one or several seed nodes and then randomly transits to neighboring nodes considering the probabilities of the edges connected the two nodes. And the probability of returning to the seed node is supposed as γ. Then, RWR algorithm can be defined as following: Fig. 2 Shows an example of calculating similarity between disease sets T1 and T2 where P 0 represents the initial probability vector, which changes with the step t and the probability γ, P t is a vector in which the ith element represents the probability of finding the walker at node i and step t, A indicates the column-normalized adjacency matrix of the network. The algorithm was implemented until the difference between P t and P t + 1 falling below 10 −10 , which indicates all the nodes' status become stable. Based on our method, researchers can predict novel lncRNA-disease and miRNA-disease associations based on RWR. Firstly, a LSN (MSN) could be constructed for RWR. A lncRNA (miRNA) has associations with a set of diseases. Hence, similarity between two lncRNAs (miR-NAs) could be computed based on their related disease sets, which promotes to construct a LSN (MSN). Then, lncRNAs (miRNAs) could be scored for each disease based on RWR, in which the known lncRNAs (miRNAs) of a disease are considered as seed nodes. For each disease, the unknown lncRNAs (miRNAs) of it could be scored. After ranking the lncRNAs (miRNAs) based on the scores, disease-related lncRNAs (miRNAs) are finally predicted. Figure 3 shows the process of performance validation. At the beginning, a benchmark set including 70 pairs between 47 diseases was derived from two public articles respectively(Additional file 4). One of them is Suthram et al.'s study [41], by which similar pairs of diseases were recognized according to the disease-related mRNA expression data and the human protein interaction network. The other is Pakhomov et al.'s study [42], in which similar pairs of diseases were manually checked by experts in related fields. Then, a random set involving ten times of the benchmark set was obtained from DO. After that, the similarities of benchmark set and random set were calculated by the state-of-art methods including Resnik's, Lin's, Wang's, PSB, SemFunSim, and InfDisSim. Finally, the receiver operating characteristic (ROC) curve was drew for assessing the performance of these methods. Furthermore, the experiment was iterated 100 times, and the average of the region under the ROC curve (AUC) for each method was obtained.

Results
Performance evaluation based on benchmark set ROC curves of the state-of-art methods based on a benchmark set and a random set are shown in Fig. 4a. The figure indicates that the AUCs of Resnik's, Lin's, Wang's, PSB, SemFunSim and InfDisSim are 0.6283, 0.6586, 0.6837, 0.8807, 0.9843, and 0.9786, respectively. Obviously, the performances of three typical methods involving Resnik's, Lin's, and Wang's methods are almost the same. And all of these three methods perform generally. By the contrast, three novel methods that predicted more disease gene associations and gene interactions perform superior, of which the performances of Sem-FunSim and InfDisSim are the best and nearly the same.
Resnik's, Lin's, and Wang's methods concentrated on sematic associations. Few of disease gene associations were employed by these three methods. With more and more disease gene associations and gene interactions identified, it is easier to study similarity between diseases in molecular level. Fortunately, three methods including PSB, SemFunSim, and InfDisSim have intergrated these associations into semantic associations. It is easy to find the interactions between genes including mRNA coexpression, protein-protein interaction, protein complex, and so on. Although PSB method only applied cooccurrenced biological process of genes, its performance has already been improved. To enhance the performance, SemFunSim and InfDisSim methods employed comprehensive gene functional associations from two different views. And both of these two methods perform excellently.

Relationship between disease similarity by InfDisSim and co-occurrence drugs
Previous studies have indicated that similar diseases could have common therapeutic drugs [9,10]. Therefore, it is possible that similar diseases tend to have more cooccurrence drugs. To prove this, we discuss the relationship of disease similarity by InfDisSim with co-occurrence drugs. In this study, we employed the Jaccard index as the measure for disease similarity by drugs. As a consequence, InfDisSim disease similarity showed significant positively correlated with the co-occurrence drugs (Pearson correlation γ 2 = 0.1315, p = 2.2e-16; Fig. 5). Results demonstrate that disease similarity detected by our method is correlated with co-occurrence drugs, which have a very strong correlation with disease similarity.

Application of disease similarity to the prediction of disease-related lncRNAs
For the sake of showing the usefulness of disease similarity computed by our InfDisSim, we firstly constructed a lncRNA similarity network (LSN) based on disease similarity, and then identified disease-related lncRNAs based on LSN. The similarity of each pair of 111 lncRNAs was computed using the eq. 6. After that, the z-score of each pair of lncRNAs was computed based on these scores. Then, each similarity score gained a one-sided P-value. Finally, all of these lncRNA similarity scores were appiled to construct LSN (Additional file 5).
LSN was further employed to predict disease-related lncRNAs employing RWR algorithm. According to the known 331 associations between 125 diseases and 111 lncRNAs, the performance of the LSN was assessed by leave-one-out cross validation. Finally, an AUC of 0.9893 was obtained (Fig. 6).

Application of disease similarity to the prediction of disease-related miRNAs
We also utilized the disease similarity to construct a MSN and predict disease-related miRNAs based on the network. Here, we calculated similarity of each pair of 265 miRNAs and corresponding one-sided P-value. All of these miRNA similarity scores were employed to construct MSN (Additional file 6) for predicting disease-  The relationship between disease similarity based on InfDisSim and co-occurrence drugs Fig. 6 The ROC curve of our method based on leave-one-out cross validation on experimentally verified lncRNA-disease associations related miRNAs. The performance of the MSN was assessed by leave-one-out cross validation. As a result, we got an AUC of 0.9007.

Discussion
To identify the disease-related ncRNAs, including lncRNAs and miRNAs, we presented a novel method based on disease similarity using a random walk. With the high AUC performance of predicting disease-related miR-NAs and lncRNAs (0.9893, 0.9007), the proposed methods in this paper may also be applied to predict other diseaserelated modules, e.g. SNP and risk pathways [43,44].

Conclusions
In this study, we presented a novel method, InfDisSim, to figure out disease similarity by semantic association and disease-related genes. In time of computing similarity based on genes, information flow was modelled into a comprehensive GFN, which is constructed by integrating multiple interactions involving mRNA co-expression, protein-protein interaction, protein complex, and so on. In the precious study, SemFunSim has introduced the interactions of pair-wise genes between different gene set. Here, the whole network was fully employed based on information flow. It introduced a novel view to compute disease similarity.
The performance of InfDisSim was validated employing the benchmark set. The high AUC (0.9786) indicates its excellent performance. Then, we assessed the observation that similar diseases could have common therapeutic drugs. Finally, InfDisSim disease similarity was significant positively correlated with the co-occurrence drugs (Pearson correlation γ 2 = 0.1315, p = 2.2e-16; Fig. 5). Therefore, InfDisSim disease similarity could be utilized to predict potential associations between diseases and drugs.
lncRNA similarity and miRNA similarity could be computed based on InfDisSim disease similarity. Here, for all the pairs of lncRNAs (miRNAs), which was applied to construct a LSN (MSN), we calculated their similarities. The network was further used to predicate disease-related lncRNAs (miRNAs). As a result, the high AUC (0.9893, 0.9007) illustrates that the LSN (MSN) is very appropriate for predicting potential associations between diseases and lncRNAs (miRNAs) based on RWR.