Methods overview
To predict unknown miRNA-disease interactions, we propose a new KNMBP model with five parts, as shown in Fig. 1. First, we calculate miRNA functional similarity and disease semantic similarity by using multiple histological data other than miRNA-disease interaction information (as shown in step 1 of Fig. 1). Second, based on the modified known miRNA-disease interaction network, we use the kernel-based neighborhood similarity model (KSNS) to calculate the disease kernel neighborhood similarity and miRNA kernel neighborhood similarity (as shown in step 2 and step 3 of Fig. 1). Finally, based on the integrated miRNA (disease) similar network calculated by Diffusion Component Analysis (clusDCA), we released a bidirectional propagation algorithm to predict unknown miRNA-disease interaction scores (as shown in step 4 and step 5 in Fig. 1).
Dataset collection
In order to fairly compare the performance of the model, we used two benchmark datasets to conduct experiments.
For benchmark dataset I, we utilized the dataset of miRNA-disease interaction prediction established by Chen et al. [16, 17]. The dataset I consists of three parts: First, 5430 interactions between 383 diseases and 495 miRNAs were extracted from HMDD v2.0 [21]. Second, based on the Medical Subject Headings (MeSH) descriptors in the U.S. National Library of Medicine, two semantic similarity matrices of diseases were established by wang et al. [11] and Xuan et al. [22], respectively. Third, the functional similarity matrix of miRNA was established by Lu et al. [23]. All these data can be downloaded from https://github.com/IMCMDAsourcecode/IMCMDA. However, Dataset I is based on the old version (HMDD v2.0), and it also has the disadvantage that the disease semantic similarity is very sparse and the miRNA functional similarity depends on the known miRNA-disease interaction. Therefore, we extracted information about miRNAs and diseases from several latest databases and built benchmark dataset II. We describe the establishment of dataset II from three aspects.
First, extract information about the disease. The Comparative Toxicogenomics Database (CTD) is an important database of disease research that provides a wealth of interactive information between disease and chemistry, genetic products, phenotypes and the environment [24]. Disease items in CTD are described by MeSH ID, which is a hierarchical vocabulary that provides a strict classification system for studying the relationships among various diseases, and the relationships between any diseases can be illustrated by a directed acyclic graph (DAG). For example, the MeSH ID of the disease “Deletion Syndrome (Partial)” was “MesH:C538288” in CTD, whose parent diseases are “Chromosome Deletion” and “Chromosome Disorders”, and the corresponding MesH ID were “MesH:D002872” and “MesH: D025063”, respectively. In order to get a detailed description of the disease, we download 12,988 diseases, including the names of diseases, multiple ID representations of the diseases, and information about their parent nodes. Furthermore, we downloaded gene-disease interactions, including 25,114,553 interactions between 46,045 genes and 7163 diseases. At the same time, disease-GO biological process interactions, including 1,727,119 interactions between 13,126 GOs and 7116 diseases were also downloaded.
Second, extract information about the miRNA. In order to accurately describe the relationship between miRNAs, we extracted as complete as possible miRNA interaction information from multiple latest databases. We obtained the miRNA-gene interaction information from experimentally verified databases, including TarBase (version 8.0) [25], miRTarBase (version 7.0) [26], miRNAMAP (version 2.0) [27], miRecord (version 4) [28]. DIANA-TarBase v8 is a reference database for indexing experimentally supported microRNA targets, has more than a decade of support in the field of non-coding RNA [25]. We downloaded 927,119 miRNA-gene interactions from the database, after the removal of non-human gene and converted the gene ID into Entrez Gene identifiers, a total of 423,392 interactions between 18,345 genes and 1084 miRNAs are retained. Meanwhile, we performed ID transformation of the genes in the miRTarBase database, deleted the null miRNAs and target genes, and finally obtained 381,088 interactions between 2599 miRNAs and 15,064 genes. Similarly, we extracted 83,071 interactions between 1135 target genes and 471 miRNAs from miRNAMAP, and obtained 1269 interactions between 767 target genes and 203 miRNAs from the miRecord. Based on miRBase [29], all of the above miRNAs were transformed into the v22 version using the R package ‘miRBaseConverter’, and the null and duplicate miRNAs were deleted. After integration, a total of 588,134 interactions between 2814 miRNAs and 18,468 genes were obtained. In addition, Lee et al. [30] integrated 21 omics data from multiple organisms by modifying bayes and used logarithmic likelihood scores to measure the probability of interaction between two genes with true functional links. To build similarity networks of genes, we downloaded the human weighted gene network data from the HumanNet database, which contained the log likelihood score of 476,399 interactions among 16,243 genes.
Third, extract interactive information of miRNA and disease. The human microRNA Disease Database (HMDD) collects large amounts of human miRNA-disease interactions from genetics, epigenetics, circulating miRNA and miRNA target interactions, and provides detailed annotation of miRNA-disease interactions [21]. In June 28, 2018, HMDD (version 3.0) [31] was also released, which provides 200.2% of human miRNA-disease interactions and has more evidence to classify. We extracted the disease information with MeSH ID or OMIM ID from HMDD v3.0, removed duplicate miRNA-disease interactions, and obtained 14,457 interactions between 1045 miRNAs and 627 diseases. To ensure all the miRNA similarity and all the disease similarity can be calculated, we delete the diseases and miRNAs not in the above two datasets, and finally got 10,561 interactions between 574 miRNAs and 579 diseases. The details of the two benchmark datasets are shown in Additional file 1.
Construction of disease semantic similarity network
In fact, most methods use MeSH descriptors to construct a directed acyclic graph of the disease, which contains common information between different diseases is used to describe the disease similarity, which leads to a sparsely similar network [16, 17]. In order to construct a more reasonable disease semantic similarity, we make full use of the various omics data to calculate the similarity of the disease. Protein-encoding genes can affect the pathogenesis of the disease to some extent [32], so disease-gene interactions also imply some features of the disease. Similarly, the gene ontology biological process of the disease is also the reflection of some characteristics of the disease. In this paper, we combine the disease-gene interactions (D-G) and disease-GO biological process interactions datasets (D-GO), and the MeSH descriptors of the disease, using the MultiSourcDSim model proposed by Lei et al. [33] to calculate the disease semantic similarity.
Based on the MeSH descriptor, a directed acyclic graph (DAG) can be used to describe the semantic relationship between diseases. Any disease d in the DAG can be expressed as DAG(d) = (d, S(d), F(d), A(d)), where S(d) and F(d), representing the set of direct child nodes and direct parent nodes of disease d, respectively, and A(d) represents the set constituted by all ancestor nodes of disease d.
First, combining the disease interaction dataset (D-G or D-GO) and DAG, the frequency FTc(d) of any disease d in the DAG can be calculated:
$$ {FT}_c(d)={f}_c(d)+\sum \limits_{d\in S(d)}{FT}_c(d) $$
(1)
where fc(d) represents the frequency of d in the interaction dataset c, it can be seen that the occurrence frequency of d in DAG is equal to the sum of the occurrence frequency of all its direct child nodes and the frequency of itself in the interaction dataset. Then, normalize the frequency of disease occurrence as follow:
$$ {PT}_c(d)=\frac{PT_c(d)}{PT_c(root)} $$
(2)
Where, PTc(root) represents the occurrence frequency of the root node in DAG. According to Eqs. 1 and 2, it can be known that 0 ≤ PTc(t) ≤ 1. Based on the more information shared, the higher the similarity. The disease similarity can be obtained:
$$ {S}_c\left({d}_1,{d}_2\right)={\displaystyle \begin{array}{c}\mathit{\operatorname{MAX}}\\ {}d\in COM\left({d}_1,{d}_2\right)\end{array}}\left(\frac{2\times \mathit{\log}\left({PT}_c(d)\right)}{\mathit{\log}\left({PT}_c\left({d}_1\right)\right)+\mathit{\log}\left({PT}_c\left({d}_2\right)\right)}\right) $$
(3)
Where, COM(d1, d2) is the set of the minimum common ancestor of the disease d1 and d2, and it is easy to see that 0 ≤ Sc(d1, d2) ≤ 1. According to D-G and D-GO, we can obtain two disease similarity networks {Sc, c = 1, 2}. After that, the clusDCA [34] was used to integrate the disease similar networks, and the integrated semantic similar network SSd was finally obtained.
Construction of miRNA functional similarity network
In order to overcome the dependence of miRNA functional similarity on known miRNA-disease interaction network, the algorithm can predict miRNAs not associated with any disease. We calculate the miRNA functional similarity by means of Luo [18] and Xiao’s [19] methods. Specifically, we used miRNA target gene interaction network and gene similarity network to calculate miRNA similarity.
First, we normalized and symmetrized the log-likelihood score data between genes downloaded from HumanNet:
$$ {S}^g\left({g}_i,{g}_j\right)=\Big\{{\displaystyle \begin{array}{c}\frac{LLS\left(i,j\right)}{{\operatorname{MAX}}_{LLS}},\kern2em LLS\left(i,j\right)\ne 0\\ {}\frac{LLS\left(j,i\right)}{{\operatorname{MAX}}_{LLS}},\kern2em LLS\left(i,j\right)=0 andLLS\left(j,i\right)\ne 0\\ {}0,\kern5.00em Otherwise\end{array}}\operatorname{} $$
(4)
Where Sg(gi, gj) represents the similarity between gene gi and gene gj, LLS(i, j) represents the log-likelihood score between gene gi and gene gj, MAXLLS represents the maximum log-likelihood score. At this point, we can define the similarity between any gene gi and any gene set G:
$$ {S}^g\left({g}_i,\mathrm{G}\right)=\underset{g_j\in \mathrm{G}}{\max}\left\{{S}^g\left({g}_i,{g}_j\right)\right\} $$
(5)
Where, Sg(gi, G) represents the similarity between gi and G. Then, we can get the functional similarity between miRNA mi and miRNA mj:
$$ {SF}_m\left({m}_i,{m}_j\right)=\frac{\sum_{g\in {G}_i}{S}^g\left(g,{G}_i\right)+{\sum}_{g\in {G}_j}{S}^g\left(g,{G}_j\right)}{\left|{G}_i\right|+\left|{G}_j\right|} $$
(6)
Where, SFm(mi, mj) represents the functional similarity between mi and mj, Gi represent the gene set associated with mi, and |Gi| represent the number of genes in the set Gi.
Kernel-based neighborhood similarity
Reasonable use of known miRNA-disease interaction information can greatly improve the performance of the model [17, 18]. In this paper, based on the known miRNA-disease interactions, we used the kernel-based neighborhood similarity (KSNS) [35] to calculate miRNA (disease) kernel neighborhood similarity. KSNS not only comprehensively utilizes the distance similarity and structural similarity of samples, but also fully excavates the nonlinear structural similarity information between samples, achieving a good prediction effect in lncRNA-protein interaction prediction. In addition, to overcome the sparse problem of the interaction matrix, a weighted k-neighborhood profile (WKNNP) algorithm was proposed by Xiao et al. [19] to preprocess the interaction matrix, achieved good results. Based on the above two points, we first use WKNNP to preprocess the known interaction matrix, and then uses KSNS to calculate the kernel neighborhood similarity of miRNA (disease).
Let the matrix X of the NM rows and ND columns represent the miRNA-disease interaction matrix, then X can be expressed as: \( \mathrm{X}=\left[{M}_1^T,{M}_2^T,\cdots, {M}_{NM}^T\right]=\left[{D}_1,{D}_2,\cdots, {D}_{ND}\right] \), where Mi is the ith row vector of X, could be regarded as the interaction profile feature of miRNA mi; Dj is the jth column vector of X, could be regarded as the interaction profile feature of disease dj.
According to the WKNNP algorithm, we make use of K-nearest neighbor feature of mi to enrich the interaction profile Mi, then the modified interaction profile \( {\hat{M}}_i \) of mi is as follows:
$$ {\hat{M}}_i=\frac{1}{Q_{m_i}}{\sum}_{k=1}^K{w}^k{M}_k $$
(7)
Where \( {Q}_{m_i}={\sum}_{m_{j\in N\left({m}_i\right)}}{SF}_m\left({m}_i,{m}_j\right) \) denotes regularization weight, and N(mi) represents the K nearest set of mi (For sake of simplicity, let K = 15 in the paper). wk is the weight coefficient of the kth neighbor, and decay factor α ∈ [0, 1] (For sake of simplicity, let α = 0.8 in the paper), It is easy to see that the more closer miRNAs have higher weight coefficients. At this point, the modified interaction profile matrix can \( {X}_M=\left[{\hat{M}}_1^T,{\hat{M}}_2^T,\cdots, {\hat{M}}_{NM}^T\right] \) be obtained through Eq. 7. Similarly, we can get the disease modified interaction profile matrix \( {X}_d=\left[{\hat{D}}_1,{\hat{D}}_2,\cdots, {\hat{D}}_{ND}\right] \). Finally, the modified interaction profile matrix X is shown as follows:
$$ \hat{X}=\max \left\{X,\frac{1}{2}\left({X}_m+{X}_d\right)\right\} $$
(8)
Now, based on the \( \hat{\mathrm{X}} \), we make use of KSNS to calculate miRNA (disease) kernel neighborhood similarity. First, we construct the K-neighboring discriminant matrix of miRNA based on the miRNA functional similarity:
$$ {C}_{i,j}=\Big\{{\displaystyle \begin{array}{c}1,\kern2em j\in N\left({m}_i\right)\\ {}0,\kern2em j\notin N\left({m}_i\right) ori=j\end{array}}\operatorname{} $$
(9)
Where N(mi) represents the set of NK nearest miRNAs of mi, NK = ⌊PN × N⌋, PN denotes neighbors proportion parameter, N is the total number of samples, ⌊∙⌋ means round down. Then weight matrix W of miRNA is as follow:
$$ {\displaystyle \begin{array}{c}\mathit{\min}\frac{1}{2}{\left\Vert \Phi (X)W-\Phi (X)\right\Vert}_F^2+\frac{\mu_1}{2}{\left\Vert W\bigodot \left(1-C\right)\right\Vert}_F^2+\frac{\mu_2}{2}{\left\Vert W\right\Vert}_F^2\\ {}s.t.{W}^Te=e\ W\ge 0\ \mathit{\operatorname{diag}}(W)=0\end{array}} $$
(10)
Where, Φ(∙) denotes kernel function, ‖∙‖F representsFrobenius norm, ⨀ is an element-by-element multiplication, μ1 is non-neighborhood control parameters, μ2 is similarity regularization parameters, e = (1, 1, ……, 1)T. The first item of constraint requires the sum of reconstruction weights of each sample to be 1, the second requires that all elements in W are non-negative, and the third term indicates that the self-similarity of miRNA is 0. Using the Lagrange multiplier method and the Karush-Kuhn-Tucker (KKT) condition, the iterative formula of W is as follows:
$$ {W}_{ij}=\frac{{\left[k\left(X,\mathrm{X}\right)+{\mu}_1W\bigodot C\right]}_{ij}}{{\left[k\left(X,\mathrm{X}\right)W+{\mu}_1W+{\mu}_2W\right]}_{ij}}{W}_{ij} $$
(11)
Where k(X, X) represents the kernel matrix of X. In this paper, we select Gaussian kernel function, which is represented as:
$$ k\left({x}_i,{x}_j\right)=\left\langle \Phi \left({x}_i\right),\Phi \left({x}_j\right)\right\rangle =\exp \left(-{\left\Vert {x}_i-{x}_j\right\Vert}^2/\upgamma \right) $$
(12)
Where k(xi, xj) is the kernel of any two samples of xi, xj. \( \upgamma =\frac{\sum {\left\Vert {x}_i\right\Vert}^2}{NM} \) represents the regularized bandwidth parameter. After that, we conducted multiple normalization operations on the weight matrix W to obtain the miRNA kernel neighborhood similarity matrix SIm, and the normalization formula is as follows:
$$ {SI}_m={D}^{-\frac{1}{2}}{\mathrm{W}}^T{D}^{-\frac{1}{2}} $$
(13)
Where, the diagonal matrix D = diag (d1, d2, …, dNM), \( {d}_j=\sum \limits_{i=1}^{NM}{W}_{i,j} \). Similarly, we can get the disease kernel neighborhood similarity SId. Then the clusDCA [34] was used to integrate the miRNA functional similarity SFm (disease semantic similarity matrix SSd) and kernel neighborhood similarity SIm (kernel neighborhood similarity SId) to obtain the final miRNA similarity matrix Sm= (disease similarity matrix Sd).
Bidirectional propagation algorithm
Based on miRNA similarity, disease similarity and known miRNA-disease interaction information, we proposed a bidirectional propagation algorithm to predict the miRNA-disease interaction score.
Let (F)NM × ND be the miRNA-disease interaction score matrix, then F can be decomposed as \( F=\left[{FM}_1^T,{FM}_2^T,\cdots, {FM}_{NM}^T\right]=\left[{FD}_1,{FD}_2,\cdots, {FD}_{ND}\right] \), Where, \( {FM}_i^T \) represents the predicted interaction score of miRNA mi with all diseases, and FDj denotes the predicted interaction score of disease dj. Based on the hypothesis that higher similarity miRNAs are more likely to be interacted with the same disease, we can get:
$$ \sum \limits_{i,j}^M{s}_{i,j}^m{\left\Vert \frac{1}{\sqrt{d_i^m}},{FM}_i,-,\frac{1}{\sqrt{d_j^m}},{FM}_j\right\Vert}^2= tr\left({F}^T\left(I-{D_m}^{-\frac{1}{2}}\bullet {S}_m\bullet {D_m}^{-\frac{1}{2}}\right)F\right) $$
(14)
Where \( {s}_{i,j}^m={\left({S}_m\right)}_{i,j} \) denotes the similarity of mi and mj. \( {d}_i^m=\sum \limits_{j=1}^{NM}{s}_{i,j}^m \), and the diagonal matrix \( {D}_m=\mathit{\operatorname{diag}}\left({d}_1^{\mathrm{m}},{d}_2^{\mathrm{m}},\cdots, {d}_{NM}^{\mathrm{m}}\right) \). Similarly for diseases, we can get:
$$ \sum \limits_{u,v}^{ND}{s}_{u,v}^d{\left\Vert \frac{1}{\sqrt{d_u^d}}{FD}_s-\frac{1}{\sqrt{d_v^d}}{FD}_t\right\Vert}^2= tr\left({F}^T\left(I-{D_d}^{-\frac{1}{2}}\bullet {S}_D\bullet {D_d}^{-\frac{1}{2}}\right)F\right) $$
(15)
Where \( {s}_{u,v}^d={\left({S}_d\right)}_{u,v} \) denotes the similarity of du and dv. \( {d}_u^d=\sum \limits_{k=1}^{ND}{s}_{u,k}^d \), and the diagonal matrix \( {D}_d=\mathit{\operatorname{diag}}\left({d}_1^d,{d}_2^d,\cdots, {d}_{ND}^d\right) \). By this stage, the bidirectional propagation algorithm can be obtained as follows:
$$ \Big\{{\displaystyle \begin{array}{c}\begin{array}{c} argmin\\ {}F\end{array}\left\{{\left\Vert F-Y\right\Vert}_F^2+\frac{\lambda_{\mathrm{m}}}{2} tr\left({F}^T{L}_mF\right)+\frac{\lambda_{\mathrm{d}}}{2} tr\left({FL}_d{F}^T\right)\right\}\\ {}\kern0.50em {L}_m=I-{D_m}^{-\frac{1}{2}}\bullet {S}_m\bullet {D_m}^{-\frac{1}{2}}\\ {}{L}_d=I-{D_d}^{-\frac{1}{2}}\bullet {S}_D\bullet {D_d}^{-\frac{1}{2}}\end{array}}\operatorname{} $$
(16)
Where \( {\left\Vert F-Y\right\Vert}_F^2 \) represents the overall prediction error, which is required to be as small as possible, λm and λd are the Laplacian regularization parameters of miRNA and disease, respectively. The derivative of Eq. 16 for F is as follows:
$$ \frac{\partial Q(F)}{F}=2\left(\mathrm{F}-\mathrm{Y}\right)+{\lambda}_m{L}_mF+{\lambda}_d{FL}_d $$
(17)
In order to speed up the optimization of the gradient algorithm, we use AdaGrad algorithm [34] to adaptively choose the gradient step size. The details of the optimization algorithm to the proposed bidirectional propagation model are described in Algorithm 1.