Materials
In this paper, we used datasets which came from the study of Zhao et al. [7]. We downloaded and used the Additional files 1, 2, 3, 4, and 5 from this study. These datasets contain 190 diseases, 111 lncRNAs and 264 miRNAs as described as follows:
Known lncRNA-miRNA associations
The known lncRNA-miRNA associations were collected from the starBasev2.0 [22] in February, 2017 and provided the most comprehensive experimentally confirmed lncRNA-miRNA interactions based on large-scale CLIP-Seq data. After eliminating duplicate values and erroneous data and also removing lncRNAs not included in DS2 dataset, we obtained the DS1 dataset which contains 1880 known lncRNA-miRNA associations.
Known lncRNA-disease associations
The known lncRNA-disease associations were collected from 8842 known disease-lncRNA associations in the MNDR database [23] and 2934 known disease-lncRNA associations in the LncRNADisease database [24]. After eliminating diseases without any MeSH descriptors because the disease names came from two different databases, merging the diseases with the same MeSH descriptors and removing the lncRNAs which were not included in the lncRNA-miRNA dataset (DS1), 936 known associations between diseases and lncRNAs (DS2) remained.
Known disease-miRNA associations
The known human miRNA-disease associations were downloaded from the HMDD V2.0 database [25]. This dataset (DS3) contains 3252 quality miRNA-disease associations after we eliminated the duplicate associations and miRNA-disease associations involving with other diseases or lncRNAs which were not contained in the DS1 or DS2 datasets.
Method overview
In this paper, we proposed a new method to infer miRNA-disease associations. The flowchart of the proposed method is illustrated in Fig. 1. Generally, our proposed method contains four main stages. At the first stage, we constructed a tripartite graph G0 based on known miRNA-disease associations, known lncRNA-disease associations, and known miRNA-lncRNA interactions. The tripartite graph G0 is represented by three adjacency matrices: A0MD, A0ML and A0DL where A0MD is the adjacency matrix between miRNAs and diseases, A0ML is the adjacency matrix between miRNAs and lncRNAs, A0DL is the adjacency matrix between diseases and lncRNAs. During the second stage, to solve the imbalance data problem, we employed a collaborative filtering algorithm on the tripartite graph G0 to obtain a tripartite graph Gu. The tripartite graph Gu is represented by three adjacency matrices: AuMD, AuML and A0DL where AuMD, AuML are the adjacency matrices obtained by updating A0MD and A0ML after using collaborative filtering algorithm. The tripartite graph Gu is used in a resource allocation algorithm at the third stage to calculate final resource score (Rscore_final) of miRNA candidates for each disease. At the final stage, we ranked all miRNA candidates’ Rscore_final for each disease in descending order so that the candidate with greater Rscore_final will have higher possibility to be verified in the future.
Construction of a tripartite graph G0
Inspired by previous studies [19, 20] to infer lncRNA-disease associations by using a tripartite graph, in this paper, we firstly construct a miRNA-disease-lncRNA tripartite graph G0 as follows:
Construction of known miRNA-disease association graph
Let M = {mk; k = 1,…,nm} denotes the set of miRNAs, D = {dj; j = 1,…, nd} denotes the set of diseases where nm, nd represent the number of miRNAs and diseases, respectively. We build a MD0 graph based on the known miRNA-disease associations. The MD0 graph is represented by a matrix A0MD which is the adjacency matrix of known miRNA-disease associations. The entity A0MD(mk, dj) is the element in kth row and jth column of A0MD, and A0MD(mk, dj) = 1 if miRNA mk is associated with disease dj, otherwise, A0MD(mk, dj) = 0.
Construction of known miRNA-lncRNA interaction graph
In the same way, let M = {mk; k = 1,…,nm} denotes the set of miRNAs, L = {li; i = 1,…, nl} denotes the set of lncRNAs where nm, nl represent number of miRNAs and lncRNAs, respectively. We can obtain ML0 graph and A0ML matrix. ML0 graph is built on known miRNA-lncRNA interactions. A0ML is the adjacency matrix of known miRNA-lncRNA interactions. The entity A0ML(mk, li) is the element in kth row and ith column of A0ML, and A0ML(mk, li) = 1 if miRNA mk interacts with lncRNA li, otherwise, A0ML(mk, li) = 0.
Construction of known disease-lncRNA association graph
Similarly, let D = {dj; j = 1,…, nd} denotes the set of diseases, L = {li; i = 1,…,nl} denotes the set of lncRNAs, where nd, nl represent number of diseases and lncRNAs, respectively. We can obtain DL0 graph and A0DL matrix where DL0 graph is built on known disease-lncRNA associations and A0DL is the adjacency matrix of known disease-lncRNA associations. The entity A0DL(dj, li) is the element in jth row and ith column of A0DL, and A0DL(dj, li) = 1 if disease dj is associated with lncRNA li, otherwise, A0DL(dj, li) = 0.
Construction of a tripartite graph G0
From the integration of the three MD0, ML0, DL0 graphs, we obtain a tripartite graph G0. The tripartite graph G0 is represented by three adjacency matrices: A0MD, A0ML and A0DL as mentioned before.
Construction of a tripartite graph G
u
In the tripartite graph G0, the number of known associations between miRNAs and diseases as well as between miRNAs and lncRNAs are small. So that, for any given lncRNA node li and disease node dj, it is clear that the number of miRNA nodes which associated with both li and dj will be very small. To improve it, in our method, we use a collaborative filtering algorithm for recommending suitable miRNA nodes to corresponding lncRNA nodes and disease nodes, respectively. By considering that a recommender system may involve various input data including users and items [18], in our proposed method, we take lncRNAs and diseases as users, while miRNAs as items. For the two adjacency matrices A0ML and A0MD obtained above, it is easy for us to construct another adjacency matrix A0MLD = [A0ML, A0MD] by splicing A0ML and A0MD together because the number of rows in both A0ML and A0MD are same. It is clear that the row vector of A0MLD consists of the row vectors in A0ML and A0MD while the column vectors in A0MLD is the same as the column vectors in A0ML or A0MD.
On the basis of A0MLD and tripartite graph G0, we can obtain a co-occurrence matrix Rm x m, in which, the entity R(mk, mr) indicates the element in kth row and rth column of Rm x m where R(mk, mr) = 1 if and only if the miRNA mk and miRNA mr have at least one common neighboring node in G0, otherwise R(mk, mr) = 0. The common neighboring node can be an lncRNA or a disease in G0. So, a similarity matrix Rnor can be calculated by normalizing Rm x m as the following equation:
$${\mathrm{R}}^{nor}\left({m}_{k}, {m}_{r}\right)=\frac{\left|N\left({m}_{k}\right)\bigcap N({m}_{r})\right|}{\sqrt{\left|N({m}_{k})\right|*\left|N({m}_{r})\right|}}$$
(1)
where k, r are the number of miRNAs. \(\left|N\left({m}_{k}\right)\right|\) indicates the number of known lncRNAs and diseases associated to mk in G0, which means the number of elements with value equaling to 1 in kth row of A0MLD. \(\left|N\left({m}_{r}\right)\right|\) indicates the number of known lncRNAs and diseases associated to mr in G0, which means the number of elements with value equaling to 1 in rth row of A0MLD. ∣N(mk) ∩ N(mr)∣ indicates the number of known lncRNAs and diseases associated with both miRNA mk and miRNA mr simultaneously in G0.
Based on the similarity matrix Rnor and the adjacency matrix A0MLD, we calculate a new recommender matrix AuMLD as follows:
$$A^{u}_{MLD} = \, R^{nor} * \, A^{0}_{MLD}$$
(2)
Specifically, for a particular lncRNA li or disease dj in G0, if there is a miRNA mk satifying A0MLD(mk, li) = 1 or A0MLD(mk, dj) = 1 in A0MLD, then we firstly calculate the sum of the values of all elements in the ith or jth column in AuMLD, respectively. Therefore, we will have its averaged value P. Next, if the ith or jth column of AuMLD contains a miRNA \({m}_{\theta }\) which satisfies AuMLD(\({m}_{\theta }\), li) > P or AuMLD(\({m}_{\theta }\), dj) > P then we recommend miRNA \({m}_{\theta }\) for lncRNA li or disease dj, respectively. Also, we will add new edge between \({m}_{\theta }\) and li or \({m}_{\theta }\) and dj into the tripartite graph G0.
Finally, we obtain a tripartite graph Gu. The tripartite graph Gu contains three graphs: MDupdate, MLupdate and DL0 and can be represented by three adjacency matrices: AuMD, AuML and A0DL. MDupdate is the updated graph of MD0 after adding new edge between recommended miRNAs and diseases. MLupdate is the updated graph of ML0 after adding new edge between recommended miRNAs and lncRNAs. AuMD is the adjacency matrix which represents MDupdate graph. It contains 10,310 known and recommended associations and 39,850 unknown remained associations. AuML is the adjacency matrix which represents MLupdate graph.
Employing resource allocation process on the tripartite graph G
u to infer miRNA-disease associations
To infer miRNA-disease association, we employ the resource allocation algorithm on the tripartite graph Gu as described in the following steps:
Step 1: Calculating resource allocation between miRNAs and diseases
For a specific miRNA mk, we define the initial resources located on disease dj as:
$$fd\left( {m_{k} } \right) = A^{u}_{MD} \left( {m_{k} , \, d_{j} } \right),\quad \, j = 1,2, \ldots ,n_{d}$$
(3)
where nd is the number of diseases.
Then we calculate the resource moved back from D to M by using a weight matrix W = {wkt}nm x nm to indicate the resource allocation process between miRNAs and diseases as follows:
$$w_{kt} = \frac{1}{{\deg A_{MD}^{u} \left( {m_{k} } \right)}}*\mathop \sum \limits_{j = 1}^{{n_{d} }} \frac{{A_{MD }^{u} \left( {m_{k} , d_{j} } \right) * A_{MD }^{u} \left( {m_{t} , d_{j} } \right)}}{{\deg A_{MD}^{u} \left( {d_{j} } \right)}}$$
(4)
where \({w}_{kt}\) is the contribution resource moved from tth node to kth node in M, and it can be understood as the similarity between miRNA mk and miRNA mt in MDupdate graph. \(\mathit{deg}{A}_{MD}^{u}\left({m}_{k}\right)\) is the degree of miRNA mk in MDupdate graph and it represents the number of associated diseases for miRNA mk. Similarly, \(\mathit{deg}{A}_{MD}^{u}\left({d}_{j}\right)\) is the degree of disease dj in MDupdate graph and it represents the number of associated miRNAs for disease dj.
With respect to previous study [20], we also modify the resource allocation algorithm by considering the level of consistency between the contribution of resource transferred in both directions. It shows the impact of co-selection (mk, mt) between the contribution of resource from mk to mt and the contribution of resource from mt to mk. A consistence-based resource allocation to represent a final miRNA-disease weight matrix W’ = {w’kt} can be defined as in the following equation:
$$W_{kt}^{^{\prime}} = W_{kt} + \frac{{W_{tk} }}{{\mathop \sum \nolimits_{s = 1}^{{n_{m} }} W_{sk} }}$$
(5)
From the combination of the final miRNA-disease weight matrix W’ and the adjacency matrix AuMD, we define a final resource Rscore_ondisease_1 located on D as follows:
$$Rscore\_ondisease\_1 = W^{{\prime }} *A^{u}_{MD}$$
(6)
Step 2: Calculating resource allocation between diseases and lncRNAs
In regard to resource allocation between genes and diseases in TPGLDA [20], the same initial resources located on M nodes are allocated from nodes in M to nodes in D and then moved back, and the final resource matrix Rscore_ondisease_2 located on D nodes are issued by:
$$Rscore\_ondisease\_2 = \mathop \sum \limits_{s = 1}^{{n_{l} }} \frac{{A_{DL }^{0} \left( {d_{j} , l_{s} } \right) }}{{\deg A_{DL}^{0} \left( {l_{i} } \right)}}*\mathop \sum \limits_{k = 1}^{{n_{d} }} \frac{{A_{MD}^{u} \left( { m_{k} , d_{j} } \right)}}{{\deg A_{DL}^{0} \left( {d_{j} } \right)}}$$
(7)
where \(\mathrm{deg}{A}_{DL}^{0}\left({l}_{i}\right)={\sum }_{j=1}^{{n}_{d}}{A}_{DL}^{0}({d}_{j}, {l}_{i})\) is the number of related diseases for lncRNA li or the degree of lncRNA li in DL0 graph. \(\mathrm{deg}{A}_{DL}^{0}\left({d}_{j}\right)\)=\({\sum }_{i=1}^{{n}_{l}}{A}_{DL}^{0}({d}_{j}, {l}_{i})\) is the number of related lncRNAs for disease dj or the degree of disease dj in DL0 graph.
Step 3: Calculating the final resource score Rscore_final to infer the potential disease-related miRNAs
We calculate the final resource score Rscore_final which is used to measure latent disease-related miRNAs as follows:
$$Rscore\_final = \gamma * Rscore\_ondisease\_1 + \, \left( {1 - \gamma } \right) \, *Rscore\_ondisease\_2$$
(8)
where γ is a tunable parameter with value in [0, 1]. Our model achieves the best prediction performance when γ = 0.9.
Ranking all candidate miRNAs’ Rscores for each disease in descending order
Finally, we sort all candidate miRNAs’ Rscore_final for each disease in descending order so that a higher score candidate will have more chances to be verified in the future.