 Research
 Open Access
 Published:
Prediction of circRNAdisease associations based on inductive matrix completion
BMC Medical Genomics volume 13, Article number: 42 (2020)
Abstract
Background
Currently, numerous studies indicate that circular RNA (circRNA) is associated with various human complex diseases. While identifying diseaserelated circRNAs in vivo is time and laborconsuming, a feasible and effective computational method to predict circRNAdisease associations is worthy of more studies.
Results
Here, we present a new method called SIMCCDA (Speedup Inductive Matrix Completion for CircRNADisease Associations prediction) to predict circRNAdisease associations. Based on known circRNAdisease associations, circRNA sequence similarity, disease semantic similarity, and the computed Gaussian interaction profile kernel similarity, we used speedup inductive matrix completion to construct the model. The proposed SIMCCDA method obtains an area under ROC curve (AUC) of 0.8465 with leaveoneout cross validation in the dataset, which is obtained by the combination of the three databases (circRNA disease, circ2Disease and circR2Disease). Our method surpasses other stateofart models in predicting circRNAdisease associations. Furthermore, we conducted case studies in breast cancer, stomach cancer and colorectal cancer for further performance evaluation.
Conclusion
All the results show reliable prediction ability of SIMCCDA. We anticipate that SIMCCDA could be utilized to facilitate further developments in the field and followup investigations by biomedical researchers.
Background
As endogenous noncoding RNA, circular RNA (circRNA) is extremely distinct from linear RNA. The largest difference is that the circRNA does not possess a terminal structure (i.e., 5′ caps and 3′ polyA tails) and is covalently closed to form a loop structure [1]. Such a loop structure facilitates the resistance of the circRNA to the degradation of RNA exonuclease and offers a stable biological effect compared with the corresponding linear structure [2, 3].
Although circRNA was discovered as early as the 1970s, it was considered ‘junk’ RNA [4]. Recently, circRNA has been rerecognized and has gradually gained attention. CircRNA is involved in numerous important biological functions, especially regulatory functions [5]. Accumulating evidence has clearly demonstrated that changes in circRNA plays an important role in developing various pathological conditions and exhibits a significant correlation with diseases, especially cancer. For example, the circRNA CDR1as is an inhibitor of miR7, which is known to be involved in various diseases, such as neurodegenerative diseases, atherosclerosis and breast cancer [6]. Therefore, circRNA is thought to be a promising disease biomarker and treatment target [5]. Analysis of existing circRNAdisease associations is necessary to help predict other potential associations and help us understand the molecular mechanisms of human disease and identify biomarkers for disease diagnosis, treatment, and prevention at the circRNA level [7].
To date, an increasing number of experimentally verified or reported databases are available for the circRNAdisease associations, such as circR2Disease [7], circRNA disease [8], circ2Disease [9], and circ2Traits [10]. However, experimental methods are too expensive and timeconsuming to obtain a large validated circRNAdisease association data. Developing computational methods to predict novel circRNAdisease associations has attracted considerable attention as they can effectively decrease the time and cost of biological experiments. In addition, few methods are available for predicting the circRNAdisease associations using computational methods. Lei et al. [11] developed the method of predicting circRNAdisease associations based on a path weighted model, and Fan et al. [12] proposed the KATZHCDA method using the KATZ model on heterogeneous networks. However, these methods predict potential associations using a single database, which is not enough to illustrate the stability of the model. Moreover, it remains challenging to achieve significant performance for predicting circRNAdisease associations.
In this work, we proposed a new method called SIMCCDA (Speedup Inductive Matrix Completion for CircRNADisease Associations prediction), which considers the prediction of circRNAdisease associations as a recommendation system problem. To the best our knowledge, we are the first to apply the recommendation system approach inductive matrix completion (IMC) [13,14,15] to predict circRNAdisease associations. This method has been applied for various bioinformatics problems, such as drugtarget interactions [16], drug repositioning [17], lncRNA (long non coding RNA)disease [18] and miRNA (microRNA)disease associations [19]. We model the circRNAdisease association prediction problem as a recommendation task and solve it using speedup IMC [20]. Three databases, including circRNA disease, circ2Disease and circR2Disease, are used as our raw data in this study. We then perform data screening, generate corresponding three subdatasets (Dataset1, Dataset2 and Dataset3), and combine them into a total dataset (named TotalCircRD1). We first calculate circRNA sequence similarity and disease semantic similarity in these four datasets. Next, these two types of similarities are combined into a Gaussian interaction profile kernel to generate new circRNA similarity and disease similarity. Primary feature vectors of the similarity matrix are extracted by principal component analysis (PCA). The final model based on IMC is built for predicting circRNAdisease associations.
Leaveoneout cross validation (LOOCV) is used to examine the performance of our method. The optimal AUC on TotalCircRD1 is 0.8465. The AUC results on the three datasets are 0.8682 (Dataset1), 0.8303 (Dataset2) and 0.8509 (Dataset3), respectively. To further evaluate the performance of the proposed method, we rank and select the top 30 predictions of each dataset to determine the number of results that existed in verified associations. We also conduct case studies in breast cancer, stomach cancer and colorectal cancer to support our predictions. Finally, we compare our method with KATZHCDA, and the prediction results indicate that our method outperforms the previous method in predicting circRNAdisease associations. In summary, the proposed SIMCCDA method has the ability to predict associations in circRNAdisease and offers a guiding significance for future biomedical clinical experiments.
Methods
Model overview
Here, we apply IMC with feature vectors to build the model called SIMCCDA. In addition, we add a linear Bregman iteration to speed up the process of calculating the final score matrix. The flowchart is presented in Fig. 1. A_{ij} = 1 indicates that circRNA circ_{i} and disease d_{j} are associated, whereas A_{ij} = 0 indicates that their association is currently in an unknown state. Given a known circRNAdisease association matrix A ∈ ℝ^{m × n} with circRNA sequences and disease DOIDs (disease ontology identities), we obtain circRNA and disease similarity, respectively. Then, PCA is employed to extract primary feature vectors from acquired similarity. Finally, we construct the model with IMC based on the above information to predict circRNAdisease associations.
Human circRNAdisease associations data
We use three databases, including circRNA disease, circ2Disease and circR2Disease, all of which include known human circRNAdisease associations. All data were downloaded before September 2018. The initial information regarding each downloaded dataset is as follows: the first database circRNA disease contains 354 circRNAdisease associations (including 330 circRNAs and 48 diseases), the second database circ2Disease includes 273 circRNAdisease associations (including 249 circRNAs and 61 diseases) and the third database circR2Disease includes 739 associations (including 661 circRNAs and 100 diseases). The sequence information of circRNA and disease DOID matching are applied to the circBase [21] and Disease Ontology [22] (DO) databases. Based on the above data processing, we generate the final three datasets (Dataset1, Dataset2 and Dataset3). These datasets are merged to obtain TotalCircRD1 without duplicated redundancy. Table 1 lists the detailed statistics of the four datasets.
The uncompleted associations in the datasets include circRNAs without sequences or diseases without DOIDs. Given that the calculation of circRNA sequence similarity requires the circRNA sequence and the disease similarity requires the disease DOID information, the preceding datasets exclude the uncompleted associations. We wanted to assess whether these uncompleted associations would influence the prediction performance, so we add several uncompleted associations to form four new datasets (Dataset4, Dataset5, Dataset6 and TotalCircRD2) based on Dataset1, Dataset2, Dataset3 and TotalCircRD1, respectively (Additional file 1: Table S1).
CircRNA sequence similarity
The sequence information of all the corresponding circRNAs in the aforementioned databases is obtained from circBase, and Levenshtein distance [23] is used to calculate the similarity between each two circRNA sequences. As a string metric for measuring the difference between two strings, the Levenshtein distance between two strings is the minimum cost of singlecharacter edits (insertions, deletions or replacements) required to change one string into the other. In the present study, both editing costs of insertion and deletion are 1, and the replacement editing cost is 2. Formula (1) is the calculation of similarity for two circRNA sequences:
where dist represents the minimum editing cost of converting the circRNA circ_{i} sequence to the circRNA circ_{j} sequence, and len(∙) represents the length of circRNA sequence.
Disease semantic similarity
We use DOSim [24] in DObased DOSE (R package) to calculate the disease semantic similarity with Wang measure [25]. The detailed formula is displayed as follow:
For a given disease d_{i}, \( {T}_{d_i} \) is the ancestor term set of term d_{i} (including d_{i} itself). \( {S}_{d_i}(t) \) is defined as the contribution score of disease t (\( t\in {T}_{d_i} \)) to disease d_{i}. It can be expressed by the following formula:
Here, w_{e} is the semantic contribution factor of edge e, where e belongs to the set of edges connecting d_{i} and its ancestor \( {T}_{d_i} \). In DOSim, we set w_{e} = 0.7.
Gaussian interaction profile kernel similarity for circRNA and disease
By considering the assumption that similar circRNAs tend to be bound with similar diseases, Gaussian interaction profile kernel similarity is computed based on the known circRNAdisease association datasets. Inspired by van Laarhoven et al. [26], we calculate the circRNA and disease similarity using the Gaussian interaction profile kernel on four datasets. Equations (4) and (5) determine the similarity between circ_{i} and circ_{j}, where m is circRNA number, IP(circ_{i}) is the associated disease set corresponding to the circ_{i}, and γ_{c} is the regulation parameter of kernel bandwidth.
The Gaussian interaction profile kernel similarity of diseases d_{i} and d_{j} is similar to the defined equations (6) and (7), where n is the number of diseases:
Integrated similarity for circRNA and disease
Based on the previously defined circRNA sequence similarity, disease semantic similarity and Gaussian interaction profile kernel similarities, the integrated circRNA similarity matrix CS and the disease similarity matrix DS are calculated using the following equations (8) and (9):
Extract primary feature vectors
To remove the similarity redundancy, we use principal component analysis (PCA) to extract the primary feature vectors from integrated similarity, CS and DS. In this method, based on the dominating energy strategy [27], we use singular value decomposition (SVD) to perform PCA and formulas (10) and (11) to obtain the primary feature vectors of circRNA and disease similarity.
In the above formulas, S_{c} and S_{d} are the singular values of circRNA and the disease similarity matrix, respectively. α_{c} and α_{d} are adjusted parameters to obtain optimal results. In this study, Dataset2, Dataset3 and TotalCircRD1 share the parameters α_{c} =0.6 and α_{d} =0.9, whereas the parameters of Dataset1 are α_{c} =0.7 and α_{d} =0.9. Detailed adjustment work of α_{c} and α_{d} is discussed in the Results section.
Model construction
In this study, we formulate circRNAdisease association prediction as a recommendation system problem. Generally, a recommendation system is an information filtering system that seeks to predict the user’s preference of a certain item based on partial known preference information. We here use the recommendation system method IMC [15] to identify circRNAs for a disease that is dependent on validated circRNAdisease associations.
Observing the matrix density of the last column in Table 1, we find that the association matrix is very sparse. As we know, there are a small amount of experimental data of associations due to the structural complexity of circRNAs and ignored biological functions. The available data scale is in the primary stage. As a result, we cover the unknown associations of circRNAs and diseases through IMC to enhance the quality of our data. The advantage is that IMC can solve matrix completion problems using a relatively small set of known information. The detailed process of IMC is described below. First, based on the assumption that the human circRNAdisease association matrix is A, the row vectors in A lie in the subspace spanned by the column vectors in D (disease feature vectors), and the column vectors in A lie in the subspace spanned by the column vectors in C (circRNA feature vectors). The problem can be defined as:
where Z is the objective matrix to complete A, CZD^{T} is the final scoring matrix based on the association matrix and the similarity matrix, Ω represents known association sets, ‖∙‖_{∗} is the nuclear norm defined as the sum of the singular values, λ is the regularization parameter controlling the extent of the nuclear norm (here we set λ to 1), and ‖∙‖_{F} is the Frobenius norm of the matrix.
Representing f(Z) as \( \frac{1}{2}{\left\Vert {\mathfrak{R}}_{\varOmega}\left( CZ{D}^TA\right)\right\Vert}_F^2 \), the formula (12) can be expressed as:
For any given \( Y\in {R}^{f_c\times {f}_d} \), the following quadratic approximation of f(Z) at Y can be considered as:
where \( \nabla f(Y)={C}^T{\mathfrak{R}}_{\Omega}\left( CY{D}^TA\right)D \) is the gradient of f(Z) at Y, 〈∙〉 represents matrix inner product, and τ is a proximal parameter for estimating the secondorder gradient ∇^{2}f(Y). Accordingly, the above formula (13) calculates the minimum model, which can be converted into the following formula:
Then, we use the accelerated proximal gradient singular value thresholding algorithm [28] with iterate h times to obtain Z [29].
In order to see the relationship between the objective function value and the number of iterations, we divide the circRNAs into several categories according to their chromosomal location and then select randomly one from each class to view the trend of the curve. Additional file 1: Figure S1 shows that the value of the objective function decreases as the number of iterations increases. When the gap of objective function values between two iterations is particularly small, i.e. \( 1\frac{{objective\ value}_k}{{objective\ value}_{k1}}<{10}^{5} \), the iterative process will end.
Results
LOOCV
To assess the predictive accuracy of SIMCCDA, we performed the following method using the leaveoneout cross validation (LOOCV) framework on the known circRNAdisease associations. The reason why LOOCV is used in this study is that the current common practice in this field (prediction of lncRNA/miRNA/circRNAdisease associations) [30,31,32] is to use LOOCV to measure the performance of the model. For a disease d_{i}, each known circRNA association corresponding to the disease was left as a test sample. Other known associations were used as training samples, and an initial nonassociation was regarded as a candidate sample. In the candidate samples and test sample set, the test sample was deemed as a positive sample, and the others were negative samples. After running the model, the probabilities of associations between candidate samples and disease d_{i} were calculated. We took the highest values as the final score of the candidate sample among probabilities. Finally, we calculated the sensitivity and specificity as follows:
where TP indicate true positives, FP is false positives, TN refer to true negatives, and FN represent false negatives.
A Receiver Operating Characteristics (ROC) curve is drawn based on the LOOCV result. The Xaxis of the ROC graph is the 1specificity, and the Yaxis is the sensitivity. From the ROC curve, the Area Under ROC Curve (AUC) can be calculated as an evaluation measure for the model.
The effect of adjusting parameters on the prediction result
In the PCA section of the Methods, two parameters α_{c} and α_{d} were included, which represent the percentage of singular values of circRNA and disease similarity matrix, respectively. We tried to take values between 0.1 and 1 for α_{c} and α_{d}, and the step size was 0.1. The results of the parameterization of TotalCircRD1 are presented in Fig. 2, and results for Dataset1, Dataset2 and Dataset3 are presented in Additional file 1: Figures S2S4. As noted in Fig. 2, as α_{c} increases, AUC is initially stable and the generally declines. The results are consistent when α_{d} =0.1 or α_{d} =0.2. As α_{d} increases, the AUC gradually increases, but the growth rate is slow. The optimal parameters of the three datasets of Dataset2, Dataset3, TotalCircRD1 are all α_{c} =0.6 and α_{d} =0.9, whereas the optimal parameters for Dataset1 are α_{c} =0.7 and α_{d} =0.9. LOOCVbased AUC results for four datasets with optimal parameters are shown in Fig. 3b. The results of our model on the four datasets are at a solid level, and the gap between the maximum and minimum values is 3% in four datasets, which reveals that our model exhibits better robustness. Figure 3a shows the PR (PrecisionRecall) curves on the four datasets, respectively, which have the same trend as the ROC curve. Figure 4 presents the number of experimental validated associations predicted by our model from the top 30 predicted associations from our four datasets. Additional file 1: Table S2 shows the predicted results of the top 10, 30, 50 and 100. It can be observed that whether it is top 10, top 30, top 50, or top 100, the ultimate trends are similar. For the sake of convenience, we only show the results of top 30 in this work. Based on the above optimal parameters, we predicted 30 known circRNAdisease associations from Dataset2, Dataset3 and TotalCircRD1, and 26 known associations from the Dataset1. This shows that our results are optimal under these parameters, and four unknown associations in Dataset1 may be potential associations based on subsequent analysis.
In addition, we added weights to each part of the integration similarity to see how the performance could be impacted. We added weights (range from 0 to 1) to Sim_{lev}(circ_{i}, circ_{j}) and Gkl(circ_{i}, circ_{j}) in equation (8) and (9), respectively. For different weights circRNAs and diseases similarity, the final results were obtained by combining the two pairs. The Additional file 1: Figure S5 shows that the combinations of different similarity weights have similar results for the models obtained on different datasets. So, in the end, our model used equation (8) and (9) to respectively calculate the circRNAs similarity and diseases similarity.
The effect of uncompleted associations
The α_{c} and α_{d} were adjusted in the same manner as described above, and the optimal parameters were selected to calculate the AUC in Dataset4, Dataset5, Dataset6 and TotalCircRD2 datasets, as presented in Fig. 3c and d. The AUC scores of newadded datasets (Fig. 3d) are slightly reduced compared with the initial datasets (Fig. 3b). Given that most of the newly added circRNA only involved in one disease, thus making the final association matrix sparser than previous one. For example, circBANP is only associated with colorectal cancer and is not associated with other diseases. Increasing association data are noted between circBANP and colorectal cancer, and the unknown associations of circBANP with other diseases also increase, as observed from the matrix density columns of Additional file 1: Table S1. In summary, uncompleted associations exhibit a minimal effect on the results and only slightly reduce the performance of predictions.
The above results show that the sparseness of the data set has little effect on the prediction results. But if the correlation matrix is too sparse, it will still affect the final prediction results. So our method has a premise that the association matrix cannot be too sparse. We conducted the following experiment to explore how the varying sparsity of datasets affect the overall performance. Since the final result has a certain relationship with the dataset, we performed sparsity processing on each dataset (0.002 was the step size, and the sparsity was reduced by 0.002 each time), respectively. The Additional file 1: Figure S6 shows that the result is not much changed when the sparsity of Dataset1 is 0.015. But when the sparsity is 0.013, the performance starts to drop significantly. Similarly, for the other three datasets (Dataset2, Dataset3, TotalCircRD1), the performance starts to drop significantly when the sparsity is 0.013, 0.009, and 0.007, respectively.
Compared with the other method
Two methods are currently available for predicting circRNAdisease associations: PWCDA [11] and KATZHCDA [12]. Given that the PWCDA method needs to set the circRNA similarity and disease similarity < 0.5 part to 0 and most of the similarities on our datasets are less than 0.5, we only compared our method with KATZHCDA. KATZHCDA is a computational model of KATZ measures and constructs heterogeneous networks by employing the circRNA expression profiles, disease phenotype similarity and Gaussian interaction profile kernel similarity. Here, we used the same eight datasets in KATZHCDA model and obtain predicted results. The results of six datasets (Dataset1, Dataset2, Dataset3, Dataset4, Dataset5 and Dataset6) are presented in Additional file 1: Figure S7, and TotalCircRD1 and TotalCircRD2 results are presented in Fig. 5a and c. As shown in Additional file 1: Figure S7, both the PR curve and the ROC curve indicate that our model performance is superior to KATZHCDA. The AUC scores of four datasets (Dataset1, Dataset2, Dataset3 and TotalCircRD1) are 0.7604, 0.7458, 0.7442 and 0.7558, respectively. According to the comparison of two methods, our model obtains an average AUC of 0.8490, which is 9% higher than KATZHCDA. The resulting top 30 predicted associations are also analyzed, demonstrating that our predicted top 30 results are superior to KATZHCDA (Fig. 5b, d).
In addition, we compared our method with KATZHCDA by using Dataset1 as the training set and Dataset2, Dataset3 as the test set. As can be seen from Additional file 1: Figure S8, our performance is slightly better than KATZHCDA. Specifically, the early stage of KATZHCDA prediction effect is better than ours, but its accuracy is reduced in the prediction of later stages. A comprehensive look at the above two results, our model is superior to KATZHCDA on the whole.
Case study
Analysis of predicted circRNAdisease associations with experimental evidence from the TotalCircRD1 dataset
To further measure the performance of SIMCCDA, case studies of three diseases, including breast cancer, stomach cancer and colorectal cancer, from the TotalCircRD1 dataset were analyzed in detail. The top 30 predicted diseaserelated circRNAs by SIMCCDA and supporting evidence from PubMed are presented in Tables 2, 3 and 4.
Breast cancer is the most common cancer and remains the leading cause of cancer death among women worldwide [33]. Among top 30 predicted candidate circRNA for breast cancer, 29 are associated with breast cancer in related studies (Table 2). For instance, hsa_circ_0001875 (top 1) is upregulated in breast cancer tissues compared with the normal breast tissue [34]. In addition, circRNA hsa_circ_0006054 (top 2) expression is significantly downregulated in breast cancer tissues compared with nonbreast cancer tissues [34].
Gastric cancer is the second disease to lead cancerrelated mortality and the fourth most frequent cancer globally [35]. Using the SIMCCDA method, we successfully predicted 30 of top 30 candidate circRNAs for gastric cancer (Table 3). Among them, CircRNA hsa_circ_0084606 (top 1) is one of the top 10 upregulated circRNAs in stomach cancer tissues [36], whereas hsa_circ_0000140 (top 2), a typical circular RNA, is significantly increased in stomach cancer tissues compared with paired adjacent nontumorous tissues [37].
Colorectal cancer is the third most common cancer worldwide with 1.36 million people diagnosed in 2012 [38]. The inferred results cover 23 experimental verified associations out of the top 30 ranked predictions (Table 4). The evidence in the literature reveals that circRNA hsa_circ_0000523 exhibits significantly reduced expression in cancer compared with normal colorectal tissues. In colorectal cancer cells, the wellvalidated circRNA hsa_circ_0000504 is upregulated [39].
Analysis of predicted circRNAdisease associations without experimental evidence from four datasets
Given that the top 30 wellvalidated associations were successfully investigated by our method using Dataset2, Dataset3 and TotalCircRD1 dataset, here we concentrated on four new predicted potential circRNAdisease associations from Dataset1 (as shown in Fig. 4). We employed circRNAmiRNA and miRNAdisease associations to construct corresponding circRNAmiRNAmRNA networks for the four new circRNAdisease associations.
We used the hsa_circ_0070963stomach cancer association as an example for a detailed exposition. First, possible miRNA targets of hsa_circ_0070963 were predicted with the miRNA Target Sites tool of CircInteractome [40]. Their target genes with experimental verification were screened out from miRTarBase [41], and then, hsa_circ_0070963miRNAdisease regulatory network was constructed using Cytoscape [42]. Finally, the corresponding experimentally verified miRNAstomach cancer associations were obtained from HMDD [43] and added to the above network. As noted from the result (Fig. 6), hsa_circ_0070963 may be targeted by four miRNAs, including hasmiR223, hasmiR421, hasmiR610 and hasmiR526b. CircRNA can act as competing endogenous RNAs (ceRNAs) (also termed miRNA sponges) to buffer the target genes expression (i.e., mRNA) of miRNAs [36, 37], and miRNA hasmiR223 is linked the most number of targets. Thus, we hypothesize that hsa_circ_0070963 may function as a hsamiR223 sponge to interact with stomach carcinoma.
Other three predicted new associations (hsa_circ_0061893, hsa_circ_0071410, and hsa_circ_0054345 in stomach cancer) exhibit similar scenarios, which are presented in Additional file 1: Figures S9S11.
Conclusions
Increasing evidence demonstrates that circRNA plays an important role in the development of various diseases. Understanding the underlying mechanisms of circRNA in disease is becoming an urgent problem worldwide. To date, the number of experimentally validated circRNAdisease associations is small, and few computational methods for predicting circRNAdisease associations are available. In this paper, we proposed a method called SIMCCDA for predicting circRNAdisease associations based on known circRNAdisease associations. Integrating data regarding circRNA similarity and disease similarity, we employed IMC to construct the model. LOOCV was applied to assess the accuracy of the SIMCCDA. We then compared our method with KATZHCDA. Further case studies were also performed on breast cancer, stomach cancer and colorectal cancer. Based on the prediction results, SIMCCDA performs well in cross validations on the four datasets we used. Simultaneously, the compared results indicate that our method can identify more associations between circRNA and disease.
The prominent performances of SIMCCDA may have been facilitated by the following factors. First, SIMCCDA was constructed based on the integrated circRNA and disease similarities, which can make a full use of various similarity data to characterize potential circRNAdisease associations. Second, SIMCCDA transformed circRNAdisease associations into a recommendation system problem and applied the IMC algorithm of the recommendation system to predict potential circRNAdisease associations. A decisive advantage of IMC is that it can supplement the missing values in the circRNAdisease association matrix to improve the performance. Third, the datasets used in this study were derived from various validated databases. Observing the results obtained on the four datasets, we found that the prediction ability of our model was better than the previous method.
However, our model also has some limitations. First, although we introduce the sequence similarity of circRNA and the semantic similarity of disease, the calculation of Gaussian interaction profile kernel similarity relies heavily on known circRNAdisease associations, thus causing inevitable bias towards wellinvestigated circRNAs and diseases. Second, SIMCCDA could not be applied to unknown circRNA and diseases. In our future work, we will extend our method to solve these limitations.
Availability of data and materials
The data pertaining to the present study has been included in table and/or figure form in the present manuscript. And all datasets and computational code underlying this study are available in an online archive https://github.com/bioinformaticsAHU/SIMCCDA.
Abbreviations
 AUC:

Area under ROC curve
 circRNA:

circular RNA
 DO:

Disease Ontology
 DOIDs:

Disease ontology identities
 IMC:

Inductive matrix completion
 lncRNA:

long non coding RNA
 LOOCV:

Leaveoneout cross validation
 miRNA:

microRNA
 PCA:

Principal component analysis
 PR:

PrecisionRecall
 ROC:

Receiver Operating Characteristics
 SIMCCDA:

Speedup Inductive Matrix Completion for CircRNADisease Associations prediction
 SVD:

Singular value decomposition
References
Jeck WR, Sorrentino JA, Wang K, Slevin MK, Burd CE, Liu J, Marzluff WF, Sharpless NE. Circular RNAs are abundant, conserved, and associated with ALU repeats. RNA. 2013;19(2):141–57.
Bahn JH, Zhang Q, Li F, Chan TM, Lin X, Kim Y, Wong DT, Xiao X. The landscape of microRNA, Piwiinteracting RNA, and circular RNA in human saliva. Clin Chem. 2015;61(1):221–30.
Memczak S, Jens M, Elefsinioti A, Torti F, Krueger J, Rybak A, Maier L, Mackowiak SD, Gregersen LH, Munschauer M, et al. Circular RNAs are a large class of animal RNAs with regulatory potency. Nature. 2013;495(7441):333–8.
Zhang Z, Yang T, Xiao J. Circular RNAs: promising biomarkers for human diseases. EBioMedicine. 2018;34:267–74.
Fang Y. Circular RNAs as novel biomarkers with regulatory potency in human diseases. Future Sci OA. 2018;4(07):FSO314.
Peng L, Yuan XQ, Li GC. The emerging landscape of circular RNA ciRS7 in cancer (review). Oncol Rep. 2015;33(6):2669–74.
Fan C, Lei X, Fang Z, Jiang Q, Wu FX. CircR2Disease: a manually curated database for experimentally supported circular RNAs associated with various diseases. Database. 2018;2018:bay044.
Zhao Z, Wang K, Wu F, Wang W, Zhang K, Hu H, Liu Y, Jiang T: circRNA disease: a manually curated database of experimentally supported circRNAdisease associations. Cell Death Dis 2018, 9(5):475.
Yao D, Zhang L, Zheng M, Sun X, Lu Y, Liu P. Circ2Disease: a manually curated database of experimentally validated circRNAs in human disease. Sci Rep. 2018;8(1):11018.
Ghosal S, Das S, Sen R, Basak P, Chakrabarti J. Circ2Traits: a comprehensive database for circular RNA potentially associated with disease and traits. Front Genet. 2013;4:283.
Lei X, Fang Z, Chen L, Wu FX. PWCDA: Path Weighted Method for Predicting circRNADisease Associations. Int J Mol Sci. 2018;19(11):E3410.
Fan C, Lei X, Wu FX. Prediction of CircRNAdisease associations using KATZ model based on heterogeneous networks. Int J Biol Sci. 2018;14(14):1950–9.
Shin D, Cetintas S, Lee KC, Dhillon IS. Tumblr Blog Recommendation with Boosted Inductive Matrix Completion; 2015. p. 203–12.
Koren Y, Bell R, Volinsky C. Matrix factorization techniques for recommender systems. Computer. 2009;8:30–7.
Jain P, Dhillon IS. Provable inductive matrix completion. arXiv preprint arXiv:13060626; 2013.
Zheng X, Ding H, Mamitsuka H, Zhu S. Collaborative matrix factorization with multiple similarities for predicting drugtarget interactions. Chicago: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2013. p. 1025–33.
Luo H, Li M, Wang S, Liu Q, Li Y, Wang J. Computational drug repositioning using lowrank matrix approximation and randomized algorithms. Bioinformatics. 2018;34(11):1904–12.
Lu C, Yang M, Luo F, Wu FX, Li M, Pan Y, Li Y, Wang J. Prediction of lncRNAdisease associations based on inductive matrix completion. Bioinformatics. 2018;34(19):3357–64.
Chen X, Wang L, Qu J, Guan NN, Li JQ. Predicting miRNAdisease association based on inductive matrix completion. Bioinformatics. 2018;34(24):4256–65.
Xu M, Jin R, Zhou ZH. Speedup matrix completion with side information: application to multilabel learning. In: Advances in neural information processing systems, vol. 2013; 2013. p. 2301–9.
Glazar P, Papavasileiou P, Rajewsky N. circBase: a database for circular RNAs. RNA. 2014;20(11):1666–70.
Schriml LM, Arze C, Nadendla S, Chang YW, Mazaitis M, Felix V, Feng G, Kibbe WA. Disease ontology: a backbone for disease semantic integration. Nucleic Acids Res. 2012;40(Database issue):D940–6.
Levenshtein VI. Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet physics doklady, vol. 1966; 1966. p. 707–10.
Li J, Gong B, Chen X, Liu T, Wu C, Zhang F, Li C, Li X, Rao S, Li X. DOSim: an R package for similarity between diseases based on disease ontology. BMC Bioinformatics. 2011;12:266.
Wang JZ, Du Z, Payattakool R, Yu PS, Chen CF. A new method to measure the semantic similarity of GO terms. Bioinformatics. 2007;23(10):1274–81.
van Laarhoven T, Nabuurs SB, Marchiori E. Gaussian interaction profile kernels for predicting drugtarget interaction. Bioinformatics. 2011;27(21):3036–43.
Ji H, Yu W, Li Y. A rank revealing randomized singular value decomposition (r3svd) algorithm for lowrank matrix approximations. arXiv preprint arXiv:160508134; 2016.
Toh KC, Yun S. An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems. Pacific J Optim. 2010;6(615–640):15.
Cai JF, Candès EJ, Shen Z. A singular value thresholding algorithm for matrix completion. SIAM J Optim. 2010;20:1956–82.
Chen X, Yan GY. Novel human lncRNAdisease association inference based on lncRNA expression profiles. Bioinformatics. 2013;29(20):2617–24.
Chen X, Qu J, Yin J. TLHNMDA: triple layer heterogeneous network based inference for MiRNAdisease association prediction. Front Genet. 2018;9:234.
Chen X. KATZLDA: KATZ measure for the lncRNAdisease association prediction. Sci Rep. 2015;5:16840.
Nagini S. Breast cancer: current molecular therapeutic targets and new players. Anticancer Agents Med Chem. 2017;17(2):152–63.
Lü L, Sun J, Shi P, Kong W, Xu K, He B, Zhang S, Wang J. Identification of circular RNAs as a promising new class of diagnostic biomarkers for human breast cancer. Oncotarget. 2017;8(27):44096.
Ang TL, Fock KM. Clinical epidemiology of gastric cancer. Singap Med J. 2014;55(12):621.
Shao Y, Li J, Lu R, Li T, Yang Y, Xiao B, Guo J. Global circular RNA expression profile of human gastric cancer and its clinical significance. Cancer Med. 2017;6(6):1173–80.
Li P, Chen S, Chen H, Mo X, Li T, Shao Y, Xiao B, Guo J. Using circular RNA as a novel type of biomarker in the screening of gastric cancer. Clin Chim Acta. 2015;444:132–6.
Yiu AJ, Yiu CY. Biomarkers in colorectal cancer. Anticancer Res. 2016;36(3):1093–102.
Xiong W, Ai YQ, Li YF, Ye Q, Chen ZT, Qin JY, Liu QY, Wang H, Ju YH, Li WH. Microarray analysis of circular RNA expression profile associated with 5fluorouracilbased chemoradiation resistance in colorectal cancer cells. Biomed Res Int. 2017;2017:8421614.
Dudekula DB, Panda AC, Grammatikakis I, De S, Abdelmohsen K, Gorospe M. CircInteractome: a web tool for exploring circular RNAs and their interacting proteins and microRNAs. RNA Biol. 2016;13(1):34–42.
Chou CH, Shrestha S, Yang CD, Chang NW, Lin YL, Liao KW, Huang WC, Sun TH, Tu SJ, Lee WH. miRTarBase update 2018: a resource for experimentally validated microRNAtarget interactions. Nucleic Acids Res. 2017;46(D1):D296–302.
Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003;13(11):2498–504.
Huang Z, Shi J, Gao Y, Cui C, Zhang S, Li J, Zhou Y, Cui Q. HMDD v3. 0: a database for experimentally supported human microRNA–disease associations. Nucleic Acids Res. 2018;47:D1013–7.
Acknowledgements
The authors thank the members of our laboratory for their valuable contributions to SIMCCDA.
About this supplement
This article has been published as part of BMC Medical Genomics Volume 13 Supplement 5, 2020: The International Conference on Intelligent Biology and Medicine (ICIBM) 2019: Computational methods and application in medical genomics (part 1). The full contents of the supplement are available online at https://bmcmedgenomics.biomedcentral.com/articles/supplements/volume13supplement5 .
Funding
Publication costs are funded by the National Natural Science Foundation of China (61672037, 21601001, U19A2064, 11835014), the Anhui Provincial Outstanding Young Talent Support Plan (gxyqZD2017005), the Young Wanjiang Scholar Program of Anhui Province, and the Key Project of Anhui Provincial Education Department (KJ2017ZD01), the China Postdoctoral Science Foundation Grant (2018 M630699) and the Anhui Provincial Postdoctoral Science Foundation Grant (2017B325).
Author information
Authors and Affiliations
Contributions
ML (Li) implemented the prediction system, performed the analysis, and drafted the manuscript. ML (Liu) collected the datasets, performed the analysis, and drafted the manuscript. YB participated in the design of study and drafted the manuscript. JX designed the study, performed the analysis, and drafted the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Additional file 1.
Supplementary file to this work (Table S1S2 and Figures S1S11).
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Li, M., Liu, M., Bin, Y. et al. Prediction of circRNAdisease associations based on inductive matrix completion. BMC Med Genomics 13 (Suppl 5), 42 (2020). https://doi.org/10.1186/s1292002006790
Published:
DOI: https://doi.org/10.1186/s1292002006790
Keywords
 CircRNAdisease associations
 CircRNA sequence similarity
 Disease semantic similarity
 Inductive matrix completion