 Research
 Open Access
 Published:
Predicting binary, discrete and continued lncRNAdisease associations via a unified framework based on graph regression
BMC Medical Genomics volume 10, Article number: 65 (2017)
Abstract
Background
In human genomes, long noncoding RNAs (lncRNAs) have attracted more and more attention because their dysfunctions are involved in many diseases. However, the associations between lncRNAs and diseases (LDA) still remain unknown in most cases. While identifying diseaserelated lncRNAs in vivo is costly, computational approaches are promising to not only accelerate the possible identification of associations but also provide clues on the underlying mechanism of various lncRNAcaused diseases. Former computational approaches usually only focus on predicting new associations between lncRNAs having known associations with diseases and other lncRNAassociated diseases. They also only work on binary lncRNAdisease associations (whether the pair has an association or not), which cannot reflect and reveal other biological facts, such as the number of proteins involved in LDA or how strong the association is (i.e., the intensity of LDA).
Results
To address abovementioned issues, we propose a graph regressionbased unified framework (GRUF). In particular, our method can work on lncRNAs, which have no previously known disease association and diseases that have no known association with any lncRNAs. Also, instead of only a binary answer for the association, our method tries to uncover more biological relationship between a pair of lncRNA and disease, which may provide better clues for researchers. We compared GRUF with three stateoftheart approaches and demonstrated the superiority of GRUF, which achieves 5%~16% improvement in terms of the area under the receiver operating characteristic curve (AUC). GRUF also provides a predicted confidence score for the predicted LDA, which reveals the significant correlation between the score and the number of RNABinding Proteins involved in LDAs. Lastly, three out of top5 LDA candidates generated by GRUF in novel prediction are verified indirectly by medical literature and known biological facts.
Conclusions
The proposed GRUF has two advantages over existing approaches. Firstly, it can be used to work on lncRNAs that have no known disease association and diseases that have no known association with any lncRNAs. Secondly, instead of providing a binary answer (with or without association), GRUF works for both discrete and continued LDA, which help revealing the pathological implications between lncRNAs and diseases.
Background
According to the central dogma of molecular biology, DNAs should be transcribed into different kinds of RNAs [1]. The transcriptional outputs of DNAs comprise both proteincoding messenger RNAs (mRNAs) and noncoding RNAs (ncRNAs). The latter was commonly regarded as transcriptional noise [1]. However, the Human Genome Project unexpectedly reveals that only ~2% of chemical bases in the genome sequence were transcribed into mRNAs [1], while the remaining bases accounting for a very big portion of the whole genome are transcribed to ncRNAs [2]. As a result, ‘Why is the majority of the genome noncoding?’ becomes one of the core questions in genomics [3].
In recent years, biological experiments show the critical biological roles of ncRNAs, which are involved in regulation of transcription, translation, RNA modification, maturation or transportation and in epigenetic modification of chromatin structures [3]. ncRNAs have amazing variety in structure and in gene regulation outcomes. As the number of known functional ncRNAs is increasing [4], various RNA species in the human genome can be roughly categorized as short (sncRNAs) and long (lncRNAs) groups by sequence length (200 nucleotides generally). sncRNAs, such as siRNA (small inhibitory RNA), miRNA (microRNA), piRNA (piwi RNA) and antisense RNA, have less than 200 nucleotides (nts) and are highly conserved in different species and have a key role in transcriptional and posttranscriptional silencing of genes. On the other hand, lncRNAs with lengths of over 200 nts, are poorly conserved and have low expression levels and high tissue specificity. lncRNAs are usually encoded as intergenic, intronic or overlapping regions [5], unfortunately, how they perform their diverse functions is still largely unknown [6, 7].
The dysfunction (e.g. mutations and deregulations [8, 9]) of lncRNAs is heavily involved in the development or progression of diseases, such as cardiovascular disease [10] and cancer [11]. Thus, lncRNAs could be novel molecules for disease diagnosis and therapy [3, 9, 12]. Nevertheless, the number of lncRNAs, which has been functionally characterized and associated with diseases, is extremely small. The relationship between lncRNAs and human diseases remains unknown in most cases up to now. Consequently, there is an increasing need to identify lncRNAsdisease associations (LDA) on a genomewide scale [12].
However, identifying diseaserelated lncRNAs based on biological experiments is still a great challenge because of the lengthy process (time) and high cost. Computational approaches provide alternatives for identifying possible lncRNAdisease associations for further study and validation in wet lab [13]. Besides, computational approaches can also help provide clues on the underlying mechanism of various lncRNAcaused diseases and accelerate the identification of potential biomarkers for disease diagnosis, treatment, prognosis and prevention [3, 14].
Computational approaches, especially based on machine learning, such as Laplacian Regularized Least Squares [15], network topology inference [16] [17], Random Walk [13, 18] and SVM [19], have been developed to predict potential LDA, based on the assumption that similar diseases tend to be associated with similar lncRNAs in function [19].
Most of the former approaches only focus on the predicting scenario between the lncRNAs with known associating diseases and the diseases with known associating lncRNAs. However, the majority of lncRNAs has no known disease association. Also, there exist more and more diseases which have no known association with any lncRNAs. It is desirable to have an approach that can work on these lncRNAs and diseases. Moreover, to the best of our knowledge, existing computational approaches only work on binary LDA (i.e. only reports if there is an association or not), which cannot reflect and reveal many biological facts or knowledge. For example, a diseaseassociated lncRNA may cause the disease by dysregulating diverse proteins [3, 20]. Binary associations cannot show the number of proteins involved in the associations as well as the intensity of the associations.
To address abovementioned issues, we propose a Graph Regressionbased approach which provides a Unified Framework (GRUF) for four predicting tasks, including the traditional task solved by existing approaches that work on lncRNA with known disease association and diseases having known association with some lncRNAs. GRUF is also able to work for lncRNAs with no known disease association and diseases without known association with any lncRNAs. Moreover, instead of predicting binary LDA only, GRUF can also work for both discrete and continued LDA, which helps to reveal the implications between lncRNA and pathology. We demonstrate the superiority of GRUF by both the comparison with three stateoftheart approaches and the comprehensive prediction across distinct tasks over multitype associations. In addition, its effectiveness is further verified by validating the prediction of novel lncRNAdisease associations from both medical literature and a related database.
Methods
Problem formulation
Given a set of associations between m known lncRNAs {r _{ i }} (denoted as R) and n known diseases {d _{ j }} (denoted as D), we have four predicting scenarios/tasks (Fig. 1). The first one (T1) is the traditional one handled by existing approaches, which infers how likely there are novel associations between R and D, where both R and D have other known associations. The second one (T2) is to find potential associated diseases from D for an lncRNA r _{ x }, which has no known disease association. Symmetrically, the third one (T3) is to find potential associated lncRNAs from R for a disease d _{ y }, with no known association with any lncRNAs. The last one (T4) is the most difficult task which deduces how likely there are potential associations between lncRNAs with no known disease association and diseases with no known association with any lncRNAs. Solving T4 could provide clues for researchers to further investigate unexpected associations between lncRNAs and diseases. Moreover, lncRNAs without known disease association and diseases without known association with lncRNAs are the majority.
The set of known LDAs between R and D can be organized into an association matrix A _{ m×n }. We consider three types of associations between lncRNAs and diseases, including binary, discrete and continued LDAs. The corresponding association matrices are denoted as \( {\mathbf{A}}_{m\times n}^b \), \( {\mathbf{A}}_{m\times n}^d \) and \( {\mathbf{A}}_{m\times n}^c \). Traditionally, in \( {\mathbf{A}}_{m\times n}^b \), a ^{b}(i, j) = 1 if there is a known interaction between lncRNA r _{ i } and disease d _{ j }, and a ^{b}(i, j) = 0 otherwise. By contrast, in \( {\mathbf{A}}_{m\times n}^d \), a ^{d}(i, j) ∈ ℕ ^{+} (positive integers) if there is a known interaction between lncRNA r _{ i } and disease d _{ j }, and a ^{d}(i, j) = 0 otherwise. In \( {\mathbf{A}}_{m\times n}^c \), a ^{c}(i, j) ∈ ℝ ^{+} (positive real numbers) and, a ^{c}(i, j) ≥ 1 if there is a known interaction between lncRNA r _{ i } and disease d _{ j }, and a ^{c}(i, j) < 1 otherwise. \( {\mathbf{A}}_{m\times n}^d \) is able to provide more information than \( {\mathbf{A}}_{m\times n}^b \), such as the number of proteins (or its coding genes) involved in the associations, while \( {\mathbf{A}}_{m\times n}^c \) can further reflect how strong the association is (i.e., the intensity of LDA). Three kinds of associations can be represented as a binary graph, a weighted graph and a completed weighted graph respectively, in which lncRNAs and disease are nodes and their associations are edges. For short, the graph is denoted as G _{ a }.
We aim to develop a unified framework for predicting LDAs in the above four scenarios. Involving new nodes, the prediction in either T2, T3 or T4 can be regarded as a coldstart problem in recommendation systems. Except for the topology of association graph, additional attributes of nodes should be integrated in T2, T3 and T4, which have a requirement of predicting links for nodes having no existing links at all.
Note that pairwise lncRNA similarities can be independently measured with respect to the topology of LDA graph, and are organized into a lncRNA similarity graph G _{ r }. Similarly, pairwise disease similarities can be organized into a disease similarity graph G _{ d }. Their symmetric adjacent matrices, represented as S _{ r } and S _{ d } respectively, are further integrated with A _{ m×n } to perform the prediction of LDAs.
Graph regression
We transform the predicting task into a graph regression between G _{ r }, G _{ d } and G _{ a } (Fig. 2). The graph regression is synchronously performed in three latent spaces, associating space, lncRNA similarity space and disease similarity space. Therefore, the graph regression can be formulized as follows,
The first three items in the above objective function account for three lowrank decompositions, which map G _{ a } into the associating space, G _{ r } into the lncRNA similarity space and G _{ d } into the disease similarity space respectively. While the last two items account for the regression between the associating space and the lncRNA similarity space, and the regression between the associating space and the disease similarity space. For elegance, the regularization items are omitted. In details, the lncRNAs and diseases in G _{ a } are mapped into an m × r lncRNA associating matrix A _{ r } (RAM) and a n × r disease associating matrix A _{ d } (DAM); the lncRNAs in G _{ r } are mapped into an m × p lncRNA latent feature matrix F _{ r } (RLFM); the diseases in G _{ d } are mapped into an n × q lncRNA latent feature matrix F _{ d } (DLFM); the p × r matrix B _{ r } and the q × r matrix B _{ d } are the corresponding regression coefficient matrices.
When assuming that the five items in formula (1) are independent, we give a simple solution for above optimization problem by minimizing the items individually. For the lowrank decompositions, we apply Singular Value Decomposition (SVD) to generate RAM, DAM, RLFM and DLFM respectively by \( \mathbf{M}\overset{SVD}{=}{\mathbf{U}\boldsymbol{\Sigma } \mathbf{V}}^T=\left(\mathbf{U}\sqrt{\boldsymbol{\Sigma}}\right){\left(\mathbf{V}\sqrt{\boldsymbol{\Sigma}}\right)}^T={\mathbf{LR}}^T \), where M denotes A, S _{ r } or S _{ d }. For the regression, we utilize Partial LeastSquares (PLS) regression to generate B _{ r } and B _{ d } individually.
Sequentially, the proposed graph regression model enables us to solve T1, T2, T3 and T4 in a unified framework. The predicted confidence scores of being a potential LDA in all the tasks are defined as
where F _{ r, x }, calculated from the lncRNA similarity matrix, is the latent feature vectors of newly given lncRNAs r _{ x } (having no association with diseases), F _{ d, y }, calculated from the disease similarity matrix, is the feature vectors of newly given diseases d _{ y } (having no association with lncRNAs), and \( \boldsymbol{\Theta} ={\mathbf{B}}_{\mathbf{r}}{\mathbf{B}}_{\mathbf{d}}^T \) is the biregression coefficient matrix, calculated by the known lncRNA set R and the known disease set D.
Moreover, this framework is flexible when there are no similarity graph available but realworld feature vectors, such as lncRNA sequence features. In this case, Θ builds the bridge between the features of lncRNAs, the features of diseases as well as the associations between them. Its entries indicate the importance of the pairs between lncRNA features and disease features among associations and nonassociations. Compared with latent features, realworld features are usually able to provide more straightforward elucidation of why an lncRNA is associated with a disease.
Generation of nonbinary association
Considering the discovery that the interaction between lncRNAs and RNAbinding proteins (RBPs) can reveal the roles of lncRNAs in the multilayered transcriptional [21] and the possible involvement in the alterations of cellular pathways [3], we hope RBPs may contribute to better lncRNA annotations in understanding diseaserelated regulations. In addition, the genes coding RBPs are probably associated with diseases. Therefore, we utilized both lncRNARBP interactions and genedisease associations to construct discrete and continued lncRNAdisease associations, which are reflected by integer and real values respectively but not binary indicators. For convenient description, the terms of gene and protein refer to as the same object in the following texts.
The traditional binary association \( {\mathbf{A}}_{m\times n}^b \) can be easily generated by checking whether or not an lncRNA and a disease share common genes/proteins. If yes, they are associated with each other. The discrete association \( {\mathbf{A}}_{m\times n}^b \) can be generated by counting the number of common genes/proteins. The numbers account for the values of discrete associations. The continued association is generated as follows. Let A _{ r − p } be the interaction matrix between lncRNAs and RBPs, A _{ d − g } be the association matrix between diseases and diseaserelated genes, and S _{ P } is the symmetric similarity matrix between the proteins, which are coded by the common genes between the coding genes of RBPs in S _{ r − p } and the diseaserelated genes in A _{ d − g }. We believe that the larger the number of common genes/proteins is and the more similar they are, the more possible the RNA is associated with the disease. Thus, the continued association matrix between lncRNAs and diseases can be defined as \( {\mathbf{A}}_{m\times n}^c={\mathbf{A}}_{rp}{\mathbf{S}}_P{\mathbf{A}}_{dg} \).
A toy example illustrates these three types of LDAs in Fig. 2. Three observations can be drawn: (1) the binary LDA matrix only denotes whether or not lncRNAs are associated with diseases; (2) beyond the binary matrix, the discrete matrix indicates an extra information of how many RNA binding proteins or their coding genes are involved in each LDA; (3) going deeper, the continued matrix shows the intensity of LDAs, which can distinguish the entries even having the same number of RBPs (e.g. the blue entry and the green entries). As a result, compared with the binary associations, the union of the discrete associations and the continued associations provides evidence for functionally annotating the roles of lncRNAs and discovering their underlying mechanisms associated with diseases.
Similarity measurement
The similarity matrices of lncRNA, protein and disease are constructed as follows. First, the occurring frequency of Kmer, a short substring consisting of K letters derived from the set {A, C, G, U} is applied to characterize an RNA sequence. In general, the occurring frequency of 4mer is applied to calculate the sequence features of lncRNA [22]. Considering that the binding between RNAs and proteins occurs on local zones in sequence, we enhance the original 4mer feature by dividing a sequence of lncRNA into multiple (e.g. 35) sequence segments of approximately same lengths, calculating 4mer features separately and concatenating them into one feature vector. Then, the pairwise similarity between any two lncRNA sequences, accounting for an edge in lncRNA graph G _{ r }, can be generated from their feature vectors r _{ i } and r _{ j } by 1/(1 + dist(r _{ i }, r _{ j })), where dist denotes Euclidian distance.
Secondly, considering the importance of specific properties of amino acids in diverse kinds of bindings, we adopted the approach in [23] to calculate protein sequence similarity as follows. According to both dipole moments and side chain volume, twenty kinds of amino acids are firstly separated into 7 groups, {A, G, V}, {I, L, F, P}, {Y, M, T, S}, {H, N, Q, W}, {R, K}, {D, E}, and {C}. Then, protein sequences are encoded into a new type of sequences, which consists 7 corresponding letters with respect to those groups. Last, the occurring frequency of 3mer is applied on these encoded sequences to represent protein sequences. The pairwise similarity between two protein sequences, which feature vectors are represented as p _{ i } and p _{ j } respectively, can be defined by 1/(1 + dist(p _{ i }, p _{ j })) as well.
Thirdly, we calculated disease similarity with the help of MeSH, which provides a hierarchical disease classification system containing a set of semantic disease descriptors (nodes) [24]. Each descriptor accounts for a disease category containing one or more diseases. Meanwhile, a disease may be assigned to one or more categories. For example, Breast Neoplasms belongs to two categories, C04.588.180 and C17.800.90.500. Base on MeSH descriptors, each disease can be represented a directed acyclic graph (DAG) and the pairwise similarity of two diseases is calculated by comparing their DAGs. The more the common parts of their DAGs are, the more similar they are. The details can be found in [25]. We adopt this semantic similarity as the disease similarity when predicting lncRNAdisease associations.
Assessment
The assessment of a predicting approach should consider two crucial factors, including algorithm validation and performance evaluation. Algorithm validation is always implemented by the wellknown Cross Validation (CV). Remarkably, when assessing approaches, the appropriate schemes of CV for different scenarios should be adopted, otherwise overoptimistic results are perhaps obtained [26, 27]. We generated different tasks of CV under four scenarios as follows (see also Fig. 1):

CV_T1: CV performed on lncRNAdisease pairs, where random entries (lncRNAdisease pairs) in A were selected for testing and the remaining entries were used for training.

CV_T2: CV performed on lncRNAs, where random rows corresponding to lncRNAs in A were blinded for testing and the remaining rows were used for training.

CV_T3: CV performed on diseases, where random columns in A (accounting for diseases) were blinded for testing and the remaining columns were used for training.

CV_T4: CV performed on lncRNAdisease pairs, where random entries in A were selected for testing, but all the rows and columns containing the testing entries should be blinded for testing as well as training simultaneously. In other words, both the rows and the columns in A for training contain NONE of the testing entries.
We adopt Kfold cross validation (KCV) to assess our approach on different predicting scenarios. The objects in an LDA matrix are randomly split into K subsets with approximately equal sizes. In each round of CV, one subset of objects is taken as the testing set while the union of other subsets of objects is taken as the training set. This procedure keeps running K1 rounds by assigning each subset of objects as the testing set in turn. Here, the term ‘object’ refers to as the entries of LDA in CV_T1 and CV_T4, while as the rows and the columns of LDA in CV_T2 and CV_T3.
Moreover, over these CV schemes, we use two metrics to evaluate the performance of LDA prediction. One is the popular Area Under the receiver operating characteristic Curve (AUC), which can be calculated according to the predicted confidence scores of positive and negative entries. In the binary prediction of LDA, known LDAs and other lncRNAdisease pairs are assigned with positive labels and negative labels respectively. AUC is appropriate to binary LDA prediction, however, is unavailable to discrete and continued prediction. We design a strategy to accommodate AUC for them.
Because there is a onetoone correspondence between each entry of binary LDA matrix(BAM) and its enriched entry in either discrete LDA matrix (DAM) or continued LDA matrix (CAM), the binary values of the entries in BAM can be taken as the binary labels of those entries in DAM or CAM. Once the labels are set, the predicted confidence scores generated by discrete prediction or continued prediction can be used to calculate AUC by the same way as that in binary prediction.
However, AUC is not enough to measure the performance of discrete prediction or continued prediction because it can only indicate how well the approach can distinguish LDA from nonLDA. Therefore, another metric, Correlation, is proposed as an enhanced measure of discrete prediction or continued prediction. It indicates the consistency between the intensity of LDA and its predicted confidence scores. The higher, the better. A perfect predicting model generates the predicted scores, which are completely correlated with DAM or CAM.
Result and discussion
Datasets
We collected three datasets to evaluate GRUF. The first, denoted as DB1, was used as a benchmark dataset in former approaches [15, 28] and was also used to build a web server of predicting binary lncRNAdisease association in the most recent work [19]. DB1 contains 117 lncRNA, 159 diseases, and 285 binary associations between them. It also contains two lncRNA similarity matrices (sequence similarity and disease associationbased similarity) as well as five disease similarity matrices (gene functional similarity, GObased similarity, PPI topologybased similarity, PPI’s shorted pathbased similarity and lncRNA associationbased similarity). The lncRNA similarity matrices and the disease similarity matrices are combined respectively [19].
The second benchmark dataset, denoted as DB2, was collected from the recently published database, LncRNA2Cancer [29], which provides comprehensive experimentally supported associations between lncRNA and human cancer. After removing the lncRNAs having no available sequence in LncRNA2Cancer [29] and their associated cancers, we obtained 345 lncRNA, 93 cancers, and 747 binary associations between them in DB2. Using the approach in Section “Similarity Measurement”, we calculated the sequence similarity of RNA. Since LncRNA2Cancer contains no MeSH code for cancer, but the labels of International Classification of Diseases (ICD). We simply calculated the disease similarity of cancers by setting the pairwise disease similarity as 1 if two cancers belong to the same category in ICD, and 0 otherwise.
Moreover, we built the third dataset (DB3) to demonstrate the capability of GRUF in four kinds of predicting scenarios over three types of lncRNAdisease associations. In order to construct three kinds of LDAs, we collected the interactions between lncRNAs and their RBPs from LncRNADisease [30] and collected diseaseassociated genes and their diseases from DisGeNET [31]. We only kept the intersection of the coding genes of the proteins and the diseaseassociated genes, and selected their related lncRNAs and diseases respectively. Finally, DB3 contains 89 lncRNAs, 108 genes, and 406 diseases. In total, there are 154 experimentally supported lncRNAprotein interactions and 884 experimentally confirmed genedisease associations.
When calculating lncRNA similarity, we split lncRNA sequences into 35 segments and obtained 8960dimensional (35^{∗}4^{4} = 8960) feature vectors based on 4mer. Because all the values of 4mer feature entries are small, we processed them by Zscore and obtained the normalized feature matrix, of which the columns have sample mean zero and sample standard deviation one. In addition, to accelerate the calculation of lncRNA similarity matrix, we performed Principal Component Analysis (PCA) on the feature vectors. After removing those dimensions, which have only entries of zeros within numerical accuracy, we obtained 88d feature vectors. Moreover, we calculated the protein features for genes based on 7 amino acid groups, after turning gene sequences into protein sequences. Similarly, by PCA, we mapped the original 7^{3} d feature vectors of proteins into 107d feature vectors. After preprocessing lncRNA and protein feature vectors, we calculated the lncRNA similarity. The protein similarity was also calculated. The disease similarity was calculated directly based on MeSH descriptors of the diseases (see also Section Similarity Measurement).
Comparison with stateoftheart approaches
In order to demonstrate the effectiveness of GRUF, we performed three experiments. We first compared our approach with three stateoftheart approaches, RWR [28], LRLSLDA [15] and LDAP [19]. However, the former approaches are not designed to work in the case of nonbinary lncRNAdisease associations and are also unable to meet the need of predicting associations for lncRNAs and diseases without known associations. The comparison was only performed in the case of predicting the traditional binary association in Scenario T1. To make a fair comparison, we adopted the same dataset (DB1), the same cross validation (leaveoneout), the same measure (AUC) as those in LDAP (the most recent approach). The result shows that our approach is significantly superior to those stateoftheart approaches in terms of AUC (Fig. 3).
To our knowledge, there is no existing approach using DB2 as benchmark dataset since it was published very recently. Thus, we compared GRUF with two models, MLKNN [26] and RLS [27], which work on the similar form of problem (drugtarget interaction prediction). As recommended in [26, 27], an extra metric, the area under precisionrecall curve (AUPR), was adopted to measure the prediction performance with AUC together. Since those models were originally designed for Scenario T2 and T3, the prediction was run in the same scenarios under 10CV (Table 1). The comparison of prediction shows that the performance of GRUF is significantly better than that of those models, especially in terms of AUPR.
Predicting comprehensive lncRNAdisease associations
We demonstrated the prediction ability of our GRUF when encountering both discrete and continued association in three scenarios, T2, T3 and T4, which involve lncRNA and/or diseases with no known association. Tenfold CV was run on DB3 to evaluate the performance of GRUF. In details, all lncRNAs and all diseases, with known associations, are randomly partitioned into 10 nonoverlapping subsets of equal size respectively. In each round of the CV, each subset of lncRNAs is removed as the testing lncRNAs Tst _{ r } and the remaining lncRNAs are referred to as the training lncRNAs Trn _{ r } in T2. Similarly, each subset of diseases is removed as the testing diseases Tst _{ d } and the remaining diseases are regarded as the training diseases Trn _{ d }, in T3. In T4, the submatrix containing all the entries between Trn _{ r } and Trn _{ d } in the association matrix A are labelled as the training part, only the submatrix containing the entries between Tst _{ r } and Tst _{ d } are labelled as the testing part, and the entries between Tst _{ r } and Trn _{ d } as well as those entries between Trn _{ r } and Tst _{ d } attend in neither training nor testing phases. Thus, T4 requires 10×10 cross validation. In addition, the results of predicting binary, discrete and continued association in T1 are listed for the comprehensive comparison.
Based on the predicted confidence scores that indicate how likely the testing lncRNAdisease pairs are potential LDA, we performed two investigations. The former examines how well GRUF separates LDA from nonLDA for binary, discrete, continued LDAs respectively (Table 2). The latter explores how beneficial both discrete LDA and continued LDA entries are to the prediction (Table 3).
In the first investigation, the values of AUC show that: (1) T1 is the easiest task while T4 is the hardest task across the other three kinds of associations; (2) GRUF shows similar results in binary, discrete and continued prediction over four predicting scenarios.
In the second investigation, the correlation between the predicted confidence scores and the number of RBPs involved in LDAs shows that: (1) continued prediction achieves the best, discrete prediction obtains the moderate, and binary prediction generates the worst results; (2) GRUF usually achieves the best performance in T1 and the worst performance in T4.
Consequently, we may draw the following conclusions: (1) T1 is the easiest task of LDA prediction T2 and T3 are the moderate tasks, and T4 is the hardest task over binary, discrete and continued LDAs in terms of both AUC and Correlation; (2) when utilizing discrete prediction and continued prediction, GRUF has the similar ability to separate LDA from nonLDA; (3) More importantly, GRUF shows its power to capture the cues to the underlying mechanisms of LDA because their correlation between the number of RBPs and the predicted confidence scores of being potential LDAs are higher than that of binary prediction. The last two points enable GRUF to reveal the implications between lncRNA and pathology.
In addition, considering GRUF achieves the most confident prediction in T1, we performed a novel prediction for it to find potential LDAs among DB3 (Table 4). The predicted lncRNAdisease pairs having high confidence scores of being potential associations are ranked. Top5 out of them were selected to be validated by checking medical literature and LncRNADisease [30] and three among top5 were validated. The result shows that our approach is able to dig out novel lncRNAdisease associations.
Conclusions
Existing computational approaches only focus on predicting associations between lncRNAs with known disease association and diseases with known association with some lncRNAs. An open question is whether we can predict association for lncRNAs without known disease association and/or diseases with no known association with any lncRNAs. In addition, current computational approaches only work in the case of binary lncRNAdisease associations (LDA), which cannot reflect and reveal many biological facts or knowledge, such as the number of proteins involved in lncRNAdisease associations and how strong LDAs are. To address abovementioned issues, we have proposed a unified inference approach based on graph regression, GRUF. This proposed GRUF is able to work for four distinct predicting tasks, in particular, for lncRNAs and diseases without any known association. More importantly, it is able to not only perform the prediction of binary LDA but also for both discrete and continued LDAs, which helps revealing the implications between lncRNA and pathology. Experiments on real datasets demonstrate the superiority and effectiveness of our approach. As a remark, we want to emphasize that the results of our approach may be affected by the quality of the dataset and also the expression level of a particular lncRNA. For example, for those lncRNAs with low expression level, it may be difficult for our method or any existing methods to accurately predict its association. For further research, how to tackle these difficult cases would be a challenging problem. Also, we plan to include more diseaserelated knowledge to improve the accuracy of prediction.
Abbreviations
 (GRUF):

graph regressionbased unified framework
 (LDA):

The associations between lncRNAs and diseases
 (lncRNAs):

long noncoding RNAs
 AUC:

The area under the receiver operating characteristic curve
 AUPR:

The area under precisionrecall curve
 CV:

Crossvalidation
 DAG:

Directed acyclic graph
 DAM:

Disease associating matrix
 DLFM:

Disease latent feature matrix
 ICD:

International classification of diseases
 MeSH:

Medical Subject Headings
 PLSR:

Partial LeastSquares Regression
 RAM:

RNA associating matrix
 RBP:

RNAbinding protein
 RLFM:

RNA latent feature matrix
References
van Bakel H, Nislow C, Blencowe BJ, Hughes TR. Most "dark matter" transcripts are associated with known genes. PLoS Biol. 2010;8(5):e1000371.
Consortium EP, I D, a K, SF a: an integrated encyclopedia of DNA elements in the human genome. Nature 2012, 489(7414):57–74.
Yotsukura S, duVerle D, Hancock T, NatsumeKitatani Y, Mamitsuka H. Computational recognition for long noncoding RNA (lncRNA): software and databases. Brief Bioinform. 2016;18(1):9–27.
Guil S, Esteller M. Cisacting noncoding RNAs: friends and foes. Nat Struct Mol Biol. 2012;19(11):1068–75.
Guttman M, Garber M, Levin JZ, Donaghey J, Robinson J, Adiconis X, Fan L, Koziol MJ, Gnirke A, Nusbaum C, et al. Ab initio reconstruction of cell typespecific transcriptomes in mouse reveals the conserved multiexonic structure of lincRNAs. Nat Biotechnol. 2010;28(5):503–10.
Mercer TR, Dinger ME, Mattick JS. Long noncoding RNAs: insights into functions. Nat Rev Genet. 2009;10(3):155–9.
Ponting CP, Oliver PL, Reik W. Evolution and functions of long noncoding RNAs. Cell. 2009;136(4):629–41.
Lalevee S, Feil R. Long noncoding RNAs in human disease: emerging mechanisms and therapeutic strategies. Epigenomics. 2015;7(6):877–9.
Wapinski O, Chang HY. Long noncoding RNAs and human disease. Trends Cell Biol. 2011;21(6):354–61.
Kataoka M, Wang DZ. Noncoding RNAs including miRNAs and lncRNAs in cardiovascular biology and disease. Cell. 2014;3(3):883–98.
Chakravarty D, Sboner A, Nair SS, Giannopoulou E, Li R, Hennig S, Mosquera JM, Pauwels J, Park K, Kossai M, et al. The oestrogen receptor alpharegulated lncRNA NEAT1 is a critical modulator of prostate cancer. Nat Commun. 2014;5:5383.
Wang J, Ma R, Ma W, Chen J, Yang J, Xi Y, Cui Q. LncDisease: a sequence based bioinformatics tool for predicting lncRNAdisease associations. Nucleic Acids Res. 2016;44(9):e90.
Chen X, You ZH, Yan GY, Gong DW. IRWRLDA: improved random walk with restart for lncRNAdisease association prediction. Oncotarget. 2016;7(36):57919–31.
Chen X. Predicting lncRNAdisease associations and constructing lncRNA functional similarity network based on the information of miRNA. Sci Rep. 2015;5:13186.
Chen X, Yan GY. Novel human lncRNAdisease association inference based on lncRNA expression profiles. Bioinformatics. 2013;29(20):2617–24.
Alaimo S, Giugno R, Pulvirenti A. ncPred: ncRNAdisease association prediction through tripartite networkbased inference. Frontiers in bioengineering and biotechnology. 2014;2:71.
Yang X, Gao L, Guo X, Shi X, Wu H, Song F, Wang B. A network based method for analysis of lncRNAdisease associations and prediction of lncRNAs implicated in diseases. PLoS One. 2014;9(1):e87797.
Zhou MWX, Li J, Hao D, Wang Z, Shi H, Han L, Zhou H, Sun J. Prioritizing candidate diseaserelated long noncoding RNAs by walking on the heterogeneous lncRNA and disease network. Mol BioSyst. 2015;11(3):760–9.
Lan W, Li M, Zhao K, Liu J, FX W, Pan Y, Wang J. LDAP: a web server for lncRNAdisease association prediction. Bioinformatics. 2016;33(3):45860.
Li JH, Liu S, Zheng LL, Wu J, Sun WJ, Wang ZL, Zhou H, LH Q, Yang JH. Discovery of proteinlncRNA interactions by integrating largescale CLIPSeq and RNASeq datasets. Frontiers in bioengineering and biotechnology. 2014;2:88.
Clark MB, Johnston RL, InostrozaPonta M, Fox AH, Fortini E, Moscato P, Dinger ME, Mattick JS. Genomewide analysis of long noncoding RNA stability. Genome Res. 2012;22(5):885–98.
Muppirala U, Honavar V, Dobbs D. Predicting RNAprotein interactions using only sequence information. Bmc Bioinformatics. 2011;12(489)
Suresh V, Liu L, Adjeroh D, Zhou X. RPIPred: predicting ncRNAprotein interaction using sequence and structural information. Nucleic Acids Res. 2015;43(3):1370–9.
Lipscomb CE. Medical subject headings (MeSH). Bull Med Libr Assoc. 2000;88(3):265–6.
Wang D, Wang J, Lu M, Song F, Cui Q. Inferring the human microRNA functional similarity and functional network based on microRNAassociated diseases. Bioinformatics. 2010;26(13):1644–50.
Shi JY, Yiu SM, Li YM, Leung HCM, Chin FYL. Predicting drugtarget interaction for new drugs using enhanced similarity measures and supertarget clustering. Methods. 2015;83:98–104.
Shi JY, Li JX, HM L. Predicting existing targets for new drugs base on strategies for missing interactions. Bmc Bioinformatics. 2016;17(Suppl 8):282.
Sun J, Shi H, Wang Z, Zhang C, Liu L, Wang L, He W, Hao D, Liu S, Zhou M. Inferring novel lncRNAdisease associations based on a random walk model of a lncRNA functional similarity network. Mol BioSyst. 2014;10(8):2074–81.
Ning SW, Zhang JZ, Wang P, Zhi H, Wang JJ, Liu Y, Gao Y, Guo MN, Yue M, Wang LH, et al. Lnc2Cancer: a manually curated database of experimentally supported lncRNAs associated with various human cancers. Nucleic Acids Res. 2016;44(D1):D980–5.
Chen G, Wang Z, Wang D, Qiu C, Liu M, Chen X, Zhang Q, Yan G, Cui Q. LncRNADisease: a database for longnoncoding RNAassociated diseases. Nucleic Acids Res. 2013;41(Database issue):D983–6.
Pinero J, QueraltRosinach N, Bravo A, DeuPons J, BauerMehren A, Baron M, Sanz F, Furlong LI. DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes. Database : the journal of biological databases and curation. 2015;2015:bav028.
Berteaux N, Aptel N, Cathala G, Genton C, Coll J, Daccache A, Spruyt N, Hondermarck H, Dugimont T, Curgy JJ, et al. A novel H19 antisense RNA overexpressed in breast cancer contributes to paternal IGF2 expression. Mol Cell Biol. 2008;28(22):6731–45.
Reddy R, Henning D, Subrahmanyam CS, Busch H. Primary and secondary structure of 73 (K) RNA of Novikoff hepatoma. J Biol Chem. 1984;259(19):12265–70.
Acknowledgements
This work was supported by RGC Collaborative Research Fund (CRF) of Hong Kong (C100816G), National High Technology Research and Development Program of China (No. 2015AA016008), the Program of Peak Experience of NWPU (2016), China National Training Program of Innovation and Entrepreneurship for Undergraduates (201710699330) and partially supported by the National Natural Science Foundation of China (No.61473232, 91430111).
Funding
The publication charge was funded by RGC Collaborative Research Fund (CRF) of Hong Kong (C100816G).
Availability of data and materials
The dataset used in this work can be download from https://github.com/JustinShi2016/InCoB2017/
About this supplement
This article has been published as part of BMC Medical Genomics Volume 10 Supplement 4, 2017: 16th International Conference on Bioinformatics (InCoB 2017): Medical Genomics. The full contents of the supplement are available online at https://bmcmedgenomics.biomedcentral.com/articles/supplements/volume10supplement4.
Author information
Authors and Affiliations
Contributions
JYS conceived and designed the experiments, and draft the manuscript. HH and YXL collected the dataset and performed the experiments. JYS, YNZ and SMY analyzed the results. JYS, and HH contributed materials/analysis tools and developed the codes used in the analysis. YNZ and SMY helped to draft the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable
Consent for publication
Not applicable
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Shi, JY., Huang, H., Zhang, YN. et al. Predicting binary, discrete and continued lncRNAdisease associations via a unified framework based on graph regression. BMC Med Genomics 10, 65 (2017). https://doi.org/10.1186/s129200170305y
Published:
DOI: https://doi.org/10.1186/s129200170305y
Keywords
 lncRNAdisease association
 Graph regression
 Prediction
 Discrete
 Continued
 Sequence feature
 Semantic similarity