Overview of CRlncRC
An integrated machine-learning pipeline was designed and designated as CRlncRC (Cancer-Related lncRNA Classifier). The pipeline was shown in Fig. 1.
Firstly, in order to increase the precision of predictions, we strictly selected the positive and negative collection for training. The positive dataset consisted of 158 experimentally-validated cancer-related lncRNAs curated from the scientific literature (Additional file 1); while the negative was randomly sampled from long intergenic noncoding RNAs whose 10 kb upstream and downstream had no cancer-related SNPs (in total 4553 lncRNAs, Additional file 2), repeatedly 100 times. For lncRNAs from each dataset, we constructed four categories of features including genomic, expression, epigenetic and network (Additional file 3).
Second, to evaluate the performance of different machine-learning algorithms, we used five popular algorithms, including Random Forest (RF), Naïve bayes (NB), Support Vector Machine (SVM), Logistic Regression (LR) and K-Nearest Neighbors (KNN), to proceed with ten-fold cross-validation in 100 training datasets. For each test, the receiver operating characteristic curves (ROCs) were calculated, and the average area under the ROC curve (AUC) was used to assess the best performance for each algorithm in 100 training sets. For the best performance model, we further compared it with other existing cancer-related lncRNA classifiers in terms of performance, and evaluated the weightiness of four categories of features contributing to cancer-related classification.
At last, we managed to use the best performance model to predict novel cancer-related lncRNAs. Here, we adopted all lncRNAs from the TANRIC dataset (Additional file 4), which were completely separate from our positive and negative datasets, to examine the prediction performance of CRlncRC. For these novel cancer-related candidates, we utilized genome-wide data to assess the probability of their associations with cancer, which include their enrichment of somatic mutations in cancer genome, distance with the cancer-related proteins, differential expression fold-change between pairs of tumor and normal tissues, and GO enrichment analysis. In addition, we also inspected the potent experimental supports from literature.
Cross-validation accuracy
We used ten-fold cross-validation to evaluate the model accuracy. As shown in Fig. 2a, RF, NB, SVM, LR and KNN achieved average AUC scores of 0.82, 0.78, 0.79, 0.76 and 0.68, respectively. Apparently, there are four models achieving an average score of AUC more than 0.75 except that of KNN, wherein RF model shows the best performance. We next checked the resulting accuracy of RF classifier when we used only one category of features. As shown in Fig. 2b, training RF model with epigenetic, expression, network and genomic features, our model achieved AUC scores of 0.76, 0.73, 0.76 and 0.73, respectively. So, using RF model, any feature solely could gain an average AUC score of > 0.7, much less an extra 6–9% AUC score when combining all the features. No single category of feature could achieve the top performance as that of united features, which strongly suggests the complementary nature between features and the advantage of integrative approaches. In addition to AUC value, more evaluation indicators were used to assess our results such as precision, recall, accuracy, and AUC confidence interval (Additional file 5). In order to perform a comprehensive analysis of the effect of features on model’s performance, we also compared two types of features and three types of features (Additional file 6).
The contribution of features to identify cancer-related lncRNAs
To better comprehensively understand the significance of features to identification of the cancer-related lncRNAs, we used ExtraTreeClassifier [13] in scikit-learn package as a measurement for further evaluating the hierarchies of all features in terms of importance (Fig. 3) (Additional file 3). Here, we summarized the amount of four categories of features located in Top10/20/50 feature importance list (Fig. 3a). In the Top10 features, four features are pertinent to genomic features, epigenetic and network features each have three positions. However, as respects of the Top20 and Top50 features, the first and the second occupancy among features belong to epigenetic (9 in Top20 and 18 in Top50) and network features (7 in Top20 and 14 in Top50), respectively. It surprised us that no expression features emerged in the Top10 and Top20 features, which only occupy 8 locations in the Top50 features, indicating that expression features are still necessary though less important than other types of features. The fact that lncRNA expression always had a strong tissue specificity with a relatively low level might explain the less importance of expression-related features on cancer-related lncRNAs prediction.
We calculated the cumulative distribution and corresponding Kolmogorov-Smirnov test p-value of all the features in the positive and negative lncRNA datasets (Additional file 7). Figure 3b and Fig. 4 showed the top nine features sorted by importance, and their corresponding cumulative distribution in the positive and negative datasets. Interestingly, ‘SINE (Short interspersed nuclear element) numbers in gene body’ was the most important feature. The cumulative curve also showed that cancer-related lncRNAs have obviously higher SINE numbers than cancer-unrelated lncRNAs (Fig. 4a, p-value = 5.3e-05). LINE (Long interspersed nuclear element) was another example of the repeats which contributes to the classifier. ‘LINE numbers in gene body’ ranks the No.8 in all of the features. Similar with SINE, we found that cancer-related lncRNAs have obviously higher LINE numbers than cancer-unrelated lncRNAs (Fig. 4h, p-value = 0.00086). We further compared the length distribution of positive and negative lncRNAs, and found that these two distributions only have slight difference (Additional file 8). Our results implied that repeat element might be an important functional element for lncRNAs and widely participant in the cancer-related process, which is consistent with a lot of published researches. For example, Alu is a subtype of SINE and has been implicated in several inherited human diseases and in various forms of cancers; and, LINE can activate immune responses and contribute to disease progression [14, 15], as well as potentially affect chromatin formation [16].
Apart from the repeat, there are other two genomic-related features in the top nine features: ‘intron GC content’ and ‘exon phastCons score’. Compared with lncRNAs from negative dataset, the introns in cancer-related lncRNAs have a relatively higher GC content (Fig. 4d, p-value = 0.014). The GC content was related to the stability of gene and regulation and might have played a significant role in the evolution [17]. Besides, the composite patterns of GC content between intron and exon would likely affect gene splicing [18, 19]. These facts hinted the relationship between ‘intron GC content’ and cancer-related lncRNA. Moreover, the exon sequences in cancer-related lncRNAs showed obviously higher conservation than negative set, implying that cancer-related lncRNAs may undergo evolutionary pressure for maintaining some important functions relevant to normal cell behavior (Fig. 4g, p-value = 7.5e-05).
In the top nine features, two epigenetic features ranked at NO.2 and No.6, they are “H3k4me1” and “H3k4me3” epigenetic modification signals within lncRNA gene body region in H1hesc cell line, respectively. Both signals in positive dataset are significantly higher than the negative set (Fig. 4b, p-value = 1.7e-10; Fig. 4f, p-value = 4.3e-13). Epigenetic feature H3k4me3 are likely associated with the expression of cancer-related lncRNAs. High levels of H3K4me3 are often found near the promoter region, and commonly associated with the activation of transcription of nearby genes [20, 21]. A broad H3K4me3 is associated with increased transcription elongation and enhancer activity at tumor-suppressor genes [12]. While H3K4me1 is usually found in intergenic region with enrichment at enhancers [22]. Recent studies have demonstrated that many enhancer elements can be transcribed into a novel class of lncRNAs, enhancer RNAs (eRNAs) [23,24,25]. These eRNAs could exert cancer-related functions through their associated enhancers, as in the case of eRNAs from p53-bound enhancer region that are required for p53-dependent enhancer activity and gene transcription [26]. On the other hand, the fact that these two histone modification related features listed in top 9 features are associated with H1hesc cell line, instead of Gm12878 and K562 cell lines, indicated that the effects of histone modifications to cancer-related lncRNAs might have tissue/cell type-specificity.
Consistent with the other papers, the cancer-related lncRNAs tend to be more likely interacted with cancer-related proteins. In the top nine features, network features ranked the position of No.3, No.5 and No.9. The lncRNAs in the positive set displayed more strongly co-expression with ERBB2, CTNNB1 and CDKN2A than the negative set (Fig. 4c, p-value = 4.6e-10; Fig. 4e, p-value = 6.6e-08; Fig. 4i, p-value = 4.4e-07). Wherein, ERBB2 was found associated with Glioma Susceptibility 1 and Lung Cancer; CTNNB1 is part of a complex of proteins that constitute adherens junctions, mutations in CTNNB1 are a cause of colorectal cancer, pilomatrixoma, medulloblastoma, and ovarian cancer; while CDKN2A (i.e. p16) is frequently mutated or deleted in a wide variety of tumors and is known to be an important tumor suppressor gene.
Comparison with other cancer-related lncRNA prediction algorithms
We used ten-fold cross-validation to compare the prediction performance of our CRlncRC with that of the other two developed prediction algorithms as mentioned in Background. Considering that the latter two developed early and were comprised of relatively small-scale datasets (for example, Zhao et al. collected 70 cancer-related lncRNAs as positive dataset, while Lanzós Andrés et al. collected 45 cancer-related lncRNAs as positive dataset), for fairness, we applied their datasets for retraining our RF model rather than our own well-established model.
As shown in Fig. 5a, the AUC score of our method reached 0.85, much higher than 0.79 reported in Zhao’s results. In the aspects of the feature choice, we adopted genomic, network, expression, and epigenetic (totally four categories of features) in our model, while Zhao et al. selected three features of genome, regulome and transcriptome in their prediction model. This result suggested that the newly introduced epigenetic features in CRlncRC, which not include in Zhao’s study, may have a great contribution to the classification between cancer-related lncRNAs and cancer-unrelated lncRNAs. On the other hand, compared with the NB model used in Zhao’s method, CRlncRC employed RF as its learning model after broad evaluation of five learning techniques, with a dominant consequence of performance enhancements.
A cancer driver gene is defined as one whose mutations increase net cell growth under the specific micro-environmental conditions that exist in the cell in vivo [27]. While a cancer-related lncRNA can be defined as it can promote or inhibits the growth of cancer cells through some mechanism [28]. To comprehensively discover the candidates of cancer driver lncRNAs, Lanzós Andrés et al. developed ExInAtor and run it on 23 tumor types. We choose ‘BRCA’ that is believed as the best tumor type of prediction in Lanzós Andrés’s work and ‘Superpancancer’ to do the comparison, two of which respectively represent the type-specific and the ubiquitous cancer-related lncRNA gene discovery. As shown in Fig. 5b, our model had an obvious superiority against ExInAtor in both ‘Superpancancer’ (AUC score 0.78 vs. 0.53) and ‘BRCA’ (AUC score 0.88 vs. 0.57). These results suggested that ExInAtor is probably just perfect for finding cancer driver lncRNAs when considering of only one feature of genomic somatic mutation, but do not suit for the prediction of cancer-related lncRNAs.
Systematic evaluation of predicted novel cancer-related lncRNAs
We used CRlncRC to predict novel cancer-related lncRNAs from TANRIC lncRNA dataset (Additional file 4), which was completely separate from our positive and negative datasets. The 11,656 unknown lncRNAs were assessed by use of the best RF model we trained. In total, 121 cancer-related lncRNA candidates were identified (Additional file 9), including 55 antisense lncRNAs, 57 intergenic lncRNAs and 9 overlapping lncRNAs. For these novel cancer-related candidates, we further utilized genome-wide data to systematically evaluate the probability of their associations with cancer. For that purpose, three types of lncRNA set were applied to our analysis, including cancer-related lncRNAs (positive), cancer-unrelated lncRNAs (negative), and predicted novel cancer-related lncRNAs (predict).
First, we assumed that these potential cancer-related lncRNAs were likely to have more somatic mutations in cancer genomes, since many previous studies had demonstrated that mutation in function genes is a main cause of cancer induction. To validate the assumption, we made a comparison of the number of somatic mutations (documented in COSMIC) between different lncRNA sets and cancer-related protein set (Fig. 6a). As a result, cancer-related protein set as the positive control possessed far more somatic mutations than cancer-unrelated lncRNA set, which is the negative control (Kolmogorov-Smirnov test, p-value = 6.10e-33). The somatic mutation numbers in both positive and predicted cancer-related lncRNA sets are between cancer-unrelated lncRNAs and cancer-related proteins, with a significant higher quantities than that of cancer-unrelated lncRNAs (Kolmogorov-Smirnov test, p-value 2.35e-07 and 3.25e-06 respectively).
Because a number of lncRNAs exert their function in cis to influence their neighboring genes, we assumed that these potential cancer-related lncRNAs likely have a closer distance with cancer-related proteins by comparison of cancer-unrelated lncRNAs. Therefore, we calculated the distance of different lncRNA sets to their closest cancer-related proteins, and compared it with the random background (that is the distance between cancer-related proteins and random positions in genome) (Fig. 6b). We found that the distances between cancer-unrelated lncRNAs and cancer-related proteins are significantly larger than that between cancer-related lncRNAs and cancer-related proteins (Kolmogorov-Smirnov test, p-value = 0.00041). Similarly, the distance of predicted cancer-related lncRNAs to cancer-related proteins is far closer than to cancer-unrelated lncRNAs (Kolmogorov-Smirnov test, p-value = 0.00116). Moreover, no significant difference was detected between background and cancer-unrelated lncRNAs set, as expected.
Next, we examined whether it is possible that the expression levels of cancer-related lncRNAs in cancer have a more marked change as compared with that of cancer-unrelated lncRNAs (Fig. 6c). By using lncRNA expression data from TANRIC database, we calculated the percentage of lncRNA differential expressed between pairs of cancer and paracancerous tissues (lncRNAs with absolute log2-fold change greater than 1), to see if there is a difference among different lncRNA sets. We found that lncRNAs in positive set had the highest percentage of differential expressed genes (about 40%), while negative set only with about 20%. For those predicted cancer-related lncRNAs, over 28% of them showed differential expression. This result further supported our prediction products have an evident association with cancer, and also revealed that simple dependence on differential expression is far from enough for identification of cancer-related lncRNAs.
Finally, we investigated the GO (Gene Ontology) annotations of these cancer-related lncRNAs candidates. LncRNA’s GO annotations were predicted according to the enriched GO terms of its neighboring proteins in the co-expression network. The Top10 (sorted by p-value, Fisher’s exact test) enriched GO terms were listed in Fig. 6d. From the list, we can found that the functions of these cancer-related lncRNA candidates mainly focused on the following keywords: 1) ‘RNA splicing’, such as ‘mRNA splicing, via spliceosome’, ‘RNA splicing, via transesterification reactions with bulged adenosine as nucleophile’, and ‘RNA splicing, via transesterification reactions’; 2) ‘morphogenesis’, such as ‘cilium morphogenesis’, ‘cell projection morphogenesis’, and ‘cell part morphogenesis’; 3) ‘immune’, such as ‘immune effector process’ and ‘regulation of immune system process’; 4) ‘mRNA processing ‘, such as ‘mRNA processing’ and ‘mRNA metabolic process’. These annotations revealed the potential action modes of cancer-related lncRNAs, which is consistent with many of the latest studies. For example, Simon et al. discovered that a bifunctional RNA, encoding both PNUTS mRNA and lncRNA-PNUTS, could mediate EMT and tumor progression when its splice switches from coding to noncoding transcript [29]. Musahl et al. found ncRNA-RB1 could positively regulate the expression of calreticulin (CALR) and sequentially activate anticancer immune responses [30].
Case study of the cancer-related lncRNA candidates
Besides utilizing genome-wide data to systematically evaluate these cancer-related lncRNA candidates, we also drilled down into some lncRNA cases. To our amazement, in the Top10 cancer-related lncRNA candidates in our prediction results, there are six predicted lncRNAs (NNT-AS1, TP53TG1, LINC01278, LRRC75A-AS1, MAGI2-AS3, EIF3J-AS1) to be found with literature supports. For example, lncRNA NNT-AS1 could promote cell proliferation and invasion through Wnt/β-catenin signaling pathway in cervical cancer [31] and contribute to proliferation and migration of colorectal cancer cells both in vitro and in vivo [32]. Besides, it can promote hepatocellular carcinoma and breast cancer progression through targeting miR-363/CDK6 axis [33] and miR-142-3p/ZEB1 axis [34], respectively. Another example is a p53-induced lncRNA TP53TG1, which is a newly identified tumor-suppressor gene and plays a distinct role in the p53 response to DNA damage. TP53TG1 hypermethylation in primary tumors is shown to be associated with poor outcome [35]. According the newest research findings, TP53TG1 participated in the stress response under glucose deprivation in glioma [36], and enhanced cisplatin sensitivity of non-small cell lung cancer cells through regulating miR-18a/PTEN axis [37].
Besides the lncRNAs mentioned above, another very interesting lncRNA -- UBR5-AS1 (UBR5 antisense RNA1) -- came into our view. UBR5-AS1 sits between two protein-coding genes (UBR5 and P53R2). The 3′ terminal sequence of UBR5 is partial antisense to UBR5, the latter is an oncogene in many cancers and contributes to cancer progression, cell proliferation [38, 39]. The 5′ end of UBR5-AS1 is positioned head-to-head (or divergent) to P53R2, which is believed to play essential roles in DNA repair, mtDNA synthesis and protection against oxidative stress, and has a positive correlation with drug sensitivity and tumor invasiveness [40]. Since a host of studies had demonstrated that lncRNAs often exert their function in cis to influence their neighboring genes, we have good reasons to believe that UBR5-AS1 is very likely to be associated with cancer. However, till now UBR5-AS1 has not been studied by researchers.
Figure 7a showed UBR5-AS1 and its neighbor region, with a variety of information about epigenetics, conservation and repeats (as visualized by UCSC genome browser). We can see that the shared promoter region between UBR5-AS1 and P53R2 had high H3K4me3 and H3K27Ac signals, which are normally associated with active transcription. On the other hand, although lncRNAs often show less conservation compared with protein-coding genes, the lncRNA UBR5-AS1 presented a much strong sequence conservation that is nearly comparable to the proteins of P53R2 and UBR5 (scoring by 100 vertebrates Basewise Conservation by PhyloP). This result suggested that UBR5-AS1 may undergo evolutionary pressure for maintaining some important functions. In addition, in the gene-body region of UBR5-AS1, a great number of SINE and LTR repeats were found, both of which had been extensively proved to be associated with lncRNA’s regulatory function [41]. Next, we identified up to 20 cancer-related proteins co-expressed with UBR5-AS1 (Fig. 7b) and predicted the GO annotations of UBR5-AS1 via GO enrichment analysis (Fig. 7c), by using the co-expression sub-network centralized on UBR5-AS1. The Top10 (sorted by p-value, Fisher’s exact test) enriched GO terms showed that UBR5-AS1 was functionally relevant with ‘RNA splicing’, ‘leukocyte activation’, ‘immune system process’ and so on. All these findings indicate that UBR5-AS1 underlines a highly potential cancer-related lncRNA and is worthy of more intensive study.