Construction of an lncRNA model for prognostic prediction of bladder cancer

Objective We aimed to investigate the role and potential mechanisms of long non-coding RNAs (lncRNAs) in bladder cancer (BC), as well as determine their prognostic value. Methods LncRNA expression data and clinical data from BC patients were downloaded from The Cancer Genome Atlas (TCGA) database. R software was used to carry out principal component analysis (PCA), differential analysis, and prognostic analysis. Lasso regression and multivariate Cox regression analyses were performed to identify potential prognostic genes. The expression of five identified genes and their correlation with prognosis were verified using TCGA and GSE13507 datasets. In addition, quantitative real-time polymerase chain reaction (qRT-PCR) was used to confirm the expression of these five genes in cell lines (two human BC cell lines and one human bladder epithelial cell line) and tissues (84 pairs of BC tissues and the corresponding paracancerous tissues). Risk scores that had been generated from the five genes and their prognostic ability were assessed by receiver operating characteristic (ROC) and Kaplan–Meier (KM) curves. Co-expressed genes were screened by WGCNA and analyzed by GO and KEGG, while functional enrichment and immune infiltration analyses were performed using STRING (https://cn.string-db.org/) and TIMER2.0 (http://timer.cistrome.org/) online tools, respectively. Results CYP4F8, FAR2P1, LINC01518, LINC01764, and DTNA were identified as potential prognostic genes. We found that these five genes were differentially expressed in BC tissue, as well as in BC cell lines, and were significantly correlated with the prognosis of BC patients. KM analysis considering risk scores as independent parameters revealed differences in overall survival (OS) by subgroups. The ROC curve revealed that a combined model consisting of all five genes had good predictive ability at 1, 3, and 5 years. GO and KEGG analyses of 567 co-expressed genes revealed that these genes were significantly associated with muscle function. Conclusion LncRNAs can be good predictors of BC development and prognosis, and may act as potential tumor markers and therapeutic targets that may be beneficial in helping clinicians decide the most effective treatment strategies.


Introduction
Bladder cancer (BC) is ranked tenth among the most common cancers worldwide and ninth among cancer deaths accounting for approximately 573,000 new cases and 213,000 deaths in 2020 [1]. By 2022, the incidence and mortality of BC in China is predicted to be higher than in developed countries such as the United States [2]. Therefore, there is a critical need to develop novel Open Access *Correspondence: lsun91@cmu.edu.cn non-invasive markers to improve the diagnosis and treatment of BC.
In the human genome more than 98% of genes do not encode proteins, and are termed non-coding RNAs (ncR-NAs) [3]. Long non-coding RNAs (lncRNAs) have been identified as RNA transcripts containing more than 200 nucleotides [4]. LncRNAs are not only the predominant type of ncRNA, but are also considered to be essential regulators of a range of biological processes [5,6].
The Cancer Genome Atlas (TCGA) database was started in 2006, and now contains data on 33 types of cancer, as well as the molecular typing of more than 20,000 primary cancers. The TCGA database has had a profound impact on global oncological research, and will be important in identifying new targets associated with BC.
In this study, we identified five novel genes by screening BC data in the TCGA database. By examining their expression profiles and prognostic relevance, as well as performing functional enrichment analysis, we determined the potential biological characteristics of these five genes.

Source and processing of the dataset
RNA data corresponding to 411 BC tissue and 19 paraneoplastic tissue samples were obtained from TCGA using GENCODE software. Analysis using the RTCGA clinical package in R software revealed 409 cases of uroepithelial carcinoma, one case of squamous carcinoma, and one case of adenocarcinoma in the 411 samples. Of the 409 uroepithelial carcinomas, four were stage I, 131 were stage II, 141 were stage III, and 136 were stage IV. Of these, 21 were low-grade, and 388 were high-grade uroepithelial carcinomas. The biomaRt package was used to screen a total of 24,971 genes for PCA analysis, and the sva R package was used to remove batch effects. The GSE13507 dataset-comprising clinical data on 188 BC tissues (165 primary uroepithelial carcinomas and 23 recurrent carcinomas) and 68 paracancerous tissues was downloaded from the GEO database using the clus-terProfiler package. Of the 165 uroepithelial carcinomas, 103 were stage I, 26 were stage II, 28 were stage III and 8 were stage IV. In total, the study comprised 105 lowgrade and 60 high-grade uroepithelial carcinomas.

Selection and validation of differential genes and prognosis-related genes
Differential expression of the 24,971 genes was analyzed using the limma package, and further screened for significantly upregulated and downregulated genes using the DESeq2 package (|Log2FoldChange|> 3.5, P < 0.05). In addition, univariate Cox analysis (Hazard Ratio (HR) ≠ 1, p < 0.05) was performed using the survival package. Prognosis-related genes with HR > 1 were intersected with upregulated differentially expressed genes separately using the VennDiagram package,while prognosis-related genes with HR < 1 were intersected with downregulated differentially expressed genes. The Glmnet package was used to perform Lasso regression and tenfold cross-validation on the intersecting genes to determine the optimal lambda (λ) value for the minimum partial likelihood deviation to derive the validation genes. Multivariate Cox regression analysis was performed on the validated genes to identify the most relevant genes.

Expression and prognostic analysis of the identified genes in the TCGA and GSE13507 databases
Differences between the identified gene expression levels in cancer and paracancer samples were analyzed in TCGA and GSE13507 datasets using the dplyr package. The median expression levels of the identified genes were used as the boundary, and Kaplan-Meier (KM) survival curves based on the survminer R package were used to compare differences in survival between the high and low expression groups of the identified genes in the two databases.

Analysis of gene expression levels in bladder cells and tissues
Two human BC cell lines (T24 and 5637) and a human bladder epithelial cell line (SV-HUC-1) were obtained from Fu Heng Biologicals (Shanghai, China). T24 cells were cultured in DMEM containing 10% fetal bovine serum (FBS), while 5637 cells were cultured in RPMI 1640 containing 10% FBS. Cell lines were cultured at 37 °C in a 5% CO2 incubator. All of the above culture materials were obtained from Gibco. A total of 84 pairs of BC and paracancer specimens were selected from our previous collections. Among them, 36 pairs were muscle invasive BC (MIBC) and 48 pairs were non-muscle invasive BC (NMIBC). All specimens were placed in liquid nitrogen immediately after surgical excision and transferred to a − 80 °C freezer for long-term storage within 30 min. Total RNA was extracted from the cells and tissues using TRIzol. The cDNA was synthesized by reverse transcription using the PrimeScript RTTM Master Mix Takara (Biotech, Dalian, China) according to the manufacturer's instructions. qRT-PCR was performed with SYBR Premix Ex TaqTM II to determine the expression levels of the five identified genes. Relative mRNA expression levels were calculated using the 2 −ΔΔCt method. βactin was used as the control. The primer sequences are shown in Table 1.

Construction and evaluation of the risk score model
Risk score models were constructed based on the multivariate Cox regression data. Forest plots were drawn using the ggforest package. Risk factor linkage plots were drawn using the ggrisk software package. Based on the median risk score, patients were divided into highrisk and low-risk groups. KM survival curves were used to compare differences between high-risk and low-risk subgroups in terms of gender, age, grading, and stage. Time-dependent receiver operating characteristic (ROC) curves based on the survROC package were used to plot risk scores and compare the prognostic accuracy of each gene.

Identification and functional analysis of co-expressed coding genes
The WGCNA software package was used to implement variance screening of the underlying data, analysis of variance between samples, removal of outlier samples, determination of soft thresholds, co-expression module mining, and calculation of correlations between genes and modules. Co-expressed genes of each module were exported and visualized by Cytoscape software. The coexpressed genes were de-duplicated and organized using the TCGA and MSigDB gene sets (c2.all.v7.5.1.entrez. gmt). GO and KEGG functional enrichment analyses were carried out using the clusterProfiler package in the TCGA and MSigDB gene sets (c2.all.v7.5.1.entrez.gmt).

Key coding gene screening and immune infiltration analysis
Co-expressed genes were imported into the STRING online tool to determine protein interactions (screening criteria were scores > 0.4). The top 10 coding genes were obtained using the cytoHubba plugin in Cytoscape software and re-imported into STRING to analyze their functional properties. Immune infiltration analysis was carried out using the TIMER2.0 online tool.

Identification and validation of genes
A total of 300 overlapping genes were identified by intersecting the prognostic-related genes with the differentially expressed genes ( Fig. 2A). A total of 12 genes were identified after subjecting the overlapping genes to a tenfold validated Lasso regression analysis (Fig. 2B). Following Cox univariate and multivariate analyses of these 12 genes, the following five genes were identified: CYP4F8, FAR2P1, LINC01518, LINC01764, and DTNA ( Table 2). Analysis of the TCGA database revealed that the expression levels of CYP4F8, FAR2P1, LINC01518, and LINC01764 were lower in cancerous than paracancerous tissues. In contrast, elevated DTNA expression levels were found in cancerous tissues compared to paracancerous tissues (Fig. 3A). Similar results were obtained using the GSE13507 dataset (Fig. 3B). Further validation by qRT-PCR revealed that the expression levels of CYP4F8, FAR2P1, LINC01518, LINC0176 in T24 and 5637 cell lines were lower than those observed in SV-HUC-1 cells, while DTNA expression levels were higher in T24 and 5637 cell lines compared to SV-HUC-1 cells (Fig. 4A). The expression levels of CYP4F8, FAR2P1, LINC01518, and LINC0176 in MIBC tissues were lower than those observed in paraneoplastic tissues, while the expression levels of DTNA in MIBC tissues were higher than those in paraneoplastic tissues (Fig. 4B). The validation results in NMIBC tissues were consistent with those in MIBC tissues (Fig. 4C). Using the median expression of five genes as cutoff values, we found that high expression of DTNA and low expression of CYP4F8, FAR2P1, LINC01518, and LINC01764 were associated with poor prognoses in BC patients (Fig. 5A-B).

Construction and evaluation of a 5-gene prognostic model
The forest plot based on the Cox multivariate regression data is shown in Fig. 6A   . We found that an increased number of patient deaths was associated with a higher risk score in the corresponding A-plot (Fig. 6C). The heat map of the five gene expression profiles revealed that DTNA was expressed at high levels, while CYP4F8, FAR2P1, LINC01518, and LINC01764 were expressed at low levels (Fig. 6D). KM survival curves showed that the OS was worse in the high-risk group compared to the low-risk group (Fig. 6E

Potential biological functions of the five identified genes
Next, we used the WGCNA package to establish a coexpression network. First, a soft threshold value of 5 was selected (Fig. 8A). WGCNA analysis revealed that 71 genes were co-expressed with CYP4F8 (Fig. 8B), 80 with FAR2P1 (Fig. 8C), 346 with LINC01518 (Fig. 8D), 16 with LINC01764 (Fig. 8E), and 65 with DTNA (Fig. 8F, https:// doi. org/ 10. 6084/ m9. figsh are. 20055 986. v1). These co-expressed genes were sorted and de-duplicated for GO analysis. The main biological processes in which they were significantly enriched were muscle system processes and muscle tissue contraction (Fig. 9A, https:// doi. org/ The expression levels of CYP4F8, FAR2P1, LINC01518, and LINC0176 were lower in MIBC tissues than those observed in paraneoplastic tissues, while DTNA expression levels were higher in MIBC tissues than in paraneoplastic tissues. C: The expression levels of CYP4F8, FAR2P1, LINC01518, and LINC0176 were lower in NMIBC tissues than those observed in paraneoplastic tissues, while the DTNA expression level was higher in NMIBC tissues than in paraneoplastic tissues. β-actin was used as the control. **** P < 0.001, *** P < 0.005, ** P < 0.01, * P < 0.05. ). Enrichment analysis of the KEGG pathway mainly showed calcium signaling pathway and myocardial contraction, similar to the GO analysis findings (Fig. 9D). Following validation in the MsigDB gene set, enrichment analysis also revealed that these genes were associated with myocardial contraction and cardiac conduction (Fig. 9E). Using the STRING online tool, we found that six of the top 10 genes were associated with muscle contraction (Fig. 10A). Using the TIMER 2.0 online tool for immune infiltration analysis of these six genes, we found that the expression levels of TNNI3, TNNT1, ACTC1, and MYH11 were significantly correlated with the level of immune cell infiltration (Fig. 10B). Furthermore, we found that TNNT1, ACTC1, and MYH11 expression levels were negatively correlated with tumor purity, while TNNI3 expression levels were positively correlated with tumor purity.

Discussion
The rapid development of industrialization worldwide has led to a continuous increase in the incidence of BC [7]. As a highly heterogeneous tumor with a poor prognosis and high recurrence rate [8], BC differs from other tumors of the urinary system due to its lack of good prognostic molecular markers [9]. Although the prognosis of BC is mainly related to histopathological staging [10], the use of modern genetic technologies is becoming more common. Thus, a diversified analysis combined with de novo tumor markers would be more conducive to accurate prediction. In recent years, numerous studies have found that lncRNAs are closely related to the prognosis of patients with urological tumors [11][12][13][14]. Therefore, an in-depth study of lncRNAs would be beneficial in identifying de novo markers associated with the prognosis of BC.
In the current study, we used different strategies to screen the TCGA database, and identified five genes associated with the prognosis of BC patients. Among them, DTNA was found to be an independent risk gene for the prognosis of BC patients, while CYP4F8, FAR2P1, LINC01518, and LINC01764 were identified as protective genes.
DTNA, as a member of the myotonic dystrophy protein family, is not only involved in calcium-binding and synapse formation and stability [15], but also plays a role in the progression of several malignancies as a proto-oncogene. For example, Fu et al. [16] found that miRNA-301b promotes the growth of esophageal cancer by regulating DTNA. In addition, Hu et al. [17] demonstrated that by binding to STAT3, DTNA further activates STAT3 resulting in the induction of TGF-β1 expression and inhibition of P53 expression, thereby promoting the progression of HBV-induced liver fibrosis, cirrhosis and hepatocellular carcinoma. Liu et al. [18] screened eight key lncR-NAs, including DTNA, in early colon adenocarcinoma, and found that DTNA may be involved in the pathogenesis of early colon adenocarcinoma, as well as a potentially valuable tool for the diagnosis of early colon adenocarcinoma. Furthermore, using TCGA and GEO databases, Zhang et al. [19] successfully constructed and validated a new hypoxic signature model for BC, which accurately predicted the prognosis of BC patients. DTNA was identified as one of the key predictive genes in this model, consistent with our findings. The oncogenic role of DTNA may be associated with its involvement in the aggregation of nicotinic acetylcholine receptors, which control the synthesis and release of growth, angiogenesis and neurotrophic factors in cancer cells and the cancer microenvironment [20].
CYP4F8, a member of the cytochrome P450 family, functions as a 19-hydroxylase of prostaglandins in the seminal vesicles [21] and is involved in the metabolism of arachidonic acid [22]. When stimulated, arachidonic acid is converted by CYP4F8 into various biologically active arachidonic acids, which induce the proliferation of prostate cancer cells [23]. This was confirmed by Vainio et al. [24], who found that CYP4F8 was highly expressed in prostate cancer and was therefore a potential novel therapeutic target for prostate cancer. Other studies have shown that CYP4F8, a regulatory target of human peroxisome proliferator-activated receptor, has antiangiogenic and anti-tumorigenic properties [25], which is consistent with the findings in our study.
The reprogramming of fatty acid metabolism in cancer can promote tumor growth, angiogenesis, survival, and metastasis [26]. The fatty acid metabolic pathway requires the generation of fatty acyl coenzyme A. FAR2P1, as a reductase of fatty acyl coenzyme A, may be involved in tumor cell progression through fatty acid metabolism. Wang et al. [27] also confirmed that FAR2P1 was highly expressed in the EGFR exon 19 deletion group of lung adenocarcinoma and that differential expression of FAR2P1 was associated with the development and progression of non-small cell lung cancer [28].
LINC01518 and LINC01764 are lncRNAs that act as targets or regulatory factors in multiple signaling pathways [29,30]. Although their oncogenic roles have been studied in neuroblastoma [31], esophageal squamous cell carcinoma [32] and colorectal cancer [33], little is known about their anti-oncogenic roles.
In summary, in the current study, we identified five genes associated with BC malignancy. The differential expressions of CYP4F8, FAR2P1, LINC01518, LINC0176, and DTNA were verified in cell lines and tissues. In particular, the same trend was obtained in the different subtypes of BC. The overall expression levels of CYP4F8, FAR2P1, LINC01518, and LINC0176 were lower and the overall expression levels of DTNA were higher in patients with MIBC, when compared to patients with NMIBC. This indirectly suggested that CYP4F8, FAR2P1, LINC01518, and LINC0176 mainly played the role of anti-oncogenes and DTNA mainly played the role of oncogenes in BC patients. We found that a risk-prognosis model based on these five genes effectively assessed the survival of BC patients. Our findings confirmed that OS was worse in the high-risk group than the low-risk group. We further explored the potential biological functions of these five genes, using co-expression followed by GO analysis, and found that these genes were significantly associated with muscle composition, contraction, and calcium signaling pathways, which were further validated in the MsigDB gene set. Immune infiltration analysis revealed that the expression levels of TNNI3, TNNT1, ACTC1, and MYH11 were correlated with the level of immune cell infiltration and tumor purity. However, whether these four coding genes are related to the prognosis of BC patients and whether they are independently associated with the expression of the five lncRNAs will be addressed in future studies. In the current study, we also examined the TNNI3, TNNT1, ACTC1 and MYH11 genes, but unfortunately, none of these coding genes was associated with the prognosis of BC patients. In addition, we found no independent association between the four coding genes and the expression of the five lncRNAs in known databases using R software and the STRING online tool.
There are several limitations to the current study. First, the BC data in the TCGA database contain only one patient with squamous carcinoma and one patient with adenocarcinoma, while the remaining patients were all diagnosed with uroepithelial carcinoma. Similarly, all 165 samples in the GSE13507 dataset were from patients with uroepithelial carcinoma. Thus, due to the limited number of squamous and adenocarcinoma samples, differential analysis based on cancer type was not possible. Future studies should include analysis of additional public databases.Second, although this study demonstrated that DTNA, CYP4F8, FAR2P1, LINC01518, and LINC01764 could predict the occurrence and prognosis of BC, our study was limited to data mining analysis and qRT-PCR validation on cell lines and tissues, and did not include control or more in-depth validation studies.

Conclusion
We identified five new genes by screening BC data in the TCGA database and studied their expression profiles, prognostic relevance, and biological functions. Our findings revealed that a prognostic model based on these five genes could predict the 1-, 3-, and 5-year OS in BC patients. Thus, our results indicate that these lncRNAs can predict the occurrence and prognosis of BC, and may be novel BC markers and potential therapeutic targets of BC that could prove to be beneficial in helping clinicians decide the most effective treatment strategy.