The Cancer Omics Atlas: an integrative resource for cancer omics annotations

Background The Cancer Genome Atlas (TCGA) is an important data resource for cancer biologists and oncologists. However, a lack of bioinformatics expertise often hinders experimental cancer biologists and oncologists from exploring the TCGA resource. Although a number of tools have been developed for facilitating cancer researchers to utilize the TCGA data, these existing tools cannot fully satisfy the large community of experimental cancer biologists and oncologists without bioinformatics expertise. Methods We developed a new web-based tool The Cancer Omics Atlas (TCOA, http://tcoa.cpu.edu.cn) for fast and straightforward querying of TCGA “omics” data. Results TCOA provides the querying of gene expression, somatic mutations, microRNA (miRNA) expression, protein expression data based on a single molecule or cancer type. TCOA also provides the querying of expression correlation between gene pairs, miRNA pairs, gene and miRNA, and gene and protein. Moreover, TCOA provides the querying of the associations between gene, miRNA, or protein expression and survival prognosis in cancers. In addition, TCOA displays transcriptional profiles across various human cancer types based on the pan-cancer analysis. Finally, TCOA provides the querying of molecular profiles for 2877 immune-related genes in human cancers. These immune-related genes include those that are established or promising targets for cancer immunotherapy such as CTLA4, PD1, PD-L1, PD-L2, IDO1, LAG3, and TIGIT. Conclusions TCOA is a useful tool that supplies a number of unique and new functions complementary to the existing tools to facilitate exploration of the TCGA resource.


Background
With the development of high-throughput sequencing technology, a large volume of cancer genomics data are emerging and advancing cancer research. Notably, The Cancer Genome Atlas (TCGA) datasets cover 33 different cancer types and more than 10,000 cancer cases in total (https://gdc-portal.nci.nih.gov/). Each TCGA cancer type contains different types of "omics" data, including: whole exome (genome) sequencing; genomic DNA copy number arrays; DNA methylation; mRNA expression array and RNA-Seq data; microRNA (miRNA) sequencing; reverse-phase protein arrays; and clinical metadata. TCGA is becoming a necessary data resource not only for the cancer informatics researchers, but also for experimental cancer researchers and oncologists. Particularly, many cancer researchers are interested in having a preliminary search of TCGA to find or filter their experimental targets; many researchers seek for a validation of their experimental results from TCGA. However, because most of experimental biologists and oncologists lack sufficient skills in bioinformatics analysis, it is usually difficult for them to explore the TCGA resource. Thus, the development of web-based tools with the user-friendly graphical user interface (GUI) must be useful for experimental biologists and oncologists to search what they need from TCGA. Some web-based tools have been developed to explore the TCGA data. The cBioPortal (http://cbioportal.org) is a web resource for analyzing and visualizing multidimensional cancer genomics data including those from TCGA [1,2]. The Broad Institute TCGA GDAC Firehose (http://gdac.broadinstitute.org/) provides standardized datasets, algorithms, and analysis results for TCGA. MEXPRESS (http://mexpress.be/) provides query and visualization of the clinical, gene expression and methylation data in TCGA [3]. GEPIA (http://gepia.cancer-pku.cn/) is a web tool for visualizing gene expression comparisons and correlations, and associations with patient survival prognosis based on TCGA and GTEx data [4]. The Cancer Proteome Atlas (http://tcpaportal.org/ tcpa/) is a web-based data portal for downloading, visualizing, and analyzing TCGA proteomics data [5]. All these tools provide significantly valuable resources that facilitate cancer biologists and oncologists to explore the TCGA data. However, the existing tools still have many places worth improving to satisfy the large community of experimental biologists and oncologists without bioinformatics expertise. For example, cBioPortal lacks differential expression and survival analyses based on gene expression profiles while these data are often of interests for biologists and oncologists. GDAC Firehose is a good resource for bioinformatics scientist while is not straightforward for cancer biologists and oncologists without bioinformatics training. Although MEXPRESS can provide the fast querying of the visualized clinical, gene expression and methylation data in TCGA, it lacks some important data types that are relevant to cancer biology and oncology such as gene somatic mutations, miRNAs, proteins and their associations with survival prognosis in cancers. Similarly, GEPIA is a recently-published web tool that specializes in querying gene expression and its association with survival prognosis in cancers while it lacks other cancer omics data such as gene somatic mutations, miRNAs and proteomics, and the associations of these molecular profiles with survival prognosis in cancers.
To provide useful functions complementary to these existing tools, we developed a new web-based tool The Cancer Omics Atlas (TCOA, http://tcoa.cpu.edu.cn) for fast and straightforward querying of TCGA gene expression, somatic mutations, miRNA expression, protein expression based on a single molecule or cancer type. TCOA also provides the querying of expression correlations of gene-gene, miRNA-miRNA, protein-protein, gene-miRNA and gene-protein, and the correlation of gene, miRNA, or protein expression with survival prognosis in cancers. Moreover, TCOA provides a portrait of transcriptional landscape of human cancers based on the pan-cancer analysis. In addition, because cancer immunotherapy is showing increasingly noteworthy for its effectiveness in treating a variety of cancers, we specifically provide a tab for querying of 2877 immune-related genes in TCOA.

Construction and content
Database architecture and web interface TCOA was developed using Hypertext Preprocessor (PHP, version 5.5.10) with a R-based web framework. The back-end database was built by MySQL (version 5.5.36) that contained the TCGA data needed for querying. PHP scripts were used to handle database queries or computational results by R script, generate results and send them to users. The TCOA website was developed by HTML (Hyper TextMarkup Language) and JavaScript for the user interface. TCOA contains six major modules: Gene, MicroRNA, Cancer, Pan-cancer, Immuno-Oncology, and Protein (Table 1). For all the querying from users, TCOA will send visualized results to them in the form of figures (a few in the form of tables).

Functions of the "gene" module
In the "Gene" module, when a user submits the querying of a gene using the gene symbol or Entrez ID, TCOA will output the information on expression and somatic mutations of the gene in 33 cancer types. The gene expression data include: gene expression levels in cancers; expression correlations with other genes in cancers; differential expression comparisons between cancer and normal samples (if the gene expression data in normal samples are available in TCGA); differential expression comparisons between different cancer phenotypes such as stage and grade; associations of gene expression with survival prognosis in cancers. The gene somatic mutation data include: mutation rates in cancers; variants classification in cancers; comparisons of mutation rates between different cancer phenotypes such as stage and grade; comparisons of gene expression between gene-mutated and gene-wildtype cancers; associations of gene mutations with survival prognosis in cancers.
For example, if we are interested in the research of the tumor suppressor gene TP53 in cancers, we can enter into the "Gene" module to search for the gene. Firstly, we obtain a summary of the TP53 mean expression levels and somatic mutation rates in 33 cancer types. We find that TP53 has the highest mutation rate of 91.2% in uterine carcinosarcoma (UCS) and has the second highest mutation rate of 83% in ovarian serous cystadeno-carcinoma (OV). There are ten cancer types that have a TP53 mutation rate greater than 50% in total (Fig. 1a). Moreover, we can find a summary of the variant classification of TP53 mutations in cancers, e.g., in pancreatic adenocarcinoma (PAAD), 64 and 12% of TP53 mutations being missense and frame-shift insertion, respectively (Fig. 1b). Importantly, we can find the associations of TP53 mutations with survival prognosis in cancers. For example, TP53 mutations are associated with worse survival (overall and disease free survival) prognosis in PAAD (Fig. 1c). In Table 1  show genes whose upregulation is associated with poor prognosis in cancers survival curve show genes whose downregulation is associated with poor prognosis in cancers survival curve show genes with increased or decreased expression alterations consistently from normal tissue to low-advanced cancers, and from low-advanced cancers to highly-advanced cancers bar chart show the cell cycle pathway consistently up-regulated in cancers bar chart show genes whose expression levels are significantly higher or lower in cancers than in normal tissue table show genes whose expression levels are significantly higher or lower in high-grade cancers than in low-grade cancers addition, one could be interested in the expression associations of other genes with TP53 in cancers, e.g., the expression association between PLK1 and TP53 in PAAD (Fig. 1d). In fact, previous studies have shown that PLK1 interacted with TP53, and that p53 dysfunction caused enhanced expression of PLK1 in cancers [6][7][8][9].

Functions of the "MicroRNA" module
In the "MicroRNA" module, a user can submit the querying of miRNAs using the human miRNA symbol. TCOA will output the miRNA expression-related data in 33 cancer types. These data include: miRNA expression levels in cancers; expression correlations with genes in cancers; expression correlations with other miRNAs in cancers; differential miRNA expression comparisons between cancer and normal samples (if the miRNA expression data in normal samples are available in TCGA); differential miRNA expression comparisons between different cancer phenotypes such as stage and grade; associations of miRNA expression with survival prognosis in cancers. For example, to explore the human miRNA hsa-mir-100 in cancers, we can submit the querying of hsa-mir-100 in the "MicroRNA" module. Firstly, we obtain a summary of hsa-mir-100 mean expression levels in 33 cancer types. Further, we desire to explore the expression levels of hsa-mir-100 in breast invasive carcinoma (BRCA). In selecting the cancer type, we find that hsa-mir-100 has significantly lower expression levels in BRCA than in normal tissue (Fig. 2a). Moreover, we find that elevated expression of hsa-mir-100 is associated with better overall survival (OS) prognosis in BRCA (Fig. 2b). Furthermore, we explore the expression cancer by targeting a number of genes including PIK1 [10,11]. Accordingly, the TCOA search result shows that elevated expression of PLK1 is associated with worse OS prognosis in BRCA (Fig. 2d).

Functions of the "Cancer" module
In the "Cancer" module, when a user clicks a cancer type, TCOA will output top 50 most frequently mutated genes in the cancer, up-regulated and down-regulated genes, and up-regulated and down-regulated miRNAs in the cancer relative to normal controls. This module also outputs important pathways associated with the highly-expressed genes in the cancer type. TCOA outputs the up-regulated and down-regulated genes or miRNAs depending on the threshold input by users. The threshold includes: fold change of expression levels in cancer compared to normal tissue, and adjusted p-value. The adjusted p-values (FDR q-values) are calculated by the Benjamini and Hochberg (BH) method [12]. For example, if we submit the querying of liver hepatocellular carcinoma (LIHC) in the module, we will find that TTN has the highest mutation rate of 34%, and TP53 has the second highest mutation rate of 31.1% in LIHC. The other frequently-mutated genes in LIHC include CTNNB1, MUC16, ND5, OBSCN, RYR2, ALB etc. (Fig. 3a). TCOA shows that THBS4 has the highest mean expression increase (nearly 40-fold) in LIHC relative to normal tissue. This gene has been shown to be overexpressed in multiple cancer types [13,14]. The other overexpressed genes in LIHC include ZIC2, GPC3, EPS8L3, CPLX2, IGF2BP1, NUF2, CDC25C, CDC20, and GABRD (Fig. 3b). In contrast, the most down-regulated gene in LIHC is CLEC4M which has nearly 335-fold expression decrease compared to normal tissue. This gene encodes a protein that is involved in the innate immune system and is expressed in the endothelial cells of the lymph nodes and liver. Previous studies have shown that CLEC4M and its product were down-regulated in LIHC and other cancer types [15,16]. The other repressed genes in LIHC include CLEC4G, INS-IGF2, CLEC1B, CYP1A2, GDF2, FCN2, MARCO, STAB2, HAMP, and MT1H (Fig. 3b). The gene set enrichment analysis of the highly-expressed genes in LIHC shows that the pathways of cell cycle, DNA replication, ECM-receptor interaction, p53 signaling, MAPK signaling, axon guidance, focal adhesion, metabolism, and mismatch repair are enriched in LIHC (Fig. 3c). In addition, TCOA shows that mir-1269, mir-10b, mir-224, and mir-183 are overexpressed in LIHC with more than 4-fold expression increase compared to normal tissue, while mir-1258, mir-675, mir-490, mir-424, mir-483, mir-1247, mir-199b, mir-199a-2, mir-139, mir-199a-1, mir-3607, and mir-451 are underexpressed in LIHC with more than 4-fold expression decrease compared to normal tissue (Fig. 3d).

Functions of the "pan-cancer" module
In the "Pan-cancer" module, TCOA outputs the genes consistently up-regulated or down-regulated, pathways significantly up-regulated, and genes whose deregulation is significantly associated with survival prognosis across various cancer types. This module also outputs the genes that are differentially expressed between cancer and normal samples, and between low-advanced and highly-advanced cancers across various cancer types. We refer to early-stage (Stage I-II) or low-grade (Grade I-II) cancers as lowly-advanced cancers, and late-stage (Stage III-IV) or high-grade (Grade III-IV) cancers highly-advanced cancers. A comparison of tumor mutation burden (TMB, defined as the total number of substitutions, regardless of variant type) among different cancer types is also shown in this module (Fig. 4). Figure 4 shows that cutaneous melanoma (SKCM) has the highest median TMB, followed by lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC). It confirms that TMB was associated with clinical response to immunotherapy [17][18][19] in that several cancer types with high TMB have shown positive response to immune checkpoint blockade treatment such as melanoma [20] and non-small cell lung cancer (NSCLC) [21]. The results presented in the "Pan-cancer" are mainly based on a recent study by our group [22].

Functions of the "Immuno-oncology" module
In the "Immuno-Oncology" module, TCOA provides the querying of 2877 immune-related genes about their expression, mutations and associations with survival prognosis in various cancer types. When a user selects a gene, TCOA will enter into the gene information interface that is the same as that by the "Gene" module querying of that gene. For example, many users could be interested in the gene PD-L1 whose product plays an important role in cancer immune evasion and is an important target for cancer immunotherapy [23]. TCOA shows that PD-L1 has significantly higher expression levels in esophageal carcinoma (ESCA) and kidney chromophobe (KICH), while has significantly lower expression levels in LIHC, LUAD, LUSC and prostate adenocarcinoma (PRAD) compared to their normal tissue (Fig. 5a). Interestingly, elevated expression of PD-L1 is associated with better OS and/or disease free survival (DFS) prognosis in adrenocortical carcinoma (ACC), colon adenocarinoma (COAD), kidney renal clear cell carcinoma (KIRC), and SKCM, while worse OS and/ or DFS prognosis in brain lower grade glioma (LGG) and PAAD (Fig. 5b).

Functions of the "protein" module
In the "Protein" module, when a user submits the querying of a protein, TCOA will output expression  LGG: brain lower grade glioma. PAAD: pancreatic adenocarcinoma data for the protein in 33 cancer types. These data include: protein expression levels in cancers; protein-gene expression correlation in cancers; differential protein expression comparisons between different cancer phenotypes such as stage and grade; associations of protein expression with survival prognosis in cancers. Figure 6 shows two DNA mismatch repair proteins MSH2 (MutS protein homolog 2) and MSH6 (MutS protein homolog 6) whose expression is significantly associated with survival prognosis in a wide type of cancers. Elevated expression of MSH2 and MSH6 is associated with worse OS and/or DFS prognosis in BRCA, sarcoma (SARC), uterine corpus endometrial carcinoma (UCEC), thyroid carcinoma (THCA), rectum adenocarcinoma (READ), KIRC, UCS and ACC, while is associated with better OS and/or DFS prognosis in LUSC and COAD.

Computational and statistical analyses
Class comparison to identify differentially-expressed genes, miRNAs or proteins We normalized the TCGA gene and miRNA expression values by log 2 (x + 1) transformation, and used the original downloaded protein expression data since they had been normalized. We compared expression levels of a single gene, miRNA or protein between two classes of samples using Student's t test.

Correlation analysis, pathway analysis and survival analysis
We calculated expression correlations of gene-gene, gene-miRNA, miRNA-miRNA and gene-protein by Pearson product-moment or Spearman correlation analysis. We performed pathway analysis of gene sets using the Gene Set Enrichment Analysis (GSEA) software [24]. The KEGG pathways significantly associated with gene sets were displayed (FDR q-value< 0.05). We performed survival analysis of TCGA patients based on gene somatic mutation data, and expression data for genes, miRNAs and proteins, respectively. Kaplan-Meier survival curves were used to show the survival (OS or DFS) differences between gene-mutated cancer patients and gene-wildtype cancer patients, and between gene, miRNA or protein higher-expression-level patients and lower-expression-level patients. Gene, miRNA or protein higher-expression-level and lower-expression-level patients were determined by the median values of expression. If the expression level in a patient was higher than the median value, the patient was classified into the higher-expression-level group; otherwise into the lower-expression-level group. We used the log-rank test to calculate the significance of survival-time differences between two classes of patients.

Utility and discussion
The TCGA data are providing an invaluable resource for cancer researchers and oncologists. However, a lack of bioinformatics expertise often hinders experimental cancer biologists and oncologists from exploring the TCGA resource. Although a number of tools have been developed for helping cancer biologists and oncologists utilize the TCGA data, these existing tools cannot fully satisfy the large community of experimental cancer biologists and oncologists without bioinformatics expertise. To this end, we developed TCOA with additional functions complementary to these existing tools. TCOA provides fast and straightforward querying of TCGA gene expression, somatic mutations, miRNA expression, protein expression based on a single molecule or cancer type. TCOA provides the querying of expression correlation not only between gene pairs, but also between miRNA pairs, gene and miRNA, and gene and protein. TCOA also provides the querying of the associations of gene, miRNA, or protein expression with survival prognosis in cancers. Moreover, TCOA presents transcriptional profiles across various human cancer types based on the pan-cancer analysis [22]. In addition, TCOA provides the querying of molecular profiles for 2877 immune-related genes in human cancers. These immune-related genes include those that are established or promising targets for cancer immunotherapy such as CTLA4, PD1, PD-L1, PD-L2, IDO1, LAG3, and TIGIT. It would be of great interest for cancer researchers and oncologists to query expression, mutations and correlations with cancer survival prognosis of these immune-related genes across various human cancer types.
TCOA will be continuously updated with more functions and modules such as DNA methylation and DNA copy number alteration modules. In addition, for a specific cancer type, one could be interested in molecular alterations across different subtypes. For the immune-related genes, one could be more interested in gene-sets that represent the activities of specific immune cells, functions or pathways [25]. TCOA is expected to provide such functions in future updates.

Conclusions
TCOA is a useful tool that supplies a number of unique and new functions complementary to the existing tools to facilitate exploration of the TCGA resource.