Among 32 TCGA cancers, 18 cancers have less than 10 matched adjacent normal tissue samples in the Treehouse dataset. Ten cancers do not have any matched adjacent tissue samples at all (Fig. 1). Whereas, GTEx has profiles for 47 tissue sites with at least ten normal samples. This suggests the significance of exploring GTEx as a source of reference.
Computing tissue of origin
We first asked if gene expression profiles could be used to identify tissue of origin. We indicated a site of cancer was correctly identified if the computed tissue was the site of cancer origin or a very close proximal site (potentially related site) e.g. kidney - cortex for kidney papillary carcinoma. We indicate unrelated sites as those that are further away from the cancer of origin (Fig. 4). We found that using a minimal number of 100 varying genes, the correlation method can correctly identify the top tissue site for only 8 of 14 cancers. Increasing the number of varying genes to 5000 improved correct selection for 11 of 14 cancers. No further improvement on tissue selection was seen by increasing number of varying genes. The PCA, as a regular dimension reduction method, was only able to correctly identify 8 of 14 cancers, so we did not examine this method in the following analysis. The best automated method we found for reference tissue selection was via correlating autoencoder features with 12 of 14 tissues being correctly chosen.
Further examination of the three misclassified cancers by varying genes methods, Bladder Urothelial Carcinoma, Lung Squamous Cell Carcinoma and Stomach Adenocarcinoma, revealed correlation values of 0.549, 0.300, and 0.858, respectively. The low correlation from the bladder and lung carcinoma may be due to substantial difference in tissue expression between the computed site, esophagus, and their expected origin site, bladder and lung. Correlation for stomach adenocarcinoma was quite high, which may be due to similarity between the computed site, ileum of the small intestine, and the stomach (Additional file 1: Table S1).
Squamous cell carcinomas arise from squamous cells that reside in the cavities and surfaces of blood vessels and organs. As samples in GTEx were taken from bulk tissues, this may cause the lower computed correlation between the cancer tissue and site of origin leading to erratic computational choices. Manual selection of the tissue of origin for lung squamous cell carcinoma and stomach adenocarcinoma improved the correlation from 0.549 and 0.858 to 0.883 and 0.926 respectively (Additional file 1: Table S1). For Bladder Urothelial Carcinoma, using the varying genes method chose esophagus - mucosa as the top site (correlation 0.549), whereas autoencoder correctly chose the bladder site (correlation 0.926). This shows that correct site choices will improve correlation.
Interestingly, Kidney Clear Cell Carcinoma, Kidney Papillary Cell Carcinoma and Kidney Chromophobe share the same tissue origin--Kidney - Cortex. This confirms that cancer can arise from different parts of one tissue and raise the question whether we should use all normal samples from one site as the reference.
Examples of hepatocellular carcinoma and bladder urothelial carcinoma
We use two cancers as examples for further in-depth analyses, specifically Hepatocellular Carcinoma (HCC) and Bladder Urothelial Carcinoma (BUC). In our prior results, we found that using more genes to compute the correlation generally helped to select the correct tissue site for the tumor. We ran correlation for each site using increasing number of varying genes as well as autoencoder features. We normalized the correlation of the cancer site liver (Fig. 5a). We found that as the number of genes used increases all tissues will generally converge to have higher correlation with the disease tissue, this may be due to including genes of conserved regions or low expressions. Using all features from the autoencoder allows us to have much better separation of the site liver from other non-related sites of the cancer, indicating autoencoder captures the biology of disease sample more specifically (Fig. 5b-c).
For BUC, however, the varying genes method was unable to determine bladder as the best site instead choosing esophagus (Fig. 6a-b). Increasing varying genes from 100 to 40,000 brought down the correlation of esophagus site relative to bladder, however, it brought up correlation of other tissue sites relative to bladder (Fig. 6a) similar to what we see in Fig. 5a. This suggests that naively increasing varying genes does not help to distinguish tissue site selection. Meanwhile, the autoencoder method correctly predicts bladder as the top site with great separation between bladder and esophagus (Fig. 6a, c). Notably, the correlation in BUC is lower than that in HCC based on different similarity metrics. This suggests that cell composition in bladder tissues may be more diverse.
Disease signature comparison
As we have demonstrated that gene expression profiles can be used to identify tissue of origin, we then asked if these samples sharing the same tissue of origin from GTEx can substitute adjacent tissues from TCGA to create disease signatures. We employed three approaches to select samples (Fig. 2). We evaluate consistency based on the significance of overlap between signatures and correlation of fold changes of common signature genes.
Figure 7 shows the rank-based correlation of differential expression between consensus transcripts for each cancer from TCGA using GTEx reference tissue vs. TCGA case-control samples. Using the average of three random tissue site selection as our baseline we see that our other strategies are superior. The autoencoder produced better correlations overall regardless of sample selection method.
For the autoencoder, it seems that choosing all samples from the same tissue of origin performs slightly better than choosing 25 percentile and above mostly correlated samples from the same tissue of origin. Interestingly, choosing top 50 mostly correlated samples from any tissue performs reasonably well or even better in some cancers, where the tissue of origin was misclassified such as the varying genes method for stomach adenocarcinoma (Additional file 1: Table S1). This is very significant because in many cases, where we may have no or an insufficient number of matched normal tissues, we may use normal samples from other sites. For example, in the three kidney cancers: Kidney Clear Cell Carcinoma, Kidney Papillary Cell Carcinoma and Kidney Chromophobe, our analysis suggests three cancers can share the same reference tissue sites despite the differences of origin within the kidney.
One additional question we assessed is how many normal samples are sufficient for proper disease signature-related analyses? We found that even a relatively low number of normal samples may be sufficient for calculating differential expression. For bladder urothelial cancer, for example, the autoencoder selected the bladder GTEx site which consists of only nine tissue samples (Fig. 1) for a correlation of 0.924; filtering for tissues above the 25th percentile left only seven tissue samples for a correlation of 0.926. When we used a strategy that selected more tissues, i.e. using autoencoder top 50 method, 50 sample tissues were used (9 from bladder and 41 from other top correlated sites), which produced a slight drop of correlation to 0.847. This indicates that even a relatively low number of reference tissue samples may provide a robust match.
Finally, we assessed whether it is a better strategy overall to select all samples from the same tissue site as the cancer of interest or only those that are correlated to the tumor sample. We found that the samples producing the best performance are sites where the tumor developed or closely related sites. However, when it is not possible to use such sites (e.g., when there are no available data), it is feasible to use top correlated tissues as seen from the top 50 methods. However, we found that for some cancers, even choosing top correlated sites can still produce erratic results, such as in the case of lung squamous cell cancer. In this case, the correlations for all non-random methods were between 0.1–0.3 which was not even able to beat the random tissue selection (Additional file 1: Table S1). Along these lines, we evaluated differential expression similarity using samples from a different origin than the cancer of interest. For example, in two kidney cancers, Kidney Papillary Carcinoma and Kidney Chromophobe the kidney cortex were computed as the top site, for Head and Neck carcinoma the esophagus-mucosa was the top site. Their high correlation with case-control > 0.8 indicates that choosing sites at different origin but proximal to the cancer will provide good disease signatures (Additional file 1: Table S1).
Assign normal tissues for cancers with low case-control pairs
Since there were 18 cancers with insufficient number of adjacent normal tissues, we use our computational approach to assign a primary site for each. Of the 18 cancers, the autoencoder method was able to determine 10 correct sites, whereas using the top 5000 varying genes only produced 4 correct sites (Fig. 8). This suggests an autoencoder can select proper samples to create disease signatures for those cancers.