Deep learning-based cancer survival prognosis from RNA-seq data: approaches and evaluations

Background Recent advances in kernel-based Deep Learning models have introduced a new era in medical research. Originally designed for pattern recognition and image processing, Deep Learning models are now applied to survival prognosis of cancer patients. Specifically, Deep Learning versions of the Cox proportional hazards models are trained with transcriptomic data to predict survival outcomes in cancer patients. Methods In this study, a broad analysis was performed on TCGA cancers using a variety of Deep Learning-based models, including Cox-nnet, DeepSurv, and a method proposed by our group named AECOX (AutoEncoder with Cox regression network). Concordance index and p-value of the log-rank test are used to evaluate the model performances. Results All models show competitive results across 12 cancer types. The last hidden layers of the Deep Learning approaches are lower dimensional representations of the input data that can be used for feature reduction and visualization. Furthermore, the prognosis performances reveal a negative correlation between model accuracy, overall survival time statistics, and tumor mutation burden (TMB), suggesting an association among overall survival time, TMB, and prognosis prediction accuracy. Conclusions Deep Learning based algorithms demonstrate superior performances than traditional machine learning based models. The cancer prognosis results measured in concordance index are indistinguishable across models while are highly variable across cancers. These findings shedding some light into the relationships between patient characteristics and survival learnability on a pan-cancer level.


Background
With the high prevalence of neural networks and Deep Learning-based algorithms in the Computational Biology, it is clear that the advantages of optimization in a highly non-linear space are welcomed improvements in biomedicine [1][2][3][4][5][6][7]. In Bioinformatics, significant effort has been committed to harnessing transcriptomic data for multiple analyses [7][8][9][10][11][12][13] especially cancer survival prognosis [14,15]. Faraggi and Simon [16] was the first study to use clinical information to predict prostate cancer survival through an artificial neural network model. Mobadersany et al. [17] integrated histological features, Convolutional Neural Networks (CNN), and genomics data to predict cancer prognosis via Cox regression. Despite of various existed applications on survival analysis such as [14,15], the use of Deep-Learning Cox models was pioneered by Ching et al. [18], who applied Cox regression with neural networks (Cox-nnet) to predict survival using transcriptomic data became prevalent. Similarly, Katzman et al. [19] used DeepSurv with multilayer neural networks for survival prognosis and developed a personalized treatment recommendation system.
As a new and effective dimensionality reduction technique, the Autoencoder (AE) framework can lead to efficient lower dimensional representations using unsupervised or supervised learning [20][21][22][23][24]. In addition, Chaudhary et al. [25] also applied AE for dimensionality reduction and then used the low-dimensional representation of data to perform prognosis prediction using traditional method. In this paper, besides two recently developed Deep Learning based methods, namely Cox-nnet and DeepSurv, we also attempted an Autoencoder-based approach (called AECOX) for cancer prognosis prediction with simultaneous learning of lower dimensional representation of inputs. This approach is similar to Cox-nnet [18] and DeepSurv [19], as it implements neural networks with Cox regression, though the network architectures differ. In AECOX (Fig. 1c), the code from AE will link to a Cox regression layer for the prognosis. Both losses from the AE networks and Cox regression layer will be counted to train the entire network weights through back-propagation. AECOX is with symmetric structure of the Autoencoder, and can accept any number of hidden layers. We refer readers to the Additional file 1 for more detailed settings of AECOX.
To evaluate the prediction performance, we adopt two metrics, namely the concordance index and p-value of log-rank test. These metrics are used in comparing two state-of-the-art Deep Learning-based prognosis models (i.e., Cox-nnet, DeepSurv) with AECOX, in a pan-cancer study covering 12 TCGA (The Cancer Genome Atlas) cancers. In addition, we use Partitioning Around Medoids (PAM) clustering algorithm [26] on the last hidden layer for each model to evaluates how well the models discriminate subgroups in the lower dimensional space. P-value of log-rank test based on K groups of Kaplan-Meier survival curve is the metric used for evaluation [27].
As we compared the prognosis prediction performance across 12 cancer types, we wonder whether the performance is related to tumor mutation burden and overall survival time. Tumor mutation burden (TMB) is a measurement of mutations in tumor [28,29] and is an important genomic marker that is closely associate with immunotherapy and survival prognosis [30][31][32][33][34]. While incorporating TMB feature into input does not increase the prediction performances, we found that TMB is negatively correlated with overall survival time statistics, and both of them are correlated with the concordance index for all three models across cancer types, suggesting an association between TMB, overall survival time, and disease prognosis accuracy.
Overall, we observed comparative results across three different Deep Learning-based cancer survival prognosis models in terms of concordance index. We also investigated the lower dimensional representation that conveyed by Deep Learning algorithms. By inspecting the relationship between TMB, overall survival statistics, and concordance index across 12 cancer types, we confirmed an association among them, suggesting a future study direction of patient stratification and integrative analysis.

Integrating Cox proportional hazards model with neural networks
The neural network architectures of all three Deep Learning-based approaches are provided in Fig. 1. Cox- Fig. 1 Neural network architectures of three Deep Learning-based models. a Cox-nnet with a single hidden layer; b DeepSurv with multiple hidden layers having consistent dimensions; c AECOX with multiple hidden layers in the both encoder and decoder part. Last hidden layers in all models were indicated in orange and were connect to a Cox regression neural networks with hazard ratios as the outputs nnet ( Fig. 1a) is the most succinct model with only one hidden layer, while DeepSurv (Fig. 1b) uses multiple hidden layers of consistent dimensions and treats the number of hidden layers as a hyper-parameter. Similarly, AECOX also treats the number of hidden layers as a hyper-parameter, but the hidden layers lay symmetrically in the encoder and decoder (Fig. 1c). All three models employed the same Cox proportional hazards model. However, Cox-nnet and DeepSurv accept the output of the last hidden layer to the Cox model while AECOX uses the low-dimensional code as the input. The output hazard ratio was then compared to the ground truth and the details of evaluation metrics are provided earlier.
The reason we introduced AECOX is to explore the feasibility of simultaneously generating a lowdimensional representation of the data while developing an effective model for prognosis.
The Cox proportional hazards model, also known as the Cox model, was developed to models the age specific failure rate or hazard function [35] at time t for patient i with covariate vector X i by.
The partial likelihood L i for patient i, which is defined to be the probability of occurrence of a death event at time Y i for patient i, is found to be β k X ik Þ . β = (β 1 , β 2 , …, β K ) are the K parameters to be estimated. The summation in denominator is carried out over all patients j (including patient i) for which a death event did not occur before time Y i . The partial likelihood for all patients is then defined as where C i = 1 indicates the occurrence of a death event.
The log partial likelihood of Cox model is then obtained as Values of the parameters β = (β 1 , β 2 , …, β K ) are then obtained through maximum likelihood estimation (MLE), that isβ Alternatively, since the Cox model utilizes a regression model that can be implemented as neural network with weights β = (β 1 , β 2 , …, β K ), values of these weights were obtained through back-propagation. This approach was embedded in all the aforementioned models and was denoted by the blue line with caption "Cox-Regression Neural Network" in Fig. 1.
These models offer several advanced features: (1) a highly non-linear function is learned, (2) neural networks and Cox proportional hazards regression are integrated together enabling the entire weights of the models to be learned through back-propagation, (3) the number of hidden layers and hidden layer dimensions were treated as hyper-parameters that can be fine-tuned, and (4) dimensionality reduction in conjunction with supervised learning is achieved.
To demonstrate the advantages of Deep Learningbased prognosis models, we also compared three traditional machine learning based models for prognosis, they are: Cox proportional hazards model with R package "glmnet" [36], Random Survival Forest (RSF) [37], and Support Vector Machine (SVM) [25]. Particularly, in Chaudhary et al. [25], we implemented their SVM model according to the top 100 mRNA-seq features selected from ANOVA (Analysis of variance) [38].

Regularization, loss functions and hyper-parameters
Despite the fact that the aforementioned Deep Learningbased approaches shared the same Cox regression network and used the hazard ratio as the output (Table 1), yet certain differences existed among the models. Currently all three models used the L2 norm regularization in the final learning after hyper-parameters tuning as it gave the optimal validation accuracy. While all models attempted Dropout and the L2 norm regularization (Ridge Regularization [39]) to penalize the network weights, AECOX also included L1 norm regularization (Least Absolute Shrinkage and Selection Operator, LASSO in short [40]) and elastic net [41].
The structure of loss functions among models shared a common base formula (Table 1), but each approach used additional penalization. Specifically, both Cox-nnet and DeepSurv used the same objective (loss) function: whereas AECOX took into account the Autoencoder's input-output difference: Here Θ denotes to the neural networks' weights to be learned, including hidden layer weights and Cox regression neural network weights, X input and X output are the input and output covariate vectors of Autoencoder, respectively. MSE(•) is the mean squared error function. The hyper-parameter λ 1 balances the loss between Autoencoder's input-output difference which is a measure of dimensionality reduction and the Cox hazard, which is a measure of regression based supervised learning. The combination of λ 2 and λ 3 permits the utilization of Elastic Net regularization. Forcing λ 2 = 0 results in L2 regularization, whereas forcing λ 3 = 0 results in L1 regularization. To optimize the objective functions given above, Cox-nnet, DeepSurv, and AECOX use Nesterov accelerated gradient descent [42], stochastic gradient descent (SGD) [43], and adaptive moment estimation (Adam) optimizer [44], respectively. AECOX adopted Adam optimizer as it is more computationally efficient and require little tuning on hyper-parameters.
As shown in Table 1, Cox-nnet has one hyperparameter to be fine-tuned, and thus a linear search technique was adopted, whereas DeepSurv and AECOX had multiple hyper-parameters in a high dimensional space. It is thus unrealistic to perform a linear search in each dimension of the hyper-parameter space as the computational complexity would be O(n p ) for p hyperparameters. Instead, DeepSurv and AECOX utilize the Sobol solver [45] in the Optunity python package [46]. Given a search time q (e.g., q = 100), the Sobol solver samples q points assuming the hyper-parameters are uniformly distributed in p-dimensional space. This reduces the computational complexity to O(nq), regardless of how large the value of p is.

Data preprocessing and statistics
Genes with lowest 20% absolute expression values and lowest 10% variance across samples were removed. This  denoising step was performed via the TSUNAMI package (https://apps.medgen.iupui.edu/rsc/tsunami/) [15], ensuring model robustness and reducing irrelevant noise.
The expression data were then rescaled with natural logarithm operation: where X original was the original non-negative RNA sequencing expression values (Illumina Hi-Seq RNA-seq v2 RSEM normalized), and X input was the input covariate vector for the models. Subsequently each gene expression at row r in the input data was normalized as This step ensured that each row of the gene expression contributed to the model on an equal scale. Table 2 provides a summary of the median and range in terms of age and survival months for the TCGA data. Each dataset was split into training, validation, and testing sets in a proportion of 60, 20, and 20% respectively. Confounding effects [47] were minimized by randomly shuffling the data 1000 times and choosing the 5 pairs of training/validation/testing sets with lowest corresponding differences. The differences that were minimized is the summation of (1) standard deviation of male/female ratio on training/validation/testing sets, (2) standard deviation of overall survival time's standard deviation on training/validation/testing sets, (3) standard deviation of overall survival time's mean on training/validation/ testing sets, (4) standard deviation of the ratio of deceased group to whole population on training/validation/testing sets, and (5) standard deviation of the ratio of tumor stages to whole population on training/validation/testing sets. Thus, survival prognosis was estimated for each cancer type 5 times.
In this study, TCGA mutation annotation files (MAFs), containing subsets of the patients for prognosis tasks, were used to calculate TMB summary statistics, including mean, median, max, and 20, 10, 5% tail cut values. These characteristics were used for examining correlation between TMB and concordance index.

Evaluation metrics
We evaluated model performance with concordance index and the p-value of log-rank test. Concordance index had been widely used for evaluating survival prognosis models [48][49][50]. Its value ranges from 0 to 1 and it describes how well models differentiated groups (censored and uncensored groups, or living and deceased groups) [50][51][52][53]. A concordance index of 0.5 indicates that a model was ineffective and is viewed to have generated a random prediction with respect to ground truth. Values above 0.5 indicate improved prediction by a model, with increased performance being conveyed by a concordance index approaching 1. Values below 0.5 indicate that a model predicted values that are the opposite of the ground truth. Higher concordance index values indicate better capability of model to perform cancer survival prognosis.
P-values were derived by dichotomizing the hazard ratios through median value and performing log-rank tests [54][55][56] between the resulted high-risk and low-risk groups. Model performance was then assessed wherein a lower p-value represents an enhanced ability to distinguish two patient groups.
To evaluate the performances across cancer types and across model types, two-way ANOVA [38] is adopted. Pairwise paired t-test [57,58] and the linear mixedeffects models test from the R package "nlme" [59,60] are also used. The linear mixed-effects models test is to test between pairs of models while accounting for random effects. The mixed effect model assumed the data (performances) to be dependent within each cancer type and independent across cancer types.

Results
The performance comparison was conducted at pancancer level using 12 cancer from The Cancer Genome Atlas (TCGA). These 12 cancers were chosen due to their relatively large sample sizes and sufficient information about patient outcomes. The specific cancers analyzed in this paper were (1)   concordance indices of approximately 0.5. This led to our further investigation with tumor mutation burden (TMB) and overall survival time as described earlier. We also made a comparison between three traditional machine learning models (Fig. 2c, d). Specifically, we presented the results in Fig. 2 as two parts in order to directly visualize the comparison between Deep Learning-based models and traditional machine learning based models in Fig. 2c and d.
Since five experiments were carried out for each cancer type and each model type, we compared the performances (via concordance index and p-value of log-rank test) for all 12 TCGA cancer types using pairwise paired t-test among all models (Table 3a) and the linear mixedeffects models test (Table 3b). In this case we considered a model to be better than another if a higher concordance index or a lower p-value of log-rank test was observed. Thus, a positive t-statistic in Table 3a or a positive coefficient in Table 3b was used to conclude that the model (distribution 1) was better than the other (distribution 2) with respect to the concordance index. In the case of the p-value of log-rank test, a negative tstatistic or coefficient was used to reach the same inference.
As can be observed from Table 3b, all models have a similar performance since most of the test results of their linear mixed-effects models are insignificant. Both Table 3a and Table 3b concluded that among Deep Learning-based approaches, Cox-nnet provided the overall optimal survival prognosis results at pan-cancer level, with respect to the concordance index and the p-value of log-rank test. This advantage of Cox-nnet is due to a simpler neural network architecture and reduced search space for hyper-parameters. Additional file 1: Table S10-S11 presented the same quantitative comparison of performances for Deep Learning-based and traditional machine learning models. All three Deep Learning models demonstrated superior performance than traditional machine learning models, suggesting the advantages of Deep Learning approaches on prognosis prediction.

Lower dimensional representation
The final hidden layer (or the code in AECOX), highlighted in orange in Fig. 1, produces a lower dimensional representation of the input and is one of the intrinsic properties in Deep Learning-based algorithms [18,21,61,62]. By using the Partitioning Around Medoids (PAM) clustering algorithm [26] on the output of the last hidden layer after the network is trained, we can then inspect the original covariate vector in a lower dimensional space. The most suitable number of clusters (ranging from 2 to 10) was determined by maximizing the averaged silhouette score [63,64]. As depicted in Table 4, Cox-nnet appeared to have overall better p-values of the log-rank test measured between different clusters, indicating a better capacity of dimensionality reduction for 9 cancers (KIRP, KIRC, LIHC, BRCA, CESC, LUAD, HNSC, OV, LUSC).

Relationship between prognosis prediction performances and tumor mutation burden
From the performances within a cancer type across models in Fig. 2 and results in Table 3, it appeared that all models achieve respectable performances measured by concordance index. We also found that performance (concordance index) was more significantly associated with cancer types than algorithms (two-way ANOVA: Cancer type p-value <2E-16, Model type p-value = 9.57E-02). This observation suggests that intrinsic characteristics of different cancer types have a large influence on the performance of prognosis models. One such characteristics is the tumor mutation burden (TMB), Notes: β denotes the coefficient (slope) of linear mixed-effects models, P denotes the p-value obtained.
which is known to vary largely between different types of cancers. TMB was increasingly used as a marker in predicting efficacy of immunotherapy [33] and was also shown to be a predictor of prognosis [34]. Since the ability to train a cancer survival prognosis model across cancer types varies significantly, we explored whether TMB can be associated with these changes. By inspecting the mutation information associated with different cancer types, we observed that the performance of survival prognosis models was associated with tumor mutation burden (TMB) characteristics. Specifically, we observed that all TMB characteristics were negatively correlated with concordance index especially the mean TMB (Mean TMB: Pearson ρ = − 0.45 (Fig. 3b) One interesting question is then if the incorporating TMB in the model would enhance the model performance. To investigate this, we take the joint subset of patients who have both RNA-seq data and TMB data, performed survival prognosis with Cox-nnet model (the method which has the best performance) with and without TMB feature, respectively. As shown in Fig. 4, although there is a slight improvement on concordance index (average value = 0.003419) after TMB feature is incorporated, the correlation between improved concordance index (mean) and mean TMB values is 0.0688 across 12 TCGA cancers, suggesting that introducing TMB feature into a mRNA-seq based learning model does not substantially improve the performance for Coxnnet.
Next, we found the correlation between the mean of overall survival times and the mean of TMB values is − 0.6853 (Pearson) and − 0.7133 (Spearman) across 12 cancers, and the correlation between the variance of overall survival times and the variance of TMB values is − 0.6159 (Pearson) and − 0.2448 (Spearman), suggesting a

Discussion
Overall our study demonstrated that the Deep Learning architecture can be effectively applied for cancer prognosis prediction with Cox-proportional hazard model incorporated. We found that Deep Learning-based model demonstrated superior performances comparing to traditional machine learning models. Among the three Deep Learning-based models tested, we observed that Coxnnet, which has the most succinct neural network structure, resulted in better prognosis performances in the measurement of concordance index and p-value of logrank test. We showed that integrating autoencoder with Cox regression network does not significantly improve the prognosis performances. These results highlight an important issue in Deep Learning approaches-namely simpler models often perform similar or better to more complex models in biological data. From the associated fine-tuned hyper-parameters (Additional file 1: Table S5-S9) during the hyperparameters tuning (with optimal validation accuracy), we found that Deep Learning-based algorithms and traditional machine learning based algorithms, especially with multiple hyper-parameters, tends to converge into different local minima with different hyper-parameter values. For example, the optimal parameter pairs of AECOX are not consistent in five different folds even when these experiments are from same cancer (e.g., TCGA BRCA cancer). This result can potentially be due to the curse of dimensionality [65]: with limited number of different training samples and large number of parameters (e.g., the hidden layer weights in Deep Learningbased models), the optimization may not guarantee to converge to same local minima. These observations lead us to rethink the robustness of training procedureespecially when higher performances are observed on Cox-nnet where it has the least hyper-parameter tuning effort.
We also noticed a negative correlation between TMB values and prognosis prediction performances. The relationship between TMB and prognosis have been examined in existing literatures in cancer biology for individual cancer types. For example, Owada-Ozaki et al. [66] examined the relationship between individual TMB and prognosis and concluded that high TMB is a poor prognostic factor in non-small cell lung cancer (NSCLC). A similar pattern occurs between TMB and prognosis specifically for lung adenocarcinomas (a subtype of NSCLC) (Naidoo et al. [67]). Our pan-cancer analyses are agreeing with these findings yet have a different conclusion. We observed that TMB is correlated with prognosis performances (concordance index), however, integrating TMB to the Cox-nnet model does not improve the performances at the pan-cancer level. By further examining the relationships behind these features and results, we found that TMB is highly correlated with overall survival times (both in mean and variance) across cancer types. Specifically, lower TMB value is associated with longer mean overall survival time, concluded that  Table S4. Other results of TMB statistics versus concordance index were shown in Additional file 1: Figure S2 - Figure S6.
TMB is a marker for tumor malignancy. These findings lead us to speculate that TMB either affect or are affected by overall survival time, but may not directly contribute to prognosis prediction when gene expression data are used. However, with a strong correlation to TMB, shorter overall survival times leads to worse prognosis performance, suggesting a direct relationship between overall survival statistics and prognosis performances. These findings will guide us to design future experiments to further explain the detailed relationships especially the dependency among TMB, survival times, and prognosis performances at pan-cancer level.

Conclusion
Bringing artificial intelligence into clinical and cancer studies [6,[68][69][70] can unravel numerous interpretabilities behind the data. In this paper, we focused on three different Deep Learning-based cancer prognosis models. The survival predictions are conducted across 12 TCGA cancer types with sufficient number of patients and survival information. We found that Deep Learning based algorithms demonstrate superior performances than traditional machine learning based models. We also found that the cancer prognosis results measured in concordance index are indistinguishable across models while are highly variable across cancers by two-way ANOVA. The highest concordance index that models can predict is renal papillary cell carcinoma (KIRP), while the lowest concordance index is observed for lung squamous cell carcinoma (LUSC). We then examined the relationships between TMB statistics, overall survival statistics, and concordance indices across 12 cancers. We found that although TMB and overall survival times are negatively correlated with concordance indices across the cancer types, integrating TMB does not improve the prognosis prediction performance for individual cancers significantly, whereas TMB has a strong correlation with overall survival times. These findings will guide us to explore the relationships between patient characteristics and survival learnability in a pancancer level in the future work.

Additional file
Additional file 1: Figure S1. An example framework of the AECOX model with four hidden layers. Table S1. The network design of AECOX .  Table S2. The hyper-parameters of AECOX to be searched. Table S3. Performances of testing set in TCGA Kidney Renal Clear Cell Carcinoma (KIRC) dataset. Bolded texts indicated optimal results among all models. Table S4. Individual model correlations (Pearson ρ) of mean TMB (Fig. 2). Figure S2. Relationship between concordance index and median TMB. Pearson ρ = − 0.30 (p-value = 7.75E-02). Figure S3. Relationship between concordance index and max TMB. Pearson ρ = − 0.40 (p-value = 1.68E-02). Figure S4. Relationship between concordance index and 20% tail TMB. Pearson ρ = − 0.32 (p-value = 5.51E-02). Figure S5. Relationship between concordance index and 10% tail TMB. Pearson ρ = − 0.32 (p-value = 5.93E-02). Figure S6. Relationship between concordance index and 5% tail TMB. Pearson ρ = − 0.30 (p-value = 7.45E-02). Table S5. Fine-tuned hyper-parameters of Cox-nnet (L2 penalty weight λ) across 12 cancer types and 5 experiments (folds). Table S6. Fine-tuned hyper-parameters of DeepSurv across 12 cancer types and 5 experiments (folds). Table S7. Fine-tuned hyper-parameters of AECOX across 12 cancer types and 5 experiments (folds). Note that we fixed λ 2 = 0 to only impose L2 sparsity. Table S8. Fine-tuned hyper-parameters of Random Survival Forest (RSF) (number of the trees) across 12 cancer types and 5 experiments (folds). Table S9. Fine-tuned hyper-parameters of SVM (α, weight of penalizing the squared hinge loss in the objective function) across 12 cancer types and 5 experiments (folds). Table S10. Model-wised performances comparison at pan-cancer level (12 TCGA (The Cancer Genome Atlas) cancer types) by pairwise paired t-test, according to metrics concordance index and p-value of log-rank test. Note that for concordance index, larger tstatistic/coefficient indicated better performance at pan-cancer level, while the p-value of log-rank test was on the contrary. Table S11. Model-wised performances comparison at pan-cancer level (12 TCGA (The Cancer Genome Atlas) cancer types) by linear mixed-effects models test, according to metrics concordance index and p-value of log-rank test. Note that for concordance index, larger t-statistic/coefficient indicated better performance at pan-cancer level, while the p-value of log-rank test was on the contrary.