This article has Open Peer Review reports available.
Cancer survival analysis using semisupervised learning method based on Cox and AFT models with L_{1/2} regularization
 Yong Liang^{1}Email author,
 Hua Chai^{1},
 XiaoYing Liu^{1},
 ZongBen Xu^{2},
 Hai Zhang^{2} and
 KwongSak Leung^{3}
https://doi.org/10.1186/s1292001601696
© Liang et al. 2016
Received: 6 January 2015
Accepted: 16 February 2016
Published: 1 March 2016
Abstract
Background
One of the most important objectives of the clinical cancer research is to diagnose cancer more accurately based on the patients’ gene expression profiles. Both Cox proportional hazards model (Cox) and accelerated failure time model (AFT) have been widely adopted to the high risk and low risk classification or survival time prediction for the patients’ clinical treatment. Nevertheless, two main dilemmas limit the accuracy of these prediction methods. One is that the small sample size and censored data remain a bottleneck for training robust and accurate Cox classification model. In addition to that, similar phenotype tumours and prognoses are actually completely different diseases at the genotype and molecular level. Thus, the utility of the AFT model for the survival time prediction is limited when such biological differences of the diseases have not been previously identified.
Methods
To try to overcome these two main dilemmas, we proposed a novel semisupervised learning method based on the Cox and AFT models to accurately predict the treatment risk and the survival time of the patients. Moreover, we adopted the efficient L_{1/2} regularization approach in the semisupervised learning method to select the relevant genes, which are significantly associated with the disease.
Results
The results of the simulation experiments show that the semisupervised learning model can significant improve the predictive performance of Cox and AFT models in survival analysis. The proposed procedures have been successfully applied to four real microarray gene expression and artificial evaluation datasets.
Conclusions
The advantages of our proposed semisupervised learning method include: 1) significantly increase the available training samples from censored data; 2) high capability for identifying the survival risk classes of patient in Cox model; 3) high predictive accuracy for patients’ survival time in AFT model; 4) strong capability of the relevant biomarker selection. Consequently, our proposed semisupervised learning model is one more appropriate tool for survival analysis in clinical cancer research.
Keywords
Cancer survival analysis Semisupervised learning Gene selection Regularization Cox proportional hazards model Accelerated failure time modelBackground

The small sample size and censored survival data versus high dimensional covariates dilemma in Cox model

The similar phenotype disease versus different genotype cancer dilemma in the AFT model
In the accelerated failure time model, to increase the available sample size and get the more accurate result, each censored observation time is replaced with the imputed value using some estimators, such as the inverse probability weighting (IPW) [10] method, mean imputation method, BuckleyJames method [11] and rankbased method. In fact, these estimation methods assume that the AFT model was used for the patients with similar phenotype cancer, and the survival times should satisfy the same unspecified common probability distribution. Nevertheless, the disparity we see in disease progression and treatment response can be attributed to that the similar phenotype cancer may be completely different diseases on the molecular genotype level. So we need to identify different cancer genotypes. Can we do it based exclusively on the clinical data? For example, patients can be assigned to a “lowrisk” or a “highrisk” subgroup based on whether they were still alive or whether their tumour had metastasized after a certain amount of time. This approach has also been used to develop procedures to diagnose patients [12]. However, by dividing the patients into subgroups just based on their survival times, the resulting subgroups may not be biologically meaningful. Suppose, for example, the underlying cell types of each patient are unknown. If we were to assign patients to “lowrisk” and “highrisk” subgroups based on their survival times, many patients would be assigned to the wrong subgroup, and any future predictions based on this model would be suspect. Therefore, we need propose more accurate classification methods by identifying these underlying cancer subtypes based on microarray data and clinical data together, and build a model that can determine which subtype is present in future patients.
Our idea in this study is to strike a tactical balance between the two contradictory dilemmas. We propose a novel semisupervised learning method based on the combination of Cox and AFT models with L_{1/2} regularization for highdimensional and low sample size biological data. In our semisupervised learning framework, the Cox model can classify the “lowrisk” or a “highrisk” subgroup though samples as many as possible to improve its predictive accuracy. Meanwhile, the AFT model can estimate the censored data in the subgroup, in which the samples have the same molecular genotype.
Methods
Cox proportional hazards model (Cox)
Accelerated failure time model (AFT)
L_{1/2} regularization
Remark: In our previous work [23], we used \( \frac{3}{4}{\left(\uplambda \right)}^{\frac{2}{3}} \) for represent L_{1/2} regularization thresholding operator. Here, we introduced a new half thresholding representation \( \frac{\sqrt[3]{54}}{4}{\left(\uplambda \right)}^{\frac{2}{3}} \). This new value is more precisely and effectively than the old one. Since it is known that the quantity of the solutions of a regularization problem depends seriously on the setting of the regularization parameter λ. Based on this novel thresholding operator, when λ is chosen by some efficient parameters tuning strategy, such as crossvalidation, the convergence of the algorithm is proved [24].
Our proposed semisupervised learning method
In the semisupervised learning framework, the predictive accuracy of the Cox and AFT models would be improved because the number of the training data increased and the censored data were imputed reasonably. The L_{1/2} regularization approach can select the significant relevant gene sets based on the Cox and AFT models respectively.
In our proposed semisupervised learning method, the censored data are evaluated from the same risk class to improve prediction performance. However, there are some observable errors in the imputations of the censored data. For example, the estimated survival time by AFT model was even less than the censored time. We regarded them as error estimations, and would not use them for model training.
In this paper, two parameters were used to test the performances obtained by different methods.
Integrated BrierScore (IBS)
The IBS is used to assess the goodness of the predicted survival functions of all observations at every time between 0 and max(t_{ i }).
Concordance Index (CI)
Note that the values of CI are between 0 and 1, the perfect predictions of the building model would lead to 1 while have a CI of 0.5 at random.
Results
Simulated experiment
 Step 1:
we generate γ_{i0}, γ_{i1},…, γ_{ip} (i = 1,…,n) independently from standard normal distribution and set: \( {X}_{ij}={\gamma}_{ij}\sqrt{1\rho }+{\gamma}_{i0}\sqrt{\rho } \) (j = 1,…, p) where ρ is the correlation coefficient.
 Step 2:
The survival time y_{i} is written as: \( {\mathrm{y}}_{\mathrm{i}}=\frac{1}{\alpha } log\left(1\frac{\alpha * \log (U)}{\omega * \exp \left(\beta X\right)}\right) \) which U is an uniformly distributed variable, ω is the scale parameter, α is the shape parameter.
 Step 3:
Censoring time point y_{i}′ (i = 1,…n) is obtained from an random distribution E (θ), where θ is determined by specify censoring rate.
 Step 4:
Here we define y_{i} = min(y_{i}, y_{i}′) and δ_{i} = I(y_{i} < y_{i}′), the observed data represented as (y_{i}, x_{i}, δ_{i}) for the model are generated.
In our simulated experiments, we build highdimensional and low sample size datasets. In every dataset, the dimension of the predictive genes is p = 1000, in which 10 prognostic genes and their corresponding coefficients are nonzero. The coefficients of the remaining 990 genes are zero. About 40 % of the data in each subgroup are right censored. We considered the training sample sizes are n = 100, 200, 300 and the correlation coefficients of genes areρ = 0 and ρ = 0.3 respectively. The simulated data were applied to the single Cox, single AFT and semisupervised learning approach with Cox and AFT models. For gene selection, we use L_{1/2} regularization approach and the regularization parameters are tuned by 5fold cross validation. To assess the variability of the experiment, each method is evaluated on a test set including 200 samples, and replicated over 50 random training and test partitions.
The performance of the Cox and AFT models with and without the semisupervised learning approach in simulated experiment (the average numbers and the standard deviations (in brackets) were listed in 50 runs)
Cor.  Size  Cox  SemiCox  
Correct  Selected  Precision  Correct  Selected  Precision  
100  4.06 (1.39)  24.44 (4.65)  0.166 (0.044)  6.58 (1.41)  16.96 (6.41)  0.388 (0.080)  
ρ = 0  200  5.62 (1.64)  28.22 (6.16)  0.199 (0.031)  8.68 (1.56)  17.84 (5.72)  0.487 (0.078) 
300  8.02 (1.43)  35.18 (5.81)  0.228 (0.029)  9.76 (0.98)  19.02 (5.41)  0.513 (0.087)  
100  3.90 (1.43)  24.38 (5.83)  0.159 (0.041)  6.46 (1.37)  17.08 (6.05)  0.378 (0.075)  
ρ = 0.3  200  5.68 (1.42)  29.64 (6.19)  0.192 (0.035)  8.62 (1.11)  17.86 (5.45)  0.483 (0.074) 
300  7.84 (1.55)  35.86 (5.96)  0.219 (0.037)  9.42 (0.68)  18.54 (5.10)  0.508 (0.082)  
Cor.  Size  AFT  SemiAFT  
Correct  Selected  Precision  Correct  Selected  Precision  
100  5.02 (1.61)  38.74 (6.27)  0.130 (0.029)  6.84 (1.37)  35.52 (6.17)  0.192 (0.031)  
ρ = 0  200  7.12 (1.30)  46.68 (6.03)  0.152 (0.025)  8.84 (1.18)  42.16 (5.38)  0.210 (0.039) 
300  8.90 (0.99)  56.54 (6.85)  0.157 (0.019)  9.86 (0.46)  50.84 (5.49)  0.194 (0.027)  
100  4.74 (1.19)  39.54 (5.88)  0.120 (0.030)  6.72 (1.43)  35.84 (6.43)  0.188 (0.033)  
ρ = 0.3  200  6.98 (1.50)  47.02 (6.32)  0.148 (0.024)  8.78 (1.02)  44.96 (6.95)  0.195 (0.031) 
300  8.80 (1.02)  56.82 (6.30)  0.155 (0.022)  9.78 (0.50)  49.31 (5.86)  0.198 (0.034) 
Simulation analysis of real microarray datasets
The detail information of four real gene expression datasets used in the experiments
Datasets  No. of genes  No. of samples  No. of censored 

DLBCL (2002)  7399  240  102 
DLBCL (2003)  8810  92  28 
Lung cancer  7129  86  62 
AML  6283  116  49 
In order to accurately assess the performance of the semisupervised learning approach, the real datasets were randomly divided into two pieces: two thirds of the available patient samples, which include the completed and correct imputed censored data, were put in the training set used for estimation and the remaining completed and censored patients’ data would be used to test the prediction capability. We used single Cox and single AFT with L_{1/2}regularization approaches for comparisons and for each procedure, the regularization parameters are tuned by 5fold cross validation. All results in this article are averaged over 50 repeated times respectively.
On the other hand, we find that for these all four gene expression datasets, the selected genes from Cox and AFT models are quite different and just small parts are overlapping. We think the reason may be that the regularized Cox model selects the relevant genes for lowrisk and highrisk classification. Nerveless, the genes selected by the AFT model are high correlation for the survival time of patients. So these two models may select different genes, which have different biological function. Through our below analyses, we know that the genes selected by semisupervised learning methods are significant relevant with the cancer.
Discussion
In this section, we introduce a brief biological discussion of the selected genes for the Lung cancer dataset to demonstrate the superiority of our proposed semisupervised learning method. The number of selected genes by semisupervised learning method is less than the single Cox and AFT model, but includes some genes which are significantly associated with cancer and cannot be selected by the two single Cox and AFT models, such as GDF15, ARHGDIB and PDGFRL. GDF15 belongs to the transforming growth factorbeta superfamily, and is one kind of bone morphogenetic proteins. It was showed that GDF15 can be seen as prognostication of cancer morbidity and mortality in men [31]. ARHGDIB is the member of the Rho (or ARH) protein family; it is involved in many different cell events such as cell secretion, proliferation. It is likely to impact on the cancer [32]. The role of PDGFRL is to encode a protein contains an important sequence which is similar to the ligand binding domain of plateletderived growth factor receptor beta. Biological research has confirmed that this gene can affect the sporadic hepatocellular carcinomas. This suggests that this gene product may get the function of the tumour inhibition.
At the same time, the Cox and AFT models with and without semisupervised learning method also selected some common genes. For example,the PTP4A2, TFAP2C, GSTT2. PTP4A2 is the member of the protein tyrosine phosphatase family, overexpression of PTP4A2 will confer a transformed phenotype in mammalian cells, which suggested its role in tumorigenic is [33]. TFAP2C can encode a protein contains a sequencespecific DNAbinding transcription factor which can activate some developmental genes [34]. GSTT2 is one kind of a member of a superfamily of proteins. It has been proved to play an important role in human carcinogenesis and shows that these genes are linked to cancer with a certain relationship [35].
Through the comparison of the biological analyses of the selected genes, we found the semisupervised method based on Cox and AFT models with L_{1/2} regularization is a competitive method compared to single regularized Cox and AFT models.
Conclusion
To overcome the limitations of fully unsupervised and fully supervised approaches for survival analysis in cancer research, we have developed a discriminative semisupervised method based on Cox and AFT models with L_{1/2} regularization. This method combines the advantages of both Cox and AFT models, and overcome the dilemma in their applications. By comparison the results of Cox and AFT modes with and without the semisupervised method in simulation experiment and real microarray datasets experiment with different regularizing method, we demonstrated that 1) the censored data could be employed after appropriate processing; 2) the semisupervised classification improved prediction accuracy as compared to the state of the art single Cox model; 3) the gene selection performance gain improved with the increase number of available samples. Therefore, for clinical applications, where the goal is often to develop an accurate predicting test using fewer genes in order to control cost, the semisupervised method based on Cox and AFT models with L_{1/2} regularization can be chosen to applied, it will be an efficient and accuracy method based on the highdimensional and lowsample size data in cancer survival analysis.
Declarations
Acknowledgements
This research was supported by Macau Science and Technology Develop Funds (Grant No. 099/2013/A3) of Macau Special Administrative Region of the People’s Republic of China.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Cox DR. Partial likelihood. Biometrika. 1975;62:269–762.View ArticleGoogle Scholar
 Wei LJ. The accelerated failure time model: a useful alternative to the Cox regression model in survival analysis. Stat Med. 1992;11:1871–9.View ArticlePubMedGoogle Scholar
 Chapelle O, Sindhwani V, Keerthi SS. Optimization techniques for semisupervised support vector machines. J Mach Learn Res. 2008;9:203–33.Google Scholar
 Tibshirani R, Hastie T, Narasimhan B, Chu G. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci U S A. 2002;99:6567–72.View ArticlePubMedPubMed CentralGoogle Scholar
 Wasito I, Veritawati I. Subtype of Cancer Identification for Patient Survival Prediction Using Semi Supervised Method. JCIT. 2012;7:14.View ArticleGoogle Scholar
 Xia Z, Wu LY, Zhou X, et al. Semisupervised drugprotein interaction prediction from heterogeneous biological spaces. BMC Syst Biol. 2010;4 Suppl 2:S6.View ArticlePubMedPubMed CentralGoogle Scholar
 Qi Y, Tastan O, Carbonell JG, et al. Semisupervised multitask learning for predicting interactions between HIV1 and human proteins. Bioinformatics. 2010;26(18):i645–52.View ArticlePubMedPubMed CentralGoogle Scholar
 Koestler DC, Marsit CJ, Christensen BC, et al. Semisupervised recursively partitioned mixture models for identifying cancer subtypes. Bioinformatics. 2010;26(20):2578–85.View ArticlePubMedPubMed CentralGoogle Scholar
 Golub T, Slonim D, Tamayo P, Huard C, Gaasenbeek M, Mesirov J, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999;286:531–7.View ArticlePubMedGoogle Scholar
 Wang Z, Wang CY. BuckleyJames boosting for survival analysis with highdimensional biomarker data. Stat Appl Genet Mol Biol. 2010;9(1):Article 24.Google Scholar
 Seaman SR, White IR, Copas AJ, et al. Combining Multiple Imputation and Inverse‐Probability Weighting. Biometrics. 2012;68(1):129–37.View ArticlePubMedPubMed CentralGoogle Scholar
 Bair E, Tibshirani R. Semisupervised methods to predict patient survival from gene expression data. PLoS Biol. 2004;2:E108.View ArticlePubMedPubMed CentralGoogle Scholar
 Huang J, Ma S, Xie H. Regularized Estimation in the Accelerated Failure Time Model with High‐Dimensional Covariates. Biometrics. 2006;62(3):813–20.View ArticlePubMedGoogle Scholar
 Tsiatis A. Estimatingregressionparametersusinglinearranktestsforcensored data. Ann Stat. 1996;18:305–28.Google Scholar
 Datta S. Estimatingthemeanlifetimeusingrightcensoreddata. Stat Methodol. 2005;2:65–9.View ArticleGoogle Scholar
 Luan Y, Li H. Modelbased methods for identifying periodically expressed genes based on time course microarray gene expression data. Bioinformatics. 2004;20:332–9.View ArticlePubMedGoogle Scholar
 Gui J, Li H. Threshold gradient descent method for censored data regression, with applications in pharmacogenomics. Pac Symp Biocomput. 2005a;10:272–83.Google Scholar
 Gui J, Li H. Penalized Cox regression analysis in the highdimensional and lowsample size settings, with applications to microarray gene expression data. Bioinformatics. 2005b;21:3001–8.View ArticlePubMedGoogle Scholar
 Xu ZB, et al. L1/2 regularization. Sci China. 2010;40(3):1–11. series F.Google Scholar
 Liu C, et al. The L1/2 regularization method for variable selection in the Cox model. Appl Soft Comput. 2014;14(c):498–503.View ArticleGoogle Scholar
 Cox DR. Regression models and lifetables. J R Statist Soc. 1972b;34:187–220.Google Scholar
 Ernst J, et al. A semisupervised method for predicting transcription factorgene interactions in Escherichia coli. Plos Comput Biol. 2008;4(3):e1000044.View ArticlePubMedPubMed CentralGoogle Scholar
 Fan J, Li R. Variable selection for Cox’s proportional hazards model and frailty model. Ann Statist. 2002;30:74–99.View ArticleGoogle Scholar
 Gui J, Li H. Penalized Cox regression analysis in the high dimensional and lowsample size settings, with applications to microarray gene expression data. Bioinformatics. 2005;21(13):3001–8.View ArticlePubMedGoogle Scholar
 Murphy AH. A new vector partition of the probability score. J Appl Meteorol. 1973;12(4):595–600.View ArticleGoogle Scholar
 Bender R, Augustin T, Blettner M. Generating survival times to simulate Cox proportional hazards models. Stat Med. 2005;24:1713–23.View ArticlePubMedGoogle Scholar
 Rosenwald A, et al. The use of molecular profiling to predict survival after chemotherapy for diffuse largeBcell lymphoma. N Engl J Med. 2002;346:1937–46.View ArticlePubMedGoogle Scholar
 Rosenwald A, et al. The proliferation gene expression signature is aquantitative integrator of oncogenic events that predicts survival in mantle cell lymphoma. Cancer Cell. 2003;3:185–97.View ArticlePubMedGoogle Scholar
 Beer DG, et al. Geneexpression profiles predict survival of patients with lung adenocarcinoma. Nat Med. 2002;8:816–24.PubMedGoogle Scholar
 Bullinger L, et al. Use of geneexpression profiling to identify prognostic subclasses in adult acute myeloid leukemia. N Engl J Med. 2004;350:1605–16.View ArticlePubMedGoogle Scholar
 Wallentin L, et al. GDF15 for prognostication of cardiovascular and cancer morbidity and mortality in men. PLoS One. 2013;8:12.View ArticleGoogle Scholar
 Hatakeyama K, et al. Placenta—Specific novel splice variants of Rho GDP dissociation inhibitor beta are highly expressed in cancerous cells. BMC Res Notes. 2012;5:666.View ArticlePubMedPubMed CentralGoogle Scholar
 Riker A, et al. The gene expression profiles of primary and metastatic melanoma yields a transition point of tumor progression and metastasis. BMC Med Genomics. 2008;1:13.View ArticlePubMedPubMed CentralGoogle Scholar
 Ailan H, et al. Identification of target genes of transcription factor activator protein 2 gamma in breast cancer cells. BMC Cancer. 2009;9:279.View ArticlePubMedPubMed CentralGoogle Scholar
 Jang SG, Kim IJ, Kang HC, et al. GSTT2 promoter polymorphisms and colorectal cancer risk. BMC Cancer. 2007;7:16.View ArticlePubMedPubMed CentralGoogle Scholar