An algorithm for classifying tumors based on genomic aberrations and selecting representative tumor models
© Lu et al. 2010
Received: 27 January 2010
Accepted: 22 June 2010
Published: 22 June 2010
Skip to main content
© Lu et al. 2010
Received: 27 January 2010
Accepted: 22 June 2010
Published: 22 June 2010
Cancer is a heterogeneous disease caused by genomic aberrations and characterized by significant variability in clinical outcomes and response to therapies. Several subtypes of common cancers have been identified based on alterations of individual cancer genes, such as HER2, EGFR, and others. However, cancer is a complex disease driven by the interaction of multiple genes, so the copy number status of individual genes is not sufficient to define cancer subtypes and predict responses to treatments. A classification based on genome-wide copy number patterns would be better suited for this purpose.
To develop a more comprehensive cancer taxonomy based on genome-wide patterns of copy number abnormalities, we designed an unsupervised classification algorithm that identifies genomic subgroups of tumors. This algorithm is based on a modified genomic Non-negative Matrix Factorization (gNMF) algorithm and includes several additional components, namely a pilot hierarchical clustering procedure to determine the number of clusters, a multiple random initiation scheme, a new stop criterion for the core gNMF, as well as a 10-fold cross-validation stability test for quality assessment.
We applied our algorithm to identify genomic subgroups of three major cancer types: non-small cell lung carcinoma (NSCLC), colorectal cancer (CRC), and malignant melanoma. High-density SNP array datasets for patient tumors and established cell lines were used to define genomic subclasses of the diseases and identify cell lines representative of each genomic subtype. The algorithm was compared with several traditional clustering methods and showed improved performance. To validate our genomic taxonomy of NSCLC, we correlated the genomic classification with disease outcomes. Overall survival time and time to recurrence were shown to differ significantly between the genomic subtypes.
We developed an algorithm for cancer classification based on genome-wide patterns of copy number aberrations and demonstrated its superiority to existing clustering methods. The algorithm was applied to define genomic subgroups of three cancer types and identify cell lines representative of these subgroups. Our data enabled the assembly of representative cell line panels for testing drug candidates.
Cancer is a disease of the genome that is characterized by substantial variability in the clinical course, outcome, and response to therapies. A key factor underlying this variability is the genomic heterogeneity of human tumors: individual tumors of the same histopathological subtype and anatomical origin typically carry different aberrations in their cellular DNA. Many of the most efficacious recent drugs target specific genetic aberrations rather than histological disease subtypes, for example trastuzumab and lapatinib for treating HER2-positive breast cancers , tamoxifen for treating ER-positive breast cancers[2, 3], and gefitinib and erlotinib for non-small cell lung cancer with EGFR mutations [4–8].
Several subtypes of common cancers have been identified based on the aberrations of individual cancer genes, for example HER2-amplified breast cancer [1, 9, 10], EGFR-mutated and EGFR-amplified non-small-cell lung cancer [5, 8], and others. However, cancer is a complex disease driven by the interaction of multiple genes and pathways [11, 12]. Therefore, the copy number status of individual genes may not be sufficient to define cancer subtypes and predict the response to treatments. More comprehensive cancer taxonomy needs to be designed based on genome-wide patterns of DNA copy number abnormalities.
Previous ground-breaking studies have reported molecular classifications for key cancer types based on their global patterns of gene expression [13–16]. As the high-density array technology became a reliable tool for copy number profiling, multiple gene copy number datasets were generated, revealing the genomic heterogeneity of key cancer types at the gene copy number level . Various clustering methodologies have been applied to comparative genomic hybridization (CGH) data sets to classify cancers based on their copy number patterns and identify copy number aberration hotspots [17–23]. Taxonomies based on gene copy number have a number of advantages over gene expression-based classifications. In particular, copy number alterations are stable events, not affected by cell cycle or cytokine stimulation. Additionally, they show greater consistency between primary human tumors and cultured cell lines.
Here we developed a copy number-based methodology for cancer classification in order to enable identification of genomic subgroups of major cancer types and facilitate rational selection of tumor models representative of individual subgroups. The methodology is based on the previously published genomic non-negative matrix factorization (gNMF) algorithm [23–26], with several major modifications to enhance the performance. We applied the algorithm to three major tumor types: non-small cell lung carcinoma (NSCLC), colorectal carcinoma (CRC), and malignant melanoma, identified distinct genomic subtypes for each cancer, and identified cell lines representative of each subtype. Our data enabled the assembly of representative cell line panels for testing drug candidates.
To determine the best of the models built by gNMF algorithm with different numbers of clusters, we calculated the Cophenetic correlation coefficient and Bayesian Information Criterion (BIC) for these models, and then selected the one with the minimum BIC or the greatest decrease of Cophenetic correlation. In our study, the minimum BIC and greatest decrease of Cophenetic correlation often pointed to the same model. Finally, the 10-fold stability test was performed on the selected model. Thus, the iteration procedure converges to the best solution, and the optimal model is identified. The entire methodology was implemented using the R language http://cran.r-project.org/. The source code is available from the authors.
The use of human tumor specimens collected at Rush University was approved by the Rush Institutional Review Board. Additional human tumor specimens were obtained from tissue banks (Caprion Proteomics, Montreal, Canada; ProteoGenex, Culver City, CA; Asterand, Detroit, MI; Genomics Collaborative, Inc, Cambridge, MA; and Ontario Tumor Bank, Toronto, Ontario, Canada), and the use was approved by the vendors. Written informed consent was obtained from all the individuals. Genomic DNA was extracted from tumor samples and cell lines using a DNAeasy kit (Qiagen, Valencia, CA). The DNA samples were then processed and hybridized to Affymetrix GeneChip Mapping arrays (Affymetrix, Santa Clara, CA, http://www.affymetrix.com). The arrays were run according to the manufacturer's protocol and scanned using a GeneChip Scanner 3000 G7 (Affymetrix, Santa Clara, CA). The Affymetrix GeneChip® Operating Software (GCOS) collected and extracted feature data from the scanner.
Partek Genomic Suite software (Partek Inc., St. Louis, Missouri, http://www.partek.com/) was used for low-level processing of the raw data to determine the copy numbers at each locus and define regions of copy number alteration. Copy numbers were calculated by comparing the signal intensities for the DNA samples to those for a reference set of 48 normal female tissue samples. The original array scan files and the calculated copy number data were deposited into the Gene Expression Omnibus database http://www.ncbi.nlm.nih.gov/geo/ with series ID GSE20481. The resulting probe-level copy number data were then segmented, and the copy number alteration regions were detected for each sample. Specifically, probe-level copy numbers were segmented into regions using the following control parameters: (i) a region must contain at least 100 probes, (ii) the p-value comparing the mean copy number of the region versus the adjacent regions must be less than 0.00001, and (iii) the signal/noise ratio of the transition must be greater than 0.1. The copy number alteration regions were detected when the mean copy numbers in these regions were lower than 1.65 (deletion) or higher than 2.65 (gain), with P values below 0.01.
Tumor samples may contain a significant percentage of normal adjacent tissue, which will dilute the signal of copy number alteration and result in false negatives. Therefore, we developed a machine learning algorithm to distinguish between copy number patterns of tumor and normal samples and applied it to identify and eliminate samples contaminated with normal tissue from further analyses. We first selected a subset of samples with the highest number of copy number alteration regions and a set of normal samples. These two groups of samples were used as a training set to train a Random Forest (RF)  classifier to classify tumors from normal samples. The trained classifier was then applied to all incoming tumor samples to assign a score to each of them representing the probability of being contaminated by normal tissue. Samples with normal contamination probability over 50% were excluded from our clustering analysis.
Although broadly used for clustering of gene expression and copy number patterns, hierarchical clustering is highly sensitive to the distance metrics and typically requires subjective evaluation to define clusters . Nevertheless, this method provides an easy and intuitive way to visualize the overall pattern of the data. The gNMF algorithm requires the number of clusters as input parameter, which is usually unknown for a given data set. In previous applications of gNMF to cluster CGH data, several numbers of clusters were tested, but without an intuitive method to determine these initial numbers [17, 23]. We used hierarchical clustering as a pilot to estimate the range of the possible numbers of clusters that exist in the data before applying the gNMF algorithm.
For each data set, we hierarchically clustered the segmented CGH data using Pearson dissimilarity (defined as 1 minus Pearson correlation coefficient). The hierarchical clustering patterns were plotted and visually inspected to derive a range of possible numbers of subgroups in the dataset. These numbers were then used as input in the gNMF clustering analysis.
Non-negative matrix factorization (NMF) was first suggested by Paatero et al . Like principal component analysis (PCA) or independent component analysis (ICA, ), NMF is a linear decomposition algorithm which decomposes a data matrix into a factor matrix and a weight matrix with limited dimensions. However, NMF adds a unique constraint that the factor matrix and weight matrix have only non-negative entries. This feature makes NMF particularly useful in the analysis of positive-value data, such as images, gene expression levels or gene copy numbers. This mathematical tool was further developed by Lee et al. and applied in image analyses [24, 25], and then in the analysis of gene expression data  and genomic copy number data [17, 23].
Where α runs from 1 to r, μ runs from 1 to m, and i runs from 1 to n.
Because of the positive constraint of the NMF algorithm, the result can easily be adapted for clustering purposes by assigning each sample into the cluster (component) that has the maximum weight. Alternatively, under a multiple-run scheme (discussed below), the average correlation of H matrix of multiple runs can be used to assign samples into clusters. Therefore, gNMF can also serve as a tool for unsupervised clustering.
Here we introduced several modifications to the gNMF algorithm. In previous applications of gNMF to cluster CGH data, the algorithm was stopped when the subgroup assignments of samples did not change after a pre-defined number of steps (e.g. 100) [17, 23]. Based on our tests with simulated data as well as actual CGH data, we determined that this criterion might stop the procedure too early, suggesting that the results could potentially be further improved if the algorithm were allowed to run more steps. Therefore, we modified the algorithm so that after every 100 steps of multiplicative updating the divergence  of the current model from the data was calculated, and the iterative algorithm was stopped if the divergence did not decrease by more than 0.001% of the previous divergence calculated 100 steps earlier.
Since gNMF is a stochastic procedure, the algorithm may generate different outcomes when started from different initial values. To further improve the performance of gNMF, we implemented a new multiple initiation strategy. For each data set, we ran the gNMF algorithm 200 times following the above stop criterion, calculated the Pearson correlation coefficient matrix of H from the output of each of the 200 random gNMF runs, and averaged the correlation matrices over the 200 runs. The final clustering result was derived by running a hierarchical clustering algorithm using one minus the average correlation matrix as the distance matrix and cutting the dendrogram into the given number of subgroups.
The core gNMF algorithm was run with different numbers of clusters, r, as suggested by the pilot hierarchical clustering. After that, to select the best model, we calculated the Cophenetic correlation coefficient and Bayesian information criterion (BIC) for these models, and then selected the one with the minimum BIC or the greatest decrease in Cophenetic correlation.
The Cophenetic correlation coefficient  is a measure of how faithfully a dendrogram that is used to derive the final clustering result preserves the pairwise distances between the original unmodeled data points. Suppose the original data X i have been modeled by a dendrogram T i . Distance measures are defined as follows:
x(i, j) = |X i - X j |, the distance between the ith and jth samples.
t(i, j) = the dendrogrammatic distance between the model points T i and T j . This distance is the height of the node at which these two points are first joined together.
The Cophenetic correlation coefficient has been used to select the best model built by the gNMF algorithm in previous applications [17, 23], and it has been reported that with the increase of r, the Cophenetic correlation will decrease dramatically at certain point which suggests the best number of clusters.
where L is the likelihood which measures how good the model approximate the data, k is the number of parameters used in the model, and n is the number of samples. The second term, kln(n), serves as a penalty on the number of parameters used in the model to avoid overfitting. A lower BIC value usually represents a good model with relatively higher likelihood and also relatively lower risk of overfitting.
where r is the number of clusters, n i is the number of samples in cluster i, and m is the number of segments, y ijt is the log transformed copy number of the t th segment of the j th sample in the i th cluster, μ it is the average of log transformed copy numbers of the t th segment in the i th cluster, and σ it is the standard deviation of log transformed copy numbers of the t th segment in the i th cluster. Then the number of parameters, k, in the specified model would be 2 × r × m.
In our studies, the minimum BIC and greatest decrease of Cophenetic correlation often pointed to the same model.
A 10-fold cross-validation test was used to assess the stability of clustering results. After assigning samples into subgroups, we randomly left 10% of samples out and applied the same procedure to the remaining 90% of samples using the same control parameters. The number of samples that were assigned to a different subgroup was counted as errors. This test was repeated 200 times to derive an error rate, which represented the stability of the clustering result with respect to the permutation of samples.
Cophenetic correlation and BIC of the NSCLC gNMF models under different cluster numbers.
Number of tumor patients in each of the four NSCLC subtypes and the cell lines selected to represent each subtype.
Number of tumors
Representative Cancer Cell Lines for each subtype
HCC827, NCI-H1437, NCI-H1563, NCI-H1568, NCI-H1623, NCI-H1651, NCI-H1693, NCI-H1755, NCI-H1793, NCI-H1838, NCI-H1944, NCI-H1975, NCI-H1993, NCI-H2023, NCI-H2073, NCI-H2085, NCI-H2087, NCI-H2122, NCI-H2126, NCI-H2228, NCI-H2291, NCI-H23, NCI-H2342, NCI-H2347, NCI-H647, NCI-H920, NCI-H969, CLS-54, LX-289, SK-LU-1, H2882, Calu-6, H358, H460
NCI-H2405, NCI-H522, SK-MES-1, H157, H1819, H2009, H2887, HCC1171, HCC1359, HCC15, HCC193, HCC366, HCC461, HCC515, HCC78, HOP-62, HOP-92, NCI-H266
A549, Calu-3, NCI-H1734, NCI-H838, HCC95
Thus, we identified genomic subgroups of NSCLC using the unsupervised clustering methodology above and selected representative cell lines for each subtype. However, the biological significance of the subgroups of NSCLC was still unclear. One way to determine the phenotypical relevance of a classification is to test the difference in the clinical outcome of patients assigned into the subgroups.
To further validate the ability of our algorithm to classify tumors into biologically relevant genomic groups, an independent set of 71 NSCLC tumor samples was profiled for copy number alterations, and the data were segmented. These validation samples were then matched to the four NSCLC clusters.
Thus, the NSCLC clusters identified using our clustering methodology are associated with different disease outcomes, suggesting that these genomic clusters represent clinically distinct subgroups of the disease. Consequently, the cell lines selected can be used as models to represent the existing subtypes of the disease.
We used the NSCLC dataset to compare the performance of our method with several known unsupervised clustering methodologies, namely the hierarchical clustering, the original gNMF method [17, 23–26], and K-means clustering . The same NSCLC dataset was clustered into 4 subgroups by these methods. The ability of the procedures to generate clinically relevant tumor subgroups was used as the main performance criterion. The cluster stability was assessed as an additional performance metric.
Performance comparison of the unsupervised clustering algorithms on NSCLC data set.
P-value for TTRa
P-value for OSb
To evaluate the relative stability of clusters produced by different procedures, the 10-fold cross-validation scheme was applied. For our algorithm, the total error rate for the NSCLC data set was 14.24%, while the error rates for hierarchical clustering, the original gNMF, and K-means clustering were 23.32%, 21.59%, and 23.27%, respectively. Thus, our comparison shows that the tumor classification methodology described here outperforms the existing algorithms, both in terms of cluster stability and the biological relevance of clusters.
Number of tumor patients in each of the five CRC subtypes and the cell lines selected to represent each subtype.
Number of tumors
Representative Cancer Cell Lines for each subtype
HCT-8, LS 174T, SK-CO-1, SW48, DLD-1, HCT-15, HCT116, LoVo, CL-34, CL-40, C170, LS180
Caco-2, LS1034, LS411N, LS513, NCI-H498, NCI-H747, SW1116, SW1417, SW837, HT-29, SW620, CL-11, CL-14, Colo-678, SW-480
Colo 320DM, NCI-H508, NCI-H716, SW1463, SW403, SW948, Colo 205, Colo-206F
Number of tumor patients in each of the six melanoma subtypes and the cell lines selected to represent each subtype.
Number of tumors
Representative Cancer Cell Lines for each subtype
SKMEL119, HS944, WM1366, WM88
451LU, SKMEL19, SKMEL28, SKMEL30, SKMEL63, WM35, WM983, WM983C
WM3211, M14, MEWO, SKMEL2, SKMEL5, UACC257, UACC62, WM122, WM13662, WM239A, WM32112, WM32482, WM793B, 501MEL
Currently, pre-clinical models for oncology drug testing are selected based on their availability, adaptability to tumor formation in mice, growth in culture, as well as other parameters. This approach does not take into account the genetic heterogeneity of the parent tumor, resulting in poor representation of molecular subtypes of tumors during preclinical testing. Thus, the high response rates that are frequently seen in pre-clinical testing may only represent the response of the single molecular subtype represented in the preclinical testing laboratory. If this subtype represents only a fraction of the patient population, and if the drug is efficacious only against this specific subtype, then the response rate in the clinic will be significantly lower. Therefore, there is a need for improved pre-clinical testing models that would cover a broader spectrum of parent tumors. Such improved pre-clinical testing will increase the predictability of the preclinical testing of new drugs.
Recently, application of high-density SNP genotyping microarrays for gene copy number profiling enabled generation of comprehensive copy number datasets for most major tumor types. However, the processing and use of high-density copy number for cancer classification remains a challenge due to the complexity of using alterations of varied length and amplitude in clustering samples. We believe that copy number profiles are best suited for developing a genomic taxonomy of cancer due to their temporal stability and easier detection in various samples. Therefore, we developed an unsupervised methodology based on revised gNMF algorithm for classifying samples based on their genomic patterns of copy number aberrations and applied it to three datasets, which contained both internally generated and public data. The NMF algorithm has previously been used in image analysis [24, 25]. It has been adapted for use in gene expression profiling  and gene copy number analysis [17, 23]. It has shown advantages over traditional clustering and component analysis algorithms such as Princial Component Analysis (PCA), Self-Organizing Maps (SOM), and hierarchical clustering when applied to gene expression data . We further improved the original gNMF algorithm and developed an integrated workflow, which includes pilot hierarchical clustering to estimate the possible number of clusters, gNMF with a revised stop criterion and multi-run strategy, model selection using Cophenetic correlation and BIC, and a cross-validation stability test.
We applied our algorithm to datasets for NSCLC, CRC, and melanoma, each containing copy number profiles for hundreds of patient tumors and ~50 cell lines. The algorithm identified distinct new genomic subtypes for each tumor type. The clustering results were then evaluated for stability and reproducibility by using 10-fold cross-validation schemes. To determine whether the clusters discovered have biologically meaningful differences, we correlated the available disease outcome parameters (overall survival and time to recurrence) with the cluster distribution of samples and found a significant difference in clinical outcome between samples assigned into different clusters. As an additional test to validate the correlation between the classification and the outcome, we also profiled a group of outcome-annotated validation samples and assigned them to the existing clusters based on their copy number profiles. It was shown that the cluster assignment of the samples also significantly correlates with the clinical outcome, providing additional validation to our methodology.
Copy number alterations of single oncogenes and tumor-suppressor genes have previously been implicated as important biomarkers in cancer. Notable examples include HER2 amplification in breast cancer [1, 9, 10], EGFR amplification in lung cancer [4–8], and MYCN amplification in neuroblastoma [34, 35]. In these and other examples of previous validation and use of copy number alterations as predictors of outcome and response to therapeutic agents, only individual alterations were considered. Our methodology is based on the analysis of complex whole-genome wide patterns of copy number alterations. Therefore, it provides a complete characterization of genomic subtypes of the cancer under consideration and is expected to generate more precise correlates of clinical behaviour and response to drugs. We believe that the proposed genomic taxonomy is valid for the entire population of tumor patients, because (i) the sample sets used to develop it were large enough (150~300 samples) and (ii) the samples were acquired from a variety of sources, thus eliminating the possibility of bias.
Heatmaps of our clustering models shown in Figs. 2, 6, and 7 revealed major genomic aberrations characteristic of individual clusters, such as 8q gain for cluster 1 or 3q gain for cluster 3 of NSCLC, 13q gain for cluster 3 or 18q loss and 20q gain for cluster 2 of CRC, and chr7 gain for cluster 3 of melanoma. However, we would like to emphasize the fact that the classification patterns observed are driven by the genome-wide copy number aberration profiles rather than selected major aberrations. In many cases, smaller copy number aberrations contain important cancer genes. Some aberrations may also represent secondary events that are part of the tumor's genomic signature. To identify the exact genes that differentiate between clusters in a statistically rigorous manner, we would need to address several complex issues related to variable selection from high-dimensional data, including controlling for false-positives and false-negatives, correlation between variables, and the lack of reproducibility between studies due to systematic biases. Analysis of these issues is beyond the scope of this manuscript.
We are currently testing the classification methodology for its ability to predict disease outcome for new patients. To predict the disease outcome for a new patient, a tumor sample will be profiled for copy number alterations by high-density arrays and assigned into one of the subgroups. Alternatively, a panel of FISH probes may be designed based on the most characteristic copy number abnormalities for each subgroup, and new patient samples would be tested by FISH using the probe panel developed and classified into one of the subgroups based on the pattern of aberrations observed. The association with one of the groups would be used to predict response to the agent under consideration. A companion diagnostic can thus be developed for use with the drug considered. An additional application of the proposed classification methodology is in the assembly of a collection of preclinical testing models representative of the genetic diversity of tumors, such as the NSCLC, CRC, and melanoma testing panels that are described here.
We developed a modified genomic non-Negative Matrix Factorization (gNMF) clustering algorithm to cluster CGH data of tumor and cell lines to identify genomic subtypes of tumors and select representative cell lines. Our algorithm enables the assembly of panels of cell lines representative of the genomic heterogeneity of cancers.
We would like to thank Viswanath Devanarayan for his suggestions and comments on the statistical methodologies used in this work. We are grateful to Lisa Roberts for generating a subset of NSCLC patient CGH data.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.