Predictive gene lists for breast cancer prognosis: A topographic visualisation study

Sivaraksa, Mingmanas; Lowe, David

doi:10.1186/1755-8794-1-8

Technical advance
Open access
Published: 17 April 2008

Predictive gene lists for breast cancer prognosis: A topographic visualisation study

Mingmanas Sivaraksa¹ &
David Lowe¹

BMC Medical Genomics volume 1, Article number: 8 (2008) Cite this article

4423 Accesses
3 Citations
Metrics details

Abstract

Background

The controversy surrounding the non-uniqueness of predictive gene lists (PGL) of small selected subsets of genes from very large potential candidates as available in DNA microarray experiments is now widely acknowledged [1]. Many of these studies have focused on constructing discriminative semi-parametric models and as such are also subject to the issue of random correlations of sparse model selection in high dimensional spaces. In this work we outline a different approach based around an unsupervised patient-specific nonlinear topographic projection in predictive gene lists.

Methods

We construct nonlinear topographic projection maps based on inter-patient gene-list relative dissimilarities. The Neuroscale, the Stochastic Neighbor Embedding(SNE) and the Locally Linear Embedding(LLE) techniques have been used to construct two-dimensional projective visualisation plots of 70 dimensional PGLs per patient, classifiers are also constructed to identify the prognosis indicator of each patient using the resulting projections from those visualisation techniques and investigate whether a-posteriori two prognosis groups are separable on the evidence of the gene lists.

A literature-proposed predictive gene list for breast cancer is benchmarked against a separate gene list using the above methods. Generalisation ability is investigated by using the mapping capability of Neuroscale to visualise the follow-up study, but based on the projections derived from the original dataset.

Results

The results indicate that small subsets of patient-specific PGLs have insufficient prognostic dissimilarity to permit a distinction between two prognosis patients. Uncertainty and diversity across multiple gene expressions prevents unambiguous or even confident patient grouping. Comparative projections across different PGLs provide similar results.

Conclusion

The random correlation effect to an arbitrary outcome induced by small subset selection from very high dimensional interrelated gene expression profiles leads to an outcome with associated uncertainty. This continuum and uncertainty precludes any attempts at constructing discriminative classifiers.

However a patient's gene expression profile could possibly be used in treatment planning, based on knowledge of other patients' responses.

We conclude that many of the patients involved in such medical studies are intrinsically unclassifiable on the basis of provided PGL evidence. This additional category of 'unclassifiable' should be accommodated within medical decision support systems if serious errors and unnecessary adjuvant therapy are to be avoided.

Peer Review reports

Background

Metastasis is crucial in determining the life expectancy of breast cancer patients. Numerous studies have focused on searching for methods to predict the predilection of cancer patients to metastasize. Traditional methods fail to correctly predict the outcome of patients who reach metastasis leading to unnecessary clinical adjuvant therapy, such as chemotherapy. Gene profiling based, for example, on DNA microarray technology, has the potential to be a more reliable method allowing better prediction of patient cancer outcome. However, using lists of many thousands of genes is uninformative and not useful in providing insight for the specialist, nor for discovering the role of specific genes. Feature selection methods have been applied to filter the number of genes which are correlated with outcome to produce a much smaller (typically of the order of a few tens) and more informative 'predictive gene list' (PGL), ideally consisting of the key genes which control the behaviour of the cancer.

Despite the obvious benefits of producing an informative and small PGL, the difficulty is how to perform the feature selection in the absence of good quality functional models of the individual gene pathways. The alternative is to use data-driven data-mining approaches and seek correlations between response and outcome to rank potential genes. To rank the genes requires an appropriate metric. Although feature selection and feature extraction are often unsupervised methods, in the literature in this domain it is more common to be based on a supervised approach based on a specific choice of a nonparametric model linking the gene expressions to the outcomes. For example, using a classification model to infer likely outcome conditioned on expression values allows saliency of individual genes to be obtained, relevant to the classified outcome, e.g. good or poor prognosis patients. This saliency is of course dependent on the chosen model, the pre-specified outcomes, and the specific data used to construct the nonparametric classifier model. It is usually assumed that the data used to construct the classifier is representative of the problem so that results obtained are not highly sensitive on the specific choice of data. However, a different choice of model, or different choice of outcome would modify the saliency even if the chosen model was correct. In addition, in problems of such large input dimensionality (5000 or more on microarray chips), and relative sparsity of patient examples (a few hundred is typical), it is statistically plausible to select small subsets which are randomly correlated with any given desired outcome, irrespective of any biological functionality of the gene expression itself. This aspect has already been discussed in [1, 2] for example. Therefore the question arises as to whether a specific PGL can be obtained based on clinical datasets, given these concerns over reliability of pattern processing techniques.

Almost all nonlinear studies so far have examined supervised approaches to patient discrimination. A major problem with dealing with such high dimensional data is the lack of reliable approaches to investigate and compare patient-specific gene expression profiles separate to the construction of supervised models. We wish to explore an alternative analysis approach, based on unsupervised, nonlinear, topographic (structure-preserving) projection and visualisation methods.

This paper explores several recent nonlinear visualisation models applied to the data introspection of the van't Veer breast cancer study [3]. The approach can be used to a-posteriori explore whether there exists likely discriminability between patient groups of good and poor prognosis for example. For comparison with the preferred PGL selected by the van't Veer study, we also select a PGL based on cross-patient consistency rather than correlation with outcome and examine its performance also by these data introspection methods.

Reviews

We briefly overview some relevant recent works which have explored different classification, discrimination and clustering techniques to represent the separation between two groups of prognosis signature patients. The studies of van't Veer's group [3, 4] have suggested that a PGL of 70 specifically selected genes has proven accurate in out-of-sample patient prognosis of metastasis.

However, other studies have concluded that the likelihood there exists a 'best' small-size predictive gene list which can be used to reliably improve the ability of patient-specific prognosis using automated pattern processing techniques is unlikely. In very recent work [5], analysing supervised machine learning approaches across several public domain data sets, it was found that many gene sets are capable of predicting molecular phenotypes accurately. Hence it is not surprising that expression profiles identified using different training datasets selected from a larger cohort, should show little agreement. It was also demonstrated that predicting relapse directly from microarray data using supervised machine learning approaches was not viable.

In other work [6], it was shown that the specific example of the van't Veer PGL selection of 70 genes was no more effective at prognosis than the Nottingham Prognostic Indicator (NPI) or a suitably trained artificial neural network using traditional non-genomic biomarkers. This is not surprising from a systems biology perspective, where we would regard cancer as the result of complex interactions between genetic, biological and environmental influences.

In [1], they also found that the top 70 most correlated genes in the van't Veer study can vary significantly depending on the specific training set of patients used. Different randomly selected 70 gene PGL's were selected and shown to have similar prediction ability. They suggested that there is no unique set of genes that can be assumed to be the best or the only set of genes for prognosis accuracy of breast cancer. A follow-up study [2] also suggested a similar conclusion, that we can not create a definitive classifier from a small subset of genes based on the small patient datasets available. Generally, large patient sample sizes are needed to produce viable and robust prediction outcomes of cancer prognosis.

Projective Visualisation

Projective data visualisation is an approach for introspection of large dimensional datasets by extracting useful information and representing it in a more meaningful way that can be more easily interpreted prior to deciding upon subsequent analysis such as constructing classifiers [7]. The approach is very useful for interpreting data by simply observing two or three dimensional projective maps of the original dataspace, where relative positioning of data points reflects some form of structural similarities in the original dataspace. This allows the easy recognition of anomalous data points, outliers, implicit clustering and relative dissimilarity.

In microarray data, the combination of large dimensionality, noise, and sparse patient samples makes it almost impossible to explore and extract useful information contained in the data. Dimensionality reduction techniques are required for visualising microarray data. The dendrogram is one of the traditional approaches to perform microarray data clustering. However it usually produces a suboptimal local clustering solution and is not effective as a spatial visualisation tool to reflect relative dissimilarities. Many other algorithms for reduced dimensionality representation have previously been used to visualise microarray data. For instance, the Self Organising Map (SOM) has been used to investigate yeast [8] and human cancers [9] and [10], the latter in combination with the k-means algorithm. Analogously, Principal Component Analysis (PCA) has been used to investigate yeast [11] and to identify tissue-specific expression of human genes [12]. However, both SOM and PCA have significant drawbacks. PCA is a variance-preserving linear projection, and this limitation does not lead to a topographic representation [13]. On the other hand, the SOM lacks a sound theoretical underpinning (for example, there is no cost function to optimise, and training parameters must be chosen arbitrarily).

We therefore seek principled approaches to unsupervised data introspection which are nonlinear (since microarray data distributions are unlikely to be distributed on a linear manifold in high dimensional spaces). In this paper we will explore the Neuroscale model [14, 15], Local Linear Embeddings (LLE) [16] and Stochastic Neighbor Embeddings (SNE) [17].

Methods

The van't Veer Data set

We re-visit the well-known study of van't Veer et.al. [3] in which we focus on 78 sporadic lymph-node negative patients. Of these 78 patients, 34 developed distant metastases within 5 years and 44 remained free of cancer in that period. These are regarded as poor and good-prognosis groups respectively. The interest is whether the information in a gene expression profile alone could be used to perform a patient-specific prognosis separation between those two groups of patients. We will primarily use structure-preserving projective visualisation techniques to investigate this possibility. In the van't Veer study, from an initial set of 24481 human genes synthesised by inkjet microarray technology, about 5000 genes were found to be significantly expressed. They ranked genes by the magnitude of the correlation coefficient and eventually reduced the number of genes to 70, the number of genes which maximised a specific classification model. The centroid-based classifier they constructed could allocate 83% of the patients into the correct prognosis groups with 5 poor prognosis and 8 good-prognosis patients misclassified into the opposite categories.

An alternative PGL

To illustrate the lack of uniqueness of capability of the van't Veer gene list, which we denote List A in this paper, we compare results on a different gene list, denoted List B, selected on the basis of cross-patient consistency rather than maximising classification accuracy on a specific classification model. Let x ⁱdenote the gene expression vector for patient i of the van't Veer PGL. x _G, where G = {1, 2,..., 44} represents a set of expression values across all good prognosis patients, and x _P, where P = {45, 46,..., 78} represents the set of all poor prognosis patients.

The variance of individual gene expression values across each patient group is estimated by

σ_{L}^{2} = {〈 {(x_{i} - {\bar{x}}_{L})}^{2} 〉}_{i \in L},

where L = {G, P} and the average is taken across all patients. Assume $R_{j}^{L}$ is the rank order of the variance of gene j for each patient group. The unique top T ranked genes from each group are extracted,

\begin{array}{l} L_{G} = {j | R_{j}^{G} \leq T} \\ L_{P} = {j | R_{j}^{P} \leq T} \end{array}

The number of T genes is chosen so that List B has a total number of genes equal to 70, the same as List A. Specifically, in this case the 35 lowest non-overlapped variance genes from each patient group were extracted.

This selection criterion emphasises consistency of gene expression across patients, rather than explicitly seeking discrimination (see table 25 for list of genes). Examining the details of the two 70-gene subsets, we observe that there are only five genes in common between the van't Veer study and this alternative gene list. If List A has superior prognostic value, its projective visualisation and discrimination properties should be better than those of List B, since List A was chosen explicitly to maximise discrimination.

Table 25 The alternative gene list

Predictive gene lists for breast cancer prognosis: A topographic visualisation study

Abstract

Background

Methods

Results

Conclusion

Background

Reviews

Projective Visualisation

Methods

The van't Veer Data set

An alternative PGL

The validation data set of van de Vijver [4]

Topographic Visualisation

NeuroScale

LLE

Stochastic Neighbor Embedding

Classifier

Results

NeuroScale Projection

Locally Linear Embedding

Stochastic Neighbor Embedding

Discussion

Comparison across models

Comparison of PGLs across patient groups

Generalisation results using the van de Vijver data set

Conclusion

References

Pre-publication history

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors' contributions

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Medical Genomics

Contact us