DeepTRIAGE: interpretable and individualised biomarker scores using attention mechanism for the classification of breast cancer sub-types

Background Breast cancer is a collection of multiple tissue pathologies, each with a distinct molecular signature that correlates with patient prognosis and response to therapy. Accurately differentiating between breast cancer sub-types is an important part of clinical decision-making. Although this problem has been addressed using machine learning methods in the past, there remains unexplained heterogeneity within the established sub-types that cannot be resolved by the commonly used classification algorithms. Methods In this paper, we propose a novel deep learning architecture, called DeepTRIAGE (Deep learning for the TRactable Individualised Analysis of Gene Expression), which uses an attention mechanism to obtain personalised biomarker scores that describe how important each gene is in predicting the cancer sub-type for each sample. We then perform a principal component analysis of these biomarker scores to visualise the sample heterogeneity, and use a linear model to test whether the major principal axes associate with known clinical phenotypes. Results Our model not only classifies cancer sub-types with good accuracy, but simultaneously assigns each patient their own set of interpretable and individualised biomarker scores. These personalised scores describe how important each feature is in the classification of any patient, and can be analysed post-hoc to generate new hypotheses about latent heterogeneity. Conclusions We apply the DeepTRIAGE framework to classify the gene expression signatures of luminal A and luminal B breast cancer sub-types, and illustrate its use for genes as well as the GO and KEGG gene sets. Using DeepTRIAGE, we calculate personalised biomarker scores that describe the most important features for classifying an individual patient as luminal A or luminal B. In doing so, DeepTRIAGE simultaneously reveals heterogeneity within the luminal A biomarker scores that significantly associate with tumour stage, placing all luminal samples along a continuum of severity.


Background
Breast cancer is a collection of multiple tissue pathologies with a joint genetic and environmental aetiology, and is a leading cause of death among women worldwide. During the progression of cancer, inherited or acquired mutations in the DNA change the sequence (or amount) of the messenger RNA (mRNA) produced by the cell, thereby changing the structure (or amount) of functional protein.
As such, mRNA can serve as a useful proxy for evaluating the functional state of a cell, with its abundance being easily measured by microarray or high-throughput RNA sequencing (RNA-Seq). Indeed, mRNA abundance has already been used as a biomarker for cancer diagnosis and classification [1,2], cancer sub-type classification [3,4], and for clustering gene expression signatures [5]. For a comprehensive comparison of the supervised and unsupervised methods used with gene expression data, see [6].
Despite advancements in the field, mRNA-based classifiers still present unique challenges. First, these datasets contain many more features (10,000s) than samples (100s), a p n problem that is usually addressed by feature selection or feature engineering [7,8]. Second, it is often difficult to interpret mRNA-based classifiers because the predictive genetic features do not necessarily make sense to biologist experts without explicit contextualisation. Third, the routine use of discriminative methods (e.g., support vector machines [9] or random forests [8,10]) only provide information with regard to the importance of a feature for an entire class. For cancer data, this means that these methods cannot suggest the importance of a feature for a specific patient, but instead only provide such information at the level of cancer type or sub-type. This is important given that a substantial amount of heterogeneity remains unaddressed within cancer sub-types [4,11].
In this paper, we propose a deep learning method for the stratification of clinical samples that not only offers interpretability through feature importance at the level of the cancer sub-type, but also at the level of the individual patient. As such, our method offers a finer level of interpretation than existing methods by capturing the heterogeneity of samples within each sub-type. To achieve this goal, we use an attention mechanism, a deep learning technique first proposed for machine translation and automatic image captioning [12,13]. Attention allows salient features to come dynamically to the forefront for each patient as needed. As a result, the global knowledge of the model, obtained from the discriminating classes, is enhanced by the local knowledge that each patient provides. In other words, the attention mechanism offers an insight into the model's decision-making process by revealing a set of individualised importance scores that describe how important each feature is for the classification of that patient. Further analysis of these importance scores reveals valuable insights into sub-type heterogeneity that are not directly apparent in the unattended data.
Machine learning techniques have been successfully applied to gene expression data for decades. More recently, deep learning, especially unsupervised deep learning, has influenced several approaches to gene expression analysis. These unsupervised models have been used to learn meaningful abstractions of biology from unlabelled gene expression data. For example, [14] extracted biological insights from the Pseudomonas aeruginosa gene expression compendium using shallow auto-encoders. Similarly, stacked auto-encoders have been adopted to capture a hierarchical latent space from yeast gene expression data [15], showing that the first layer correctly captures yeast transcription factors, while deeper layers conform to the existing knowledge of biological processes. More related to our work, it has been shown that shallow denoising auto-encoders are capable of extracting clinical information and molecular signatures from the gene expression data of patients with breast cancer [16]. Later, [17] coupled representations obtained from stacked denoising auto-encoders with a traditional classifier to achieve discriminatory power, applying it to classify cancerous samples from healthy ones.
Although the classification of cancer is an important task, breast cancer is not a monolithic entity. It is comprised of distinct molecular sub-types, each with a distinct molecular signature that correlate with patient prognosis and response to therapy [18]. This is especially true when considering the luminal sub-types: luminal A cancers have a much better prognosis than luminal B cancers and can be treated with endocrine therapy alone [18]. Yet, luminal A remains one of the most diverse cancer subtypes in terms of its molecular signature and severity [19]. As a case study, we focus our analysis on the difficult problem of classifying breast cancer sub-types based on gene expression signatures, and approach it using an endto-end supervised deep learning model. Using publicly available data from The Cancer Genome Atlas (TCGA), we develop and apply a novel deep learning architecture, called DeepTRIAGE (Deep learning for the TRactable Individualised Analysis of Gene Expression). The DeepTRIAGE architecture achieves two key outcomes.
First, our architecture extends the attention mechanism to model data where the number of features is much larger than the number of observations. Second, our architecture facilitates a new interpretation of feature importance by providing individualised patient-level importance scores. These patientlevel importance scores can be analysed directly using multivariate methods to reveal and describe latent intraclass heterogeneity.
Taken together, our work establishes a computational framework for calculating interpretable and individualised biomarker scores that can accurately classify luminal sub-types, while simultaneously revealing and describing intra-class heterogeneity. Using DeepTRIAGE, we calculate personalised biomarker scores that describe the most important features for classifying an individual patient as luminal A or luminal B. In doing so, Deep-TRIAGE simultaneously reveals heterogeneity within the luminal A biomarker scores that significantly associate with tumour stage, placing all luminal samples along a continuum of severity.

Data acquisition
We retrieved the unnormalised gene-level RNA-Seq data for the TCGA breast cancer (BRCA) cohort [20] using the TCGAbiolinks package in Bioconductor [21]. After filtering any genes with zero counts across all samples, we performed an effective library size normalisation of the count data using DESeq2 [22]. To retrieve luminal A (LumA) and luminal B (LumB) sub-type status for the TCGA BRCA samples, we downloaded the "PAM50" labels from the supplementary data of [19]. With 1148 PAM50 labels retrieved, we excluded patients that had more than one tumour sample sequenced. This left us with 528 LumA and 201 LumB samples. Of these, 176 LumA and 67 LumB samples were set aside as a test set.

Engineering annotation-level expression from genes
To reduce the dimensionality of the raw feature space, we transformed raw features from "gene expression space" into an "annotation space". For this, we elected to use the Gene Ontology (GO) Biological Process and Kyoto Encyclopedia of Genes and Genomes (KEGG) annotation databases. Pathways with less than ten associated genes were removed before estimating pathway-level expression by taking the sum of counts for all genes in each pathway, as described and validated in [23]. This results in 3942 GO features and 302 KEGG features that are then used for model training.

Model architecture
Gene expression data-sets usually have many more features than samples. Using a standard deep learning architecture with such data leads to a parameter explosion in the model that can cause over-fitting and reduce generalizability. Instead, our model aims to (i) reduce the number of free parameters of the model, (ii) encode global knowledge in the data by finding discriminatory features at the cancer sub-type level (akin to a logistic regression), and (iii) encode local information provided by each individual patient using an attention mechanism. These innovations make the attention mechanism tractable for high-dimensional data.

The deepTRIAGE model
Let d be the dimension of the raw feature space and let x j = x j1 , . . . , x jd be the representation of sample j in this space. Our goal is to train a binary classifier that learns whether sample j belongs to the LumA class (y j = 1) or the LumB class (y j = 0).
LetE d×m be an embedding matrix and let e i = [e i1 , . . . , e im ] be the embedding vector for feature i ∈ {1 : d}. Using Eq. 1, we definex (i) j , the m-dimensional embedded vector of feature i for sample j: where f e , parametrised by e , is the function that captures the relationship between the scalar value x ji and the embedding vector e i . Note that the same embedding matrix is used for all samples. Now, we can define a new representation for sample j using its embedded vectors, as shown in Eq. 2: wherex j is a p-dimensional representation of the raw feature data, f x : R m → R p is parametrised by x , and β ji is a scalar value denoting the individualised importance that feature i has in the classification of sample j. Note that m d and usually p m. In other words, the dimensionality of the embedding space is much smaller than the dimensionality of the raw feature space, and the dimensionality of the final representationx j is smaller than the dimensionality of the embedding space, allowing the attention mechanism to work successfully for such high-dimensional data. By viewing Eq. 2 in a deep learning framework, one can interpret β ji as the attention of sample j to feature i and compute it using Equations 3, 4, and 5: is simply a normalisation to ensure that the attention weights sum to one for each sample. Now, given the fixed size representation of sample j as x j , we can train the binary classifier f y for LumA vs. LumB classification: where y is the set of parameters for f y . Given the embedding matrix E, we have defined our model through Equations 1-6. Figure 1 shows a schematic overview of the DeepTRIAGE model architecture. In the next section, we discuss how we learn the parameters of this model.

Remark 1
There are different approaches to constructing the embedding matrix E. For instance: end-to-end learning with an unsupervised component added to the model, estimation using auto-encoders, or dimensionality reduction using PCA. We chose to use random vectors because it has been shown that their performance is comparable with the aforementioned techniques [24,25]. Therefore, e i is an m-dimensional random vector.

Remark 2
There are many ways to compute the attention weights. We used a definition inspired by the concept of self-attention which means that the attention to a feature is only influenced by that feature [26].

Learning model parameters
In the previous section, we defined our model through Equations 1-6. Now we discuss how to specify its components f e , f x , f α , f y and how to learn their parameters e , x , α , y . Since we want to learn the model endto-end, we choose these components to be differentiable. In order to computex (i) j , we capture the relationship between the feature value x ji and the embedding vector e i via multiplicative interaction using Eq. 7. Therefore, e is a null set. One could, however, choose a more complex function.
= x ji e i We choose f x and f α to be two feed-forward neural networks with weights x and α respectively. See Equations 8 and 9: where both can be thought of as a non-linear transform; nnet x : R m → R p and nnet α : R m → R.
Givenx j , any differentiable classifier can be placed on top to predict the cancer sub-type (Eq. 6). We use a feedforward network with a sigmoid activation function in the last layer to calculate the probability of sample j belonging to a sub-type: where y represents the weights of this network. To limit the model complexity, we choose f x to be a single-layer neural network with tanh nonlinearity, f α to be a network with one hidden layer and tanh nonlinearity, and f y to be a network with one hidden layer, batch normalisation and ReLu nonlinearity. Dropout with p = 0.5 is also applied to these three functions. Again, one can use more complex functions as long as they are differentiable. Since all components are fully differentiable, the entire model can be learnt by minimising the log-loss function employing automatic differentiation and gradient-based methods. In this case, we used the Adam optimiser [27].

Analysis of importance scores
What we have described so far focuses on the discriminatory mechanism of our model. When viewed from the top, our proposed model is capable of separating cancer subtypes, like many other classification algorithms. However, one important distinction is that our model also generates an individualised importance score for each feature at the sample-level. This aspect is highly useful as it opens new opportunities for post-classification analyses of individual patients, making our method both hypothesis-testing and hypothesis-generating.
Given β j = β j1 , . . . , β jd , where β ji is the individualised importance score for sample j and feature i, we can construct an importance score matrix B by stacking β j for all samples. We benchmark its performance as compared to a logistic regression and support vector machine (SVM), using both gene and gene set annotation features. From this, we see that our model, which adds a level of interpretability at the individual level, does not sacrifice classification accuracy. The objective of DeepTRIAGE is to improve interpretability, not accuracy. Yet, this method appears to perform marginally better for the given test set To detect emerging patterns within the individualised importance scores, we perform non-negative matrix factorisation (NMF) and principal component analysis (PCA) of the importance score matrix B. As a point of reference, we also perform an ordination of the raw feature space from "Engineering annotation-level expression from genes" section. Note that all individualised persample importance scores were calculated on the withheld test set. Table 1 shows the performance of the DeepTRIAGE model for luminal sub-type classification according to a single test set. When applying this model to Ensembl gene expression features, we obtain personalised biomarker scores that describe how important each gene is in predicting the cancer sub-type for each sample. The objective of DeepTRIAGE is to improve interpretability, not accuracy. Yet, this method appears to perform marginally better for the given test set.

GINS1 drives luminal sub-type classification in test set
We can interpret the resultant importance score matrix directly using multivariate methods. Figure 2 shows the NMF factor which best discriminates between the breast cancer sub-types. Here, we see that a single gene, GINS1 (ENSG00000101003), contributes most to this factor. This gene has a role in the initiation of DNA replication, and has been associated with worse outcomes for both luminal A and luminal B sub-types [28]. Interestingly, this is not a PAM50 gene, suggesting that our model does not merely re-discover the PAM50 signature. We posit that the model performance, along with this biologically plausible result, validates its use for gene expression data.

Kinetochore organisation associates with tumour severity within and between luminal sub-types
To reduce the number of features and to facilitate the interpretation of feature importance, we transformed the gene-level expression matrix into an annotation-level expression matrix using the Gene Ontology (GO) annotation set (cf. "Engineering annotation-level expression from genes" section). Table 1 shows that GO annotation features perform as well as gene features for all Fig. 2 This figure presents the results of non-negative matrix factorisation applied to the importance score matrix computed from Ensemble gene expression data using DeepTRIAGE. Shown here is the factor which best discriminates between the two breast cancer sub-types. a shows the relative contribution of each gene term to the most discriminative factor, with the top 3 components labelled explicitly. b shows a box plot of the distribution of all samples across the composite factor score. This figure is produced using the test set only models. Although annotation features do not improve performance, they do improve the interpretability of the model by representing the data in a way that reflects domain-specific knowledge [29]. By applying DeepTRIAGE to the GO features, we obtain personalised biomarker scores that describe how important each GO term is in predicting the cancer sub-type for each sample. Figure 3 shows the most discriminative NMF factor of the GO-based importance score matrix. The left panel shows the relative contribution of each term to this factor, while the right panel shows the distribution of samples with regard to this factor. From this, we see that a single factor cleanly delineates the luminal A samples from the luminal B samples, and is comprised mostly by the GO:0051383 (kinetochore organisation) gene set. Figure 4 shows a PCA of the same importance score matrix, along with a biplot of the 5 most variable GO terms, offering another perspective into the structure of the importance score matrix.
Both visualisations show that the kinetochore organisation gene set can meaningfully discriminate between the luminal A and luminal B cancer sub-types. This gene set contains 5 members: SMC4, NDC80, SMC2, CENPH, and CDT1. Figure 5 shows the expression of these genes in the test data, showing that the prioritised gene set contains genes with significant mean differences between the two sub-types (p-value < 0.01). Interestingly, only one of these (NDC80) is a member of the PAM50 gene set used to define the luminal A and B sub-types. The kinetochore organisation gene set is involved in the assembly and disassembly of the chromosome centromere, an attachment point for spindle microtubules during cell division. The dysregulation of this gene set would be expected to associate with luminal sub-typing because centromere instability drives genomic instability, and luminal B cancers are more unstable than luminal A cancers (as evidenced by Ki-67 staining [30] and tumour severity). Indeed, NDC80 and CENPH dysregulation has already been associated with worse breast cancer outcomes, with luminal A exhibiting less centromere and kinetochore dysregulation in general [31].
However, the real added value of our attention model is that it projects all samples according to a distribution of importance scores, implicitly revealing and describing heterogeneity within the cancer sub-types. While Fig. 4 shows how GO:0051383 distinguishes between the luminal sub-types, it also shows how GO:0031668 (cellular response to extra-cellular stimulus) and GO:0061158 (3'-UTR-mediated mRNA destabilisation) explain much variance within the luminal A group. These axes are not arbitrary. A linear model predicting each PCA axis as a function of the tumour (T), node (N), and metastasis (M) stage (as nominal factors) among the luminal A samples only, reveals that small values in the first axis (PC1) significantly associate with the lower T stages, while large values significantly associate with the N2 stage (p < 0.05). Fig. 3 This figure presents the results of non-negative matrix factorisation applied to the GO-based importance score matrix. Shown here is the factor which best discriminates between the two breast cancer sub-types. a shows the relative contribution of each GO term to the most discriminative factor, with the top 3 components labelled explicitly. b shows a box plot of the distribution of all samples across the composite factor score. This figure is produced using the test set only For the importance scores, we see that the first principal axis describes much of the variance between the breast cancer sub-types, while the second principal axis describes much of the variance within the luminal A sub-type. By super-imposing the features as arrows, we can see which annotations best describe the origin of this variance. This level of structure is not evident when looking at the PCA biplot of the annotation feature space. This figure is produced using the test set only Meanwhile, large values in the second axis (PC2) significantly associate with the T4 stage (p < 0.05). This suggests that the luminal A samples which are closest to luminal B samples in the PCA tend to be worse tumours. This is consistent with the literature which describes luminal B cancer as a more severe disease [18], as well as Netanely et al's observation that luminal cancers exist along a phenotypic continuum of severity [19]. Thus, our method provides a biological explanation for some of the variance associated with the diagnostically-relevant   For the importance scores, we see that the first principal axis describes much of the variance between the breast cancer sub-types, while the second principal axis describes much of the variance within the luminal A sub-type. By super-imposing the features as arrows, we can see which annotations best describe the origin of this variance. This level of structure is not evident when looking at the PCA biplot of the annotation feature space. This figure is produced using the test set only differences in luminal sub-types. This level of resolution is not provided by the other machine learning algorithms used for RNA-Seq data, and is not evident in the ordination of the unattended GO annotation features (see Fig. 4b).

DNA mismatch repair associates with tumour severity within and between luminal sub-types
We repeated the same analysis above using the Kyoto Encyclopedia of Genes and Genomes (KEGG) annotation set which organises genes according to canonical functional pathways (cf. "Engineering annotation-level expression from genes" section). Like with GO annotations, the DeepTRIAGE model performed well with KEGG annotations (see Table 1). By applying DeepTRIAGE to the KEGG features, we obtain personalised biomarker scores that describe how important each KEGG term is for the classification of each patient.
The NMF and PCA ordination of the KEGG-based importance scores both show that hsa03430 (DNA mismatch repair) explains much of the inter-group variability (see Fig. 6 and Fig. 7). This is expected to separate luminal A and B sub-types because errors in the DNA mismatch repair mechanism allow mutations to propagate, resulting in a more aggressive cancer. Yet, the PCA biplot shows that there exists a large amount of intraclass heterogeneity that is not explained by this pathway. Along this axis, we see a contribution by hsa04670 (Leukocyte transendothelial migration) and hsa04215 (Apoptosis), both relevant to tumour progression and metastasis. Again, these axes are not arbitrary. A linear model predicting each PCA axis as a function of the tumour (T), node (N), and metastasis (M) stage (as nominal factors) among the luminal A samples only, reveals that small values in both axes (PC1 and PC2) significantly associate with the T1 stage (p < 0.05). This suggests that the heterogeneity uncovered by the DeepTRIAGE architecture places patients along a diagnostically-relevant continuum of tumour severity. Again, this level of resolution is not provided by other machine learning algorithms and is not evident in the ordination of the unattended annotationlevel data (see Figure 7b).

Conclusions
Breast cancer is a complex heterogeneous disorder with many distinct molecular sub-types. The luminal breast cancer class, comprised of the luminal A and luminal B intrinsic sub-types, varies in disease severity, prognosis, and treatment response [18], and has been described as existing along a vast phenotypic continuum of severity [19]. Stratifying individual cancerous samples along this severity continuum could inform clinical decisionmaking and generate new research hypotheses. In this manuscript, we propose the DeepTRIAGE architecture as a general solution to the classification and stratification of biological samples using gene expression data. To the best of our knowledge, this work showcases the first application of the attention mechanism to the classification of high-dimensional gene expression data.
In developing DeepTRIAGE, we also innovate the attention mechanism so that it extends to high-dimensional data where there are many more features than samples. Using DeepTRIAGE, we show that the attention mechanism can not only classify cancer sub-types with good accuracy, but can also provide individualised biomarker scores that reveal and describe the heterogeneity within and between cancer sub-types. While commonly used feature selection methods prioritise features at the populationlevel during training, our attention mechanism prioritises features at the sample-level during testing. By applying DeepTRIAGE to the gene expression signatures of luminal breast cancer samples, we identify canonical cancer pathways that differentiate between the cancer sub-types and explain the variation within them, and find that some of this intra-class variation associates with tumour severity.