 Research
 Open Access
 Published:
Integrative subspace clustering by common and specific decomposition for applications on cancer subtype identification
BMC Medical Genomics volume 12, Article number: 191 (2019)
Abstract
Background
Recent high throughput technologies have been applied for collecting heterogeneous biomedical omics datasets. Computational analysis of the multiomics datasets could potentially reveal deep insights for a given disease. Most existing clustering methods by multiomics data assume strong consistency among different sources of datasets, and thus may lose efficacy when the consistency is relatively weak. Furthermore, they could not identify the conflicting parts for each view, which might be important in applications such as cancer subtype identification.
Methods
In this work, we propose an integrative subspace clustering method (ISC) by common and specific decomposition to identify clustering structures with multiomics datasets. The main idea of our ISC method is that the original representations for the samples in each view could be reconstructed by the concatenation of a common part and a viewspecific part in orthogonal subspaces. The problem can be formulated as a matrix decomposition problem and solved efficiently by our proposed algorithm.
Results
The experiments on simulation and text datasets show that our method outperforms other stateofart methods. Our method is further evaluated by identifying cancer types using a colorectal dataset. We finally apply our method to cancer subtype identification for five cancers using TCGA datasets, and the survival analysis shows that the subtypes we found are significantly better than other compared methods.
Conclusion
We conclude that our ISC model could not only discover the weak common information across views but also identify the viewspecific information.
Background
With the advancements of biological technologies, there are many kinds of data available such as genomic DNA copy number arrays, DNA methylation, exome sequencing, messenger RNA arrays, microRNA sequencing and reversephase protein arrays and so on. By analyzing the multiple data generated by cancer patients, it is now possible to classify cancer patients to different subgroups, and thus improve the diagnostic and treatment. For example, Breast cancer is one of the most common cancers worldwide, and it is clinically categorized into four basic therapeutic subgroups: (1). Luminal A with oestrogen receptor (ER) positive group; (2). Luminal B with oestrogen receptor (ER) positive group; (3) HER2 amplified group; (4) triplenegative breast cancers (TNBCs, also called basallike, lacking expression of ER, progesterone receptor (PR) and HER2). The ER positive (including Luminal A and B) is the most common and diverse, and several genomic tests can be used to predict outcomes for ER+ patients receiving endocrine therapy. The treatment for the HER2 amplified subtype has a great success due to the effective therapeutic targeting of HER2. The basallike breast cancers, often with BRCA1 mutations or of African ancestry have only option of chemotherapy. Therefore, subtype identification for breast cancers surely can assist the treatment for the patients.
Most molecular studies of subtype identification for breast cancer integrate genomic, epigenomic, and transcriptomic profiling including mRNA expression profiling, miRNA expression, DNA methylation and DNA copy number analysis, and so on. It is assumed in these studies that integrative clustering of multiomics data can capture clearer structure that can not be discovered by only exploring a single omic data. In fact, in many other applications, a single object often can be represented by multiple features or views. For example, an image can be represented by its pixels and its captions, an Internet webpage can be represented by its text contents and the hyperlinks to other webpages, and a scientific publication can be represented by its text contents and its citations. In all these applications, multiview clustering takes information from all views into account such that better clustering structures could be discovered.
The difficulty in multiview learning mainly lies in that the similarity measurement, geometric distribution, clustering structure, and noisy levels and so on are often diverse for different views. Samples represented in different views may have their own clustering structures, or subspaces they lie in. The differences hamper the clustering significantly. It is challenging to efficiently reconcile the conflicting information among views.
Most of existing multiview clustering approaches follow three directions. The first class of methods [1–7] attempt to determine new representations by minimizing the differences or maximizing the correlations between different views. The second class of approaches propagate information from different views to construct graphs or similarities in a slightly different way, including multiview EM [8], multiview spectral clustering [9, 10], multiview clustering with unsupervised feature selection [11, 12], nonnegative Matrix Factorization [13], pattern fusion [14], similarity network fusion (SNF) [4]. For example, the similarity network fusion (SNF) [4] fuses multiple networks to one network by iteratively updating a sequence of nonnegative status matrices. The third class of methods aim to learn an optimal linear combination of multiple kernels or similarities [15–20]. For example, the optimized kernel kmeans [16] is proposed to obtain optimal linear combination of multiple kernels and cluster assignment matrix simultaneously by minimizing a trace clustering loss.
However, almost all the existing methods assume strong consistency among different views or omics, and thus they capture the clustering structure by using the hidden shared information. This may face problem in the case when the different views share relatively weak common clustering structure. For instance, different views may have different levels of noisy information. Furthermore, different views may have conflicting clustering structures, or one single view may have different clustering structures with all the others. All of these may make it difficult to identify the shared information among views. A biological example is that, the analysis on different omics for glioblastoma multiforme (GBM), an aggressive adult brain tumor, obtains different results. One work [21] based on expression and copynumbervariant data, identifies two subtypes, which is inconsistent with the results obtained in [22], which identifies four subtypes primarily only by expression data. Therefore, when the consistent information is weaker than the conflicting information, which is highly likely in subtype identification, it is challenging to discover the hidden clustering structures. A natural idea to overcome this challenge is to decompose the information in each view to a shared part across all views and a viewspecific part. A kernel based method [23] is developed following this idea, which attempts to construct a consensus kernel using multiomics data. However, for applications, it focuses more on the common part, but ignores the viewspecific clustering structure. Furthermore, the semidefinite programming for the optimization problem is computational complex.
In this work, we propose a novel integrative subspace clustering method by assuming that the common structure information is weak across views. The main idea is to find a specific subspace for each view, so that the new representation for each sample in each view in this subspace is a concatenation of two vectors, say, a common representation among all views, and a specific representation for this view. This could make sure that the common parts and the specific parts lie in two orthogonal subspaces for each view. Furthermore, the representations of the common part are expected to be independent with those of each specific part, where the dependence is measured by Hilbert Schmidt Independence Criterion (HSIC). Our main contributions in this work are summarized as follows.
We propose a novel subspace learning model to discover the common and specific representations for each sample, especially for the case when the common information might be relatively weaker than the specific information. We propose an algorithm to solve the corresponding optimization problem efficiently.
We test our method on simulation datasets, text multiview datasets, cancer type identification, and it works the best for most cases. Especially, our model works even the common information across views is very weak.
We apply the proposed clustering method on subtype identification, by assuming that the subtype information may also come from the viewspecific part of a single omics data. We apply our approach to identify subtypes for five cancers using TCGA datasets. The survival analysis on the clustering results shows that our method works the best for most cases.
Methods
In this section, we will present the proposed integrative subspace clustering method by multiview matrix decomposition. We first give a problem statement, and then propose a subspace learning method by multview matrix decomposition. We then introduce the Hilbert Schmidt Independence Criterion, and finally propose our integrative subspace clustering model ISC and the corresponding optimization algorithm.
Problem statement
Suppose we are given n samples with V views, X=[X_{1},⋯,X_{V}], where \(\phantom {\dot {i}\!}X_{v} \in R^{p_{v} \times n},v=1,\cdots,V\). Denote \(X_{v} = [x^{v}_{1},\cdots,x^{v}_{n}]\), where \(x^{v}_{i}\in R^{p_{v}}\). The aim is to cluster the n samples with a given cluster number based on the integrative information from the v views. In cancer subtype identification, the views can be different data sources, omics or platforms.
Subspace learning for common and specific decomposition
We consider the samples \(\phantom {\dot {i}\!}X_{v}\in R^{p_{v}\times n}\) from view v are approximately lying in a ddimensional subspace \(\Omega _{v}\subset R^{p_{v}}\) (d<p_{v}), which is spanned by the columns of an orthonormal matrix \(P_{v}\in R^{p_{v}\times d}, P_{v}^{T}P_{v} = I_{d}\). This means that
where \(z_{i}^{v}\in R^{d}\) is the new representation of \(x_{i}^{v}\) in this subspace. We assume that the samples X_{v} from view v have both common and specific clustering structures, which means that \(z_{i}^{v}\) can be further represented as
where \(\phantom {\dot {i}\!}c_{i}\in R^{d_{0}}\) is the common representation of x_{i} across all views, and \(s_{i}^{v}\in R^{d_{v}}\) is the specific representation of x_{i} in the vth view. Note that d=d_{0}+d_{v}. In other words, \(x_{i}^{v}\) can be approximately represented as
where \(P_{v} = \left (P_{v}^{(c)} \ \ P_{v}^{(s)}\right), (P_{v}^{(c)})^{T}P_{v}^{(c)}=I_{d_{0}}\) and \(\left (P_{v}^{(s)}\right)^{T}P_{v}^{(s)}=I_{d_{v}}\). This means that the ddimensional subspace Ω_{v} spanned by P_{v} is further decomposed to two orthogonal subspaces \(\Omega _{v}^{(c)}\) and \(\Omega _{v}^{(s)}\), spanned by orthonormal matrices \(P_{v}^{(c)}\) and \(P_{v}^{(s)}\), respectively. In other words, \(\Omega _{v} = \Omega _{v}^{(c)} \oplus \Omega _{v}^{(s)}\), where \(\Omega _{v}^{(c)}\) and \(\Omega _{v}^{(s)}\) are orthogonal subspaces to each other. We can rewrite the above equations in a matrix form as follows,
where \(Z_{v}=\left [z_{1}^{v},\cdots,z_{n}^{v}\right ], C=\left [c_{1},\cdots,c_{n}\right ], S_{v}=\left [s_{1}^{v},\cdots,s_{n}^{v}\right ]\), and E_{v} is the error matrix for view v.
We demonstrate the decomposition idea in Fig. 1. We attempt to find two orthogonal subspaces \(\Omega _{v}^{(c)}\) and \(\Omega _{v}^{(s)}\) for each view v, such that X_{v} could be decomposed to the common part C and the specific part S_{v} in the subspace \(\Omega _{v} = \Omega _{v}^{(c)} \oplus \Omega _{v}^{(s)}\). Hopefully, the common clustering structure is hidden in C, and the specific clustering structure for view v is hidden in S_{v}.
HilbertSchmidt Independence criterion (HSIC)
To better decompose each view to a common and a viewspecific part, such that each viewspecific clustering structure in S_{v} is independent to the common part C across all views, a measurement for independence is required. We measure the independence by using the HilbertSchmidt Independence Criterion (HSIC) which is a measure of statistical independence [24]. Intuitively, HSIC can be considered as a squared correlation coefficient between two random variables c and s computed in feature spaces \(\mathcal {F}\) and \(\mathcal {G}\).
Let c and s be two random variables from the domains \(\mathcal {C}\) and \(\mathcal {S}\), respectively. Let \(\mathcal {F}\) and \(\mathcal {G}\) be feature spaces on \(\mathcal {C}\) and \(\mathcal {S}\) with associated kernels \( k_{c}: \mathcal {C} \times \mathcal {C} \rightarrow \mathbb {R}\) and \(k_{s}: \mathcal {S} \times \mathcal {S} \rightarrow \mathbb {R}\), respectively. Denote the joint probability distribution of c and s by p_{(c,s)}, and (c,s) and (c^{′},s^{′}) are drawn according to p_{(c,s)}. Then the Hilbert Schmidt Independence Criterion can be computed in terms of kernel functions via:
where E is the expectation operator.
The empirical estimator of HSIC for a finite sample of points C and S from c and s with p_{(c,s)} was given in [24] to be
where tr is the trace operator of a matrix, H is the centering matrix \(H = I_{n}\frac {ee^{T}}{n}\) (e is a proper dimensional column vector with all ones), and K_{c} and K_{s}∈R^{n×n} are kernel matrices. The smaller the HSIC value, the more likely C and S are independent from each other.
Integrative subspace clustering (ISC) model
Based on the above considerations, we propose our integrative subspace clustering model as follows,
where \(S_{v}^{T}S_{v}\) and C^{T}C are the linear kernels of S_{v} and C, respectively, and β is a parameter. Note that the first term is the decomposition term that tries to find the orthogonal subspaces where the corresponding common and viewspecific representations lie in, and the second independence term is to minimize the dependence between the common part and the viewspecific part. We use the linear kernel of C and S_{v} to simplify the computation. After C and S_{v}s for all views are obtained, kmeans clustering is applied to cluster the samples represented by C and S_{v}, respectively. The clustering results by using the common part C and the specific part S_{v} are called ISCC, ISCS1,ISCS2, ⋯, respectively.
Based on the resulting C and S_{i}s, we define a consensus score(Cscore) which is similar to [23] as below:
Cscore is used to measure the weight of the consensus part in the ith view. Note that the Cscore ranges from 0 to 1, and a higher Cscore implies stronger consistent information in the corresponding view.
Optimization algorithm
We propose an alternative updating approach to solve the optimization problem (3).
Step 1. We first fix P_{v} and C in (3), and solve for optimal S_{1},⋯,S_{v} one by one. The vth optimization subproblem can be written as:
Since P_{v} can be represented as \(P_{v}=(P_{v}^{(c)} \ \ P_{v}^{(s)})\), the subproblem (5) to solve for S_{v} can be simplified to:
By setting the derivatives of the objective function f(S_{v}) in (6) with respect to S_{v} to be zero, we obtain
The matrix equation for S_{v} in (7) is a standard Sylvester equation and can be solved efficiently using method in [25].
Step 2. We then fix C,S_{1},⋯,S_{V}, and solve the optimization problem (3) for optimal P_{1},⋯,P_{V} one by one. The corresponding vth optimization subproblem can be written as:
where \(Z_{v} = \left (\begin {aligned} C \\ S_{v} \end {aligned} \right).\) The optimization problem (8) is a least square problem on grassman manifold, and solved by algorithm 2 in [26].
Step 3. We fix P_{1},⋯,P_{V} and S_{1},⋯,S_{V}, then solve the optimization problem (3) for C. The corresponding subproblem can be written as:
Similarly, we set the derivatives of objective function of the subproblem (9) with respect to C, and obtain
The matrix equation for C in (10) is also a standard Sylvester equation and the same algorithm for solving (7) can be used.
The overall algorithm for solving (3) is shown in the algorithm box ISC. For each iteration, we need to solve three subproblems in our ISC algorithm to alternatively update S_{v},P_{v} and C. Since the objective function of ISC model in (3) has a lower bound of zero. and the objective values of our method is decreasing at each step to solve the three subproblems. Therefore the convergence of objective values in our algorithm can be assured. We also experimentally show the convergence of objective values by using four text datasets in Fig. 2, which further confirms the convergence analysis above.
Results
Comparative methods
We compare our ISC model with the following comparative methods.
Spectral clustering for single views(SV1, SV2).
Coregularized spectral clustering (Coreg) [3]. The coreg method extends the single view spectral clustering method by adding a coregularization term which forces the low embeddings from multiple views to be close.
Similarity network fusion (SNF) [4]. The SNF method integrates the sample similarity network constructed by each data type into a single similarity network by a nonlinear combination approach. This converged network can be used to cluster multiview datasets.
Enhanced consensus multiview clustering model(ECMC) [23]. The ECMC method attempts to find the consensus kernels of multiple views by dividing the kernel of each view into a consensus kernel and a disagreement kernel. The method can achieve a relatively good clustering effects even the correlation between views is weak.
Measurements of clustering performance
We use the following three measurements to evaluate the clustering results when the ground truth clustering is given.
Normalized mutual information (NMI). The normalized mutual information (NMI) of a clustering result \({\mathcal {C}} = \{C_{k}\}\) is defined as
$$\begin{array}{*{20}l} \mbox {NMI}({\mathcal{C}},{\mathcal{C}}^{*})=\frac{2\text{MI}({\mathcal{C}},{\mathcal{C}}^{*})}{H({\mathcal{C}})+H({\mathcal{C}}^{*}) }\quad \text{with}\quad \\ \text{MI}({\mathcal{C}},{\mathcal{C}}^{*}) = \sum_{C_{k}\in {\mathcal{C}},C_{\ell}^{*} \in {\mathcal{C}}^{*}} p\left(C_{k},C_{\ell}^{*}\right)\cdot \log_{2} \frac{p\left(C_{k},C_{\ell}^{*}\right)}{p(C_{k})p\left(C_{\ell}^{*}\right)}, \end{array} $$where \({\mathcal {C}}^{*}=\{C^{*}_{l}\}\) is the ground truth clustering, \(p(C_{k}):= C_{k}/n, p\left (C_{i},C_{j}^{*} \right)\) is the joint probability of the two classes C_{i} and \(C_{j}^{*}\), and \(H({\mathcal {C}}) = \sum _{C_{i} \in {\mathcal {C}}}p(C_{i})\log _{2} (p(C_{i})).\)
Average clustering accuracy (ACC). with the clustering labels {l_{j}} of \({\mathcal {C}}\) in a suitable clustering ordering which matches the ground truth labels \(\left \{l^{*}_{j}\right \}\) of \({\mathcal {C}}^{*}\), the average clustering correction (ACC) is defined as
$$ ACC({\mathcal{C}},{\mathcal{C}}^{*}) = \frac1n \sum_{j=1}^{n} \delta\left(l_{j},l_{j}^{*}\right), $$where the function \(\delta (l_{j},l_{j}^{*})=1\) if \(l_{j}=l_{j}^{*}\), or \(\delta \left (l_{j},l_{j}^{*}\right)=0\) otherwise.
Adjusted rand index (ARI). For a computed cluster C_{i} and a ground truth cluster \(C^{*}_{j}\), let \(n_{i.}=C_{i}, n_{.j}=C^{*}_{j}\), and \(n_{ij}=C_{i} \cap C^{*}_{j}\). The adjusted rand index is defined as
$$ARI=\frac{RIE(RI)}{max(RI)E(RI)},$$where \(RI=\sum _{i,j} C_{n_{ij}}^{2}, max(RI)=\frac {1}{2}\left (\sum _{i} C_{n_{i.}}^{2}+ \sum _{j} C_{n_{.j}}^{2}\right)\), and \(E(RI)=\left (\sum _{i} C_{n_{i.}}^{2}\right)\left (\sum _{j}C_{n_{.j}}^{2}\right)/C_{n}^{2}\), where C represents combination number operator. The range of ARI is from 1 to 1. A larger value of ARI means that the clustering result is more consistent with the ground truth clustering.
Silhouette score (Sscore) [27]. When the ground truth clustering is unkonwn, the above criterions could not be computed, and thus Silhouette score defined as follows can be used
$$\text{Sscore} = \frac{1}{n} \sum_{i} \frac{b_{i}a_{i}}{max\{a_{i},b_{i}\}},$$where a_{i} is the average Euclidean distance from sample i to the other samples within the same cluster of sample i and b_{i} is the minimum of the average Euclidean distance from sample i to all samples in any one of the other clusters different from the cluster of sample i. The range of silhouette score is from 1 to 1. The larger the silhouette score is, the better the clustering structure is.
Simulation experiments
In this section, we use synthetic datasets to evaluate our ISC model. The synthetic datasets are generated in the following way. We first sample 200 twodimensional points evenly from a mixed Gaussian distribution with μ_{1}=[−4,6], μ_{2}=[3,−10] and a common covariance matrix Σ=[10 0;0 6], and thus could obtain a matrix Y∈R^{2×200}. By adding white noises to Y, we can get two data matrices Y_{1}∈R^{2×200} and Y_{2}∈R^{2×200}, which can be considered as the common part for two views. We then construct two specific matrices T_{1} and T_{2} by randomly permuting the columns of Y_{1} and Y_{2}, respectively. Finally, we randomly construct two matrices P_{v}∈R^{8×4} and construct the twoview matrices X_{v}=P_{v}[Y_{v};tT_{v}]∈R^{8×200},(v=1,2), where t is a parameter which could control the degree of inconsistency of different views. Note that the ground truth clustering labels for both common part, and the two specific parts are both known and denoted by y,y_{1},y_{2}. We construct 10 corresponding datasets by taking t={0.1,0.9,1,2,5,6,10,15,20,30}. We report the consensus scores for two views on simulation datasets in Table 1. From the table, we can see that simulation datasets with small t have high consensus scores and those with large t have low consensus scores.
We first compare the three clustering results obtained by our method and show their performance when t changes. We apply our ISC model to compute the corresponding common part C and the specific parts S_{1} and S_{2}. kmeans clustering is then applied on C, S_{1} and S_{2}, and three corresponding clustering results ISCC, ISCS1 and ISCS2 are obtained, respectively. Since the kmeans method may be sensitive to the initials, we run the kmeans method 100 times and report the average of the results. We choose the parameter β from {0,1e−6,1e−5,⋯,1e+5,1e+6}. We report the average Silhouette scores for the three clustering results in Table 1. As we can see, the clustering result of ISCC achieves a higher silhouette score than the clustering results of ISCS1 and ISCS2 for any t, which indicates that the common part may have better clustering structure in the simulation datasets. We also compute the NMI, ACC and ARI by comparing the three clustering results with the ground truth labels y,y_{1} and y_{2}, respectively. The average values are reported in Table 2. We have two observations from the results. First, ISCC peforms perfect when t changes, and the results by ISCS1 and ISCS2 are getting better when t increases. This means that the our ISCC could always capture the common structure even the consisitency is very weak, and our ISCS1 and ISCS2 could capture the specific structures better when the consistency gets weak. Second, ISCC achieves higher NMI, ACC and ARI values than ISCS1 and ISCS2, which is consistent with the results obtained by silhouette scores. This implies that Silhouette scores may be used to select the best clustering result.
We then compare our clustering result by ISCC with the comparison methods by computing NMI, ACC, and ARI of each methods, which all assume strong consistency across views except ECMC. The average values of all the methods are reported in Table 2. When t is relatively small, almost all the methods could perform well. When the degree of inconsistency increases as t increases, our method ISCC outperforms other methods. That is because, when the consistency signal is very weak, existing methods could not capture the common clustering structure any more, but our ISCC could discover the common clustering structure very well. We also plot the clustering results for all multiview methods with t=0.1 and t=10 in Fig. 3. In the figure, since the common result of the SNF method is in the form of the kernel, we present all the data in the form of a kernel. Specifically, as for the simulation datasets, the linear kernel of X_{v},Y_{v} and T_{v} are denoted as \(K_{v}, K_{v}^{c}\) and \(K_{v}^{s}\), respectively. In addition, when using a linear kernel, equations \(K_{v}=K_{v}^{c}+K_{v}^{s}\) hold for v=1,2. We can see that in Fig. 3a, t is small and consensus score is big, and all methods could discover the latent common clustering structure with high accuracy. However, in Fig. 3b, when t is big and the consensus score is low, all baseline methods fail to discover the best clustering structure, but our ISCC method could still capture the common structure across views. This further shows the power of our method even when the common information is very weak.
Experiments on multiview text datasets
In this section, we evaluate our ISC method on multiview text datasets. Since only the ground truth labels for common part is known, we compare the ISCC results with other methods.
BBC and BBCSport datasets. BBC datasets consist of 2,225 documents provided by the BBC News website, which are stories about the five thematic areas of business, entertainment, politics, sports and technology from 2004 to 2005. The BBCSport datasets consist of 737 documents from the BBC Sports website, which correspond to sports news articles in the five subject areas of sports, cricket, football, rugby and tennis from 2004 to 2005. Each article is divided into up to four parts, each part has at least 200 characters, and then the pieces are randomly assigned to each view, which can generate the dataset of BBC2/3/4views and BBCSport2/3/4views. Here we only select BBC2/3views, BBCSport2/3views datasets for clustering.
Cora dataset. The Cora dataset consists of machine learning papers that are one of seven categories: casebased, genetic algorithms, neural networks, probabilistic methods, reinforcement learning, rule learning, and theory. There are 2,708 papers in the entire corpus. The dataset consists of two views. One view is represented by a 0/1 value word vector, indicating the absence/presence of the corresponding word in the dictionary. The other view is the citation relationship between each publication and other publications.
By using the ISC model, we could obtain the common part C. We then apply kmeans clustering on C. We compare the results of ISCC with other methods, and the results are shown in Table 3. We can see from the table that, our ISC model works the best for most cases.
Identifying cancer types by colorectal cancer dataset
Tumors may not be diagnosed pathologically, and thus it’s meaningful to determine whether the patient’s specific symptoms are colon cancer or colorectal cancer. We further evaluate our method by identifying colon cancer and colorectal cancer on a colorectal cancer dataset [28]. which consists exome sequences, DNA copy number, promoter methylation and messenger RNA, and microRNA expression for 276 patients. We select three types of expression data including DNA methylation, mRNA expression and miRNA expression. Specifically, DNA methylation profiles are obtained by the Illumina Infinium HumanMethylation27 arrays, mRNA expression profiles are generated by Agilent microarray, and miRNA quantification via Illumina sequencing. After screening, we obtain 85 cancer patients with colon cancer and colorectal cancer.
We apply our ISC model to identify the cancer types (colon cancer or colorectal caner) for these patients with two or three views, and obtain the corresponding common part C and three specific parts S1,S2 and S3. Since we assume that the cancer type or subtype structures may be specifically shown in a single omics, we check the clustering results for both the common and specific parts and see whether they capture the clustering information for cancer types. Note that the ground truth for cancer types is known, thus we could also calculate NMI, ACC and ARI by using the common part ISCC, the specific parts ISCS1, ISCS2, ISCS3. The results are reported in Table 4. Our method performs better than the baseline methods for most of the cases. Overall, our method ISCC with common part with DNA methylation and miRNA expression data performs the best among all the obtained clustering results. While for miRNA and mRNA expression, SNF works the best, our ISC method with the specific part of DNA methylation (ISCS1) works the best among all methods on the view combinations with DNA methylation. It may imply that DNA methylation plays an important role in the identification of the cancer type. This confirms our hypothesis that information about the type of cancer may be hidden in a particular omics.
Applications on cancer subtype identification using TCGA datasets
We finally apply our ISC model on The Cancer Genome Atlas (TCGA) Research Network[29] to identify subtypes for five cancers. TCGA is currently the largest database of cancer genetic information, and has included 33 types of cancer including 10 rare cancer types. In addition, in the database, each cancer data contains gene expression data, miRNA expression data, copy number variation, DNA methylation, SNP, etc., and has sufficient clinical data.
Data sets
The datasets for five cancers using TCGA datasets are collected by Wang et al. [4]. The datasets contain five cancer types: polymorphism Glioblastoma (GBM), renal clear cell carcinoma (KRCCC), breast invasive carcinoma (BIC), colon adenocarcinoma (COAD) and lung squamous cell carcinoma (LSCC). There are three types of cancer expression data: DNA methylation, mRNA expression, and miRNA expression, as well as clinical information, including survival data for patients. Since we don’t have the ground truth labels for the subtypes of these datasets, survival analysis is mainly used to evaluate our model.
For each of the five datasets, we apply the ISC model to compute the common part and specific parts, and then apply kmeans to obtain clustering results. The procedure for obtaining the cancer subtype of the dataset is the same as that of Colorectal cancer dataset. The numbers of subtypes are chosen as 3, 3, 4, 3 and 4 for GBM, KRCCC, BIC, COAD, and LACC[4], respectively. We also report consensus scores for the three views of the five cancers in Table 5. As we can see, the consensus scores for the first two views are both very low. This implies that the consistency information across views are relatively weaker compared to the inconsistency, and thus the traditional multiview methods may not work.
Survival analysis
We apply the logrank test to measure whether different subtypes obtained by clustering are meaningful, since the survival time in months are given for each sample in the TCGA datasets. The logrank test is a commonly used nonparametric test method for comparison of survival processes in survival analysis and can be used to compare whether two or more sets of survival curves are identical. In general, the smaller the pvalue obtained from it, the more different the survival curves of the two or more groups.
The logrank pvalues for all the methods are reported in Table 6. we can see from the table that, for four cancers including GBM, BIC, KRCCC, and LSCC, our ISC method could obtain the most significant pvalues. For COAD, our method with ISCS2 could obtain the similarly good pvalue with the ECMC method. Furthermore, the subtypes for GBM and KRCCC found by the common part across three views obtain the most significant pvalues, the BIC subtypes found by miRNA expression are the most significant, and the subtypes for LSCC found by DNA methylation are the most significant. We also report the silhouette scores for the clustering results of ISCC, ISCS1, ISCS2, and ISCS3 in Table 7. By comparing Tables 6 and 7, for four of five datasets except GBM, the best clustering results with the best cox pvalues among our four clustering results are corresponding to the highest silhouette scores. This implies that the our selection sheme for the clustering results is effective in this application.
We also plot the KaplanMeier survival curves by the ISC clustering results with the most significant pvalues for all the five cancer types. Figure 4 shows the curves for GBM, BIC, COAD, and LSCC, and Fig. 5 shows the curve for KRCCC. From the figures, we could see the significantly different survival profiles over the subtypes. For the cancer KRCCC, we also plot the KaplanMeier survival curves obtained by baseline methods Coreg, ECMC and SNF in Fig. 5. We can see the survival curves by our ISC method are more significantly different than that obtained by the other compared methods.
Subtype visualization
We further analyze the obtained breast cancer subtypes by our model ISC with S3, since S3 by miRNA expression generates the most significantly different survival profiles across different subtypes. Fig. 6 shows the visualization of four breast cancer subtypes identified by the specific part of miRNA (S3). It can be seen that with the clustering results, the samples in the other two views  mRNA expression and DNA methylation are not separated, and some subtypes are even very similar. However, the characteristics of miRNA expression for the four subtypes seem significantly different. This implies that the resulting best subtype identified by ISCS3 is specifically shown by miRNA expression, but not shown in other views.
Drug treatment analysis on cancer subtypes
We finally validate the obtained subtypes by comparing the survival profiles from different treatment groups in each subtype. We choose two drug treatments of Cytoxan and Adriamycin for breast cancer, and drug treatment temozolomide for GBM. For each subtype, we check whether the survival profiles are significantly different between the treatment patients and the untreated patients. The Cox pvalues for all the three treatments in all subtypes are reported in Table 8. Interestingly, we can see that for breast cancer, the patients in Subtype 2 is sensitive to the two drug treatments of Cytoxan and Adriamycin. The KaplanMeier survival curves of these two treatments in Subtype 2 are shown in Fig. 7. In Subtype 1 of GBM, the patients with treatment temozolomide have significantly different survival profiles with the untreated patients in this subtype. the KaplanMeier survival curves of glio cancers in Subtype 1 is shown in Fig. 8. These further validate that the Subtypes we cound is biological meaningful.
Discussion on breast subtypes
We further discuss the subtypes we found for breast cancer. Breast cancer is a heterogeneous and polygenic disease, which is one of the most common malignancies in women. Based on histological and genomic features, breast cancer can be roughly separated into four subtypes (luminal A, luminal B, HER2amplified, and basallike) [30].
To date, researchers have reported many genes related to subtypes of breast cancer. We firstly collect genes associated with these subtypes, respectively, and then check the matching between our resulting four subtypes and these four known subytpes. BUB1, CDCA4, CHEK1, FOXM1 and HDAC2 probably are the key genes in basallike subtype. Because alterations in these genes is a kind of deletion event in the basal cancers, which is related with basallike cancer enriched subgroup, harbours chromosome 5q deletions, and several signaling molecules, transcription factors and cell division genes [31]. Besides, basallike subtype may also correlate with the gene EGFR, which is supported with the fact that alterations of EGFR, p53 and pTeN are cooperative and likely to play an important role in basallike breast cancer pathogenesis[32]. For luminal B subtype, PPP2R2A is an associated gene due to the dysregulation of specific PPP2R2A functions in luminal B breast cancers [31]. The genes ZNF703 and DHRS2 are likely to correlate with luminal B since [33] suggests ZNF703 is a luminal B specific driver and Tumors with elevated ZNF703 levels were characterized by alterations in a lipid metabolism and detoxification pathway that include DHRS2 as a key signaling component. For HER2 subtype, [34] confirms that agents targeting GAB2 or GAB2dependent pathways may be useful for treating breast tumors that overexpress HER2, and thus we include GAB2 as a correlated gene for HER2 type breast cancer. Besides, Trastuzumab blocks the HER2HER3(ERBB3) interaction and is used to treat breast cancers with HER2 overexpression, although some of these cancers develop trastuzumab resistance. By using small interfering RNA (siRNA) to identify genes involved in trastuzumab resistance, [35] identified several kinases and phosphatases that were upregulated in trastuzumabresistant cancers, including PPM1H. This suggests that PPM1H and ERBB3 may have some link with HER2 type breast cancer.
For each computed subtype by our ISC algorithm, we first calculate ttest pvalues for each of these correlated genes to show whether the gene expression levels are significantly changed between the subtype and the other subtypes. We then apply the Fisher’s combined probability test [36] to compute the group pvalues for these genes, which could test whether the group of the selected genes are significantly different between the subtype the and other subtypes. We report the group pvalues for each resulting subtype in Table 9. The results show that, our computed Subtype 2 is highly likely corresponding to the basallike breast cancer subtype, with group pvalue being 3.83e08. Our computed Subtype 4 may also contain the basallike breast cancer subtype, with group pvalue being 4.79e07. Our Subtype 4 probably corresponds to the HER2 breast cancer subtype, with group pvalue being 4.17e07, and our Subtype 3 is likely to correspond to the luminal B breast cancer subtype.
Conclusion
Our goal in this work is to discover common and specific information simultaneously from multiviews when the consistency across views is relatively weak, and the specific signal is strong. We propose integrative subspace clustering method (ISC) by common and specific decomposition to find two orthogonal subspaces for each view. To better distinguish the common and viewspecific part, we also hope the common part and viewspecific part are as independent as possible by using the measurement HSIC. Our simulation experiments, realworld benchmark experiments, cancer type identification by colorectal data, subtype identification for five cancers by TCGA datasets all show that the ISC model outperforms other stateofart multiview clustering algorithms. In particular, we find some interesting subtypes in breast cancer and GBM cancer, and the survival analysis shows that the subtypes are biologically meaningful.
Availability of data and materials
Multiview text datasets were downloaded from http://mlg.ucd.ie/datasets/bbc.html. Colorectal cancer dataset was downloaded from http://www.cbioportal.org/study/summary?id=coadread_tcga_pub. TCGA datasets were downloaded on 18/4/2017 from http://compbio.cs.toronto.edu/SNF/SNF/Software.html.
Abbreviations
 BIC:

Breast cancer
 COAD:

Colon cancer
 GBM:

Glio cancer
 HSIC:

Hilbert Schmidt Independence Criterion
 ISCC:

Clustering by the common part in our ISC method
 ISCS:

Clustering by the specific part in our ISC method
 KRCCC:

Kidney cancer
 LSCC:

Lung cancer
 SV:

Single view
 TCGA:

The cancer genome atlas
References
Tang W, Lu Z, Dhillon I. Clustering with multiple graphs. 2009; 24(4):1016–21. https://doi.org/10.1109/icdm.2009.125.
Chaudhuri K, Kakade S, Livescu K, Sridharan K. Multiview clustering via canonical correlation analysis. In: International Conference on Machine Learning: 2009. p. 129–36. https://doi.org/10.1145/1553374.1553391.
Kumar A, Rai P, Daumé H. Coregularized multiview spectral clustering. In: Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011. Proceedings of a Meeting Held. Granada: 2012. p. 1413–14. http://papers.nips.cc/paper/4360coregularizedmultiviewspectralclustering.
Wang B, Mezlini A, Demir F, Fiume M, Tu Z, Brudno M, Haibekains B, Goldenberg A. Similarity network fusion for aggregating data types on a genomic scale. Nat Methods. 2014; 11(3):333.
Blum A, Mitchell T. Combining labeled and unlabeled data with cotraining. In: Proceedings of the Eleventh Annual Conference on Computational Learning Theory: 1998. p. 92–100. https://doi.org/10.1145/279943.279962.
Muslea I, Minton S, Knoblock C. Active learning with multiple views. J Artif Intell Res. 2006; 27:203–33.
Wang.W, Zhou.Z. A new analysis of cotraining. In: Proceedings of the 27th International Conference on Machine Learning (ICML10): 2010. p. 1135–1142.
Bickel S, Scheffer T. Multiview clustering. In: ICDM: 2004. p. 19–26. https://doi.org/10.1109/icdm.2004.10095.
Kumar A, III HD. A cotraining approach for multiview spectral clustering. In: Proceedings of the 28thInternational Conference on Machine Learning, ICML. Bellevue: 2011. p. 393–400. https://icml.cc/2011/papers/272icmlpaper.pdf.
Xia R, Pan Y, Du L, Yin J. Robust multiview spectral clu stering via lowrank and sparse decomposition. In: Proceedings of the TwentyEighth AAAI Conference on Artificial Intelligence. Québec: 2014. p. 2149–55.
Tang J, Hu X, Gao H, Liu H. Unsupervised feature selection for multiview data in social media. In: SDM: 2013. p. 270–8. https://doi.org/10.1137/1.9781611972832.30.
Wang H, Nie F, Huang H. Multiview clustering via joint nonnegative matrix factorization. In: Proceedings of the 13th SIAM International Conference on Data Mining. Austin: 2013. p. 352–60. http://proceedings.mlr.press/v28/wang13c.html.
Gao J, Han j, Liu j, Wang c. Multiview clustering via joint nonnegative matrix factorization. In: Proceedings of the 13th SIAM International Conference on Data Mining. Austin: 2013. p. 252–60. https://doi.org/10.1137/1.9781611972832.28.
Qianqian S, Chuanchao Z, Minrui P, Xiangtian Y, Tao Z, Juan L, Luonan C. Pattern fusion analysis by adaptive alignment of multiple heterogeneous omics data. Bioinformatics. 2017. https://doi.org/10.1093/bioinformatics/btx176.
Lanckriet G, Cristianini N, Bartlett P, El G, Jordan M. Learning the kernel matrix with semidefinite programming. J Mach Learn Res. 2002; 5(1):27–72.
Yu S, Tranchevent L, Liu X, Glanzel W. Optimized data fusion for kernel kmeans clustering. Pattern Anal Mach Intell IEEE Trans. 2011; 34(5):1031–9.
Lange T, Buhmann J. Fusion of similarity data in clustering. In: Advances in Neural Information Processing Systems 18. Vancouver: NIPS: 2005. p. 723–30. http://papers.nips.cc/paper/2880fusionofsimilaritydatainclustering.
Chuang Y. Affinity aggregation for spectral clustering. IEEE Conf Comput Vis Pattern Recogn. 2012; 23(10):773–80.
Gönen M, Margolin A. Localized data fusion for kernel kmeans clustering with application to cancer biology. Adv Neural Inf Process Syst. 2014; 2:1305–13.
Bach F, Lanckriet G, Jordan M. Multiple kernel learning, conic duality, and the smo algorithm. In: International Conference: 2004. p. 6. https://doi.org/10.1145/1015330.1015424.
Nigro JM, Misra A, Zhang L, Smirnov I, Colman H, Griffin C, Ozburn N, Chen M, Pan E, Koul D, Yung WKA, Feuerstein BG, Aldape KD. Integrated arraycomparative genomic hybridization and expression array profiles identify clinically relevant molecular subtypes of glioblastoma. Cancer Res. 2005; 65(5):1678–86.
Verhaak Roel GW, et al.Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in pdgfra, idh1, egfr, and nf1. Cancer Cell. 2010; 17(1):98–110.
Cai M, Li L. Subtype identification from heterogeneous tcga datasets on a genomic scale by multiview clustering with enhanced consensus. BMC Med Genomics. 2017; 4:75. https://doi.org/10.1186/s129200170306x.
Gretton A, Bousquet O, Smola A J, Schölkopf B. Measuring statistical dependence with hilbertschmidt norms. In: ALT: 2005. p. 63–77. https://doi.org/10.1007/11564089_7.
Bartels RH, Stewart GW. Solution of the matrix equation ax+xb=c [f4] (algorithm 432). Commun Acm. 1972; 15(9):820–6.
Wen, Zaiwen. A feasible method for optimization with orthogonality constraints. Math Program. 2013; 142(12):397–434.
Rousseeuw PJ. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1999; 20(20):53–65.
Network CGA, et al.Comprehensive molecular characterization of human colon and rectal cancer. Nature. 2012; 487(7407):330.
Network TCGA. The cancer genome atlas. 2006. http://cancergenome.nih.gov/. Accessed 20 Jun 2019.
Parker JS, Mullins M, Cheang MC, et al.Supervised risk predictor of breast cancer based on intrinsic subtypes. J Clin Oncol. 2009; 27(8):1160–7.
Curtis C, Shah SP, Chin SF, Turashvili G, Rueda OM, Dunning MJ, Speed D, Lynch AG, Samarajiwa S, Yuan Y. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature. 2012; 486(7403):346–52.
Pires MM, Hopkins BD, Saal LH, Parsons RE. Alterations of egfr, p53 and pten that mimic changes found in basallike breast cancer promote transformation of human mammary epithelial cells. Cancer Biol Ther. 2013; 14(3):246–53.
Holland DG, Burleigh A, Git A, Goldgraben MA, Perezmancera PA, Chin SF, Hurtado A, Bruna A, Ali HR, Greenwood W. Znf703 is a common luminal b breast cancer oncogene that differentially regulates luminal and basal progenitors in human mammary epithelium. Embo Mol Med. 2015; 3(3):167–80.
BentiresAlj M, Gil SG, Chan R, Wang ZC, Wang Y, Imanaka N, Harris LN, Richardson A, Neel BG, Gu H. A role for the scaffolding adapter gab2 in breast cancer. Nat Med. 2006; 12(1):114.
Leehoeflich S, Pham T, Dowbenko D, Munroe X, Lee J, Li L, Zhou W, Haverty P, Pujara K, Stinson J. Ppm1h is a p27 phosphatase implicated in trastuzumab resistance. Cancer Discov. 2011; 1(4):326–37.
Fisher R, Vol. 118. Statistical Methods for Research Workers; 1954, pp. 66–70. https://doi.org/10.2307/2528855.
Acknowledgements
Supported by the NSFC projects 11631012, Shanghai Municipal Science and Technology Major Project (No.2018SHZDZX01), LCNBI and ZJLab, and the Fundamental Research Funds for the Central Universities.
About this supplement
This article has been published as part of BMC Medical Genomics Volume 12 Supplement 9, 2019: Proceedings of the Joint International GIW & ABACBS2019 Conference: medical genomics. The full contents of the supplement are available online at https://bmcmedgenomics.biomedcentral.com/articles/supplements/volume12supplement9.
Funding
The publication charges for this article were funded by the Fundamental Research Funds for the Central Universities.
Author information
Authors and Affiliations
Contributions
YG designed the optimization algorithms and conducted the experiments. HL conducted the survival analysis. LL and MC designed the model and the experiments, and wrote the manuscript. All authors revised and approved the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Guo, Y., Li, H., Cai, M. et al. Integrative subspace clustering by common and specific decomposition for applications on cancer subtype identification. BMC Med Genomics 12 (Suppl 9), 191 (2019). https://doi.org/10.1186/s1292001906331
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1292001906331
Keywords
 Subtype identification
 Multiview clustering
 Subspace clustering