In this section, we will present the proposed integrative subspace clustering method by multi-view matrix decomposition. We first give a problem statement, and then propose a subspace learning method by mult-view matrix decomposition. We then introduce the Hilbert Schmidt Independence Criterion, and finally propose our integrative subspace clustering model ISC and the corresponding optimization algorithm.
Problem statement
Suppose we are given n samples with V views, X=[X1,⋯,XV], where \(\phantom {\dot {i}\!}X_{v} \in R^{p_{v} \times n},v=1,\cdots,V\). Denote \(X_{v} = [x^{v}_{1},\cdots,x^{v}_{n}]\), where \(x^{v}_{i}\in R^{p_{v}}\). The aim is to cluster the n samples with a given cluster number based on the integrative information from the v views. In cancer subtype identification, the views can be different data sources, omics or platforms.
Subspace learning for common and specific decomposition
We consider the samples \(\phantom {\dot {i}\!}X_{v}\in R^{p_{v}\times n}\) from view v are approximately lying in a d-dimensional subspace \(\Omega _{v}\subset R^{p_{v}}\) (d<pv), which is spanned by the columns of an orthonormal matrix \(P_{v}\in R^{p_{v}\times d}, P_{v}^{T}P_{v} = I_{d}\). This means that
$$x_{i}^{v} \approx P_{v}z_{i}^{v}, $$
where \(z_{i}^{v}\in R^{d}\) is the new representation of \(x_{i}^{v}\) in this subspace. We assume that the samples Xv from view v have both common and specific clustering structures, which means that \(z_{i}^{v}\) can be further represented as
$$z_{i}^{v} = \left(\begin{array}{c} c_{i} \\ s_{i}^{v} \end{array} \right) $$
where \(\phantom {\dot {i}\!}c_{i}\in R^{d_{0}}\) is the common representation of xi across all views, and \(s_{i}^{v}\in R^{d_{v}}\) is the specific representation of xi in the v-th view. Note that d=d0+dv. In other words, \(x_{i}^{v}\) can be approximately represented as
$$x_{i}^{v} \approx P_{v}z_{i}^{v} \,=\, P_{v}\left(\begin{array}{c} c_{i} \\ s_{i}^{v} \end{array} \right) = \left(P_{v}^{(c)} \ \ P_{v}^{(s)}\right)\left(\begin{array}{c} c_{i} \\ s_{i}^{v} \end{array} \right) =P_{v}^{(c)}c_{i} +P_{v}^{(s)}s_{i}^{v}, $$
where \(P_{v} = \left (P_{v}^{(c)} \ \ P_{v}^{(s)}\right), (P_{v}^{(c)})^{T}P_{v}^{(c)}=I_{d_{0}}\) and \(\left (P_{v}^{(s)}\right)^{T}P_{v}^{(s)}=I_{d_{v}}\). This means that the d-dimensional subspace Ωv spanned by Pv is further decomposed to two orthogonal subspaces \(\Omega _{v}^{(c)}\) and \(\Omega _{v}^{(s)}\), spanned by orthonormal matrices \(P_{v}^{(c)}\) and \(P_{v}^{(s)}\), respectively. In other words, \(\Omega _{v} = \Omega _{v}^{(c)} \oplus \Omega _{v}^{(s)}\), where \(\Omega _{v}^{(c)}\) and \(\Omega _{v}^{(s)}\) are orthogonal subspaces to each other. We can rewrite the above equations in a matrix form as follows,
$$ \begin{array}{lll} X_{v}& = & P_{v} Z_{v} +E_{v} \\ &= & \left(P_{v}^{(c)} \ \ P_{v}^{(s)}\right) \left(\begin{aligned} C \\ S_{v} \end{aligned} \right) + E_{v}\\ & = & P_{v}^{(c)}C+P_{v}^{(s)}S_{v} +E_{v}\\ &= & P_{v} \left(\begin{aligned} C \\ S_{v} \end{aligned} \right) + E_{v}, \ \ v=1,\cdots,V \\ \end{array} $$
(1)
where \(Z_{v}=\left [z_{1}^{v},\cdots,z_{n}^{v}\right ], C=\left [c_{1},\cdots,c_{n}\right ], S_{v}=\left [s_{1}^{v},\cdots,s_{n}^{v}\right ]\), and Ev is the error matrix for view v.
We demonstrate the decomposition idea in Fig. 1. We attempt to find two orthogonal subspaces \(\Omega _{v}^{(c)}\) and \(\Omega _{v}^{(s)}\) for each view v, such that Xv could be decomposed to the common part C and the specific part Sv in the subspace \(\Omega _{v} = \Omega _{v}^{(c)} \oplus \Omega _{v}^{(s)}\). Hopefully, the common clustering structure is hidden in C, and the specific clustering structure for view v is hidden in Sv.
Hilbert-Schmidt Independence criterion (HSIC)
To better decompose each view to a common and a view-specific part, such that each view-specific clustering structure in Sv is independent to the common part C across all views, a measurement for independence is required. We measure the independence by using the Hilbert-Schmidt Independence Criterion (HSIC) which is a measure of statistical independence [24]. Intuitively, HSIC can be considered as a squared correlation coefficient between two random variables c and s computed in feature spaces \(\mathcal {F}\) and \(\mathcal {G}\).
Let c and s be two random variables from the domains \(\mathcal {C}\) and \(\mathcal {S}\), respectively. Let \(\mathcal {F}\) and \(\mathcal {G}\) be feature spaces on \(\mathcal {C}\) and \(\mathcal {S}\) with associated kernels \( k_{c}: \mathcal {C} \times \mathcal {C} \rightarrow \mathbb {R}\) and \(k_{s}: \mathcal {S} \times \mathcal {S} \rightarrow \mathbb {R}\), respectively. Denote the joint probability distribution of c and s by p(c,s), and (c,s) and (c′,s′) are drawn according to p(c,s). Then the Hilbert Schmidt Independence Criterion can be computed in terms of kernel functions via:
$$\begin{array}{*{20}l} \text{HSIC}(p_{(c,s)},\mathcal{F},\mathcal{G})&=\mathbf{E}_{c,c',s,s'}[k_{c}(c,c')k_{s}(s,s')]\notag\\ &\quad+\mathbf{E}_{c,c'}[k_{c}(c,c')]\mathbf{E}_{s,s'}[k_{s}(s,s')] \notag\\ &\quad-2\mathbf{E}_{c,s}[\mathbf{E}_{c'}[k_{c}(c,c')]\mathbf{E}_{s'}[k_{s}(s,s')]], \end{array} $$
where E is the expectation operator.
The empirical estimator of HSIC for a finite sample of points C and S from c and s with p(c,s) was given in [24] to be
$$\begin{array}{*{20}l} \text{HSIC}((C,S),\mathcal{F},\mathcal{G})\propto tr({K_{c}HK_{s}H}), \end{array} $$
(2)
where tr is the trace operator of a matrix, H is the centering matrix \(H = I_{n}-\frac {ee^{T}}{n}\) (e is a proper dimensional column vector with all ones), and Kc and Ks∈Rn×n are kernel matrices. The smaller the HSIC value, the more likely C and S are independent from each other.
Integrative subspace clustering (ISC) model
Based on the above considerations, we propose our integrative subspace clustering model as follows,
$$ {\begin{aligned} \begin{array}{l} \min_{\substack{P_{1},\cdots,P_{V}\\ C,S_{1},\cdots,S_{V}}} \sum_{v=1}^{V} \left|\left| X_{v}-P_{v} \left(\begin{aligned} C \\ S_{v} \end{aligned} \right) \right|\right|_{F}^{2} +\beta \sum_{v=1}^{V} tr\left(C^{T}CHS_{v}^{T}S_{v}H\right) \\ s.t.~ P_{v}^{T}P_{v}=I, \end{array} \end{aligned}} $$
(3)
where \(S_{v}^{T}S_{v}\) and CTC are the linear kernels of Sv and C, respectively, and β is a parameter. Note that the first term is the decomposition term that tries to find the orthogonal subspaces where the corresponding common and view-specific representations lie in, and the second independence term is to minimize the dependence between the common part and the view-specific part. We use the linear kernel of C and Sv to simplify the computation. After C and Svs for all views are obtained, k-means clustering is applied to cluster the samples represented by C and Sv, respectively. The clustering results by using the common part C and the specific part Sv are called ISC-C, ISC-S1,ISC-S2, ⋯, respectively.
Based on the resulting C and Sis, we define a consensus score(C-score) which is similar to [23] as below:
$$ \text{C-score}_{i}=\frac{tr\left(HX_{i}^{T}X_{i}HC^{T}C\right)}{tr\left(HX_{i}^{T}X_{i}H\left(C^{T}C+S_{i}^{T}S_{i}\right)\right)}. $$
(4)
C-score is used to measure the weight of the consensus part in the i-th view. Note that the C-score ranges from 0 to 1, and a higher C-score implies stronger consistent information in the corresponding view.
Optimization algorithm
We propose an alternative updating approach to solve the optimization problem (3).
Step 1. We first fix Pv and C in (3), and solve for optimal S1,⋯,Sv one by one. The v-th optimization subproblem can be written as:
$$ \min_{S_{v}} \left|\left| X_{v}-P_{v} \left(\begin{aligned} C \\ S_{v} \end{aligned} \right) \right|\right|_{F}^{2}+\beta tr(C^{T}CHS_{v}^{T}S_{v}H). $$
(5)
Since Pv can be represented as \(P_{v}=(P_{v}^{(c)} \ \ P_{v}^{(s)})\), the subproblem (5) to solve for Sv can be simplified to:
$$ {\begin{aligned} \begin{array}{ll} \min\limits_{S_{v}}& tr\left(-2 X_{v}^{T} P_{v}^{(s)} S_{v} + 2 S_{v}^{T} \left(P_{v}^{(s)}\right)^{T} P_{v}^{(c)} C + S_{v}^{T} \left(P_{v}^{(s)}\right)^{T} P_{v}^{(s)} S_{v}\right) \\ & +\beta tr\left(C^{T}CHS_{v}^{T} S_{v}H\right) \end{array} \end{aligned}} $$
(6)
By setting the derivatives of the objective function f(Sv) in (6) with respect to Sv to be zero, we obtain
$$ {\begin{aligned} \frac{\partial f(S_{v})}{\partial S_{v}}&=0 \Rightarrow \left(P_{v}^{(s)}\right)^{T} P_{v}^{(s)}S_{v}+\beta S_{v}H C^{T}C H \\ &= \left(P_{v}^{(s)}\right)^{T}X_{v}-\left(P_{v}^{(s)}\right)^{T} P_{v}^{(c)}C. \end{aligned}} $$
(7)
The matrix equation for Sv in (7) is a standard Sylvester equation and can be solved efficiently using method in [25].
Step 2. We then fix C,S1,⋯,SV, and solve the optimization problem (3) for optimal P1,⋯,PV one by one. The corresponding v-th optimization subproblem can be written as:
$$ \min_{P_{v}} \left|\left| X_{v}-P_{v} Z_{v} \right|\right|_{F}^{2} \qquad \quad s.t.\quad P_{v}^{T}P_{v}=I, $$
(8)
where \(Z_{v} = \left (\begin {aligned} C \\ S_{v} \end {aligned} \right).\) The optimization problem (8) is a least square problem on grassman manifold, and solved by algorithm 2 in [26].
Step 3. We fix P1,⋯,PV and S1,⋯,SV, then solve the optimization problem (3) for C. The corresponding subproblem can be written as:
$$ {\begin{aligned} \begin{array}{ll} \min\limits_{C} & \sum_{v}^{V} tr\left(-2 X_{v}^{T} P_{v}^{(c)}C +2 S_{v}^{T} \left(P_{v}^{(s)}\right)^{T} P_{v}^{(c)} C + C^{T} \left(P_{v}^{(c)}\right)^{T} P_{v}^{(c)} C\right) \\ & + \beta tr\left(S_{v}^{T} S_{v} H C^{T} C H \right). \end{array} \end{aligned}} $$
(9)
Similarly, we set the derivatives of objective function of the subproblem (9) with respect to C, and obtain
$$ {\begin{aligned} \left(\sum_{v=1}^{V} \left(P_{v}^{(c)}\right)^{T} P_{v}^{(c)}\right) C &+ \beta C \left(\sum_{v=1}^{V} HS_{v}^{T}S_{v}H\right)\\ &=\sum_{v=1}^{V}{ \left(P_{v}^{(c)}\right)^{T} X_{v}-\left(P_{v}^{(c)}\right)^{T} P_{v}^{(s)} S_{v} }. \end{aligned}} $$
(10)
The matrix equation for C in (10) is also a standard Sylvester equation and the same algorithm for solving (7) can be used.
The overall algorithm for solving (3) is shown in the algorithm box ISC. For each iteration, we need to solve three subproblems in our ISC algorithm to alternatively update Sv,Pv and C. Since the objective function of ISC model in (3) has a lower bound of zero. and the objective values of our method is decreasing at each step to solve the three subproblems. Therefore the convergence of objective values in our algorithm can be assured. We also experimentally show the convergence of objective values by using four text datasets in Fig. 2, which further confirms the convergence analysis above.