In this section, we will present the proposed integrative subspace clustering method by multi-view matrix decomposition. We first give a problem statement, and then propose a subspace learning method by mult-view matrix decomposition. We then introduce the Hilbert Schmidt Independence Criterion, and finally propose our integrative subspace clustering model ISC and the corresponding optimization algorithm.

### Problem statement

Suppose we are given *n* samples with *V* views, *X*=[*X*_{1},⋯,*X*_{V}], where \(\phantom {\dot {i}\!}X_{v} \in R^{p_{v} \times n},v=1,\cdots,V\). Denote \(X_{v} = [x^{v}_{1},\cdots,x^{v}_{n}]\), where \(x^{v}_{i}\in R^{p_{v}}\). The aim is to cluster the *n* samples with a given cluster number based on the integrative information from the *v* views. In cancer subtype identification, the views can be different data sources, omics or platforms.

### Subspace learning for common and specific decomposition

We consider the samples \(\phantom {\dot {i}\!}X_{v}\in R^{p_{v}\times n}\) from view *v* are approximately lying in a *d*-dimensional subspace \(\Omega _{v}\subset R^{p_{v}}\) (*d*<*p*_{v}), which is spanned by the columns of an orthonormal matrix \(P_{v}\in R^{p_{v}\times d}, P_{v}^{T}P_{v} = I_{d}\). This means that

$$x_{i}^{v} \approx P_{v}z_{i}^{v}, $$

where \(z_{i}^{v}\in R^{d}\) is the new representation of \(x_{i}^{v}\) in this subspace. We assume that the samples *X*_{v} from view *v* have both common and specific clustering structures, which means that \(z_{i}^{v}\) can be further represented as

$$z_{i}^{v} = \left(\begin{array}{c} c_{i} \\ s_{i}^{v} \end{array} \right) $$

where \(\phantom {\dot {i}\!}c_{i}\in R^{d_{0}}\) is the common representation of *x*_{i} across all views, and \(s_{i}^{v}\in R^{d_{v}}\) is the specific representation of *x*_{i} in the *v*-th view. Note that *d*=*d*_{0}+*d*_{v}. In other words, \(x_{i}^{v}\) can be approximately represented as

$$x_{i}^{v} \approx P_{v}z_{i}^{v} \,=\, P_{v}\left(\begin{array}{c} c_{i} \\ s_{i}^{v} \end{array} \right) = \left(P_{v}^{(c)} \ \ P_{v}^{(s)}\right)\left(\begin{array}{c} c_{i} \\ s_{i}^{v} \end{array} \right) =P_{v}^{(c)}c_{i} +P_{v}^{(s)}s_{i}^{v}, $$

where \(P_{v} = \left (P_{v}^{(c)} \ \ P_{v}^{(s)}\right), (P_{v}^{(c)})^{T}P_{v}^{(c)}=I_{d_{0}}\) and \(\left (P_{v}^{(s)}\right)^{T}P_{v}^{(s)}=I_{d_{v}}\). This means that the *d*-dimensional subspace *Ω*_{v} spanned by *P*_{v} is further decomposed to two orthogonal subspaces \(\Omega _{v}^{(c)}\) and \(\Omega _{v}^{(s)}\), spanned by orthonormal matrices \(P_{v}^{(c)}\) and \(P_{v}^{(s)}\), respectively. In other words, \(\Omega _{v} = \Omega _{v}^{(c)} \oplus \Omega _{v}^{(s)}\), where \(\Omega _{v}^{(c)}\) and \(\Omega _{v}^{(s)}\) are orthogonal subspaces to each other. We can rewrite the above equations in a matrix form as follows,

$$ \begin{array}{lll} X_{v}& = & P_{v} Z_{v} +E_{v} \\ &= & \left(P_{v}^{(c)} \ \ P_{v}^{(s)}\right) \left(\begin{aligned} C \\ S_{v} \end{aligned} \right) + E_{v}\\ & = & P_{v}^{(c)}C+P_{v}^{(s)}S_{v} +E_{v}\\ &= & P_{v} \left(\begin{aligned} C \\ S_{v} \end{aligned} \right) + E_{v}, \ \ v=1,\cdots,V \\ \end{array} $$

(1)

where \(Z_{v}=\left [z_{1}^{v},\cdots,z_{n}^{v}\right ], C=\left [c_{1},\cdots,c_{n}\right ], S_{v}=\left [s_{1}^{v},\cdots,s_{n}^{v}\right ]\), and *E*_{v} is the error matrix for view *v*.

We demonstrate the decomposition idea in Fig. 1. We attempt to find two orthogonal subspaces \(\Omega _{v}^{(c)}\) and \(\Omega _{v}^{(s)}\) for each view *v*, such that *X*_{v} could be decomposed to the common part *C* and the specific part *S*_{v} in the subspace \(\Omega _{v} = \Omega _{v}^{(c)} \oplus \Omega _{v}^{(s)}\). Hopefully, the common clustering structure is hidden in *C*, and the specific clustering structure for view *v* is hidden in *S*_{v}.

### Hilbert-Schmidt Independence criterion (HSIC)

To better decompose each view to a common and a view-specific part, such that each view-specific clustering structure in *S*_{v} is independent to the common part *C* across all views, a measurement for independence is required. We measure the independence by using the Hilbert-Schmidt Independence Criterion (HSIC) which is a measure of statistical independence [24]. Intuitively, HSIC can be considered as a squared correlation coefficient between two random variables *c* and *s* computed in feature spaces \(\mathcal {F}\) and \(\mathcal {G}\).

Let *c* and *s* be two random variables from the domains \(\mathcal {C}\) and \(\mathcal {S}\), respectively. Let \(\mathcal {F}\) and \(\mathcal {G}\) be feature spaces on \(\mathcal {C}\) and \(\mathcal {S}\) with associated kernels \( k_{c}: \mathcal {C} \times \mathcal {C} \rightarrow \mathbb {R}\) and \(k_{s}: \mathcal {S} \times \mathcal {S} \rightarrow \mathbb {R}\), respectively. Denote the joint probability distribution of *c* and *s* by *p*_{(c,s)}, and (*c*,*s*) and (*c*^{′},*s*^{′}) are drawn according to *p*_{(c,s)}. Then the Hilbert Schmidt Independence Criterion can be computed in terms of kernel functions via:

$$\begin{array}{*{20}l} \text{HSIC}(p_{(c,s)},\mathcal{F},\mathcal{G})&=\mathbf{E}_{c,c',s,s'}[k_{c}(c,c')k_{s}(s,s')]\notag\\ &\quad+\mathbf{E}_{c,c'}[k_{c}(c,c')]\mathbf{E}_{s,s'}[k_{s}(s,s')] \notag\\ &\quad-2\mathbf{E}_{c,s}[\mathbf{E}_{c'}[k_{c}(c,c')]\mathbf{E}_{s'}[k_{s}(s,s')]], \end{array} $$

where **E** is the expectation operator.

The empirical estimator of HSIC for a finite sample of points *C* and *S* from *c* and *s* with *p*_{(c,s)} was given in [24] to be

$$\begin{array}{*{20}l} \text{HSIC}((C,S),\mathcal{F},\mathcal{G})\propto tr({K_{c}HK_{s}H}), \end{array} $$

(2)

where *tr* is the trace operator of a matrix, *H* is the centering matrix \(H = I_{n}-\frac {ee^{T}}{n}\) (e is a proper dimensional column vector with all ones), and *K*_{c} and *K*_{s}∈*R*^{n×n} are kernel matrices. The smaller the HSIC value, the more likely *C* and *S* are independent from each other.

### Integrative subspace clustering (ISC) model

Based on the above considerations, we propose our integrative subspace clustering model as follows,

$$ {\begin{aligned} \begin{array}{l} \min_{\substack{P_{1},\cdots,P_{V}\\ C,S_{1},\cdots,S_{V}}} \sum_{v=1}^{V} \left|\left| X_{v}-P_{v} \left(\begin{aligned} C \\ S_{v} \end{aligned} \right) \right|\right|_{F}^{2} +\beta \sum_{v=1}^{V} tr\left(C^{T}CHS_{v}^{T}S_{v}H\right) \\ s.t.~ P_{v}^{T}P_{v}=I, \end{array} \end{aligned}} $$

(3)

where \(S_{v}^{T}S_{v}\) and *C*^{T}*C* are the linear kernels of *S*_{v} and *C*, respectively, and *β* is a parameter. Note that the first term is the decomposition term that tries to find the orthogonal subspaces where the corresponding common and view-specific representations lie in, and the second independence term is to minimize the dependence between the common part and the view-specific part. We use the linear kernel of *C* and *S*_{v} to simplify the computation. After *C* and *S*_{v}s for all views are obtained, *k*-means clustering is applied to cluster the samples represented by *C* and *S*_{v}, respectively. The clustering results by using the common part *C* and the specific part *S*_{v} are called ISC-C, ISC-S1,ISC-S2, ⋯, respectively.

Based on the resulting *C* and *S*_{i}s, we define a consensus score(C-score) which is similar to [23] as below:

$$ \text{C-score}_{i}=\frac{tr\left(HX_{i}^{T}X_{i}HC^{T}C\right)}{tr\left(HX_{i}^{T}X_{i}H\left(C^{T}C+S_{i}^{T}S_{i}\right)\right)}. $$

(4)

C-score is used to measure the weight of the consensus part in the *i*-th view. Note that the C-score ranges from 0 to 1, and a higher C-score implies stronger consistent information in the corresponding view.

### Optimization algorithm

We propose an alternative updating approach to solve the optimization problem (3).

Step 1. We first fix *P*_{v} and *C* in (3), and solve for optimal *S*_{1},⋯,*S*_{v} one by one. The *v*-th optimization subproblem can be written as:

$$ \min_{S_{v}} \left|\left| X_{v}-P_{v} \left(\begin{aligned} C \\ S_{v} \end{aligned} \right) \right|\right|_{F}^{2}+\beta tr(C^{T}CHS_{v}^{T}S_{v}H). $$

(5)

Since *P*_{v} can be represented as \(P_{v}=(P_{v}^{(c)} \ \ P_{v}^{(s)})\), the subproblem (5) to solve for *S*_{v} can be simplified to:

$$ {\begin{aligned} \begin{array}{ll} \min\limits_{S_{v}}& tr\left(-2 X_{v}^{T} P_{v}^{(s)} S_{v} + 2 S_{v}^{T} \left(P_{v}^{(s)}\right)^{T} P_{v}^{(c)} C + S_{v}^{T} \left(P_{v}^{(s)}\right)^{T} P_{v}^{(s)} S_{v}\right) \\ & +\beta tr\left(C^{T}CHS_{v}^{T} S_{v}H\right) \end{array} \end{aligned}} $$

(6)

By setting the derivatives of the objective function *f*(*S*_{v}) in (6) with respect to *S*_{v} to be zero, we obtain

$$ {\begin{aligned} \frac{\partial f(S_{v})}{\partial S_{v}}&=0 \Rightarrow \left(P_{v}^{(s)}\right)^{T} P_{v}^{(s)}S_{v}+\beta S_{v}H C^{T}C H \\ &= \left(P_{v}^{(s)}\right)^{T}X_{v}-\left(P_{v}^{(s)}\right)^{T} P_{v}^{(c)}C. \end{aligned}} $$

(7)

The matrix equation for *S*_{v} in (7) is a standard Sylvester equation and can be solved efficiently using method in [25].

Step 2. We then fix *C*,*S*_{1},⋯,*S*_{V}, and solve the optimization problem (3) for optimal *P*_{1},⋯,*P*_{V} one by one. The corresponding *v*-th optimization subproblem can be written as:

$$ \min_{P_{v}} \left|\left| X_{v}-P_{v} Z_{v} \right|\right|_{F}^{2} \qquad \quad s.t.\quad P_{v}^{T}P_{v}=I, $$

(8)

where \(Z_{v} = \left (\begin {aligned} C \\ S_{v} \end {aligned} \right).\) The optimization problem (8) is a least square problem on grassman manifold, and solved by algorithm 2 in [26].

Step 3. We fix *P*_{1},⋯,*P*_{V} and *S*_{1},⋯,*S*_{V}, then solve the optimization problem (3) for *C*. The corresponding subproblem can be written as:

$$ {\begin{aligned} \begin{array}{ll} \min\limits_{C} & \sum_{v}^{V} tr\left(-2 X_{v}^{T} P_{v}^{(c)}C +2 S_{v}^{T} \left(P_{v}^{(s)}\right)^{T} P_{v}^{(c)} C + C^{T} \left(P_{v}^{(c)}\right)^{T} P_{v}^{(c)} C\right) \\ & + \beta tr\left(S_{v}^{T} S_{v} H C^{T} C H \right). \end{array} \end{aligned}} $$

(9)

Similarly, we set the derivatives of objective function of the subproblem (9) with respect to *C*, and obtain

$$ {\begin{aligned} \left(\sum_{v=1}^{V} \left(P_{v}^{(c)}\right)^{T} P_{v}^{(c)}\right) C &+ \beta C \left(\sum_{v=1}^{V} HS_{v}^{T}S_{v}H\right)\\ &=\sum_{v=1}^{V}{ \left(P_{v}^{(c)}\right)^{T} X_{v}-\left(P_{v}^{(c)}\right)^{T} P_{v}^{(s)} S_{v} }. \end{aligned}} $$

(10)

The matrix equation for *C* in (10) is also a standard Sylvester equation and the same algorithm for solving (7) can be used.

The overall algorithm for solving (3) is shown in the algorithm box ISC. For each iteration, we need to solve three subproblems in our ISC algorithm to alternatively update *S*_{v},*P*_{v} and *C*. Since the objective function of ISC model in (3) has a lower bound of zero. and the objective values of our method is decreasing at each step to solve the three subproblems. Therefore the convergence of objective values in our algorithm can be assured. We also experimentally show the convergence of objective values by using four text datasets in Fig. 2, which further confirms the convergence analysis above.