Overview of the conCluster model
To identify subtypes from a collection of cancer cells, we developed conCluster to ensemble multiple clustering results. Let E N∗G denotes a single-cell gene expression matrix, in which rows correspond to different cells and columns correspond to genes. Each element of E ij corresponds to the expression of gene j in the ith cell. Our conCluster takes the expression matrix E as input, through four steps, finally partition the N cells into K clusters, represented as C={Ck|k=1,2,⋯,K}. Figure 1 shows the overview of the proposed conCluster model. In the following, we will elaborate each step in detail.
Step1 Filter genes
To focus on the intrinsic transcriptomic signatures of these tumor cells, we filtered out rare and ubiquitous genes and identified the most variable genes across the single-cell dataset. Firstly, as the rare and ubiquitous genes are usually not useful for clustering, we filtered out genes that are either expressed in less than r% of cells (rare genes) or expressed in at least (100-r)% of cells (ubiquitous genes). As in the previous study [22], r is set as 6. Next, we identified the gene set that was the most v% variable across these single-cells, by controlling the relationship between mean expression and variability.
Step 2 Reduce dimension using t-SNE
To further reduce the dimensionality, we adopted the widely used t-SNE to reduce the high dimensional data into a lower dimensional subspace. Detailedly, perplexity is an important parameter of t-SNE, which is used as a smooth measure of the effective number of neighbors. Previous studies indicate that performance of t-SNE is fairly robust with changes in the perplexity between 5 and 50. Here, we set perplexity as 30 and used t-SNE to reduce the filtered scRNA expression data into two dimensions.
Step3 Partition cells in multiple ways
Based on the transformed two-dimensional data matrices, we performed K-means clustering with different initial parameters T times to obtain different basic partitions for these single cells. In this step, we can also utilize other basic clustering methods. For each individual clustering result, we derived a binary matrix B N∗Kt, which was constructed based the corresponding cluster labels of N cells, where Kt (t=1,2,⋯,T) is the cluster number in the tth basic partition. For each row of B N∗Kt, only one element is 1, others are 0.
Step4 Consensus clustering
After gaining the T different partitions, we concatenated all those binary matrices into a larger binary matrix B={BN∗Kt|t=1,2,⋯,T}. Furthermore, we performed K-means clustering based on the merged binary matrix. Here, Calinski-Harabaz Index [22] is utilized to decide the number of clusters. Then we fused the results of each individual clustering result into a consensus one [23].
Evaluation Metrics
When cell labels are available in the dataset, we adopted the adjusted rand index (ARI) to measure the accuracy of clustering [24]. For a set of N cells and two different partitions of these cells, the overlap between the two partitions can be summarized in a contingency table, in which each entry denotes the number of objects in common between the two partitions. The ARI is then calculated as:
$$ ARI=\frac{\sum\limits_{ij}{{n_{ij}} \choose {2}} -\left[\sum\limits_{i} {{a_{i}} \choose {2}} \sum\limits_{j} {{b_{i}} \choose {2}} \right]/{{n} \choose {2}}}{\left[\sum\limits_{i} {{a_{i}} \choose {2}} +\sum\limits_{j} {{b_{i}} \choose {2}} \right]/2-\left[\sum\limits_{i} {{a_{i}} \choose {2}} \sum\limits_{j} {{b_{i}} \choose {2}} \right]/{{n} \choose {2}}} $$
(1)
where (.) denotes a binomial coefficient, nij is the element from the contingency table, ai is the sum of the ith row of the contingency table, bj is the sum of the jth column of the contingency table.
Datasets
Single-cell expression data from two recent scRNA-seq studies were selected from the data repository NCBI Gene Expression Omnibus (GSE72056 [25], GSE73727 [26]).
As they contained the cell types in the original publications, it can be used to further validate the clustering results of different methods. In these studies, cell types were determined through a multi-stage process involving additional information such as cell-type molecular signatures. The first dataset contains a collection of cells from human melanoma tumor, consisted of 4645 single cells isolated from 19 patients; and the second dataset is from human pancreatic islet, containing 6 known human islet cell types. To ensure good data quality, samples with a library size less than 10,000 were excluded. Data sets transformed by logTPM were used as inputs of different methods.