Pre-processing of CTCs and DEA
After the quality control and pre-processing step for GSE86978, 74 out of 77 cells were included in the downstream analyses. The excluded cells revealed low quality, therefore, they were filtered out. The differential expression analysis was implemented after the normalization step (FDR < 0.05) [18]. The adjusted p-values and logarithm of fold changes were reported in Additional files 1 and 2.
We discovered the differential expression between CTC clusters and single CTCs for two identified subnetworks (EMT and Immune). As shown in Fig. 1, the immune subnetwork genes were downregulated in CTC clusters (light purple color in the heatmap), and the genes of the EMT subnetworks were upregulated (dark purple color in the heatmap) in CTC clusters in comparison with single CTCs. Figure 1 was extracted from the GSE86978 dataset. Moreover, using identified subnetworks, the single and cluster CTCs were grouped well in both subnetworks (Fig. 1a, b). The immune-related subnetwork represented a stronger expression difference between clustered and single cells. Figure 1 was extracted from GSE86978.
Metastasis associated subnetworks
Metastasis-associated subnetworks were determined, using co-expression analysis and hierarchical clustering [16]. We detected 16 subnetworks. The first principle component (in PCA analysis) of subnetworks and the trait (cluster CTCs vs. single CTCs) association was assessed by correlation analysis. The two top significant subnetworks, which had the highest correlation with the trait, were nominated for enrichment analysis (\(\left|\mathrm{midnightblue correlation}\right|\)=0.57, \(\left|\mathrm{turquoise correlation}\right|\)= 0.51) (Additional file 3). The sizes of midnightblue and turquoise subnetworks were 35 and 22 genes, and they were enriched for immune responses and EMT pathways, respectively (Fig. 2).
To have a biological concept for subnetworks, we addressed the midnightblue and the turquoise subnetworks, the Immune and EMT subnetworks, respectively, The EMT subnetwork included cancer-related pathways such as ‘cell–cell communication’, ‘tight junction’, ‘keratinization’, and ‘estrogen signaling pathway’. The Immune subnetwork included pathways such as ‘platelet activation’, ‘immune system’, and ‘innate immune system’. After having reviewed the literature, we detected the novel metastatic biomarkers in breast cancer. The immune-related novel metastatic biomarkers in breast cancer were PTCRA, F13A1, LAT, GNG11, ICAM2, NRGN, P2RX1, CLEC1B, BIN2, LPAR5, CCL5, SELP, RUFY1, C6ORF25, TUBB1, GFI1B, C2ORF88, ACRBP, and C17ORF72. Module membership and gene significance of the Immune subnetwork were reported in Additional file 1.
The EMT-related genes were LRPPRC, AGR2, CLDN4, CRIP1, DSP, ELF3, JUP, KRT8, KRT18, KRT19, FAM102A, TACSTD2, EPCAM, PEBP1, PSMD8, RAN, SNRPC, SPTAN1, EZR, DDR1, MLPH, and WDR34. In which, gene SNRPC, upregulated in CTC clusters, is a metastatic novel biomarker in breast cancer. Module membership and gene significance for the EMT subnetwork were summarized in Additional file 2.
The preservation of all subnetworks was assessed in the external dataset (GSE51827). The two combined statistics \({Z}_{summary }\) and \({Median}_{rank}\) were calculated to assess subnetworks preservation in the second dataset (Immune subnetwork: \({Z}_{summary }=14\), \({Median}_{rank}=9\) and EMT subnetwork: \({Z}_{summary }=31\),\({Median}_{rank}=6\)). The \({Z}_{summary}\) values for both subnetworks were more than 10, and \({Median}_{rank}\) values were low; Subsequently, these statistics indicated the Immune and EMT subnetwork preservations in the second dataset. The statistics for all subnetworks were reported in Additional file 4).
Detection of crosstalk among pathways
The signaling crosstalk between two selected subnetworks as well as intermediate pathways between Immune and EMT was investigated by mapping them to the KEGG pathways and extracting induced subnetwork. A directed subnetwork of size 255 genes was extracted and illustrated in Fig. 3. The network density was 0.5 and it included 10 components, in which, PLCG1 showed the highest betweenness value; and MYC, MYLK, and MRAS showed the highest closeness in the subnetwork.
To have a better illustration of the interplay among pathways, we categorized genes by colors based on ClueGO results. We detected 12 gene categories based on biological processes, including’Hormonal regulation’,’Immune responses’,’Ion metabolism’,’Nucleobase metabolism’,’Oxidative responses’,’Protein localization’,’Protein topology response’,’STAT singling pathway’,’Vitamin metabolism’,’Cell differentiation’,’Circulation in blood regulation’, and’Energy metabolism’ (Fig. 3). These categories were illustrated by colors on the subnetwork nodes, and the genes with no category remained grey. The nodes with multiple colors indicated different biological processes. The node size was illustrated by node degrees (Fig. 3). Genes PLCG1 and ENTPD8, participated in ‘Energy Metabolism’ and ‘Nucleobase Metabolism’, were two hub nodes in our detected directed subnetwork. The ClueGO results were reported in Additional file 5.
Distant metastasis classification model
The distant metastasis potential of two nominated subnetworks was assessed using SVM, neural network, and decision tree classification methods in GSE7390. The accuracy, sensitivity, and specificity scores of the SVM model for the EMT subnetwork were 79%, 78%, and 21%, respectively. The neural network accuracy, sensitivity, and specificity scores were 18%, 18%, and 80%, respectively. Eventually, the decision tree accuracy, sensitivity, and specificity scores were 60%, 60%, and 30%, respectively. These results refer to a full model (all genes in the subnetwork included). Comparing three models, the SVM model was the strongest method in classifying metastatic and non-metastatic patients, but the specificity score was too low.
The SVM accuracy, sensitivity, and specificity scores for immune-related subnetwork were 78%, 78%, and 78%, respectively. The neural network accuracy, sensitivity, and specificity scores were 85%, 85%, and 14%, respectively. Eventually, the decision tree accuracy, sensitivity, and specificity scores were 71%, 71%, and 36%, respectively. These results refer to a full model (all genes of subnetwork included). The specificity of the neural network and decision tree methods was low compared to the SVM model. Due to the results, the SVM model was the most powerful method in classifying metastatic and non-metastatic patients for the immune-related subnetwork. The SVM model accuracy, sensitivity, and specificity for the Immune subnetwork were superior to the EMT subnetwork.
The feature selection algorithms, for the SVM model, were implemented for both subnetworks. The WCC introduced 13 and 15 genes, and GA introduced 12, 17 genes for EMT- and immune-related subnetworks, respectively. The WCC introduced HLA-E, MYLK, WIPF1, TLN1, F13A1, NRGN, ICAM2, PTGS1, SELP, PF4, ITGA2B, GFI1B, TUBB1, PTCRA, RUFY1, BIN2, and CLEC1B and the GA introduced CCL5, MYLK, WIPF1, TLN1, NRGN, GNG11, PTGS1, SELP, ITGA2B, MAX, GFI1B, P2RX1, PTCRA, RUFY1, and BIN2 for the immune subnetwork.
The SVM model, full model, for immune subnetwork validated in GSE9195. The accuracy, sensitivity, and specificity were 0.868; surprisingly, the validation scores were superior to GSE7390. The results confirmed that the immune-related genes detected in this study can classify metastatic and non-metastatic samples more precisely compared to the neural network and decision tree models, using two data sets. We implemented the classification methods to assess the metastasis potential of two nominated CTC-related subnetworks.
Distant metastasis-free-survival and overall survival analyses
The association between gene expression and distant metastasis-free survival /overall survival was implemented to detect metastatic potential genes in two selected subnetworks. Overall survival and distant metastasis-free survival of JUP, KRT18, and KRT19 were significant (Log-rank test p-value < 0.05; the exact p-values were reported in Figures) (Fig. 4a–f). These three genes belonged to the EMT subnetwork. The upregulation of JUP, KRT18, and KRT19 was associated with more metastases; Therefore, the lower overall survival of patients (Fig. 4a–f). Moreover, JUP, KRT18, and KRT19 were upregulated in CTC clusters (Fig. 1a). The lower distant metastasis-free survival and lower overall survival curves confirmed the importance of selected genes in metastasis. Therefore, they are important gene signatures in CTCs.
We fitted a metastasis-free-survival Cox-PH regression model for EMT and Immune subnetworks to assess patients’ metastasis risks through a predictive model. The Immune Cox-PH model included RUFY1 and P2RX1 variables (Likelihood ratio test p-value = 0.0295); and the EMT Cox-PH model included RAN, PEBP1, KRT8, DSP, DDR1, and CLDN4 variables (Likelihood ratio test p-value = 0.0001016). The coefficients of variables and p-values were reported in Additional file 6 and Additional file 7. All the significant genes in the model had VIF < 2 to avoid multicollinearity problems, using the VIF-feature selection method (Additional file 8 and Additional file 9). The proportional hazard assumptions for two model variables were assessed by the Schoenfeld residuals (Additional file 10 and Additional file 11). The predictive Cox-PH models for distant metastasis-free survival for two subnetworks were illustrated in Fig. 4g, h. The concordance index, a performance evaluation measure, for EMT and Immune predictive Cox-PH models were 0.7 and 0.6, respectively. Therefore, the EMT model is more powerful in discriminating patients into low, medium, and high metastasis risk groups compared to the Immune model.
Of note, the concordance index is a generalization of the area under the ROC curve (AUC) that is modified for survival analysis. Concordance index can take into account censored data. Therefore, to evaluate more than one model, we use the concordance index that evaluates the candidate models' performance. The higher concordance index indicates more power in discrimination. The supplementary material was provided in Additional file 12.