EARN: an ensemble machine learning algorithm to predict driver genes in metastatic breast cancer

Background Today, there are a lot of markers on the prognosis and diagnosis of complex diseases such as primary breast cancer. However, our understanding of the drivers that influence cancer aggression is limited. Methods In this work, we study somatic mutation data consists of 450 metastatic breast tumor samples from cBio Cancer Genomics Portal. We use four software tools to extract features from this data. Then, an ensemble classifier (EC) learning algorithm called EARN (Ensemble of Artificial Neural Network, Random Forest, and non-linear Support Vector Machine) is proposed to evaluate plausible driver genes for metastatic breast cancer (MBCA). The decision-making strategy for the proposed ensemble machine is based on the aggregation of the predicted scores obtained from individual learning classifiers to be prioritized homo sapiens genes annotated as protein-coding from NCBI. Results This study is an attempt to focus on the findings in several aspects of MBCA prognosis and diagnosis. First, drivers and passengers predicted by SVM, ANN, RF, and EARN are introduced. Second, biological inferences of predictions are discussed based on gene set enrichment analysis. Third, statistical validation and comparison of all learning methods are performed by some evaluation metrics. Finally, the pathway enrichment analysis (PEA) using ReactomeFIVIz tool (FDR < 0.03) for the top 100 genes predicted by EARN leads us to propose a new gene set panel for MBCA. It includes HDAC3, ABAT, GRIN1, PLCB1, and KPNA2 as well as NCOR1, TBL1XR1, SIRT4, KRAS, CACNA1E, PRKCG, GPS2, SIN3A, ACTB, KDM6B, and PRMT1. Furthermore, we compare results for MBCA to other outputs regarding 983 primary tumor samples of breast invasive carcinoma (BRCA) obtained from the Cancer Genome Atlas (TCGA). The comparison between outputs shows that ROC-AUC reaches 99.24% using EARN for MBCA and 99.79% for BRCA. This statistical result is better than three individual classifiers in each case. Conclusions This research using an integrative approach assists precision oncologists to design compact targeted panels that eliminate the need for whole-genome/exome sequencing. The schematic representation of the proposed model is presented as the Graphic abstract. Graphic abstract Supplementary Information The online version contains supplementary material available at 10.1186/s12920-021-00974-3.


Investigation of the diversity of features extracted from the original mutation file
The plotting venn diagram (p-value≤0.05) for BRCA shows that the genes prioritization is diverse and just ten genes are common among the outputs obtained from four software tools regarding BRCA (Fig. 5a).
This can be a good indication that the performance of the proposed ensemble model will be appropriate in the next step of implementation of algorithms.

Outputs of three individual classifiers and EARN
In the case of BRCA and after applying non-linear SVM on 18017 protein-coding genes, 39.11% of predicted genes are labeled with index +1 (Fig. 6a). Similarly, the results are illustrated for the predictions of ANN, and RF in Fig. 6 (b,c). The outputs of EARN for BRCA show 7729 genes (42.90%) out of 18017 genes were identified as drivers and 10288 genes (57.10%) predicted as passengers (Fig. 6d). In comparison, driver genes predicted by EARN are less than RF and more than NLSVM and ANN.

Investigation of top 100 genes predicted by the four machine learning methods
We compare predicted drivers (top 100 genes) by the four machines using GeneVenn diagram tool (Fig.   7a). 10 genes including GNL3L, PTEN, SMAD2, CBFB, ERBB3, MARK1, TMEM167A, GRIK2, IL1RAPL, and GRXCR1 have been predicted by all the machines. In this list PTEN, SMAD2, CBFB, and ERBB3 were already introduced regarding different cancers in the OMIM, CGC, and NCG databases (PTEN, CBFB, and ERBB3 are also known for breast cancer). The other comparisons are presented in an extra file. The 14 unique driver genes predicted by EARN100 have been presented in table S11.

Biological validation of outputs based on gene set enrichment analysis
In the case of BRCA, two plans of the biological inferences of driver genes predicted by EARN are similar to MBCA.

The biological inferences of all predicted genes with label +1
For biological analysis based on the label, we calculated the extent to which the driver genes predicted by each method was enriched in the public databases, including the OMIM, CGC, and NCG, associated with different cancers and specifically related to breast cancer. Overall, in these databases, there are 2443 genes as known and candidate genes in occurrence of different primary cancers. We know 40 of these genes are available in the positive training set. In the case of BRCA, the enrichment rate of driver genes predicted by EARN regarding different cancers and breast cancer after excluding 40 positive training genes are 58.18% and 72.14%, respectively.
We also investigated the original mutation file and found that 75.07% of the mutated genes with mutation counts more than 10 across 983 samples were also predicted by EARN as drivers, after excluding positive training genes from the mutated genes list.

The biological inferences of predicted top genes
The enrichment of the top 50 genes in three databases shows that the enrichment score for BRCA is 52% using EARN (26/50) compared to 24%, 42%, and 48% (average 38%) related to RF, ANN, and NLSVM, respectively (Table S12).

Statistical validation of three individual classifiers and EARN based on evaluation measures
For statistical evaluation of the methods, 3-fold cross-validation was done and repeated 100 times to calculate the metrics on test data. For BRCA, comparison of the results of the methods based on statistical validation shows that false-positive rate (FPR), precision or PPV, and average precision for EARN and ANN are better than the other methods. It can be also observed that the recall or sensitivity for EARN and RF is higher than ANN and NLSVM. In comparison, EARN achieves slightly higher accuracy (99.77%) than the others. Also, it can outperform the other methods in F1 score (F-measure) and ROC-AUC.