Discovering cancer genes by integrating network and functional properties
© Li et al. 2009
Received: 14 October 2008
Accepted: 19 September 2009
Published: 19 September 2009
Skip to main content
© Li et al. 2009
Received: 14 October 2008
Accepted: 19 September 2009
Published: 19 September 2009
Identification of novel cancer-causing genes is one of the main goals in cancer research. The rapid accumulation of genome-wide protein-protein interaction (PPI) data in humans has provided a new basis for studying the topological features of cancer genes in cellular networks. It is important to integrate multiple genomic data sources, including PPI networks, protein domains and Gene Ontology (GO) annotations, to facilitate the identification of cancer genes.
Topological features of the PPI network, as well as protein domain compositions, enrichment of gene ontology categories, sequence and evolutionary conservation features were extracted and compared between cancer genes and other genes. The predictive power of various classifiers for identification of cancer genes was evaluated by cross validation. Experimental validation of a subset of the prediction results was conducted using siRNA knockdown and viability assays in human colon cancer cell line DLD-1.
Cross validation demonstrated advantageous performance of classifiers based on support vector machines (SVMs) with the inclusion of the topological features from the PPI network, protein domain compositions and GO annotations. We then applied the trained SVM classifier to human genes to prioritize putative cancer genes. siRNA knock-down of several SVM predicted cancer genes displayed greatly reduced cell viability in human colon cancer cell line DLD-1.
Topological features of PPI networks, protein domain compositions and GO annotations are good predictors of cancer genes. The SVM classifier integrates multiple features and as such is useful for prioritizing candidate cancer genes for experimental validations.
Cancer is a complex disease whose multi-step progression involves alteration of many genes, including tumor suppressor genes and oncogenes. Although multiple targeted cancer therapeutic agents have been developed based on several known cancer genes, it is expected that many cancer genes remain to be identified . Identification of novel genes likely to be involved in cancer is important for understanding the disease mechanism and development of cancer therapeutics. Recently, efforts in global genomic re-sequencing have been made to identify novel cancer genes by detecting somatic mutations in tumor tissues [2–4]. However, it is challenging to distinguish true cancer-associated mutations from a large amount of "passenger" variants detected in these studies that are likely to be irrelevant to cancer progression.
Most gene products interact in complex cellular networks. It was proposed that direct and indirect interactions often occur between protein pairs whose mutations are attributable to similar disease phenotypes. This concept was utilized to predict phenotypic effects of gene mutations using protein complexes  and identify previously unknown complexes likely to be associated with disease [6, 7]. Similar notion may be applied to cancer where identifying protein interaction network of known cancer genes may provide an efficient way to discover novel cancer genes. The rapid accumulation of genome-wide human PPI data has provided a new basis for studying the topological features of cancer genes. It was shown that the network properties in human protein-protein interaction (PPI) data, such as network connectivity, differ between cancer causing genes  and other genes in the genome . An interactome-transcriptome analysis also reported increased interaction connectivity of differentially expressed genes in lung squamous cancer tissues . These studies indicated a central role of cancer proteins within the interactome. Recent studies also applied network approaches to studying cancer signaling  and identifying biomarkers of cancer progression in specific cancer types [11, 12]. However, the utility of PPI network for identification of novel genes whose genetic alterations are likely to be causally implicated in oncogenesis remains to be demonstrated. In addition, efforts have been made to use functional and sequence characteristics, such as GO annotation and sequence conservation, to predict cancer genes and cancer mutations [13, 14]. However, a systematic analysis of all these features side-by-side is needed to evaluate their merits, both individually and in combination, in cancer gene prediction.
In this study, we took a machine learning approach to investigate various network and functional properties of known cancer genes to predict the likelihood of a gene to be involved in cancer. Although Cancer Gene Census provides a catalogue of currently known cancer causing mutations, many other cancer genes may be yet to be discovered from the rest of the genome. To reduce the false positives in classifying genes not involved in cancer, we extended the comparison of various features in four non-overlapping gene groups, i.e. "cancer genes" from the Cancer Gene Census (bona fide cancer genes whose mutations are causally implicated in cancers) , "COSMIC genes" profiled for somatic mutations in cancer and deposited into the Catalogue Of Somatic Mutations In Cancer (COSMIC) database  (excluding those in the cancer gene set), "OMIM genes" from the Online Mendelian Inheritance in Man (OMIM) database  (excluding those in the cancer or COSMIC gene set), and other genes in the genome (noted as "non-cancer genes"). Somatic mutations were observed for a subset of "COSMIC genes" in cancers and they are potentially related to oncogenesis while "OMIM genes" contain known genes involved in diseases other than known cancer genes. We trained various classifiers using "cancer genes" and "non-cancer genes", and evaluated the contribution of various features and different classification methods using cross validation. We then applied the trained classifier with the best cross validation performance to human genes to prioritize human genes likely to be involved in cancer. To evaluate the roles of predicted cancer genes in cancer cell growth and proliferation, siRNA knock-down experiments and cell viability assays were conducted in human colorectal cancer cell line.
PPI network was constructed as the union of all relationships obtained from representative published datasets [8, 17, 18]. Sequence features were obtained from NCBI Entrez database . The number of alternative transcripts for each Entrez gene was obtained from the RefSeq database. Non-synonymous mutation rate Ka and synonymous mutation rate Ks of human-mouse and human-rat orthologs were retrieved from NCBI HomoloGene database ftp://ftp.ncbi.nih.gov/pub/HomoloGene/.
We constructed four non-overlapping gene groups, i.e. "cancer genes" from the Cancer Gene Census , "COSMIC genes" from the Catalogue Of Somatic Mutations In Cancer (COSMIC) database  excluding genes in the cancer gene group, "OMIM genes" from the Online Mendelian Inheritance in Man (OMIM) database  excluding genes in the cancer gene group or COSMIC gene group, and rest of the genes (noted as "non-cancer genes").
From human genome, 9218 Entrez genes were mapped to all of the following datasets; PPI network, Refseq database, HomoloGene human-mouse and human-rat orthologs, GO annotations, and Pfam database. Among these, 278 belong to cancer, 2191 belong to COSMIC, 1088 belong to OMIM, and 5661 belong to non-cancer gene set, respectively.
SVM classifier was built using LIBSVM tools, a library for Support Vector Machines http://www.csie.ntu.edu.tw/~cjlin/libsvm. SVMs were trained on cancer genes and non-cancer genes to estimate the probability of a gene to be involved in cancer. We chose radial basis function (RBF) as the kernel of SVM. We conducted cross-validation to select the parameter gamma for the radial basis function kernel and the parameter c for the cost of training error. Cost weights wi were set based on the ratio between the number of negative examples and the number of positive examples in the training data. Naïve Bayes and logistic regression classifiers were built using default parameters from Weka tools http://www.cs.waikato.ac.nz/ml/weka/.
The features used to train the classifiers include degree, clustering coefficient and average length of shortest path to a cancer gene from the PPI network, gene and protein lengths from sequence features, Ka and Ka/Ks from evolutionary features, presence or absence of annotation of selected GO terms and Pfam domains (p < 0.01 from chi-square tests of over- or under-representation in cancer genes compared with non-cancer genes). Continuous features whose distribution deviates significantly from the normal distribution were log transformed, including PPI degree, protein length, gene length, Ka and Ka/Ks.
Classifiers were trained and evaluated using cancer genes as positive examples and non-cancer genes as negative examples. 10-fold cross validation experiments were conducted to evaluate the performance of the classifiers. The dataset was randomly divided into ten subsets, each of which has one tenth of the number of examples in the original set and preserves the relative proportion between positive and negative examples. A classifier was trained and tested ten times where each time a different subset was used for testing and the remaining nine subsets were used for training. For SVM classifiers, we conducted 5-fold cross-validation using the training data for each round to select parameter pair c and gamma, and then a classifier trained using the selected parameter pair was evaluated using the test data (parameter pair was selected from c = 1, 4, 16, 64 and gamma = 0.001, 0.01, 0.1, 1). The area under the ROC curve (AUC) was used to measure the performance of different classifiers. ROC curves and AUC values were obtained using the LIBSVM and Weka tools. A classifier with better performance than a random predictor has an AUC between 0.5 and 1.
DLD-1 (ATCC CCL-221) cells were obtained from the American Type Culture Collection and maintained in High Glucose Dulbecco Modified Essential Media supplemented with 10% Fetal bovine sera and 2 mM L-Glutamine. Gene targeting siRNA duplexes (Dharmacon siGENOME), siGENONE Non-Targeting siRNA #2 (Dharmacon D-001210-02) and siGENOME Non-Targeting siRNA Pool #1 siRNAs were transfected into cells using Lipofectamine 2000 (InVitrogen #11668). siRNA duplexes were transfected at concentrations of 25-30 nM (duplexes) or 100-120 nM (pools), respectively. Lipid-siRNA complexes were formed in OptiMEM Media (Gibco #31985) to which cells were added in antibiotic-free media. Four days following transfection, cell viability was measured with the addition of CellTiter-Glo (Promega #G7570) and luminescence was measured according to manufacturer's instructions using a Perkin Elmer EnVision luminometer.
A viability score is defined as the ratio of CellTiter-Glo readout between transfection of a testing siRNA and that of negative controls (non-targeting siRNAs). A viability score less than 1 indicates decreased cell viability with siRNA targeting a given gene. Viability scores for two replicated transfections using the same siRNA were averaged and statistical significance of reduced viability was evaluated by t tests for each siRNA oligo. As there are four siRNAs targeting a given gene, we require at least two of the siRNAs have p-value less than 0.05 and the decrease in cell viability is at least 15% to claim that the gene is essential for cell viability.
A comprehensive human PPI network was built via integrating multiple publicly available data sources, including a collection of validated direct interactions , computationally predicted interactions based on homology mapping , and experimentally proposed interactions from large-scale human mass-spectrometry experiments . Validated interactions were derived from the Biomolecular Interaction Network Database (BIND) , the Human Protein Reference Database (HPRD) , Reactome , and the Kyoto Encyclopedia of Genes and Genomes (KEGG) . All proteins were mapped to Entrez  genes and the combined human PPI network consists of 13,802 genes (genes and their protein products are used interchangeably below) and 140,600 interactions. 331 out of 368 (89.9%) genes from Cancer Gene Census, 2769 out of 3,001 (92.3%) genes from COSMIC (excluding genes also in Cancer Gene Census), 1786 out of 1976 (90.4%) genes from OMIM (excluding genes also in Cancer Gene Census or COSMIC) and 8916 out of 18,744 (47.6%) remaining genes in the human genome were included in the PPI network.
To investigate various PPI network, functional and sequence features of cancer genes and other genes, we selected 9218 well-annotated genes that were included in the PPI network and were assigned with GO terms  and Pfam domains . They also have mouse or rat orthologs defined by HomoloGene . We classified these genes into four mutually exclusive sets, including 278 "cancer genes" from Cancer Gene Census, 2191 "COSMIC genes" from the COSMIC database (excluding those also in Cancer Gene Census), 1088 "OMIM genes" from the OMIM database (excluding those also in Cancer Gene Census or COSMIC), and the rest 5661 "non-cancer genes".
Top 5 significantly over-represented GO terms of 'molecular function' and 'biological process' in cancer genes vs. non-cancer genes
# cancer genes
# non-cancer genes
negative regulation of cell cycle
response to DNA damage stimulus
regulation of cellular process
response to endogenous stimulus
protein amino acid phosphorylation
protein-tyrosine kinase activity
protein kinase activity
Transcription regulator activity
phosphotransferase activity, alcohol group as acceptor
Top 10 significantly over-represented Pfam domains in cancer genes vs. non-cancer genes
# cancer genes
# non-cancer genes
Protein tyrosine kinase
Protein kinase domain
'Paired box' domain
DEAD/DEAH box helicase
DNA mismatch repair protein, C-terminal domain
Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase
Summary statistics of sequence features.
Protein sequence length (aa)
761 ± 25a
557 ± 22
634 ± 26
713 ± 30
Genomic sequence length (bp)
84163 ± 330
54319 ± 317
66471 ± 322
63604 ± 330
Number of exons
10.24 ± 2.66
10.01 ± 2.98
10.49 ± 3.24
10.39 ± 2.97
Number of alternative splicing events
1.82 ± 1.23
1.36 ± 0.97
1.50 ± 0.99
1.54 ± 1.07
With regards to other sequence features we examined, the probability density distributions did not show clear separation between cancer and non-cancer genes regarding the average number of alternative transcripts (data not shown); no significant difference was observed in the number of exons among the four gene groups (ANOVA F test p-value 0.096).
As distinctive patterns were observed between cancer and non-cancer genes from the analyses of PPI networks, annotations of GO and Pfam, sequence and conservation features, we sought to design a classifier to combine the predictive power of each type of feature for identification of cancer genes. Specifically, we considered PPI network features including degree, clustering coefficient and the length of the shortest path to a cancer gene, sequence features including gene and protein lengths, and conservation features including Ka and Ka/Ks ratio. For GO and Pfam features, we selected 79 GO terms and 61 Pfam domains significantly differentially represented in caner genes compared to non-cancer genes by chi-square test (p < 0.01). For a given gene, presence and absence of assignment of each GO term or Pfam domain was encoded as '1' or '0' respectively.
Area under ROC (AUC) for feature selections and classifiers
Sequence (gene + protein length)
Conservation (Ka + Ka/Ks)
GO + Pfam
GO + Pfam + Sequence
GO + Pfam + Conservation
GO + Pfam + Sequence + Conservation
PPI + GO + Pfam
PPI + GO + Pfam + Sequence
PPI + GO + Pfam + Conservation
PPI + GO + Pfam + Sequence + Conservation
RNAi-based phenotypic screening has demonstrated its utility in identifying cancer genes and putative drug targets [32–34]. As genes that are essential for cancer cell proliferation and survival represent attractive drug target candidates, we examined a subset of predicted cancer genes using small interference RNA (siRNA) knockdown and cell viability assays. Although our predictions do not distinguish between oncogenes and tumor suppressor genes, we are interested in identifying novel oncogenes in this experiment for potential new therapeutics. As COSMIC, OMIM and non-cancer gene sets may contain novel oncogenes that have not been characterized as cancer genes in the Cancer Gene Census, we focused on COSMIC, OMIM and non-cancer genes and examined whether their siRNA knockdown would lead to decreased viability of the cell. The phenotype of decreased viability when a gene is knocked down indicates that the gene is essential for cancer cell proliferation and may potentially become a novel drug target. A total of 332 from these three gene sets overlap with the duplex siRNA library for druggable genes (Dharmacon Inc.) and were included in a large siRNA screen conducted at our institution for druggable genes that affect the viability of human colon cancer cell line DLD-1 (unpublished). Among these, 16 genes are likely to be cancer genes having probability scores greater than 0.5 from the classifier (noted as predicted cancer genes) and the rest 316 genes are less likely to be involved in cancer (noted as predicted non-cancer genes). A viability score is defined as the ratio of cell viability after the transfection of the testing siRNA over the negative control siRNAs (non-targeting siRNAs). A viability score significantly less than 1 indicates that siRNA knockdown of the target gene significantly decreased cell viability. We conducted one-sample t tests of the viability scores from two replicated experiments for each of the four different siRNA oligos targeting the same gene. To identify genes whose siRNA knockdown leads to decreased viability in a cell line, we require that at least two of the four siRNAs targeting this gene produced significantly reduced viability (p < 0.05) and the decrease in cell viability is at least 15%. As a result, 6 out of the 16 (37.5%) predicted cancer genes vs. 40 out of the 316 (12.7%) predicted non-cancer genes were selected. Fisher's exact test showed a significant enrichment of genes essential for cell viability in predicted cancer genes vs. non-cancer genes (odds ratio 4.11 and p-value 0.014).
Our study represents a first attempt to examine the predictive power of PPI network properties, in combination with an extensive set of structural and functional features, for identification of cancer genes. Compared to OMIM disease genes and non-cancer genes, cancer genes have more interaction partners, higher network density in their neighborhood, and are more closely related to other cancer genes in the PPI network. These observations agree with the notions that cancer genes play a central role in the cellular network and exert functions in an inter-dependant modular fashion. One common concern regarding analysis of PPI network is that the observed higher connectivity of certain group of genes could be a result of a bias in the PPI network, as it could be argued that these genes received more detailed investigations by the research community. To address this concern, it was previously argued that higher number of known interaction partners for cancer genes is likely to be a consequence of higher frequency of promiscuous domains (which interact with a variety of different domains) in caner genes rather than obvious bias in the PPI network . Based on a probability density function from the Pfam domain population , many of the top Pfam domains enriched in cancer genes vs. non-cancers in our study showed significantly higher-than-expected interaction promiscuity in term of the number of different domains they interact with, such as protein kinase domain, Ets domain and Homeobox domain (Table 2). In addition, there is significant difference in connectivity and clustering coefficient between cancer and OMIM genes (Figure 1; see additional file 1: Supplementary Table S1) even though cancer genes and OMIM genes both represent heavily studied gene sets. ~90% of both cancer genes from Cancer Gene Census and disease genes from OMIM database were included in the PPI network. Furthermore, the analyses were conducted using the subset of well-annotated genes from human genome that were assigned with GO terms and Pfam domains. As a result, the less well-studied genes were filtered out from the non-cancer gene group.
Our study showed that cancer genes have distinctive functional, sequence and evolutionary characteristics from COSMIC, OMIM and non-cancer genes. COSMIC genes and OMIM genes in turn have distinctive features between each other and from non-cancer genes. It should be noted that the OMIM gene set in our study is specific to the context of comparison with cancer genes as we excluded from the OMIM gene set those common between the OMIM database and Cancer Census Genes or COSMIC database. COSMIC genes showed relatively more similarities with cancer genes in many properties, and in fact many COSMIC genes were found to be involved in cancer although they are not included in the Cancer Gene Census database . Therefore, it is beneficial to separate COSMIC and OMIM gene groups from non-cancer genes in training a classifier to predict cancer genes.
SVM classifiers on average perform slightly better than Naïve Bayes and logistic regression. Naïve Bayes performs the worst in our study probably due to the fact that our feature vectors are not orthogonal to each other, which violated the basic assumption of Naïve Bayes models. The theoretical advantage of SVMs is that they simultaneously minimize the empirical classification error and maximize the geometric margin; the idea of maximizing the margin mitigates the problem of over-fitting the training data, which is of particular importance when dealing with large number of features.
PPI topological features alone have relatively strong predictive power for identification of cancer genes. Similar to PPI features, GO and Pfam annotations are strong predictors compared to sequence and conservation features. Combining all these features maximize the predictive power (Table 4). With the accumulation of more and more protein-protein interaction datasets, our approach of integrating PPI topological features will potentially become more powerful in the future.
The SVM classifier provides a probability score to prioritize candidate cancer genes, which can be followed up by experimental studies, such as siRNA knock down and cell viability assays. Preliminary siRNA studies on predicted cancer genes showed promising leads for further investigations. Interestingly, COSMIC genes with somatic mutations in cancer samples have higher scores than other genes in the COSMIC database (Figure 5). As COSMIC genes were held out from the training set and no mutation information was included in the training features, this observation indicates our approach aligns with the large-scale systematic re-sequencing efforts and can serve as a useful complementary approach for identifying cancer genes.
Topological features of PPI networks, protein domain compositions and GO annotations are good predictors of cancer genes. The SVM classifier integrates multiple features and as such is useful for prioritizing candidate cancer genes for experimental validations. Preliminary siRNA studies on predicted cancer genes showed promising leads for further investigations. The integrative approach using PPI networks is a useful complement to large-scale systematic re-sequencing and other genomic discovery projects for identifying novel cancer genes.
support vector machine
Catalogue Of Somatic Mutations In Cancer
Online Mendelian Inheritance in Man
Biomolecular Interaction Network Database
Human Protein Reference Database
Kyoto Encyclopedia of Genes and Genomes
Receiver Operating Characteristic
radial basis function.
We are grateful to Drs. Jinfeng Liu, Josh Kaminker, Zeming Zhang for helpful discussions.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.