Skip to main content

Identifying driver genes involving gene dysregulated expression, tissue-specific expression and gene-gene network

Abstract

Background

Cancer as a kind of genomic alteration disease each year deprives many people’s life. The biggest challenge to overcome cancer is to identify driver genes that promote the cancer development from a huge amount of passenger mutations that have no effect on the selective growth advantage of cancer. In order to solve those problems, some researchers have started to focus on identification of driver genes by integrating networks with other biological information. However, more efforts should be needed to improve the prediction performance.

Methods

Considering the facts that driver genes have impact on expression of their downstream genes, they likely interact with each other to form functional modules and those modules should tend to be expressed similarly in the same tissue. We proposed a novel model named by DyTidriver to identify driver genes through involving the gene dysregulated expression, tissue-specific expression and variation frequency into the human functional interaction network (e.g. human FIN).

Results

This method was applied on 974 breast, 316 prostate and 230 lung cancer patients. The consequence shows our method outperformed other five existing methods in terms of Fscore, Precision and Recall values. The enrichment and cociter analysis illustrate DyTidriver can not only identifies the driver genes enriched in some significant pathways but also has the capability to figure out some unknown driver genes.

Conclusion

The final results imply that driver genes are those that impact more dysregulated genes and express similarly in the same tissue.

Background

Cancer as a kind of genomic alteration disease each year deprives many people’s life [1,2,3]. It is acknowledged that cancer arise is due to the accumulation of mutations in a subgroup of genes which conferring growth advantage, allowing uncontrolled proliferation and avoiding apoptosis [4, 5]. With the development of next-generation sequencing technology, several large-scale cancer projects have generated a large amount of cancer genomic data, such as The Cancer Genome Atlas (TCGA) [6], International Cancer Genome Consortium (ICGC) [7], which enable the detection of thousands of mutations. However, not all mutations contribute to the cancer initiation and progression. The mutations that are important to the cancer development and provide selective growth advantage are called driver mutations, the opposite is termed as the passenger mutations [8, 9]. Some researches show that the number of passenger mutations far beyond the number of driver mutations [9]. For example, from 11 cancer types, there are only 2 to 6 mutations have been regarded as the driver mutations among 200 somatic mutations which including missense, nonsense, silent, non-coding, splice-site, non-stop mutations, frameshift insertions and deletions (indels) and inframe indels [9,10,11,12]. Besides, those important alterations are not uniformly distributed across the genome and target to some specific genes associated with important cellular functions such as cell survival, cell fate etc. [4, 13,14,15]. For example, the well-known tumor suppressor TP53 participate in defense mechanisms against cancer and their inactivation by alteration can increase the selective growth advantage of the cell [16]. The alterations of ERBB2 [17] and KRAS [18] can lead to the acquisition of new properties that provide some selective growth advantage or spread to remote organs. Hence, the biggest challenge to overcome cancer is how to precisely discriminate those driver genes which harboring driver mutations and have the capability to promote cancer development from those irrelevant passenger genes [11]. This act is essential to understand the tumor biology and designing precision therapies [4, 19].

Traditional methods to identify cancer driver genes are based on the assumption that driver mutations confer a selective advantage to tumor growth and they occur more frequently than expected by random chance [20]. This kind of methods such as Mutsig [21] and MuSic [22] successfully pinpoints part of recurrence genes. However, in fact, only a small number of genes are altered in a high percentage of patient. Much larger number of genes are altered infrequently [11]. Besides, due to the heterogeneity of cancer, it is so hard to properly estimate the background mutation rate that many errors may be introduced [23].

A promising angle to identify cancer driver genes is based on network since it is acknowledged that cancer genes are more closely related with each other within a group to perform a certain function [24]. HotNet [25] and HotNet2 [26] apply a propagation process that diffuse the score of mutation frequency through the whole gene-gene interaction network and extract significantly mutated subnetworks to identify driver genes. NBS [27] detects driver genes by taking the strategies similar to HotNet. However, NBS detects mutated subnetworks of each patient and uses a consensus clustering framework to merge subnetworks across all patients. Unlike previous methods that use global network information, MUFFINN [28] prioritizes the cancer driver genes by measuring the impact from all neighbors of mutated genes in the functional network. Although these network-based methods mentioned above proposed a new focus on the interacting relationship of cancer driver genes, most of them identified cancer driver genes only consider the patient-gene mutation profiles and topology of networks. Besides, they are too much rely on the known network which may create some false positive data [23].

To overcome these limitations, some researchers focus on combining the cancer gene’s functional interactive relationship and other biological properties to improve the precision of detecting cancer driver genes. For example, DriverNet [29] identifies cancer driver genes by estimating their effect on mRNA expression. Inspired by the rationale that cancer driver genes may be determined by their impact on expressions of downstream genes, DriverNet firstly identifies the downstream genes (called outlying genes) with significantly differential expressions and then constructs a bi-graph where one side is mutated genes and the other side is outlying genes. It selects the driver genes that connect to the most nodes in the outlying gene side. Shi et al. [30] further improve DriverNet method by introducing diffusion process on the bi-graph. DawnRank [31] ranks potential cancer driver genes based on both their own expression difference and their impact on the overall differential expression of the downstream genes in the molecular interaction network. LNDriver [24] is also designed on the basis of bi-graph, while it incorporates the DNA length to filter mutated gene at the first step.

Above mentioned bi-graph-based methods to some degree improve the accuracy of identifying cancer driver genes by adding biology profiles to the gene itself. However, the reliability of network still needs to do further improvement since most of known networks are built based on either or mix of large scale of computational and experimental data. This may directly impact the efficiency and precision of detecting novel driver genes [23]. Hence, the fundamental problem is to establish one model that can improve the reliability of network so as to improve the power of prediction. To achieve this, some researchers consider to incorporate specific biological profiles to assign a weight for each interaction such as the impact of differential expression information [32]. However, seldom of them considered the facts that the majority of cancer genes interact with each other to form functional modules and those modules should tend to be expressed similarly in the same tissue. Ganegoda et.al [33] use the tissue-specific data to predict the new disease-gene associations by measuring the gene expression in disease related tissues and achieved higher performance. Besides, previous studies found genetic disorders tend to manifest only in a single or a few tissues for a given disease [34]. Motivated by these, we want to refine the gene functional interaction network by considering expression similarity between each pair of mutated genes in the cancer’s related one or two tissues. Moreover, from the previous research, it is known that cancer driver genes are more likely to be frequently mutated across a cohort of patients and also dysregulate downstream genes’ expression.

Based on the facts mentioned above, we proposed a model called DyTidriver to predict cancer driver genes by integrating dysregulated expression profiles, tissue-specific expression profiles, modularity of mutated genes and variation frequency into the gene functional interaction network. In DyTidriver, considering the fact that cancer driver genes are likely dysregulate downstream genes’ expression, mutated genes were firstly filtered according to their impact on the expression of downstream genes. After that, mutated genes’ interactive network was weighted by considering gene-gene co-expression in specific tissues of each query disease and the relationship between mutated genes. Because the majority of cancer driver genes interact with each other to form functional modules and those modules tend to be expressed similarly in the same tissue. Finally, with respect to the facts that driver genes are more likely to be frequently mutated across a cohort of patients and interact with each other to form functional modules, the mutated genes were ranked by summing up the weighted graph and multiplying itself variation frequency. We explored our method to detect cancer driver genes of lung cancer, breast cancer and prostate cancer. The result shows that our method significantly outperforms other five existing methods [28,29,30,31] in terms of Fscore, Precision and Recall. Besides, the cociter analysis illustrates our method can not only identify some well-known cancer driver genes but also detects the unknown cancer driver genes with high co-occurrence ratio in some publications. Furthermore, the identified cancer driver genes also enrich in some significant pathways and biological functions.

Methods

Our method consists of four steps (see Fig. 1). At first, we filtered the mutated genes for each patient according to whether or not it influenced the expression of downstream genes. Only the mutated genes that dysregualte downstream genes’ expression will be included in our study. Then, the remaining mutated genes for all patients were mapped to the human functional interaction network (human FIN) to construct the Mut-Mut matrix. Thirdly, the tissue-specific pearson correlation coefficient (PCC) matrix was constructed by calculating the co-expression values of mutated genes derived from downloaded tissue expression information after searching the disease-tissue matrix. Finally, we calculated the edge clustering coefficient (ECC) values for the interactions in the network which established at the last step and assigned each mutated gene in the network a score by firstly summing up ECC values of its connected edges and then multiplying the addictive result to its corresponding variation frequency. According to the scores, the mutated genes were ranked in a descending order and those ranked at the top of the list were considered as potential cancer driver genes.

Fig. 1
figure1

The workflow of Dytidriver. We divided our whole process of cancer driver gene identification into four steps and marked with ‘a’,’ b’, ‘c’, ‘d’. In the step ‘a’, we filtered the mutated genes for each patient according to whether or not it influenced the expression of downstream genes. Only the mutated genes which connect at least one outlying genes would be included in our study. Then, the filtered mutated genes for all patients were mapped to the human functional interaction network to construct the Mut-Mut matrix. The ‘b’ step is to generate the tissue-specific PCC matrix. For each cancer, we chose the top one or two tissues with the higher association score in disease-tissue matrix as the cancer related tissues such as the tissue 1 and tissue 2 for disease D1. For each tissue, we calculated its gene-gene pearson correlation values across the whole patients and then generated the gene-gene PCC matrix by keeping the absolute PCC values more than 0.3 while left setting to 0. If there are more than one tissue related to a cancer, the final tissue-specific PCC matrix is constructed by averaging the values in the gene-gene PCC matrix of each tissue. In the ‘c’ step, we constructed the ECC mutated matrix by utilizing the ECC equation. In the final ‘d’ step, we assigned each mutated gene in the network a score by summing up all the ECC values of its connecting edges and then multiply to its corresponding variation frequency. According to the scores, the mutated genes were ranked in a descending order and those ranked at the top list the were considered as potential driver genes

Experimental data

The datasets in this study derived from three places. The first part includes the somatic mutation data and their corresponding transcriptional expression data for each patient. Both of these datasets were downloaded from the TCGA website by utilizing the TCGA2STAT R packages. For our analysis, we focused on the somatic mutation and gene transcriptional expression data for 230 lung cancer patients, 974 breast cancer patients and 331 prostate cancer patients. The downloaded TCGA datasets include both tumor and normal patients: 58 of 230 lung, 110 of 974 breast and 52 of 331 prostate are normal patients.

The second part of dataset is the tissue-specific expression profiles. In order to find the most related tissues for each cancer type, we searched the tissue-disease matrix which can be downloaded from the reference [34]. Each entry in the matrix represents the covariance of a disease with a tissue through the way of counting the number of publications co-appearing the disease and tissue, relative to the number of publications mentioning the disease or tissue alone. It is acknowledged that genetic disorders tend to manifest only in a single or few tissues for a given disease [34]. Hence, we chose one or two of the most relevant tissues for each cancer type. Fortunately, the directly related tissue can be found for most of cancer type e.g. the lung tissue for lung cancer, prostate tissue for prostate cancer. However, we cannot find the breast tissue in the disease-tissue matrix. Instead, we chose the top two relevant tissues (e.g. prostate, ovary) with higher association score for breast cancer. In order to obtain the tissue-specific expression profiles, we used the Gene Expression Omnibus (GEO) database. Because GEO database is currently the largest and most famous expression data platform which stores relatively complete expression data. According to the identified most related tissues for each cancer type, we downloaded the gene expression details of each tissue sample from the GEO website by querying dataset GSE7307. The database lists the transcriptional profile of both normal and disease human tissues representing over 90 distinct tissue types by using the Affymetrix human U133 plus 2.0 array. At here, we used the R package called GEOquery to download the corresponding tissue expression information from the platform GPL570. The downloaded data is the expression profile matrix with genes and patients as the columns and rows respectively.

The last part of the dataset comes from the currently release version (2016) of human functional interaction network (human FIN) in which involving 12,275 genes and 46,0434 edges [35]. This network is constructed by extending curated pathways with non-curated sources of information, including protein-protein interactions, gene co-expression, protein domain interaction, Gene Ontology (GO) annotations and text-mined protein interactions, which cover close to 50% of the human proteome. The benchmarking of driver genes was downloaded from the NCG 4.0 which included 537 known cancer genes from the Cancer Gene Census [36] and 1463 candidate cancer genes that were derived from the manual curation of 77 whole genome or whole exome cancer-resequencing screenings [37] .

Filtering mutated genes and constructing Mut-Mut matrix

The somatic mutation data were downloaded from TCGA website where records the information of mutated gene across patients. The genes that were mutated in at least one patient were kept and regarded as the mutated genes. Previous researches have pointed out that driver genes are more likely to regulate the expression of downstream genes [29,30,31]. Those gene whose expression were impacted significantly are called outlying genes. In order to acquire the outlying genes, we downloaded the transcriptional expression information from the TCGA website and calculated their z-scores. More specifically, for each gene and each patient, a gene was regarded as the outlying gene for the patient if its z-score > 2.0 or its z-score < − 2.0. The setting of threshold as ± 2.0 was referred to the DriverNet [29]. Then, we kept the mutated genes which have at least one connection with outlying genes in the human FIN while filtered out those having no connections with outlying genes. Finally, the remaining mutated genes were mapped to the human FIN to generated the binary Mut-Mut matrix in which the rows and columns are the remaining mutated genes and the element is 1 if there is a connection between the two mutated genes in the human FIN, 0 otherwise.

Assigning weight to Mut-Mut matrix by PCC values

Since the majority of disease genes forming a common functional module tend to be expressed similarly in the same tissue and there exist too much false positive connections in the gene networks, in this work, we use tissue-specific expression profile to assign weights for the interactions of genes in order to improve the reliability of genes interactive network. For each cancer type, at first, we chose the most related tissue according to its association score in the disease-tissue matrix [34]. If there is at least one tissue related with a cancer in the disease-tissue matrix, its corresponding tissue expression information across a cohort of patients can be downloaded from the GEO website. After that, we calculated the gene-gene PCC values of downloaded tissue expression matrix across the whole patients and then generated the PCC matrix by keeping their absolute PCC values more than 0.3 while left setting to 0. The threshold setting was according to previous research [34]. At last, the average score of PCC matrix of each tissue was regarded as the final tissue-specific PCC matrix of the cancer type. We assigned a weight to values in the Mut-Mut matrix based on the tissue-specific PCC matrix. Specifically, if a mutated gene i connects to a mutated gene j in the Mut-Mut matrix (e.g. W(i,j) = 1), the PCC value of genes i and j was assigned to the corresponding entry of the Mut-Mut matrix otherwise the value was set to 0. Consequently, a weighted mutated PCC matrix denoted by W is constructed.

Calculating the mutated gene score

Previous studies have found that cancer is the fact that genes act together in various signaling pathway and protein complexes [25]. Hence, in order to highlight the modularity of cancer driver genes, we calculated the ECC values for each pair of mutated genes in the mutated PCC matrix. The ECC value was normally used to measure the degree of closeness between two nodes in a network, which has been widely applied in detecting network modules [38,39,40]. We calculated the ECC values for each pair of mutated genes in the weighted mutated PCC matrix (denoted by Matrix W in Eq. 1). The higher ECC value means two genes are more likely to act together in a common module. The definition of ECC is as Eq. 1. After calculating the ECC score for each pair of mutated genes in the weighted mutated PCC matrix, we assigned each mutated gene a score (Mi) by summing up all ECC values of its connecting edges (see Eq. 2). It is known that cancer driver genes are more likely to be those frequently mutated in many patients. Hence, the final ranking score of each mutated gene was calculated by multiplying its variation frequency to its additive score (see Eq. 3). After that, all mutated genes were ranked in a descending order according to their ranking scores and the genes with the higher rank are more likely to be the cancer driver genes.

$$ ECC\left(i,j\right)=\frac{\sum_{k\in i\cap j}^n{W}_{ik}+{W}_{jk}}{\min \left({d}_i,{d}_j\right)} $$
(1)
$$ {M}_i=\sum \limits_{j\in {N}_i}^n ECC\left(i,j\right) $$
(2)
$$ {F}_i={V}_i\bullet {M}_i\kern0.5em $$
(3)

Where W denotes weighted mutated PCC matrix. k denotes the common neighbors between mutated gene i and gene j in the matrix W. Wik is the weight between mutated gene i and gene k. di and dj are the degrees of nodes i and j, respectively. Min (di,dj) represents the maximal possible number of triangles that might include the edge(i,j). Ni is the set of all neighbors of mutated gene i. Vi denotes variation frequency of gene i which is measured by mutated times of gene i out of total patient counts.

Statistic evaluation metrics

In order to evaluate the performance of our method, top N of ranked genes were selected as potential cancer driver genes. The accuracy of prediction depends on how well the predicted cancer driver genes match the real ones, which was measured by three widely used statistic metrics, Precision, Recall and Fscore.

$$ Precision=\frac{TP}{TP+ FP} $$
$$ Recall=\frac{TP}{TP+ FN} $$
$$ {F}_{score}=2\bullet \frac{Precision\bullet Recall}{Precision+ Recall} $$

where TP (true positive) is the number of predicted driver genes matched by known driver genes in benchmarking dataset. TN (true negative) is the number of not predicted driver genes that are not matched by known ones. FP (False Positive) is the number of predicted driver genes that are not matched by known driver genes. FN (false negative) is the number of known driver genes that are not matched by predicted ones.

Enrichment analysis

Another evaluation metric is pathway and GO enrichment analysis in order to evaluate whether or not the predicted cancer driver genes share common biological functions. It is widely known that cancer is a disease of pathways and the somatic mutations target the cancer genes in a group of regulatory and signaling networks [25]. Besides, those cancer-related driver mutations recurrently occur in the functional regions of protein (such as kinase domains and binding domains) to interrupt the major biological functions [41]. In this study, we leveraged the DAVID database to do the KEGG pathway enrichment analysis and GO enrichment analysis [42].

Results

In order to testify the effectiveness of our method, we applied our method and other four models:

DriverNet [29], DawnRank [31] and Diffusion algorithm [30], Muffinn [28] on the breast cancer, prostate cancer and lung cancer to identify their driver genes. Among them, the DriverNet, DawnRank and Shi’s Diffusion algorithm utilize the gene dysregulated expression information to identify outlying genes and construct the bipartite graph. These methods ranked mutated genes according to their connections with the outlying genes. The Muffinn method leverages both the variation frequency of mutated genes and the impact of their neighbors to design the ranking scores. It was further classified into two models: Muf_max and Muf_sum, according to considering the impact of either the most frequently mutated neighbor or all direct neighbors [28]. Unlike the DriverNet, DawnRank and Shi’s diffusion method that use gene dysregulated expression to construct bipartite graph, our study only employs the dysregulated expression profile to filter the mutated genes. Moreover, similar to the Muffinn method, we also consider the variation frequency of mutated genes and the impact of their direct neighbors. However, compared with other methods, our method not only integrates the features of dysregulated expression information, variation frequency and human FIN but also considers the modularity of mutated genes and their co-expression in the same tissue.

Running DawnRank demands expression data with normal and tumor samples. From the three cancer datasets, we can only download 110, 58, 52 tumor samples that have normal gene expression profiles for breast, lung and prostate respectively. Besides, we set the free parameter of DawnRank as three which was recommended by DawnRank authors [31].

Comparing performance

All the mutated genes were ranked in a descending order based on the scores assigned by each comparing method. After that, K of genes ranked in the top list were selected as candidate driver genes. According to the benchmark dataset, the Fscore, Recall, Precision values can be calculated to evaluate the performance of each method. With different values of K ranging from 1 to 200, the Fscore curve, Recall curve and Precision curve is drawn. The results are shown in the Fig. 2. In general, our results are superior to all of other four methods on the lung, prostate and breast cancer datasets. Compared with the other five methods, our model identifies the largest number of known drivers from NCG 4.0. For lung cancer, the Dytidriver and the other methods are tangled together when predicting small number of potential driver genes and then Dytidriver is significantly better than the other methods when the number of predicted driver genes increases from top 40 to 200. For prostate and breast cancer, our model demonstrated the best performance from beginning to the end. Similar to Muffinn, considering the variation frequency and the functional impact of direct neighbors, our method additionally takes advantage of the tissue-specific co-expression property and the modularity property which improve the precision of detecting driver genes to a higher level. Besides, the performance of Muf_max is worse than that of Muf_sum, which means it is inappropriate to judge a driver only based on the impact of single gene. DawnRank performed poorly among all comparing methods. The reason might be that only a limited number of cancer patients both have normal and tumor expression data for DawnRank.

Fig. 2
figure2

A comparison of the Precision, Recall, and Fscore for top ranking genes in the six methods. The X-axis represents the number of top-ranking genes. The Y-axis represents the score of the given metric

Enrichment analysis

We select the top 200 of cancer driver genes to do GO and pathway enrichment analysis. For lung cancer, in the biological process, the genes detected by our method enrich in the signal transduction, intracellular signaling cascade, transcription, metabolic process, regulation of cell death and apoptosis etc. In the cellular component, our results focus on the plasma membrane, organelle, cytoskeleton, lumen and cell fraction etc. In the molecular function, our results enrich in ion binding, nucleotide binding, ATP binding, transcription regulator activity etc. From the pathway aspect, our identified cancer driver genes enrich in some important cancer pathway, such as calcium signaling pathway, PI3K-Akt signaling pathway, mTOR signaling pathway.

With respect to the breast cancer, in biological process, our results enrich in the intracellular signaling cascade, signal transduction, regulation of transcription, metabolic process, regulation of cell death, phosphorylation, transcription, phosphorylation and cell proliferation. In the cellular component, our results enrich in the plasma membrane, organelle, lumen and cell fraction. In the molecular function, our results mainly enrich in the nucleotide binding, ATP binding, DNA binding, transcription regulator activity and kinase activity. In pathway analysis, our results enrich in Calcium signaling pathway, MAPK signaling pathway, PI3K signaling pathway, p53 signaling pathway etc.

In terms of prostate cancer, our results enrich in the regulation of transcription, signal transduction, adhesion molecules, regulation of GTPase activity etc. in biological process. For cellular component, our results enrich in nucleus, plasma membrane, cytosol, intracellular, protein complex etc. For molecular function, our results focus on protein binding, ATP binding, DNA binding, protein kinase activity and so on. From pathway aspect, our results enrich in the Calcium signaling pathway, PI3K signaling pathway, cAMP signaling pathway, mTOR signaling pathway.

Cociter analysis

Because the benchmark cancer driver genes are incomplete, to further prove the prediction capability of our method in distinguishing potentially cancer driver genes, we adopted the literature mining method to figure out the co-citation times of the predicted driver genes with the keywords ‘cancer type’(i.e. breast, prostate or lung), ‘driver’ and ‘cancer’ in the cociter website [25]. The larger the number of times the gene co-appeared with the keywords, the stronger associations between them. In this study, Tables 1, 2 and 3 show the cociter analysis of top 30 of genes identified by our method for each cancer type. In order to illustrate the capability of our method to prioritize significant well-known cancer driver genes, we also listed genes ranking position in other five methods.

Table 1 Cociter analysis of top 30 lung cancer driver genes identified by our method
Table 2 Cociter analysis of top 30 prostate cancer driver genes identified by our method
Table 3 Co-citer analysis of top 30 breast cancer driver genes identified by our method

For lung cancer, Table 1 shows some well-studied cancer driver genes were ranked in the top 30 by our methods, but were put in the latter positions by other methods. For example, Phosphatidylinositol 3-kinases (PI3Ks) are well known regulators of cellular growth and proliferation. It was ranked 12th by our method while ranked 112th by Dawnrank, 430th by Muf_max, 81th by Muf_sum. Toll-like receptor-4 (TLR4) in human tumors often correlates with chemoresistance and metastasis [43] which was ranked 13th by our method, ranked 71th by Diffusion algorithm while ranked 672th by Muf_max and 138th by Muf_sum. The oncogenic BRAF(V600E) mutation results in an active structural conformation characterized by greatly elevated ERK activity [44]. It was identified as the known cancer driver genes but ranked 70th, 75th, 155th, 392th and 150th by Diffusion, DriverNet and DawnRank, Muf_max and Muf_sum respectively. Our method can not only prioritize the significant cancer driver genes but also identify some potential cancer driver genes which were neglected by the NCG 4.0 such as the NES, MET and HGF. Especially for the MET, some researchers found that high MET gene copy number leads to shorter survival in patients with non-small cell lung cancer. MET co-existed with key words, ‘cancer’, ‘lung’ and ‘driver’ for 1045, 348 and 40 times.

For the prostate cancer as shown in Table 2, our method also identified some high-ranking significant driver genes, including TP53, CTNNB1, PTEN, PIK3CA and so on. What we want to mention is the famous tumor suppressor PTEN which is frequently inactivated in human prostate cancer [45]. It was ranked 6th by our method but strangely put in the 700th by Diffusion algorithm, 94th by DriverNet and even neglected by DawnRank. Furthermore, the results show DawnRank missed more than one significant cancer driver genes including PTEN, PIK3CA and AKT1. BRAF which involves in prostate related RAS/RAF/ERK signaling pathway [28] was ranked 13th by our methods while 326th by Diffusion algorithm, 63th by DriverNet, 36th by DawnRank, 348th by Muf_max and 34th by Muf_sum. Besides, some high associated genes ignored by NCG 4.0 are also ranked in the top list of our method. The ATM (ataxia telangiectasia mutated) kinase plays an essential role in maintaining genome integrity by coordinating cell cycle arrest, apoptosis, and DNA damage repair [46]. It was missed by the NCG 4.0 but co-appeared with ‘cancer’ for 1377 times, with ‘prostate’ for 61 times and with ‘driver’ for 5 times. Forkhead box protein A1 (FOXA1) modulates the transactivation of steroid hormone receptors and thus may influences tumor growth and hormone responsiveness in prostate cancer [47]. It was ranked 8th by our method while neglected by NCG 4.0. In addition, the transcription factors SP1 also has been missed by NCG 4.0.

For breast cancer in Table 3, our method successfully achieved a high precision in identifying the top 10 cancer driver genes with 8 out of 10 accuracy rates. The well-studied breast cancer driver genes including TP53, PIK3CA, MAP 3 K1, CDH1, ERBB2 and PTEN were also put in the top list of our method. Among those known breast cancer driver genes, the top three cancer driver genes (TP53, PIK3CA, MAP 3 K1) identified by our methods were ranked 233th, 156th and 128th respectively by Diffusion algorithm. The HER2 (official name is ERBB2) gene encodes a membrane receptor in the epidermal growth factor receptor family amplified and over expressed in adenocarcinoma [48]. It was regarded as the important cancer driver gene by many researchers and ranked 6th by our method while 72th, 64th, 90th, 73th by Diffusion algorithm, DriverNet, DawnRank and Muf_sum respectively. The breast cancer suppressor gene PTEN was ranked 14th by our method while 185th, 98th, 93th and 79th by Diffusion, DriverNet, DawnRank and Muf_sum receptively. Besides, the BRCA1 and JAK2 that co-cited with ‘cancer’ and ‘breast’ for many times were also missed by the DawnRank.

Discussion

The core step to overcome cancer is to identify the cancer driver genes which can promote cancer evolvement and development. However, it is a hard task since cancer is heterogeneous and there are too much irrelevant passenger genes. Recently, many methods try to shorten the distance to the truth. However, these methods still have some limitations. For example, they ignored many driver genes with low variation frequency and highly depend on the error-prone network. Inspired by the fact that cancer genes forming functional modules tend to be expressed similarly in the same tissue, we considered to improve the reliability of the gene functional interaction network by incorporating the expression similarity between mutated gene pairs in the cancers’ related tissues. In order to obtain the tissue-specific expression profiles, we used the GEO database. Because GEO database is currently the largest and most famous expression data platform which stores relatively complete expression data. The GEO dataset which we used in this work was consisted of a total of 677 patients, including cancer and normal patients, covered over 90 distinct tissue types and was created by the same organization using the same experimental technology. Although our model is superior to the other methods, it still has some limitations. For example, the datasets used in this work come from different projects: TCGA and GEO. Although, we just use the GEO dataset to calculate the co-expression values of mutated genes in a specific tissue. The likelihood is that there exists ambiguous since the heterogeneous within different patients. Therefore, in order to release this concern, in the future, we consider to unify the dataset as far as possible.

Conclusion

In this work, we proposed a new method to identify cancer driver genes by integrating the gene dysregulated expression, tissue-specific expression and variation frequency into the functional interaction network. Compared to other network-based methods, our method not only considered that driver genes have impact on the expression of downstream genes, but also took advantage of the modularity property of driver genes, their co-expression in specific tissues and itself variation frequency. We compared our results with other four similar methods and did cociter analysis and enrichment analysis. From the results, we can easily draw the conclusion that our method has the capability to identify the cancer driver genes with high precision and meanwhile detect some potential unknown cancer driver genes. Besides, the enrichment analysis also illustrates that the top ranking cancer driver genes in our list enrich in some significant cancer-related pathways and implement important functions [48].

Availability of data and materials

The source code and datasets used in this research can be downloaded from https://github.com/weiba/DyTidriver.

Abbreviations

ECC:

edge clustering coefficient

GEO:

Gene Expression Omnibus

human FIN:

human functional interaction network

ICGC:

Cancer Genome Consortium

PCC:

Pearson correlation coefficient

TCGA:

The Cancer Genome Atlas

References

  1. 1.

    Baker SG. The questionable premises underlying the search for cancer driver mutations and cancer susceptibility genes. Organisms Journal of Biological Sciences. 2017;1(1):3–4.

    Google Scholar 

  2. 2.

    Kumar S, Warrel J, Mcgillivray P, Meyerson W, Li S, Salichos L, et al. Passenger mutation landscape in cancer genomes. AACR. 2018:1279–9.

  3. 3.

    Yates LR, Campbell PJ. Evolution of the cancer genome. Nat Rev Genet. 2012;13(11):795.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  4. 4.

    Collier O, Stoven V, Vert J-P. LOTUS: a single-and multitask machine learning algorithm for the prediction of cancer driver genes. PLoS Comput Biol. 2019;15(9):e1007381.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  5. 5.

    Greenman C, Stephens P, Smith R, Dalgliesh GL, Hunter C, Bignell G, et al. Patterns of somatic mutation in human cancer genomes. Nature. 2007;446(7132):153.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  6. 6.

    Network CGAR. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature. 2008;455(7216):1061.

    Article  CAS  Google Scholar 

  7. 7.

    Consortium ICG. International network of cancer genome projects. Nature. 2010;464(7291):993.

    Article  CAS  Google Scholar 

  8. 8.

    Stratton MR, Campbell PJ, Futreal PA. The cancer genome. Nature. 2009;458(7239):719.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  9. 9.

    Pon JR, Marra MA. Driver and passenger mutations in cancer. Annual Review of Pathology: Mechanisms of Disease. 2015;10:25–50.

    CAS  Article  Google Scholar 

  10. 10.

    Kandoth C, McLellan MD, Vandin F, Ye K, Niu B, Lu C, et al. Mutational landscape and significance across 12 major cancer types. Nature. 2013;502(7471):333.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  11. 11.

    Vogelstein B, Papadopoulos N, Velculescu VE, Zhou S, Diaz LA, Kinzler KW. Cancer genome landscapes. Science. 2013;339(6127):1546–58.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  12. 12.

    Network CGAR. Integrated genomic and molecular characterization of cervical cancer. Nature. 2017;543(7645):378.

    Article  CAS  Google Scholar 

  13. 13.

    Ding L, Getz G, Wheeler DA, Mardis ER, McLellan MD, Cibulskis K, et al. Somatic mutations affect key pathways in lung adenocarcinoma. Nature. 2008;455(7216):1069.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  14. 14.

    Morin RD, Mendez-Lago M, Mungall AJ, Goya R, Mungall KL, Corbett RD, et al. Frequent mutation of histone-modifying genes in non-Hodgkin lymphoma. Nature. 2011;476(7360):298.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  15. 15.

    Paez JG, Jänne PA, Lee JC, Tracy S, Greulich H, Gabriel S, et al. EGFR mutations in lung cancer: correlation with clinical response to gefitinib therapy. Science. 2004;304(5676):1497–500.

    CAS  PubMed  Article  Google Scholar 

  16. 16.

    Chen P-L, Chen Y, Bookstein R, Lee W-H. Genetic mechanisms of tumor suppression by the human p53 gene. Science. 1990;250(4987):1576–80.

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  17. 17.

    Gemignani ML, Schlaerth AC, Faina B, Barakat RR, Oscar L, Robert S, et al. Role of KRAS and BRAF gene mutations in mucinous ovarian carcinoma. Gynecol Oncol. 2003;90(2):378–81.

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  18. 18.

    Schechter AL, Stern DF, Vaidyanathan L, Decker SJ, Drebin JA, Greene MI, et al. The neu oncogene: an erb-B-related gene encoding a 185,000-Mr tumour antigen. Nature. 1984;312(5994):513.

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  19. 19.

    Reimand J. Candidate non-coding driver mutations in super-enhancers and long-range chromatin interaction networks across 1,800 whole cancer genomes. AACR. 2018:2354–4.

  20. 20.

    Lawrence MS, Stojanov P, Polak P, Kryukov GV, Cibulskis K, Sivachenko A, et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature. 2013;499(7457):214.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  21. 21.

    Banerji S, Cibulskis K, Rangel-Escareno C, Brown KK, Carter SL, Frederick AM, et al. Sequence analysis of mutations and translocations across breast cancer subtypes. Nature. 2012;486(7403):405.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  22. 22.

    Dees ND, Zhang Q, Kandoth C, Wendl MC, Schierding W, Koboldt DC, et al. MuSiC: identifying mutational significance in cancer genomes. Genome Res. 2012;22(8):1589.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  23. 23.

    Cheng F, Zhao J, Zhao Z. Advances in computational approaches for prioritizing driver mutations and significantly mutated genes in cancer genomes. Brief Bioinform. 2015;17(4):642–56.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  24. 24.

    Wei P-J, Zhang D, Xia J, Zheng C-H. LNDriver: identifying driver genes by integrating mutation and expression data based on gene-gene interaction network. BMC bioinformatics. 2016;17(17):467.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  25. 25.

    Vandin F, Upfal E, Raphael BJ. Algorithms for detecting significantly mutated pathways in cancer. J Comput Biol. 2011;18(3):507–22.

    CAS  PubMed  Article  Google Scholar 

  26. 26.

    Leiserson MD, Vandin F, Wu H-T, Dobson JR, Eldridge JV, Thomas JL, et al. Pan-cancer network analysis identifies combinations of rare somatic mutations across pathways and protein complexes. Nat Genet. 2015;47(2):106.

    CAS  PubMed  Article  Google Scholar 

  27. 27.

    Hofree M, Shen JP, Carter H, Gross A, Ideker T. Network-based stratification of tumor mutations. Nat Methods. 2013;10(11):1108.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  28. 28.

    Cho A, Shim JE, Kim E, Supek F, Lehner B, Lee I. MUFFINN: cancer gene discovery via network analysis of somatic mutation data. Genome Biol. 2016;17(1):129.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  29. 29.

    Bashashati A, Haffari G, Ding J, Ha G, Lui K, Rosner J, et al. DriverNet: uncovering the impact of somatic driver mutations on transcriptional networks in cancer. Genome Biol. 2012;13(12):R124.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  30. 30.

    Shi K, Gao L, Wang B. Discovering potential cancer driver genes by an integrated network-based approach. Mol BioSyst. 2016;12(9):2921–31.

    CAS  PubMed  Article  Google Scholar 

  31. 31.

    Hou JP, Ma J. DawnRank: discovering personalized driver genes in cancer. Genome medicine. 2014;6(7):56.

    PubMed  PubMed Central  Article  Google Scholar 

  32. 32.

    Guo W-F, Zhang S-W, Liu L-L, Liu F, Shi Q-Q, Zhang L, et al. Discovering personalized driver mutation profiles of single samples in cancer by network control strategy. Bioinformatics. 2018;34(11):1893–903.

    CAS  PubMed  Article  Google Scholar 

  33. 33.

    Ganegoda GU, Wang J, Wu F-X, Li M. Prediction of disease genes using tissue-specified gene-gene network. BMC Syst Biol. 2014;8(3):S3.

    PubMed  PubMed Central  Article  Google Scholar 

  34. 34.

    Lage K, Hansen NT, Karlberg EO, Eklund AC, Roque FS, Donahoe PK, et al. A large-scale analysis of tissue-specific pathology and gene expression of human disease genes and complexes. Proc Natl Acad Sci. 2008;105(52):20870–5.

    CAS  PubMed  Article  Google Scholar 

  35. 35.

    Wu G, Feng X, Stein L. A human functional protein interaction network and its application to cancer data analysis. Genome Biol. 2010;11(5):R53.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  36. 36.

    Futreal PA, Coin L, Marshall M, Down T, Hubbard T, Wooster R, et al. A census of human cancer genes. Nat Rev Cancer. 2004;4(3):177.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  37. 37.

    An O, Pendino V, D’Antonio M, Ratti E, Gentilini M, Ciccarelli FD: NCG 4.0: the network of cancer genes in the era of massive mutational screenings of cancer genomes. Database. 2014;2014.

  38. 38.

    Wang J, Li M, Chen J, Pan Y. A fast hierarchical clustering algorithm for functional modules discovery in protein interaction networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB). 2011;8(3):607–20.

    Article  Google Scholar 

  39. 39.

    Peng W, Li M, Chen L, Wang L. Predicting protein functions by using unbalanced random walk algorithm on three biological networks. IEEE/ACM transactions on computational biology and bioinformatics. 2017;14(2):360–9.

    PubMed  Article  PubMed Central  Google Scholar 

  40. 40.

    Peng W, Wang J, Zhao B, Wang L. Identification of protein complexes using weighted PageRank-nibble algorithm and core-attachment structure. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB). 2015;12(1):179–92.

    Article  Google Scholar 

  41. 41.

    Song J, Peng W, Wang F: An Entropy-based method for identifying mutual exclusive driver genes in cancer. IEEE/ACM transactions on computational biology and bioinformatics. 2019.2.7;https://doi.org/10.1109/TCBB.

  42. 42.

    Huang DW, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2008;4(1):44.

    Article  CAS  Google Scholar 

  43. 43.

    Rajput S, Volk-Draper LD, Ran S. TLR4 is a novel determinant of the response to paclitaxel in breast cancer. Mol Cancer Ther. 2013;12(8):1676–87.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  44. 44.

    Lee MH, Lee SE, Kim DW, Ryu MJ, Kim SJ, Kim SJ, et al. Mitochondrial localization and regulation of BRAFV600E in thyroid cancer: a clinically used RAF inhibitor is unable to block the mitochondrial activities of BRAFV600E. The Journal of Clinical Endocrinology & Metabolism. 2011;96(1):E19–30.

    CAS  Article  Google Scholar 

  45. 45.

    Kwak MK, Johnson DT, Zhu C, Lee SH, Ye D-W, Luong R, et al. Conditional deletion of the Pten gene in the mouse prostate induces prostatic intraepithelial neoplasms at early ages but a slow progression to prostate tumors. PLoS One. 2013;8(1):e53476.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  46. 46.

    Fan C, Quan R, Feng X, Gillis A, He L, Matsumoto ED, et al. ATM activation is accompanied with earlier stages of prostate tumorigenesis. Biochimica et Biophysica Acta (BBA)-Molecular Cell Research. 2006;1763(10):1090–7.

    CAS  Article  Google Scholar 

  47. 47.

    Gerhardt J, Montani M, Wild P, Beer M, Huber F, Hermanns T, et al. FOXA1 promotes tumor progression in prostate cancer and represents a novel hallmark of castration-resistant prostate cancer. Am J Pathol. 2012;180(2):848–61.

    CAS  PubMed  Article  Google Scholar 

  48. 48.

    Wen W, Xiao N, Bender R, Ghazalpour A, Tan Z, Swensen J, et al. Mutations in the kinase domain of the HER2/ERBB2 gene identified in a wide variety of human cancers. The Journal of Molecular Diagnostics. 2015;17(5):487–95.

    CAS  PubMed  Article  Google Scholar 

Download references

Acknowledgements

We appreciate Dr. Shi for providing the Diffusion algorithm’s source code and related datasets and the reviews for their helpful comments and constructive suggestions.

About this supplement

This article has been published as part of BMC Medical Genomics, Volume 12 Supplement 7, 2019: Selected articles from the 14th International Symposium on Bioinformatics Research and Applications (ISBRA-18): medical genomics. The full contents of the supplement are available at https://bmcmedgenomics.biomedcentral.com/articles/supplements/volume-12-supplement-7

Funding

This work is supported in part by the National Natural Science Foundation of China under grant No.31560317, No.61972185, 61702122, No.61472133 and No.81560221. Natural Science Foundation of Yunnan Province of China (No. 2019FA024), Yunnan Ten Thousand Talents Plan young. The funders played roles in the design of study, collection, analysis and interpretation of data, writing the manuscript and will pay the publish fee of the manuscript.

Author information

Affiliations

Authors

Contributions

JS collected the datasets of three cancers, human functional interaction network and tissue specific information. WP and JS designed the Dytidriver method and analyzed the result. JS, WP and FW discussed extensively about this study and drafted the manuscript together. JS, WP, FW and JW participated in revisiting and modifying the draft. All authors have read and approved the manuscript.

Corresponding author

Correspondence to Wei Peng.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Song, J., Peng, W., Wang, F. et al. Identifying driver genes involving gene dysregulated expression, tissue-specific expression and gene-gene network. BMC Med Genomics 12, 168 (2019). https://doi.org/10.1186/s12920-019-0619-z

Download citation

Keywords

  • Driver genes
  • Dysregulated expression
  • Tissue-specific expression
  • Human functional interaction network
  • Variation frequency