Identification of lung cancer gene markers through kernel maximum mean discrepancy and information entropy

Background The early diagnosis of lung cancer has been a critical problem in clinical practice for a long time and identifying differentially expressed gene as disease marker is a promising solution. However, the most existing gene differential expression analysis (DEA) methods have two main drawbacks: First, these methods are based on fixed statistical hypotheses and not always effective; Second, these methods can not identify a certain expression level boundary when there is no obvious expression level gap between control and experiment groups. Methods This paper proposed a novel approach to identify marker genes and gene expression level boundary for lung cancer. By calculating a kernel maximum mean discrepancy, our method can evaluate the expression differences between normal, normal adjacent to tumor (NAT) and tumor samples. For the potential marker genes, the expression level boundaries among different groups are defined with the information entropy method. Results Compared with two conventional methods t-test and fold change, the top average ranked genes selected by our method can achieve better performance under all metrics in the 10-fold cross-validation. Then GO and KEGG enrichment analysis are conducted to explore the biological function of the top 100 ranked genes. At last, we choose the top 10 average ranked genes as lung cancer markers and their expression boundaries are calculated and reported. Conclusion The proposed approach is effective to identify gene markers for lung cancer diagnosis. It is not only more accurate than conventional DEA methods but also provides a reliable method to identify the gene expression level boundaries.


Background
Small-cell lung carcinoma (SCLC) and non-small-cell lung carcinoma (NSCLC) are two main types of lung cancer, comprising the majority of clinic cases [1]. As the most common cancer, lung cancer is the leading cause of cancer-related deaths all over the world [2,3]. However, most lung cancer cases were diagnosed in a very late stage when symptoms like coughing, coughing up blood, development are possible to diagnose cancer and distinguish the specific cancer sub-types [8][9][10]. Researchers have explored to identify efficient biomarkers from these molecules as the indicator of the pathogenic process to improve the diagnosis sensitivity [11]. These explorations are mainly focused on genetic mutations, DNA methylation profile, miRNA synthesis profile and especially blood proteins [12][13][14][15][16][17][18][19]. Till now, panels of protein markers have been identified and intensively used in clinic applications. For example, the combinations of CEACAM, CYFRA 21-1, ProGRP, CA125, NSE (neuron-specific enolase) and NY-ESO (cancer-testis antigen) are popular lung cancer diagnosis markers [20][21][22][23][24]. Recently, researchers also discovered that β-chain of human haptoglobin [25], SAA (serum amyloid A) [26], APOA1 (apolipoprotein A-1) [27] and some other proteins [28] may be potential biomarkers. Despite the advances in protein marker discovery, some disadvantages of protein markers are still existing, like genetic heterogeneity of tumors, poor reproducibility of laboratory test and low concentration of the proteins [18,29]. Recent years the next-generation sequence technologies have promoted the study of disease-related genomes. Projects like The Cancer Genome Atlas (TCGA) [30] and the Genotype-Tissue Expression (GTEx) [31] have collected a large number of sequencing experiments and provided tissue-specific gene expression data in public. As some genes have distinct expression levels between normal and tumor tissues for the reason of disease development, they are promising to diagnose lung cancer more timely and accurately.
During the past years, gene differential expression analysis (DEA) has been extensively applied in the preprocess of high-throughput profiling data collected from microarrays [32][33][34]. Based on statistical models, researchers developed tools to identify genes which had distinct expression levels between different experiment groups. Compared with the microarray data, the RNA-seq raw data comes with the unique feature of discrete reads which should be analyzed under an appropriate statistical hypothesis [35]. According to the statistical hypothesis, the existing RNA-seq analysis models can be categorized into Poisson model, negative binomial model, betabinomial model, and Bayesian model [36,37]. These models can tell whether the gene expression levels are the same between experiment groups and calculate a confidence coefficient scores (also named p-value) suggesting the magnitude of expression difference.
In cancer studies, the histologically normal tissue adjacent to tumor is usually used to compare with the tumor tissue under the assumption that they are the same with real healthy tissues This approach allows researchers to compare samples from the same patient and reduce the individual specific effects. However, recent studies have deepened our understanding about NAT tissue, indicating that NAT is not exactly equal to the real healthy tissue [38]. In NAT tissues, the specific micro-environment surrounding tumor makes the change of gene expression in various pathways that are related to disease development. In order to identify efficient and meaningful marker genes, we proposed to detect differentially expressed genes(DEGs) from real normal, NAT and tumor tissues.
Here, we present a novel approach to identify genes markers for lung cancer with kernel maximum mean discrepancy (MMD) and Information Entropy. As mentioned above, the conventional DEA methods can calculate a p-value to evaluate the expression difference based on certain statistical hypothesis, but it's hard to decide which distribution assumption is correct before calculation. Inspired by the distribution measure method of transfer learning, we use the kernal MMD to detect DEGs between tumor, NAT and normal tissues. This method can output the maximum mean discrepancy score which indicates the degree of differential expression which does not require a statistical hypothesis on data distribution. Besides, although the p-value of conventional techniques can identify DEGs, it is essential to define a threshold of expression level to distinguish different types of tissue. Commonly, Researchers would like to take the upper boundary of lower expressed tissue or lower edge of higher expressed tissue as the threshold when there is a distinct expression gap. But this kind of gap is not always existing and then the threshold is hard to define. As the gene expression level is continuous data and how to choose a definite threshold point is a tough task. Here we applied the information theory to solve this problem.
In this paper, we firstly evaluate the expression level difference of 23368 genes in normal, normal adjacent tumor and tumor tissues with the kernel maximum mean discrepancy. Then the top-ranked genes selected by kernel MMD method are compared with genes selected by two conventional DEA methods, t-test and fold change. Then GO and KEGG pathway enrichment analysis are conducted to analyze the top 100 genes ranked by average MMD scores. Lastly, the top 10 genes are selected as marker genes for lung cancer and their expression boundaries between normal, NAT and tumor tissues are identified by the proposed information theory method.

Dataset
Three gene expression datasets used in this paper are collected from different tissue types in reference [38], containing the expression data of 23368 genes. Dataset 1 includes the gene expression data of 373 normal healthy samples. The raw reads file of dataset 1 is obtained from the GTEx program (phs000424.v6.p1, 18 November 2015). Dataset 2 has 59 NAT tissues, while dataset 3 has 541 lung cancer tumor tissues. Their raw feature counts and FPKM values are original from NCBI Gene Expression Omnibus (GEO) [39]. Since the raw values are from different data sources, the RNA-sequencing raw reads files were processed and normalized with the Rsubread package and aligned to the UCSC hg19 reference genome with the same pipeline. The processed GTEx expression profiles of dataset 1 are available in GEO under an accession number GSE86354 and other two datasets are deposited as GSE62944.

Gene marker identification framework
With the above three datasets, we apply a novel approach to detect DEGs and determine the expression boundaries between normal, NAT and tumor cells as the criterion of lung cancer diagnosis.
In our method, there are mainly four steps: First, the kernel Maximum Mean Discrepancy is used to identify DEGs between two types of tissues respectively and genes are ranked by the MMD values; Second, the genes with top average MMD rankings are selected from all genes; Third, genes selected from the previous step are put into KEGG pathway analysis and GO enrichment analysis to validate the efficiency of those gene markers; Last, we define the gene expression boundaries for the top 10 marker genes with information gain theory. The whole framework of the proposed approach is illustrated in Fig. 1.

Kernel maximum mean discrepancy
The problem of comparing the probability distribution between two sample groups, also referred to as two-sample problem, widely exists in data science areas. In bioinformatics field, this problem is extensively existing in micro-array data analysis, database attribute matching, data integration from different platforms and so on. The key point of two-sample problem is how to determine if two groups of observations are from the same distribution and some statistical test methods were applied to address that in previous researches.
However, these methods have different statistical modelings based on specific assumptions of data distribution, which is commonly unknown before calculation in practical use. In some previous studies, researchers have explored to using the kernel Maximum Mean Discrepancy (MMD) method to test the distribution difference in RNA-Transcript expression and pathway differential expression and achieved better performance than traditional statistical tests [40,41].Here, we adopt kernal MMD to identify the DEGs in lung cancer gene expression data.
Give F to be a class of functions f : χ → R. Two samples X = {x 1 , x 2 , . . . x m } and Y = y 1 , y 2 , . . . y n are drawn form two probability distribution p and q, respectively. The empirical estimation of MMD value is as following [42]: (1)

Fig. 1 Gene marker identification framework
As the definition above, if the function class F is rich enough, the value of MMD will be zero if and only if p=q. But a too rich F will lead to that MMD differs from zero for most finite sample estimates. Thus some restrictions ought to be placed on the function class. One trade-off way is to set F as the unit ball in a universal reproducing kernel Hilbert space H, defined on the compact metric space χ. Since H is a complete inner product space of Then MMD can be rewritten as: Then we can calculate like the following function: As the inner product can be replaced by Gaussian ker- can be figured out as: In our method, the minimum variance unbiased estimate of MMD value is obtain according to the above functions based on Shogun package in python [43]. The computational complexity of MMD method is O(n 2 ). The MMD score can evaluate the gene expression difference between different sample types, while a higher MMD score means greater gene expression level difference.

Boundary discovery method
As a biomarker, there should be an expression threshold for the marker gene as the indicator for disease diagnosis. If the gene expression level is proved to be different in normal and tumor tissues, it is necessary to define a threshold of expression level as the boundary. When the gene expression level has a distinct gap between normal and tumor samples, the threshold is commonly the lower or upper boundary of this gap. However, the expression level does not have that kind of obvious gap all the time, thus how to define a reliable boundary is challenging in these cases.
Here we propose to identify the threshold with information theory which has been widely used in decision tree algorithms for classification problems. According to the information theory, the change of information entropy which is also named information gain can evaluate the classification efficiency of a threshold point. If there is the expression data of a gene from m normal samples and n tumor samples in dataset D, p m and p n refer to the proportions of normal and tumor samples in all samples, then the original entropy of D is defined as: In the boundary identification, all samples are reclassified by the gene expression level with a split point of x and D v denotes the new dataset re-classed by x. Then the information gain of this split point can be computed as: Different from discrete data, the expression level is continuous and it is inappropriate to use the expression level values in samples as the split points. Besides, as the distribution of the expression level is also unknown, we cannot use the probability function to calculate the entropy. To address this problem, we propose to deal with continuous data like discrete data: First, the expression level values are sorted from small to large and the middle points between two expression level values are taken as the split points; Second, we calculate the information gain of the split points respectively and choose the point that has the highest information gain as the boundary. The algorithm of expression boundary identification with information theory is illustrated in Algorithm S1 in Additional file 2.

GO and KEGG enrichment analysis
The GO enrichment analysis is the major gene-annotation analysis method based on the Gene Ontology resource, describing the gene function at a molecular level. The Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis has been widely used to model and simulate the molecular interactions and reaction networks in system biology. In this paper, these two methods are applied to figure out the molecular functions of identified potential marker genes and validate whether these genes are related to lung cancer. Here the enrichment analysis methods are both implemented based on the R package called ClusterProfiler developed by Guangchuang Yu's team [44]. The GO terms and enriched pathways are all filtered with the p-value <0.05.

Conventional DEA method and machine learning evaluation
In this work, two conventional differentially expressed gene analysis methods, t-test and fold change, are compared with the proposed kernel MMD. The t-test is completed based on a python package called 'Scipy' [45]. The fold change is calculated as below: Where E1 and E2 are the average of gene expression level in two different issue types. The p-value, fold change value and MMD score are calculated for every single gene in our datasets. Then genes are ranked with the same strategy and top ranking genes are regarded as potential markers. Here a 10-fold cross validation based on the random forest classifier is applied to evaluate the efficiency of these top genes under four frequently used metrics: recall, F1-score, accuracy and Matthews correlation coefficient (MCC).

Results
In the first part, we present the genes ranking with kernel MMD score and analysis the gene expression difference between different issue types. Then the top ranked genes are reported and compared with those genes identified by conventional t-test and fold change methods. The third part shows the results of GO and KEGG pathway analysis of the top ranked genes. At last, we choose the top ten genes of average ranking as marker gene and identify the expression boundaries of these gene markers with information gain theory.

Gene differential expression between different tissue types
For the three mentioned datasets, kernel MMD values are calculated on each two of them respectively to discover DEGs. For every single gene, we calculate three MMD values which are from Normal-NAT, Normal-Tumor and NAT-Tumor groups. The MMD scores indicate the difference of expression levels among three types of samples. The top 10 ranked genes in each group are shown in Table 1. As illustrated in the table, the top MMD scores in Normal-Tumor group are over 200, which are much higher than the other two groups. The Normal-NAT group has comparable MMD scores with NAT-Tumor group. It is clear that gene expression level difference in normal-tumor group is much greater than other two groups.
In addition, the NAT samples have different expression profiles from not only tumor samples but also the real healthy samples. Since the NAT samples are always considered as healthy samples in the state-of-art researches, we test the top 10 ranked genes selected by NAT-Tumor group, Normal-Tumor group and their average ranking to explore the influence of regarding NAT as real normal samples. To evaluate the effectiveness of selected genes, the expression data of the top genes above is applied to classify tumor samples from other samples via 10-fold cross-validation. The results of the 10-fold cross validation are reported in Table 2.
As shown in Table 2, the selected genes from each group can classify tumor samples from other samples. However, the performance of the three groups of genes varies greatly. When considering normal samples and NAT samples together, the top average ranked genes have the best scores under all metrics with an accuracy of 0.9907. The highest F1 score of 0.9914 implies that these genes also have a better classification balance. The results show that  the real normal samples and NAT samples are not exactly the same. Researchers should take both of them into consideration in cancer study rather than simply replacing real normal samples with NAT samples. The detailed results of differential expressing gene identification conducted by MMD and other two conventional methods are listed in Additional file 1.

Identify marker genes for lung cancer development
In this work, two conventional DEA methods t-test and fold change are compared with our approach. T-test and fold change methods are both applied to identify DEGs between different tissue types. The p-value of t-test and fold change value are calculated to evaluate the gene expression difference. Since the ability to detect tumor samples is more significant in clinical application, the top 10 genes of average rankings from Normal-Tumor group and NAT-tumor group selected by t-test and fold change are compared with the genes selected by our method. Another 10-fold cross validation is conducted and the results are reported in Table 3. As shown in Table 3, the proposed kernel MMD method outperforms other two conventional methods under all metrics with the recall of 0.9885, F1 score of 0.9914, accuracy of 0.9907 and MCC of 0.9816. The fold change method has the worst performance and the selected genes by fold change method are not efficient enough to classify tumors from other samples. The t-test has a comparable result with MMD method. Since the t-test and fold change methods have been widely used, the kernel MMD method is promising to improve the differential gene analysis efficiency in practical use.
From Table 1, we can see there are some overlapping genes like LOC442459, LOC100132831, LOC401127, CSNK1A1P1, CSNK1A1L and PIN1P1 in Normal-NAT group and Normal-Tumor group. These genes can distinguish normal samples from not only NAT samples, but also tumor samples. Inspired the previous part, the average ranking of all groups can help to identify more significant genes. Thus, the gene average ranking of the three groups is calculated and top genes of average ranking are chosen to be potential marker genes to diagnose lung cancer. In Fig. 2, expression levels in normal, NAT and tumor samples of the top 4 genes of average ranking are presented. From the figure, the four genes exactly have distinct expression levels in different types of tissues.

GO and KEGG pathway enrichment
From the average ranking gene list, we choose the top 100 genes to conduct the GO and KEGG pathway enrichment analysis. In the GO enrichment analysis, we select 'Biology Process' as the enrichment target, and there are 12 GO terms with p-value <1.0e-04 and count ≥5. As shown in Table 4, the top two terms, 'GO:0051480' and 'GO:0007204' , are both related to the regulation factors of cytosolic calcium ion concentration while term No.5 and No.6 are also involved in cellular calcium ion homeostasis. The influence of calcium ion channels on lung cancer has been studied for a long time [46][47][48], and the cellular calcium ion level change has been explored in lung cancer development [48]. It is suggested that these calcium ion regulation related genes are significant in lung cancer. The results of KEGG pathway enrichment analysis are illustrated in Fig. 3. There are 20 pathways with a p-value below 0.05 and count number over 2. The adrenergic signaling pathway and the cGMP-PKG signaling pathway are the most significant pathways. Currently, the role of adrenergic signaling pathways plays in lung cancer development have not been fully studied. However, the β-adrenergic signaling have been found to be a possible novel cancer therapy in tumor cells [49]. Besides, some researches have made some explorations about that [50]. The second top significant pathway is the cGMP-PKG signaling pathway which mediates the action of cellular ion concentration and sensitivity, influencing cell proliferation. The regulation relationship between cGMP-PKG signaling pathway and lung cancer has been studied in [51]. The results of GO and KEGG pathway enrichment analysis show that the top gene selected by MMD method is indeed highly related to lung cancer.

Expression boundary identification
Although the conventional methods can detect the differential expressed gene, they can only manually define the expression boundary when there is a distinct expression level gap. After selecting lung cancer marker genes, we identify the expression boundaries between normal,NAT and tumor with the mentioned information theory method. Here the top 10 genes in MMD average ranking list are chosen as the lung cancer marker genes and the expression boundaries of them are illustrated in  Table 5. The identified expression boundaries of all genes are reported in Additional file 1.
As shown in Table 5, the ten gene markers have a distinct expression range in normal, NAT and tumor samples, which can be an indicator of lung cancer development. Additionally, in practical clinic application, the boundary between tumor and other tissues is the most significant for disease diagnosis.The boundary between normal samples and NAT samples also implied that there would be some gene expression changes in the disease

Discussion
Since the early-diagnosis of lung cancer has been a longterm critical problem in clinical practice, researchers have explored various types of biomarkers, like genetic mutations, blood proteins. Here, this paper proposed a novel method to identify genes markers for lung cancer. There are two main problems in efficient gene markers identification: first, how to evaluate the gene expression difference; second, how to find the reliable expression boundary between tumor and other samples. The most existing DEA methods were built to solve the first problem, but they can only give out a p-value to assess the differential expressing gene without defining the expression boundary. The ln of this research is to address both of the problems in biomarker identifications. The gene markers are given out based on the existing lung cancer dataset. We think there are two limitations in our work. First, a larger dataset can help to obtain more accurate results; Second, a threshold of MMD value to define the differentially expressed gene can be defined with a large dataset, while here we just take the top ranked genes as potential marker genes.

Conclusion
In this paper, we not only proposed a more efficient method, kernel MMD, to evaluate the expression changes, but also provide a information theory based algorithm to identify the gene expression boundary. The experiment results show our method can select more significant genes than traditional methods and give out the expression boundary of the marker gene. Through the GO and KEGG pathway enrichment analysis, the function of marker genes in lung cancer is studied, and these marker genes are indeed related to lung cancer development. In the future, we will collect more gene expression data related to lung cancer and calculate more accurate results. In addition, we will explore the application of our method on biomarker discovery for other diseases.

Additional file 1:
The results of DEA and boundary identification. In Additional file 1, the results of MMD, t-test and fold change analysis between Normal-NAT, Normal-Tumor and NAT-Tumor are reported. The mmd score, p-value and fold change score for every single gene are all presented. Besides, the boundary identification results are also included in this file.

Additional file 2:
The supplementary files. In Additional file 2, the gene differential expression boundary identification algorithm is included.

Abbreviations
DEA: Differentially expressed analysis; GO: Gene ontology; KEGG: Kyoto encyclopedia of genes and genomes; MMD: Maximum mean discrepancy; NAT: Normal adjacent to the tumor