Label propagation-based semi-supervised feature selection on decoding clinical phenotypes with RNA-seq data

Background Clinically, behavior, cognitive, and mental functions are affected during the neurodegenerative disease progression. To date, the molecular pathogenesis of these complex disease is still unclear. With the rapid development of sequencing technologies, it is possible to delicately decode the molecular mechanisms corresponding to different clinical phenotypes at the genome-wide transcriptomic level using computational methods. Our previous studies have shown that it is difficult to distinguish disease genes from non-disease genes. Therefore, to precisely explore the molecular pathogenesis under complex clinical phenotypes, it is better to identify biomarkers corresponding to different disease stages or clinical phenotypes. So, in this study, we designed a label propagation-based semi-supervised feature selection approach (LPFS) to prioritize disease-associated genes corresponding to different disease stages or clinical phenotypes. Methods In this study, we pioneering put label propagation clustering and feature selection into one framework and proposed label propagation-based semi-supervised feature selection approach. LPFS prioritizes disease genes related to different disease stages or phenotypes through the alternative iteration of label propagation clustering based on sample network and feature selection with gene expression profiles. Then the GO and KEGG pathway enrichment analysis were carried as well as the gene functional analysis to explore molecular mechanisms of specific disease phenotypes, thus to decode the changes in individual behavioral and mental characteristics during neurodegenerative disease progression. Results Large amounts of experiments were conducted to verify the performance of LPFS with Huntington’s gene expression data. Experimental results shown that LPFS performs better in comparison with the-state-of-art methods. GO and KEGG enrichment analysis of key gene sets shown that TGF-beta signaling pathway, cytokine-cytokine receptor interaction, immune response, and inflammatory response were gradually affected during the Huntington’s disease progression. In addition, we found that the expression of SLC4A11, ZFP474, AMBP, TOP2A, PBK, CCDC33, APSL, DLGAP5, and Al662270 changed seriously by the development of the disease. Conclusions In this study, we designed a label propagation-based semi-supervised feature selection model to precisely selected key genes of different disease phenotypes. We conducted experiments using the model with Huntington’s disease mice gene expression data to decode the mechanisms of it. We found many cell types, including astrocyte, microglia, and GABAergic neuron, could be involved in the pathological process.


Background
Neurodegenerative disease is a type of chronic progressive disease with complex pathogenic mechanisms caused by neuronal degeneration, leading to abnormal behavior, mental dysfunction and ultimately death [1][2][3]. Motor ability, cognitive ability, memory ability and other functions are gradually impaired during the disease progression [4,5]. It has been reported that there are many pathogenic factors of neurodegenerative disease, such as neurotrophasthenia, impairments of axon transmission, impairments of metabolic pathways, protein misfolding, inflammation, and intestinal microorganism [6][7][8][9]. However, single pathogenic factor cannot fully explain the pathogenesis of the disorder. The pathogenesis is still not well understood, and there is no effective treatment for it.
Meanwhile, Huntington's disease (HD) is a representative neurodegenerative disease, which is caused by a triplet (CAG) repeat elongation in huntingtin (HTT) gene on chromosome 4 that codes for polyglutamine in the huntingtin protein [10]. The mutant protein can enter the nucleus and alter gene transcription [11]. With the accumulation of the mutant protein, numerous interactions between molecules and pathways can be affected, resulting in neuronal dysfunction and degeneration [12,13]. With the connections between neurons get sparse, the neurons finally died during the disease deterioration, and the volume of striatum tissue decreased markedly [14]. Clinically, motor ability, cognitive, and mental functions are gradually affected.
With the rapid development of high-throughput sequencing technology, large amounts of omics data and biomedical data have been accumulated, providing both opportunities and challenges to develop computational methods for mining biomarkers, such as functional elements and locus in DNA sequences. Further decoding regulatory relationships of those biomarkers to clinical phenotypes is helpful for understanding physiopathologic mechanisms under the abnormal behavior, promoting early diagnosis and interventional treatment for neurodegenerative disease. Generally, at the transcriptomic level, researchers select key genes affected by diseases based on the hypothesis that disease genes tend to differentially expressed between case samples and normal samples. Nevertheless, the relationship between genes and their functions is complex and multifaceted, namely the same gene can play a role in many different functions. In living organisms, genes interact with each other to produce high-level biological functions, such as motor ability, cognitive ability, memory, emotion, etc. It has been well established that genes that have synergistic effects usually have similar expression patterns, and participate in a same biochemical reaction or in a same pathway [15]. Therefore, searching for gene clusters that are severely affected, and analyzing the biological pathways involved in can be helpful to understand the dynamic molecular process during the degeneration of the disease. The screened key genes and pathways can further be used to decode molecular mechanisms related to clinical abnormal behaviors.
Because of the critical of some essential genes, the annotation of many genes that maintain the normal function of central nervous system is still unclear [16]. Besides, our previous studies shown that the expression level of most lethal phenotype genes are not significantly changed during Huntington's disease degeneration [17,18]. Therefore, traditional statistical-based differentially expressed gene selection methods can not effectively select clinical phenotype associated genes for complex neurodegenerative disease. Nevertheless, clustering algorithms often used to detect gene modules. Genes that belong to a same module would have similar function or expression pattern, while genes that belong to different modules usually have very different properties. Moreover, we can use clustering methods to detect high-order biological signals, deepen the understanding of biological process which are seriously affected by the disease.
Based on the objects to be clustered, clustering algorithms can be classified into three categories: gene-based clustering, sample-based clustering, and bi-clustering [19,20]. Gene-based clustering methods classify the genes with similar expression patterns into one category, such as label propagation algorithm [21], and fuzzy clustering algorithm [22], etc., to get meaningful gene modules. Sample-based clustering methods take the samples as cluster objects, and gene expression is seen as a feature of the sample, which can be used to measure and identify the subtypes of patients. Supervised machine learning technology are often used to conduct cluster analysis of samples. Bi-clustering algorithms cluster genes and samples at the same time, mining genes with similar expression patterns, and further exploring the dynamic changes of gene module function under different sample states [23][24][25]. Since the function of clustered gene module can be seen as high order biological signal, bi-clustering algorithms are usually used to analyze the changes of biological process during disease degeneration [26].
Meanwhile, label propagation clustering algorithm is a graph-based semi-supervised machine learning method. It is based on guilt-by-association to predict the label information of unlabeled nodes with a few labeled nodes [21]. When the labels of the nodes in the network tend to be stable, the nodes with the same label identity are divided into a same category. Since it is costly to make tags of the samples for big biomedical data, unsupervised and semi-supervised methods have great prospect in this type of applications. According to the above discussion, to identify key genes which could be matched to the complex clinical phenotypes of different disease stages, we designed a semi-supervised feature selection method based on label propagation clustering algorithm (LPFS). LPFS includes two parts: one part is label propagation clustering based on the sample network which is constructed with gene expression data, the other part is the feature selection process based on the feature selection matrix. By conducting alternative iteration of the two steps, we select key genes which could be matched to the complex clinical phenotypes of different disease stages. To our best knowledge, this is the first time to put gene selection and sample clustering into one framework to prioritize disease genes.
To investigate the effectiveness of the biomarkers selected by the LPFS, we also conducted experiments with DESeq2 [27], edgeR [28], limma [29], t-test [30], fold change method (FC) [30], joint non-negative matrix factorization meta-analysis method (jNMFMA) [31], and flexible non-negative matrix factorization method (FNMF) [18]. Finally, we performed GO and KEGG pathways enrichment analysis of key genes identified by LPFS, to explore the affected gene functions underlying the complex clinical phenotypes, gaining a deep understand of the dynamic molecular mechanisms during the disease progression.
The rest of this paper is organized as follows: In "Methods" section, we present the proposed LPFS in detail. In "Results and discussion" section, we illustrate experiments of different methods with RNA-seq data of Huntington's disease. The enrichment analysis of key genes obtained by LPFS are performed and reported. And the overall discussion of experimental results of various methods are also reported. In "Conclusions" section, conclusions are presented.

Methods
In this section, we present LPFS approach in detail and discuss the parameter setting of it.

Label-propagation based semi-supervised feature selection
The gene expression data is denoted as X = [x ij ] n×m , where x ij represents the expression level of gene j in sample i, x i· denotes sample i, and x ·j denotes gene j. L = {1, . . . , c} represents the set of labels, c is the number of cluster number, and l i is the label for sample x i· , l i ∈ L . The initial category label matrix is denoted as . To make precision diagnosis of a patient, one key point is to identify biomarkers corresponding to the illness state of the patient correctly. To address the problem, we designed a feature selection method based on label propagation clustering namely, LPFS. LPFS conduct key gene selection during the sample clustering process, filter out redundant features, and select key genes that would well represent and distinguish different category samples. The selected genes should make the sample distance within one class close, and the sample distance between classes farther. Biologically, to identify severely affected genes corresponding to different clinical stages or phenotypes, it is important to select key genes that can distinguish different stages of the disease. Since not all genes have positively contribute to sample classification, therefore, we put l 2,1 constraint on feature selection matrix to sparse each column of it and filter out noise factors [32]. According to mathematical meaning, LPFS can be formulated as the following optimization problem: (2) Here, µ and β are hyper-parameters. The parameter µ balances the importance of the final label and the initial label of a node during label propagation. The parameter β constraints the sparse penalty on the feature selection matrix. µ, β ∈ (0, 1) . It should be noted that . There only cluster indicator matrix H is unknown by fixing F in the first three terms of Eq. (2), and there only feature selection matrix F is unknown by fixing H in the last two terms of Eq. (2). So, we compute the solution for the LPFS via an iterative updating algorithm that alternatively updates H and F. The detailed solving processes are shown below.
Step 1. Define an undirected graph G = (V , E) using gene expression data X. We use Gaussian kernel function to measure the relationship between two nodes. The weight Step 2. Normalize the weight matrix. Let D = diag{d ii } , where d ii = n j=1 w ij . Therefore the normalized weight matrix is Step 3. Initialize the initial category label matrix Y, and initialize cluster indicator matrix H to Y.
Step 4. According to the last two terms in Eq. (2), we solve feature selection matrix F In this study, each row in the feature selection matrix F is randomly initialized in (0, 1). The elements in F should be non-negative to keep the contribution of genes not be systematically offset. φ ij is the Lagrangian multiplier of f ij ≥ 0 . So, we can construct Lagrangian function as below: is an Auxiliary matrix, and F i denotes the i-th column of matrix F, The derivation of F is Based on the KKT condition ψ ij f ij = 0 , we can get Equation (9) can be written as Then, we can get the update role of F Step 5. According to the first three terms in Eq.

Equation (12) is a convex function. The derivation of H is
We can get the global optimal solution at the stationary point.
The category of sample i is Therefore, we update the cluster indicator matrix H = [h ij ] n×c , where Step 6. Repeat Step 4 until the objective function of Eq. (5) converges. Then we can get the feature selection matrix F.
Step 7. Repeat Step 5 until the objective function of Eq. (12) converges. At this point, we obtain the cluster indicator matrix H.
Step 8. Conduct loop iteration of Step 3 to Step 7, until the objective function of Eq. (2) converges. At this points, we get both the feature selection matrix F and cluster indicator matrix H.
Step 9. Based on rank-product method [30], we calculate the element fluctuation of each row in the feature selection matrix. If the elements in k-th row fluctuate significantly, the rank-product value of that row is larger, representing that the corresponding feature gene k has a stronger ability to distinguish samples of different categories.
Sorting the rank-product value of each row of the feature selection matrix in descending order, high ranking rows are reserved and low ranking rows are removed from the feature selection matrix.
Low ranking row indicates the elements in that row change very little through different columns, i.e. the corresponding gene has no discrimination ability of different category samples. Therefore, to improve prediction precision and reduce computational complexity, we filter out low ranking genes to conduct next iteration.
Step 10. Repeat aforementioned steps from Step 1 to Step 9.
Let function top(v s ) represents the s larger elements of vector v.
Since greater elements in F j contribute more to the identification of the specific category j, the genes, whose column number in the gene expression matrix equals to the row number of the greater elements in the feature selection matrix, are seen as key features of category j, i.e., the genes that could be severely affected under this condition.
In this study, we use key j to denote the key gene set for category j.
The detailed process of LPFS is summarized in Algorithm 1.
It should be noted that when the number of features is too large, it becomes hard to distinguish connections between samples and to detect modules on sample network. It is difficult to get accurate clustering results since there is no obvious clustering patterns, resulting in unstable and invalid key gene sets. Besides, when the number of clusters is less than the categories, i.e., samples belong (18) key j = arg i≤m top s f ij .
to two categories are classified into one cluster in experimental results, it will result in one column of the cluster indicator matrix to be 0. Then, some columns in the feature selection matrix will be all equal to 0, eventually leading to instability of the solution.
Theoretically, the computational process tends to stable as the number of features decreases. In addition, increase the number of samples is helpful to clarify the module structure in the network.
To ensure the convergence of Eq. (2), we first solve feature selection matrix, and then solve cluster indicator matrix. Through the alternative iteration strategy, the Eq. (2) can be convergent to a stable solution. According to experience and suggestions in paper [33], we set µ = 0.2 , and β = 0.2 . Besides, we set δ = 200 to ensure ||x i − x j || 2 /(2δ 2 ) ∈ (0, 1) , to get a reasonable connection between genes. In each iteration, low ranking 1000 genes are removed to modify the gene expression data for next iteration. To accurately prioritize the clinical phenotype related genes, 5 iterations were conducted to end the process. Finally, 100 times of the total process were run to get statistical significant result.

Results and discussion
First, we briefly introduced the gene expression dataset of Huntington's disease. Second, we demonstrated the experimental results of LPFS. Then, to verify the effectiveness of LPFS, we also conducted experiments with DESeq2, edgeR, limma, t-test, FC, jNMFMA, and FNMF. We further analyzed and discussed the disease gene prediction accuracy of different methods. Finally, we conducted GO and KEGG pathway enrichment analysis of the selected key genes, thus to get a deep understanding of the pathological mechanisms under complex clinical phenotypes of different disease phenotypes.

Gene expression data
The gene expression data were downloaded from http:// www. hdinhd. org, which were obtained from the striatum tissue of Huntington's disease mice through RNAseq technology. The experimental mice in this data set are of 2-month-old, 6-month-old, and 10-month-old. The genotypes include ploy Q20, poly Q80, poly Q92, poly Q111, poly Q140, and poly Q175. The ploy Q20 is normal one, while the rest genotypes are disease ones. There are 16 2-month-old mice of ploy Q20, 16 10-month-old mice of ploy Q20, and 8 mice for other genotypes at each age. The data set contain 23,351 genes. Since the genes expressed robustly across all samples have little contribution to sample classification, we selected top 6000 genes based on the mean (Fig. 1) and variance (Fig. 2) of gene expression data to reduce computational complexity. Besides, to test the accuracy of the selected genes by different methods, we collected 520 modifier genes from the literature [34], including 89 disease genes and 431 non-disease genes. The detailed information of the data set is illustrated in Table 1.

Prediction performance of LPFS
To get robust gene sets of different disease stage, we designed the following experimental pipeline, see Fig. 3. First, we used normal samples with genotype of ploy Q20 under 3 different time points and case samples with genotype of ploy Q x under 3 different time points, Q x ∈ {Q80, Q92, Q111, Q140, Q175} , to conduct LPFS. Samples of a genotype at a time point belonged to one category. Thus, there are 6 categories in each experiment. Finally, we ranked genes in descending order according to the elements in each column of the feature selection matrix. Top ranking genes are seen as the key gene set for each category. During the label-propagation based feature selection process, low ranking 1000 genes were removed out from the original gene expression matrix. 5 times iteration have been conducted in each experiment. Finally, 1375 genes were left for each category. To get a robust key gene set, we run each experiment for 100 times. Then, through the intersection of 100 key gene sets for each category, genes that appeared more than 50 times were selected as the key genes for that category. The number of key genes for each category is shown in Table 2.
The selected key gene set for each category could be used to describe functional changes during the development of the disease. In summary, we selected 397 marker genes, including 133, 73, 101 specific marker genes for 2-month-old, 6-month-old, 10-month-old normal mice, and 38, 22, 30 specific marker genes for 2-month-old, 6-month-old, 10-month-old case mice, respectively.
The GO and KEGG pathway enrichment analysis of those key gene sets help to get a deep understanding of the intermediate phenotypes and molecular activity of different disease stages [35,36]. We conducted      enrichment analysis with Metascape [37]. The enrichment analysis results for specific marker genes are shown in Tables 3 and 4. From Table 3, we can see that the functions, such as pituitary gland development, aromatic amino acid family metabolic process, arachidonic acid metabolic process, and regulation of adaptive immune response, change greatly during the growth process. From Table 4, we can see that the functions, such as sensory perception of sound, aging, positive regulation of NF-kappaB transcription factor activity, negative regulation of inflammatory response are affected during the disease degeneration. The GO and KEGG pathway enrichment results for all the 397 marker genes are shown in Fig. 4. Figure 4 shows that the functions, such as metabolic process, immune system process, developmental process, growth, etc. change significantly between different disease state.

Prediction performance of FC, t-test, DESeq2, edgeR, limma, jNMFMA, FNMF, and LPFS
To verify the effectiveness of LPFS, we also conducted experiments with FC, t-test, DESeq2, edgeR, limma, jNMFMA, and FNMF. Hamming accuracy, one-error, coverage, area under ROC curve (AUC) and area under precision-recall (AUPR) curve were used as evaluative criteria of prediction accuracy. The experimental results of LPFS were shown in Table 5. The comparison results of the 8 methods were shown in Table 6, which indicates that the performance of LPFS was comparable to that of the-state-of-art methods. We further choose the best performed result of each method to draw the ROC curves and PR curves. The ROC curves and PR curves of the 8 methods were shown in Figs. 5, and 6, respectively. We could know that LPFS performs better than other methods.
In addiation, we statistices the overlap degree of top 1000 genes obtained by any two methods (397 genes for LPFS). The details are shown in Table 7. Finally, we get intersection genes of the top 1000 genes obtained by the 8 methods. There are 9 overlapped genes in total, i.e., SLC4A11 (Solute Carrier Family 4 Member 11, GOTERM_BP_DIRECT: cellular cation homeostasis, fluid transport), ZFP474 (zinc finger protein 474, GOTERM_MF_DIRECT: metal ion binding), CD209G (CD209g antigen, GOTERM_MF_DIRECT: carbohydrate binding), AMBP (alpha 1 microglobulin/bikunin, GOTERM_BP_DIRECT: negative regulation of peptidase activity, protein-chromophore linkage, protein catabolic process, protein maturation), TOP2A (topoisomerase (DNA) II alpha 170kDa, GOTERM_MF_DIRECT: ATP binding, DNA binding), PBK (PDZ binding kinase, GOTERM_BP_DIRECT: negative regulation of proteasomal ubiquitin-dependent protein catabolic process, negative regulation of stress-activated MAPK cascade, cellular response to UV, negative regulation of inflammatory response), CCDC33 (coiled-coil domain containing 33, COG_ONTOLOGY: cell division and chromosome partitioning), CAPSL (calcyphosine like, GOTERM_ MF_DIRECT: calcium ion binding), DLGAP5 (DLG Fig. 4 The enrichment analysis of 397 specific gene markers The precision recall curves of FC, t-test, DESeq2, edgeR, limma, jNMFMA, and LPFS associated protein 5, GOTERM_BP_DIRET: cell cycle, signaling), and Al662270 (have no annotation information yet), annotated with DAVID [38,39]. The annotations of these genes indicate that the function of fluid transport, metal ion binding, the regulation of inflammatory response, cell division, cell cycle, and calcium ion binding are severally affected with the progress of the disease. Moreover, by investigating the human prefrontal cortex single cell expression files, we found that Ccdc33 mainly expressed in astrocytes and GABAergic neurons, Capsl mainly expressed in neurons and GABAergic neurons, while Dlgap5 can expressed in astrocytes, neurons, microglia, OPC, stem cells, and GABAergic neurons. This indicates that the neuron, astrocyte, microglia, GABAergic neuron, and OPC may be involved in the pathological process.

Conclusions
Precisely decode the pathological mechanism of neurodegenerative disease is the prerequisite for the diagnosis and treatment of it. Recently, with the accumulation of omics data and clinical data, we could conduct more detailed analysis of the phenotype of the disease at different pathological stages.
In this study, to screen key genes associated with different disease stages or clinical phenotypes, we designed LPFS to screen key genes that specific identify or distinguish different disease stages. Large amounts of experiments have been conducted to investigate and verify the performance of LPFS. Then, GO and pathway enrichment analysis was been conducted to make a deep understanding of biological functions of key genes for each disease stage. Finally, by intersecting top ranking genes of the 8 methods, we found 9 novel genes, including SLC4A11, ZFP474, CD209G, TOP2A, PBK, CCDC33, CAPSL, DLGAP5, and AL662270, are seriously affected with the progressive of Huntington's disease. Moreover, we found that the neuron, astrocyte, microglia, GABAergic neuron, and OPC could be involved in the pathological process.