Large-scale gene network analysis reveals the significance of extracellular matrix pathway and homeobox genes in acute myeloid leukemia: an introduction to the Pigengene package and its applications

Background The distinct types of hematological malignancies have different biological mechanisms and prognoses. For instance, myelodysplastic syndrome (MDS) is generally indolent and low risk; however, it may transform into acute myeloid leukemia (AML), which is much more aggressive. Methods We develop a novel network analysis approach that uses expression of eigengenes to delineate the biological differences between these two diseases. Results We find that specific genes in the extracellular matrix pathway are underexpressed in AML. We validate this finding in three ways: (a) We train our model on a microarray dataset of 364 cases and test it on an RNA Seq dataset of 74 cases. Our model showed 95% sensitivity and 86% specificity in the training dataset and showed 98% sensitivity and 91% specificity in the test dataset. This confirms that the identified biological signatures are independent from the expression profiling technology and independent from the training dataset. (b) Immunocytochemistry confirms that MMP9, an exemplar protein in the extracellular matrix, is underexpressed in AML. (c) MMP9 is hypermethylated in the majority of AML cases (n=194, Welch’s t-test p-value <10−138), which complies with its low expression in AML. Our novel network analysis approach is generalizable and useful in studying other complex diseases (e.g., breast cancer prognosis). We implement our methodology in the Pigengene software package, which is publicly available through Bioconductor. Conclusions Eigengenes define informative biological signatures that are robust with respect to expression profiling technology. These signatures provide valuable information about the underlying biology of diseases, and they are useful in predicting diagnosis and prognosis. Electronic supplementary material The online version of this article (doi:10.1186/s12920-017-0253-6) contains supplementary material, which is available to authorized users.

Supplementary Note 1: The advantages of network analysis. Biological processes in a cell often require intricate coordination between multiple genes and proteins, not just one gene or a single protein. 1 Therefore, the traditional approaches that study associations between individual genes and specific diseases are unable to provide a full understanding of complex biological phenomena. 2 Although statistical tests such as ANOVA, [3][4][5][6] t-test, 7,8 and rank products [9][10][11] are successfully used in many studies to identify deferentially expressed genes, [11][12][13][14][15] they have a limited statistical power. 2 These approaches fail to pinpoint the biological mechanisms of complicated conditions such as cancer because complex diseases are usually caused by collaboration of more than a few genes. 16 In other words, subtle but coordinated and consistent changes in the expression of a set of functionally related genes can be more important and informative than dramatic changes in the expression of a few individual genes. 1 Gene network analysis models the interactions between genes in a comprehensive structure. 17,18 The approaches taken to infer gene or protein networks [19][20][21] include ordinary differential equations (ODEs), [22][23][24][25] Boolean networks, 26,27 cofunction networks, 28-32 coexpression analysis, [33][34][35][36][37][38][39][40][41] and Bayesian networks. [42][43][44][45][46][47][48] Our methodology is inspired by, and builds upon, Bayesian network and coexpression network analyses that have been successfully used for interpreting many biological experiments. While coexpression analysis can potentially model the entire genome, unfortunately, its application is limited due to low accuracy, a deficiency that is rooted in imperfect clustering. 49,50 Bayesian networks can accurately model complicated probabilistic dependencies between a handful of genes. However, it is very challenging to fit them to the data when the number of genes exceeds hundreds or thousands. 43,51 We addressed this challenge by comparing expressions at the module level. Our analysis is robust to biological and technical noise because an eigengene is a weighted average expression of several genes, and thus, not affected by random fluctuation in expressions of a few genes ( Supplementary Fig. 10).
Supplementary Note 2: Identifying gene modules. The MILE dataset includes expression data of 202 AML and 164 MDS cases measured using Affymetrix microarrays with 54,000 probes (see Methods). Using GEO2R web application a , 52 we obtained an R script that ranked all of the 54,000 probes based on their p-values. The script used limma 53 (version 3.22.7) to test for each probe the null hypothesis that it was similarly expressed in AML and MDS (Supplementary Script 1). Consistent with the approach taken by other scholars, 43 we kept all of the top one-third (n=18,200) of the most variable probes in our analysis.
We used Custom CDF 54 (version 15) to map probes to Entrez-gene IDs. The mapping was not one-to-one and we took the following approach to project the data from the probe level to the gene level. First, we excluded the probes that were mapped to multiple Entrez-gene IDs. Out of 18,200 probes, 13,294 remained. Next, among all probes that were mapped to a specific Entrezgene, the one with the lowest p-value was chosen as the "representative" of that gene. That is, we considered the most differentially expressed probe as the representative of a gene. For genes with only one corresponding probe, this probe was taken as the representative. Our approach resulted in an expression profile with 9,166 probes, each representing to a unique Entrez-gene a http://www.ncbi.nlm.nih.gov/geo/geo2r/?acc=GSE15061
To identify the gene modules, we applied Weighted correlation network analysis (WGCNA, version 1.41) on the 9, 166 × 202 AML expression matrix 35 (Supplementary Table 1). Specifically, we used the following parameters to call blockwiseModules() function from WGCNA package. The power (β ) parameter was adjusted based on the recommendation of authors using pickSoftThreshold() function with the default value of RsquaredCut=0.85 . For better results, we set the parameter maxBlockSize=9166 so that the process was done in only one block. We set TOMType= "unsigned" , and used the default values for the rest of the arguments of blockwiseModules() . WGCNA could not confidently assign 4,125 genes to any of the modules because they hardly correlated with any other gene. They were designated as module 0, and excluded from the rest of the analysis.  56 The HOXA&B module (28) had the highest enrichment as expected because many of HOXA and HOXB genes were reported by scholars to be associated to AML. [56][57][58][59][60][61][62][63][64][65][66][67][68][69][70][71][72][73][74] Supplementary Table 3 includes the numbers corresponding to this plot. A Bayesian network is a statistical model that represents a set of random variables using a directed acyclic graph. 75 Nodes of the network correspond to random variables and the edges (arcs) model their conditional dependencies. An important property of Bayesian networks is that each node conditioned on its parent variables is independent of its non-descendants. In particular, if two nodes are not connected by a directed path, they are conditionally independent.
We trained a Bayesian network to model the probabilistic dependencies between the modules. Each module eigengene was represented by a node (observed random variable). To model the hematological malignancy, we added "Disease" as an observed random variable to the network. For instance in our study, it was equal to 1 for AML, and 0 for MDS. No eigengene was allowed to be a parent of Disease node.
We used bnlearn package to infer the edges and fit the above Bayesian network to the eigengenes. 76 Specifically, we discretized the values of eigengenes into three levels using Hartemink's method. 77 We used the bn.boot() function from the bnlean package to fit 1000 networks to the discretized data. This function used hill climbing strategy to optimize Bayesian Dirichlet equivalent (BDe) score. 78 Consistent with the approach taken by other scholars, 43 we averaged one-third of the notworks with the highest scores to obtain the consensus network. To facilitate applying the above procedure in other studies, we provided learn.bn() function in our Pigengene R package (version 0.99.19). In particular, Code 1 reproduces the results presented in this paper. See the package manual for more detail. Code 1. Reproducing the Bayesian network library(Pigengene) ## version 0.99.19 data(eigengenes33) amlE <-eigengenes33$aml mdsE <-eigengenes33$mds eigengenes <-rbind(amlE,mdsE) Labels <-c(rep("AML",nrow(amlE)),rep("MDS",nrow(mdsE))) names(Labels) <-rownames(eigengenes) learnt <-learn.bn(Data=eigengenes, Labels=Labels, bnPath="bn", bnNum=1000, seed=1, verbose =4) ## Visualize: d1 <-draw.bn(BN=learnt$consensus1$BN,nodeFontSize=18) The computation does not need more than 2 MB of memory and it is done in one to two days depending on the computer speed. Our package is capable of learning the networks in parallel using a computer cluster. Parallelization results in decreasing the wall-time substantially, dividing it by the number of available compute nodes. The average accuracy of our decision tree over 1000 runs are shown on the y-axis. The error bars correspond to the standard deviation. The accuracy is very robust with respect to noise, for instance, even when 30% of the expression profile is perturbed, the decline in the accuracy is negligible (2%).  4). Each plot corresponds to a dataset. We identified the biological signatures using METABRIC training dataset (a), and confirmed their predictive value in METABRIC validation (b) and MILLER (c) datasets (Methods). The low-risk patients are identified by regulated cell cycle and transcription (green). For this group, the probability of surviving more than 10 years is above 89% in all the three datasets. The p-values indicate that the difference between low and high risk groups is statistically significant.

15/23
Supplementary Figure 12. Association with translational control. The y-axis shows the p-value from overrepresentation analysis on the smaller module that was automatically-selected by breast cancer survival analysis. This module of 193 genes is significantly associated with translation and translational control.

16/23
Supplementary Note 4: Survival analysis on breast cancer. We applied a methodology similar to our AML-MDS analysis on METABRIC discovery dataset to identify 15 modules and compute the corresponding eigengenes. We used glmnet package (version 2.0-2) to fit a regularized Cox model with the Lasso penalty (α = 1). 80 The regularization path indicated that two modules with 319 and 193 genes are most associated with survival. We used the corresponding two eigengenes to fit an accelerated failure time model to the survival data. Specifically, we used survreg function from Survival package (version 2.38-3), Weibull distribution with scale=1 , and the defaults values for the rest of the parameters. 81 We used the fitted model to predict the survival time using only the two eigengenes. We chose two thresholds for the predicted values that maximized the precision of low and high risk predictions in the METABRIC discovery dataset. We inferred the two eigengenes in METABRIC validation and MILLER datasets to evaluate our accelerated failure time model. Using the same thresholds, our approach could identify low-risk patients in these two independent datasets with high specificity (> 89%, Table 3 and Supplementary Fig. 11). This illustrates the biological significance of the identified signatures.