Integrative approach for inference of gene regulatory networks using lassobased random featuring and application to psychiatric disorders
Abstract
Background
Inferring gene regulatory networks is one of the most interesting research areas in the systems biology. Many inference methods have been developed by using a variety of computational models and approaches. However, there are two issues to solve. First, depending on the structural or computational model of inference method, the results tend to be inconsistent due to innately different advantages and limitations of the methods. Therefore the combination of dissimilar approaches is demanded as an alternative way in order to overcome the limitations of standalone methods through complementary integration. Second, sparse linear regression that is penalized by the regularization parameter (lasso) and bootstrappingbased sparse linear regression methods were suggested in state of the art methods for network inference but they are not effective for a small sample size data and also a true regulator could be missed if the target gene is strongly affected by an indirect regulator with high correlation or another true regulator.
Results
We present two novel network inference methods based on the integration of three different criteria, (i) zscore to measure the variation of gene expression from knockout data, (ii) mutual information for the dependency between two genes, and (iii) linear regressionbased feature selection.
Based on these criterion, we propose a lassobased random feature selection algorithm (LARF) to achieve better performance overcoming the limitations of bootstrapping as mentioned above.
Conclusions
In this work, there are three main contributions. First, our z scorebased method to measure gene expression variations from knockout data is more effective than similar criteria of related works. Second, we confirmed that the true regulator selection can be effectively improved by LARF. Lastly, we verified that an integrative approach can clearly outperform a single method when two different methods are effectively jointed. In the experiments, our methods were validated by outperforming the state of the art methods on DREAM challenge data, and then LARF was applied to inferences of gene regulatory network associated with psychiatric disorders.
Keywords
Gene regulatory network Psychiatric disorderBackground
Inferring gene regulatory networks (GRN) from biological data is currently the most interesting area of the systems biology research aiming to elucidate cellular and physiological mechanisms. GRN inference, which is often referred to as reverse engineering, is a process in which the network structure that best represents the regulation relationship over gene expression data is estimated. An inferred GRN consists of nodes and edges representing genes and genegene regulatory interactions (activation or suppression) respectively. Once the regulation maps are constructed by identifying the interactions of genes from highthroughput data such as gene microarray [1], we can gain insight into complex biological process from the regulatory networks in order to discover biomarkers for a target disease and apply further it to drug design [2, 3].
Basically the inference method should be determined depending on both what kind of data such as gene expression, geneTranscription Factor (TF) [4], or proteinprotein interaction (PPI) [5] are used to infer and which type of network model, such as directed or indirected graph [6], we assume. In addition, we have to consider the case of data integration. Namely, not only individual data but also multiple data types together (i.e. integration of gene expression and geneTF data [7]) can be used for more reliable inference [8, 9]. As an assumption in this work, we limit our inference methods for directed network with a single data type: gene expression data. In order to decipher regulatory interactions with gene microarray data, which provides the gene expression level regulated by the other genes directly or indirectly, the number of effective network inference methods have been proposed by employing a variety of computational and structural models based on boolean networks [10], Bayesian networks [11], information theory [12], regression model [13], and so on. Depending on the different approaches, however, the results tend to be irregular due to inherently different advantages and limitations of each of the inference solutions [14]. The results of the Dialogue on Reverse Engineering Assessment and Methods (DREAM) project [15] describe well the pros and cons of the different methods as well as how effectively they can work together when the advantages of all methods are integrated (but it does not mean any combination always outperforms any other standalone method). More specifically, we note that they conclude two points through the experiments that (i) there is a limit to a single criterion for continuous improvement of network inference research without the integration and (ii) specifically the bootstrapping (resampling) based regression method [16] is required to avoid overfitting in regressionbased methods [15].
In this paper, we propose two methods, IMLARF (integration of MI and LARF) and ISLARF (integration of zscore and LARF). First, IMLARF indicates the integration of MI and LARF and consists of three steps. The first step of IMLARF is to build a matrix where each element is an edge score calculated by MI. In order to overcome the limitation of MI as mentioned above, the second step is to construct another edge score matrix using LARF, then the two edge score matrices are combined as the last step. In LARF, we regard a sparse linear regression as a feature selection since our goal is to identify the regulators that best predict the expression level of target genes. The problem is that features selected by lasso tend to be overfitted to a given tuning parameter λ, and thus the unstability problem caused by this overfitting can be solved by using bootstrapping [12, 21] in which data is randomly resampled so that a more stable selection can be achieved. However, the limitation of resampling is that it could not be effective in the case of a small sample size. Another limitation of bootstrapping is that the true variable (regulator gene) is likely to be missed (false negative) when strong indirect or direct regulators exist. LARF is similar to bootstrapping but LARF selects variables among randomly preselected candidate features in each iteration over different tuning parameters of lasso optimization so that true features weakly correlated to the target gene could not be missed, excluding indirect or direct regulators from the feature set. The second method we propose is ISLARF, which integrates two criteria, ZS and LARF. ZS is the name of the criteria that uses the zscore of variation of the knocked out gene expression. Although ISLARF is available only to knockout data, the performance is highly superior to other zscore based similar methods with knockout data in related works.
In the experimental evaluation, we validate the proposed method on a dataset from the DREAM3 challenge [22]. In addition, we explore the gene networks of Psychiatric disease with the related genes. The results shows that the proposed method significantly outperforms the stateofthe art [23, 24] and rebuilds the known regulations of genes possibly associated with Psychiatric Disorders.
Methods
Problem definition
We begin with a brief definition of problems and notations. The network we target is a directed graph that consists of n nodes and n(n−1) edges representing genes and regulations respectively. Given a matrix \(\mathbf {X}{\in }\mathbb {R}^{N{\times }n}\) where N is number of samples, we denote the ith column by a vector x _{ i } indicating expression levels of ith gene over N samples, and we also let X={X _{1},…,X _{ n }} be a set of variables (genes, features, node, and variable are interchangeably used in this paper). The goal of our work is to not only identify the regulators given a target gene but also to define the confidence level of regulation as a weight of the edge. In other words, we estimate the weight of all possible regulations, which are directed edges between all pairs of nodes {X _{ i }←X _{ j }:i,j∈X} in the network, then select only edges that have a higher weight than predefined threshold θ. As a final result, therefore, a weight matrix \(\mathbf {W}{\in }\mathbb {R}^{n{\times }n}\) is returned by the inference method, and \({W^{i}_{j}}\) represents a confidence level of the regulation when target gene i is connected to activator or suppressor gene j. In the following sections, we present how the edge weight is estimated by information theory, the LARF algorithm, and the zscore from knockout data.
Overview
IMLARF and ISLARF
Information theoretic approach
Mutual information matrix
where c o v(X _{ i }) is the covariance matrix of variable X _{ i }, and c o v is the determinant of covariance matrix. The reader is referred to [26] for more details. We build MI matrix in which each element \({M^{i}_{j}}\) indicates the dependency between X _{ i } and X _{ j } which means that X _{ i } and X _{ j } are independent if \({M^{i}_{j}}=0\) or \({M^{i}_{j}}\) is relatively lower than other edges. Networks with the edges whose \({M^{i}_{j}}\) are higher than the heuristic threshold are referred to as relevance networks. Two critical limitations of relevance networks, however, are that firstly, MI does not provide the direction of edges due to \({M^{i}_{j}}={M^{j}_{i}}\), and secondly, the high coexpression and indirect regulation may cause false positives.
Statistical approach
Zscore and gene knockout data
where \(X^{i}_{j}\) is the expression level of gene j after knocking gene i out, and \(\mu _{D_{j}}\) and \(\sigma _{D_{j}}\) is mean and standard deviation of jth column vector D _{ j } of variation matrix D respectively. As the zscore of \({D^{i}_{j}}\) over D _{ j } is the weight of regulation edge G i→G j, the zscore of \({D^{i}_{j}}\) is equivalent to \({S^{j}_{i}}\) of edge weight matrix S. The limitation of this criterion is the availability only in knockout data.
LARF algorithm
where coefficient column vector β _{ i } represents regulation relationships between the target gene i and others. More precisely, after β _{ i } is optimized to minimize the objective function (5), then if the jth element of β _{ i } is zero, gene j does not regulate gene i, otherwise it does. The optimization is performed for each target gene i, i∈X. Coefficient matrix B={β _{1},…,β _{ n }}^{ T } is equivalent to adjacency matrix where nonzero B _{ ij } is the regulation edge from regulator gene j to target gene i. The tuning parameter λ in lasso is used to enforce network sparsity, so the number of selected (nonzero coefficient) variables varies with different λ. In our works, we regard variable selection of lasso as a feature selection to predict a target gene’s expression level.
and m a x(F ^{ i }) and \(min(F^{i}_{i})\) is maximum value of ith row vector of F and minimum of \(F^{i}_{i}\).
Results

TPR=TP/(TP+FN)

FPR=FP/(FP+TN)

AUROC: the area under ROC curve.

MI: edge is scored by mutual information

ZS: relative variation from wild type is measured by zscore.

LARF: lasso based random featuring and sampling.

IMLARF: integration of MI and LARF

ISLARF: integration of ZS and LARF

ZDR: top rank in DREAM 3 [23]

GENIE3: top rank in DREAM 4 [24]

TIGRESS: top rank in DREAM 5 [21]
Evaluation on the DREAM3 benchmarks
Materials
The data for DREAM3 In Silico Network challenge consists of three differently sized networks, (10, 50, and 100 genes), and there are five goldstandard networks for each size (total of 15 networks). The five networks are named Ecoli1, Ecoli2, Yeast1, Yeast2, and Yeast3. From each true network, three different data types (knockdown, knockout perturbations, and time series data) are provided, and the knockdown and knockout data includes a single wild type sample. In our experiments, only knockout data is used and 10gene, 50gene, 100gene of Yeast1 networks are mainly tested.
Random sampling vs Random featuring
Setting parameters
Effect of integration and performance comparisons
AUROC of standalone and integrative methods
Method  10gene  50gene  100gene 

GENIE3  0.9175  0.8427  0.8631 
TIGRESS  0.7044 ± 0.0056  0.8179 ± 0.0025  0.7690 ± 0.0023 
TIGRESSTF  0.8154 ± 0.0037  0.9006 ± 0.0010  0.8777 ± 0.0009 
MI  0.9312  0.8329  0.8586 
LARF  0.9250 ± 0.0154  0.8489 ± 0.0038  0.8610 ± 0.0039 
IMLARF  0.9425 ± 0.0047  0.8487 ± 0.0032  0.8701 ± 0.0012 
ZDR ^{∗}  0.8975  0.9223  0.8876 
ZS ^{∗}  0.9725  0.9204  0.8870 
ZS ^{∗}+MI  0.9775  0.8931  0.8925 
ISLARF ^{ ∗ }  0.9892 ± 0.0021  0.9301 ± 0.0049  0.9065 ± 0.0029 
Inference of GRN for psychiatric disorders
In this section, the proposed method is applied to real gene expression data for psychiatric disorders. Through the experiments, we evaluate how the method constructs the network and explore what potential biomarkers of Psychiatric disorders are in the inferred networks. Psychiatric disorders data that are provided from the Stanley Medical Research Institute (SMRI) consist of gene expression data of 25833 genes and 131 samples (43 controls and 88 cases) including bipolar disorder, schizophrenia, major depression as three major psychiatric diseases.
Figure 8 b is the inferred network for cluster 7, and a total of 8 genes known as psychiatric disorderrelated genes in related literatures are found as following: TEF [36], NR1D1 [37], KIF13A [38], ADCYAP1R1 [39], MDGA1 [40], GNAZ [41], CNR1 [42], and DCLK1 [43]. Additionally we defined 5 genes, ZBTB20, MAP7, ZBTB16, ANK2, and MRAP2, as susceptible genes, and surprisingly ZBTB20 [44], MAP7 [45], ZBTB16 [46], ANK2 [47] was also reported as schizophrenia disorderassociated genes in SNP and CNVbased studies. So we imply that it is worth to investigate the genes that have only an edge to reference gene as candidate genes associated with psychiatric disorder. In addition, reference genes in the network tend to interact with each other directly or indirectly though susceptible genes but they are not widely spread implying they may work together or may be coregulated by another unknown biomarker.
The network inference result for the combination of cluster 4 and 5 is shown in Fig. 8 c consisting of two components. There are 10 reference genes such as DLG4 [48]], MIF [49], SLC6A5 [50], GAD1 [51], GAD2 [52], GOT2 [53], RGS9 [54], HDAC9 [55], CDH7 [56], and BDNF [57], and 3 susceptible genes such as PRMT8, KIT, and ELAVL2. It is noted that ELAVL2 has connections to three reference nodes and was reported as schizophreniarelated gene in recent GWAS [58].
Discussion
The difference between ZS and zscore of [23] is in whether the absolute value of variation \({D^{i}_{j}}\) is taken before zscoring or original value of \({D^{i}_{j}}\) is used. In our method, we simply calculate the zscore to measure how many deviations the observed variation is above or below while the absolute value of variation \({D^{i}_{j}}\) is used for zscore. Since we want to know how much the variation of a gene is higher than another target gene after knockout of the source gene, the use of \({D^{i}_{j}}\) rather than \({D^{i}_{j}}\) is more reasonable and it is not guaranteed to select highvariant genes if absolute value of \({D^{i}_{j}}\) is used. Since random featuring and random sampling are performed in iterations of lasso, the computational time is significantly increased especially in finding optimal parameters. In implementation, the step size, therefore, should be set to a reasonably small value, and parallel processing (i.e. parfor in matlab) can reduce the processing time in practice (In our case, eight local cores are used). As a future work, we can integrate TF information additionally in the inference so that we can get more reliable results, and then also apply our method to DREAM5 challenge data for comparison to TIGRESS that utilizes TF information.
Conclusion
We presented two integrative approaches for gene regulatory network inference combining two different algorithms. First, IMLARF that we proposed is based on the integration of MI and LARF, which is a novel regressionbased random featuring, to overcome the limitation of random sampling and MI. Secondly, ISLARF is the combination of LARF and ZS that is based on the zscore of variation of expression after the candidate regulator is knocked out. Both integrative methods outperform the standalone methods and the selected state of the art techniques on DREAM3 challenge data. In application to inference of the gene regulation associated with psychiatric disorders, we applied IMLARF to gene expression data and inferred the interactions between genes reported known as psychiatric disorderassociated genes and susceptible genes defined by inferred networks.
