 Research
 Open Access
 Published:
Identifying disease genes by integrating multiple data sources
BMC Medical Genomics volume 7, Article number: S2 (2014)
Abstract
Background
Now multiple types of data are available for identifying disease genes. Those data include genedisease associations, disease phenotype similarities, proteinprotein interactions, pathways, gene expression profiles, etc.. It is believed that integrating different kinds of biological data is an effective method to identify disease genes.
Results
In this paper, we propose a multiple data integration method based on the theory of Markov random field (MRF) and the method of Bayesian analysis for identifying human disease genes. The proposed method is not only flexible in easily incorporating different kinds of data, but also reliable in predicting candidate disease genes.
Conclusions
Numerical experiments are carried out by integrating known genedisease associations, protein complexes, proteinprotein interactions, pathways and gene expression profiles. Predictions are evaluated by the leaveoneout method. The proposed method achieves an AUC score of 0.743 when integrating all those biological data in our experiments.
Background
Many human genetic diseases or disorders are resulted from mutations of multiple genes [1]. The identification of those disease genes is not only important in understanding genetic disease mechanisms, but is also helpful in developing new methods in diagnostics and therapeutics [2].
Genes associated with similar disorders are often functionally related, supporting the existence of distinct diseasespecific functional modules [3–5]. A "guiltbyassociation" [6] assumption is often used by various algorithms to identify disease genes. If a gene is ranked as "close" to known disease genes, it would be likely regarded as related to the same disease. The principle is largely supported by many biological data sources, such as proteinprotein interactions (PPIs) [7–11], pathways [12–15], gene expression profiles [16–18], etc.. Lage et al. [19] rank disease genes from a constructed phenomeinteractome network by using PPIs and phenotype similarities. Wu et al. [5] develop a tool called CIPHER to predict disease genes based on a global concordance between a PPI network and a phenotype network. Hwang et al. [20] use a similar coherence score between a gene network and a phenotype network. Vanunu et al. [21] design a method called PRINCE that predicts disease genes and protein complexes associated with diseases at the same time. Li et al. [22] analyze human disease and disease relationships from a pathwaybased point of view. Ma et al. [23] employs the Markov Random Field (MRF) theory to prioritize genes associated with a specific phenotype or trait by using gene expression profiles and PPI data.
Multiple data integration is another commonly used methodology that collects evidences of gene disease associations from different data sources. Köhler et al. [24] propose a random walk with restart (RWR) algorithm that predicts disease genes by using a mixed PPI network. Zhang et al. [25] develop a Bayesian regression approach to explain similarities between disease phenotypes by using diffusion kernels of one or several PPI networks. Chen et al. [26] define a data integration rank (DIR) score by taking a max instead of average to capture the most informative evidence among a set of integrated data sources. The DIR algorithm potentially yields better performance than many other data integration methods [26].
However, challenges still exist because of the following reasons. Firstly, there are many levels of controls along paths from genotypes to phenotypes [26]. Genes have to be transcribed and then be translated into proteins, and proteins interact with many other molecules to perform cellular functions [26–28], resulting in the complex relationship between genotypes and phenotypes [29]. Secondly, different biological data are heterogeneous. They describe relationships of molecular entities in various levels. No widely acceptable criterion is available to standardize them into the same scale. An inappropriate integration method combines noise as well, which often decreases the prediction accuracy. Thirdly, many "guiltbyassociation" methods only take edges of a candidate gene with known disease genes into account, ignoring edges of the gene with many other vertices in a biological network. They ignore the fact that the biological network, let's say a PPI network or a gene coexpression network, is built independently for describing a specific biological relationship of proteins or genes. It may have no direct relationship with gene disease associations.
In this paper, we introduce a multiple data integration method for disease gene identifications, which considers comprehensive characters of a set of heterogeneous datasets to capture the complex relationship between genotypes and phenotypes. The method is based on the theory of MRF and the method of Bayesian analysis. Two previous algorithms of Deng et al. [30] and Ma et al. [23] have been proposed to integrating multiple datasets by using the MRF theory for yeast protein function predictions. Their method cannot be directly employed to identify human disease genes. Predictions of the method of Deng et al. [30] become unreliable due to the following scale problem. Human genome consists of around 21,000 genes [31], while most diseases are associated by mutations of only a few genes. Even merging similar diseases into classes, the associated genes of individual disease classes is still not enough to estimate parameters correctly by using Deng's method. The method by Ma et al. [23] mainly uses gene expression profiles to group genes with similar characters. PPI data are only employed to calibrate predictions. It is not clear how to integrate more kinds of biological data by using their method. In paper [32], we have developed a basic modified MRF model for human disease gene prioritization. In this study, we will further improve it by introducing a new parameter estimation strategy and a new Gibbs sampling strategy. The improved MRF algorithm is not only stable in terms of parameter estimation, but also reliable in terms of its prediction accuracy.
Methods
In this paper, we first briefly describe how the problem is formulated as a Bayesian labelling problem. The labelling configration assumes to follow a Gibbs distribution. After that, a MRF model is introduced to solve this problem by integrating multiple kinds of biological data, including known genedisease associations, protein complexes, PPIs, pathways and gene expression profiles.
The Bayesian labelling problem
Let L = {L_{1}, L_{2}, ..., L_{ k }} be a set of k labels and S = {S_{1}, S_{2}, ... , S_{ r }} be a set of r sites. A labelling problem [33] is defined as assigning each site Si with a label in L.
Let F = {F_{1}, F_{2}, ... , F_{ r }} be a family of random variables defined on S, in which each random variable F_{ i } takes value f_{ i } of L. We use the notation F = f to represent the joint event that {F_{1} = f_{1}, ... , F_{ r } = f_{ r }}, where f = {f_{1}, ... , f_{ r }} is called a configuration of F. The set of all configurations is denoted as \phantom{\rule{0.25em}{0ex}}\mathcal{F}.
The relationship of sites is determined by a neighborhood system \mathcal{N}=\left\{{N}_{i}{\forall}_{i}\in S\right\}, where N_{ i } is the set of sites neighboring i.
A family of random variables F is said to be a MRF on S w.r.t. N if and only if the following two conditions are satisfied:
1 Positivity: P\left(f\right)>0,\forall f\in \mathcal{F},
2 Markovianity: P\left({f}_{i}{F}_{S\backslash \left\{i\right\}}\right)=P\left({f}_{i}{f}_{{N}_{i}}\right).
The Markovianity indicates that the probability of a local event f_{ i } conditioned on all other events is equivalent to that conditioned on only events of its neighbors. Hence, the joint probability P(f) of the random field can be uniquely determined by local conditional probabilities.
Let rbe an observation of F . Suppose we know both the prior probability distribution P (f) of configuration f and the conditional probability distribution P (r f) of the observation rgiven the configuration f. The best estimation of f is the one maxizing a posteriori probability (MAP), which is
where P(r) is the probability that we get the observation r.
The Bayesian labelling problem [33] is that given a set of observation r, find the MAP configuration of labelling {f}^{*}={\mathsf{\text{argmax}}}_{f\in F}P\left(fr\right). Here, as P (r) is not a function of f , it does not affect the MAP estimation of f.
Gibbs distribution in MRF
It is usually hard to specify a prior probability of a MRF for a real problem. Fortunately, the HammersleyClifford theorem [34] provides a solution for this. According to the theorem, F is a MRF on S w.r.t. \phantom{\rule{0.25em}{0ex}}\mathcal{N} if and only if the probability distribution of P (F = f) of the configuration is a Gibbs distribution w.r.t. \phantom{\rule{0.25em}{0ex}}\mathcal{N}. The Gibbs distribution has a form of
where Z={\sum}_{f\in F}{e}^{U\left(f\right)/T} is a normalizing constant, T is a global control constant that is often assumed to be 1, and U(f) is the energy function calculated as follows
where V_{ i }(f) is the energy potential of C_{ i } (the set of i^{th} order cliques) in the neighborhood system \phantom{\rule{0.25em}{0ex}}\mathcal{N}, R_{ n }(f) represents those higher order terms. A special case of MRF is the Ising model that only considers up to the second order of cliques [35].
Given a configuration f, let the conditional probability distribution of observation r have the same exponential form
Then the posterior probability of the Gibbs distribution has form
where the posterior energy is [33]
Based on this, suppose the collection of whole human genes G = {g_{1}, g_{2}, ..., g_{ N }} is the site set, and {1, 0} is the label set, where 1 represents a gene is a disease gene and 0 otherwise. The problem of human disease gene identification is actually to find the best configuration of G according to what is currently known about human diseases.
The MRF model for identifying human disease genes
Suppose human genome consists of a set of N genes G = {g_{1}, g_{2}, ..., g_{ N }}. Some of them are already known to be associated with genetic diseases, while associations of most other genes are still not known. Without loss of generality, let g_{1}; g_{2}, ..., g_{ n } be genes that have not yet been known to be associated with genetic diseases, and g_{ n+1 }, g_{ n+2 }, ..., g_{ n+m } be currently known disease genes. Obviously, we have N = n + m. Let {D_{1}, D_{2}, ..., D_{ M } } be a set of human diseases, where D_{ i } consists of the set of genes that are already known associated with the i^{th} disease.
For a specific disease, let X = (X_{1}, X_{2}, ..., X_{ n+m }) be the random variables defined on all genes, where X_{ i } = 1 represents gene g_{ i } to be a associated gene of the disease and X_{ i } = 0 otherwise.
Consider those individual genes. Let (π_{1}, π_{2}, ..., π_{ n+m }) be a set of probabilities, where π_{ i } represents the probability that X_{ i } = 1. Let x = (x_{1}, x_{2}, ..., x_{ n+m }) be observations of X. The probability distribution of configuration x is proportional to
where {\alpha}_{i}=\mathsf{\text{log}}\frac{{\pi}_{i}}{1{\pi}_{i}}, and {\sum}_{i=1}^{n+m}\mathsf{\text{log}}\left(1{\pi}_{i}\right) is a constant.
Next, consider pairwise relationships between genes. Suppose we have K biological networks H = (H^{1}, ..., H^{K}), where vertices represent genes. Given a H^{k}, edges of H^{k} represent a specific kind of biological relationship between those genes. Let x be the observation labels of X. According to x, edges of H^{k} can be classified into three categories: (1) edges that between two 1labelled vertices, (2) edges that between a 1labelled vertex and a 0labelled vertex, and (3) edges that between two 0labelled vertices. Let {N}_{11}^{k},{N}_{10}^{k} and {N}_{00}^{k} denote the number of edges in each category of G^{k} respectively. Then
The probability that we have such a kind of biological network H^{k} conditional on those observed labels x follows as
where θ^{k} = (β^{k}, γ^{k}, κ^{k}) are weights of these three kinds of edges for H^{k}. One of three parameters in θ^{k} is redundant. Without loss of generality, let κ^{k} = 1. Similarly, for K biological networks, the probability that we observe them conditional on the observed labels follows as
Based on the Ising model, the energy function can be written in terms of x as
where θ = (α_{ i }, β^{1}, γ^{1}, ..., β^{K} , γ^{K}) are parameters. In the terminology of MRF [30], U (xθ) defines a Gibbs distribution of the entire networks
where Z(θ) is the normalized constant that is calculated by summing over all configurations χ:
The Gibbs sampling
The Gibbs distribution (14) gives a prior probability distribution of the configuration for all genes. In the study of identifying human disease genes, the objective is to find the posterior probability of X_{1}, X_{2}, · · ·, X_{ n } conditional on known disease genes
To achieve this, consider the following posterior probability distribution of an individual gene X_{ i }
where X_{[−i] }= (X_{1}, · · ·, X_{ i−1 }, X_{ i+1 }, · · ·, X_{ n+m }) represents labels of all other genes except X_{ i }, θ are parameters. According to the Bayes' theorem [36] and the Gibbs distribution (14), we have
where
the according to equation (13), and
Here {M}_{0}^{k} and {M}_{1}^{k} are the number of neighbors of the gene {g}_{i} labelled with 0 and 1 on network H^{k} , k = 1, ..., K, respectively.
Equation (15) provides a method to update the label X_{ i } according to all other labels. Suppose parameters θ = (α_{ i }, β^{1}, γ^{1}, ..., β^{K}, γ^{K}) of the model are given, together with prior observed labels of all genes. Using equation (15), we can update labels for all unknown genes. Repeating this procedure a number of times until all posterior probabilities of labels are stabilized. This is the essential procedure of the Gibbs sampling.
Parameter estimation
In practice, we do not know parameters of the model and they need to be estimated according to those known informations. Ideally, the maximum likelihood estimation (MLE) method is a good choice to estimate θ in equation (14). However, the normalizing part Z(θ) is also a function of θ, which is the main difficulty for using the MLE method directly. Deng et al. [30] using a pseudolikelihood method to estimate parameters in the MRF model. Specifically, the following pseudolikelihood function is derived from equation (15), which is
The parameter estimation can be done by a binary logistic regression, where dependent variables in equation (19) are categorical labels and independent variables are {M}_{0}^{1},\phantom{\rule{2.77695pt}{0ex}}{M}_{1}^{1},\dots ,{M}_{0}^{K},\phantom{\rule{2.77695pt}{0ex}}{M}_{1}^{K} of the K biological networks. The standard MATLAB function glmf it() can be employed to perform such binary logistic regression.
The pseudolikelihood method used by Deng et al. [30] is valuable. However, there is an important potential problem [32, 37], which may result in unreasonable predictions with their original method. The parameter estimation of Deng et al. [30] is conducted on only known labelled vertices of biological networks. However, a known vertex with labelling 1 may have plenty of unknown vertices with labelling 0 in a biological network and vice versa. A neglect of those unknown vertices may result in inaccurate estimated parameters, which makes predictions problematic. This problem becomes serious with the increasing number of unknown vertices [37]. Kourmpetis et al. [37] alternatively introduce a Bayesian MRF model to estimate parameters and update labels at the same time. An adaptive Markov Chain Monte Carlo (MCMC) algorithm is employed to perform the estimation by using another scaling parameter, a Z matrix and a multivariate normal distribution.
In this study, we introduce a new method to simultaneously estimate parameters and update labels. Suppose a prior probability of π_{ i } for each unknown vertex is known. A set of prior labels of unknown vertices can be assigned according to this probability. Then the pseudolikelihood parameter estimation method is performed on all labeled vertices, including those known labelled ones and those unknown prior labelled ones. Using these estimated parameters to update labels for all unknown vertices, and then using the updated labels to reestimate parameters until both of them are stable. The stepbystep description of this procedure is given as follows.
1 Initialization:
Let t = 0, and initialize labels of all vertices \left({X}_{1}^{\left(0\right)},{X}_{2}^{\left(0\right)},\dots ,{X}_{n+m}^{\left(t\right)}\right)
2 Estimating parameters:
3 Gibbs sampling:
4 Let t = t + 1, and go to 2, until stabilized.
During the Gibbs sampling procedure, a "burnin period" and a "lag period" often need to be specified. The "burnin period" is the period that a Markov process takes to become stabilized. Simulation results in this period are discarded to reduce the effect of initial prior probabilities. The "lag period" is the period that needs to reduce the dependence of the Markov process. The posterior probabilities in this period are estimated by averaging simulation results during individual lag steps.
In this study, the "burnin period" takes 100 steps while the "lag period" takes 90 steps. Simulation results are averaged every 10 steps in the "lag period". There is 1000 steps in total for simulations. For convenience, predictions made by the original MRF model of Deng et al. [30] is denoted as "MRFDeng", while predictions of our improved MRF method is denoted as "IMRF_{1}" hereafter. A second improved MRF method is also given in the following by adding a new period at last in simulations, which is called "prediction period". It takes the average estimated parameters in the "lag period" as parameters and fixes them hereafter in simulations. The input probabilities of unknown vertices are also obtained by the average posterior probabilities in the "lag period". The Markov process runs another 100 steps in this period. The average posterior probabilities in the "prediction period" are outputted as final predictions, and predictions of this method is denoted as "IMRF_{2}".
Estimating a prior probability
Now, the only problem left is to estimate the prior probability of π_{ i }. Similarly as the method used in Deng et al. [30], we also estimate them according to known protein complexes. Since genes that encode proteins in a same complex tend to associated with similar diseases. For a gene g_{ i } that encodes protein in a complex,
be the prior probability, where A is the number of disease genes for a specific disease in the complex, and B is the number of all disease genes in the complex. If a gene appears in multiple protein complexes, we use the maximum value as the prior probability for the gene.
For those genes that do not belong to any protein complex, let
as the prior probability, where C is the number of all currently known disease genes for the specific disease, and D is the total number of genes in human genome.
Data sources
The genedisease association data are obtained from Goh et al. [3], which contain 1 284 disorders and 1 777 disease genes. These data are originally collected from the Morbid Map list of the Online Mendelian Inheritance in Man (OMIM) [38]. Disorders are manually classified into 22 primary disease classes, including a 'multiple' class and a 'unclassified' class. In this study, we consider only those disease classes that consist of at least 30 genes. We also exclude the 'multiple' class, the 'unclassified' class, the 'cancer' class and the 'neurological' class due to the class evidence and the class heterogeneity [3]. The final dataset consists of 815 genes in 12 disease classes.
The protein complex data are collected from the database of CORUM [39] and PCDq [40]. There are 1677 and 1103 protein complexes in the dataset that consist of at least two proteins, respectively. There are in total 3881 proteins in those protein complexes.
The PPI datasets are derived from the database of HPRD (Release 9) [9], BioGrid (Release 3.2.108) [10] and IntAct (downloaded on Jan 26, 2014) [11], respectively. Duplicated edges between the same pair of vertices and edges connecting to itself are deleted. Each dataset is processed independently, and three PPI networks are obtained finally. The HPRD PPI network consists of 9465 vertices and 37039 edges. The BioGrid PPI network consists of 15298 vertices and 127612 edges. The IntAct PPI network consists of 13449 vertices and 63825 edges.
The pathway datasets are obtained from the database of KEGG [12], Reactome [13], PharmGKB [14] and PIN [15], There are 280, 1469, 99 and 2679 pathways in datasets, respectively. There are in total 8614 proteins in those pathways. A pathway coexisting network is constructed by taking individual proteins/genes as vertices. Edges are constructed between two vertices, if they coexist in any pathway.
The human gene expression profiles are obtained from BioGPS (GSE1133) [16, 17], which contain 79 human tissues in duplicates, measured using the Affymetrix U133A array. Pairwise Pearson correlation coefficients (PCC) are calculated and a pair of genes are linked by an edge if the PCC value is larger than 0.5, similar to the method used in [3, 26].
Hence, five biological networks are constructed by collecting data from various databases. All protein IDs are mapped onto the form of the gene symbol. In order to test the performance of multiple data integration of our methods, we select those genes that appears at least four times in the five networks. The final datasets consist of 7311 human genes, 815 out of which are known associated with 12 disease classes.
Validation method and evaluation criteria
The accuracy of predictions is validated by the leaveoneout method. For each known disease gene with at least one annotated interaction partner in a biological network, we assume it is an unknown gene and predict its posterior probability by our proposed methods. We use the receiver operating characteristic (ROC) curve to show the relationship between the true positive rate and the false positive rate by varying the threshold for declaring positives. The area under the ROC curve (AUC) is also employed to show an overall measure of the performance. The negative control set consists of known disease genes that do not belong to current disease class, and they are also validated by using the leaveoneout method.
Decision score and declaration of positives
One can directly use the posterior probabilities obtained by the Gibbs sampling to select candidate disease genes. The greater the probability is for a gene, the more likely it is to associated with specific disease. However, different disease classes consist of different numbers of known disease genes, and thus the prediction results may not be good if a global threshold is used for all classes. Hence, we propose to use a percentage as a decision score to generate the finial predictions. All the ROC curves and the AUC scores of our "IMRF_{1}" and "IMRF_{2}" method are calculated according to the decision score hereafter.
Results and discussion
We first analyze the performance of the IMRF_{1} and IMRF_{2} algorithms in terms of stability and reliability, and then compare our method with the original MRFDeng method [30], the RWR algorithm [24] and the DIR algorithm [26]. These three algorithms are selected elaborately.
Firstly, since ideas of our improved methods (IMRF_{1} and IMRF_{2}) are initially inspired by the MRFDeng method, the direct comparison illustrates how much improvement can be made results from our methods.
Secondly, we compare our methods with the RWR algorithm to show which manner of multiple data integration is better. The RWR algorithm is a typical data integration method that uses a mixed network, where vertices and edges of several biological networks are simply merged together, while our methods integrate different networks separately.
Finally, the DIR algorithm has a very good performance among multiple data integration methods, which also integrates different networks separately. It is the same with our methods in terms of the data integration method.
Stability and reliability of MRF methods
We first investigate the stability and reliability MRF methods, by analyzing Markov processes of the IMRF_{1} method and the MRFDeng method.
Parameters of the MRFDeng method are estimated from subnetworks of known vertices. This is feasible to be used for predicting protein functions of yeast in [30], since each function class consists of at least hundreds known vertices, which is possible for estimating reasonable parameters.
However, for disease gene identifications, only dozens of disease genes are available for individual disease classes. The estimated parameters of the MRFDeng method becomes unreliable. This can be seen by analyzing characters of Figure 1. In a Gibbs sampling process, it stops until all Markov processes and parameters are stabilized. However, stabilized Markov processes and parameters do not indicate they converge to expected results. It is also stabilized if most vertices are labelled with 1. Take the Figure 1 (a) and the Figure 1 (c) for example, the variation of posterior probability distributions by using the MRFDeng method is smaller than the IMRF_{1} method. It seems the performance of the MRFDeng method is better. However, if we look at Figure 1 (b) and Figure 1 (d), we find that there are 23.3% vertices with probabilities larger than 0.97. This is commonly unreasonable in practices, since it contains too many false positive predictions. The predictions of the IMRF_{1} is reasonable. Most unknown vertices are ranked with a very low probability by using the IMRF_{1} method. Only 5.7% unknown vertices are ranked with probabilities larger than 0.065, and only a few significant vertices are predicted with higher probabilities.
Here, the variation of posterior probabilities for two adjacent steps is calculated from
where P_{ i }(t) is the posterior probability P (X_{ i } = 1 X_{[−i]}, θ) of g_{ i } obtained in the t^{th} iteration.
Figure 2 illustrates the variation of estimated parameters for adjacent steps by using the IMRF_{1} method. We can see that all parameters converge very fast, but noises still exist and cannot be reduced by increasing iteration steps. This inspires us to add a "prediction period" for Gibbs sampling processes. The "prediction period" takes the average estimated parameters in the "lag period" as parameters and fixes them hereafter in simulations. The input probabilities of unknown vertices are also obtained by taking the average posterior probabilities in the "lag period".
Comparisons with the MRFDeng method
Our improved methods are significantly better than the MRFDeng method in terms of identifying disease genes. Figure 3 illustrates comparisons of the MRFDeng method, the IMRF_{1} method and the IMRF_{2} method in terms of ROC curves. Predictions of the IMRF_{1} method is significantly better than that by using the MRFDeng method, but is a little worse than the IMRF_{2} method, no matter using single biological network or using integrated biological networks. In terms of informativeness of each biological network, the HPRD PPI network (shows in Figure 3 (a)) is the most informative data source, which obtains the highest AUC value in all three methods.
Integration of heterogeneous data sources
Different biological datasets are commonly heterogeneous. When information in those data is integrated, noises are also integrated. Hence, an inappropriate method may result in a set of worse predictions than using only single dataset. Generally, various data integration methods can be divided into two categories: (1) by using a mixed network and (2) by using several separated networks. Generally, separated networks contain more information than the mixed network, since it is very easy to generate the mixed network from several separated networks but not vice versa. One advantage of the MRF model is that it takes the whole network into consideration, which potentially yields better performance than those using mixed network ones.
In Figures 4, we use the most stable IMRF_{2} method to compare the differences between different kinds of data integration methods. The separated network method achieves the best performance among all predictions, while the mixed network method achieves only modest performance. It seems that the mixed network method combines informations of individual datasets together with their noises, which does not improve its performance by integrating multiple datasets.
Comparisons by using multiple data sources
The IMRF_{2} method is compared with the RWR algorithm, the DIR algorithm and the MRFDeng algorithm, respectively. Figure 5 illustrates ROC crossvalidation results by integrating all five biological networks. The IMRF_{2} method achieves the highest AUC score at 0.743, followed by the DIR algorithm (AUC = 0.691) and the RWR algorithm (AUC = 0.676). The MRFDeng method achieves the AUC score only at 0.551. It also shows that the separated network interaction method performs better than the mixed network RWR method.
Conclusions
In this paper, we have presented an improved multiple data integration method for prioritizing human disease genes, which is based on the theory of MRF and the method of Bayesian analysis. The presented method is both flexible in terms of integrating different kinds of biological data and reliable in terms of prioritizing human disease genes. Compared to the MRFDeng method [30], two strategies have been developed to significantly improve the performance of the MRF method for disease gene identifications.
Firstly, parameters of our improved MRF methods are estimated according to all labelled vertices in integrated biological networks, instead of estimating them according to only known vertices. Moreover, parameters are updated together with sampling labels during iterations, instead of using fixed parameters. The improved parameter estimation method makes our MRF methods more stable and more reliable.
Secondly, a new "prediction period" is added to Gibbs sampling process. Parameters of this period is obtained by taking average parameters in the previous "lag period" and is fixed during iterations of this period. The input probability is also obtained by taking average of posterior probabilities in the "lag period". This strategy significant improves the prediction accuracy of our method.
Predictions when integrating known genedisease associations, protein complexes, PPIs, pathways and gene expression profiles achieve the AUC score of 0.743, which is better than the RWR method and the DIR method by using the same datasets.
Abbreviations
 MRF:

Markov random field
 PPI:

proteinprotein interaction
 RWR:

random walk with restart
 DIR:

data integration rank
 MLE:

maximum likelihood estimation
 MCMC:

Markov chain Monte Carlo
 OMIM:

online Mendelian inheritance in man
 PCC:

Pearson correlation coefficient
 ROC:

receiver operating characteristic
 AUC:

area under the ROC curve.
References
Oti M, Snel B, Huynen MA, Brunner HG: Predicting disease genes using proteinprotein interactions. J Med Genet. 2006, 43 (8): 691698. 10.1136/jmg.2006.041376.
Sun PG, Gao L, Han S: Prediction of human diseaserelated gene clusters by clustering analysis. Int J Biol Sci. 2011, 7 (1): 6173.
Goh KI, Cusick ME, Valle D, Childs B, Vidal M, Barabási AL: The human disease network. Proc Natl Acad Sci USA. 2007, 104 (21): 86858690. 10.1073/pnas.0701361104.
Oti M, Brunner HG: The modular nature of genetic diseases. Clin Genet. 2007, 71 (1): 111.
Wu X, Jiang R, Zhang MQ, Li S: Networkbased global inference of human disease genes. Mol Syst Biol. 2008, 4: 189
Altshuler D, Daly M, Kruglyak L: Guilt by association. Nat Genet. 2000, 26 (2): 135137. 10.1038/79839.
Rual JF, Venkatesan K, Hao T, HirozaneKishikawa T, Dricot A, Li N, Berriz GF, Gibbons FD, Dreze M, AyiviGuedehoussou N, Klitgord N, Simon C, Boxem M, Milstein S, Rosenberg J, Goldberg DS, Zhang LV, Wong SL, Franklin G, Li S, Albala JS, Lim J, Fraughton C, Llamosas E, Cevik S, Bex C, Lamesch P, Sikorski RS, Vandenhaute J, Zoghbi HY, et al: Towards a proteomescale map of the human proteinprotein interaction network. Nature. 2005, 437 (7062): 11731178. 10.1038/nature04209.
Stelzl U, Worm U, Lalowski M, Haenig C, Brembeck FH, Goehler H, Stroedicke M, Zenkner M, Schoenherr A, Koeppen S, Timm J, Mintzlaff S, Abraham C, Bock N, Kietzmann S, Goedde A, Toksöz E, Droege A, Krobitsch S, Korn B, Birchmeier W, Lehrach H, Wanker EE: A human proteinprotein interaction network: a resource for annotating the proteome. Cell. 2005, 122 (6): 957968. 10.1016/j.cell.2005.08.029.
Keshava Prasad TS, Goel R, Kandasamy K, Keerthikumar S, Kumar S, Mathivanan S, Telikicherla D, Raju R, Shafreen B, Venugopal A, Balakrishnan L, Marimuthu A, Banerjee S, Somanathan DS, Sebastian A, Rani S, Ray S, Harrys Kishore CJ, Kanth S, Ahmed M, Kashyap MK, Mohmood R, Ramachandra YL, Krishna V, Rahiman BA, Mohan S, Ranganathan P, Ramabadran S, Chaerkady R, Pandey A: Human Protein Reference Database  2009 update. Nucleic Acids Res. 2009, 37 (Database): D767D772. 10.1093/nar/gkn892.
Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M: BioGRID: a general repository for interaction datasets. Nucleic Acids Res. 2006, 34 (Database): D535539.
Kerrien S, AlamFaruque Y, Aranda B, Bancarz I, Bridge A, Derow C, Dimmer E, Feuermann M, Friedrichsen A, Huntley R, Kohler C, Khadake J, Leroy C, Liban A, Lieftink C, MontecchiPalazzi L, Orchard S, Risse J, Robbe K, Roechert B, Thorneycroft D, Zhang Y, Apweiler R, Hermjakob H: IntAct  open source resource for molecular interaction data. Nucleic Acids Res. 2007, 35 (Database): D561565. 10.1093/nar/gkl958.
Kanehisa M, Goto S: KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 2000, 28 (1): 2730. 10.1093/nar/28.1.27.
Vastrik I, D'Eustachio P, Schmidt E, Gopinath G, Croft D, de Bono B, Gillespie M, Jassal B, Lewis S, Matthews L, Wu G, Birney E, Stein L: Reactome: a knowledge base of biologic pathways and processes. Genome Biol. 2007, 8 (3): R3910.1186/gb200783r39.
WhirlCarrillo M, McDonagh EM, Hebert JM, Gong L, Sangkuhl K, Thorn CF, Altman RB, Klein TE: Pharmacogenomics knowledge for personalized medicine. Clin Pharmacol Ther. 2012, 92 (4): 414417. 10.1038/clpt.2012.96.
Schaefer CF, Anthony K, Krupa S, Buchoff J, Day M, Hannay T, Buetow KH: PID: the Pathway Interaction Database. Nucleic Acids Res. 2009, 37 (Database): D674D679. 10.1093/nar/gkn653.
Wu C, Orozco C, Boyer J, Leglise M, Goodale J, Batalov S, Hodge CL, Haase J, Janes J, Huss JW, Su AI: BioGPS: an extensible and customizable portal for querying and organizing gene annotation resources. Genome Biol. 2009, 10 (11): R13010.1186/gb20091011r130.
Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G, Cooke MP, Walker JR, Hogenesch JB: A gene atlas of the mouse and human proteinencoding transcriptomes. Proc Natl Acad Sci USA. 2004, 101 (16): 60626067. 10.1073/pnas.0400782101.
Lukk M, Kapushesky M, Nikkilä J, Parkinson H, Goncalves A, Huber W, Ukkonen E, Brazma A: A global map of human gene expression. Nat Biotechnol. 2010, 28 (4): 322324. 10.1038/nbt0410322.
Lage K, Karlberg EO, Størling ZM, Olason PI, Pedersen AG, Rigina O, Hinsby AM, Tümer Z, Pociot F, Tommerup N, Moreau Y, Brunak S: A human phenomeinteractome network of protein complexes implicated in genetic disorders. Nat Biotechnol. 2007, 25 (3): 309316. 10.1038/nbt1295.
Hwang T, Zhang W, Xie M, Liu J, Kuang R: Inferring disease and gene set associations with rank coherence in networks. Bioinformatics. 2011, 27 (19): 26922699. 10.1093/bioinformatics/btr463.
Vanunu O, Magger O, Ruppin E, Shlomi T, Sharan R: Associating genes and protein complexes with disease via network propagation. PLoS Comput Biol. 2010, 6 (1): e1000641.10.1371/journal.pcbi.1000641.
Li Y, Agarwal P: A PathwayBased View of Human Diseases and Disease Relationships. PLoS One. 2009, 4 (2): e434610.1371/journal.pone.0004346.
Ma X, Lee H, Wang L, Sun F: CGI: a new approach for prioritizing genes by combining gene expression and proteinprotein interaction data. Bioinformatics. 2007, 23 (2): 215221. 10.1093/bioinformatics/btl569.
Köhler S, Bauer S, Horn D, Robinson PN: Walking the interactome for prioritization of candidate disease genes. Am J Hum Genet. 2008, 82 (4): 949958. 10.1016/j.ajhg.2008.02.013.
Zhang W, Sun F, Jiang R: Integrating multiple proteinprotein interaction networks to prioritize disease genes: a Bayesian regression approach. BMC Bioinformatics. 2011, 12 (Suppl 1): S1110.1186/1471210512S1S11.
Chen Y, Wang W, Zhou Y, Shields R, Chanda SK, Elston RC, Li J: In silico gene prioritization by integrating multiple data sources. PLoS One. 2011, 6 (6): e2113710.1371/journal.pone.0021137.
Chen B, Shi J, Zhang S, Wu FX: Identifying protein complexes in proteinprotein interaction networks by using clique seeds and graph entropy. Proteomics. 2013, 13 (2): 269277. 10.1002/pmic.201200336.
Chen B, Wu FX: Identifying protein complexes based on multiple topological structures in PPI networks. IEEE Trans Nanobioscience. 2013, 12 (3): 165172.
Strohman R: Maneuvering in the complex path from genotype to phenotype. Science. 2002, 296 (5568): 701703. 10.1126/science.1070534.
Deng M, Chen T, Sun F: An integrated probabilistic model for functional prediction of proteins. J Comput Biol. 2004, 11 (23): 463475. 10.1089/1066527041410346.
Bentley DR: The Human Genome Project  an overview. Med Res Rev. 2000, 20 (3): 189196. 10.1002/(SICI)10981128(200005)20:3<189::AIDMED2>3.0.CO;2#.
Chen B, Wang J, Wu FX: Prioritizing human disease genes by multiple data integration. Bioinformatics and Biomedicine (BIBM), 2013 IEEE International Conference on. 2013, 621
Li SZ: Markov random field models in computer vision. In Proceedings of the European Conference on Computer Vision. 1994, 361370.
Besag J: Spatial Interaction and the Statistical Analysis of Lattice Systems. J Royal Statist Soc B. 1974, 36 (2): 192236.
Kamberova G: Markov random field models: a Bayesian approach to computer vision problems. Department of Computer & Information Science Technical Reports. 1992, University of Pennsylvania
Suess EA, Trumbo BE: Introduction to probability simulation and Gibbs sampling with R. 2010, Springer New York
Kourmpetis YA, van Dijk AD, Bink MC, van Ham RC, ter Braak CJ: Bayesian Markov Random Field Analysis for Protein Function Prediction Based on Network Data. PLoS One. 2010, 5 (2): e929310.1371/journal.pone.0009293.
McKsick VA: Mendelian Inheritance in Man and its online version, OMIM. Am J Hum Genet. 2007, 80 (4): 588604. 10.1086/514346.
Ruepp A, Waegele B, Lechner M, Brauner B, DungerKaltenbach I, Fobo G, Frishman G, Montrone C, Mewes HW: CORUM: the comprehensive resource of mammalian protein complexes  2009. Nucleic Acids Res. 2010, 38 (Database): D497D501. 10.1093/nar/gkp914.
Kikugawa S, Nishikata K, Murakami K, Sato Y, Suzuki M, AltafUlAmin M, Kanaya S, Imanishi T: PCDq: human protein complex database with quality index which summarizes different levels of evidences of protein complexes predicted from hinvitational proteinprotein interactions integrative dataset. BMC Syst Biol. 2012, 6 (Suppl 2): S710.1186/175205096S2S7.
Declarations
The publication costs for this article were supported by the Natural Sciences and Engineering Research Council of Canada (NSERC), the National Natural Science Foundation of China under Grant No.61232001 and No. 61370024, and the Program for New Century Excellent Talents in University (NCET120547).
This article has been published as part of BMC Medical Genomics Volume 7 Supplement 2, 2014: IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2013): Bioinformatics in Medical Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcmedgenomics/supplements/7/S2.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
FXW and BC initiated this study and designed algorithms and experiments. BC performed the experiments, analyzed the results, and drafted the manuscript. FXW, JXW and ML revised the manuscript. All authors have read and approved the final manuscript.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.
The Creative Commons Public Domain Dedication waiver (https://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Chen, B., Wang, J., Li, M. et al. Identifying disease genes by integrating multiple data sources. BMC Med Genomics 7 (Suppl 2), S2 (2014). https://doi.org/10.1186/175587947S2S2
Published:
DOI: https://doi.org/10.1186/175587947S2S2
Keywords
 disease gene
 data integration
 Markov random field
 Bayesian analysis