In this paper, we first briefly describe how the problem is formulated as a Bayesian labelling problem. The labelling configration assumes to follow a Gibbs distribution. After that, a MRF model is introduced to solve this problem by integrating multiple kinds of biological data, including known gene-disease associations, protein complexes, PPIs, pathways and gene expression profiles.
The Bayesian labelling problem
Let L = {L1, L2, ..., L
k
} be a set of k labels and S = {S1, S2, ... , S
r
} be a set of r sites. A labelling problem [33] is defined as assigning each site Si with a label in L.
Let F = {F1, F2, ... , F
r
} be a family of random variables defined on S, in which each random variable F
i
takes value f
i
of L. We use the notation F = f to represent the joint event that {F1 = f1, ... , F
r
= f
r
}, where f = {f1, ... , f
r
} is called a configuration of F. The set of all configurations is denoted as .
The relationship of sites is determined by a neighborhood system , where N
i
is the set of sites neighboring i.
A family of random variables F is said to be a MRF on S w.r.t. N if and only if the following two conditions are satisfied:
1 Positivity: ,
2 Markovianity:
The Markovianity indicates that the probability of a local event f
i
conditioned on all other events is equivalent to that conditioned on only events of its neighbors. Hence, the joint probability P(f) of the random field can be uniquely determined by local conditional probabilities.
Let rbe an observation of F . Suppose we know both the prior probability distribution P (f) of configuration f and the conditional probability distribution P (r |f) of the observation rgiven the configuration f. The best estimation of f is the one maxizing a posteriori probability (MAP), which is
(1)
where P(r) is the probability that we get the observation r.
The Bayesian labelling problem [33] is that given a set of observation r, find the MAP configuration of labelling . Here, as P (r) is not a function of f , it does not affect the MAP estimation of f.
Gibbs distribution in MRF
It is usually hard to specify a prior probability of a MRF for a real problem. Fortunately, the Hammersley-Clifford theorem [34] provides a solution for this. According to the theorem, F is a MRF on S w.r.t. if and only if the probability distribution of P (F = f) of the configuration is a Gibbs distribution w.r.t. . The Gibbs distribution has a form of
(2)
where is a normalizing constant, T is a global control constant that is often assumed to be 1, and U(f) is the energy function calculated as follows
(3)
where V
i
(f) is the energy potential of C
i
(the set of ith order cliques) in the neighborhood system , R
n
(f) represents those higher order terms. A special case of MRF is the Ising model that only considers up to the second order of cliques [35].
Given a configuration f, let the conditional probability distribution of observation r have the same exponential form
(4)
Then the posterior probability of the Gibbs distribution has form
(5)
where the posterior energy is [33]
(6)
Based on this, suppose the collection of whole human genes G = {g1, g2, ..., g
N
} is the site set, and {1, 0} is the label set, where 1 represents a gene is a disease gene and 0 otherwise. The problem of human disease gene identification is actually to find the best configuration of G according to what is currently known about human diseases.
The MRF model for identifying human disease genes
Suppose human genome consists of a set of N genes G = {g1, g2, ..., g
N
}. Some of them are already known to be associated with genetic diseases, while associations of most other genes are still not known. Without loss of generality, let g1; g2, ..., g
n
be genes that have not yet been known to be associated with genetic diseases, and g
n+1
, g
n+2
, ..., g
n+m
be currently known disease genes. Obviously, we have N = n + m. Let {D1, D2, ..., D
M
} be a set of human diseases, where D
i
consists of the set of genes that are already known associated with the ith disease.
For a specific disease, let X = (X1, X2, ..., X
n+m
) be the random variables defined on all genes, where X
i
= 1 represents gene g
i
to be a associated gene of the disease and X
i
= 0 otherwise.
Consider those individual genes. Let (π1, π2, ..., π
n+m
) be a set of probabilities, where π
i
represents the probability that X
i
= 1. Let x = (x1, x2, ..., x
n+m
) be observations of X. The probability distribution of configuration x is proportional to
(7)
where , and is a constant.
Next, consider pairwise relationships between genes. Suppose we have K biological networks H = (H1, ..., HK), where vertices represent genes. Given a Hk, edges of Hk represent a specific kind of biological relationship between those genes. Let x be the observation labels of X. According to x, edges of Hk can be classified into three categories: (1) edges that between two 1-labelled vertices, (2) edges that between a 1-labelled vertex and a 0-labelled vertex, and (3) edges that between two 0-labelled vertices. Let and denote the number of edges in each category of Gk respectively. Then
(8)
(9)
(10)
The probability that we have such a kind of biological network Hk conditional on those observed labels x follows as
(11)
where θk = (βk, γk, κk) are weights of these three kinds of edges for Hk. One of three parameters in θk is redundant. Without loss of generality, let κk = 1. Similarly, for K biological networks, the probability that we observe them conditional on the observed labels follows as
(12)
Based on the Ising model, the energy function can be written in terms of x as
(13)
where θ = (α
i
, β1, γ1, ..., βK , γK) are parameters. In the terminology of MRF [30], U (x|θ) defines a Gibbs distribution of the entire networks
(14)
where Z(θ) is the normalized constant that is calculated by summing over all configurations χ:
The Gibbs sampling
The Gibbs distribution (14) gives a prior probability distribution of the configuration for all genes. In the study of identifying human disease genes, the objective is to find the posterior probability of X1, X2, · · ·, X
n
conditional on known disease genes
To achieve this, consider the following posterior probability distribution of an individual gene X
i
where X[−i] = (X1, · · ·, X
i−1
, X
i+1
, · · ·, X
n+m
) represents labels of all other genes except X
i
, θ are parameters. According to the Bayes' theorem [36] and the Gibbs distribution (14), we have
(15)
where
(16)
(17)
the according to equation (13), and
(18)
Here and are the number of neighbors of the gene labelled with 0 and 1 on network Hk , k = 1, ..., K, respectively.
Equation (15) provides a method to update the label X
i
according to all other labels. Suppose parameters θ = (α
i
, β1, γ1, ..., βK, γK) of the model are given, together with prior observed labels of all genes. Using equation (15), we can update labels for all unknown genes. Repeating this procedure a number of times until all posterior probabilities of labels are stabilized. This is the essential procedure of the Gibbs sampling.
Parameter estimation
In practice, we do not know parameters of the model and they need to be estimated according to those known informations. Ideally, the maximum likelihood estimation (MLE) method is a good choice to estimate θ in equation (14). However, the normalizing part Z(θ) is also a function of θ, which is the main difficulty for using the MLE method directly. Deng et al. [30] using a pseudo-likelihood method to estimate parameters in the MRF model. Specifically, the following pseudo-likelihood function is derived from equation (15), which is
(19)
The parameter estimation can be done by a binary logistic regression, where dependent variables in equation (19) are categorical labels and independent variables are of the K biological networks. The standard MATLAB function glmf it() can be employed to perform such binary logistic regression.
The pseudo-likelihood method used by Deng et al. [30] is valuable. However, there is an important potential problem [32, 37], which may result in unreasonable predictions with their original method. The parameter estimation of Deng et al. [30] is conducted on only known labelled vertices of biological networks. However, a known vertex with labelling 1 may have plenty of unknown vertices with labelling 0 in a biological network and vice versa. A neglect of those unknown vertices may result in inaccurate estimated parameters, which makes predictions problematic. This problem becomes serious with the increasing number of unknown vertices [37]. Kourmpetis et al. [37] alternatively introduce a Bayesian MRF model to estimate parameters and update labels at the same time. An adaptive Markov Chain Monte Carlo (MCMC) algorithm is employed to perform the estimation by using another scaling parameter, a Z matrix and a multivariate normal distribution.
In this study, we introduce a new method to simultaneously estimate parameters and update labels. Suppose a prior probability of π
i
for each unknown vertex is known. A set of prior labels of unknown vertices can be assigned according to this probability. Then the pseudo-likelihood parameter estimation method is performed on all labeled vertices, including those known labelled ones and those unknown prior labelled ones. Using these estimated parameters to update labels for all unknown vertices, and then using the updated labels to re-estimate parameters until both of them are stable. The step-by-step description of this procedure is given as follows.
1 Initialization:
Let t = 0, and initialize labels of all vertices
2 Estimating parameters:
3 Gibbs sampling:
4 Let t = t + 1, and go to 2, until stabilized.
During the Gibbs sampling procedure, a "burn-in period" and a "lag period" often need to be specified. The "burn-in period" is the period that a Markov process takes to become stabilized. Simulation results in this period are discarded to reduce the effect of initial prior probabilities. The "lag period" is the period that needs to reduce the dependence of the Markov process. The posterior probabilities in this period are estimated by averaging simulation results during individual lag steps.
In this study, the "burn-in period" takes 100 steps while the "lag period" takes 90 steps. Simulation results are averaged every 10 steps in the "lag period". There is 1000 steps in total for simulations. For convenience, predictions made by the original MRF model of Deng et al. [30] is denoted as "MRF-Deng", while predictions of our improved MRF method is denoted as "IMRF1" hereafter. A second improved MRF method is also given in the following by adding a new period at last in simulations, which is called "prediction period". It takes the average estimated parameters in the "lag period" as parameters and fixes them hereafter in simulations. The input probabilities of unknown vertices are also obtained by the average posterior probabilities in the "lag period". The Markov process runs another 100 steps in this period. The average posterior probabilities in the "prediction period" are outputted as final predictions, and predictions of this method is denoted as "IMRF2".
Estimating a prior probability
Now, the only problem left is to estimate the prior probability of π
i
. Similarly as the method used in Deng et al. [30], we also estimate them according to known protein complexes. Since genes that encode proteins in a same complex tend to associated with similar diseases. For a gene g
i
that encodes protein in a complex,
be the prior probability, where A is the number of disease genes for a specific disease in the complex, and B is the number of all disease genes in the complex. If a gene appears in multiple protein complexes, we use the maximum value as the prior probability for the gene.
For those genes that do not belong to any protein complex, let
as the prior probability, where C is the number of all currently known disease genes for the specific disease, and D is the total number of genes in human genome.
Data sources
The gene-disease association data are obtained from Goh et al. [3], which contain 1 284 disorders and 1 777 disease genes. These data are originally collected from the Morbid Map list of the Online Mendelian Inheritance in Man (OMIM) [38]. Disorders are manually classified into 22 primary disease classes, including a 'multiple' class and a 'unclassified' class. In this study, we consider only those disease classes that consist of at least 30 genes. We also exclude the 'multiple' class, the 'unclassified' class, the 'cancer' class and the 'neurological' class due to the class evidence and the class heterogeneity [3]. The final dataset consists of 815 genes in 12 disease classes.
The protein complex data are collected from the database of CORUM [39] and PCDq [40]. There are 1677 and 1103 protein complexes in the dataset that consist of at least two proteins, respectively. There are in total 3881 proteins in those protein complexes.
The PPI datasets are derived from the database of HPRD (Release 9) [9], BioGrid (Release 3.2.108) [10] and IntAct (downloaded on Jan 26, 2014) [11], respectively. Duplicated edges between the same pair of vertices and edges connecting to itself are deleted. Each dataset is processed independently, and three PPI networks are obtained finally. The HPRD PPI network consists of 9465 vertices and 37039 edges. The BioGrid PPI network consists of 15298 vertices and 127612 edges. The IntAct PPI network consists of 13449 vertices and 63825 edges.
The pathway datasets are obtained from the database of KEGG [12], Reactome [13], PharmGKB [14] and PIN [15], There are 280, 1469, 99 and 2679 pathways in datasets, respectively. There are in total 8614 proteins in those pathways. A pathway co-existing network is constructed by taking individual proteins/genes as vertices. Edges are constructed between two vertices, if they co-exist in any pathway.
The human gene expression profiles are obtained from BioGPS (GSE1133) [16, 17], which contain 79 human tissues in duplicates, measured using the Affymetrix U133A array. Pairwise Pearson correlation coefficients (PCC) are calculated and a pair of genes are linked by an edge if the PCC value is larger than 0.5, similar to the method used in [3, 26].
Hence, five biological networks are constructed by collecting data from various databases. All protein IDs are mapped onto the form of the gene symbol. In order to test the performance of multiple data integration of our methods, we select those genes that appears at least four times in the five networks. The final datasets consist of 7311 human genes, 815 out of which are known associated with 12 disease classes.
Validation method and evaluation criteria
The accuracy of predictions is validated by the leaveone-out method. For each known disease gene with at least one annotated interaction partner in a biological network, we assume it is an unknown gene and predict its posterior probability by our proposed methods. We use the receiver operating characteristic (ROC) curve to show the relationship between the true positive rate and the false positive rate by varying the threshold for declaring positives. The area under the ROC curve (AUC) is also employed to show an overall measure of the performance. The negative control set consists of known disease genes that do not belong to current disease class, and they are also validated by using the leave-one-out method.
Decision score and declaration of positives
One can directly use the posterior probabilities obtained by the Gibbs sampling to select candidate disease genes. The greater the probability is for a gene, the more likely it is to associated with specific disease. However, different disease classes consist of different numbers of known disease genes, and thus the prediction results may not be good if a global threshold is used for all classes. Hence, we propose to use a percentage as a decision score to generate the finial predictions. All the ROC curves and the AUC scores of our "IMRF1" and "IMRF2" method are calculated according to the decision score hereafter.