Problem formulation
Let H be a bipartite graph consisting of two disjoint sets of vertices, where one set represents all known human genes {g_{1}, g_{2}, . . . , g_{
N
}}, while the other set represents all known genetic diseases {d_{1}, d_{2}, . . . , d_{
r
}}. The associations between those genes and genetic diseases can be obtained from either the Online Mendelian Inheritance in Man (OMIM) database [22] or similar databases.
Although a disease d_{
k
} may associate with one or several genes, the number of all known disease genes m is much smaller than N . Hence, associations of most other genes are still not known and need to be analyzed. Without loss of generality, we can reorder the set of all human genes as a vector (g_{1}, g_{2}, . . . , g_{N} ), according to a set of given genedisease associations, where g_{n+1}, g_{n+2}, . . . , g_{n+m}are genes associated with at lease one known disease (disease genes), and g_{1}, g_{2}, . . . , g_{
n
}are others. Here N = n + m, and n is the number of all genes that are not known to associate with any diseases and they are called unknown genes in this paper.
For a specific disease d_{
k
}, the issue of disease gene identification is to find a set of candidate genes which may have associations with d_{
k
}. To achieve this, let {x}^{k}=\left({x}_{1}^{k},\phantom{\rule{0.3em}{0ex}}{x}_{2}^{k},...,{x}_{n+m}^{k}\right) be a vector of binary class labels (i.e. taking the value zero or one) defined on all genes, where {x}_{i}^{k}=1 represents gene g_{
i
} being a disease gene of d_{
k
} , and {x}_{i}^{k}=0 otherwise. Since we have to address each genetic disease one by one, we take d_{
k
}for example, and ignore the superscript k in the vector x^{k}for simplicity as x = (x_{1}, x_{2}, . . . , x_{n+m}), hereafter. Therefore, the identification of disease genes is equivalent to find labels of x_{
i
} for all unknown genes. Identification of disease genes for other diseases can be similarly conducted by changing d_{
k
} to another disease.
In this paper, the issue of disease gene identification is formulated as a twoclass classification problem by using Bayesian analysis and logistic regression. The conditional probability p(x_{
i
} = 1 Φ) for each unknown gene is first calculated in an inference stage, and a decision score is then obtained according to this probability in a decision stage [23]. Here Φ represents the information used to make the inference, such as a vector of prior labels of x, the connectivity of the bipartite graph H, the neighborhood relationships of g_{1}, g_{2}, . . . , g_{
N
} , and the similarity relationship between d_{1}, d_{2}, . . . , d_{
r
} . The flow diagram of the proposed algorithm is depicted in Figure 1.
Prior label estimation
The logistic regression based algorithm needs a vector of prior labels for x. For those known disease genes, one can directly assign 1 or 0 according to the known genedisease associations. For those unknown disease genes, a prior probability of each gene get the label 1 should be first estimated.
The simplest way is to assign the prior probability as 0 for all unknown genes. The prediction results in this case is denoted as P_{0} hereafter.
However, one can make it better by using additional prior information, such as protein complex data, to estimate prior probabilities for unknown genes. This is not only due to the fact that they are naturally available from various databases, but also due to their capability to describe the module characteristic of disease genes. If a disease is resulted from the disfunction of a protein complex, then any component of the complex should associate with the disease with a high probability.
Similar to the method used in [19, 20], if a gene g_{
i
} encodes a protein in a complex, then let
{\hat{p}}_{i}=\frac{A}{B}
(1)
be its prior probability, where A is the number of disease genes of the specific disease in the complex, and B is the number of all disease genes in the complex. If g_{
i
} appears in multiple protein complexes, we use the maximum value as its prior probability. If g_{
i
} does not belong to any protein complex, let
{\hat{p}}_{i}=\frac{C}{D}
(2)
be its prior probability, where C is the number of all currently known disease genes of the specific disease, and D is the total number of genes in human genome.
Once a prior probability {\hat{p}}_{i} is estimated for g_{
i
}, its prior label of {\hat{x}}_{i} can be obtained as follows. First, generate a random number following the standard uniform distribution. If the value of the random number is large than {\hat{p}}_{i}, then assign 0 as the prior label for g_{
i
}. Otherwise, assign 1 as the prior label for g_{
i
}. Repeat this step for all unknown genes, one can obtain prior labels for all of them. The prediction results generated by using those prior labels is denoted as P_{
c
} hereafter.
Logistic regression
For a twoclass classification problem, each gene is labelled with either 1 or 0. A vector of all binary values of x is called a configuration. In the previous MRF algorithm [19, 20], the configuration x is formulated as an MRF which follows a Gibbs distribution. However, the Markovianity characteristic of the MRF model makes it only considering direct neighbors to construct feature vectors, which limits the capability of the method to use other topological attributes in a biological network. It is also very time consuming to maintain Markov chains for all unknown genes. In [21], we have introduced a logistic regression based algorithm to directly estimate the configuration by using the same feature vectors. However, the application of the logistic regression based algorithm is still limited to single biological network. A multiple data integration method should be further investigated. To generalize the formulation of feature vectors by using other topological attributes and extend its applicability to multiple data integration, we propose an improved logistic regression based algorithm in this study as follows.
Let C_{1} be a set of genes with label 1 and C_{0} be a set of genes with label 0. Suppose the following four kinds of probabilities are given: the classconditional densities p(xC_{1}) and p(xC_{0}), which indicate the probability of the configuration x conditional on C_{1} and C_{0}, respectively, and the class prior densities p(C_{1}) and p(C_{0}), which indicate the prior probability of genes in C_{1} and C_{0} being labelled with 1 and 0, respectively.
According to the Bayes' rule, the posterior probabilities of those genes in C_{1} that are labelled with 1 can be described as a logistic sigmoid function [23, 24]
p\left({C}_{1}x\right)=\frac{p\left(x{C}_{1}\right)p\left({C}_{1}\right)}{p\left(x{C}_{1}\right)p\left({C}_{1}\right)+p\left(x{C}_{0}\right)p\left({C}_{0}\right)}=\frac{{e}^{t}}{{e}^{t}+1}
(3)
and the posterior probabilities of those genes in C_{0} that are labelled with 0 can be similarly written as
p\left({C}_{0}x\right)=\frac{p\left(x{C}_{0}\right)p\left({C}_{0}\right)}{p\left(x{C}_{1}\right)p\left({C}_{1}\right)+p\left(x{C}_{0}\right)p\left({C}_{0}\right)}=\frac{1}{{e}^{t}+1}
(4)
where the variable t is defined as
t=\text{ln}\frac{p\left(x{C}_{1}\right)p\left({C}_{1}\right)}{p\left(x{C}_{0}\right)p\left({C}_{0}\right)},
(5)
which is related to the four kinds of probabilities.
Although t is often unavailable for a real problem, under general assumptions [23], t can be formulated as a function of different features t = f (·) associated with the integrated networks. To be more specific, let x be a prior configuration of all human genes and f be a function. For any given gene g_{
i
}, let ϕ_{
i
} be the feature vector of g_{
i
} that is related to the prior configuration \hat{x}. The posterior probability that the specific gene g_{
i
} has label 1 and 0 are
p\left({x}_{i}=1{\varphi}_{i},f\right)=\frac{\text{exp}\left(f\left({\varphi}_{i}\right)\right)}{\text{exp}\left(f\left({\varphi}_{i}\right)\right)+1},
(6)
and
p\left({x}_{i}=0{\varphi}_{i},f\right)=\frac{1}{\text{exp}\left(f\left({\varphi}_{i}\right)\right)+1}.
(7)
respectively. Note that the sum of these two probabilities (6) and (7) must equal to 1 in this twoclass classification problem. A linear function f (ϕ_{
i
}) = w^{T} ϕ_{
i
} with variables (feature vectors) ϕ_{
i
} and coefficients (parameters) w is the most commonly used function to ensure the calculation of the posterior probability not too complex.
The key step of the proposed algorithm is the construction of feature vectors. In the previous methods [19, 21], the numbers of direct neighbors that connects to disease genes and nondisease genes are employed as the feature vector for each gene. Take g_{
i
} for example, its feature vector can be written as
{\varphi}_{i}={\left(1,{\varphi}_{i1},{\varphi}_{i0}\right)}^{T},
(8)
where ϕ_{i1 }and ϕ_{i0 }are the number of direct neighbors of g_{
i
} that connected to vertices with labels 1 and 0, respectively. It is a three dimensional vector, where the first element represents the constant term. All feature vectors of individual genes together form a feature matrix as
{F}_{1}={\left[\begin{array}{ccc}1\hfill & {\varphi}_{11}\hfill & {\varphi}_{10}\hfill \\ 1\hfill & {\varphi}_{21}\hfill & {\varphi}_{20}\hfill \\ \vdots \hfill & \vdots \hfill & \vdots \hfill \\ 1\hfill & {\varphi}_{N1}\hfill & {\varphi}_{N0}\hfill \end{array}\right]}_{{}_{N\times 3}}
(9)
where N is the number of all human genes. The corresponding parameters are w = (w_{
0
}, w_{
1
}, w_{
2
})^{T}. Predictions generated by using (9) are denoted as F_{1} hereafter.
In this study, two extended feature vector construction methods are proposed as follows. Firstly, in a single biological network, not only the number of direct neighbors of g_{
i
}, but also the number of its second order neighbors are employed to construct the feature vector as
{\varphi}_{i}={\left(1,{\varphi}_{i1},{\varphi}_{i0},{{\varphi}^{\prime}}_{i1},{{\varphi}^{\prime}}_{i0}\right)}^{T}
(10)
where ϕ_{i1 }and ϕ_{i0 }are the numbers of direct neighbors of g_{
i
} connected to vertices with labels 1 and 0, respectively, and {\varphi}_{i1}^{\prime}and {\varphi}_{i0}^{\prime} are the numbers of the second order neighbors of g_{
i
} connected to vertices with labels 1 and 0, respectively. The contribution of those indirect neighbors has been investigated for predicting disease genes in [20, 25, 26]. The feature matrix in this situation can be written as
{F}_{2}={\left[\begin{array}{ccccc}1\hfill & {\varphi}_{11}\hfill & {\varphi}_{10}\hfill & {{\varphi}^{\prime}}_{11}\hfill & {{\varphi}^{\prime}}_{10}\hfill \\ 1\hfill & {\varphi}_{21}\hfill & {\varphi}_{20}\hfill & {{\varphi}^{\prime}}_{21}\hfill & {{\varphi}^{\prime}}_{20}\hfill \\ \vdots \hfill & \vdots \hfill & \vdots \hfill & \vdots \hfill & \vdots \hfill \\ 1\hfill & {\varphi}_{N1}\hfill & {\varphi}_{N0}\hfill & {{\varphi}^{\prime}}_{N1}\hfill & {{\varphi}^{\prime}}_{N0}\hfill \end{array}\right]}_{N\times 5.}
(11)
The corresponding parameter vector w = (w_{0}, w_{1}, w_{2},w_{3}, w_{4})^{T} is a five dimensional vector. Predictions generated by using (11) are denoted as F_{2} hereafter.
Secondly, in the multiple data integration situation, suppose there are l biological networks. Let {\varphi}_{i1}^{j},{\varphi}_{i0}^{j} be the number of direct neighbors of g_{
i
} connected to vertices with labels 1 and 0 in the j^{th} network, respectively. The feature vector obtained from those l networks
{\varphi}_{i}={\left(1,{\varphi}_{i1}^{1},{\varphi}_{i0}^{1},...,{\varphi}_{i1}^{l},{\varphi}_{i0}^{l}\right)}^{T}
(12)
is a 2l + 1 dimensional vector. All those feature vectors together form a feature matrix as
{F}_{3}={\left[\begin{array}{cccccc}1\hfill & {\varphi}_{11}^{1}\hfill & {\varphi}_{10}^{1}\hfill & \cdots \phantom{\rule{0.3em}{0ex}}\hfill & {\varphi}_{11}^{l}\hfill & {\varphi}_{10}^{l}\hfill \\ 1\hfill & {\varphi}_{21}^{1}\hfill & {\varphi}_{20}^{1}\hfill & \cdots \phantom{\rule{0.3em}{0ex}}\hfill & {\varphi}_{21}^{l}\hfill & {\varphi}_{20}^{l}\hfill \\ \vdots \hfill & \vdots \hfill & \vdots \hfill & \cdots \phantom{\rule{0.3em}{0ex}}\hfill & \vdots \hfill & \vdots \hfill \\ 1\hfill & {\varphi}_{N1}^{1}\hfill & {\varphi}_{N0}^{1}\hfill & \cdots \phantom{\rule{0.3em}{0ex}}\hfill & {\varphi}_{N1}^{l}\hfill & {\varphi}_{N0}^{l}\hfill \end{array}\right]}_{N\times \left(2l+1\right).}
(13)
The corresponding parameter vector w = (w_{0}, w_{1}, w_{2},. . . , w_{2l−1}, w_{2l})^{T} is a 2l + 1 dimensional vector, and N is the number of all human genes. Predictions generated from (13) by integrating multiple networks is denoted as F_{3} hereafter.
Parameter estimation
Parameter estimation can be conducted on a training set consists of known disease genes, where known genes associated with d_{
k
} are labelled with 1 and known genes associated with other diseases are labelled with 0. However, as we discussed in [19, 21], the exclusion of most unknown genes reduces the number of vertices with label 0 significantly, thereby making the estimation of parameters inaccurate. Predictions from those inaccurate parameters are unreliable in disease gene identification.
It is noteworthy that the majority of human genes should not be disease genes associated with d_{
k
}. Hence, the inclusion of all unknown genes with prior labels as the training set will make the training set more reasonable, where the number of vertices with label 0 is significantly increased, while the number of vertices with label 1 does not change too much. Such a training set, which consists of both known genes and unknown genes, has proved its powerful and efficient to estimate meaningful parameters in [19–21].
Given a prior configuration \widehat{x} for all vertices, a maximumlikelihood estimation (MLE) method can be employed to estimate the parameter vector w. The likelihood function can be written as
\mathcal{L}\left(w;{x}_{1},{x}_{2},...,{x}_{N}\right)={\displaystyle \prod _{i=1}^{N}}p\left({x}_{i}{\varphi}_{i,}f\right).
(14)
where x_{
i
} is the label of g_{
i
}, ϕ_{
i
} is its feature vector that is calculated according to \widehat{x}, f is a linear function of ϕ_{
i
} with the form as f (ϕ_{
i
}) = w^{T} ϕ_{
i
}, and N is the number of all human genes. The log likelihood of (14) is
ln L (w; x_{1}, x_{2}, . . . , x_{
N
} )
={\displaystyle \sum _{i=1}^{N}}\left[{x}_{i}{w}^{T}{\varphi}_{i}\text{ln}\left(1+\text{exp}\left({w}^{T}{\varphi}_{i}\right)\right)\right].
(15)
The log likelihood (15) is a convex function [27]. Hence, we can find an unique global optimal solution by solving a convex optimization problem. In this study, the standard MATLAB function fminunc() is employed to find a numerical solution of (15) (by calculating the minimum of − ln L (w; x_{1}, x_{2}, . . . , x_{
N
})). The the initial value of w is simply set as zero for the fminunc() function.
Decision score and evaluation methods
The logistic regression based algorithm returns a set of posterior probabilities during the inference stage. One can directly use those probabilities to make decisions in the following decision stage. However, the posterior probabilities do not always work well due to the hardness to set a threshold for a genetic disease. Inspired by the DIR method [18], we propose to use a percentage value of a posterior probability as the decision score for each gene. The decision score is calculated as follows
{q}_{i}=\frac{\left\left\{j{p}_{i}\ge {p}_{j}\right\}\right}{n},i=1,2,...,n
(16)
where {p_{1}, p_{2}, . . . , p_{
n
}} is the posterior probabilities of individual unknown genes, and q_{
i
} is the top percentage value of p_{
i
} among all those posterior probabilities. A candidate gene is more likely to be associated with d_{
k
}, if its decision score is larger than majority of others.
To evaluate the performance of the proposed algorithm, the leaveoneout cross validation paradigm is employed by using above decision scores. The receiver operating characteristic (ROC) curve is employed as one of the evaluation criteria, which shows the relationship between the true positive rate (TPR) and the false positive rate (FPR) by varying a threshold for determining positives. The area under the ROC curve (AUC) is employed to show the overall performance of algorithms.
The positive control genes are those known disease genes associated with d_{
k
}. For those negative control genes, although they are indispensable to calculate false positives and true negatives, it is generally hard to obtain a true negative dataset [28]. In this study, the negative control genes are randomly selected from known disease genes that do not associate with d_{
k
}. Since those genes have been widely studied as disease genes for other genetic diseases, it is less likely for them to be disease genes for a different specific disease. If there are s known disease genes associated with d_{
k
}, we randomly select \u230a\frac{s}{2}\u230b such genes as a negative control set. Each gene belonging to the negative control set is also validated by using the leaveoneout cross validation paradigm.
The proposed algorithm is compared with four previous algorithms: (1) the initial logistic regression based algorithm proposed in [21]; (2) the RWR algorithm proposed in [17]; (3) the MRF algorithm proposed in [19]; and (4) the DIR algorithm proposed in [18]. The first algorithm is applicable to a single network. The second and the third algorithms are applicable to both single network and multiple data integration. The fourth algorithm works only for multiple data integration. All those algorithms identify disease genes with high prediction performance and they work better than many previous methods [17–19, 21].
Algorithm
The stepbystep description of the proposed logistic regression based algorithm is given as follows.
Input: The vector of all human genes (g_{1}, . . . , g_{n+m}), where (g_{1}, . . . , g_{
n
}) are unknown genes, and (g_{n+1}, . . . , g_{n+m}) are known genes; l integrated biological networks G_{1}, G_{2}, . . . , G_{
l
}; a set of protein complexes; and a set of genedisease associations.
Output: The vector of decision score for each unknown gene for each disease.

1:
For a specific disease d_{
k
} , calculate prior probabilities for all human genes, where the prior probability of unknown genes {\widehat{p}}_{1},...,{\widehat{p}}_{n} are calculated according to (1) and (2).

2:
For each known gene g_{n+i}, i = 1, . . . , m, if g_{n+i}is known to be associated with d_{
k
} , let {\widehat{p}}_{n+i}=1. Otherwise, let {\widehat{p}}_{n+i}=0.

3:
Assign prior labels \widehat{x}=\left({\widehat{x}}_{1},{\widehat{x}}_{2},...,{\widehat{x}}_{n},{\widehat{x}}_{n+1},...,{\widehat{x}}_{n+m}\right) for all genes according to the prior probabilities \left({\widehat{p}}_{1},...,{\widehat{p}}_{n+m}\right), respectively.

4:
Calculate the feature vector ϕ_{
i
} for each g_{
i
} according to the integrated biological networks and \widehat{x}.

5:
Estimate parameters \u0175 by maximizing the log likelihood ln \mathcal{L}\left(w;{x}_{1},{x}_{2},...,{x}_{N}\right) in (15) based on \widehat{x} and ϕ_{
i
}, i = 1, . . . , n + m. A binary logistic regression is performed here by taking the vector \widehat{x} as the categorical dependent variables and those labelrelated feature vectors ϕ_{
i
} as predictor variables. Here i = 1, . . . , N.

6:
Calculate the posterior probability p_{1}, . . . , p_{
n
} for each unknown gene according to (6) by using \u0175 and ϕ_{
i
}.

7:
Calculate the decision scores q_{1}, . . . , q_{
n
} according to (16).

8:
Repeat all the steps for another disease until every disease is checked.