A network embedding model for pathogenic genes prediction by multi-path random walking on heterogeneous network

Background Prediction of pathogenic genes is crucial for disease prevention, diagnosis, and treatment. But traditional genetic localization methods are often technique-difficulty and time-consuming. With the development of computer science, computational biology has gradually become one of the main methods for finding candidate pathogenic genes. Methods We propose a pathogenic genes prediction method based on network embedding which is called Multipath2vec. Firstly, we construct an heterogeneous network which is called GP−network. It is constructed based on three kinds of relationships between genes and phenotypes, including correlations between phenotypes, interactions between genes and known gene-phenotype pairs. Then in order to embedding the network better, we design the multi-path to guide random walk in GP−network. The multi-path includes multiple paths between genes and phenotypes which can capture complex structural information of heterogeneous network. Finally, we use the learned vector representation of each phenotype and protein to calculate the similarities and rank according to the similarities between candidate genes and the target phenotype. Results We implemented Multipath2vec and four baseline approaches (i.e., CATAPULT, PRINCE, Deepwalk and Metapath2vec) on many-genes gene-phenotype data, single-gene gene-phenotype data and whole gene-phenotype data. Experimental results show that Multipath2vec outperformed the state-of-the-art baselines in pathogenic genes prediction task. Conclusions We propose Multipath2vec that can be utilized to predict pathogenic genes and experimental results show the higher accuracy of pathogenic genes prediction.

genes and shorten the period of discovering pathogenic genes, laying the foundation for the development of biotechnology, and personalizing gene therapy, etc.
With the accumulation of protein-protein interaction data, it has been a research hotspot in bioinformatics that predicting the pathogenic genes from protein-protein interaction networks [9][10][11][12]. Computational biology has gradually become one of the main methods for finding candidate pathogenic genes. Calculating the functional similarity between unknown candidate genes and known pathogenic genes is one of the most popular methods for finding unknown candidate genes. Discovering pathogenic genes by network topological features in human protein-protein interaction networks has made some progress [13,14]. Moreover, many scholars have made efforts to identify genetic phenotype associations rather than gene-diseases associations [10,15]. Lage et al. scored protein complexes using gene-phenotype data and genes are ranked according to their asscioation with scored protein complex [10]. Wu et al. established a regression model by calculating a score to measure the correlation between the phenotype similarities and the functional genetic relatedness of disease genes [15].
Some studies have shown that similar phenotypes are generally caused by functionally related genes [16][17][18]. Driven by this observation, researchers have proposed another method of prediction of pathogenic genes that predict candidate pathogenic genes by gene-phenotype associations [19][20][21][22]. Researchers prioritize candidate pathogenic genes of a given disease phenotype by constructing a heterogeneous network that consists of phenotype network, gene(protein) network, and known disease gene-phenotype associations [23]. As it is well known, the phenotypes are regarded as vertices and the links between highly similar phenotypes are regarded as edges in the phenotype network. As for the protein network, the individual proteins are regarded as vertices and the detected protein-protein interactions (PPI) are regarded as edges between the two corresponding vertices. The two networks, i.e., phenotype network and protein network, are connected by the known disease gene-phenotype associations. This kind of heterogeneous network can be used to infer causative genes of a given phenotype. Many methods calculate the similarities between the candidate genes and the target phenotype using the heterogeneous network [24,25]. Li and Patra proposed a random walk with restart algorithm to infer the gene-phenotype on the heterogenous network [24]. Yang et al. added the information of real protein complexes into the heterogenous network, which constructed a novel protein complex network [25]. However, studies are restricted by the existing big differences between the properties of vertices or links in the heterogenous network. Predicting pathogenic genes of diseases by the heterogenous network is restricted by the complex network properties. Recently in the field of computer science, network embedding algorithms have been proposed [26][27][28][29][30]. Neural networkbased learning models can represent latent embeddings into low-dimensional space while capturing the internal relationships of rich and complex data. It has been proved that the network embedding algorithms perform well in clustering, network classification and link prediction, etc [31,32]. Deep learning techniques are first introduced to analyze graphs in Deepwalk algorithm, which have been proved to be success in natural language processing as well as network analysis [26,33,34]. Abundant studies extended and modified the basic Deepwalk model in order to implement this model into the heterogeneous network.
In this work, we use network embedding to predict causative genes in human gene-phenotype heterogeneous networks. We propose a network embedding method called Multipath2vec, which aims to precisely predict pathogenic genes of a target disease. In Multi-path2vec, we first construct a human gene-phenotype heterogeneous network. And we design the multi-path which can better capture correlations between different types of vertices to guide random walk in the human gene-phenotype heterogeneous network. Then we use network embedding algorithm to learn features of the constructed networks. Finally, we calculate the similarities between genes and the target phenotypes and then predicts the pathogenic genes. We make the following contributions.
• We propose a pathogenic genes prediction algorithm called Multipath2vec. In Multipath2vec, we propose a special multi-path random walk to make better use of the information of the heterogeneous network. • We introduce network embedding algorithm in the prediction of pathogenic genes. To our best knowledge, this is the first attempt to exploit the network embedding method in the prediction of pathogenic genes. • The research strategy of this work can inspire the resolution of analysis task in bioinformatics.
The structure of our paper is organized as follows. "Methods" section illustrates the Multipath2vec algorithm in detail. Experiments are introduced in "Results" and "Discussion" sections concludes the paper.

Methods
In this section, we introduce the detailed description of the construction of the human gene-phenotype heterogeneous network and propose the Multipath2vec algorithm. The flow chart of Multipath2vec is shown in Fig. 1. First, a human gene-phenotype heterogeneous network is constructed based on the correlations between genes and genes, phenotypes and phenotypes, genes and phenotypes. Then we design multi-path to guide random walk in the human gene-phenotype heterogeneous network and represent the network into d dimension vectors. And then we calculate the similarities between genes and the target phenotypes. After that, we can get the ranking list of candidate genes.

Heterogeneous network construction
The heterogeneous network consists of two types of nodes and three types of links. In the heterogeneous network, nodes include the gene nodes and the phenotype nodes. Edges are connected in three relationships: the relationship between phenotype and gene, the relationship between two genes, and the relationship between two phenotypes. The edge between two phenotypes is the link between two highly similar vertices. The edge is connected between two corresponding genes when there exists the experimentally detected protein-protein interaction. Besides, the known disease gene-phenotype associations are used to connect gene and phenotype. For a better understanding, we give the formal definition of the heterogeneous network as follows.
A Heterogenous Network is defined as a graph G = (V , E, T) in which each vertex v and each edge e are associated with their mapping functions φ(v) : V → T V and ϕ(e) : E → T E , respectively. T V and T E denote the vertex types and relation types, where |T V | + |T E | > 2.
For predicting the pathogenic genes of the known disease, we first construct the human gene-phenotype heterogeneous network. In order to precisely describe the relationships in the human gene-phenotype heterogeneous network, we use proteins/genes (i.e., g) and phenotypes (i.e., p) and several relationships between them to represent heterogeneous networks. Proteins/genes and phenotypes are represented as vertices and the edges are denoted as phenotype similarity (i.e., p-p), protein-protein interaction (i.e., g-g), and gene-phenotype association (i.e., g-p/p-g), respectively. We give a clear definition of the human gene-phenotype heterogeneous network that we construct in this paper. We name this network as GP−network.
A GP−network is defined as a graph G = (V , E, T), wherein V = G∪P. G is gene set and P is phenotype set. T is type set, which T = T V ∪T E . T V and T E represents the sets of object type and relation type, where Figure 2a is an example of GP−network. Between genes and phenotypes, there are many associations. Our purpose is to predict the unknown associations between certain genes and phenotypes according to the known links in the GP−network.

Heterogeneous GP−network embedding
Dong et al. proposed Metapath2vec, which is a network embedding method for network analysis [35]. In this method, scholars designed meta-path (i.e., a path including different kinds of vertices) to guide random walk [36]. Metapath2vec generates paths through random walks based on meta-path, which can capture rich correlations between different types of vertices. In this paper, we design a novel multi-path to capture richer correlations between vertices. The formal definitions of meta-path and multi-path are respectively introduced as follows.

defines the composite relations between vertex types V 1 and V l .
A meta-path "g − p − g" represents the common pathogenic genes relationship of a phenotype(p) between Fig. 1 The flow of Multipath2vec. First, we construct the human gene-phenotype heterogeneous network. Based on multi-path guided random walk, we can achieve the vector representation of network according to network embedding. Finally, we calculate the similarities and then rank the candidate genes Fig. 2 a is an example of GP−network. Each light orange node represents a phenotype and each blue node represents a protein. The links between phenotypes represent the high similarities. The links between proteins represent the interactions proteins. The black dots represent the associations between genes and phenotypes. b is the multi-path used in this work,which is "g-p-g&g-g-p" and "p-g-p&p-p-g" two genes(g). But, the relationship captured by a metapath is not enough for the heterogeneous network we constructed. For example, if a gene g 1 is the uncovered pathogenic gene of a phenotype p 1 , the reasons may be the following two situations: (1) Gene g 1 may interact with g 2 , which g 2 has been confirmed to be the pathogenic gene of known phenotype p 1 . (2) Gene g 1 may closely associate with p 2 which is highly similar to p 1 . Therefore, a meta-path is not suitable for the heterogeneous network we constructed because it can only capture one relationship. Considering this particularity, we propose multi-path based random walk to capture gene-phenotype relationships (g − p) and gene-gene relationships (g − g). We define multi-path as follows.

Definition 2 In GP−network, a multi-path scheme W is defined as a path that is denoted in the form of
That is, there cannot be three successive vertexes that are all of the same type in a multi-path. Multi-path is more suitable for heterogeneous networks than meta-paths because it can capture multiple relationships simultaneously. Take the situation in Fig. 2a as an example, g 2 → p 2 → g 1 is meta-path. Different from meta-path, multipath is allowed to contain two relationships simultaneously. For instance, g 2 → p 2 → g 1 and g 2 → g 4 → p 4 are multi-paths. Fig. 2b shows the multi-path used in this work.
Here, we describe how multi-path guides random walkers to walk in the heterogeneous network we build. To a multi-path scheme W : the transition probability at step i is defined as shown in Eq. 1.
Wherein v i t ∈ V t and N t+1 (v i t ) denotes the neighborhood of v i t as well as being the (t + 1) th type of vertices, φ(v i+1 ) represent the type of vertex v i+1 . That is, the walker will walk through the pre-defined multi-path W. The strategy of the multi-path based random walk ensures the four kinds of relationships can be the input of heterogeneous skip-gram model. One of the advantages of multi-path random walk is that it can capture richer structural correlations.
Given W ={v 1 . . . v l } with length l, a multi-path guided random walk, the vertex embedding function is denoted by (·). (·) is learned by maximizing the probability, which is the occurrence that the neighborhood vertices of v i are within k window size conditioned on (v i ). The objective function is shown in Eq. 2.
To effectively maximize the objective function, we approximate the conditional probability by using the independence assumption. The expression is in Eq. 3.
Heterogeneous skip-gram is used to learn effective vertex representations for a heterogeneous network by maximizing the probability of Pr(v j | (v i )), it assumes the probability of wherein, N t (v) denotes the neighborhood of v as well as being the t th type of vertices.
We also used negative sampling to approximate the objective function for efficient optimization.
wherein σ (·) is the sigmoid function, and v jm is the m th negative node sampled for node v j and M is the number of negative samples. Parameters and are updated as follows:

Score and rank
After getting the vector representation of each phenotype and protein in the human gene-phenotype heterogeneous network, we then calculate the similarity of every gene with the given phenotype. Given a gene g = (x 1 , x 2 , . . . , x d ) and a phenotype p = (y 1 , y 2 , . . . , y d ), we measure the similarity between two vectors using the cosine similarity between the normalized vectors. The calculation formula of similarity is shown in Eq. 7.
After calculating the similarity of every protein in the human gene-phenotype heterogeneous network with the target phenotype, the similarity scores can be ranked in order. Candidate genes are then prioritized. Algorithm 1 shows the whole process of Multipath2vec.

Algorithm 1 Multipath2vec
Require: GP−network G(V , E), walk per vertex t, walk length l, embedding size d, g-p associations S gp , a multi-path scheme W, window size k; Ensure: :candidate gene rank 1: Initialize vertex embeddings X ∈ R |V |×d 2: for each s gp ∈ S gp do 3: for i = 0 → t do 5: for each v i ∈ V do 6: X=Heterogeneous-networkembedding(G , W , v i , l, X, R, k) for j = max(0, i − k) → min(i + k, l)&j = i do 22: c t = R[ j] 23: update X according to Eq.6 24: end for 25

Results
In this section, we introduce the details about experimental data set, experiment settings, evaluation metrics, baseline approaches and the analysis of experimental results.

Data sets
We access data sets from three different sources to generate the GP−network. The details of these three data sets are described as below.
• PPI: We get Human PPI data from the Human Protein Reference database (HPRD). HPRD is a centralized platform which aims at presenting the integrate information about human proteome. The information in HPRD has been extracted by biologists manually. The data set we access from HPRD includes 39,240 interactions among 9,590 human proteins/genes. We filter out the proteins with self-interactions only. After filtering, a total of 8,756 human proteins were used in our experiments. • Gene-phenotype associations: We achieve data of gene-phenotype associations from Online Mendelian Inheritance in Man (OMIM) database. OMIM is an online catalog of human genes and genetic disorders, which focuses on heritable genetic diseases.

Experiment settings
Before generating the GP−network, we preprocess the data and set some details in our experiment as follows.
1 Proteins with self-interactions only in the PPI data set are filtered out. 2 In the gene network, we filtered those proteins that are not in gene-phenotype network and also have no links to proteins in the gene-phenotype network. And in the phenotype network, we filtered those phenotypes that are not in gene-phenotype network and also have no links to phenotypes in the gene-phenotype network. In the gene-phenotype network, we filtered those gene-phenotype associations which gene not in the gene network and phenotype not in the phenotype network. 3 In the phenotype network, we connected two phenotypes when their similarity scores are higher than 0.6, which is considered to be reliable according to previous studies [37].

Evaluation metrics
We used leave-one-out cross validation to verify results in our experiments. Cross validation is also called loop estimation sometimes and is usually used in statistics. It is widely used in the verification of prediction issues. Cross validation can be used to test whether a prediction model is accurate in practical.

Leave-one-out cross validation
The first process of cross validation is to separate the original data into two groups, i.e., training set and testing set. Then we use the training set to train the classifier. Finally, we use the testing set to evaluate the classifying quality of classifier. One of the most common used method is leave-one-out cross validation.
In leave-one-out cross validation, leave one sample as testing set and the other samples as training set. We use leave-one-out cross validation in this work since it is suitable for small samples.

Precision
Precision is widely used to evaluate the accuracy of prediction. There are four situations in the binary detection, i.e., True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). Take Fig. 2 as example, suppose there exist association between gene g 10 and phenotype p 8 . If g 10 − p 8 is successfully predicted, then this situation is counted as a TP. Since the failed prediction is meaningless in this issue, we calculate precision to evaluate the successful prediction rate. The calculation formula is shown in Eq. 8.

Baseline approaches
We use four methods as baseline approaches, i.e., CAT-APULT, PRINCE, Deepwalk and Metapath2vec. We describe the four baseline approaches in detail as below.
1 CATAPULT: CATAPULT [38] uses a biased SVM framework and train a bagging support vector machine classifier to classify the gene-phenotype pairs. In CATAPULT, the similarities between vertices can be evaluated by the length of different paths. CATAPULT then uses supervised algorithms to learn the coefficients of different paths. Specifically, the features of gene-phenotype pairs are represented by the number of paths with different lengths in network shown as follows. (9) wherein, N GG is the gene network, N GP is the gene-phenotype network, N PG is the transposed form of N GP , and N PP is the phenotype network. By training a bagging classifier, CATAPULT learns the weights of different paths. Then the unconnected paths can be predicted. Considering the negative relationships between vertices do not exist indeed, CATAPULT assumes that the unconnected links are unlabeled and then randomly select some unlabeled relationships as negative association. Therefore, the sample of each SVM classifier are consist of all positive relationships and randomly selected unlabeled relationships. Then we can get a linear classifier θ t . The final results are calculated according to the average value of multiple training models. 2 PRINCE: PRINCE is one of the classical algorithms for dealing with this issue. The correlation between the gene g and disease d is decided by two factors. One is the correlation between neighbor genes of g and the target disease d. The other one is the priori knowledge of gene g. The optimal function of the correlation between the g and disease F(g) is shown as follows.
Wherein, w is normalized form of the weight matrix of the network and α is the parameter using to adjust the weight of the two factors. 3 Deepwalk: Deepwalk is a network embedding method which is usually used in homogeneous network. Deepwalk learns low-dimensional feature representations by using uniform random walks. It generate random walks by treating nodes of different types equally. 4 Metapath2vec: Metapath2vec is proposed for heterogeneous networks. Metapath2vec presented meta-path to guide random walk. Metapath2vec generates paths through random walks based on meta-path, which can capture rich correlations between different types of vertices.

Experimental results analysis
We use leave-one-out cross validation to evaluate the performance of our method Multipath2vec and four baseline methods in the experiment. We set the experimental parameters as follows. For CATAPULT, PRINCE, Deepwalk and Metap-ath2vec, we follow the original settings in their previous experiments. In Metapath2vec, we used meta-path "g − p − g". The similarities are calculated through these five approaches and then ranked in the descending order. To better compare these five methods, we calculate the accuracy of Top 1 as well as the lists of Top 5, Top 10, Top 30, Top 50, and Top 100. Table 1 shows the overall performance of Multi-path2vec, CATAPULT, PRINCE, Deepwalk and Metap-ath2vec approaches on whole gene-phenotype data. We can see that Multipath2vec successfully predicted 317 pathogenic genes at the Top 1 list, whereas CATAPULT, PRINCE, Deepwalk and Metapath2vec successfully predicted 46, 203, 285 and 96 pathogenic genes respectively. As for the Top 5 list, Multipath2vec achieved higher performance with successfully predicting 693 pathogenic genes. Deepwalk predicted 565. Metapath2vec predicted 121. PRINCE predicted 403 and CATAPULT only predicted 57.
As for the single-gene gene-phenotype data, the experimental results are shown in Table 2. We can see that Mul-tipath2vec outperforms the other two algorithms. CAT-APULT performed worst on single-gene gene-phenotype data. The reason may be that CATAPULT trains a bagging classifier by learning the weights of different paths ,but there is only one connected path between target gene and phenotype in single gene data. So we focus on the comparison of Multipath2vec, PRINCE, Deepwalk and Metapath2vec. Multipath2vec successfully predicted 266 pathogenic genes at the Top 1 list, while PRINCE, Deepwalk and Metapath2vec predicted 179, 242 and 48 respectively.
Moreover, we list the overall performance of these five methods on many-genes gene-phenotype data in Table 3. As shown in the Table 3, Multipath2vec still outperforms.  We also calculate the precision values of these five methods, which are shown in Figs. 3 and 4, respectively. Figure 3 shows the precision values of the five methods under the 6 different groups on single-genes, many-genes and whole-genes gene-phenotype data, respectively. We choose 6 groups of different sizes as mentioned above, i.e., Top 1, Top 5, Top 10, Top 30, Top 50, and Top 100. It can be obviously seen from Fig. 3 that the precision values of Multipath2vec outperform CATAPULT, PRINCE, Deepwalk and Metapath2vec. Figure 4 shows the precision values of Multipath2vec, CATAPULT, PRINCE, Deepwalk and Metapath2vec, grouping by single-genes, many-genes and whole-genes gene-phenotype data, respectively. In general, Multi-path2vec outperforms the other four approaches. The performance of Multipath2vec and Deepwalk are ahead of the other baseline approaches. Deepwalk performs closely with Multipath2vec but still cannot catch up with Multi-path2vec. Wherein, CATAPULT performs worst so that it cannot compare with Multipath2vec, PRINCE, Deepwalk and Metapath2vec.
In summary, Multipath2vec can successfully predict pathogenic genes with high accuracy. The experimental results shows that Multipath2vec outperformed baseline approaches in all perspectives. Therefore, Multipath2vec is able to be used in predicting pathogenic genes.

Robustness of false negative
In our experiment, we used precision to evaluate the accuracy of prediction. In each round of leave-one-out cross validation, if the cut link between the target gene and phenotype is successfully predicted, then this situation is counted as a TP. And we can also regard this situation as a TN because the negative is successfully predicted as a negative. So the number of TP is equal to the number of TN and the number of FP is equal to the number of FN. Our method perform well in the accuracy of prediction, so it is also robust to false negative.

Conclusion
The study of pathogenic genes plays an important role in revealing the pathogenesis of diseases as well as developing corresponding disease prevention and diagnosis methods. The key to deciphering the molecular and genetic basis of human disease is to analyze the correlation between diseases and genes. In this paper, we propose the Multipath2vec algorithm which is based on network embedding to predict pathogenic genes. The multi-path in Multipath2vec are designed to guide random walk in the human gene-phenotype heterogeneous network. The multi-path based random walk can better represent the network. The experimental results show that Multipath2vec outperforms four baseline methods from several perspectives. By implementing these three approaches on single-gene gene-phenotype data, many-genes gene-phenotype data and whole-genes genephenotype data, Multipath2vec showed the outstanding performance in prediction of pathogenic genes. By calculating the precision values of these five methods, Multi-path2vec still outperforms under all circumstances. This fact illustrates the possibility of applying heterogeneous network embedding approach in prediction of pathogenic genes.  . 3 The precision values of Multipath2vec, CATAPULT, PRINCE, Deepwalk and Metapath2vec, grouping by Top 1, 10, 30, 50, 100 on whole-genes gene-phenotype data,one-gene gene-phenotype data and many-genes gene-phenotype data,respectively