In this section, we introduce the detailed description of the construction of the human gene-phenotype heterogeneous network and propose the Multipath2vec algorithm. The flow chart of Multipath2vec is shown in Fig. 1. First, a human gene-phenotype heterogeneous network is constructed based on the correlations between genes and genes, phenotypes and phenotypes, genes and phenotypes. Then we design multi-path to guide random walk in the human gene-phenotype heterogeneous network and represent the network into d dimension vectors. And then we calculate the similarities between genes and the target phenotypes. After that, we can get the ranking list of candidate genes.
Heterogeneous network construction
The heterogeneous network consists of two types of nodes and three types of links. In the heterogeneous network, nodes include the gene nodes and the phenotype nodes. Edges are connected in three relationships: the relationship between phenotype and gene, the relationship between two genes, and the relationship between two phenotypes. The edge between two phenotypes is the link between two highly similar vertices. The edge is connected between two corresponding genes when there exists the experimentally detected protein-protein interaction. Besides, the known disease gene-phenotype associations are used to connect gene and phenotype. For a better understanding, we give the formal definition of the heterogeneous network as follows.
A Heterogenous Network is defined as a graph G=(V,E,T) in which each vertex v and each edge e are associated with their mapping functions ϕ(v):V→TV and φ(e):E→TE, respectively. TV and TE denote the vertex types and relation types, where |TV|+|TE|>2.
For predicting the pathogenic genes of the known disease, we first construct the human gene-phenotype heterogeneous network. In order to precisely describe the relationships in the human gene-phenotype heterogeneous network, we use proteins/genes (i.e., g) and phenotypes (i.e., p) and several relationships between them to represent heterogeneous networks. Proteins/genes and phenotypes are represented as vertices and the edges are denoted as phenotype similarity (i.e., p-p), protein-protein interaction (i.e., g-g), and gene-phenotype association (i.e., g-p/p-g), respectively. We give a clear definition of the human gene-phenotype heterogeneous network that we construct in this paper. We name this network as GP−network.
AGP−network is defined as a graph G=(V,E,T), wherein V=G∪P. G is gene set and P is phenotype set. T is type set, which T=TV∪TE. TV and TE represents the sets of object type and relation type, where |TV|+|TE|>2. In G, each vertex v is associated with its mapping function ϕ(v):V→TV and each edge e is associated with its mapping functions φ(e):E→TE.
Figure 2a is an example of GP−network. Between genes and phenotypes, there are many associations. Our purpose is to predict the unknown associations between certain genes and phenotypes according to the known links in the GP−network.
Heterogeneous GP−network embedding
Dong et al. proposed Metapath2vec, which is a network embedding method for network analysis [35]. In this method, scholars designed meta-path (i.e., a path including different kinds of vertices) to guide random walk [36]. Metapath2vec generates paths through random walks based on meta-path, which can capture rich correlations between different types of vertices. In this paper, we design a novel multi-path to capture richer correlations between vertices. The formal definitions of meta-path and multi-path are respectively introduced as follows.
Definition 1
In the regular heterogeneous network, a meta-path scheme H is defined as a path that is denoted in the form of V1→R1V2→R2…→Rl−1Vl, wherein, R=R1◇R2◇…◇Rl−1 defines the composite relations between vertex types V1 and Vl.
A meta-path " g−p−g" represents the common pathogenic genes relationship of a phenotype(p) between two genes(g). But, the relationship captured by a meta-path is not enough for the heterogeneous network we constructed. For example, if a gene g1 is the uncovered pathogenic gene of a phenotype p1, the reasons may be the following two situations: (1) Gene g1 may interact with g2, which g2 has been confirmed to be the pathogenic gene of known phenotype p1. (2) Gene g1 may closely associate with p2 which is highly similar to p1. Therefore, a meta-path is not suitable for the heterogeneous network we constructed because it can only capture one relationship. Considering this particularity, we propose multi-path based random walk to capture gene-phenotype relationships (g−p) and gene-gene relationships (g−g). We define multi-path as follows.
Definition 2
In GP−network, a multi-path scheme W is defined as a path that is denoted in the form of V1→R1V2→R2…→Rl−1Vl. Wherein, R=R1◇R2◇…◇Rl−1 defines the composite relations between vertices. Besides, Vi+2∉P when Vi∈P∧Vi+1∈P. Likewise, Vi+2∉G when Vi∈G∧Vi+1∈G.
That is, there cannot be three successive vertexes that are all of the same type in a multi-path. Multi-path is more suitable for heterogeneous networks than meta-paths because it can capture multiple relationships simultaneously. Take the situation in Fig. 2a as an example, g2→p2→g1 is meta-path. Different from meta-path, multi-path is allowed to contain two relationships simultaneously. For instance, g2→p2→g1 and g2→g4→p4 are multi-paths. Fig. 2b shows the multi-path used in this work.
Here, we describe how multi-path guides random walkers to walk in the heterogeneous network we build. To a multi-path scheme W:V1→R1V2→R2…→Rl−1Vl, the transition probability at step i is defined as shown in Eq. 1.
$$ \begin{aligned} Pr(v^{i+1}|v_{t}^{i},W)=\left\{ \begin{array}{lr} \frac{1}{\left|N_{t+1}(v_{t}^{i})\right|} \, \left(v^{i+1},v_{t}^{i}\right)\in E,\phi(v^{i+1})=t+1 \\ 0 \qquad \qquad \left(v^{i+1},v_{t}^{i}\right)\in E,\phi(v^{i+1})\neq t+1\\ 0 \qquad \qquad \left(v^{i+1},v_{t}^{i}\right)\notin E \end{array} \right. \end{aligned} $$
(1)
Wherein \(v_{t}^{i}\in {V_{t}}\) and \(N_{t+1}(v_{t}^{i})\) denotes the neighborhood of \(v_{t}^{i}\) as well as being the (t+1)th type of vertices, ϕ(vi+1) represent the type of vertex vi+1. That is, the walker will walk through the pre-defined multi-path W. The strategy of the multi-path based random walk ensures the four kinds of relationships can be the input of heterogeneous skip-gram model. One of the advantages of multi-path random walk is that it can capture richer structural correlations.
Given W={ v1…vl} with length l, a multi-path guided random walk, the vertex embedding function is denoted by Φ(·). Φ(·) is learned by maximizing the probability, which is the occurrence that the neighborhood vertices of vi are within k window size conditioned on Φ(vi). The objective function is shown in Eq. 2.
$$ \min_{\Phi}-\log Pr(\{v_{i-k},\ldots,v_{i+k}\} \backslash v_{i}|\Phi(v_{i})) $$
(2)
To effectively maximize the objective function, we approximate the conditional probability by using the independence assumption. The expression is in Eq. 3.
$$ Pr(\{v_{i-k},\ldots,v_{i+k}\} \backslash v_{i}|\Phi(v_{i}))=\prod_{j=i-k,j\neq i}^{i+k} Pr(v_{j}|\Phi(v_{i})) $$
(3)
Heterogeneous skip-gram is used to learn effective vertex representations for a heterogeneous network by maximizing the probability of Pr(vj|Φ(vi)), it assumes the probability of Pr(vj|Φ(vi)) is related to the type of vertex vj
$$ Pr(v_{j}|\Phi(v_{i}))=\frac{e^{\Psi(v_{j})\cdot \Phi(v_{i})}}{\sum_{u\in V}e^{\Psi(u)\cdot \Phi(v_{i})}},v_{j}\in N_{t}(v) $$
(4)
wherein, Nt(v) denotes the neighborhood of v as well as being the tth type of vertices.
We also used negative sampling to approximate the objective function for efficient optimization.
$$ \begin{aligned} O_{ij}=- \log Pr(v_{j}|\Phi(v_{i}))= \log\sigma(\Psi(v_{j})\cdot \Phi(v_{i}))+ \\ \sum_{m=1}^{M} \log\sigma(-\Psi(v_{jm})\cdot \Phi(v_{i})) \end{aligned} $$
(5)
wherein σ(·) is the sigmoid function, and vjm is the mth negative node sampled for node vj and M is the number of negative samples. Parameters Φ and Ψ are updated as follows:
$$ \Phi =\Phi - \alpha\frac{\partial O_{ij}}{\partial\Phi},\Psi =\Psi - \alpha\frac{\partial O_{ij}}{\partial\Psi} $$
(6)
Score and rank
After getting the vector representation of each phenotype and protein in the human gene-phenotype heterogeneous network, we then calculate the similarity of every gene with the given phenotype. Given a gene g=(x1,x2,…,xd) and a phenotype p=(y1,y2,…,yd), we measure the similarity between two vectors using the cosine similarity between the normalized vectors. The calculation formula of similarity is shown in Eq. 7.
$$ sim(g,p)=\frac{\sum\limits_{n=1}^{d} x_{n}\ast y_{n}}{\sqrt{\sum\limits_{n=1}^{d} x_{n}^{2}}\ast\sqrt{\sum\limits_{n=1}^{d} y_{n}^{2}}} $$
(7)
After calculating the similarity of every protein in the human gene-phenotype heterogeneous network with the target phenotype, the similarity scores can be ranked in order. Candidate genes are then prioritized. Algorithm 1 shows the whole process of Multipath2vec.