Heterogeneous network embedding enabling accurate disease association predictions

Background It is significant to identificate complex biological mechanisms of various diseases in biomedical research. Recently, the growing generation of tremendous amount of data in genomics, epigenomics, metagenomics, proteomics, metabolomics, nutriomics, etc., has resulted in the rise of systematic biological means of exploring complex diseases. However, the disparity between the production of the multiple data and our capability of analyzing data has been broaden gradually. Furthermore, we observe that networks can represent many of the above-mentioned data, and founded on the vector representations learned by network embedding methods, entities which are in close proximity but at present do not actually possess direct links are very likely to be related, therefore they are promising candidate subjects for biological investigation. Results We incorporate six public biological databases to construct a heterogeneous biological network containing three categories of entities (i.e., genes, diseases, miRNAs) and multiple types of edges (i.e., the known relationships). To tackle the inherent heterogeneity, we develop a heterogeneous network embedding model for mapping the network into a low dimensional vector space in which the relationships between entities are preserved well. And in order to assess the effectiveness of our method, we conduct gene-disease as well as miRNA-disease associations predictions, results of which show the superiority of our novel method over several state-of-the-arts. Furthermore, many associations predicted by our method are verified in the latest real-world dataset. Conclusions We propose a novel heterogeneous network embedding method which can adequately take advantage of the abundant contextual information and structures of heterogeneous network. Moreover, we illustrate the performance of the proposed method on directing studies in biology, which can assist in identifying new hypotheses in biological investigation.

these methods only extract simple features from datasets and there still exist many challenges as discussed below.
Recent technological advances have enabled researchers to produce and investigate an enormous quantity of data to illustrate the underlying biological mechanisms of complicated diseases [6] better. Consequently, many large databases have been developed to preserve and organize the accumulated data, which were generated and conserved by extensive collaboration. For instance, the DisGeNET database [7] collects a comprehensive catalogue of genes and variants involved in human diseases from various expert-curated repositories [1,4,8,9], and the miRNet database [10] integrates data from eleven disease-miRNA databases [5,11]. In addition, almost all of these datasets supply perceived and/or inferred knowledge about relations between diseases and other biological entities. For instance, the MISIM database [12] preserves a miRNA similarity network; the Human Reference Protein Database (HPRD) [13] keeps a network of protein-protein interaction; the MimMiner [14] offers a similarity network of diseases. Capturing the complicated biological relationships among data requires a systematic method to ponder these multifaceted data simultaneously, involving genes [15], proteins [16], miRNAs [17], drugs [18], side-effects [19] and so on. It may shed light not only on understanding the mechanisms in complex diseases, but also on identifying new biological hypotheses to direct future explorations and researches. Although several big consortia such as ENCODE and GTEx have made remarkable progress, we discover a growing disparity between our capabilities of producing data and the capabilities of integrating, investigating, and explaining data. The majority of recent researches typically concentrate on data produced in the environment managed by themselves or by their colleagues, in order to make sure that data is produced in homogeneous conditions thus can be compared directly. Accordingly, data produced from previous researches and the inferred knowledge preserved in available repositories are still widely underutilized. And it is unpractical to fully utilize such enormous amount of data to conduct biological experiments due to high expenses. Moreover, heterogeneity of data types, experimental environments and experimental technologies is a primary challenge. Consequently, we design a network-based analytic model to tackle these challenges.
We are motivated by the discovery that networks in which nodes indicate entities such as proteins, diseases and edges indicate relationships between these entities can represent a majority of the above-mentioned data. Because there exist various types of entities, the relationships may be likewise of various types (e.g. proteinprotein interaction, disease-miRNA association). Besides, nodes and edges may have auxiliary attributes such as node features and link weights which further describe the characteristics of the entities and relations. For the sake of making full use of the knowledge carried by the constructed network, we apply the network embedding method [20,21] which has successfully presented its effect in exploring and discovering relationships between persons within social networks. Network embedding maps the network data into a continuous low-dimensional feature space which preserves the vertex content, side information and topological structure, especially existent relationships. Every entity (e.g., protein, disease) is embeded to a low-dimensional vector and mapped to a point in the vector space. And if the relationship between two entities is stronger, they are closer in the vector space. Figure 1a demonstrates a sub-network which contains one disease (i.e., prostate cancer), two miRNAs (i.e., hsa-mir-223, hsamir-21) and two genes (i.e., ZNF804A, ATM), as well as their existent links to other diseases, miRNAs, and genes. Figure 1b displays a projection of a tiny region around prostate cancer in the two-dimensional embedding space where genes and miRNAs which are actually connected to prostate cancer are distributed in the proximity of this disease. The four red dashed edges denote the top two miRNAs and two genes which don't possess direct links but have great possibility of connecting to prostate cancer in the prediction of our model.
Representation learning for the aforementioned heterogeneous networks confronts some challenges. Nodes in a network may represent entities of vastly different characteristics. And edges may represent disparate relationships, and each of which may be of various weight or other attribute. Conventional network embedding methods [20][21][22] are focused on homogeneous networks and based on skip-gram [23] model to learn the topological structures and other latent attributes of networks. Recently, deep neural networks have been introduced into homogeneous network embedding, [24][25][26] utilize graph convolution networks (GCNs) which generalize the operation of convolution [27] from traditional data (images or grids) to graph data and learn the connectivity structures from the adjacency matrices of graphs. There are also several existing works on heterogeneous network embedding [28][29][30][31]. Translation-based models [28,29] learn representations of entities (nodes) and relationships (links) in knowledge graphs which can be regarded as heterogeneous networks, but these models only preserve the local structure by interpreting relations as translations and ignore the link weights in the network. Another kind of methods [30,31], which decompose a heterogeneous network to a set of subgraphs and then perform embedding individually, ignore the different semantics of relationships in each subgraph and only capture the aggregated information of relationships by combining embedding of each subgraph. Moreover, [32,33] consider the distinctive characteristics of relations (or entities) in the heterogeneous network, but [32] only projects different kinds of nodes (i.e., image and text) into the same vector space by neural networks which ignores the semantic information interpreting contextual properties in the heterogeneous network, and [33] distinguishes heterogeneous relations into two categories by structure-related measures and utilizes two different embedding models for each but there exist relations which can not be well distinguished by the structure-related measures in various heterogeneous networks. Although [34,35] introduce meta path [35] to capture the rich semantic information in heterogeneous network, they don't present how to select proper meta path in different networks especially in the biological network.
Another challenge is the scalability of the network embedding method. Heterogeneous networks provide a large amount of information about node relations. However, it is non-trivial to capture a large number of heterogeneous relationships. And it is impossible to list all neighbor nodes under different relations when the network scales up. Therefore, we need a scalable method to capture such rich relations efficiently.
To overcome the aforementioned challenges, in this paper, we propose HeteWalk, which is based on meta path [35] controlled random walk for representations learning in heterogeneous networks. Besides, we consider the edge weights during the representation learning and provide a random walk-based measure to assist in selecting meta path. We utilize the meta paths to capture abundant semantic information involved in the heterogeneous network. And the random walk procedure, which has shown the scalability in exploring large-scale networks [20], is controlled by not only the meta paths but also link weights on our network. In the embedding vector space, entities which are close to each other but at present do not possess direct links(edges) are probably connected and thus are significant subjects in future biological study.
In order to demonstrate the effectiveness of our method, we construct a heterogeneous network of diseases, genes and miRNAs using data from six real-world datasets and conduct two disease-related prediction tasks including disease-gene association prediction and disease-miRNA association prediction. Then we compare the proposed method with several advanced disease association prediction methods as well as some typical network embedding methods. The experimental results show the superiority of our proposed method. Moreover, we perceive that embracing additional datasets to train our method will promote the accuracy of the predicted results at all time. Furthermore, substantial associations we predict are verified by the latest miRNet dataset [10], which demonstrates our method can effectively provide guidance to discover new disease-related associations in biological studies.

Network construction
The accumulated biological data has been preserved and organized in massive databases, nevertheless, only a fraction of data generated from previous studies has been utilized. And the heterogeneity in data types, experimental technologies as well as experimental settings remains a vital challenge. We demonstrate the construction of a weighted heterogeneous network by integrating data from various databases in this section.

Datasets description
We utilize real-world data in six public sources to interpret the definition and effectiveness of the proposed method. These biological datasets offer the association networks and similarity networks between three types of entities which are diseases, miRNAs and genes. The detailed description of these biological networks are as follows: • Gene (proteins) interaction network: We obtain 39,240 protein-protein interactions (PPI) from the Human Protein Reference Database (HPRD) [13] which was manually extracted from biological literature. For the pair of proteins with direct connections, their corresponding protein-coding genes are linked through an unweighted edge in the HPRD network and we set the weight as 1.0. • miRNA similarity network: We acquire the similarities of miRNA functions from the MISIM databank [12], which provides the functional similarity of 271 miRNAs in pairs. The similarity score for each link which is calculated by the MISIM method ranges from 0 to 1. • Disease phenotype similarity network: The similarities of human disease are extracted from the MimMiner [14], which utilizes a text-mining method for the classification of human diseases from the Online Mendelian Inheritance in Man (OMIM) database [36]. All links are associated with their own similarity scores ranging from 0 to 1 calculated by the MimMiner system. • Gene-Disease association network: We extract this network from DisGeNET database [7], which incorporates gene-disease associations of humans from various professional databases. 19,714 entries whose disease phenotypes can be related to OMIM terms are used. Every association possess a score ranging from 0 to 1 in accordance with confidence, which is called DisGeNET score [7] with taking into account the number of sources supporting the association and the reliability of each of them.
• Gene-miRNA interaction network: The gene-miRNA interactions are provided by the miRTarBase database [37], which is gathered through manual survey of literature relevant to miRNAs' functional studies. Reporter assay, western blot, microarray or next-generation sequencing experiments verify the collected interactions experimentally. At the step of network construction, We set the weights of 7269 interactions supported by strong experimental evidences (reporter assay or western blot) as 1, and set the weights of 13,990 interactions supported by weak experimental evidences (microarray or pSILAC) as 0.3. And the experimental evidence is justified by many crosslinking and immunoprecipitation sequencing (CLIP-seq) datasets which were generated by 21 independent studies [37]. • miRNA-Disease association network: Two datasets are combined to build this network. One dataset provides 242 miRNA-disease associations offered by Chen et al. [11]. The other is derived from the miRNet dataset [10], which contains substantial confirmed associations of miRNA-disease incorporated from HMDD [38], miR2Disease [39], and Phenomir [40], from which we extract the records whose disease names are able to connect with their OMIM ids then we obtain 666 disease-miRNA associations. And 878 miRNA-disease associations which totally includes 267 miRNAs and 59 diseases are acquired after deleting duplicated records. Because the associations have been validated at a high level of confidence, we determine all the weights as 1.0.

Weighted heterogeneous network construction
We build a weighted heterogeneous network by joining the six above-mentioned networks entirely through shared nodes. And in these networks, genes are denoted by their gene symbols in HPRD [13], miRNAs are denoted through their names while disease phenotypes are denoted through their respective OMIM ids [36]. We summarize each sub-network of the constructed heterogeneous network in Table 1. The Fig. 2 presents the network schema, which comprises three types of nodes, in which rhombuses denote genes, circles denote miR-NAs while squares denote diseases. The solid black lines indicate the existing connections in the aforementioned network, and the red dashed lines indicate the links to be predicted, involving disease-gene associations as well as disease-miRNA associations. The constructed heterogeneous network includes various types of entities as well as relationships(links) with different weights. But it is not appropriate to compare the weights of links in different types directly since they come from distinct datasets. For instance, if the link weight of prostate cancer(disease) and hsa-mir-21(miRNA) is lower than that of prostate cancer and ATM (gene), it may not suggest that hsa-mir-21 holds weaker association with prostate cancer than ATM. Consequently, in terms of a heterogeneous network, we need to map the network into a vector space where similarities and interactions between entities of different types can be numerically measured and predicted.

HeteWalk
HeteWalk is a network embedding method which can generate a low dimensional representation vector for every entity in the heterogeneous network, which captures the structural and semantic information, especially the existent relationships. A critical inspiration for our method is that diseases (or genes, miRNAs) which are in close proximity to each other in the network have higher potential to be associated. For instance, a miRNA which plays an important part in a disease may be possible to play a similar part in a similar disease. This intuition equips us to make unknown disease-related link predictions founded upon the existent edges.

Network embedding
Lately, several network embedding methods [20,21] have presented competitive performance in various tasks such as node classification, link prediction and clustering. For the purpose of learning effective node representations for a network, we would like to maximize the probability of a node occurring given that its connected nodes (i.e., those with direct links) have occurred [20,22]. Given a node v i and the set of connected nodes N(v i ), we want to maximize the conditional probability of observing N(v i ) for the node v i . The probability of observing each node is assumed to be independent of another, we want to maximize the following objective function: We define the conditional probability as follows: where V is the set of whole nodes in the network. x i is the embedding vector for node v i while x j is the embedding vector for node v j . The whole vectors of nodes are latent d-dimensional vectors via learning based on the objective function.
The majority of existent network embedding methods focus on homogeneous networks where the types of whole nodes and edges are identical. In the setting of our constructed network, a disease node is possible to link to other diseases, genes or miRNAs, which are not in a single type. In order to fully capture the abundant contextual information and semantic properties of a node in such a complicated network, we would better to go further than direct-linked nodes. For instance, if a gene and a disease are related via a path involving several links such as Gene they may be related as well. Next, we present how to take advantage of such paths in the heterogeneous network embedding.

Meta path-controlled random walk
A meta path P is a path which describes a composite relation between two objects, and we use the form of A 1 → A 2 → · · · → A m to denote a meta path, where A i denotes a type of nodes (e.g., disease, gene) [35]. We can use different meta-paths to classify multiple relationships which two nodes may possess in a heterogeneous network. For instance, the meta-path Gene assoc − −− → Disease represents a direct gene-disease connection; the meta path Gene assoc − −− → miRNA assoc − −− → Disease presents a relationship that a gene and a disease are connected to a common miRNA; and the meta path Gene sim − − → Gene assoc − −− → Disease represents that a gene is similar to another gene which is associated with a disease. It's obvious that semantics underneath these meta paths are different.
Meta-path is a powerful approach to describe indirect relationships among specific types of nodes. The quantity of different meta-paths increases exponentially with the amount of types in entity and relation and also the length of meta paths, supplying fruitful semantic information interpreting contextual characteristics of the network. Furthermore, in order to consider the link weights at the same time, we apply a meta path-controlled random walk to search the associated entities for each meta path. A meta path indicates what type of neighbor node should be visited at each step, then the link weights determine the probability to be chosen for each node with the determined type. We will demonstrate how to construct and select meta-paths in "Meta-path selection" and "Experimental settings" sections. Starting at node v i with type A k , given a meta path P = A 1 → A 2 → · · · → A m , the random walk procedure will only visit a connected node in type A k+1 on the next step. If there are several nodes in type A k+1 , we randomly choose a node with a probability proportional to the weight of link. If the link weight is higher, the node is more likely to be selected. For each node v i with type A k , we define its transition probability to another node v j as: where E denotes the edge set of the network, φ(v i ) denotes the node type while w ij indicates the link weight for v i and v j . The random walk procedure will create a node sequence starting from each node guided by a meta path. For the purpose of producing adequate node sequences, we repeat the random walk procedure which starts from every node.

Meta-path selection
Though a variety of meta paths can be defined by combining different node types, too many meta paths are redundant and may lead to low-efficiency. Besides, some meta paths may carry misleading information, which can be interference to the tasks [41]. So it's significant to select proper meta path(s). Here we propose a random walk-based measure to assist in selecting meta path. During a random walk, we want to visit as many nodes as possible to capture more characteristics of the network. Given a candidate set of meta-paths, for each meta path, the random walk procedure controlled by the meta path is repeated m times for each node, then we count the amount of nodes whose visited times are no larger than m and we call these nodes as isolated walking nodes. For a meta path P, the random walks is repeated m times for every node in the network, then the random walk-based measure is the defined as the count of isolated walking nodes: where I is the indicator function. V is the set of whole nodes in the network and t i is the visited times of node v i by random walks. The value of random walk-based mea-sure for the meta path is smaller, random walks controlled by the meta path will visit more nodes and capture more attributes of the network thus this meta path is better to be selected.

Negative sampling
After obtaining a set of node sequences, our next step is to learn the vector representations for each node. As illustrated in Eq. (1), we aim at maximizing the probability of each node occurring given its linked nodes. That is, for nodes occurring in the identical node sequence, their node representations will be updated to maximize Eq. (1). There exist a massive amount of node pairs in all node sequences, thus it is very costly to compute Eq.
(1) . Enlightened by the optimization in word embedding methods, we employ negative sampling [23] to approximate: where σ (x) = 1 1+e −x is the sigmoid function, and NEG(v j ) is the distribution to sample a negative node v n . Besides, K is the number of negative samples.
We randomly choose K negative node pairs . Take (Disease 1 , Gene 1 ) as an instance, subsequently, K nodes of gene type are randomly selected, which are symbolized by Gene N 1 , · · · , Gene N K , where Gene N i = Gene 1 . The positive sample (Disease 1 , Gene 1 ) and K negative samples (Disease 1 , Gene N i ) are fed into the model at the same time and we use Stochastic Gradient Descent (SGD) [42] to update their corresponding representation vectors based on Eq. (5).

Disease associations prediction
All types of nodes (diseases, genes and miRNAs) in our heterogeneous network are mapped to the common vector space after network embedding. Then the cosine distance between node vectors are used to assess their relationships. As to the prediction of disease-related associations, if a disease and a gene/miRNA without direct link in the network but are in proximity to each other in the projected vector space, it is very likely for them to be associated so they are promising to study in biological investigation.

Comparison to baselines
We compared our method HeteWalk with several stateof-the-art baselines so as to measure its performance. We partitioned these baseline methods into two groups. One group consist of CATAPULT [1], HSMP and HSSVM [4,5], which are conventional statistical and machine learning methods without network embedding and specially designed to identify a particular type of associations (i.e., disease-miRNA or disease-gene). These methods were operated on our constructed heterogeneous network. CATAPULT utilizes features extracted from paths with different lengths based on a biased support vector machine. And HSMP and HSSVM evaluate the relevance between nodes utilizing the HeteSim score [3], which judges the accessibility between two nodes along a given path. HSMP joins HeteSim scores in multiple paths to a constant which inhibits the long paths' contributions, and HSSVM integrates HeteSim scores utilizing a supervised machine learning method.
Methods in the other group are representative network embedding methods including DeepWalk [20], LINE [21], DGI [26], TransE [28] and AspEm [31]. DeepWalk is a typical homogeneous network embedding method, which uses a vanilla random walk procedure and learns representations of vertices by treating walks as sentences. LINE, which also ignores the heterogeneous information, preserves both first-order and second-order proximities and is suitable for arbitrary large-scale information networks such as our constructed network. DGI is the latest homogeneous network embedding method using established graph convolutional network (GCN) [24] architectures as far as we know. TransE, which models relationships as translations in the embedding space of entities, is a typical knowledge graph embedding method where the knowledge graph can be regarded as a heterogeneous network. AspEm learns embedding by aspects, with each aspect representing one underlying semantic facet of the heterogeneous network.
HeteWalk applies meta path-controlled random walks for heterogeneous network embedding. We utilize the embeded vectors of nodes for prediction of entities (e.g., genes, miRNAs) which have great chances to be associated with diseases.

Experimental settings
We experimentally evaluated the effectiveness of predicting two types of association including gene-disease association and miRNA-disease association. The vector dimension is set to 128, the number of walks per node and per meta path to 10, while the size of negative samples is set to 5 following the common practice in network embedding [21,31]. In addition, we set the margin to be 1 and the dissimilarity measure to be L2 for TransE based on the best validation performance. Besides, we utilized one-hot representation of each node as node features and a weighted adjacency matrix extracted from our constructed network in DGI as input. And for AspEm, since nodes may appear different times in the selected set of representative aspects (e.g., one node may occur in two aspects, while another may occur in only one), and the dimension of the vector learned from each aspect was the same, we filled zeros for those vectors whose dimensions were below 128. We demonstrated in "Parameter analysis" section that the performance is insensitive to the settings on the vector dimension and the number of walks.
In the progress of constructing meta path, all nonredundant meta paths related to target entity types were extracted separately in the first step. After that, redundant meta paths were formed by combining two or more. Since long meta paths are useless to capture the link structure [35],only short meta paths with restricted length were extracted. Then we obtained the candidate set of meta paths. Moreover, we selected meta path from the candidates by utilizing the random walk-based measure in which the number of random walks is 10, the same with original experimental set. The meta paths we extracted and their corresponding values of the measure are shown in Table 2. We can see that the measure of meta path "GGD" is smallest with the value 8658 in genedisease association prediction, which is the same with the selected meta path according to our experience (best test results by cross validation on each meta-path). But for miRNA-disease association prediction, the smallest measure value belongs to the meta path "MGGD", different from our experience, in which the performance of meta path "MMDD" was best ("G" denotes gene, "M" denotes miRNA and "D" denotes disease). This is mainly because the number of miRNA-Disease interaction edges is far less than other types of edges in the network as we can observe from Table 1. Additionally, the measure value of "MMDD" is smallest among meta paths with only two node types (i.e. miRNA and disease). We can select the meta path not only by experience, but also use the random walk-based measure, which can be regarded as the a auxiliary approach to reduce the time cost on experiments. We utilized the meta-path "GGD" for gene-disease association prediction and "MMDD" for miRNA-disease association prediction in subsequent experiments. CATA-PULT, HSMP, HSSVM, and our HeteWalk used the same meta paths.

Effectiveness measurement
In each experiment, we randomly partitioned the known disease associations into 10 sets with same size, and we utilized a subset for training while the left for testing. As regards testing, in each experiment, the known associations were regarded as positive samples, randomly  selecting the same amount of node pairs which have the same node types and no associations as negative samples, the cosine distance between the embedding vectors of the node pair in each sample was the predicted value. The proportion of training set varied from 50% to 90%. We repeated the experiments 10 times and reported the average Area under Receiver Operating Characteristic curve (AUROC) score for each training ratio. We demonstrate the results in Table 3 (gene-disease association prediction) and Table 4 (miRNA-disease association prediction). It is obvious that our method outperforms other methods in both disease association prediction tasks under entire training ratios except for the gene-disease association prediction with 50% training data in which the AUROC score of HeteWalk is 0.638, slightly inferior to the best score which is 0.639 achieved by AspEm. With more training data, the advantage of our method becomes more significant. In practice, the training ratio is almost always much bigger than 50%. For the miRNA-disease association prediction task, HeteWalk achieves a significantly excellent AUROC score 0.969 in 90% training ratio. However, the best score on the gene-disease prediction task is 0.798, because there exist relatively larger amount of candidate gene-disease associations.
HeteWalk demonstrates the superiority over heterogeneous network-based baselines, involving CATAPULT, The best performance is in bold HSMP, HSSVM, TransE,and AspEm. CATAPULT, HSMP, and HSSVM use the same set of meta paths with Hete-Walk, but only simple features on accessibility between two nodes along path are extracted by them. By contrast, HeteWalk preserves existent relationships through maximizing the conditional probability of each node pair occurring given other pairs in a node sequence which is created based on the meta path. Though TransE considers the heterogeneity in node (entity) and edge (relation) types, it only preserves the local structures in the network represented by observed links and ignores link weights while our HeteWalk preserves global structures by meta path-controlled random walks in addition to the local structures and the selected nodes on random walk are determined by both link weight and meta path. AspEm learns embedding vectors from each aspect (selected subgraph) and then gets the final embedding for each node by concatenating the learned vectors from all aspects involving that node, so a problem occurs that not all embedding vectors are in the same vector space and some important information learned from the network may be lost after projecting all representation vectors to the same vector space.
The main reason why DeepWalk, LINE, DGI show poor performance is that they are specially designed for homogeneous networks. For DeepWalk, when selecting the next The best performance is in bold node to visit during a random walk, it ignores the differences between various types of relationships and treats all types of nodes equally. LINE, which preserves both local and global structures by first-order and second-order proximity, also ignores node and link types. DGI utilizes the weighted adjacency matrix as structure features which does not distinguish between different node and link types. As a result, it may be unlikely for the embedding methods mentioned above to successfully conserve the relationships between specific entities.

Advantage of heterogeneity
We investigated the capability for each method to deal with heterogeneity and presented the advantage to incorporate various data sources. We constructed another two heterogeneous networks which only consist of two types of nodes. We solely joined G-G, G-D and D-D networks described in Table 1 for the gene-disease association prediction task. And only D-D, M-M, and D-M networks are used in the miRNA-disease association prediction task. We conducted 3-fold cross validation in the experiment, that is the known disease associations are divided into three parts with same size, and two parts are used to train and another to test each time. We compare the average score on two tasks for each method in Fig. 3. Conspicuous improvement is observed via combining networks to construct a bigger and more complex one, particularly in the miRNA-disease association prediction tasks. This may own to sparse relations between miRNAs and diseases, thus it is fairly unreliable to make predictions based on these relations alone. The gene-related data provide some information about indirect relations between miRNAs and diseases, which is possibly obtained via the meta paths. It demonstrates that potential knowledge of complicated diseases can be dug through integrating multifaceted data, which promote our prediction results to a greater extent. Alhough we have presented the effectiveness of HeteWalk on six databases, HeteWalk is actually able to incorporate any amount of data which could be represented by a network. The amount of types of node and link are not limited.

Parameter analysis
We explored the sensitivity of parameters in HeteWalk following the same setting as the 3-fold cross validation above-mentioned. We present the performance with various vector dimensions and various number of walks for each node in Fig. 4. We can find that the optimal performance is attained around 128 dimensions from Fig. 4a. Besides, we can observe the AUROC result remains almost steady when the amount of walks per node exceed 10 from Fig. 4b. Therefore, we set the vector dimensions as 128 and walks for each node as 10 in the experiment due to the performance and computational cost.

Top-ranked predicted associations for specified diseases
The top-ranked gene/miRNA candidates for eight disease phenotypes predicted by HeteWalk are listed detailedly in Table 5, so as to investigate which may play a dominant part in a particular disease.
These candidates are ranked depending on their cosine distances to each selected disease. For the purpose of concision, the existent associations are not displayed here.
We discover that the existent associations are not always ranked high on the list, though the diseases possess many directly related genes and miRNAs in our real-world datasets. For instance, there exist 33 known genes associated with insulin resistance (125853) in the datasets, but  This results from their relatively low link weights in our constructed network, which denotes a weak relation to insulin resistance. And in our method, several meta paths can extract the complex relationship with insulin resistance for genes without direct links, so these genes may distribute closer to the disease in the embedding space than some actually connected genes. Besides, there also exist many unknown associations with genes or miRNAs predicted for other diseases, which may assist biologists in identifying new disease relations.

Validation and comparison of the top-ranked miRNA-disease associations prediction
To validate our approach, we manually checked the miRNA-disease associations predicted by our algorithm based on the miRNet dataset [10], which contains a massive collection of verified miRNA-disease associations from miR2Disease [39], HMMD [38] and Phenomir [40]. As each disease is represented by a disease name instead of its OMIM id, we only combined part of the records (666 of 19,342) to construct the heterogeneous network, the left of which were utilized to validate the top-ranked miRNA-disease associations predicted by our HeteWalk.
In the experiment, all datasets in Table 1 was utilized to generate the heterogeneous network and our method was applied to learn the representation vector for each node. Table 6 reports the top 10 diseases predicted to have associations with each of the four miRNAs (i.e., hsa-mir-21, hsa-let-7a-1, hsa-mir-125b-1 and hsa-mir-155), which possess the largest amount of verified records in the miR-Net dataset. Among these predictions, we identified 8, 7, 6, and 7 confirmed associations for hsa-mir-21, hsalet-7a-1, hsa-mir-125b-1 and hsa-mir-155, respectively, demonstrating the effectiveness of our methods.
The first column in Table 6 presents the rank of the corresponding predicted disease among all associated diseases, and their disease name as well as OMIM id are in column two. The last column indicates whether the predicted associations is verified in miRNet and, if so, the verification source is given. There are 7, 11, 4, and 6 known disease associations in the training set for hsamir-21, hsa-let-7a-1, hsa-mir-125b-1, and hsa-mir-155, respectively. We can find that some of the known associations which actually exist were not ranked highly. The reasons are two-fold. First, some of these associations possess relatively low weights, suggesting a weak relationship with the disease. Second, while some diseases and miR-NAs do not currently possess direct links in the training data, they are well related to each other by several meta paths in the heterogeneous network. These diseases are therefore considered more associated to the miRNAs than those that are directly connected but with low link weights and are more likely to be predicted by HeteWalk.
The top 10 disease phenotypes for these four miR-NAs predicted by alternative baselines (i.e., CATAPULT, HSMP and HSSVM) are listed in Tables 7, 8 and 9, with records verified by miRNet indicated in bold. We omit the known associations in these tables too and the first column indicates their original rankings. We compare them with the results predicted by HeteWalk.
There exist considerable overlap in the predictions from CATAPULT (Table 7) among these four miRNAs. Male germ cell tumor (273300) occurs within the top three predicted candidate diseases for whole four miRNAs. Nonmedullary Thyroid cancer 1(188550) and Enterocolitis (226150) also occur in all four lists. This is because CATAPULT is biased towards nodes with larger degrees and therefore may neglect important connections that are special to a single miRNA. For each disease, the top-ranked genes are in the left column while the top-ranked miRNAs are in the right. The numbers denote their original ranking before known associations are removed in the results The first column shows the rankings of the predictions among all diseases, the second presents their diseases names and OMIM ids, and the third indicates whether the predicted associations are verified There exist lower degree of overlap in the top-ranked predictions returned by HSMP (Table 8) and HSSVM (Table 9) in contrast to CATAPULT. In these two tables associations verified by miRNet are in bold, from which we can discover the number of confirmed associations are 5, 5, 5, 4 and 5, 6, 1, 5 respectively, fewer than that predicted by HeteWalk, which are 8, 7, 6, 7.

Conclusion
In this paper, we propose a heterogeneous network embedding method to predict disease associations accurately. We construct a heterogeneous network from various biological databases and obtain a representation vector for each entity in the network based on meta path [35] controlled random walk in our method. Moreover, we innovatively consider the edge weights during the representation learning and provide a random walk-based measure to assist in selecting meta path. The learned network embedding well captures the semantic characteristics and topological structures of the network to achieve accurate prediction of disease-related associations. Experimental results on real-world datasets shows the superiority of our method by multiple evaluations.
As for future work, we plan to combine more heterogeneous network data to improve the performance of association prediction and also generalize our HeteWalk for different genres of heterogeneous networks. Authors' contributions YX, MG, LR, XK and WW designed the study, performed the experiments and drafted the manuscript, CT assisted to the study design, YZ and WW supervised the study. All of the authors have read and approved the final manuscript.