 Research
 Open access
 Published:
HPOAnnotator: improving largescale prediction of HPO annotations by lowrank approximation with HPO semantic similarities and multiple PPI networks
BMC Medical Genomics volume 12, Article number: 187 (2019)
Abstract
Background
As a standardized vocabulary of phenotypic abnormalities associated with human diseases, the Human Phenotype Ontology (HPO) has been widely used by researchers to annotate phenotypes of genes/proteins. For saving the cost and time spent on experiments, many computational approaches have been proposed. They are able to alleviate the problem to some extent, but their performances are still far from satisfactory.
Method
For inferring largescale proteinphenotype associations, we propose HPOAnnotator that incorporates multiple ProteinProtein Interaction (PPI) information and the hierarchical structure of HPO. Specifically, we use a dual graph to regularize Nonnegative Matrix Factorization (NMF) in a way that the information from different sources can be seamlessly integrated. In essence, HPOAnnotator solves the sparsity problem of a proteinphenotype association matrix by using a lowrank approximation.
Results
By combining the hierarchical structure of HPO and coannotations of proteins, our model can well capture the HPO semantic similarities. Moreover, graph Laplacian regularizations are imposed in the latent space so as to utilize multiple PPI networks. The performance of HPOAnnotator has been validated under crossvalidation and independent test. Experimental results have shown that HPOAnnotator outperforms the competing methods significantly.
Conclusions
Through extensive comparisons with the stateoftheart methods, we conclude that the proposed HPOAnnotator is able to achieve the superior performance as a result of using a lowrank approximation with a graph regularization. It is promising in that our approach can be considered as a starting point to study more efficient matrix factorizationbased algorithms.
Background
Phenotypes refer to observable physical or biological traits of an organism. Revealing the relationships between genes/proteins and their related phenotypes is one of the main objectives of genetics in the postgenome era [1–3]. The Human Phenotype Ontology (HPO) [4] is a standardized vocabulary for describing the phenotypic abnormalities associated with human diseases [5]. Being initially populated by using databases of human genes and genetic disorders such as OMIM [6], Orphanet [7] and DECIPHER [8], HPO was later expanded by using literature curation [9]. At present, only small quantities of human proteincoding genes (∼3500) have HPO annotations. It is, however, believed that a large number of currently unannotated genes/proteins are related to disease phenotypes. Therefore, it is critical to predict genes/proteinHPO associations by using accurate computational methods.
Currently, HPO contains four subontologies: Organ abnormality, Mode of inheritance, Clinical modifier, and Mortality/Aging. As the main subontology, Organ abnormality describes clinical abnormalities whose first level children are formed by terms like abnormality of a skeletal system. The Mode of inheritance describes inheritance patterns of phenotypes and contains terms such as Autosomal dominant. The Clinical modifier contains classes that describe typical modifiers of clinical symptoms such as those triggered by carbohydrate ingestion. For Mortality/Aging, it describes the age of death by terms like Death in childhood and Sudden death. The Organ abnormality, Mode of inheritance, Clinical modifier, and Mortality/Aging have ∼12000, 28, 100, and 8 terms, respectively.
The annotations between genes/proteins and HPO terms are very sparse. Specifically, 284621 annotations are for 3459 proteins and 6407 HPO terms with the sparsity of 1.2%. Meanwhile, the annotation growth by time, for example, is about 5%, with adding only 14820 annotations as new ones between June 2017 to December 2017. Since genes/proteins are annotated with multiple HPO terms, the prediction can be regarded as a problem of multilabel predictions. Differing from this, HPO terms, however, form a hierarchical structure. This implies that once a gene/protein is labeled with one HPO term, it should also be labeled with all of its ancestors of this particular HPO term. In other words, when a gene/protein is not labeled with an HPO term, it should not be labeled with all of its descendants, either. That is, general terms are located at the top of the HPO structure, with the term specificity increasing from the root to the leaves. Figure 1 shows a real example of an HPO hierarchical structure (i.e., Directed Acyclic Graph, DAG) and the scale of subontologies.
The existing computational approaches for HPO annotation prediction can be divided into two categories, namely featurebased and networkbased methods. The featurebased approaches use gene/protein information as the features to predict its annotations for a query gene/protein. For sparse and noisy data, the incorporation of auxiliary information into original input data generally helps to improve predictive performance. One of these methods, learning to rank, has been demonstrated the superior performance in GO annotation prediction [10], for example. Compared with GO annotations, HPO annotations are, however, more reliable and stable. In addition, the sparseness of HPO annotations is much less than that of GO annotations, with focusing on human proteins and terms under Organ abnormality only. Nevertheless, few existing featurebased models take into consideration HPO information, e.g., the hierarchical structure and cooccurrence of HPO terms. The networkbased approaches are more prevalent at present. Usually, multiple networks are integrated into a new largescale network in order to improve the prediction in these approaches such as randomwalk [11] and weighted score computation [12]. However, networkbased approaches cannot perform well for sparse data. This is because of disconnected nodes that are commonly encountered in realworld graphs, particulary for sparse data, even though they can be related to each other.
Prediction of the annotations between genes/proteins and HPO terms can be grouped into two categorises: 1) pair prediction, which predicts the missing HPO annotations of existing proteins, and 2) prediction of new proteins, which annotates HPO terms to the totally unannotated proteins. Most existing work belong to the latter category, but few are for the former. To narrow this gap, we focus on the first category in this paper, which is also a famous task in the CAFA challenge. Existing methods for the first category have four major limitations. First, the hierarchy of HPO is completely ignored. The hierarchical structure poses a formidable challenge to a prediction: a model needs to evaluate the associations between a protein and all of its related phenotypes from the deeper levels to the root in the hierarchy. Second, the existing methods do not make full use of the potentials of ProteinProtein Interaction (PPI) networks. For example, a PPI network is modeled in the original annotation space in their models, which may not extract the information effectively. Moreover, multiple PPI networks may be derived from different sources, resulting in the data fusion. Third, only a few known associations are available for training. So they are extremely unbalanced. Specifically, more than half of the terms in HPO are used to annotate zero or only one protein. As a result, such a drastic sparsity makes prediction more challenging. Finally, existing methods usually study the subontologies independently without considering the coannotations of HPO terms. However, coannotations are quite common in annotations. It is likely that they help improve prediction results.
To address the above four problems, we apply matrix factorization to approximate a proteinHPO annotation matrix by two factorized lowrank matrices. As such, the latent factors that underlie the HPO annotations can be well captured. Since the HPO annotation matrix is binary, we choose to use Nonnegative Matrix Factorization (NMF). NMF has proved to be effective for sparse problems in the field of bioinformatics [13–16]. Based on our above observations, we propose an NMFbased framework called HPOAnnotator by which to predict missing proteinHPO annotations. In essence, the key idea of our model is to factorize the HPO annotation matrix into two nonnegative lowrank latent matrices, which correspond to the respective latent feature spaces of proteins and HPO terms. In addition, the graph Laplacian on PPI networks is performed to exploit their intrinsic geometric structure. Coannotations and the hierarchical structure of HPO are also incorporated to measure HPO semantic relationships.
We have experimentally validated the performance of HPOAnnotator by comparing it with the three networkbased approaches, which will be reviewed in the related work. The proposed model was tested on the latest largescale HPO data with around 300000 annotations. Experimental results clearly demonstrated that HPOAnnotator outperformed the competing methods under two scenarios: crossvalidation and independent test. It indicates that a lowrank approximation and network information are effective for pair prediction. Furthermore, our case studies further provide evidence for the practical use of HPOAnnotator. Note that, the work presented in this paper is the extension of our previous work AiProAnnotator [17] (AiPA for short). The main difference between the two methods is that HPOAnnotator can seamlessly combine multiple rather than single PPI networks and then benefit from them.
Related work
As mentioned before, we can group the existing approaches to HPO annotations into two categories: featurebased and networkbased ones.
Two wellknown methods of featurebased approaches are PHENOstruct [9] and ClusHMCEns [18]. ClusHMCEns applies the decision tree ensembles, while PHENOstruct (the extension of GOstruct which was designed to predict GO annotations) relies on the Structural Support Vector Machine (SSVM). Together with HPO annotations (i.e., labels) of each protein, a featurebased method normally accepts feature vectors as the input of a classifier. The trained classifier is then used to make a prediction. The above procedure is the same for both two categories of approaches. Additionally, it is worth noting that PHENOstruct and ClusHMCEns were originally developed for GO but then applied to HPO annotation prediction. In this sense, the difference between HPO annotations and GO annotations has not been fully taken into account by researchers.
Relying on two networks of proteinHPO annotations and the hierarchy of HPO (or Network of HPO, called NHPO) with an optional PPI Network (hereafter PPN), the networkbased approaches make predictions. The assumption behind them is that two nodes in a network should share some similarities, particular for those wellconnected nodes who have more similarities. In the following, we review the three methods as representatives of networkbased approaches, all of which are compared against our proposed approach in the experiments.
Birandom walk
BiRandom Walk (BiRW) [19, 20] has been demonstrated as a useful method for the binetwork prediction problem. BiRW performs random walks on the Kronecker product graph between PPN and NHPO in a way that they can be combined effectively for the proteinphenotype association prediction. The random walks iteratively performed by BiRW follow the equation:
where α>0 is a decay factor, P and G are the normalized PPN and NHPO matrix, respectively. Y_{t} is the estimation of associations at iteration t, and \(\mathbf { \widetilde {Y} }\) denotes the initial annotations in the training data. By introducing BiRW to capture the circular bigraphs patterns in the networks, the model can unveil phenomegenome associations over time.
Dual label propagation model
The label propagationbased algorithm has been successfully applied to predict phenotypegene associations in various forms [21, 22]. With the following objective function, label propagation assumes that proteins should be assigned to the same label, if they are connected in a PPN:
where \(\bar {\mathbf {S}}^{p}\) is a normalized PPN defined as \(\bar {\mathbf {S}}^{p} = \mathbf {D}^{\frac {1}{2}}\mathbf {S}^{p}\mathbf {D}^{\frac {1}{2}}\), and D is a diagonal matrix with the rowsum of S^{p} on the diagonal entries. Equation 2 can be rewritten as follows:
where tr(·) denotes the trace of matrix, ∥·∥_{F} denotes the Frobenius norm, and L_{S} is the normalized graph Laplacian matrix of \( \bar {\mathbf {S}}^{p}\) defined as \(\mathbf {L}_{S} = \mathbf {I}  \bar {\mathbf {S}}^{p}\).
The Dual Label Propagation model (DLP) [23] extends the label propagation model by adding two smoothness terms. The first term imposes the smoothness in a PPN such that interacting proteins tend to be associated with the same HPO term. The second term imposes the smoothness in NHPO in a way that the connected phenotypes (parentchild pair) are encouraged to be associated with the same protein. The objective function of DLP is given as:
where β,γ≥0 are tuning parameters, L_{S} and \(\mathbf {L}_{G_{Y}}\) encode the PPN and NHPO information, respectively. Ω is the binary indicator matrix that selects only the known associations to be penalized, and ⊙ denotes Hadamard product (a.k.a entrywise product).
Ontologyguided group lasso
The last method to be reviewed is Ontologyguided Group Lasso (OGL) [24]. It uses an ontologyguided group norm for HPO, rather than the graph regularizer in DLP. By combining label propagation and an ontologyguided group lasso norm derived from the hierarchical structure of HPO, OGL updates estimation, according to the following objective function:
where β,γ≥0 are balancing factors. \(r_{g}^{Y}\) is the group weight for group g. Y_{(g)i} selects the group members of group g from the ith column of Y, and the smoothness is imposed through the ℓ_{2}norm group lasso (∥·∥_{2}) among the members for the consistent prediction within the group. A notable difference between OGL and our model is that the estimated matrix is not factorized into lowrank matrices.
One of the biggest drawbacks of networkbased methods is that data sparseness has a significant impact on the performance. As mentioned before, the current HPO annotations are quite sparse. In addition, all of the network basedmethods suffer the heavy computational burden, as they accept a largescale proteinHPO annotation matrix as an input directly.
Methods
Notation
Let \(\phantom {\dot {i}\!}\mathbf {Y} \in \{ 0, 1 \}^{N_{p} \times N_{h}}\) be a proteinHPO annotation matrix, where N_{p} and N_{h} are the number of proteins and HPO terms, respectively. If protein i is annotated by an HPO term j, then Y_{ij}=1, and 0 otherwise. We define \(\phantom {\dot {i}\!}\mathbf {S}^{p_{k}}\) (k=1,2,⋯,t) be the networks for proteins, namely PPNs, where t is the total number of networks. \(\mathbf {S}^{p_{k}}_{i,j}\) represents the strength of the relationship between protein i and protein j in the kth PPN. Similarly, let S^{h} be the network of HPO terms which is generated from an ontology structure and coannotations, and \(\mathbf {S}^{h}_{i,j}\) is the similarity value between term i and term j. Our goal is to estimate \(\hat {\mathbf {Y}}\) given Y, \(\phantom {\dot {i}\!}\mathbf {S}^{p_{k}}\) and S^{h}.
Our proposed method
Preprocessing: generating a network from HPO
The network of HPO terms, or NHPO, is derived by measuring the similarity between two HPO terms in a hierarchy. We adopt the measure proposed in [25]. Having been extensively used in natural language processing, this metric defines the semantic similarity between two labeled nodes by counting the cooccurrence frequency in a corpus.
Specifically for HPO, the semantic similarity between two terms s and t is defined as:
where I(s)= log(p(s)) and \(p(s) = \frac {\text {count}(s)}{N_{p}}\). Here, count(s) denotes the number of proteins annotated by term s and mca(s,t) is given as follows:
where A(s,t) represents the set of all common ancestors of s and t.
The weight of the edge between nodes s and t in NHPO is exactly the similarity score. The larger the number of annotated proteins shared by s and t, the higher their similarity score is. It is more likely to happen when the common ancestor of s and t is located closely. This means that S^{h} considers both the coannotations of two HPO terms and their distance in a hierarchical structure.
Nonnegative matrix factorization
The aim of Nonnegative Matrix Factorization (NMF) is to find two lowrank matrices with all nonnegative elements by approximating the original input matrix. In fact, the latent factors that underlie the interactions are captured. Mathematically, the input matrix \({\mathbf {Y} \in \mathbb {R}^{N_{p} \times N_{h}}_{+}}\) is decomposed into two rankK matrices, \(\mathbf {U} \in \mathbb {R}^{N_{p} \times K }_{+}\) and \(\mathbf {V} \in \mathbb {R}^{N_{h} \times K}_{+}\). Then, finding U and V can be done by minimizing the reconstruction error which is defined as:
Generally, the ℓ2 (Tikhonov) regularization is imposed to Eq. (7) so as to alleviate overfitting of U and V.
Since there are unknown (missing) entries in Y, we encode the missingness with a masking matrix \(\phantom {\dot {i}\!}\mathbf {W} \in \{0, 1\}^{N_{p} \times N_{h}}\). If the annotation between protein i and HPO term j is missing, we set W_{ij}=0. Otherwise, we set W_{ij}=1, meaning that the element Y_{ij} is observed. Accordingly, W is also plugged as an extra input into our model. Together with the ℓ2norm regularization terms, the objective function is refined as follows:
where λ is a regularization coefficient.
The unobserved proteinHPO associations are completed by multiplying two factor matrices, or concretely, \(\hat {\mathbf {Y}} = \mathbf {U} \mathbf {V}^{T}\).
Network regularization
Once we obtain the similarity matrix of HPO, S^{h}, we can regularize V with the help of it. The basic idea is to impose smoothness constraints on the phenotypeside factors; that is
where V_{i} is the ith row vector of V, D^{h} is a diagonal matrix whose diagonals are the node degrees, and L^{h}=D^{h}−S^{h} is the graph Laplacian of S^{h}. Actually, the term is exactly the vanilla graph regularizer.
For proteins, multiple PPNs are derived from diverse data sources with heterogeneous properties. In this way, for a collective of PPNs \(\phantom {\dot {i}\!}\mathbf {S}^{p_{k}} (k = 1, \cdots, t)\), their regularizer is imposed as
where \(\mathbf {L}^{p_{k}} = \mathbf {D}^{p_{k}}  \mathbf {S}^{p_{k}} \) is the graph Laplacian of \(\phantom {\dot {i}\!}S^{p_{k}}\), and \(\phantom {\dot {i}\!}\mathbf {D}^{p_{k}}\) is the degree matrix.
Minimization of graphbased regularization terms will lead to the learned data representations (U and V) that respect the intrinsic geometrical structure of original data spaces (\(\phantom {\dot {i}\!}\mathbf {S}^{p_{k}}\) and S^{h}). Note that such standard graph regularization has already been used in a variety of applications [26].
Model formulation
By combining (8), (9) and (10), our model is formulated as follows:
where α and β are regularization coefficients to strike a balance between the reconstruction error and graph smoothness.
Model optimization
Notice that the objective function defined in Eq. (11) is biconvex with respect to U and V. A very regular but effective procedure for fitting is Alternating Least Square (ALS), which alternately optimizes one of the variables by fixing the others as constants until convergence.
We first hold U fixed and derive the updating rule of V. The objective function of V can be written as:
Accordingly, the derivative of J(V) with respect to V is
Taking the KarushKuhnTucker (KKT) complementary condition, we obtain
Now let us rewrite L^{h}=L^{h+}−L^{h−}, where we have L^{h+}=(L^{h}+L^{h})/2 and L^{h−}=(L^{h}−L^{h})/2. The multiplicative update rule of V is then:
Note that the problem given by (11) is symmetric in terms of U and V. Therefore, the derivation of the updating rule of U is simply the reverse of the above case. Precisely, we have
Training algorithm
We describe the overall framework of HPOAnnotator in Fig. 2. The procedure of our optimization process is presented in Algorithm 1. The optimization was implemented based on the MATLAB code provided by [26].
Results
Data
HPO annotations
Two HPO annotation datasets released by June 2017 and December 2017 were downloaded from the official HPO website (https://hpo.jax.org/). For the sake of brevity, we call them Data201706 and Data201712 in the following, respectively. The truepathrule is applied here to propagate annotations, and only HPO terms with at least one related protein remains. Table 1 lists the statistics of the two datasets.
According to the number of proteins annotated, we separated the HPO terms into five groups: 1 to 10, 11 to 30, 31 to 100, 101 to 300, and more than 300. Figure 3 shows the percentage of HPO terms and corresponding annotations over five groups in Data201706.
NHPO (Network of HPO)
We downloaded the hierarchical structure of HPO from their official website.
PPN (ProteinProtein Network)
Four types of PPNs were used in our experiments; that is, STRING [27] (https://stringdb.org/), GeneMANIA [28] (http://genemania.org/data/), BioGRID [29] (https://downloads.thebiogrid.org/BioGRID), and Reactome [30] (https://reactome.org/downloaddata). Table 2 reports the statistics of these four networks. Note that STRING is the most famous PPI network, which was found very useful for predicting HPO annotations in [9]. It combines diverse data sources, including coexpression, cooccurrence, fusion, neighborhood, genetic interactions, and physical interactions, by assigning a confidence score to a certain pair of proteins for indicating its reliability.
A preliminary test on pairs of two HPO terms in NHPO: the correlation between the number of shared proteins and the average similarity
First, we grouped all pairs of two HPO terms (from NHPO), according to the number of proteins, say M, shared by the two HPO terms. For each group, we then computed the average similarity score (S^{h}) by NHPO over those sharing M proteins. Finally, we plotted each group over the twodimensional space of M× the average similarity score. Figure 4 shows the result. The similarity score is equal to the edge weight of NHPO. This means that this test would be evaluated on the consistency of the similarity with the number of shared proteins from each HPO term pair. There found some correlations between these two, which would be a positive support for using NHPO for HPO annotations.
A preliminary test on pairs of proteinprotein edges in a PPN: correlations between the average similarity by a PPN and #shared HPO
Considering the extensiveness, we chosen STRING as the research object. At first step, we grouped all pairs of two proteins, according to the number of their shared HPO terms, denoted as K. For each group, we then computed the average of similarity score (S^{p}) of STRING PPN over those sharing the same number of HPO terms. Finally, we plotted each group over the twodimensional space of the average score (similarity) ×K. Figure 5 shows the plotted results. The line in this figure shows that the polynomial trend line is fitted to the distributed points of the twodimensional space. It shows a slightly positive correlation between the number of shared HPO terms and the average similarity score by a PPN. This observation validates the idea that the edges in a PPN may imply that proteins connected by the edges share the same HPO.
Evaluation criteria
The performance is evaluated from three aspects.
Annotationcentric measure Each annotation (or a proteinHPO term pair) is viewed as one instance. The models are evaluated using Area Under the receiver operator characteristics Curve (AUC) [31]. Considering the sparseness of proteinHPO association matrix, we measure the Area Under the PrecisionRecall curve (AUPR) as well.
Proteincentric measure AUCs (AUPRs) are calculated for each protein based on the corresponding predictive scores by all available HPO terms. Then the computed AUCs (AUPRs) are averaged over all proteins, resulting in microAUC (microAUPR).
HPO termcentric measure We think that the termcentric measure is important. Typical scientists or biologists focus first on a certain HPO term and are interested in obtaining genes/proteins, which can be annotated by the focused HPO term. The HPO termcentric measure can be computed in a total reverse manner of the proteincentric measure, with the following two steps: 1) AUCs (AUPRs) are first computed for each HPO term; and 2) The computed AUCs (AUPRs) are averaged over all HPO terms, which result in macroAUC (macroAUPR). In addition, we average the computed AUCs (AUPRs) over HPO terms at only leaves of the HPO hierarchical structure. We call the obtained AUC (AUPR) leafAUC (leafAUPR).
We further calculate the macroAUCs (macroAUPRs) for each of the five groups, which are generated by focusing on the number of annotations per HPO term (see Fig. 3). In total, (from annotation, protein, and HPO termcentric measures) we have the eight criteria to validate the performance.
Experimental procedures
Parameter settings
Our approach is compared with three networkbased methods: BiRW [20], DLP [23] and OGL [24] as described in related work. Besides, we take Logistic Regression (LR) as a featurebased baseline. Note that LR classifiers are trained on each single HPO term independently, and the features are built by concatenating association scores in PPNs together.
The parameter of BiRW is selected from {0.1,0.2,⋯,0.9}. Regularization coefficients (i.e., hyperparameters) of DLP and OGL, β and γ are selected from {10^{−6},10^{−5},⋯,10^{6}}. Note that the ranges of these parameters are specified by following [23]. Our model has four parameters: K, α, β and λ, which are determined by internal fivefold crossvalidation, where the training data is further randomly divided into five folds (one for validation and the rest for training). The search ranges are as follows: {100,200} for K, {2^{−3},2^{−2},⋯,2^{2},2^{3}} for λ, {2^{−7},2^{−6},⋯,2^{6},2^{7}} for α and β.
There are several variants of our algorithm by changing the settings of hyperparameters α and β. We also evaluate each of them as comparison methods. The details are as follows.
 1.
NMF: α=0 and β=0
Now the model is reduced to standard NMF, and the objective function is exactly the same as Eq. (8).
 2.
NMFPPN: α≠0 and β=0
Under this setting, there is no regularization term of NHPO, but PPN has. Thus, we term this model as NMFPPN.
 3.
NMFNHPO: α=0 and β≠0
This setting is in contrast to NMFPPN. That is, the regularization term of NHPO is kept, while that of PPN is not.
For the case of α≠0 and β≠0, there are two another variants depending on whether or not multiple PPNs are utilized.
 1.
AiPA: only one PPN is utilized
It is proposed in our previous study [17], which can be regarded as a special case of HPOAnnotator because only single PPN of STRING is exploited.
 2.
HPOAnnotator: multiple PPNs are utilized
It is our final model presented in this paper. All four PPNs are used, including STRING, GeneMANIA, BioGRID, and Reactome as described before.
Two evaluation settings
Under two different settings, we validate the performance of the compared methods from two viewpoints:
 1.
Crossvalidation over Data201706
We conduct 5 ×5fold crossvalidation over all annotations on Data201706. That is, we repeat the following procedure five times: all known annotations are divided randomly into five equal folds. The four folds are for training, while the remaining one is for test. After selecting the test annotation between protein p and HPO term h, all annotations between p and the descendants of term h in the hierarchical structure of HPO are removed from the training data, in order to avoid any overlaps between training data and test data. It means that we predict the annotation of protein p out of all unknown HPO terms, which is a fair and strict evaluation.
 2.
Independent test by using Data201712
HPO annotations are incomplete, due to various reasons, such as slow curation. The way of annotations might be changed over time. So we conduct additional several experiments other than regular crossvalidation by using data obtained in different time periods. That is, the training data is obtained before June 2017. All annotations in Data201706 are used for training, where an internal fivefold crossvalidation is done for setting up parameter values. After training, annotations obtained from June to December 2017 are then used for testing.
Experimental results
Predictive performance in crossvalidation on Data201706
Table 3 reports the scores of the eight criteria obtained by averaging over 5 ×5 crossvalidation (25 runs in total) on Data201706. In this experiment, we compare the nine methods in total. In particular, the four are existing methods (LR, BiRW, OGL and DLP), and another five are variants of our model (NMF, NMFPPN, NMFNHPO, AiPA and HPOAnnotator). Note that STRING is the only PPN utilized in NMFPPN. From the table, it clearly shows that our five methods perform better than the four existing methods. For example, our four methods achieve around 0.5 to 0.56 in AUPR, while all the scores by the existing methods are less than 0.1. In fact, our five methods perform better than the existing methods with respect to all of the eight metrics. Thus, their performance differences are very clear. We can conclude that a lowrank approximation is useful for the HPO annotation problem. Furthermore, HPOAnnotator always outperforms other variants in eight conditions among our five methods. This indicates that network information is well incorporated into our formulation.
Table 4 lists the AUC scores obtained for five groups divided by the number of annotations. Again, the results reported in these tables demonstrate the same conclusion as that in Table 3. That is, HPOAnnotator outperforms all other methods in all of the cases. A similar trend is also shown in Table 5. In summary, our approach is capable of achieving the best performance for HPO annotations in terms of crossvalidation.
A noteworthy point is that our method works well for the HPO terms with a very small number of annotations, i.e., only one to ten annotations per HPO term. In fact, this situation is usually hard for a lowrank approximation. As HPOAnnotator has achieved the best performance, this implies that a lowrank approximation is useful for all types of groups including HPO terms with a very small number of annotations for HPO annotations.
The effectiveness of individual PPNs in crossvalidation on Data201706
By using NMFPPN, we perform a set of experiments in order to identify the most effective PPN in terms of HPO predictions. To this end, we perform a series of experiments on NMFPPN by using a single PPN as its input at a time. NMFPPN with the four PPNs performs best as reported in Table 6. As shown in Table 6, we can conclude that STRING is the most useful PPN for predicting HPO annotations. By the way, Our model can take advantage of different PPNs to achieve the best performance.
Computation times in crossvalidation on Data201706
The computation (training) times of the eight methods compared in the crossvalidation are recorded, where the times are averaged over the total 25 runs (5 ×5 folds). The computation times on the same machine with the same settings are reported in Table 7. From the table, our four models run faster than the compared ones. In fact, they are more than eight times faster than OGL and DLP. The training data is updated periodically, thus the model must be trained by the updated data often. As such, this advantage of our models would make a difference. In addition, OGL and DLP need much more memory spaces than the compared methods.
Predictive performance in the independent test on Data201712
Table 8 reports AUC obtained by the experiments conducted on independent data for the eight competing methods. Among the three existing methods, DLP achieves the best performance, with AUC of 0.8298. NMF outperforms DLP with AUC of 0.8527, while two variants of NMF with one network regularizer further achieves better performance with AUC of around 0.89. AiPA achieves 0.9187 of AUC with STRING PPN and NHPO. Most importantly, HPOAnnotator archives the best performance, with the AUC of more than 0.92.
As Table 9 reports, seven out of the 30 highest ranked predicted annotations are validated to be true according to Data201712 which is released later. For example, protein Q02388, encoded by gene COL7A1, is actually annotated by HPO term HP:0001072 (Thickened skin). But we fail to find it in the data released by December 2017. Another example is protein Q9UBX5. According to Data201706, it has no relationship with HPO term HP:0012638 (Abnormality of nervous system physiology). But this record occurs in the later release of the data.
As the highestranked new annotation found by our model, HP:0001072 is known to also annotate another ten proteins, O43897, P07585, P08123, P08253, P12111, P20849, P20908, P25067, P53420, and Q13751, based on Data201706. We find that their similarity scores with Q02388 in STRING are more than 0.9. It indicates that their interactions between Q02388 and those ten proteins in PPNs imply a high possibility of annotating Q02388 by HP:0001072. In summary, the number of these examples have demonstrated both the effectiveness and necessity of introducing PPI networks for unknown HPO annotations prediction.
Validating false positives
As mentioned before, seven of the top 30 correct predictions from our model have already been found in the December 2017 release version of HPO annotations. Due to the fact that a curation process on HPO annotations is normally slow, we believe that there may be more false positives among our top ranked predictions. In order to validate our assumption, we first select the rest of the top 10 predictions that have not been found in the December 2017 HPO data. Using a protein name (or its coding gene name) and an HPO term name as a query for online search engines, we then check the relevant literature and diseases for each false prediction. Finally, we manually extract the information from the retrieved papers containing supporting evidence that suggest a particular false positive to be correct in fact. Using this manual process, we find evidence for another two predictions. Table 10 lists the PubMed IDs of the relevant literature, the relevant diseases names, and the detailed evidence for each pair of the found gene/proteinHPO term. The results strongly indicate that the performance of HPOAnnotator is underestimated, which is caused by the incompleteness of the current gold standard.
A typical example of demonstrating the performance of HPOAnnotator
To further demonstrate the performance of our proposed method for predicting HPO annotations, we here present the different predictions made by the four methods for a typical example, protein P23434. As listed in the last row of Table 11, this protein has 10 annotations. It is interesting to note that the number of correctly predicted HPO terms gradually increases from the first row to the fourth row. Again, this indicates that network information is effective for improving the performance of predicting HPO annotations.
Performance comparisons focusing on Organ abnormality
Most of the existing models are evaluated on separate subontologies. However, considering only part of the ontology may lose entire network information. Such information can connect proteins or HPO terms that are even beyond the boundaries of two or more subontologies in the network space. As such, we do not conduct the experiments on separate subontologies. Instead, we focus on the major subontology, Organ abnormality (the part under HP:0000118), with 6370 HPO terms, 3446 proteins and 269420 annotations in total according to Data201706. A 5 ×5fold crossvalidation has been conducted by following the same splitting strategy as before. Table 12 reports the scores of the eight evaluation criteria obtained by all compared methods. The results clearly show that the performance differences among the seven cases are subtle. For example, HPOAnnotator achieves the best performance with respect to all evaluation measure except for leafAUC. Comparing NMFOrgan and NMFPPNOrgan in terms of AUC, we can find that network information can help to improve the performance to a certain extent. Nonetheless, the use of both networks of PPN and NHPO might not be so effective in this scenario. Besides, it seems that the performance improvement is quite limited when we consider the whole ontology rather than individual subontologies. Tables 13 and 14 list the evaluation scores of MacroAUC and MacroAUPR over the five HPO term groups, respectively. The trend is similar to that presented in Table 12. Again, the results show no notable difference among the compared methods.
Discussion and Conclusion
In this paper, we have presented an approach that uses a lowrank approximation to solve the problem of the largescale prediction of HPO annotations for human proteins. In particular, network information is used to regulate such an approximation. The network information can be derived from both sides of annotations, i.e., PPI networks, and a hierarchical structure of an ontology. In essence, we provided a lowrank approximation solution to the optimization problem of matrix factorization with a networkderived regularization. Extensive experiments on the current HPO database have been conducted to validate the effectiveness of our approach. Experimental results clearly demonstrated the good performance of the proposed method under various settings, including crossvalidation, independent test, analysis on the major subontology Organ abnormality, and detailed case studies. The results have validated the good effectiveness as a result of using network information and ontology hierarchical structure as regularization and a lowrank approximation for HPO predictions, even for predictions on HPO terms with a very small number of known annotations.
Overall, the four important findings can be concluded from the experimental results: 1) a lowrank approximation works quite well for a largescale HPO annotations prediction; or more generally, for multilabel classification, even for predicting labels with an extremely small number of labeled instances; 2) a hierarchical ontology structure is very useful as side information for improving the performance of a lowrank approximation; 3) PPI networks from different sources play an important role in predictions; and 4) multiplicative parameter update of a lowrank approximation (matrix factorization) is timeefficient, with around eight times faster than networkbased approaches that need the huge memory because of using the original annotation matrices directly.
Availability of data and materials
Not applicable.
Abbreviations
 AiPA:

AiProAnnotator
 ALS:

Alternating least square
 AUC:

Area under the receiver operator characteristics curve
 AUPR:

Area under the precisionrecall curve
 BiRW:

Birandom walk
 CAFA:

Critical assessment of functional annotation
 DAG:

Directed acyclic graph
 DLP:

Dual label propagation
 GO:

Gene ontology
 HPO:

Human phenotype ontology
 KKT:

Karushkuhntucker
 LR:

Logistic regression
 NHPO:

Network of HPO
 NMF:

Nonnegative matrix factorization
 OGL:

Ontologyguided group lasso
 PPI:

Proteinprotein interaction
 PPN:

Proteinprotein network
 SSVM:

Structural support vector machine
References
Botstein D, Risch N. Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease. Nat Genet. 2003; 33(3s):228.
Li MJ, Sham PC, Wang J. Genetic variant representation, annotation and prioritization in the postgwas era. Cell Res. 2012; 22(10):1505–8.
Lage K, Karlberg EO, Størling ZM, et al. A human phenomeinteractome network of protein complexes implicated in genetic disorders. Nat Biotechnol. 2007; 25(3):309–316.
Freimer N, Sabatti C. The human phenome project. Nat Genet. 2003; 34(1):15–21.
Köhler S, Doelken SC, Mungall CJ, et al. The human phenotype ontology project: linking molecular biology and disease through phenotype data. Nucleic Acids Res. 2013; 42(D1):966–74.
Hamosh A, Scott AF, Amberger JS, et al. Online mendelian inheritance in man (omim), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 2005; 33(suppl_1):514–7.
Aymé S, Schmidtke J. Networking for rare diseases: a necessity for europe. Bundesgesundheitsblatt Gesundheitsforschung Gesundheitsschutz. 2007; 50(12):1477–83.
Bragin E, Chatzimichali EA, Wright CF, et al. Decipher: database for the interpretation of phenotypelinked plausibly pathogenic sequence and copynumber variation. Nucleic Acids Res. 2013; 42(D1):993–1000.
Kahanda I, Funk C, Verspoor K, BenHur A. Phenostruct: Prediction of human phenotype ontology terms using heterogeneous data sources. F1000Res. 2015; 4:259.
You R, Zhang Z, Xiong Y, et al. Golabeler: Improving sequencebased largescale protein function prediction by learning to rank. Bioinformatics. 2018; 34(14):2465–73.
Xie M, Hwang T, Kuang R. Reconstructing disease phenomegenome association by birandom walk. Bioinformatics. 2012; 1(02):1–8.
Wang P, Lai W, Li MJ, et al. Inference of genephenotype associations via proteinprotein interaction and orthology. PloS one. 2013; 8(10):77478.
Gao Y, Church G. Improving molecular cancer class discovery through sparse nonnegative matrix factorization. Bioinformatics. 2005; 21(21):3970–5.
Kim H, Park H. Sparse nonnegative matrix factorizations via alternating nonnegativityconstrained least squares for microarray data analysis. Bioinformatics. 2007; 23(12):1495–502.
Wang JJ, Wang X, Gao X. Nonnegative matrix factorization by maximizing correntropy for cancer clustering. BMC Bioinformatics. 2013; 14(1):107.
Hofree M, Shen JP, Carter H, Gross A, Ideker T. Networkbased stratification of tumor mutations. Nat Methods. 2013; 10(11):1108–15.
Gao J, Yao S, Mamitsuka H, Zhu S. Aiproannotator: Lowrank approximation with network side information for highperformance, largescale human protein abnormality annotator. In: IEEE International Conference on Bioinformatics and Biomedicine, BIBM. Madrid: IEEE: 2018. p. 13–20.
Schietgat L, Vens C, Struyf J, et al. Predicting gene function using hierarchical multilabel decision tree ensembles. BMC Bioinformatics. 2010; 11(1):2.
Xie M, Hwang T, Kuang R. Prioritizing disease genes by birandom walk. In: Advances in Knowledge Discovery and Data Mining  16th PacificAsia Conference, PAKDD. Kuala Lumpur: Springer: 2012. p. 292–303.
Xie M, Xu Y, Zhang Y, Hwang T, Kuang R. Networkbased phenomegenome association prediction by birandom walk. PloS One. 2015; 10(5):0125138.
Hwang T, Kuang R. A heterogeneous label propagation algorithm for disease gene discovery. In: Proceedings of the SIAM International Conference on Data Mining, SDM. Columbus: SIAM: 2010. p. 583–94.
Mehan MR, NunezIglesias J, Dai C, Waterman MS, Zhou XJ. An integrative modular approach to systematically predict genephenotype associations. BMC Bioinformatics. 2010; 11(1):62.
Petegrosso R, Park S, Hwang TH, Kuang R. Transfer learning across ontologies for phenomegenome association prediction. Bioinformatics. 2016; 33(4):529–36.
K S, X EP. Treeguided group lasso for multitask regression with structured sparsity. In: Proceedings of the 27th International Conference on Machine Learning (ICML10). Haifa: Omnipress: 2010. p. 543–50.
Lin D. An informationtheoretic definition of similarity. In: Proceedings of the Fifteenth International Conference on Machine Learning (ICML) 1998. Madison: Morgan Kaufmann: 1998. p. 296–304.
Cai D, He X, Han J, Huang TS. Graph regularized nonnegative matrix factorization for data representation. IEEE Trans Pattern Anal Mach Intell. 2011; 33(8):1548–60.
Szklarczyk D, Franceschini A, Kuhn M, et al. The string database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Res. 2010; 39(suppl_1):561–8.
WardeFarley D, Donaldson SL, Comes O, et al. The genemania prediction server: biological network integration for gene prioritization and predicting gene function. Nucleic Acids Res. 2010; 38(suppl_2):214–20.
Stark C, Breitkreutz B, Reguly T, et al. Biogrid: a general repository for interaction datasets. Nucleic Acids Res. 2006; 34(suppl_1):535–9.
Fabregat A, Jupe S, Matthews L, et al. The reactome pathway knowledgebase. Nucleic Acids Res. 2017; 46(D1):649–55.
Wu X, Zhou Z. A unified view of multilabel performance measures. In: Proceedings of the 34th International Conference on Machine Learning, ICML. Sydney: PMLR: 2017. p. 3780–8.
Funding
Publication costs were funded by National Natural Science Foundation of China (No. 61872094 and No. 61572139). S. Z. is supported by Shanghai Municipal Science and Technology Major Project (No. 2017SHZDZX01). J. G., L. L. and S. Y. are supported by the 111 Project (NO. B18015), the key project of Shanghai Science & Technology (No. 16JC1420402), Shanghai Municipal Science and Technology Major Project (No. 2018SHZDZX01) and ZJLab. H. M. has been supported in part by JST ACCEL (grant number JPMJAC1503), MEXT Kakenhi (grant numbers 16H02868 and 19H04169), FiDiPro by Tekes (currently Business Finland) and AIPSE program by Academy of Finland. The funding body have no role in the design of the study and collection, analysis, and interpretation of data and writing the manuscript.
Author information
Authors and Affiliations
Contributions
JG and SZ jointly contributed to the design of the study. JG designed and implemented the ANMF method, performed the experiments, and drafted the manuscript. LL, SY, XH, HM and SZ helped the result analysis, and contributed to improving the writing of manuscripts. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Gao, J., Liu, L., Yao, S. et al. HPOAnnotator: improving largescale prediction of HPO annotations by lowrank approximation with HPO semantic similarities and multiple PPI networks. BMC Med Genomics 12 (Suppl 10), 187 (2019). https://doi.org/10.1186/s1292001906251
Published:
DOI: https://doi.org/10.1186/s1292001906251