Prediction of HIV-1 virus-host protein interactions using virus and host sequence motifs
© Evans et al. 2009
Received: 30 January 2009
Accepted: 18 May 2009
Published: 18 May 2009
Skip to main content
© Evans et al. 2009
Received: 30 January 2009
Accepted: 18 May 2009
Published: 18 May 2009
Host protein-protein interaction networks are altered by invading virus proteins, which create new interactions, and modify or destroy others. The resulting network topology favors excessive amounts of virus production in a stressed host cell network. Short linear peptide motifs common to both virus and host provide the basis for host network modification.
We focused our host-pathogen study on the binding and competing interactions of HIV-1 and human proteins. We showed that peptide motifs conserved across 70% of HIV-1 subtype B and C samples occurred in similar positions on HIV-1 proteins, and we documented protein domains that interact with these conserved motifs. We predicted which human proteins may be targeted by HIV-1 by taking pairs of human proteins that may interact via a motif conserved in HIV-1 and the corresponding interacting protein domain.
Our predictions were enriched with host proteins known to interact with HIV-1 proteins ENV, NEF, and TAT (p-value < 4.26E-21). Cellular pathways statistically enriched for our predictions include the T cell receptor signaling, natural killer cell mediated cytotoxicity, cell cycle, and apoptosis pathways. Gene Ontology molecular function level 5 categories enriched with both predicted and confirmed HIV-1 targeted proteins included categories associated with phosphorylation events and adenyl ribonucleotide binding.
A list of host proteins highly enriched with those targeted by HIV-1 proteins can be obtained by searching for host protein motifs along virus protein sequences. The resulting set of host proteins predicted to be targeted by virus proteins will become more accurate with better annotations of motifs and domains. Nevertheless, our study validates the role of linear binding motifs shared by virus and host proteins as an important part of the crosstalk between virus and host.
This study focused on the computational identification of host proteins targeted by an invading virus, using HIV-1 infection as a case study because extensive study at the molecular level has yielded nearly fifteen hundred experimentally determined HIV-1, human protein interactions, which are catalogued in the HIV-1, Human Protein Interaction Database [1, 2]. Virus and cellular parasite proteins alter host interaction networks by competing with host proteins for binding in the host protein-protein interaction (PPI) network [3–5]. Knowledge of which host proteins interact with virus proteins is important for antiviral drug discovery and treatment optimization using existing drugs . Experimental approaches for finding virus protein binding partners in the human proteome have proved challenging because nearly thirty thousand human proteins must be tested . Computational approaches have helped by reducing the number of host proteins to verify experimentally.
Previous host-pathogen interaction prediction methods focused largely on finding PPIs between human and cellular parasite proteins. One recent method found the probability that two protein domains interact given the human PPI network, and used this probability to find the likelihood that pathogen and human proteins interact given their domain profiles . Another method matched human and pathogen protein pairs to proteins known to form complexes, and then filtered these interaction candidates based on expression data from human and pathogen . Translating these methods to interactions between HIV-1 and human proteins has been difficult because HIV-1 proteins have few domains and their structures are hard to find by comparative modeling. For instance, to find structures for the N-terminal and C-terminal domains of HIV-1 VIF, two different protein structures were required for comparative modeling .
In our study, we focus on protein interactions mediated by short eukaryotic linear motifs (ELMs)  on HIV-1 proteins and human protein counter domains (CDs) known to interact with these ELMs. We aim to obtain host protein sets enriched with known sets of virus targeted proteins based on ELM and CD associations. The potential functional roles of interactions mediated by ELMs and their CDs in viral infection have been addressed in a number of recent articles [12–14]. The HIV-1 literature contains at least ten examples of HIV-1, human PPIs that are directly associated with motif and domain presence. The motif/domain basis of such PPIs is not restricted to a single HIV-1 protein, but is widely distributed across the HIV-1 proteome, including HIV-1 NEF , ENV , TAT , REV , VIF , and VPU . This experimental evidence is the motivation for systematically investigating the association of motif/domain pairs with PPIs between virus and host proteins. Although Tastan et al.  estimated a relatively weak link between binding motif/domain presence and the actual virus-host PPIs, their work was restricted to predicting direct binding between host and HIV-1 proteins. In this study, we set out to identify host proteins involved in direct interactions as well as those that compete with HIV-1 proteins for binding to their host targets. Moreover, the algorithm presented by Tastan et al. is based on supervised learning and training from known interactions between HIV-1 and human proteins. In their method, each potentially interacting protein pair is associated with a feature vector composed of parameters related to Gene Ontology (GO), global gene expression profiles, the human protein interactome, and protein domains and motifs. Ours is a hypothesis-based approach, and does not require a priori knowledge of virus-host interactions beyond what can be gathered from viral and host protein sequences. As such, it is directly applicable to identifying host protein sets enriched with virus targeted host proteins for a wide scope of infectious diseases. The extremely low p-values we calculated for the overlap between our predictions and experimentally verified HIV-1, host protein interactions indicate the potential value of our approach for deducing a first draft of the molecular vocabulary employed in less studied host-pathogen protein interactions.
We downloaded the 2007 versions of multiple protein alignments for 9 (ENV, GAG, NEF, POL, REV, VIF, VPR, TAT and VPU) HIV-1 translated open reading frames from the HIV-1 Sequence Database http://www.hiv.lanl.gov/content/sequence/NEWALIGN/align.html and removed all sequences except those labeled as subtypes B or C. We focused on subtype B because it is most common in the industrialized world , and chose subtype C because it is most common globally . We computationally cleaved the GAG alignment into CA, MA, NC, P1, P2, and P6 alignments, and cleaved the POL alignment into IN, PR, and RT alignments using [GenBank: NC_001802] as a reference. All proteins in the resulting 18 alignments were annotated with ELMs using the ELM resource, accessed December 2008 , using default settings except selecting human for the species field. Any protein lacking an ELM was removed from the study, leaving at least 70 sequences in each multiple alignment [see Additional file 1]. We considered an ELM to be conserved on an HIV-1 protein if it was present on more than 70% of the protein's multiple alignment. This cutoff was chosen for its stability. An increase of 5% additional conversation did not alter the number of conserved ELMs (data not shown). A total of 99 ELMs were found on at least one virus protein sequence. The conservation threshold removed 43 of these, leaving 56 total.
The ELM resource lists CDs or proteins known to interact with ELMs. For each ELM conserved on a virus protein, we found the appropriate CDs and mapped them to PROSITE domains . When the ELM resource listed a set of interacting proteins instead of CDs, we assumed that all proteins had a common unknown CD, and annotated them with that. We constructed a list of CDs and interacting proteins for each HIV-1 conserved ELM [see Additional file 2].
We annotated PROSITE domains and ELMs on the 9446 human protein sequences in the Human Protein Reference Database (HPRD) PPI network , and mapped these sequences to Entrez GeneIDs. PROSITE domains were annotated with the PROSITE scan tool (release 20.31) using the default parameters . ELMs were determined by using the ELM resource, accessed August 2008, selecting the same settings used for the HIV-1 sequences. Any protein lacking a PROSITE domain, or not binding to a protein with a PROSITE domain (other than itself), was removed from the study, leaving 5954 proteins.
The prediction of HHP, the set of human proteins that might interact with HIV-1 proteins, was based on interactions mediated by ELMs and CDs. We built HHP from the union of two sets of human proteins, H1 and H2. H1 was the set of human proteins predicted to directly interact with one or more HIV-1 proteins via a human CD and a virus ELM. H2 was the set of human proteins whose interactions with proteins in H1 were potentially disrupted by competition with an HIV-1 protein. Here an H1 protein has a CD that it might use to interact with an ELM present on both H2 and HIV-1 proteins. For example, in the competition between an HIV-1 and H2 protein for phosphorylation by an H1 kinase, the H1 protein has a kinase CD and the competing proteins have ELMs for phosphorylation sites.
The HHP prediction algorithm was straightforward. For each virus protein, we looked at all interactions documented in HPRD that could be explained by an interaction between a virus protein's conserved ELM and a CD known to interact with that ELM, and added the protein with the CD to H1 and the protein with the ELM to H2. Human proteins are involved in multiple interactions, so H1 and H2 are not mutually exclusive. HHP for each virus protein was the union of the protein's H1 and H2 sets, and contains all host proteins that either bind to or compete with the virus protein. HHP has 2348 proteins involved in 23330 predicted HIV-1, human interactions.
The HIV-1, Human Protein Interaction Database (accessed August 2008) has 3,950 interactions between 19 HIV-1 proteins and 1,439 human proteins. All interactions for ENV's 2 cleavage products, GP41 and GP120, were assigned to ENV. Interactions for GAG and POL products were shown separately as well as assigned to GAG or POL. We restricted the human proteins interacting with HIV-1 proteins to those belonging to the set of 5954 proteins that have PROSITE domains and appear in the HPRD network with at least one non-self edge. The HIV-1, human interactions are spread over 68 interaction types, such as "interacts with", "phosphorlates", and "upregulates". We considered all interaction types, both direct and indirect. For each HIV-1 protein, we removed an interaction type if it described less than six interactions. This resulted in a set of 1,687 verified interactions between 15 HIV-1 proteins and 887 human proteins, which we called HHE, and used to investigate the usefulness of HHP. We constructed a subset of HHE, DHHE, which had interaction types deemed to be direct by Tastan et al. . DHHE was used to evaluate H1.
The statistics in this research focused on the comparison of our predicted set HHP and the experimental dataset HHE based on the overlap between the two sets, GO molecular function enrichment, and KEGG pathway enrichment. P-values for the overlap between HHP and HHE and their various subsets were calculated using the hypergeometric test in the R Project for Statistical Computing. P-values for GO and KEGG enrichment for a given protein set compared to a background set of 5954 proteins were found using Bonferroni corrected p-values from DAVID .
We found KEGG pathways enriched with proteins from each virus protein's HHP set (p-value < 0.01, see Methods). Shown in the lower half of Figure 4 are bar graphs demonstrating the intersection of HHP in KEGG (or HHP found in enriched pathways) and HHE for ENV, NEF, and TAT. Our model predicted 584, 519, and 410 proteins will interact with ENV, NEF, and TAT, respectively, and matched 127 of 409, 54 of 155, and 112 of 509 experimentally verified interactions. The p-values indicated a statistically significant match between predicted and experimental sets for ENV, NEF, and TAT when using both direct predictions (H1), and direct predictions in addition to competing predictions (HHP). However, the p-values in Figure 4 showed that the overlap between predicted and experimental data was weaker for H1 and DHHE than for HHP and HHE.
KEGG Pathway Enrichment
B cell receptor signaling pathway
Epithelial cell signaling in H. pylori infection
Fc epsilon RI signaling pathway
Leukocyte transendothelial migration
NK cell mediated cytotoxicity
Non-small cell lung cancer
Pathogenic E. coli infection – EHEC
Phosphatidylinositol signaling system
Regulation of actin cytoskeleton
Small cell lung cancer
T cell receptor signaling pathway
Toll-like receptor signaling pathway
Type II diabetes mellitus
Gene Ontology Enrichment
adenyl ribonucleotide binding
inositol or phosphatidylinositol kinase activity
interleukin receptor activity
lipid kinase activity
MAP kinase activity
MAP kinase kinase kinase activity
phosphoric monoester hydrolase activity
protein kinase activity
protein kinase binding
The KEGG pathways statistically enriched for ENV, NEF, and TAT interacting proteins (experimental as well as computational) included immune system pathways such as T cell and B cell receptor signaling pathways, apoptosis, focal adhesion, and toll-like receptor signaling pathways (Table 1). Gene expression data before and after HIV-1 infection of macrophages also showed apoptosis and MAPK signaling pathways as statistically enriched , as predicted here. Microarray results did not show cell cycle and toll-like receptor pathways as highly activated in HIV-1 activated macrophages, although the toll-like receptor pathway was highly enriched with known HIV-1 targeted proteins (Table 1). Also statistically enriched were disease pathways such as the colorectal cancer, leukemia, and lung cancer pathways that have been shown to have high incidence of occurrence in HIV-1 infected individuals . Other disease pathways predicted by our analysis included those previously associated with HIV-1 infection: H. pylori infection , E. coli infection , and type II diabetes . These observations indicated the promise of our method in predicting activated disease pathways based on viral sequence. Post-translational modification appeared to be an important element of HIV-1 cellular network hijacking. As shown in Table 2, protein kinase activity and protein kinase binding were highly statistically enriched both in HHP and HHE, suggesting the importance of altered phosphorylation events in the reorientation of the host cell PPI network towards virus survival and replication . The HIV-1 activated GO categories listed in Table 2 are associated with signal transduction processes in the KEGG pathways presented in Table 1.
The rapid sequencing of viral genomes with next generation sequencing technology  makes it possible to link clinical parameters of viral infection to sequence motifs. The task of identifying host proteins targeted by a virus is worthwhile because such proteins may become drug targets to fight infection . Experimental studies for determining virus targeted proteins are expensive and highly challenging . Such efforts, although large-scale, have produced incomplete results for even well studied viruses like HIV-1 [6, 35, 36]. In this study, we used a systems approach to identify host protein subsets enriched by virus targeted proteins. Our method was based on the identification of host motifs on virus sequences. We used the a priori knowledge in the ELM resource to identify the counter domains associated with these motifs and information from the human interactome to focus on host protein interaction pairs with appropriate motif/domain links. KEGG pathways and the GO molecular functions were used to provide biological context to our findings.
The sets of host proteins we predicted as targeted by a given HIV-1 protein in KEGG pathways were highly statistically enriched with host proteins known to interact with the same HIV-1 protein (Figure 4). For example, the match between our predictions and the interactions for HIV-1 NEF in the HIV-1, Human Protein Interaction Database corresponded to a p-value of 4.26 E-21 in KEGG pathways enriched in our predicted set. After combining our predictions for all HIV-1 proteins, we had 607 proteins in HHP enriched KEGG pathways, and of these we matched 241 in the set of 877 experimentally verified proteins with a p-value of 3.11 E-58 (Figure 7). Our predictions were not nearly an exact match for experimental data, but our list was highly enriched with HIV-1 targeted host proteins. Given that HHP in KEGG pathways is about half as large as all HHP, and has a stronger overlap with HHE, experimentalists should begin verification with this set.
In addition to the binding/interaction research compiled in the HIV-1, Human Protein Interaction Database, recent experimental studies based on genome-wide siRNA screens have brought additional light to host-pathogen interactions that facilitate HIV-1 replication [6, 35, 36]. Three studies produced smaller lists of host proteins than the list in the HIV-1, Human Protein Interaction Database. The lower matrix in Figure 7 shows the five-way comparison of HIV-1 targeted protein lists: HHE, HHP, and the three screens. The table indicated the extent of discrepancy between lists, as well as the statistical significance of the matches between them. Our predictions matched HHE with the lowest p-value, and the genome-wide study lists generally matched each other better than the interaction studies. The list of 280 genes presented as host cellular factors required for HIV-1 replication by Brass et al. had 13 genes in common with the list of 295 genes deemed necessary by Konig et al. for regulation of early stage HIV-1 replication, and shared 10 genes with the 311 genes given in the Zhou study. When these proteins were projected into HPRD, the matches led to p-values of 7.35 E-4 and 4.46 E-5. Although the match was significant, there was still a discrepancy between the results. This mismatch may be attributed to the differences in the analysis and experimental methodologies used. Our predictions matched 56 of the 129 HPRD proteins presented by Konig et al. with a p-value of 0.15, 44 of the 91 HPRD proteins in the list by Brass et al. with a p-value of 0.03, and 54 of the 139 HPRD proteins given by Zhou et al. with a p-value of 0.52. These results indicated the challenges faced by experimental studies trying to uncover the grammar of HIV-1, host interactions.
Although our study produced host protein sets statistically enriched with proteins known to be targeted by HIV-1, mismatches between our predictions and experimental data cannot be ignored. It is possible that host-virus interactions employ a grammar that is much more complex than the short linear motif/counter domain interactions assumed in this study. The molecular vocabulary of PPIs is simply not well understood even for proteins belonging to the same species. However, one common mode of interaction is the binding of a linear binding motif on one protein to a domain on another protein . A central hypothesis in the discovery of the linear binding motifs mediating protein interactions has been that proteins with a common interacting partner, such as protein kinases, share a common feature in the form of a motif . Some of the linear binding motifs in the ELM resource have been shown to bind directly to sites at opposing counter domains listed in databases such as PROSITE and Pfam . However, for approximately 30% of the PPI interactions listed in HPRD database, interacting proteins possess none of the already annotated domains. Thus, a model based on known motif/domain interactions would not be able to capture all of the known interactions in the host, let alone those between virus and host.
Another important cause of the discrepancy between our predictions and experimental data might have been the poor annotation of known motifs and counter domains used in this study . Recent studies of domain-motif interactions indicated that the annotation signatures are more specific than those presented in ELM and PROSITE. This was found to be true for the HIV-1 interacting PDZ domain , SH3 domain  and others . Emerging motif finding tools such as DILIMOT , SLIMFinder , and D-STAR  will help researchers improve the specificity of the motifs that mediate host-virus interactions. Still, the list of host proteins we have provided [see Additional file 5] comprises a candidate set for genome-wide studies of the regulation of HIV-1 replication and infection.
We focused on HIV-1 infection in this study because we desired to assess the effectiveness of our computational approach by comparing our predictions with large-scale experimental data. Our results provided a rationale for applying our method to predict virus-human interactions for sequenced viruses. A systems approach to predicting host-pathogen interactions will at least be partially based on the sequence motifs of interacting genome/proteomes. The present study illustrated the importance of ELMs in the molecular cross talk between host and virus and opened the door for more extensive experimental and computational studies of host-virus interactions.
In this study, we described a bioinformatics model to investigate the crosstalk between the HIV-1 and human proteins. Our method used multiple sequence alignments of HIV-1 proteins, and three datasets related to the host: decoded sequences of the host proteins, a priori knowledge of experimentally observed protein-protein interactions within the host proteome, and interactions between short linear peptide motifs and protein domains. The output of the model was a list of host proteins that may interact with specific HIV-1 proteins using specific sites. This list can be used to draft a connectivity map between virus and host, and to determine a set of protein interaction pathways that are significantly enhanced by host proteins predicted to be targeted by HIV-1.
The model was based on the assumption that virus proteins interact with host proteins though a set of conserved linear sequence motifs present in the host proteome. The conserved spatial organization of these motifs on the rapidly evolving HIV-1 proteome supported the assertion that short linear motifs play critical roles in interactions with the host network. The model's predictions led to host protein sets that are crowded by known HIV-1 targeted proteins. This statistical enrichment was particularly high along cellular pathways modulated by HIV-1. The model's predictions were also consistent with experimental data showing phosphorylation events as key targets of HIV-1 when redirecting cell protein networks toward the goal of virus replication.
The methodology applied here for HIV-1, host protein interactions is applicable to any viruses with multiple sequence alignments and hosts with known interactomes. Therefore, our approach has potential use in the identification of host proteins targeted by emerging and/or understudied viruses. The resulting list will be useful for selecting optimal drug therapies and discovering new antivirus drugs. The systems approach presented here for predicting host-virus protein interactions will benefit from ongoing research on the more specific annotations of short linear motifs and domains involved in protein-protein interactions.
This study was supported by the National Institutes of Health (NIH) grant number # 232240 and by the National Science Foundation (NSF) grant # 235327. Additional support came from a Drexel University Calhoun Fellowship (WD) and from NIH training grant T32 HG000046 (PE). The authors would like to thank Drs. Roger Ptak and Fred Davis for their valuable input and comments.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.