HIP2: An online database of human plasma proteins from healthy individuals
© Saha et al; licensee BioMed Central Ltd. 2008
Received: 13 February 2008
Accepted: 25 April 2008
Published: 25 April 2008
With the introduction of increasingly powerful mass spectrometry (MS) techniques for clinical research, several recent large-scale MS proteomics studies have sought to characterize the entire human plasma proteome with a general objective for identifying thousands of proteins leaked from tissues in the circulating blood. Understanding the basic constituents, diversity, and variability of the human plasma proteome is essential to the development of sensitive molecular diagnosis and treatment monitoring solutions for future biomedical applications. Biomedical researchers today, however, do not have an integrated online resource in which they can search for plasma proteins collected from different mass spectrometry platforms, experimental protocols, and search software for healthy individuals. The lack of such a resource for comparisons has made it difficult to interpret proteomics profile changes in patients' plasma and to design protein biomarker discovery experiments.
To aid future protein biomarker studies of disease and health from human plasma, we developed an online database, HIP2 (Healthy Human Individual's Integrated Plasma Proteome). The current version contains 12,787 protein entries linked to 86,831 peptide entries identified using different MS platforms.
This web-based database will be useful to biomedical researchers involved in biomarker discovery research. This database has been developed to be the comprehensive collection of healthy human plasma proteins, and has protein data captured in a relational database schema built to contain mappings of supporting peptide evidence from several high-quality and high-throughput mass-spectrometry (MS) experimental data sets. Users can search for plasma protein/peptide annotations, peptide/protein alignments, and experimental/sample conditions with options for filter-based retrieval to achieve greater analytical power for discovery and validation.
A surge of interest in defining molecular biomarkers of health and disease from human plasma has recently emerged with the recent launch of the pilot Plasma Proteome Project (PPP) by the Human Proteome Organization (HUPO) . The easy clinical access and processing of plasma samples, and the abundance of proteins as well as metabolites that may collectively define a person's health status, have made human plasma the top choice among bio-fluids for future clinical molecular diagnostic applications. The fluctuating nature of blood from different individuals, huge dynamic protein concentration ranges (up to 1012), and the protein detection limits of most MS platforms, have made the plasma proteome elusive to define. Many proteomics researchers even believe that the current "plasma proteome" observed by a single shotgun MS experiment is analogous to a stochastic sampling of the human proteome, with low run-to-run consistencies and inherent detection biases peculiar to each type of MS platform . Even when used for healthy individual plasma, with multi-dimensional separations and advanced bioinformatics search software tools, proteins identified in different shotgun MS/MS plasma proteomics experiments are often inconsistent with each other except for the most abundant proteins. To overcome the poor coverage, potential bias, and complementary nature of each experimental measurement of the human plasma proteome, it is necessary for biomedical researchers to collect and assess all reliable publicly-available plasma protein data sets generated from different MS analytical and computational platforms for healthy individuals. A comprehensive integrated resource of the human plasma proteins for healthy individuals, currently missing in the field of clinical proteomics, would enable researchers to understand the basic constituents, diversity, and variability of the human plasma proteome. Such a resource would provide a high amount of comparative power for interpreting proteomics profile changes in patients' plasma, and may supplement or compensate for limitations and biases associated with the set of controls for a given study. It would also improve the ability for finding protein biomarkers that are known to occur in healthy human plasma for instances where a protein is differentially expressed in a patient sample related to the quantities observed in the study control.
Although multiple projects to profile the human plasma proteome have been attempted, including PeptideAtlas, GPMO, HUPO PPP, and several recent publications [3–6], an unaddressed need has been for a compiled, central repository structured to enable the stable retrieval, comparison and querying of results. The existing sources vary widely in terms of data set size, available user interface, experimental protocol or sample details, choices of protein identifiers, linking to peptide evidence from MS experiments, MS search software used, and extent of data annotation. This information needs to be compiled further and assembled for end users before they can consider incorporating human plasma proteome data into their studies. The largest single source of data is from an independently conducted experimental study that utilized a ion-mobility spectrometry (IMS) platform to chart the existence of 9,087 proteins based on 37,842 unique inferred peptide sequences, of which 2,928 proteins are high-confidence . There is not however a web-based interface or other online resource for making this data widely available. The other sources of data we examined have some form of online presentation (aside from publication) and range in size from less than one thousand to over several thousand identified proteins. The Plasma Proteome Database (PPD) provides a web interface and is geared for providing detailed functional annotations of 3,778 distinct proteins based on data extracted from the literature, yet the PPD provides information on neither experimental protocol or associated MS-detected peptides used for protein identification . HUPO PPP information consists of 3,020 proteins and 47,950 peptides along with experimental protocol information and is available to the public online, but the data is only accessible as flat files . The Institute of Systems Biology (ISB) has surveyed and analyzed a comparably smaller set of data produced by 28 human plasma proteomics experiments, and has reported an approximate count of 960 proteins based on the 6,929 distinct observed peptides in their web-interfaced PeptideAtlas database . Another resource, providing evidence for human proteomics based mainly on data from HUPO PPP, is hosted by the Global Proteome Machine Organization (GPMO). An important feature of the GPMO database is that it provides annotated information to assist with the difficult process of validating peptide MS/MS spectra and patterns of protein coverage . GPMO also includes data from non-human organisms such as cats, guinea pigs, rabbits, unicellular eukaryotes like yeasts, as well as a number of prokaryotic organisms.
By gathering the protein and peptide data used to characterize the proteome of a healthy individual, we made an attempt to develop a resource that presents a comparative baseline of plasma proteomics results against which proteomic data from patients with diseases such as cancer, neurodegenerative diseases, metabolic diseases, and other genetic disorders may be studied. In this effort, we define "healthy" or "normal" as human adults without major known life-threatening diseases, genetic diseases, HIV, or inflammation at the time of blood drawing (a slightly more stringent variation than the HUPO definition in Omenn et al. ). On these premises, we developed an integrated database HIP2 (Healthy Human Individual's Integrated Plasma Proteome) by compiling all of the existing experimental data performed on healthy individual samples, and creating a web-based interface to aid the many upcoming projects of protein biomarker studies of health and disease . With HIP2, clinical samplings of patient plasma may be better compared to random or non-random aspects of overlap with the reported set of healthy human plasma proteins.
Construction and content
Summary of HIP2 database. The numbers of peptides and proteins represent unique entries that are the union of multiple subjects, possibly from different ethnic groups.
HUPO PPP (3,020 proteins and 47,950 peptides)
David Clemmer's group (9,087 proteins and 37,842 peptides)
PeptideAtlas (788 proteins and 6,039 peptides)
Leigh Anderson's group (1,175 proteins)
2DEMS & LC_MS/MS*
Utility and discussion
The HIP2 database provides protein biologists and clinical biomedical researchers with a new gateway for exploring proteins from the human proteome with peptide-level evidence found in plasma. The basic questions that the HIP2 database helps biological researchers answer includes "whether a protein may be found in human plasma" and "how likely or easily it is for a protein to be observed in healthy human plasma with mass spectrometry." The HIP2 database allows its users to assess the confidence of identifying plasma proteins in "normal" plasma MS proteomics experiments by examining such evidence as the number of matched peptide hits, data sources covered, MS experiments observed, types of MS platforms, and search software used. The protein and peptide sequence information also enables the user to examine peptide evidence that may be mapped to different gene splice variants and protein isoforms. Partial protein trypsin digestions can be evaluated based on multiple peptide to protein alignment information presented in the database. The overall quality of digested peptide mapped to proteins can be further used by mass spectrometry data analysts to assess different performance of MS proteomics platforms or samples. A typical example in the HIP2 database of a protein-peptide sequence comparison is how the peptide sequence 'KQSAGLVLWGAILFVAWNALLLLFFWTRPAPGRPPSVSALDGDPASLTR' is present in two proteins, IPI00000138 and IPI00179044, and aligns with the same sites of trypsin cleavage (as shown in Fig. 6). In the case of protein IPI00000138, evidence for the protein was found in three data sources, three MS experiments, three MS platforms, two MS search software and six mapped peptide sequences, whereas in the case of protein 'IPI00179044', there are not any experimentally proven peptides from MS results.
The primary goal of the HIP2 database is to support future clinical proteomics research, especially the discovery of biomarkers through plasma proteomics profiling. For biomedical researchers interested in MS-based plasma biomarker studies, HIP2 can be the first database to search against a list of candidate proteins/genes or peptides before choices of prioritized biomarker candidates are made. As the database grows, additional annotation information of human plasma proteins such as relative abundance, normal range of variability, detectability, peptidomic patterns associative with cleavage, mutation and putative sites of glycosylation will be added. HIP2 helps provide an integrated interface where database curators and data contributors can work together to collect ongoing published data from healthy human plasma proteomics experiments, foster community-based assessment of presence and absence of proteins in healthy human plasma, and provide a centralized data repository for subsequent bioinformatics analysis of the consistency and biases of each MS proteomics platform or search software. We expect this database to become an essential clinical proteomics resource, helping link together the community of biomedical researchers engaged in biomarker studies and the community of mass spectrometry researchers developing sensitive analytical solutions.
Availability and requirements
The online content of HIP2 is freely available to all WWW users. The database infrastructure and software tools used to develop the database are subject to the intellectual property protection terms of Indiana University.
Project name: Database for Healthy Human Individual's Integrated Plasma Proteome
Project home page: HIP2 website 
Browser requirements: Modern browsers (e.g., Firefox or Microsoft Explorer) will function satisfactorily.
This work was partially supported by a Clinical Proteomic Technology Assessment for Cancer (CPTAC) grant from the National Cancer Institute (U24CA126480-01), part of NCI's Clinical Proteomic Technologies Initiative. We thank Fred Regnier and Charles Buck from Purdue University for support of this project. We thank Ron Beavis, David Tabb, and Steve Stein for helpful initial discussions that led to the conceptualization of this work.
- Omenn GS: Advancement of biomarker discovery and validation through the HUPO plasma proteome project. Disease markers. 2004, 20 (3): 131-134.View ArticlePubMedPubMed CentralGoogle Scholar
- Carr S, Aebersold R, Baldwin M, Burlingame A, Clauser K, Nesvizhskii A: The need for guidelines in publication of peptide and protein identification data: Working Group on Publication Guidelines for Peptide and Protein Identification Data. Mol Cell Proteomics. 2004, 3 (6): 531-533.View ArticlePubMedGoogle Scholar
- Deutsch EW, Eng JK, Zhang H, King NL, Nesvizhskii AI, Lin B, Lee H, Yi EC, Ossola R, Aebersold R: Human Plasma PeptideAtlas. Proteomics. 2005, 5 (13): 3497-3500.View ArticlePubMedGoogle Scholar
- Beavis RC: Using the global proteome machine for protein identification. Methods in molecular biology (Clifton, NJ). 2006, 328: 217-228.Google Scholar
- Omenn GS, States DJ, Adamski M, Blackwell TW, Menon R, Hermjakob H, Apweiler R, Haab BB, Simpson RJ, Eddes JS, Kapp EA, Moritz RL, Chan DW, Rai AJ, Admon A, Aebersold R, Eng J, Hancock WS, Hefta SA, Meyer H, Paik YK, Yoo JS, Ping P, Pounds J, Adkins J, Qian X, Wang R, Wasinger V, Wu CY, Zhao X, et al: Overview of the HUPO Plasma Proteome Project: results from the pilot phase with 35 collaborating laboratories and multiple analytical groups, generating a core dataset of 3020 proteins and a publicly-available database. Proteomics. 2005, 5 (13): 3226-3245.View ArticlePubMedGoogle Scholar
- Liu X, Valentine SJ, Plasencia MD, Trimpin S, Naylor S, Clemmer DE: Mapping the human plasma proteome by SCX-LC-IMS-MS. Journal of the American Society for Mass Spectrometry. 2007, 18 (7): 1249-1264.View ArticlePubMedPubMed CentralGoogle Scholar
- Muthusamy B, Hanumanthu G, Suresh S, Rekha B, Srinivas D, Karthick L, Vrushabendra BM, Sharma S, Mishra G, Chatterjee P, Mangala KS, Shivashankar HN, Chandrika KN, Deshpande N, Suresh M, Kannabiran N, Niranjan V, Nalli A, Prasad TS, Arun KS, Reddy R, Chandran S, Jadhav T, Julie D, Mahesh M, John SL, Palvankar K, Sudhir D, Bala P, Rashmi NS, et al: Plasma Proteome Database as a resource for proteomics research. Proteomics. 2005, 5 (13): 3531-3536.View ArticlePubMedGoogle Scholar
- Plasma Proteome Project. [http://www.bioinformatics.med.umich.edu/hupo/ppp]
- Clinical Proteomics Technologies for Cancer. [http://proteomics.cancer.gov/]
- Anderson NL, Polanski M, Pieper R, Gatlin T, Tirumalai RS, Conrads TP, Veenstra TD, Adkins JN, Pounds JG, Fagan R, Lobley A: The human plasma proteome: a nonredundant list developed by combination of four separate sources. Mol Cell Proteomics. 2004, 3 (4): 311-326.View ArticlePubMedGoogle Scholar
- Kersey PJ, Duarte J, Williams A, Karavidopoulou Y, Birney E, Apweiler R: The International Protein Index: an integrated database for proteomics experiments. Proteomics. 2004, 4 (7): 1985-1988.View ArticlePubMedGoogle Scholar
- BioMart. [http://www.biomart.org/biomart/martview/]
- Olsen JV, Ong SE, Mann M: Trypsin cleaves exclusively C-terminal to arginine and lysine residues. Mol Cell Proteomics. 2004, 3 (6): 608-614.View ArticlePubMedGoogle Scholar
- Statistical Analysis of Protein Sequences. [http://www.ebi.ac.uk/saps/]
- Taylor CF, Paton NW, Lilley KS, Binz PA, Julian RK, Jones AR, Zhu W, Apweiler R, Aebersold R, Deutsch EW, Dunn MJ, Heck AJ, Leitner A, Macht M, Mann M, Martens L, Neubert TA, Patterson SD, Ping P, Seymour SL, Souda P, Tsugita A, Vandekerckhove J, Vondriska TM, Whitelegge JP, Wilkins MR, Xenarios I, Yates JR, Hermjakob H: The minimum information about a proteomics experiment (MIAPE). Nature biotechnology. 2007, 25 (8): 887-893.View ArticlePubMedGoogle Scholar
- Arnold RJ, Jayasankar N, Aggarwal D, Tang H, Radivojac P: A machine learning approach to predicting peptide fragmentation spectra. Pacific Symposium on Biocomputing. 2006, 219-230.Google Scholar
- HIP2 website. [http://bio.informatics.iupui.edu/HIP2/]
- The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1755-8794/1/12/prepub
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.