A surge of interest in defining molecular biomarkers of health and disease from human plasma has recently emerged with the recent launch of the pilot Plasma Proteome Project (PPP) by the Human Proteome Organization (HUPO) [1]. The easy clinical access and processing of plasma samples, and the abundance of proteins as well as metabolites that may collectively define a person's health status, have made human plasma the top choice among bio-fluids for future clinical molecular diagnostic applications. The fluctuating nature of blood from different individuals, huge dynamic protein concentration ranges (up to 1012), and the protein detection limits of most MS platforms, have made the plasma proteome elusive to define. Many proteomics researchers even believe that the current "plasma proteome" observed by a single shotgun MS experiment is analogous to a stochastic sampling of the human proteome, with low run-to-run consistencies and inherent detection biases peculiar to each type of MS platform [2]. Even when used for healthy individual plasma, with multi-dimensional separations and advanced bioinformatics search software tools, proteins identified in different shotgun MS/MS plasma proteomics experiments are often inconsistent with each other except for the most abundant proteins. To overcome the poor coverage, potential bias, and complementary nature of each experimental measurement of the human plasma proteome, it is necessary for biomedical researchers to collect and assess all reliable publicly-available plasma protein data sets generated from different MS analytical and computational platforms for healthy individuals. A comprehensive integrated resource of the human plasma proteins for healthy individuals, currently missing in the field of clinical proteomics, would enable researchers to understand the basic constituents, diversity, and variability of the human plasma proteome. Such a resource would provide a high amount of comparative power for interpreting proteomics profile changes in patients' plasma, and may supplement or compensate for limitations and biases associated with the set of controls for a given study. It would also improve the ability for finding protein biomarkers that are known to occur in healthy human plasma for instances where a protein is differentially expressed in a patient sample related to the quantities observed in the study control.
Although multiple projects to profile the human plasma proteome have been attempted, including PeptideAtlas, GPMO, HUPO PPP, and several recent publications [3–6], an unaddressed need has been for a compiled, central repository structured to enable the stable retrieval, comparison and querying of results. The existing sources vary widely in terms of data set size, available user interface, experimental protocol or sample details, choices of protein identifiers, linking to peptide evidence from MS experiments, MS search software used, and extent of data annotation. This information needs to be compiled further and assembled for end users before they can consider incorporating human plasma proteome data into their studies. The largest single source of data is from an independently conducted experimental study that utilized a ion-mobility spectrometry (IMS) platform to chart the existence of 9,087 proteins based on 37,842 unique inferred peptide sequences, of which 2,928 proteins are high-confidence [6]. There is not however a web-based interface or other online resource for making this data widely available. The other sources of data we examined have some form of online presentation (aside from publication) and range in size from less than one thousand to over several thousand identified proteins. The Plasma Proteome Database (PPD) provides a web interface and is geared for providing detailed functional annotations of 3,778 distinct proteins based on data extracted from the literature, yet the PPD provides information on neither experimental protocol or associated MS-detected peptides used for protein identification [7]. HUPO PPP information consists of 3,020 proteins and 47,950 peptides along with experimental protocol information and is available to the public online, but the data is only accessible as flat files [8]. The Institute of Systems Biology (ISB) has surveyed and analyzed a comparably smaller set of data produced by 28 human plasma proteomics experiments, and has reported an approximate count of 960 proteins based on the 6,929 distinct observed peptides in their web-interfaced PeptideAtlas database [3]. Another resource, providing evidence for human proteomics based mainly on data from HUPO PPP, is hosted by the Global Proteome Machine Organization (GPMO). An important feature of the GPMO database is that it provides annotated information to assist with the difficult process of validating peptide MS/MS spectra and patterns of protein coverage [4]. GPMO also includes data from non-human organisms such as cats, guinea pigs, rabbits, unicellular eukaryotes like yeasts, as well as a number of prokaryotic organisms.
By gathering the protein and peptide data used to characterize the proteome of a healthy individual, we made an attempt to develop a resource that presents a comparative baseline of plasma proteomics results against which proteomic data from patients with diseases such as cancer, neurodegenerative diseases, metabolic diseases, and other genetic disorders may be studied. In this effort, we define "healthy" or "normal" as human adults without major known life-threatening diseases, genetic diseases, HIV, or inflammation at the time of blood drawing (a slightly more stringent variation than the HUPO definition in Omenn et al. [5]). On these premises, we developed an integrated database HIP2 (Healthy Human Individual's Integrated Plasma Proteome) by compiling all of the existing experimental data performed on healthy individual samples, and creating a web-based interface to aid the many upcoming projects of protein biomarker studies of health and disease [9]. With HIP2, clinical samplings of patient plasma may be better compared to random or non-random aspects of overlap with the reported set of healthy human plasma proteins.