Introduction

One goal of the Human Proteome Project is to define the protein molecules that make up the human body. Such an activity could generate a reference list to better understand and detect human disease, and, with hyperfine detail, our responses to new therapeutics. What follows is a short synopsis of an idea to crystallize the Human Proteome Project (HPP) into a focused effort to map the natural structure and variation of human beings at the molecular level, much like the Human Genome Project completed a decade ago.

Whereas the decoding of the human genome involved the determination of a linear sequence of A’s, G’s, C’s, and T’s present in most of our cells, proteins are far more context-dependent. This fact, along with the complexity of highly processed protein molecules and the lack of amplification methods, forces one to define the context and scope of a compelling project that builds on the exploding knowledge of human genomes. This includes describing a clear endpoint of high value that will transform both basic and clinical research, and accelerate the delivery of societal promises made for the post-genomic practice of medicine.

Discussion

Current Strategies for the HPP

Launched a decade ago [1], the Human Proteome Organization, or HUPO (http://www.hupo.org/), has focused on creating knowledge bases, antibody-based reagents, and mass spectrometry-based proteomics using a “Bottom Up” analytical strategy. Using antibodies, the construction of a Human Protein Atlas (http://www.proteinatlas.org/) has yielded immunofluorescence images profiling protein expression from ~40 % of human genes. There have been two articulations of initiatives using protein mass spectrometry thus far. A “biology/disease” approach, generally linked to disease research, was first to be put forth (e.g., for the human plasma proteome, the liver, the brain, etc.) [2, 3]. This has come to be known also as a “protein-centric” or discovery approach [4]. More recently, a “gene-centric” (aka, a “chromosome-centric”, or C-HPP) approach has emerged [5, 6], with groups in many countries coordinating national efforts to map the abundance, distribution, and sub-cellular localization of proteins whose genes are co-located on the same chromosome. One additional achievement of HUPO has been to begin unifying the field of proteomics via the Proteomics Standards Initiative [7] and to provide a forum for coordinated efforts to improve cross-lab reproducibility [8]. For ease, Table 1 summarizes the current articulations of the HPP.

Table 1 Abbreviations Used to Notate Articulations of the Human Proteome Project (HPP)

A new context in which to place the HPP takes inspiration from a particular level of the natural organization present in the human body (Figure 1), with cell type assuming the primary, defining context for the project. With a few exceptions, it is individual cells that convert the genome into the proteome, thereby defining cell type through biomolecule expression. A cell-centric focus places a premium on knowing and classifying all the sub-types of cells in the human body. With relevance across the spectrum of human disease, a cell mapping stage naturally precedes the large-scale characterization of protein molecules (vide infra). This is akin to the genome mapping stages (first using genetic and then physical techniques) that dominated the first decade of the HGP (Table 2, row 2). Stimulated by the end goal and the resources to achieve it, DNA sequencing technologies underwent development at a sharply accelerated rate during this mapping stage of the project. A similar stage of technology development for quantitative measurement of protein forms is envisioned for this “Cell-Based” articulation of the HPP (or “CB-HPP”; Table 1).

Figure 1
figure 1

Graphical depiction of the cell-based version of the Human Proteome Project (CB-HPP)

Table 2 Comparison of the Human Genome Project (HGP) and the Cell-Based Version of the Human Proteome Project (CB-HPP)

Mapping Cell Types

It is clear that cellular heterogeneity is a major point of confusion in normal and disease biology, and that the textbook number of ~230 different cell types in the human body is out of date in the age of molecular medicine. The cell mapping stage of the CB-HPP can utilize a variety of cell surface markers for fluorescence assisted cell sorting (FACS) to prepare 1000 to 1,000,000 cells of high purity prior to cell-specific proteomics [9]. Further, a cell-based project calls for the large-scale discovery and validation of cell surface markers, using capture technologies for cell surface proteins [10], FACS, mass cytometry [11], RNA-seq, and other multi-parameter tools to categorize the cell types present in the human body. The Cellpedia project has generated an ontology of cell types raising the classical number of 230 to >2500 currently [12]. Given that we will add substantially to the number of cell and sample types during the cell mapping stage of the project, the number could rise to perhaps ~4000 cell types. Defining the variation of healthy cells using quantitative and isoform-resolved proteomics, both within an individual and within populations, would provide a rich basis for subsequent disease-driven research and regenerative medicine. The source of cells should be highly restricted to those isolated from primary tissue. The CB-HPP has a high bar for sampling prior to mass spectrometry-based proteomics, a will use classifiers for definition of primary cell type.

Defining the Proteome of Specific Cell Types

This cell-based articulation of the Human Proteome Project takes inspiration, where appropriate, from the experience of the Human Genome Project (Table 2). The most analogous effort to the genome project is to provide the definitive primary structure for Homo sapiens at the level of protein molecules. This focused effort would involve the definition of all detectable proteoformsFootnote 1 of carefully defined and sorted cell types from the human body. Assuming there are ~250,000 distinct proteoforms detectable in a given cell type by technologies ready within a 10-year time horizon, the whole cell-based project involves characterization of at least 1 billion proteoforms present in nondiseased cell types (Figure 2). Combined with the 10 major body fluids such as blood [13]— the core of the CB-HPP project would involve identification, characterization, and quantitation of over 1 billion detectable protein forms. The precise level of analytical depth could be adjusted once a cost versus depth model is in place prior to a production scale effort being launched around the time the C-HPP is projected to be completed in the year ~2022 [5]. To facilitate interpretation of splicing events, mutations, and coding polymorphisms, samples would be subjected to parallel genome sequencing and RNA-seq using NGS.

Figure 2
figure 2

The levels of organization in the human body. The cell-based approach to the Human Proteome Project (CB-HPP) recognizes cell type as a primary context for mass spectrometry-based proteomics to measure the molecular complexity present in the body naturally. The CB-HPP also calls for accelerated development of new and emerging technologies to better define cell types and precisely catalogue whole protein molecules

The Human Genome Project involved taking a grand inventory of human DNA. Similarly, the proposed CB-HPP would create definitive knowledge of cell types and the protein molecules within them. With a simplified focus on cell type and protein primary structure, the core of a focused project based on mass spectrometry can then be crafted:

  • Goal: By the year 2030, to develop and apply the technology to analyze the ~1 billion primary structures of protein forms present in all the cell types and major fluids present in the human body.

This primary goal of the CB-HPP will drive development of technologies to transform the proteome from a nebulous enigma into a closed system—with knowable molecules and intelligible codes. One promising approach is the “Top Down” mass spectrometric strategy for analyzing molecules, now achievable for thousands of intact proteoforms [14]. For perspective, almost all practitioners of large-scale proteomics in discovery and targeted modes use the method of “Bottom Up” proteomics, which employs proteases to digest the primary structures of whole proteins present naturally. Clearly, both strategies can work together in a project that unifies the gene- and protein-centric articulations of the HPP. As judged by comparison with RNA-seq, Bottom Up methods are asymptotically approaching the ability to completely detect all expressed proteins (~10,000) in discovery mode from a single human cell type [15, 16]. Detection of proteoforms produced from these ~10,000 genes from carefully defined and isolated cell types then becomes the primary target for technology development in mass spectrometry-based proteomics.

This fresh and focused approach to the human proteome highlights major gaps in our current understanding of proteins and leads to a call for technological innovations (like the pioneers of genomics in the late 1980s). What combinations of coding polymorphisms, alternative splice forms, and post-translational modifications create the constellation of proteoforms present in each cell type? Once technologies are in place to answer this, we can address the question of how they vary in human disease in a deterministic and comprehensive fashion. A cell-based Human Proteome Project places a premium on defining and isolating specific cell types prior to analysis with 100 % sequence coverage for proteoforms detected at a copy number of 10 and above. Mainstream technologies in proteomics cover <20 % of the sequence space of the detectable proteome, and suffer limitations from the protein inference problem.

An Early Example: Knowing Proteoforms of Human Histones

The human genome as presented in chromatin is 1/2 DNA and 1/2 protein by weight – and knowledge of histone forms across the ~60 million nucleosomes in diploid cells is now in view from application of the full complement of mass spectrometric methods over the years. Recently, knowledge of over a thousand distinct molecular forms of core and linker histones has been obtained by analysis of intact histones. With this bird’s eye perspective (i.e., molecular composition and approximate quantity), we have a reasonably good “basis set” of histone forms that are present down to a copy number of ~1000. While increased depth of this analysis will uncover thousands (not billions) more histone proteoforms in the future, we can already use this reference set to better understand combinations of modifications, their epigenetic contribution to diseases, and our responses to epigenetic-based therapeutics (e.g., those in development for a variety of cancers of the blood).

Beyond Primary Structure: Capturing Protein Pleiotropism at All Levels of Organization in the Human Body

Proteins are heterogeneous and dynamic molecules in time and space. This is the reason why they are critical to our understanding of precise mechanisms in complex diseases. The dynamic nature of proteins also makes their analysis more challenging than the genome in several respects. The context of proteins in large complexes, organelles, lipid membranes, and organ/tissue type can defocus the protein analysis picture. However, recent developments do give hints of how we might proceed. For example, the canonical Top-Down experiment using mass spectrometry (i.e., complete analysis of protein primary structure) already has a next-generation counterpart, which includes characterization of the quaternary structure of megadalton protein complexes using native electrospray [17]. Also, by linking “Top Down” and “Bottom Up” flavors of mass spectrometry to separations that fractionate organelles or protein complexes, their composition can be built up using the concept of co-fractionation [18, 19]. Further, cellular and sub-cellular localization of proteins can be provided by the Human Protein Atlas, already well underway. In addition to capturing the tissue and organ context within the body (initiatives also underway in the HUPO consortium), such details on organellular localization and protein complexation form additional goals (added at considerable expense) to round out a project with the integrated resources to provide precise molecular information at each level of organization present in the human body (Figure 1).

Comparing/Contrasting with the Human Genome Project

Without an analogue to the polymerase chain reaction (PCR) for proteins, the challenge of the human proteome requires some different strategies and tactics (Table 2). When the HGP project was begun the technological hurdles seemed insurmountable. However, the architects of the HGP recognized that when conditions are right, methodological advances come more quickly than expected. The mapping phase of the human genome provided meaningful linkages to disease research and a “Top Down” scaffold that anchored the “Bottom Up” method of whole genome shotgun sequencing. In addition to improving cell-based separations and mass spectrometric-based analyses of endogenous proteoforms, the call for disruptive technologies in proteomics would be given new voice. Assuming a cost on the same scale and growth curve of the Human Genome Project, one should demonstrate value and performance in pilot projects. While small bacteria like H. influenzae served nicely as models to develop whole genome shotgun sequencing, most bacterial proteins are not highly modified into a diversity of proteoforms. Despite this, such microorganisms would serve as excellent models to judge completeness and benchmark technologies capable of measuring detectable proteins with complete coverage of their primary structures (Table 2, row 5). Also, pilot projects on readily obtained human cell types can commence straight away (e.g., those of the hematopoietic system). Discovery of surface markers and methods for defining and sorting unfixed cells from solid tissue is a critical early part of the cell-mapping phase for the CB-HPP, where proteoforms are inventoried on a cell-specific basis.

Regarding the Molecular Variation of DNA and Proteins

While the genome is quite stable and definable, there is substantial variation of it through mutation and polymorphism in populations. This variation is becoming known more fully as we reach the era of the $1000 genome [20]. The proteome has greater variation, but does not defy definition (particularly when each sample could readily have its full genome determined). Therefore, we can identify meaningful goals (vide supra), recognizing the differences between the genome and the more pleotropic proteome. Another major difference is the sampling of proteins versus DNA. For proteomics, the demands of ethical sampling would increase the requirement for a highly collaborative consortium and would extend the project in time.

Cost and Return

The Human Genome Project involved mapping stages and much technology development stimulated by articulation and funding of the project. Even after 2003, continued stimulus through the National Institutes of Health contributed to the amazing drop of more than six orders of magnitude in cost to sequence DNA. This created over 300,000 new jobs and an estimated ~ $700 billion of economic activity [21]. A similar trajectory could be envisioned for the acquisition of information of the estimated ~1 billion proteoforms (4000 cell types × 250,000 forms/type = 1 billion protein forms). Not until the cost reaches less than $1/proteoform would a production scale effort be launched. This would provide a target to stimulate public and private sector efforts to create disruptive technologies with orders-of-magnitude increases in efficiency for discovery- and targeted-proteomics.

Summary

The cell-based version of the Human Proteome Project, or CB-HPP, relies on two simple tenets. (1) Cells convert the blueprint of life into proteins. Therefore, a proteome project should use cell type as its primary index. (2) The “Top Down” philosophy of molecular analysis can be used in conjunction with the C-HPP project [5] to determine the complete primary structures of protein molecules on a cell-specific basis. Adherence to these tenets sharply focuses a long-term effort to create a more solid foundation for 21st century biology and provides clear metrics of progress and completion.

Outlook

This new approach to the Human Proteome Project calls for nonincremental technological leaps, and recognizes the biological hierarchy present naturally in our bodies. The CB-HPP would provide fundamental knowledge of all cells and detectable forms of protein molecules in a range of healthy human bodies. Such knowledge would revolutionize our understanding of the proteome, making it far easier and deterministic to prepare reagents and assays for diagnostics and therapeutics. For example, antibodies of the future can be constructed for targeted epitopes and even made as proteoform-specific reagents. Similarly, a drug candidate can be assessed for returning a specific constellation of proteoforms in a pathway back to a healthier state (with far better knowledge of off-target effects). Drugs and diagnostics can be devised to target specific protein molecules with a level of precision that will help drive the century of biology along the path envisioned by many decades ago. Such hyperfine control over complex biological systems was part of the original promise of the genome project; the CB-HPP can serve as a next bridge to that goal. While drawing from the HGP experience but acknowledging the strong context-dependent nature of the proteome, we may indeed see momentum gathering to develop a comprehensive understanding of just what we are at the level of protein molecules. How can we realize all our goals for the “Century of Biology” without a transformation in our molecular comprehension of the proteome?