Visualization and Dissemination of Multidimensional Proteomics Data Comparing Protein Abundance During Caenorhabditis elegans Development
Regulation of protein abundance is a critical aspect of cellular function, organism development, and aging. Alternative splicing may give rise to multiple possible proteoforms of gene products where the abundance of each proteoform is independently regulated. Understanding how the abundances of these distinct gene products change is essential to understanding the underlying mechanisms of many biological processes. Bottom-up proteomics mass spectrometry techniques may be used to estimate protein abundance indirectly by sequencing and quantifying peptides that are later mapped to proteins based on sequence. However, quantifying the abundance of distinct gene products is routinely confounded by peptides that map to multiple possible proteoforms. In this work, we describe a technique that may be used to help mitigate the effects of confounding ambiguous peptides and multiple proteoforms when quantifying proteins. We have applied this technique to visualize the distribution of distinct gene products for the whole proteome across 11 developmental stages of the model organism Caenorhabditis elegans. The result is a large multidimensional dataset for which web-based tools were developed for visualizing how translated gene products change during development and identifying possible proteoforms. The underlying instrument raw files and tandem mass spectra may also be downloaded. The data resource is freely available on the web at http://www.yeastrc.org/wormpes/.
KeywordsProteoform Proteomics Visualization Database Caenorhabditis elegans Development Protein separation SDS-PAGE
Bottom-up shotgun proteomics is a widely used technique for identifying peptides and, indirectly by inference, proteins present in biological samples. Broad adoption of this technique was facilitated by the advent of SEQUEST  (and the availability of new genome sequences), which greatly streamlined the interpretation of tandem mass spectra. By searching spectra against a list of candidate peptides taken from a database of possible protein sequences, SEQUEST provided an unprecedented ability to quickly and easily identify proteins present in a protein mixture.
Increasingly, proteomics studies are focusing not only on the identification of proteins but also on the differences in the proteome between biological samples. Multiple techniques have been developed to quantify proteins in bottom-up shotgun proteomics experiments—largely encompassed by methods that require introduction of internal reference standards (such as SILAC , ITRAQ , and ICAT ), and so-called “label-free” methods that do not (such as spectral counting). Spectral counting, which uses a metric based simply on the number of observations for all peptides mapping to a given protein in an experiment, is a widely-used and computationally inexpensive technique for comparing differences between samples [5, 6, 7, 8, 9]. However, the problem of ambiguous peptides is compounded when attempting to quantify distinct gene products or proteoforms using spectral counting. Given peptides, not proteins, are being measured, and given no clear way to determine which proteoforms containing that peptide are contributing spectrum counts for that peptide, how can one reliably estimate the presence of each of those proteoforms using this method?
A technique that assigns these ambiguous peptides to distinct gene products or proteoforms using bottom-up proteomics was developed and applied as part of the modENCODE project , which aimed to fill in the gaps in the genome annotation for Caenorhabditis elegans. This technique uses Gelfree fractionation  to separate the endogenous proteins in a sample by mass before analysis by mass spectrometry so that identified peptides that map to multiple gene products with distinct masses may be attributed specifically to the gene product with the correct mass for the fraction (Figure 1, right panel). For the modENCODE study, this technique was applied separately to whole proteomes of 11 distinct developmental stages of Caenorhabditis elegans, resulting in a rich, multidimensional dataset that could conceivably be used to not only confirm the presence of distinct gene products or proteoforms but also to estimate and compare quantities of those gene products or proteoforms between developmental stages using spectral counting.
Given the complexity of the data, tools designed to help interpret the SEQUEST results in a biologically meaningful context are essential for efficient discovery and proteogenomic analysis. To this end, we constructed a database and web application that allow searching, visualizing, and downloading the data. Spectral counting-based analysis was performed, and the web application provides tools for identifying distinct proteoforms and interrogating how the quantities of those proteoforms may change with respect to developmental stage. The web site and all raw data are freely available at http://www.yeastrc.org/wormpes/.
Sample Preparation and Mass Spectrometry Analysis
Eleven developmental stages of C. elegans were analyzed—N2 embryo, N2 L1, N2 L2, N2 L3, N2 L4, N2 YA, N2 dauer, spe-9 L4, spe-9 YA, spe-9 adult, and him-8. Each developmental stage was grown on agar plates at 20°C seeded with the NA22 strain of E. coli. , sucrose floated, lysed in the presence of protease inhibitors (Roche Diagnostics, Indianapolis, IN, USA) and centrifuged to separate insoluble and soluble fractions. A 200 μg soluble lysate of each developmental stage was reduced with 5 mM DTT (Sigma, St. Louis, MO) in 30 uL Gelfree sample buffer (125 mM Tris, 4% SDS, 0.025% bromophenol blue, pH 7) and vortexed and heated to 50°C for 10 min. The samples were then cooled to room temperature, alkylated with 15 mM IAA (Sigma) and incubated at room temperature in the dark for 10 min. The samples were separated into 15 molecular weight fractions ranging from 3.5 to 500 kDa using the Gelfree 8100 fractionation system (Protein Discovery/Expedeon). Twelve fractions were collected from the mid-range Gelfree cartridge (3.5–100 kDa) and three fractions were collected from the high-range Gelfree cartridge (3.5–500 kDa).
fraction 1 (3.5-15 kD)
fraction 2 (13–17 kD)
fraction 3 (15–20 kD)
fraction 4 (15–25 kD)
fraction 5 (17–30 kD)
fraction 6 (23–35 kD)
fraction 7 (30–42 kD)
fraction 8 (35–50 kD)
fraction 9 (40–57 kD)
fraction 10 (50–57 kD)
fraction 11 (55–77 kD)
fraction 12 (70–100 kD)
fraction 15 (120–200 kD)
fraction 16 (190–250 kD)
Each fraction was trypsin (Promega, Madison, WI) digested. SDS was removed with SDS removal columns (Pierce, Rockville, Il, USA) and salts were removed with MCX columns (Waters, Milford, MA, USA). The peptides from each fraction were analyzed using a 35 cm fused silica 75 μm column and a 4 cm fused silica Kasil1 (PQ Corporation, Malvern, PA, USA) frit trap loaded with Jupiter C12 reverse phase resin (Phenomenex, Torrance, CA, USA) with a 120-min LC-MS/MS run on a Thermo LTQ-Orbitrap Velos mass spectrometer coupled with an Eksigent nanoLC 2D. A biological and analytical replicate was performed for each sample.
Accurate masses were assigned using Bullseye  and peptides were identified using SEQUEST searched against a FASTA protein sequence database comprising Wormbase wormpep (WS229) , RNA-seq-based predictions [10, 15], and gene predictions and translated C. briggsae intergenic ORFs as described in Merrihew et al. . P-values and q-values were assigned to PSMs and peptides on a per-fraction basis using Percolator .
To guard against the effective increase in false discovery rate (FDR) associated with combining multiple datasets that are each filtered on q-value, we calculated a single q-value for each distinct peptide in the dataset that is meant to be the minimum false discovery rate at which we may confidently consider the peptide to be present in the whole dataset. We ranked all the target and decoy PSMs by P-value from every run together as calculated by Percolator in their respective MS/MS runs, eliminated all but the top-scoring PSM for each distinct peptide, and used the decoys as an empirical null for the targets. Specifically, we computed a decoy-based P-value for each target peptide (i.e., the ratio of decoys that score better than the target score), and then converted the resulting P-values to q-values using qvality . Only peptides with a q-value ≤0.01 using this method were considered for spectral counting.
Normalized Spectrum Count (NSC)
So, given an NSCratio for a protein in three conditions of 5E-9, 4E-6, and 2E-7 and a NSCmin ratio of 1E-9, the NSC would be calculated as 5, 4000, and 200, respectively.
NSC was calculated for all proteins separately for each developmental stage, such that the abundances may be compared between developmental stages. To calculate the NSCratio for a protein for a developmental stage, Sp is the sum total of PSMs for that protein across all fractions (including all replicates) and St is the sum total of all PSMs for all proteins across all fractions (including all replicates). Then, to calculate NSC, all NSCratio values are divided by NSCmin ratio, which is the minimum NSCratio calculated for all proteins across all developmental stages. (Only peptides with a whole-dataset q-value ≤ 0.01 and PSMs with a q-value ≤ 0.01 as calculated by the Percolator algorithm were considered).
The same method was used to compute NSC values for proteins for individual mass fractions. NSCratio was calculated where Sp is the sum total of PSMs for that protein in that mass fraction across all developmental stages, and St is the sum total of PSMs for all proteins in that fraction across all developmental stages. NSC was then calculated using an NSCmin ratio that was the minimum NSCratio calculated for all proteins across all fractions.
To compare spectrum counts between combinations of developmental stage and mass fraction, NSCratio was calculated where Sp was the sum total of PSMs for a protein using all replicate runs of that specific developmental stage and mass fraction, and St was the sum total of PSMs for all proteins in those runs. NSC was then calculated using an NSCmin ratio that was the minimum NSCratio calculated for all proteins across all possible combinations of developmental stage and mass fraction.
Considerations for NSC
It is important to note that we are not performing any quantitative comparisons. We are only using NSC values to make qualitative comparisons of the same protein between samples. Properties of proteins, such as protein length or performance of tryptic peptides specific to a protein in the mass spectrometer, may have significant effects on spectrum counts for a given protein that are independent of the amount of protein. The NSAF score  was developed to account for protein length by dividing the spectrum count for each protein by the protein’s length to calculate a spectrum abundance factor (SAF), then dividing this SAF by the sum of the SAF calculated for all other proteins in the run to arrive at a normalized SAF (NSAF). However, NSAF ignores the variable peptide performance resulting from different possible tryptic peptides between separate proteins. Additionally, we were not wholly confident in the true sequence lengths of the detected proteins as we may be unknowingly detecting alternate splice variants and proteoforms that are post-translationally modified. Given these two factors, we chose to exclude protein length from the calculation of NSC to avoid the implication that NSC values may be legitimately compared between separate proteins.
An inherent limitation in most (if not all) methods that use spectral counting is that deviation in conditions (or experimental design) between compared samples may introduce inherent biases for classes of proteins that are not a function of the biology as much as they are a function of the methods themselves (e.g., biases that enrich for size or hydrophobicity). These biases may invalidate comparison between samples by sufficiently altering the likelihood of sampling a particular protein (and thus its spectral counts) based solely on non-meaningful attributes of that protein. In this dataset, we use NSC to compare gene products across developmental stages and across separate mass fractions. While comparing spectrum counts across developmental stages should not be subject to these artificial biases, comparing spectrum counts across separate mass fractions from the Gelfree separation may have biases in terms of the complement of expected proteins in the fraction, and so may impact the likelihood of sampling a given protein. When comparing directly between mass fractions, users should not consider the NSC a direct comparison of abundance between those fractions but rather a crude proxy of how enriched the individual fractions are for the protein of interest.
Web Site and Database Implementation
Blast  (blastp: 2.2.25+) was installed on multiple RHEL servers to support user-driven searching of the dataset by sequence. The FASTA file used to search the MS/MS data was used to build the Blast sequence database. A Jobcenter  client module for executing Blast was developed and installed on the Blast servers and linked to an in-house installation of Jobcenter to support distributed execution of user-driven Blast requests from the web application.
Results and Discussion
The dataset comprises 698 MS/MS runs from which 4,732,473 PSMs were identified (individual q-value ≤ 0.01) for 39,563 distinct peptides (whole-dataset q-value ≤ 0.01) mapping to 28,740 protein sequences from the FASTA file used to search the data. Of the 39,563 peptides, 8725 map uniquely to a single protein sequence, and of the 39,563 peptides, 2748 do not map to any protein found in Wormbase, but map to 1273 protein sequences that are the result of RNA-seq or computational prediction (see the “Methods” section). Given the large, multidimensional nature of the data (each run being a biological or technical replicate of a combination of developmental stage and mass fraction), a database and web-based interface were constructed to collate the data, help find proteins of interest, visualize how abundances of those proteins (and their possible proteoforms) may change as a function of developmental stage, and view the underlying, supporting mass spectrometry data.
Searching for Proteins
Users may search for proteins by using query strings (such as common name, accession string, or keyword) or by protein sequence using Blastp. Searching using query string effectively limits the possible results to those proteins found in Wormbase because those are the only annotated proteins in the dataset. However, many proteins in the dataset are the result of RNA-seq or computational prediction and have no commonly known names or annotations. To solve this, a system for searching by sequence with Blastp was set up (see the “Methods” section) and a novel interface for visualizing Blast results was constructed that colors hits based on confidence and clusters the search results based on where they physically map to the query sequence. This approach will tend to cluster matching proteoforms together as easily distinguishable groups and aid users in interpreting the results and selecting possible proteins of interest. From either search method, users may click on the names of proteins to visualize comparative protein abundance and proteomics data associated with that protein.
Visualizing NSC Abundance
NSC Bar Chart
The NSC bar chart makes use of a simple bar graph to compare NSC signal by showing how the total NSC of all peptides that map to a given protein change with respect to developmental stage. However, some peptides may map (by sequence) to multiple proteoforms and if other proteoforms are present, it is not simple to determine which (if any) of the peptides that map to the current protein were detected as a result of the presence of one or more of the other proteoforms. To help determine if (and to what degree) confounding proteins may be present, a bar graph comparing NSC between mass fractions is also presented that shows whether or not PSMs for peptides mapping to the current protein were detected in mass fractions other than the expected mass fraction for this protein’s calculated mass (expected fraction is shaded blue). Detection of peptides in other fractions may indicate the presence of proteoforms (previously known or unknown), protein degradation products, or that the accepted protein sequence is incorrect. In the case of signal present only in the expected mass fraction, caution should still be used as multiple proteoforms of a protein may have similar masses that cannot be distinguished by mass fraction.
Hovering the mouse pointer over any of the bars will show the raw and normalized spectrum counts being represented. The bars may be clicked on to view the peptides, PSMs, and spectra associated with those spectral counts. Each PSM is annotated with both the developmental stage and mass fraction in which it was observed in order to further interrogate the presence and effects of possible proteoforms.
Protein Heat Map
The protein heat map visualizes protein NSC with respect to both developmental stage and mass fraction simultaneously and is designed to further interrogate the presence and character of possible proteoforms—and help mitigate the effects of those proteoforms when interpreting NSC. With the heat map it is not only possible to see in which mass fractions peptides mapping to a given protein were detected but also how the NSC in each of those mass fractions is different with respect to developmental stage. In the heat map, brighter red represents a higher NSC and grey represents the lack of detected PSMs for that developmental stage/mass fraction combination. Red boxes outside the expected mass fraction may indicate the presence of peptides also matching to proteoforms. Differences between mass fractions in the pattern of NSC with respect to developmental stage may additionally suggest the presence of proteoforms whose abundances are differentially regulated with respect to developmental stage. Additionally, the confounding effects of multiple proteoforms may be mitigated somewhat by examining only the pattern of NSC in the expected mass fraction for the protein of interest.
Red squares in the heat map may be hovered over with the mouse pointer to view the raw and normalized spectral counts, and red squares may be clicked on to view peptides, PSMs, and spectra found for the specific developmental stage/mass fraction combination. A bar graph is present at the top and right side of the heat map that represents the total NSC for each developmental stage and mass fraction, respectively. Each bar may also be hovered over to view spectral counts and clicked to view peptides, PSMs, and spectra.
Peptide Coverage Heat Map
The peptide coverage heat map attempts to provide still further insight into proteoforms by providing a visual comparison of individual peptides that map to a given protein as a function of developmental stage or biochemical fraction. This view uses the Mason viewer  to lay out the protein sequence coverage as a row by drawing rectangles along the horizontal axis (where the left and right edges are the N- and C-termini) that represent which segments of the protein are covered by identified peptides. The colors of the rectangles are shades of red, such that brighter red indicates a higher NSC. The software then stacks the rows vertically using the same scale so that patterns of sequence coverage may be easily compared between different stages or fractions. Where multiple peptides overlap and map to the same position in the protein, the cumulative NSC for peptides mapping to a given protein position are used to determine shading. In this case, distinct peptides may also be viewed by expanding a developmental stage or mass fraction by clicking the icon to the left of the row label.
Using this view, it is simple to see how patterns of protein coverage change between stages or fractions. Differences in this pattern may be the result of detecting proteoforms with overlapping peptides and provide some insight into the sequence composition of those proteoforms. It is also possible to review which peptides are contributing most significantly to the spectral count for a given protein, and in which mass fractions those specific peptides are most significantly represented.
All segments of protein coverage may be hovered over with the mouse pointer to view position in the protein, raw spectrum count, and NSC. Where peptides overlap, a row for a given stage or fraction may be expanded to view individual peptides. Individual peptides may be clicked on to view sequence, PSMs, and spectra associated with that peptide.
Application to a Biological Example
Viewing Underlying MS/MS Data
We have presented a web application and data resource designed to search, visualize, and interpret data generated by SEQUEST when applied to multiple mass fractions from multiple developmental stages of C. elegans. The application has been designed to not only illustrate how proteins may change between developmental stages but also to deduce whether proteoforms are present, the character of those proteoforms, and how they may be affecting the estimation of abundance for a given protein. The web application is freely accessible at http://www.yeastrc.org/wormpes/. All the instrument raw files and minimally-processed MS/MS data are available for download at the site.
The authors acknowledge support for this work by grants P41 GM103533, R01 DK069386, and U01 HG004263 from the National Institutes of Health, and the University of Washington Proteomics Resource (UWPR95794).