1 Introduction

The controversy surrounding the quality of discovery data generated from shotgun LC-MS proteomic experiments continues without resolution. For example, Ptolemy and Rifai [1] suggest a serious review of both the terminology and validation schema utilized in biomarker discovery experiments. The basis of their report is the apparent disconnect between the level of funding and effort associated with biomarker discovery and the limited number of protein biomarkers actually in use in routine clinical management. In addition, Ransohoff [2] details how the strong claims from both the genomic and proteomic biomarker initiatives suffer from relatively poor experimental design, reproducibility, and applicability. Likewise, White [3] discusses the potential cost of high-throughput proteomics, describing a culture that motivates laboratories to generate large lists of protein, peptide, and post-translational modification biomarker candidates, typically at the expense of accuracy and reproducibility. Fundamental data acquisition and data processing changes may be required to address accuracy and reproducibility issues [46]. The protein complement of comparable biological samples is known to be qualitatively and quantitatively similar. However, measurement efforts based on shotgun LC-MS experiments generally lack sufficient reproducibility. Stochastic and serendipitous data sampling arguments have been advanced as an explanation for why these experiments are not reproducible [7, 8], despite the protein complement and hydrophobicity and ionization efficiency of their proteolytic peptides being vastly the same. There is, however, a growing body of evidence that suggests that the selectivity of LC-MS/MS-based strategies may be insufficient to deal with the complexity of a proteolytically digested (sub)proteome [911].

A significant source of error in proteomic experiments results from the algorithmic interpretation of product ion spectra derived from chimeric and composite MS spectra. To date, in the instance of complex mixture experiments, most search engines do not acknowledge the fact that a typical data-dependent analysis (DDA) product ion spectrum is most likely to arise from co-fragmented peptides. Approximately two-thirds of all precursor ion detections in a complex protein digest mixture are at least two-and-a-half orders of magnitude lower in intensity than the most abundant ions [10, 12]. Consequently, the incidence of overlapping isotopic clusters of similar m/z and intensity is significant. The specificity of DDA acquisitions is challenged under such conditions, especially when the search engine peptide score is primarily based on the intensity of the matched product ions relative to the unmatched. Acquiring DDA data faster or with higher sensitivity hardly reduces these sources of error. An increase in speed and sensitivity without a concurrent increase in specificity will generally produce compromised information by generating more low abundant mixed spectra. Overloading the separation column can also exacerbate the chimeric and composite challenge since this will produce peak broadening and tailing of higher abundance peptides and enhances the incidence of interference. On the other hand, improving the overall separation capacity of the LC-MS method can have a positive impact on error rates. This is however only achieved if column flow rate, gradient and sample loading are harmonized to reduce the incidence of composite and chimeric spectra. A secondary benefit of increased chromatographic resolution is that chromatographic peak widths are reduced, which improves electrospray sensitivity by presenting a higher peptide concentration per unit time to the mass spectrometer. Multidimensional chromatography, when properly implemented, should have a positive impact on overall separation capacity [13, 14]. Alternatively, an additional dimension of separation such as ion mobility (IM) can be very effective in reducing chimeric and composite interferences and the benefits of this approach have been demonstrated in data-independent analysis (DIA) strategies [14, 15]. The application and combination of IM with DDA is less common. The technical advantages and limitations of DDA and DIA methods, including their main differences, have been discussed in detail [9, 10, 16]. Lastly, data processing errors associated with charge state assignment, de-isotoping and centroiding can be especially problematic when processing low abundance, overlapping isotopic cluster data [11, 1719].

Different statistical approaches have been used to estimate the contribution to error from peptide sequence database search algorithms. The most widely used method to date has been the use of a decoy database strategy, whereby the decoy database is concatenated to the database of interest to infer a false positive rate (FPR) or false discovery rate (FDR) at the peptide and/or protein level [2022]. This approach is based on the assumption that peptides from a random or reverse decoy database can be identified at a rate similar to that of the peptides from the original database [20]. The amino acids sequences of proteins are, however, not organized randomly or in a reversed manner. More specifically, the frequency of various sequence motifs commonly found in a given proteome may not be correctly represented in a decoy version of the database, resulting in an apparent low number of hits to the decoy database, in turn leading to an underestimation of identification error rate [3]. Generic statistical tools have been employed and their merits demonstrated [2325] to calculate peptide and protein FDR and FPRs as well. Which peptides are identified and how they are scored varies significantly between the various employed methods.

Assuming that high resolution exact mass measurement of peptide precursor and product ions, peptide fragmentation efficiency, relative retention time, and drift time are similar on comparable instruments, operated in a comparable manner, querying these metrics, as well as the relative intensity values of peptide precursor and product ions, should provide means to identify and quantify proteins in complex mixtures. To that end, spectral library searching has been suggested as an alternative to the more traditional sequence database search approach [26]. In this strategy, an unknown spectrum is compared with a library of known spectra and a match achieved based on the similarity of physicochemical properties. Spectral searching and the use of libraries have been the premise of GC-MS for the interpretation of unknown spectra for some time [2729]. Its utility in proteomics research has been further explored by a number of research groups, and it is moving gradually into more mainstream use and acceptance [3034]. Spectral libraries must contain correctly identified spectra to have value. This generally requires the accumulation of replicate spectra, which results in challenging data storage and computational requirements, forming the motivation to cluster spectra and develop so-called spectral archives [3537]. Currently, the sharing of experimental MS/MS data between laboratories to more effectively use spectral libraries and archives is not widespread. Several identification and data repositories such as PRIDE [38], Tranche [39], PeptideAtlas [40], and Peptidome [41] facilitate spectral upload, viewing, and comparison, but generally do not offer the ability to build validated composite MS/MS spectra, or conduct spectral or fragment ion searches against the validated composite spectra.

The work described in this paper demonstrates that the construction of a fragment ion repository using high-specificity product ion spectra, in combination with appropriate aggregation and query of the repository, provides promise as a strategy for characterizing complex protein digests both qualitatively and quantitatively. It will be shown how the strategy can be utilized in a targeted fashion to monitor the presence of a single or a number of proteins in a complex mixture as well as determining the stoichiometry of proteins in biological pathways. The strategy is based on maximizing the selectivity and specificity of the analytical workflow and on the use of signal replication. The method relies on technical and biological replication of DIA acquired precursor and product ion information of similar and dissimilar samples from a multitude of tissues and species, prepared, processed, and acquired in multiple laboratories. The quality of the fragment ion repository relies on the fact that no two datasets, either technical or biological, are likely to be fully identical in all analytical dimensions. The peptides and associated product ions detected in these mixtures illustrate however reproducible behavior and their physicochemical properties can be confirmed. These concepts and their application will be disclosed and discussed.

2 Materials and Methods

2.1 Relational Database Repository

A development repository derived from data-independent CID spectra of 740,278 redundant, 100,434 non-redundant peptide ions was created from 207 DIA LC-MS data sets of tryptic digest of various Rattus norvegicus and Mus musculus tissue and body fluid samples. The digestion methods and experimental conditions were generic and described in more detail elsewhere [12, 4245]. ProteinLynx Global SERVER ver. 2.4 was used as the database search algorithm for the preliminary data-independent identifications using either the reviewed entries of Rattus norvegicus (release 2010_11, 7,551 entries) or Mus musculus (release 2010_11, 16,320 entries) UniProtKB databases. Sequence information of internal standard proteins was added to the databases to normalize the data sets or to conduct quantification [46]. Guideline identification criteria were applied throughout [47]. In addition to the information provided by the search algorithm, including identification score and FDR [48], normalized fragment ion intensities f1 and f2 and a normalized peptide intensity p1 are calculated. Their definition and an explanation are provided in Results section. In addition, the peptide retention times are standardized as described by Tarasova et al. [49], by obtaining linear fit parameters based on hydrophobicity [50] and standardizing to a reference. This approach affords initial population of the repository with orthogonal identification information from different instruments and laboratories. A variant, based on repository content based retention time normalization is applied in this study, which requires the upload of a sufficient number of identification results and in silico information to construct a reliable and robust reference retention normalization mechanism. Grouping normalized parameters f1, f2, and p1 creates a so-called ion map of which the utility will be explained in detail in the section 3. Finally, a normalized protein molar P1 amount is expressed. Parameters f1, f2, p1, and P1 are related to unique repository identifiers. In addition, accurate precursor and product ion mass information is uploaded into the relational database, as well as, if applied, their associated ion mobility values. Direct comparison of the experimental drift time with a standard mobility database is likely to enhance peptide ion identification [51]. As for mass, drift time values are not normalized or standardized. The relational database and queries rely on the native accuracy and precision of the mass and mobility measurements.

Taxonomy (ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/) and tissue (http://www.brenda-enzymes.info/ontology/tissue/tree/update/update_files/BrendaTissueOBO) identifiers are appended as ancillary, queryable information to the content of the repository during upload of the data. In addition, the protein accession numbers are mapped to the UniParc database [52].

2.2 Isoform/Homology Filtering

The protein concentrations are estimated as described [46]. Briefly, the average ion intensity of the three most abundant peptides identified to a protein is standardized to that of an internal standard spiked into the sample at known concentration. However, the observed signal intensity of sequence common peptides can be a summed value arising from redundant identifications. This is advantageous from a qualitative perspective since the intensity of the redundant peptides is cumulative. From a quantitative perspective, it hampers data analysis, especially if the contribution of the individual protein isoform cannot be addressed or accessed. Certain quantification schemes therefore disregard or down-weight these peptides to express a quantitative value, which could be problematic for highly homologous proteins since the number of proteotypic peptides could be small. An extension to the earlier presented absolute quantification schema is discussed.

The average intensity, in contrast to calculating the average intensity of the n best ionizing peptides, is calculated from the n most abundant proteotypic peptides. These averaged intensities are subsequently used to segment the total observed intensity of the common peptide belonging to each parent protein. In instances where no proteotypic peptides can be identified, the identified proteins will be grouped and an absolute amount assigned to the group as a whole. Next, the peptides are re-ordered based on their segmented intensities for the sequence common and non-segmented intensities of the proteotypic peptides and the molar amounts calculated. The segmentation process is illustrated in Supplementary Figure 1. This leads to improved estimation of the amount and concentration of protein isoforms and homologues. In the instance of comparative analysis, the method also provides a better estimate of the relative amounts or fold changes for different homologous proteins between two or more conditions, since the information content obtained from non-proteotypic peptides is more detailed and comprehensive.

2.3 Fragment Ion Repository/Relational Database

The fragment ion repository is a multi-user application, with the information stored in a mySQL relational database server and the website run on Apache. Data entry and queries are achieved using a combination of PHP, Perl, and JavaScript. A single or multiple zip archives are uploaded to the server. Each zip file consists of the search results in comma-separated values format and the search criteria in text file format. The latter also holds instrumental performance and acquisitions settings and is used as a first pass assessment of the quality of the uploaded data. Equivalent and additional proteomics LC-MS performance metrics [53, 54] can be readily retrieved from the repository content for one-dimensional LC-MS experiments. An example is provided in Supplementary Figure 2, illustrating the interquartile retention time range, median retention time, interquartile retention time ratio, average chromatographic peak width at half height and number of identified peptides/min within the interquartile retention time range prior to retention time standardization, excluding in-source fragment, losses, and variable modifications. Performance metrics have intrinsic value but are not discussed in detail.

3 Results

The upload, annotation, and processing workflow of data-independent analysis (DIA) LC-MS data into the fragment ion repository is shown in Figure 1. A more detailed description of this process and the developed software is provided in the Materials and Methods section. Briefly, the uploaded results are initially quality flagged on the basis of the automatically derived precursor/product ion search tolerances and data resolution as calculated by the utilized search engine. This information is currently stored in the database and provides the possibility to reject data in case the quality is suspicious from an MS perspective. Currently, no rejection criteria are applied during the database population process. LC quality metrics are under study and their significance discussed in the Materials and Methods section. and Supplementary Figure 2, respectively. The protein accession numbers are mapped to a universal identifier using user provided species taxonomy as a filter. During this process, tissue information is appended to the processed data as well. Next, the data is processed (i.e., normalized on retention time, fragment ion intensity, precursor ion intensity, estimated molar amounts, fragment ion to multiple parent peptide, and protein relationships determined. These normalized parameters describe fragment and precursor ions relationships and are calculated as follows:

$$ {{\text{I}}_{{{\text{fragment}}\;{\text{ion}}\;{\text{f}},\;{\text{p}}}}}/{{\text{I}}_{{{\text{peptide}}\;{\text{ion}}\;{\text{p}}}}}\left( {{\text{f}}1} \right) $$

and

$$ {{{\text{I}}}_{{{\text{fragment}}\:{\text{ion}}\:{\text{f}},\:{\text{p}}}}}/\sum\nolimits_{{\text{i}}} {{{{\text{I}}}_{{{\text{fragment}}\:{\text{ion}}\:{\text{f}}\left( {\text{i}} \right),\:{\text{p}}}}}} \left( {{\text{f}}2} \right) $$

where Ifragment ion f, p = intensity fragment ion of peptide p and Ipeptide ion p the intensity of peptide p. Normalized ratio f1 describes the fragmentation efficiency for a given fragment ion in relation to the precursor intensity and normalized ratio f2 the preferred fragmentation pathway of a particular fragment ion for a given sequence. Alternatively, the intensities of y″max and bmax could be subtracted from the precursor and summed fragment ion intensity to account for unfragmented peptide precursor. Relative, normalized peptide intensities are calculated as follows:

$$ {{\text{I}}_{{{\text{peptide}}\;{\text{p}},\;{\text{P}}}}}/\sum\nolimits_{\text{i}} {{{\text{I}}_{{{\text{peptide}}\;{\text{p}}({\text{i}})}}}\left( {{\text{p}}1} \right)} $$

where Ipeptide p, P = intensity peptide precursor ion of protein P. Finally, a normalized protein molar amount is expressed:

$$ {{\text{n}}_{\text{P}}}/\sum\nolimits_{\text{i}} {{{\text{n}}_{{{\text{P}}({\text{i}})}}}} \left( {{\text{P}}1} \right) $$

where nP = estimated (molar) amount of protein P1 [47]. Relative and normalized molar amounts can be used for stoichiometry and pathway analyses purposes. Already foreseen is a final, automated validation and curation process, based on variation converging with either technical or biological experiment increment, before final upload of the normalized spectral information into the fragment ion repository/relational database. In order to demonstrate the importance of normalized DIA spectrum intensity values and their use to create a fragment ion repository, no attempts were made to remove outlier data at this stage.

Figure 1
figure 1

Workflow/activity diagram of the upload, annotation, processing, and creation of a multi-user relational database-based, data-independent fragment ion repository. Data are uploaded and species and tissue taxonomy meta data entered. The calculation of fragment, precursor, protein relationships, and standardized elution times is conducted automatically and appended to the database, including species-independent protein accession number mapping. A web front-end is used to query the database

The results of 207 DIA LC-MS experiments were uploaded to the relational database, representing the experimental results from six different laboratories and 10 tissue types. Currently, the database holds in total the identification results from 69,907 proteins and 1,032,110 peptides. Note that these results do not represent unique protein identifications, but unique protein-sample and peptide-sample combinations. Peptide identifications include in-source fragments and the losses of water and ammonia from precursor ions. These peptides can be readily mapped to multiple proteins of which the relationships can be retrieved by means of standard SQL language queries. The total number of redundant fragment ion identifications equals 7,480,798. This number of identifications is reduced to 2,181,901 by excluding the y and b ion losses of water and ammonia, ymax, bmax, in source fragments and variable modifications. More than 85.6% of these fragment ion identifications replicated at least twice across the database content. With decoy entries excluded this amounts to 86.5%. This highlights the quality of the current content of the fragment identification repository, which is a primary prerequisite, as will be illustrated later, for successful and confident identifications. In contrast, only 8.3% of the decoy entries were identified in at least two LC-MS experiments. This number readily decreases to 0.7% when a replication rate of at least five is applied. With the same criteria applied, the replication rate of the non-decoy fragment ion identifications equals 72.5%. This is expected, as decoy signals are not amplified during the process of identification replication. Interestingly, two fragment ions, associated with two particular data sets, account for the majority of the replicating decoy identifications. Excluding these two ions, reduced the number of replicating decoy fragment ion identifications further down to 0.4%. The same logic was applied throughout the analysis of the database content unless mentioned otherwise. These basic database entry statistics suggest however that accurate mass product ion signal replication is a very strong metric and that the value of product ion mass accuracy with respect to improving specificity is most likely underestimated by most sequence search algorithms.

The fragment ion relationships f1 and f2 describe the fragmentation efficiency for a given fragment ion in relation to the precursor intensity and the preferred fragmentation pathway of a particular fragment ion for a given sequence. Examples for f1 and f2 for three highly replicating fragment ions are shown in Figures 2 and 3, respectively. Outlier data are marked by a black square and are indicative of incorrect identifications or interfered ion intensity measurements. The value and specificity of f1 and f2 increases substantially by supplementing the fragment ion signature with additional orthogonal information, such as standardized peptide retention time [49], drift time [51, 55] or collision cross section [56]. Supplementary Figure 3 illustrates the addition of the standardized retention time to f1 and f2, thereby creating a subset ion map for the three fragment ions of interest, indicating substantially improved specificity compared with the results shown in Figs. 2 and 3. Panel (a) of Supplementary Figure 3 illustrates that experimental retention time correlate linearly with predicted normalized hydrophobicity. Panel (b) shows f1 and f2 as a function of non-standardized (raw) retention time, and panel (c) f1 and f2 as a function of retention time after normalization and standardization [49]. Drift time and collision cross section information were not acquired and/or available for the results described. Drift time is expected to further increase specificity [51], whereas cross section information is believed to be useful for the analysis of post-translationally modified peptides [56]. Median values for f1 and f2 were 31.8% and 28.5%, respectively. Fragment ion relationships f1 and f2 can be utilized to create queryable library-like database spectra. An example is shown in Fig. 4. The circles represent curated average database f2 values with the database frequency in parentheses. A normalized experimental spectrum is superimposed. The relative intensity of the majority of the observed fragment ions is in agreement with the database entries, contradicting a recent study [57]. Moreover, the spectrum indicates remarkable similarity with previously presented data from other species and sample types [46]. For this particular example, the absolute normalized fragment ion intensity standard deviation was as high as 50% to 70% for the lower abundant product ions with lower ion statistics, whereas for the more abundant fragment ions, this value is closer to 35%.

Figure 2
figure 2

Intra- (tissue/experiment) and inter- (sequence) relative fragment ion intensity f1 distribution examples for one- and two-* dimensional LC-DIA-MS experiments of different mouse and rat proteomes. The statistical distribution of ratio f1 values for three fragment ions originating from three different peptides are shown as a box plot for each of the experiments deposited in the database. The overall distribution of the ratio f1 values for each of the fragment ion is shown in the right-hand panel. Outlier data are marked as black squares

Figure 3
figure 3

Intra- (tissue/experiment) and inter- (sequence) relative fragment ion intensity f2 distribution examples for one and two* dimensional LC-DIA-MS experiments of different mouse and rat proteomes. The statistical distribution of ratio f2 values for three fragment ions originating from three different peptides are shown as a box plot for each of the experiments deposited in the database. The overall distribution of the ratio f2 values for each of the fragment ion is shown in the right-hand panel. Outlier data are marked as black squares

Figure 4
figure 4

Experimental DIA fragment ion spectrum VEIIANDQGNR overlayed with the normalized f2 fragment ion intensities of Hspa5 (78 kDa glucose-regulated protein) from mouse and rat. Color legend: grey = experimental spectrum, red = y ion, blue = b ion and green = neutral loss; marker legend: * = neutral loss of ammonia and º = neutral loss of water

An additional benefit of spectral library-like searches is demonstrated in Supplementary Figure 4 for various cancer cell line samples of mammalian species similar to the described organisms under study. The left hand side of Supplementary Figure 4 illustrates the absolute increase in number of identified proteins, corresponding to relative increases of 18%, 28%, and 4% for PC-3, MDA-MB-231, and Hep-G2, respectively. The right hand side shows the identification distributions of the sample for both approaches, i.e., database and repository centric. For the latter, an ion match tolerance of ±10 ppm was used. In addition, the retention and drift times, as discussed in the Materials and Methods section, were normalized and standardized as the experiments were conducted on different instruments in different laboratories. The match tolerances were ±1 min and ±1 drift time bin, respectively. As can be noticed from the presented results, the sample common number of identified proteins substantially increased, whereas the number of sample unique proteins decreased, by using a spectrum library-like search approach. More importantly, this protein identification increment was primarily achieved through ion detections from the lower concentration ranges of the dynamic range of the studied cancer cell line proteomes.

The median calculated value for p1 equaled 44.2% (n = 150), ranging from 23.7% (n = 43) to 45.6% (n = 50) for individual experiments. The results from the two-dimension LC-MS experiments were excluded from p1 value trend analysis since different, first dimension-dependent, non-linear second dimension gradients were applied in order to optimally utilize the available chromatographic space/increase system peak capacity. Precursor intensities arising from multiple fractions are typically summed in the instance of two-dimensional data-independent LC-MS experiment and error measurement can, therefore, be slightly higher than expected. This could be overcome by extracting the precursor and product ion intensity from only one second dimension gradient separation, preferably the more dominating contributing one in terms of identification confidence and ion statistics, in order to calculate fragment ion relationships f1 and f2. This was, however, not considered at this moment of time. As previously mentioned, no attempts were made to remove outlier data, which could be part of the earlier proposed automatic digital curation process with the upload of new results. Alternatively, machine learning algorithms could be implemented and employed [58]. Examples for p1 for three highly replicating peptides are shown in Supplementary Figure 5. In this instance, the two-dimensional LC-MS data are included to illustrate that they exhibit somewhat more scatter. Peptides are expected to have similar ionization efficiencies, regardless of the sample and protein origin [12, 46]. This could go readily unnoticed in a DDA experiment, especially in the case of chimeric or composite spectrum instances. Some of the problems associated with chimeric events were emphasized already in the Introduction section. Undoubtedly, amino acid sequence-ionization efficiency relationships cannot be established when fragment ion spectra are incorrectly annotated. Moreover, DDA experiments are duty cycle limited, thereby limiting the opportunity to detect and identify all peptides of interest [10]. This would be especially the case for high in-spectrum dynamic range occurrences in combination with automatic gain control, as applied with trap based mass analyzers, whereby low abundant peptides are unnoticed [59]. DIA acquisition methods are, therefore, arguably more suited to quantify ionization efficiencies by means of electrospray LC-MS as they are only detection limited. Superimposed chimeric DIA fragmentation spectra are searched with dedicated search engines since they, by default, arise from co-eluting, non-isolated peptides [46]. Species- and tissue-independent peptide ionization efficiency consistency is illustrated in Fig. 5, where the average p1 value, the related coefficient of variation, and replication rate are summarized for Aldolase A, B, and C for both rat and mouse using the amino acid sequence of Aldolase A from mouse as the alignment reference. The tryptic fragment number annotation of the latter is shown in Supplementary Figure 6. Despite sequence differences between the protein isoforms and species, similar precursor intensities can be observed. Normalized peptide p1 intensity values can be used for example for pathway analysis since peptides identified to associated proteins are also expected to show relative ionization distribution efficiency similarity [60].

Figure 5
figure 5

Relative fragment ion intensity p1 isoform distribution and commonality. Isoform-independent tryptic fragment number annotation is provided in Supplementary Figure 6. Marker legend: size = average relative precursor ion intensity p1, angle arrow = variance relative precursor ion intensity p1 and color = tryptic fragment number

Proteotypic information and estimated relative within-sample molar amounts P1 were retrieved from the repository for 62,760 proteins, representing 2971 non-redundant, replicating identifications from non-fractionated mammalian samples. The actual total number of protein uploads equaled 69,907. In other words, proteotypic, quantifiable information was obtained for 89.8% of the identified proteins. Moreover, the repository currently holds 2266 replicating, species-independent genes, whereas to date 16,106 non-redundant and reviewed (curated) primary gene names can be retrieved from the utilized protein sequence databases, equaling a 14.1% depth of genome coverage. This very high volume of genomic and proteotypic content stems from the use of high-quality, curated databases and, equally important, a data-independent scanning approach and search algorithm that can extract the required type of information from complex samples. As an example, Fig. 6 illustrates normalized molar amounts of mitochondrial elongation factor Tu and cytoplasmic actin identified in both rat and mouse in various tissues and indicates that the protein abundance level of both proteins is similar across the investigated samples. The relative abundance between the two proteins of interest is approximately 30-fold and is relatively consistent across all tissue types and species. Mitochondrial elongation factor Tu was consistently low in abundance and not quantified in one of the samples. Cytoplasmic actin is more abundant in one of the investigated tissue samples, which may be biologically relevant for the sample and/or perturbation under study or sample preparation procedure related. The relational database captures, however, meta data during upload, which could be used to investigate discrepancies in more detail or filter the results. Relative molar amount comparison between tissues would benefit from geometric normalization as would be more typically applied in microarray analysis [61]. It has been recently demonstrated that this normalization technique can also be applied to the data obtained from label-free, data-independent experiments [62]. This information can be utilized for both inter- and intra-sample stoichiometry analysis [6264], as shown in Supplementary Figure 7, where the within-tissue consistency for a well-described multi-enzyme complex is shown. The E1α (PDHA1), E1β (PDHB), and E2 (DLAT) subunits of the pyruvate dehydrogenase complex were normalized to E3 (DLD), which is associated with other protein complexes. A good within-tissue agreement was observed vs. the expected 1:1:1 subunit ratio for the majority of the samples in which the proteins of interest were identified and quantified. The E1α subunit did not follow the commonly observed trend for the cerebral cortex samples and was identified at an approximate 1:1.5 ratio vs. the two other subunits.

Figure 6
figure 6

Relative distribution estimated molar within-sample amounts P1 examples for one dimensional LC-DIA-MS experiments of different mouse and rat proteomes

3.1 Concept and Outlook

Recent instrument developments and LC-MS based proteomics techniques have considerably improved the speed of analysis, depth of protein coverage, and information content that can be obtained from complex biological sample mixtures. Despite these impressive developments, identification and quantification variation is still a concern and, thus, alternative and complementary methods are even required to date. The value and use of a data-independent fragment ion repository has therefore been explored. The required sensitivity and selectivity for the purpose of protein identification and quantification has been demonstrated in previous paragraphs. More conceptually, the following schema can be considered for validation and or conformation, hypothesis driven studies or selected reaction monitoring (SRM)/multiple reaction monitoring (MRM) method development [65, 66].

  1. 1.

    A minimum of three peptides with three product ions are selected from any given protein. The fragment selection is based upon highest replication rate and smallest signature product ion variation across both experiments and samples.

  2. 2.

    In addition, similar precursor and product ions, three times three for the complete ‘protein set’, are selected from second and third protein that are consistently present with the protein of interest. Together, these proteins outline a fragment ion signature (i.e., ion map).

  3. 3.

    Unknown samples can be subsequently mapped against the fragmentation database signature to validate the presence of the target protein, with the additional ions and their associated intensity ratios acting as an internal validation mechanism.

Various statistical and computational tools and methods are currently considered and implemented for the analysis of the content of the fragment ion repository in order to facilitate the above and more mathematical accounting of the information that resides in the relational database [67]. These developments, query tools, and the public section of the repository will become open source and can be followed at: http://sites.duke.edu/ionmap/. The fragment ion database can currently only be populated with qualitative results obtained through DIA experiments, also known as LC-MSE [68]. DDA experiments generally do not afford precursor and product ion intensity measurements across the complete chromatographic peak or MS and MS/MS intensity recording for the same amount of time with the same gain applied. Hence, calculating normalized fragment and precursor ion intensities such as f1 and p1 could be more challenging. However, it has been demonstrated previously that DDA and DIA product ion spectra share great similarity [9], arguing that normalized f2 values and aggregate MS/MS spectra originating and derived from DDA spectra could be used to complement the content of the repository, which may hold great value in the instance of the more targeted analysis of fractionated or enriched samples. The presented concepts can be easily transferred to other application areas, including lipodomics [69] or metabolomics [70], facilitating the characterization and quantification of other molecule types. As spectral libraries and fragment ion repositories find more widespread use in proteomics, some of the remaining objection will be solved. In addition to the identification of fragment ions, peptides, and proteins, data-independent fragment ion repositories have great potential with regard to the quantification of protein abundances, stoichiometry, and the reliable quantification of post-translational modifications. In conclusion, repositories are a valuable addition to the requirements of systems biology, not only allowing quantitative analysis of low-abundant proteins, but also delivering reliably quantitative data when proteins are analyzed across multiple samples in multiple laboratories.