Introduction

A primary goal of proteomics is to confidently identify, sequence, and/or quantify all peptides from a complex mixture in a high throughput manner. This task is conventionally addressed with a “bottom up” strategy using liquid chromatography tandem mass spectrometry (LC-MS/MS). However, this strategy often misses many of the peptides that hold important biological relevance. Post-translational modifications (PTMs) of proteins are often not found, yet known to regulate a myriad of cellular mechanisms, and identification of these modifications leads to a better understanding of the components of the signaling networks that modify proteins. Signaling networks are found to be increasingly complex, as evidence mounts that signaling events require combinatorial modifications on a single protein working together to modulate the protein’s function. Mass spectrometry-based proteomics has become a fundamental tool in the identification of proteins and PTMs from both complex cellular systems and simple protein mixtures [25]; however, identification of combinatorial modifications remains an extremely important and challenging problem. Strong evidence of PTM peptides is readily available in high-resolution MS1 data and can be mined to substantially support the identification process. We believe this approach has been underutilized and describe a computational method to complement existing techniques in this important task.

The conventional technique used for the identification of peptides and proteins has been the “shotgun proteomics” approach, in which the protein(s) of interest is denatured and digested (usually with the enzyme trypsin), and the resulting peptides are separated using liquid chromatography. This is followed by tandem mass spectrometry (MS2) using data-dependent acquisition (DDA), in which a survey scan is initially run and then ions are selected for fragmentation based on highest abundance and/or other predefined criteria. MS/MS spectra are generated and search engine software is used to identify the peptides/proteins by comparing the observed spectrum to a theoretical spectrum for a given peptide sequence [626]. The operation of this method is straightforward, as it is not necessary to know which proteins are being targeted in advance, and performs well in common proteomic studies looking specifically at which proteins are present.

Despite the benefits, there are limitations to DDA, including limited dynamic range, reproducibility, and a bias toward abundant peptides. Relying solely on MS2 data from DDA for the identification of peptides is often problematic, especially those that are post-translationally modified. Successful identification by tandem mass spectroscopy is only possible if a peptide is selected for fragmentation and PTMs typically exist on only a portion of a protein species, making them substoichiometric. In many cases, phosphorylation has been detected at less than 1% of total protein concentration [27]. Typical experiments rely on peak intensity to select ions for fragmentation, so it is routine for modified species to be missed using this approach [24]. Furthermore, it is common to use collision-induced dissociation (CID) to achieve peptide sequencing and fragmentation related nuances, which result in MS2 spectra that are difficult to decipher because of the peptide bond break localization [15, 17].

Several bioinformatic approaches have been developed to address the difficulty of PTM identification. The vast majority of these strategies are based on analysis of MS2 data using a peptide spectrum match (PSM) algorithmic approach, in which the observed spectrum is compared with a theoretical spectrum for a given peptide sequence [626]. De novo sequencing is another method for identification of PTMs using MS2 data, in which the mass distances between peaks in the mass spectrum are matched to known residue masses, allowing the sequence to be derived solely from the spectral data. This method has been previously described [2835], and while it is somewhat limited to high quality data and has lower throughput, it has less of a computational limitation compared with the database search approach. These strategies have been used almost exclusively, as it is data provided by the fragmentation of the MS2 spectra and allows the determination of the site of modification and the sequence of the modified peptide. As a result, the development of software tools for the identification of PTMs from MS2 data has been an extremely active area of research.

More recently, investigators have adopted unrestrictive search approaches [3639], where all possible modifications are searched at once. When combined with conventional protein database search strategies, these methods are limited by the number of simultaneous variable peptide modifications that can be searched. This is due to the exponential nature of the problem and the restrictions of existing computational capacities. Despite the doubling of computational advances approximately every two years for more than half a century [40, 41], large numbers of modifications in combination result in a large number of permutations. This leads to a problem that is often prohibitively large or NP-complete [42], especially in cases where unrestrictive searches are run.

The application of filtering to reduce the amount of data that needs to be considered has become a central approach for addressing this issue. This seems to be a reasonably effective strategy for the analysis of MS2 data and several variations on this theme have been described [4348]. Some have tackled this problem by using parallelization strategies [49]. In addition, unrestrictive searches are prone to high false discovery rates, requiring each search result to be examined. Tanner et al. [50], addressed this issue with PTMFinder. Database search approaches have been described that iterate the search multiple times to significantly reduce the number of possible sequences considered. The result is that searches are accomplished more quickly, with fewer computational resources and with more stringent parameters [51, 52]. The large amount of research in this field outlines the importance of the problem and although significant advances have been made, there is still much to be accomplished to address the challenges in this domain.

Recently, instruments have been able to function in data independent acquisition (DIA) mode in which MS/MS data are generated from all of the sample precursors. There are several different methods, but they are all essentially accomplished by fragmenting all ions within preset m/z and retention time windows. In this way, the precursor ion selection is not biased and low abundant peptides will be fragmented, increasing the number of identifications and sequence coverage. A downside to this method is that isolation windows can often result in co-fragmentation of precursor ions producing complex multiplexed spectra that are difficult to interpret. Several approaches have partially addressed this data analysis step [53, 54]; however, it remains difficult.

Another method has been described in which software builds a database of algorithmically selected peaks and directly interfaces with the instrument acquisition process. In this manner, it is able to generate inclusion lists using different peaks from previous runs, effectively increasing MS/MS sequence coverage and the number of proteins identified, while accomplishing this in a fully automated manner [55]. Despite improvements in instrumentation, techniques, and efforts in data analysis, we are still falling short of identifying all peptides and their isoforms in a sample, particularly in complex mixtures. This continues to be highlighted by the fact that many biologically important PTM identifications are missed. In light of this, methods to improve this area are needed. We believe that many of these issues may be addressed by complementing MS2 identification with the analysis of MS1 data, which is now commonly available with high mass accuracy, particularly in support of PTM identifications.

Some efforts have been made in this area. Multiple LC-MS runs have been aligned and quantitative information has been measured for the purpose of targeting post-translationally modified peptides for fragmentation [5658]. Other tools compare the results from a database search with MS1 data to correlated unmatched spectra [59, 60] or to improve the scoring of search results [61, 62]. Although these efforts do not focus solely on the PTM identification problem and do not use the MS1 interrogation as a central and initial method, they do highlight the value of the spectral data available in MS1. To date the only work to specifically search for PTM peptides using the information present in MS1 data was presented recently [63, 64].

Using a conventional DDA approach, although it may not be possible to precisely determine the presence of a particular modification site through examination of MS1 data, it is possible to quickly and conclusively determine its absence. For every confirmed MS2 identification there will always be evidence of the peptide at the MS1 level. More importantly, if a modified peptide is present, this strategy quickly provides a short list of possible permutations associated with the isotopic signature. In complex multiple modified examples, there are often permutations within the same peptide that share a common chemical formula. If searched with MS1, only the unique combinations of formula and charge state need be considered, as they can be expected to result in the same isotopic distribution and monoisotopic m/z within the parts-per-million (ppm) error of the instrument. High scoring results can either be selected for subsequent targeted MS2 experiments or confirmed if fragmentation spectra are available. The smaller subset of permutations that need to be considered in MS1 data will theoretically lead to a reduction in computational expense and faster search times. Additional benefits of this strategy may include increased sequence coverage over MS2 and that MS1 data is readily available, as it is a prerequisite to MS2 data, allowing this approach to be applicable to older data sets.

We propose a computational strategy for the identification of protein post-translational modifications, which initially interrogates MS1 data and compares the theoretical isotopic distributions against the experimental data to generate a list of possible matches for examination. A goal of this strategy is to initially reduce the number of peptide sequences taken into account, in order to expedite subsequent MS2 analysis. In this manner, it contains some similarity to previously described iteratively refined MS2-based search methods [51, 52], but instead uses MS1 interrogation as the first step. This approach is amenable for an automated computation and has been implemented in a software framework to resolve some of the key limitations of the MS2 based methods: examination of non-fragmented peaks, identification of protein modifications that are associated with poor fragmentation patterns and computational algorithmic enhancements.

Methods

The workflow for these methods was implemented in the Java programming language and took the following as input: Thermo Scientific raw data files, the sequence of the protein in FASTA format [65], a list of modifications, and a list of parameters described, which include enzyme, number of missed cleavages, as well as limits related to retention time, peptide length and ppm tolerance. An in-silico enzymatic digest of the protein was performed using variables specified in the input, to create the initial list of possible peptides from the protein. All possible modified permutations for each peptide were determined and each was searched against a distinct combination of sequence and charge states. This usually results in a much smaller number of permutations searched than would be needed for an MS2 search. For example, a search at the MS2 level for a singly modified phosphorylation site on the RORg peptide LISSIFD would need to search LIpSSIFD and LISpSIFD, both of which are comprised of the same chemical formula as they both contain one modified serine residue. When using the MS1 approach, only one of these forms would be preserved for the subsequent search.

Importantly, mass spectrometric data does not display the mass of the peptide but, instead, the mass-to-charge ratio of the peptides. Therefore, each of the calculated permutation masses was converted to m/z ratios within the range of expected charges. A user-defined m/z range filter was then applied, typically between 350 and 2000 m/z units. The resultant m/z that fell between +7 to –7 ppm (as defined by user input) were used to create single ion chromatograms (SIC) from the mass spectral data. The chromatographic peaks were then used to hone in on the possible retention times of the peptide and to determine the presence of the peptide and best retention time ranges in the MS1 data using methods that have been described previously [63, 66, 67].

Each peptide is assigned a score that is used to rank the confidence of the match between the expected and observed spectra. The score is obtained by using primarily a least squares approach combined with mass error and works as follows: the retention time ranges provided by the SIC are used in conjunction with a sliding window strategy utilizing a preset expected elution range. The averaged mass spectrum for retention time range is generated in profile mode and undergoes a peak picking step in which all spectral peaks not consistent with peptides are removed. The software then employs a custom version of the software Qmass [68, 69] to generate theoretical isotopic distributions for the peptide ion of interest, and each theoretical peak is assigned to a corresponding experimental peak so long as the intensity exceeds 4% of the most abundant peak. The score is determined using a least squares approach comparing the theoretical peaks to the observed and is calculated as follows:

$$ \sqrt{\frac{{\displaystyle {\sum}_i^N{\left({T}_i-{A}_i\right)}^2}}{N}} $$

where T i is the relative abundance of the ith peak of the theoretical distribution, A i is the relative abundance of the ith peak of the observed distribution, and N is the number of peaks. The lowest score is assumed to be the best match for the peptide ion of interest, and results with a reported score exceeding the user-defined input threshold score and meeting the criteria of the input parameters are preserved for interrogation. Once the processing is complete, the results presented in the software interface are output in comma separated values (CSV) format for straightforward review. Once the list of candidate peptides has been validated, they can be compared against the MS2 identifications. Masses of peptides not previously identified are then used to build inclusion lists for subsequent MS2 analysis, often producing new identifications and increasing sequence coverage. An overview of this software approach is further described in Figure 1.

Figure 1
figure 1

Software overview. (a) The conventional DDA approach selects a subset of peptides for fragmentation based on abundance. The resulting MS/MS are then subjected to search algorithm to identify peptides. (b) The same raw file is searched at the MS level using Proteomics Workbench software to find all possible candidate peptides. The peptides that were not identified using the initial approach are used to create an inclusion list. (c) In a subsequent run, the inclusion list masses are fragmented and the resulting MS/MS is searched, often resulting in new identifications

We tested the software using the data-set from a previous publication [1], in which several PTMs were discovered by calculating the theoretical peptide mass and manually extracting and validating the peaks using the Thermo Scientific Qual Browser. The sample preparation and MS2 analysis was accomplished as follows: the kinase Ulk1 was prepared as an NTAP tagged construct and expressed in HEK293T cells; the resultant protein was purified using streptavidin binding resin and eluted to over 80% purity; the protein was digested, as previously described, and subjected to LC-MS/MS analysis using a Thermo Scientific Orbitrap.

To further validate and compare the software with MS2 identifications from another software, the protein bovine serum albumin (BSA) was searched with both Mascot 2.3 (Matrix Science) and Proteomics Workbench. The BSA sample was prepared such that all cysteines were carbamidomethylated (+57), digested using trypsin, and fragments +2 and above were targeted for fragmentation. Data-dependent selection of the 10 most abundant ions was used for HCD. The resulting raw file was searched using Mascot with the following parameters: cysteine carbamidomethyl fixed modification, two missed cleavages, and digestion enzyme trypsin. Mass tolerance for the precursor ion was set to 50 ppm; mass tolerance for the fragment ions was 0.5 Da. To ensure that no identifications were precluded, the subsequent search was conducted with a decoy database, and identifications were considered valid if the results were above the determined FDR and had an ion score cutoff >10. The search using PWB used the following parameters: max peptide length 20, mass tolerance 10 ppm, fixed modification of carbamidomethyl cysteine, and two missed cleavages.

Results and Discussion

The 16 sites of phosphorylation that were previously identified in Ulk1 were all detected using the proposed computational workflow. A list of the peptides that were found and the associated scores are listed in Table 1. The manual detection of the PTMs in this study was accomplished in 3 wk, as the goal was to discover and verify new sites that needed to be searched manually, by generating selected ion chromatograms (SICs) for each peptide in both modified and unmodified form, and manually using the native instrument software (Thermo Scientific Qual Browser) to validate correlating spectra. Using the software approach, automated detection was completed in 3 h using a desktop PC with an Intel i7 processor with 16GB of RAM.

Table 1 Number of Permutation Considered MS1 Versus MS2 for Proteins MKK5 and Ulk1. Using Conventional Search Parameters, the MS1 Approach Becomes More Essential as the Number of Modifications Searched Increases

Concerning the BSA search, Mascot identified 56 non-duplicate peptides with sequence coverage of 56%. PWB scored 68 peptides with a sequence coverage of 66%. PWB found all of the peptides identified in Mascot, plus an additional 12. Of these 12, four were +1 peptides, which, as expected, were not found in the Mascot results because +1 ions, generally contaminants, were not fragmented (PWB only has a maximum charge parameter and cannot disregard +1 peptides). That left eight additional peptides that were found using the software. An inclusion list was made with these eight peptides and we targeted them for MS2 fragmentation. Of these eight peptides, one additional peptide was confirmed at the MS2 level by Mascot with the same parameters as previously, increasing sequence coverage by 4% (Figure 2). There results show that the software is comparable with existing established MS2 software, and is able to search for modifications. This also shows that using this approach with targeted inclusion lists can increase the number of peptides over a conventional single runge.

Figure 2
figure 2

MS2 spectra of inclusion list peptide: MS2 spectra for the bovine serum albumin peptide YICDNQDTISSKLK (+3) are shown above. Using the conventional DDA approach, this peptide was not identified in the initial MS2 data. Using software analysis of the MS1 data, strong evidence of this peptide was discovered, and an inclusion list was generated for directed MS. The peptide was identified in the subsequent MS2 analysis using the same settings, yielding increased sequence coverage

To illustrate the difference between the number of permutations considered between MS1 and MS2 searches, the sequences of MKK5, with a sequence length of 448 amino acids, and a larger protein Ulk1, with a sequence of 1050 amino acids, were examined with conventional search parameters. The MS2 search on the protein Ulk1 would need to consider 17 times or 1.4 million more permutations than an MS1 approach (Table 2).

Table 2 Ulk1 Peptides. A comprehensive list of all the identified phosphorylated peptides from Ulk1 identified previously, along with the corresponding theoretical masses, observed masses, and associated ppm errors [3]. The peptides listed above were confirmed manually by calculating the exact mass and “chro-ing” out the peaks using the Thermo Scientific Qual browser over a period of three weeks. Using the software, we were able to find and validate all of these peptides in three hours. Phosphorylation sites are indicated in lower case

While the workflow described is suitable for the analysis of complex mixtures against large protein databases, Proteomics Workbench has been initially designed to interrogate single proteins, providing graphical tools to for quick validation and analysis. The software interface provides a central feature set for the integration of several graphic tools for editing and reviewing of data. Multiple peptides can be selected and the resulting spectra and chromatographic data are rendered in the software (Figure 3). Proteomics Workbench provides graphical tools in which a user can validate the peptide by visually examining the spectra, which is displayed in the same way as the native instrument software such as the Qual Browser (Thermo Scientific, San Jose, Ca, USA). The averaged mass spectrum for the peptide is rendered in the software interface and grey bars outline where each peak should reside theoretically based on peak width and charge state. The extracted ion chromatogram (XIC) view is calculated based on all peak mass ranges within the user defined m/z start and m/z end, and presents the chromatographic peak and the retention time range used to generate the peptide’s peaks. In the event that there are multiple peaks in the XIC, the user can select a different chromatographic peak and examine the resulting spectra by adjusting the retention time range. The theoretical isotopic distribution can also be overlaid onto the observed. Using these tools, the user can quickly and confidently validate peptide assignments.

Figure 3
figure 3

The main interface overview. This feature allows for integrating several graphic tools for editing and reviewing data. (a) The top table shows the selected peptide (highlighted in blue) has a phosphorylated tyrosine. Other columns include charge, theoretical m/z, and positional information. Peptide selection will launch results in the subsequent tools. (b) Extracted ion chromatogram (XIC) for one or more peptide replicates. (c) Bar charts present average area under the curve (AUC) from the XIC. (d) The spectral pane displays the averaged mass spectral data for one or several replicates. (e) Information Toolbar allows recalculation of input values. Tables (f) and (g) allow users to load from one to many replicates in the spectral and XIC panes for review. The replicate table (g) displays AUC results for each sample peptide replicate; (b), (c), and (d) support zoom in/out functionality

Area under the curve (AUC) data are represented in bar charts and all results are exportable from the software. Another available feature can search all peptides within the experiment and flag those that are isobaric or have mass conflicts based on shared mass. The software supports both peptide-specific and global modifications. Peptide-specific modifications are set in the protein editor interface in the peptide set. Global modifications selected in the initial detect job are applied to all peptides as variable. All predefined modifications can be selected prior to the search. User-defined modifications are accessible through an exposed XML file and formatted in a manner similar to UNIMOD. It should be pointed out that in most cases the software can be used in place of the native instrument software and does not need to be tied to this specific workflow.

Conclusions

Identification of post-translational modifications on proteins is an important part of the experimentation performed in mass spectrometry labs worldwide. However, limited tools allow for the identification of PTMs and the tools that exist often do not take into account the combinatorial nature of modifications. This software makes the interrogation of high-resolution MS data to find mass signatures related to putatively modified peptides possible, which then can to be validated by MS/MS spectra. Overall, this software lets the investigator dig deeper into collected data and could provide additional information about modified or unmodified peptides that are present in the analysis. Many who continue to manually search for peptides in MS1 would benefit from the automation and validation tools this software provides. This information will be of great value to investigators, as they continue to determine the biological significance and regulation that these modifications control. Moreover, when this experimental process is automated, it can be used for rapid screening of sample data and utilized as an iterative approach through continual development of peptide inclusion lists in each subsequent analysis encompassing all of these putative masses.

The software is freely available at http://www.proteomicsworkbench.com.