Advertisement

Journal of The American Society for Mass Spectrometry

, Volume 27, Issue 9, pp 1579–1582 | Cite as

Adaptation of Decoy Fusion Strategy for Existing Multi-Stage Search Workflows

  • Mark V. Ivanov
  • Lev I. Levitsky
  • Mikhail V. Gorshkov
Application Note

Abstract

A number of proteomic database search engines implement multi-stage strategies aiming at increasing the sensitivity of proteome analysis. These approaches often employ a subset of the original database for the secondary stage of analysis. However, if target-decoy approach (TDA) is used for false discovery rate (FDR) estimation, the multi-stage strategies may violate the underlying assumption of TDA that false matches are distributed uniformly across the target and decoy databases. This violation occurs if the numbers of target and decoy proteins selected for the second search are not equal. Here, we propose a method of decoy database generation based on the previously reported decoy fusion strategy. This method allows unbiased TDA-based FDR estimation in multi-stage searches and can be easily integrated into existing workflows utilizing popular search engines and post-search algorithms.

Graphical Abstract

Keywords

Proteomics False discovery rate X!Tandem Mascot Refinement Target-decoy approach 

Introduction

Protein identification in shotgun proteomics is performed using a search against a protein sequence database [1]. A number of popular search engines, including X!Tandem [2], Mascot [3], SEQUEST [1], PEAKS [4], Morpheus [5], and Andromeda [6] implement different algorithms that match MS/MS spectra to peptide sequences from the reference database. Filtering and validation of the search results is a crucial step for their subsequent biological interpretation. A simple and versatile strategy known as the target-decoy approach (TDA) [7] is widely employed for filtering of peptide identifications. The approach is based on generating a database of “decoy” proteins known a priori to be false, followed by its concatenation with the relevant (“target”) database. In addition, a number of search engines are capable of performing multi-stage strategies [2, 3, 4, 8], wherein a second search is run after the initial one using relaxed parameters and reduced search space. A number of reports suggest that these multi-stage strategies can provide significant gain in sensitivity at a relatively small time cost, owing to a drastic reduction in the search space [9, 10, 11, 12]. It has been noted in the cited works, however, that it also leads to systematic errors in false discovery rate (FDR) estimation based on TDA. The problem arises if different numbers of target and decoy proteins are considered in the second search. In this case, the widely employed equation for FDR estimation cannot be used anymore, because the main assumption of TDA does not hold. Other implementations of two-pass searching, such as precursor mass recalibration performed by Andromeda, do not introduce errors in FDR estimation, as long as they do not change the proportion of decoy peptides in the search space.

A correction for the X!Tandem search engine was reported to fix the problem [13]. It was later argued, however, that it introduces yet another bias into the FDR estimation, and that for accurate results the target and decoy parts of the database must be equal not only in size, but also in quality [14]. Recently, Zhang et al. proposed and implemented a new method for decoy database generation called decoy fusion [4]. In the classic TDA, decoy sequences are generated by reversing or shuffling the sequences from the reference database, or using a random walk approach [7]. Then, the generated decoy protein sequences are simply added to the protein database employed for the search. In the decoy fusion method, each reversed sequence is appended directly to the end of the target protein sequence within the same protein record, leaving the total number of sequences in the database unchanged. In this way, for every target protein passing into the second search stage, there will be a reversed sequence. The decoy fusion method corrects the bias in FDR estimation, yet it requires that post-search validation tools localize each identified peptide within the concatenated protein sequence to correctly label it as target or decoy. The popular tools do not support this functionality, which hinders integration of decoy fusion into the workflows not using the PEAKS software.

Here, we propose a solution based on the decoy fusion approach and compatible with popular post-search tools without modifications to their software. The proposed solution was tested using two widely used search engines, X!Tandem and Mascot, which implement multi-stage search.

Experimental

Methods

In the proposed strategy, shown schematically in Figure 1, decoy protein sequences are added as new records to the database, but each decoy record contains both the reversed (or shuffled, etc.) and the original target sequence, separated with an enzyme-specific amino acid motif (we used a single arginine for tryptic digest searches). The approach is similar to the one implemented in PEAKS, except for two distinctive features: the decoy record starting with the reversed (or shuffled) sequence followed by the target one, and an enzyme-specific residue separating the two parts. When a target protein is matched in the initial search, the corresponding decoy protein also passes into the second search due to shared peptides with the target protein. Most post-search validation tools consider peptides shared between target and decoy proteins as target, which is correct in this case (all target peptides are shared with decoy proteins). For the results to be correct, it is necessary that the search engine reports all proteins for peptide identifications.
Figure 1

The classic TDA, the decoy fusion method (implemented in PEAKS) and the fused decoy approach presented in this work. In classic TDA, the database contains target and decoy records. In the decoy fusion approach, target proteins are fused with reversed sequences. In the proposed approach, fused records consisting of the decoy sequence, an enzyme-specific residue separator, and the target sequence are used as decoys. The database contains target and fused decoy records

The data set consisting of 157,128 LC-MS/MS spectra obtained using Thermo LTQ Orbitrap Velos for human colorectal cancer sample annotated as TCGA-A6-3807-01A-22 and described elsewhere [15] was used in this work. SwissProt human protein database was used for the searches. Decoy databases were generated using in-house developed Python scripts based on Pyteomics library [16].

Two popular search engines implementing multi-stage searching, X!Tandem (ver. CYCLONE 2012.10.01.1) and Mascot (ver. 2.4.1), were used for peptide identification. X!Tandem output files were converted to pepXML using pepxmltk utility [17]. MPscore [18] was applied to pepXML files for post-search validation. We also applied Percolator [19] to the “fused” database search results to demonstrate the versatility of the proposed approach. It should be noted that only versions of Percolator released after February 29, 2016 can be successfully used with fused decoy searches. Mascot results were converted to CSV files using Mascot server with the “Group protein families” setting disabled. Precursor and fragment mass tolerances were set at 15 ppm and 0.3 Da, respectively. Carbamidomethylation of cysteine was set as fixed modification.

Instead of using a “nonsense” database to detect possible bias [13], we used a regular SwissProt database. To expose potential bias in FDR estimation, a “null model” (i.e., a set of false identifications coming from the second search) is needed. For X!Tandem searches, unlikely (“nonsense”) variable modifications were enabled for this purpose, and the matches to modified peptides were considered false. According to the main assumption of target-decoy approach, PSMs with nonsense modifications are expected to come in equal proportions from the target and decoy databases. Variable modifications were set to 70.041865 at Ser, 150.041585 at Thr, 72.021129 at Ile, 74.019021 at Ala, and 100.016044 at Gly. The values of the mass shifts were taken arbitrarily from the Unimod database [20] to ensure that “modified” peptides fall inside the mass clusters described elsewhere [21]. An expectation value threshold for refinement was set at 0.01, which corresponds to 0.16% FDR for our data set.

Three X!Tandem searches were performed: a standard search with nonsense variable modifications without refinement (referred to as “no refine” for brevity), as well as two searches against the “target + fused decoy” (“fused”) and classic target-decoy databases (“classic”) with refinement and with nonsense modifications enabled at the refinement step only.

Mascot search engine does not allow specifying a list of variable modifications for the second-stage (“error-tolerant”) search. For this reason, nonsense modifications cannot be used as a null model with Mascot error-tolerant search. Like with X!Tandem, three searches were performed: a single-stage search and two 2-stage searches using classic and fused decoy databases. When comparing the results of the two-stage searches, we note that target proteins considered in the second stage are the same. Hence, all identifications that are unique to the classic search are due to the difference in the decoy databases and must be false. Since all false matches are mutually independent, the subset of PSMs unique to classic error-tolerant search can also serve as a null model and indicate the proportion of decoys in the second-stage search.

Results and Discussion

First, we estimated the ratio of target and decoy peptides for the decoy databases generated using the classic and proposed approaches. For 2,692,736 target peptides of length 6 and above with up to two miscleavages allowed, there were 2,687,024 and 2,740,696 decoy peptides for the two approaches, respectively. The number of decoy peptides for the proposed approach is slightly increased because of peptides with missed cleavage sites spanning the border between the target and reversed sequences. In case when a decoy sequence ends with a cleavage site, an additional N-terminal target peptide with the separator motif attached at the N-terminus may also be generated. These peptides are considered to be decoy because they are not shared with target proteins. Thus, while the commonly accepted decoy sequence generation approach leads to a small underestimation of FDR, the proposed approach is slightly conservative.

For X!Tandem, FDR filtering was performed in two ways: 1% global FDR at PSM level and 1% FDR at PSM level for modified peptides only. The latter was done by simply excluding all non-modified sequences from the identification list before FDR filtering. The results are shown in Table 1. Refinement using classic target-decoy database reports more PSMs in total, but also more PSMs with nonsense modifications. They can be interpreted as false positives. Both “no refine” and “fused decoy” report less than 1% of identifications containing nonsense modifications. Similar results were observed for the more recent VENGEANCE version of X!Tandem (2015.12.15), but it showed significantly lower performance for our data set. We observed this discrepancy between the versions in all searches with high values of fragment mass tolerance (0.3 Da in our case).
Table 1

X!Tandem Peptide-Spectrum Matches Filtered to 1% FDR. With no refinement and when using fused decoy, modified peptides comprise less than 1%. With the classic database, modified peptides are more than 1.6%, which is an indication that FDR is underestimated

  

No refine

Classic

Fused

Global FDR filtering

Total PSMs

48,529

54,712

50,169

Modified PSMs

415

894

278

Nonsense modifications only

Modified PSMs

0

105

0

Figure 2a demonstrates the bias observed with refinement using “classic” database for X!Tandem. Each curve represents target and decoy counts among identifications with nonsense modifications sorted by expectation value. The target-to-decoy PSM ratio is close to 1 for “no refine” and “fused decoy” searches, which is expected for false identifications. However, for the “classic” search, this ratio is approximately 31. This high disproportion indicates an imbalance between target and decoy peptides passing into the second search, which results in underestimation of FDR. Direct analysis of “no refine” search results shows that the e-value threshold of 0.01 used for the refinement searches corresponds to 1.4% protein FDR, or 3635 target and 50 decoy proteins. These proteins yield 159,060 and 5436 target and decoy peptides, respectively. This means that there will be approximately 29 target peptides for every decoy selected for the second search. This is the essence of the bias inherent to “classic” search with refinement, explaining why it reports more PSMs in total, as shown further in Figure 2b and c for X!Tandem and Mascot, respectively.
Figure 2

X!Tandem identifications with nonsense modifications (a), total amount of X!Tandem identifications (b), and total amount of Mascot identifications (c) for searches with no refinement and with refinement using “classic” and “target + fused decoy” databases

Mascot’s “classic” search reports more target identifications than both “fused decoy” and “no refine” searches for a fixed number of decoy identifications. The latter is due to the different number of target and decoy proteins used for the second search. To estimate the fraction of target and decoy proteins, the “classic” results were filtered to 1% FDR and divided into three groups: (1) shared with “fused decoy” and “no refine” searches; (2) shared with “fused decoy” only; and (3) unique for “classic” search (Table 2). As discussed above, identifications in the latter group are false because all true PSMs are matched in the “fused decoy” search with the same target peptide search space. From the table, the fraction of target and decoy proteins used in the error tolerant search can be estimated as approximately 4:1 (442 versus 112) for our data instead of 1:1, which results in the “real” FDR estimation for the “classic” search of (29 + 345 + 442)/(31825 + 16823 + 442) = 1.7% instead of 1%.
Table 2

Mascot Peptide-Spectrum Matches Identified in “Classic” Search and Filtered to 1% FDR. All PSMs are divided into three groups: identified in all searches, identified only in “fused decoy” and “classic” searches and unique for “classic” search

 

No refine, Fused, Classic

Fused, Classic

Classic

Target

31,825

16,823

442

Decoy

29

345

112

Conclusions

This study confirms the previously voiced concerns regarding a bias in target-decoy-based FDR estimation for multi-stage searches. A solution for the problem is proposed. Unlike the previously reported decoy fusion method, the proposed approach does not require any modifications to the search engines or post-search validation tools used for the analysis. The only necessary change is to use a specific decoy database generation procedure, which does not affect other steps in protein identification workflows. The method has been successfully tested for X!Tandem search engine with MPscore and Percolator post-search validation tools, as well as for Mascot search engine, but it has no limitations for other tools allowing multi-stage analysis. The proposed decoy generation method is implemented in the Pyteomics library [16].

Notes

Acknowledgments

This work was supported by Russian Science Foundation (project #14-14-00971). The authors have declared no conflict of interest.

References

  1. 1.
    Eng, J.K., McCormack, A.L., Yates, J.R.: An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5, 976–989 (1994)CrossRefGoogle Scholar
  2. 2.
    Craig, R., Beavis, R.C.: TANDEM: matching proteins with tandem mass spectra. Bioinformatics 20, 1466–1467 (2004)CrossRefGoogle Scholar
  3. 3.
    Perkins, D.N., Pappin, D.J., Creasy, D.M., Cottrell, J.S.: Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20, 3551–3567 (1999)CrossRefGoogle Scholar
  4. 4.
    Zhang, J., Xin, L., Shan, B., Chen, W., Xie, M., Yuen, D., Zhang, W., Zhang, Z., Lajoie, G.A., Ma, B.: PEAKS DB: de novo sequencing assisted database search for sensitive and accurate peptide identification. Mol. Cell. Proteomics 11, M111.010587 (2012)CrossRefGoogle Scholar
  5. 5.
    Wenger, C.D., Coon, J.J.: A proteomics search algorithm specifically designed for high-resolution tandem mass spectra. J. Proteome Res. 12, 1377–1386 (2013)CrossRefGoogle Scholar
  6. 6.
    Cox, J., Neuhauser, N., Michalski, A., Scheltema, R.A., Olsen, J.V., Mann, M.: Andromeda: a peptide search engine integrated into the MaxQuant environment. J. Proteome Res. 10, 1794–1805 (2011)CrossRefGoogle Scholar
  7. 7.
    Elias, J.E., Gygi, S.P.: Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat. Methods 4, 207–214 (2007)CrossRefGoogle Scholar
  8. 8.
    Shilov, I.V., Seymour, S.L., Patel, A.A., Loboda, A., Tang, W.H., Keating, S.P., Hunter, C.L., Nuwaysir, L.M., Schaeffer, D.A.: The Paragon algorithm, a next generation search engine that uses sequence temperature values and feature probabilities to identify peptides from tandem mass spectra. Mol. Cell. Proteomics 6, 1638–1655 (2007)CrossRefGoogle Scholar
  9. 9.
    Jeong, K., Kim, S., Bandeira, N.: False discovery rates in spectral identification. Bioinformatics 13(Suppl 16), S2 (2012)Google Scholar
  10. 10.
    Gupta, N., Bandeira, N., Keich, U., Pevzner, P.A.: Target-decoy approach and false discovery rate: when things may go wrong. J. Am. Soc. Mass Spectrom. 22, 1111–1120 (2011)CrossRefGoogle Scholar
  11. 11.
    Nesvizhskii, A.I.: A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics. J. Proteome 73, 2092–2123 (2010)CrossRefGoogle Scholar
  12. 12.
    Tharakan, R., Edwards, N., Graham, D.R.M.: Data maximization by multipass analysis of protein mass spectra. Proteomics 10, 1160–1171 (2010)CrossRefGoogle Scholar
  13. 13.
    Everett, L.J., Bierl, C., Master, S.R.: Unbiased statistical analysis for multi-stage proteomic search strategies. J. Proteome Res. 9, 700–707 (2010)CrossRefGoogle Scholar
  14. 14.
    Bern, M., Kil, Y.J.: Comment on “Unbiased statistical analysis for multi-stage proteomic search strategies.”. J. Proteome Res. 10, 2123–2127 (2011)CrossRefGoogle Scholar
  15. 15.
    Zhang, B., Wang, J., Wang, X., Zhu, J., Liu, Q., Shi, Z., Chambers, M.C., Zimmerman, L.J., Shaddox, K.F., Kim, S., Davies, S.R., Wang, S., Wang, P., Kinsinger, C.R., Rivers, R.C., Rodriguez, H., Townsend, R.R., Ellis, M.J.C., Carr, S.A., Tabb, D.L., Coffey, R.J., Slebos, R.J.C., Liebler, D.C.: NCI CPTAC: proteogenomic characterization of human colon and rectal cancer. Nature 513, 382–387 (2014)CrossRefGoogle Scholar
  16. 16.
    Goloborodko, A.A., Levitsky, L.I., Ivanov, M.V., Gorshkov, M.V.: Pyteomics—a Python framework for exploratory data analysis and rapid software prototyping in proteomics. J. Am. Soc. Mass Spectrom. 24, 301–304 (2013)CrossRefGoogle Scholar
  17. 17.
    Ivanov, M.V., Levitsky, L.I., Tarasova, I.A., Gorshkov, M.V.: Pepxmltk—a format converter for peptide identification results obtained from tandem mass spectrometry data using X!Tandem search engine. J. Anal. Chem. 70, 1598–1599 (2015)CrossRefGoogle Scholar
  18. 18.
    Ivanov, M.V., Levitsky, L.I., Lobas, A.A., Panic, T., Laskay, Ü.A., Mitulovic, G., Schmid, R., Pridatchenko, M.L., Tsybin, Y.O., Gorshkov, M.V.: Empirical multidimensional space for scoring peptide spectrum matches in shotgun proteomics. J. Proteome Res. 13, 1911–1920 (2014)CrossRefGoogle Scholar
  19. 19.
    Käll, L., Canterbury, J.D., Weston, J., Noble, W.S., MacCoss, M.J.: Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nat. Methods 4, 923–925 (2007)CrossRefGoogle Scholar
  20. 20.
    Creasy, D.M., Cottrell, J.S.: Unimod: protein modifications for mass spectrometry. Proteomics 4, 1534–1536 (2004)CrossRefGoogle Scholar
  21. 21.
    Gay, S., Binz, P.A., Hochstrasser, D.F., Appel, R.D.: Modeling peptide mass fingerprinting data using the atomic composition of peptides. Electrophoresis 20, 3527–3534 (1999)CrossRefGoogle Scholar

Copyright information

© American Society for Mass Spectrometry 2016

Authors and Affiliations

  • Mark V. Ivanov
    • 1
    • 2
  • Lev I. Levitsky
    • 1
    • 2
  • Mikhail V. Gorshkov
    • 1
    • 2
  1. 1.Institute for Energy Problems of Chemical PhysicsRussian Academy of SciencesMoscowRussia
  2. 2.Moscow Institute of Physics and Technology (State University)MoscowRussia

Personalised recommendations