Adaptation of Decoy Fusion Strategy for Existing Multi-Stage Search Workflows
- 555 Downloads
A number of proteomic database search engines implement multi-stage strategies aiming at increasing the sensitivity of proteome analysis. These approaches often employ a subset of the original database for the secondary stage of analysis. However, if target-decoy approach (TDA) is used for false discovery rate (FDR) estimation, the multi-stage strategies may violate the underlying assumption of TDA that false matches are distributed uniformly across the target and decoy databases. This violation occurs if the numbers of target and decoy proteins selected for the second search are not equal. Here, we propose a method of decoy database generation based on the previously reported decoy fusion strategy. This method allows unbiased TDA-based FDR estimation in multi-stage searches and can be easily integrated into existing workflows utilizing popular search engines and post-search algorithms.
KeywordsProteomics False discovery rate X!Tandem Mascot Refinement Target-decoy approach
Protein identification in shotgun proteomics is performed using a search against a protein sequence database . A number of popular search engines, including X!Tandem , Mascot , SEQUEST , PEAKS , Morpheus , and Andromeda  implement different algorithms that match MS/MS spectra to peptide sequences from the reference database. Filtering and validation of the search results is a crucial step for their subsequent biological interpretation. A simple and versatile strategy known as the target-decoy approach (TDA)  is widely employed for filtering of peptide identifications. The approach is based on generating a database of “decoy” proteins known a priori to be false, followed by its concatenation with the relevant (“target”) database. In addition, a number of search engines are capable of performing multi-stage strategies [2, 3, 4, 8], wherein a second search is run after the initial one using relaxed parameters and reduced search space. A number of reports suggest that these multi-stage strategies can provide significant gain in sensitivity at a relatively small time cost, owing to a drastic reduction in the search space [9, 10, 11, 12]. It has been noted in the cited works, however, that it also leads to systematic errors in false discovery rate (FDR) estimation based on TDA. The problem arises if different numbers of target and decoy proteins are considered in the second search. In this case, the widely employed equation for FDR estimation cannot be used anymore, because the main assumption of TDA does not hold. Other implementations of two-pass searching, such as precursor mass recalibration performed by Andromeda, do not introduce errors in FDR estimation, as long as they do not change the proportion of decoy peptides in the search space.
A correction for the X!Tandem search engine was reported to fix the problem . It was later argued, however, that it introduces yet another bias into the FDR estimation, and that for accurate results the target and decoy parts of the database must be equal not only in size, but also in quality . Recently, Zhang et al. proposed and implemented a new method for decoy database generation called decoy fusion . In the classic TDA, decoy sequences are generated by reversing or shuffling the sequences from the reference database, or using a random walk approach . Then, the generated decoy protein sequences are simply added to the protein database employed for the search. In the decoy fusion method, each reversed sequence is appended directly to the end of the target protein sequence within the same protein record, leaving the total number of sequences in the database unchanged. In this way, for every target protein passing into the second search stage, there will be a reversed sequence. The decoy fusion method corrects the bias in FDR estimation, yet it requires that post-search validation tools localize each identified peptide within the concatenated protein sequence to correctly label it as target or decoy. The popular tools do not support this functionality, which hinders integration of decoy fusion into the workflows not using the PEAKS software.
Here, we propose a solution based on the decoy fusion approach and compatible with popular post-search tools without modifications to their software. The proposed solution was tested using two widely used search engines, X!Tandem and Mascot, which implement multi-stage search.
The data set consisting of 157,128 LC-MS/MS spectra obtained using Thermo LTQ Orbitrap Velos for human colorectal cancer sample annotated as TCGA-A6-3807-01A-22 and described elsewhere  was used in this work. SwissProt human protein database was used for the searches. Decoy databases were generated using in-house developed Python scripts based on Pyteomics library .
Two popular search engines implementing multi-stage searching, X!Tandem (ver. CYCLONE 2012.10.01.1) and Mascot (ver. 2.4.1), were used for peptide identification. X!Tandem output files were converted to pepXML using pepxmltk utility . MPscore  was applied to pepXML files for post-search validation. We also applied Percolator  to the “fused” database search results to demonstrate the versatility of the proposed approach. It should be noted that only versions of Percolator released after February 29, 2016 can be successfully used with fused decoy searches. Mascot results were converted to CSV files using Mascot server with the “Group protein families” setting disabled. Precursor and fragment mass tolerances were set at 15 ppm and 0.3 Da, respectively. Carbamidomethylation of cysteine was set as fixed modification.
Instead of using a “nonsense” database to detect possible bias , we used a regular SwissProt database. To expose potential bias in FDR estimation, a “null model” (i.e., a set of false identifications coming from the second search) is needed. For X!Tandem searches, unlikely (“nonsense”) variable modifications were enabled for this purpose, and the matches to modified peptides were considered false. According to the main assumption of target-decoy approach, PSMs with nonsense modifications are expected to come in equal proportions from the target and decoy databases. Variable modifications were set to 70.041865 at Ser, 150.041585 at Thr, 72.021129 at Ile, 74.019021 at Ala, and 100.016044 at Gly. The values of the mass shifts were taken arbitrarily from the Unimod database  to ensure that “modified” peptides fall inside the mass clusters described elsewhere . An expectation value threshold for refinement was set at 0.01, which corresponds to 0.16% FDR for our data set.
Three X!Tandem searches were performed: a standard search with nonsense variable modifications without refinement (referred to as “no refine” for brevity), as well as two searches against the “target + fused decoy” (“fused”) and classic target-decoy databases (“classic”) with refinement and with nonsense modifications enabled at the refinement step only.
Mascot search engine does not allow specifying a list of variable modifications for the second-stage (“error-tolerant”) search. For this reason, nonsense modifications cannot be used as a null model with Mascot error-tolerant search. Like with X!Tandem, three searches were performed: a single-stage search and two 2-stage searches using classic and fused decoy databases. When comparing the results of the two-stage searches, we note that target proteins considered in the second stage are the same. Hence, all identifications that are unique to the classic search are due to the difference in the decoy databases and must be false. Since all false matches are mutually independent, the subset of PSMs unique to classic error-tolerant search can also serve as a null model and indicate the proportion of decoys in the second-stage search.
Results and Discussion
First, we estimated the ratio of target and decoy peptides for the decoy databases generated using the classic and proposed approaches. For 2,692,736 target peptides of length 6 and above with up to two miscleavages allowed, there were 2,687,024 and 2,740,696 decoy peptides for the two approaches, respectively. The number of decoy peptides for the proposed approach is slightly increased because of peptides with missed cleavage sites spanning the border between the target and reversed sequences. In case when a decoy sequence ends with a cleavage site, an additional N-terminal target peptide with the separator motif attached at the N-terminus may also be generated. These peptides are considered to be decoy because they are not shared with target proteins. Thus, while the commonly accepted decoy sequence generation approach leads to a small underestimation of FDR, the proposed approach is slightly conservative.
X!Tandem Peptide-Spectrum Matches Filtered to 1% FDR. With no refinement and when using fused decoy, modified peptides comprise less than 1%. With the classic database, modified peptides are more than 1.6%, which is an indication that FDR is underestimated
Global FDR filtering
Nonsense modifications only
Mascot Peptide-Spectrum Matches Identified in “Classic” Search and Filtered to 1% FDR. All PSMs are divided into three groups: identified in all searches, identified only in “fused decoy” and “classic” searches and unique for “classic” search
No refine, Fused, Classic
This study confirms the previously voiced concerns regarding a bias in target-decoy-based FDR estimation for multi-stage searches. A solution for the problem is proposed. Unlike the previously reported decoy fusion method, the proposed approach does not require any modifications to the search engines or post-search validation tools used for the analysis. The only necessary change is to use a specific decoy database generation procedure, which does not affect other steps in protein identification workflows. The method has been successfully tested for X!Tandem search engine with MPscore and Percolator post-search validation tools, as well as for Mascot search engine, but it has no limitations for other tools allowing multi-stage analysis. The proposed decoy generation method is implemented in the Pyteomics library .
This work was supported by Russian Science Foundation (project #14-14-00971). The authors have declared no conflict of interest.
- 8.Shilov, I.V., Seymour, S.L., Patel, A.A., Loboda, A., Tang, W.H., Keating, S.P., Hunter, C.L., Nuwaysir, L.M., Schaeffer, D.A.: The Paragon algorithm, a next generation search engine that uses sequence temperature values and feature probabilities to identify peptides from tandem mass spectra. Mol. Cell. Proteomics 6, 1638–1655 (2007)CrossRefGoogle Scholar
- 9.Jeong, K., Kim, S., Bandeira, N.: False discovery rates in spectral identification. Bioinformatics 13(Suppl 16), S2 (2012)Google Scholar
- 15.Zhang, B., Wang, J., Wang, X., Zhu, J., Liu, Q., Shi, Z., Chambers, M.C., Zimmerman, L.J., Shaddox, K.F., Kim, S., Davies, S.R., Wang, S., Wang, P., Kinsinger, C.R., Rivers, R.C., Rodriguez, H., Townsend, R.R., Ellis, M.J.C., Carr, S.A., Tabb, D.L., Coffey, R.J., Slebos, R.J.C., Liebler, D.C.: NCI CPTAC: proteogenomic characterization of human colon and rectal cancer. Nature 513, 382–387 (2014)CrossRefGoogle Scholar