Advertisement

Brute-Force Approach for Mass Spectrometry-Based Variant Peptide Identification in Proteogenomics without Personalized Genomic Data

  • Mark V. Ivanov
  • Anna A. Lobas
  • Lev I. Levitsky
  • Sergei A. Moshkovskii
  • Mikhail V. Gorshkov
Short Communication

Abstract

In a proteogenomic approach based on tandem mass spectrometry analysis of proteolytic peptide mixtures, customized exome or RNA-seq databases are employed for identifying protein sequence variants. However, the problem of variant peptide identification without personalized genomic data is important for a variety of applications. Following the recent proposal by Chick et al. (Nat. Biotechnol. 33, 743–749, 2015) on the feasibility of such variant peptide search, we evaluated two available approaches based on the previously suggested “open” search and the “brute-force” strategy. To improve the efficiency of these approaches, we propose an algorithm for exclusion of false variant identifications from the search results involving analysis of modifications mimicking single amino acid substitutions. Also, we propose a de novo based scoring scheme for assessment of identified point mutations. In the scheme, the search engine analyzes y-type fragment ions in MS/MS spectra to confirm the location of the mutation in the variant peptide sequence.

Graphical abstract

Keywords

Proteogenomics LC-MS/MS Search engine Open search 

Introduction

Proteogenomic studies almost exclusively rely on identification of variant peptides using customized databases [1]. To search for these peptides, either an exome or RNA-seq data obtained for the same sample are required. The alternative approach for variant peptide identification without personalized genomic or transcriptomic data is challenging, and the question on the feasibility of such a search with meaningful efficiency remains open [2, 3, 4]. However, this alternative is demanded in many applications in which the genomic data of high quality is not available for the samples. For example, obtaining genomic sequences may be difficult due to genome size and organization for toxic plants and animals, when the sequences of peptide toxins are of specific interest [5].

Currently, two possible approaches are available for variant peptide search without employing a customized protein database. The first one is the so-called “open” search [6] and the other one is the “brute-force” strategy, in which the search is performed for all possible single amino acid substitutions in the wild-type protein sequences [7]. These approaches may yield a significant number of variant peptide identifications; however, correct false discovery rate (FDR) estimation for such identifications is challenging. There are a number of recent studies on class-specific FDR estimation and proposals for methods aiming at improving the sensitivity of proteogenomic searches [8, 9, 10, 11, 12]. However, most of these studies use traditional search algorithms with scoring functions best suited to wild-type peptide identification. This leads to an increased level of false peptide identifications of non-random nature, when modified wild-type peptides are identified as variants. In this work, we studied the efficiency of the two mentioned search approaches and propose an enhancement to a traditional search scoring algorithm for “brute-force” method that significantly improves its sensitivity. The study was performed using publicly available LC-MS/MS and exome sequencing data for melanoma cell line, ME14, from NCI-60 panel [13, 14].

Note that the “open” search can be run on most of the existing search engines. However, since the “brute-force” strategy evaluated in this study required extensive modifications of the search engine, an open-source IdentiPy search engine [15] (freely available at https://bitbucket.org/levitsky/identipy/) was used to analyze LC-MS/MS data. For the search, carbamidomethylation of cysteine was set as fixed modification. Enzyme specificity was set to “trypsin”. Up to one missed cleavage was allowed and fragment ion mass measurement accuracy was set to 0.01 Da. The precursor mass accuracy was set at 15 ppm and 300 Da for classic and “open” search approaches, respectively. The reference database UniProt Human (ver. 2013_09, containing 88,277 protein sequences, downloaded from http://www.uniprot.org/taxonomy/complete-proteomes) and the NCI-60 variant database described earlier [16] were used for the conventional proteogenomic searches employing exome data. The human SwissProt database (20193 proteins) was used for the exome-free searches. The FDR for variant peptide identifications was estimated using group-specific target-decoy strategy [8, 10, 17, 18] as follows:
$$ FDR=\frac{Number\kern0.5em of\kern0.5em decoy\kern0.5em variant\kern0.5em PSMs+1}{Number\kern0.5em of\kern0.5em target\kern0.5em variant\kern0.5em PSMs} $$
(1)

Decoy proteins were generated by reversing the original sequences using the Pyteomics library [19].

Two approaches for variant peptide identification without customized databases were tested in this work. First, the “open” search approach was used, where the precursor mass window was significantly expanded [6, 7]. In this approach, the peptides with single amino acid mutations are matched with their wild-type counterparts and reported by search engines with a precursor mass shift corresponding to the difference between two amino acids.

A straightforward alternative to the “open” search approach is the “brute-force” strategy. In this strategy, the variants are searched simply by varying the amino acid residues one by one in the sequences of targeted proteins. This procedure is implemented in several search engines, including X!Tandem [20], Mascot [21], and IdentiPy [15].

One of the most challenging issues in variant peptide identification is the high probability of incorrect assignment of MS/MS spectra produced by wild-type peptides with chemical modifications in the sequences. Indeed, a number of modifications result in the same mass shifts as amino acid changes, thus mimicking the amino acid substitutions due to mutation [22]. First, we performed a conventional proteogenomic search against a combined database containing both wild-type and variant protein sequences as described elsewhere [23]. Identification results for the search were 54,095 and 125 wild-type and variant peptides, respectively. The spectra assigned to wild-type peptides with 1% FDR were excluded in the subsequent searches. In the next step, we performed the “open” and “brute-force” searches. We selected a separate group of identifications found from the “open” search, in which mass shifts were equal (within 0.01 Da accuracy) to a difference between any two amino acids (excluding the cases with zero mass shift). Identifications in this group were filtered to 1% FDR. Variant peptide identifications found using the “brute-force” approach were also filtered to group-specific FDR of 1%. The total numbers of variant peptides were 13,767 and 9876 for the “open” search and the “brute-force” approaches, respectively, and only 57 and 70 of them were confirmed by the exome-derived protein database. Since most of the identifications are not confirmed by exome data, they are assumed by default to be false positives originating from modifications mimicking point mutations. At this stage of the analysis, we call the identifications in both groups “pseudo variant” peptides to reflect their non-mutation origin. Table 1 shows the most frequent “pseudo variant” point mutations identified using “brute-force” strategy compared with the most frequent point mutations according to the exome-based protein database. The list of all identified “mutations” with PSM and peptide counts for all approaches used in the manuscript is shown in Supplementary Table S1.
Table 1

The most frequent modifications mimicking point mutations found for the “brute-force” strategy and the most frequent real point mutations according to the exome-based protein database

“Brute-force”

Exome-based protein database

Mass diff, Da

# peptides

Amino acid substitution

Mass diff, Da

# mutations

Amino acid substitution

0.98

1595

N>D

30.01

2053

A>T

0.98

1478

Q>E

–28.04

1956

R>Q

15.99

1416

A>S

28.03

1889

A>V

57.02

581

A>Q

–19.04

1788

R>H

57.02

571

G>N

–0.95

1458

E>K

42.01

542

S>E

3.93

1400

R>C

15.99

518

F>Y

16.03

1285

P>L

–29.99

339

M>T

14.02

1272

V>I

0.96

338

L>N

29.98

1131

R>W

16.01

317

D>M

–10.02

1116

P>S

Exact calculation of the probability of observing a modification instead of the amino acid substitution is not straightforward. Yet it is clear that the most frequently identified point mutations for exome-free methods have the highest probability to be found among the false positive variant identifications.

In the next step, we evaluated the sensitivity and specificity of the two exome-free variant peptide identification approaches. As said above, the “pseudo variant” peptides not confirmed by exome-derived protein database are being considered as false positives. Then, from the identified exome-free variant identifications we also excluded peptides with suspicious residue substitutions corresponding to the mass shifts with highest frequencies obtained above and listed in Supplementary Table S1. The dependences of the number of variant peptides identified using exome-free and exome-specific searches on the number of excluded amino acid residue substitutions are shown in Figure 1a and Supplementary Table S2.
Figure 1

Variant peptide identifications for (a) “conventional” (red line), “open” search (green lines), and “brute-force” (blue lines) approaches, and (b) “brute-force” approach with using “peak difference confirmation” (PDC) score (purple lines). Dashed lines correspond to all variant peptides identified at 1% group-specific FDR, and solid lines correspond to variant peptides confirmed by exome-based protein database. The X-axis represents the number of amino acid substitutions excluded from the analysis. The exclusion was done according to the mass shift frequencies. The exclusion of “conventional” search identifications was done using the frequencies from the “brute-force” search

The number of variant peptide identifications confirmed by exome-based protein database does not change dramatically with the exclusion. This observation was intuitively expected because the abundance of excluded amino acid substitutions depends more on the variant database than on the probability of mimicking modifications (see Table 1 and Supplementary Table S1). On the contrary, the number of variants not confirmed by exome data drops at significantly higher rate when we start excluding the high-frequency modifications mimicking residue substitutions. What is important is that excluding 50–100 of such “suspicious” substitutions decreases FDR by one order of magnitude for both exome-free approaches without loss of variant identifications confirmed by exome-derived protein database.

In the final note, we describe a novel scoring scheme for the variant PSMs found using exome-free searches. As shown above, these searches are characterized by the high rate of false positives, in part due to the fact that the search engines do not discriminate peptide-spectrum matches that do not confirm the mutation at the fragment peak level. To address this issue, we proposed and implemented the additional requirement for variant sequence candidates: the search engine considers only the sequences with the location of point mutation proven by the series of y-fragment ions in the MS/MS spectrum. Further, we call this scoring scheme “peak difference confirmation”.

An example of “peak difference confirmation” is shown in Supplementary Figure S1. When the point mutation occurs at the 4th residue in a peptide of length 7, the search engine is looking for y3 and y4 ions and reports identification only if this pair is matched. Application of this simple and straightforward filtering has resulted in a 3-fold decrease in sensitivity for the “brute-force” approach, but eliminated most of false positives for the variant identifications (see Figure 1b). For example, adding the “peak difference confirmation” to exclusion of 20 amino acid substitutions with most frequent mass shifts reduces the number of false identifications from 2073 peptides to 38, while the number of true identifications drops from 65 to 18.

The efficiencies of the exome-free “open” search and “brute-force” approaches were compared with the conventional proteogenomic search workflow employing a customized database generated using sample-specific exome data. Both evaluated exome-free approaches delivered up to 56% (70 of 125 peptides) of variant peptides identified using customized databases, yet without the knowledge about these variants a priori. At the same time, they result in a large number of highly confident variant identifications that are not confirmed by the exome, which can be attributed to peptides with modifications mimicking the residue substitutions with identical mass shifts. For the “brute-force” approach, we proposed an addition to the standard scoring scheme, called “peak difference confirmation”, which also resolves the problem with “pseudo variant” identifications.

All search engine output, PSM tables, protein databases, and search parameters are available at http://pubdata.theorchromo.ru/noexomeproteogenomics.

Notes

Acknowledgments

This work was supported by Russian Foundation for Basic Research (project #16-54-21006). The authors have declared no conflict of interest.

Supplementary material

13361_2017_1859_MOESM1_ESM.png (276 kb)
Supplementary Figure S1 (PNG 275 kb)
13361_2017_1859_MOESM2_ESM.xls (103 kb)
Supplementary Table S1 (XLS 103 kb)
13361_2017_1859_MOESM3_ESM.xls (49 kb)
Supplementary Table S2 (XLS 49 kb)

References

  1. 1.
    Jaffe, J.D., Berg, H.C., Church, G.M.: Proteogenomic mapping as a complementary method to perform genome annotation. Proteomics. 4, 59–77 (2004)CrossRefGoogle Scholar
  2. 2.
    Wang, Z., Gerstein, M., Snyder, M.: RNA-Seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet. 10, 57–63 (2009)CrossRefGoogle Scholar
  3. 3.
    Desiere, F., Deutsch, E.W., Nesvizhskii, A.I., Mallick, P., King, N.L., Eng, J.K., Aderem, A., Boyle, R., Brunner, E., Donohoe, S., Fausto, N., Hafen, E., Hood, L., Katze, M.G., Kennedy, K.A., Kregenow, F., Lee, H., Lin, B., Martin, D., Ranish, J.A., Rawlings, D.J., Samelson, L.E., Shiio, Y., Watts, J.D., Wollscheid, B., Wright, M.E., Yan, W., Yang, L., Yi, E.C., Zhang, H., Aebersold, R.: Integration with the human genome of peptide sequences obtained by high-throughput mass spectrometry. Genome Biol. 6, R9 (2005)CrossRefGoogle Scholar
  4. 4.
    Ingolia, N.T.: Ribosome profiling: new views of translation, from single codons to genome scale. Nat. Rev. Genet. 15, 205–213 (2014)CrossRefGoogle Scholar
  5. 5.
    Barghi, N., Concepcion, G.P., Olivera, B.M., Lluisma, A.O.: Structural features of conopeptide genes inferred from partial sequences of the Conus tribblei genome. Mol. Gen. Genomics. 291, 411–422 (2016)CrossRefGoogle Scholar
  6. 6.
    Chick, J.M., Kolippakkam, D., Nusinow, D.P., Zhai, B., Rad, R., Huttlin, E.L., Gygi, S.P.: A mass-tolerant database search identifies a large proportion of unassigned spectra in shotgun proteomics as modified peptides. Nat. Biotechnol. 33, 743–749 (2015)CrossRefGoogle Scholar
  7. 7.
    Xiong, Y., Guo, Y., Xiao, W., Cao, Q., Li, S., Qi, X., Zhang, Z., Wang, Q., Shui, W.: An NGS-independent strategy for proteome-wide identification of single amino acid polymorphisms by mass spectrometry. Anal. Chem. 88, 2784–2791 (2016)CrossRefGoogle Scholar
  8. 8.
    Nesvizhskii, A.I.: Proteogenomics: concepts, applications and computational strategies. Nat. Methods. 11, 1114–1125 (2014)CrossRefGoogle Scholar
  9. 9.
    Woo, S., Cha, S.W., Bonissone, S., Na, S., Tabb, D.L., Pevzner, P.A., Bafna, V.: Advanced proteogenomic analysis reveals multiple peptide mutations and complex immunoglobulin peptides in colon cancer. J. Proteome Res. 14, 3555–3567 (2015)Google Scholar
  10. 10.
    Woo, S., Cha, S.W., Na, S., Guest, C., Liu, T., Smith, R.D., Rodland, K.D., Payne, S., Bafna, V.: Proteogenomic strategies for identification of aberrant cancer peptides using large-scale next-generation sequencing data. Proteomics. 14, 2719–2730 (2014)CrossRefGoogle Scholar
  11. 11.
    Menschaert, G., Fenyö, D.: Proteogenomics from a bioinformatics angle: a growing field. Mass Spectrom. Rev. 36, 584–599 (2017)Google Scholar
  12. 12.
    Noble, W.S.: Mass spectrometrists should search only for peptides they care about. Nat. Methods. 12, 605–608 (2015)CrossRefGoogle Scholar
  13. 13.
    Moghaddas Gholami, A., Hahne, H., Wu, Z., Auer, F.J., Meng, C., Wilhelm, M., Kuster, B.: Global proteome analysis of the NCI-60 cell line panel. Cell Rep. 4, 609–620 (2013)CrossRefGoogle Scholar
  14. 14.
    Abaan, O.D., Polley, E.C., Davis, S.R., Zhu, Y.J., Bilke, S., Walker, R.L., Pineda, M., Gindin, Y., Jiang, Y., Reinhold, W.C., Holbeck, S.L., Simon, R.M., Doroshow, J.H., Pommier, Y., Meltzer, P.S.: The exomes of the NCI-60 panel: a genomic resource for cancer biology and systems pharmacology. Cancer Res. 73, 4372–4382 (2013)CrossRefGoogle Scholar
  15. 15.
    Levitsky, L.I., Ivanov, M.V., Lobas, A.A., Gorshkov, M.V.: IdentiPy – an open-source search engine for shotgun proteomics. J. Proteome Res. Submitted (2017)Google Scholar
  16. 16.
    Karpova, M.A., Karpov, D.S., Ivanov, M.V., Pyatnitskiy, M.A., Chernobrovkin, A.L., Lobas, A.A., Lisitsa, A.V., Archakov, A.I., Gorshkov, M.V., Moshkovskii, S.A.: Exome-driven characterization of the cancer cell lines at the proteome level: the NCI-60 case study. J. Proteome Res. 13, 5551–5560 (2014)CrossRefGoogle Scholar
  17. 17.
    Elias, J.E., Gygi, S.P.: Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat. Methods. 4, 207–214 (2007)CrossRefGoogle Scholar
  18. 18.
    Levitsky, L.I., Ivanov, M.V., Lobas, A.A., Gorshkov, M.V.: Unbiased false discovery rate estimation for shotgun proteomics based on the target-decoy approach. J. Proteome Res. 16, 393–397 (2017)CrossRefGoogle Scholar
  19. 19.
    Goloborodko, A.A., Levitsky, L.I., Ivanov, M.V., Gorshkov, M.V.: Pyteomics – a Python framework for exploratory data analysis and rapid software prototyping in proteomics. J. Am. Soc. Mass Spectrom. 24, 301–304 (2013)CrossRefGoogle Scholar
  20. 20.
    Craig, R., Beavis, R.C.: TANDEM: matching proteins with tandem mass spectra. Bioinformatics. 20, 1466–1467 (2004)CrossRefGoogle Scholar
  21. 21.
    Perkins, D.N., Pappin, D.J., Creasy, D.M., Cottrell, J.S.: Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis. 20, 3551–3567 (1999)CrossRefGoogle Scholar
  22. 22.
    Chernobrovkin, A.L., Kopylov, A.T., Zgoda, V.G., Moysa, A.A., Pyatnitskiy, M.A., Kuznetsova, K.G., Ilina, I.Y., Karpova, M.A., Karpov, D.S., Veselovsky, A.V., Ivanov, M.V., Gorshkov, M.V., Archakov, A.I., Moshkovskii, S.A.: Methionine to isothreonine conversion as a source of false discovery identifications of genetically encoded variants in proteogenomics. J. Proteomics. 120, 169–178 (2015)Google Scholar
  23. 23.
    Ivanov, M.V., Lobas, A.A., Karpov, D.S., Moshkovskii, S.A., Gorshkov, M.V.: Comparison of false discovery rate control strategies for variant peptide identifications in shotgun proteogenomics. J. Proteome Res. 16, 1936–1943 (2017)CrossRefGoogle Scholar

Copyright information

© American Society for Mass Spectrometry 2017

Authors and Affiliations

  • Mark V. Ivanov
    • 1
    • 2
  • Anna A. Lobas
    • 1
    • 2
  • Lev I. Levitsky
    • 1
    • 2
  • Sergei A. Moshkovskii
    • 3
    • 4
  • Mikhail V. Gorshkov
    • 1
    • 2
  1. 1.Institute for Energy Problems of Chemical PhysicsRussian Academy of SciencesMoscowRussia
  2. 2.Moscow Institute of Physics and Technology (State University)DolgoprudnyRussia
  3. 3.Institute of Biomedical ChemistryMoscowRussia
  4. 4.Pirogov Russian National Research Medical UniversityMoscowRussia

Personalised recommendations