Indexing and Searching a Mass Spectrometry Database

  • Søren Besenbacher
  • Benno Schwikowski
  • Jens Stoye
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6060)


Database preprocessing in order to create an index often permits considerable speedup in search compared to the iterated query of an unprocessed database. In this paper we apply index-based database lookup to a range search problem that arises in mass spectrometry-based proteomics: given a large collection of sparse integer sets and a sparse query set, find all the sets from the collection that have at least k integers in common with the query set. This problem arises when searching for a mass spectrum in a database of theoretical mass spectra using the shared peaks count as similarity measure. The algorithms can easily be modified to use the more advanced shared peaks intensity measure instead of the shared peaks count. We introduce three different algorithms solving these problems. We conclude by presenting some experiments using the algorithms on realistic data showing the advantages and disadvantages of the algorithms.


Cluster Algorithm Tandem Mass Spectrometry Query Time Theoretical Spectrum International Protein Index 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Aebersold, R., Mann, M.: Mass spectrometry-based proteomics. Nature 422(6928), 198–207 (2003)CrossRefGoogle Scholar
  2. 2.
    Dutta, D., Chen, T.: Speeding up tandem mass spectrometry database search: metric embeddings and fast near neighbor search. Bioinformatics 23(5), 612–618 (2007)CrossRefGoogle Scholar
  3. 3.
    Elias, J.E., Gibbons, F.D., King, O.D., Roth, F.P., Gygi, S.P.: Intensity-based protein identification by machine learning from a library of tandem mass spectra. Nat. Biotechnol. 22(2), 214–219 (2004)CrossRefGoogle Scholar
  4. 4.
    Frank, A., Tanner, S., Pevzner, P.A.: Peptide sequence tags for fast database search in mass-spectrometry. In: Miyano, S., Mesirov, J., Kasif, S., Istrail, S., Pevzner, P.A., Waterman, M. (eds.) RECOMB 2005. LNCS (LNBI), vol. 3500, pp. 326–341. Springer, Heidelberg (2005)Google Scholar
  5. 5.
    Havilio, M., Haddad, Y., Smilansky, Z.: Intensity-based statistical scorer for tandem mass spectrometry. Anal. Chem. 75(3), 435–444 (2003)CrossRefGoogle Scholar
  6. 6.
    Izumi, T., Yokomaru, T., Takahashi, A., Kajitani, Y.: Computational complexity analysis of set-bin-packing problem. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences E81-A(5), 842–849 (1998)Google Scholar
  7. 7.
    Johnson, J.M., Castle, J., Garrett-Engele, P., Kan, Z., Loerch, P.M., Armour, C.D., Santos, R., Schadt, E.E., Stoughton, R., Shoemaker, D.D.: Genome-wide survey of human alternative pre-mRNA splicing with exon junction microarrays. Science 302(5653), 2141–2144 (2003)CrossRefGoogle Scholar
  8. 8.
    Kersey, P.J., Duarte, J., Williams, A., Karavidopoulou, Y., Birney, E., Apweiler, R.: The international protein index: An integrated database for proteomics experiments. proteomics 4(7), 1985–1988 (2004)CrossRefGoogle Scholar
  9. 9.
    Mann, M., Jensen, O.N.: Proteomic analysis of post-translational modifications. Nature Biotechnol. 21(3), 255–261 (2003)CrossRefGoogle Scholar
  10. 10.
    Mann, M., Wilm, M.: Error-tolerant identification of peptides in sequence databases by peptide sequence tags. Anal. Chem. 66(24), 4390–4399 (1994)CrossRefGoogle Scholar
  11. 11.
    Margulies, M., Egholm, M., Altman, W.E., Attiya, S., Bader, J.S., Bemben, L.A., Berka, J., Braverman, M.S., Chen, Y.J., Chen, Z., Dewell, S.B., Du, L., Fierro, J.M., Gomes, X.V., Godwin, B.C., He, W., Helgesen, S., Ho, C.H., Irzyk, G.P., Jando, S.C., Alenquer, M.L., Jarvie, T.P., Jirage, K.B., Kim, J.B., Knight, J.R., Lanza, J.R., Leamon, J.H., Lefkowitz, S.M., Lei, M., Li, J., Lohman, K.L., Lu, H., Makhijani, V.B., McDade, K.E., McKenna, M.P., Myers, E.W., Nickerson, E., Nobile, J.R., Plant, R., Puc, B.P., Ronan, M.T., Roth, G.T., Sarkis, G.J., Simons, J.F., Simpson, J.W., Srinivasan, M., Tartaro, K.R., Tomasz, A., Vogt, K.A., Volkmer, G.A., Wang, S.H., Wang, Y., Weiner, M.P., Yu, P., Begley, R.F., Rothberg, J.M.: Genome sequencing in microfabricated high-density picolitre reactors. Nature 437(7057), 376–380 (2005)Google Scholar
  12. 12.
    McCreight, E.M.: A space-economical suffix tree construction algorithm. J. ACM 23(2), 262–272 (1976)zbMATHCrossRefMathSciNetGoogle Scholar
  13. 13.
    Palagi, P.M., Hernandez, P., Walther, D., Appel, R.D.: Proteome informatics I: Bioinformatics tools for processing experimental data. Proteomics 6(20), 5435–5444 (2006)CrossRefGoogle Scholar
  14. 14.
    Ramakrishnan, S.R., Mao, R., Nakorchevskiy, A.A., Prince, J.T., Willard, W.S., Xu, W., Marcotte, E.M., Miranker, D.P.: A fast coarse filtering method for peptide identification by mass spectrometry. Bioinformatics 22(12), 1524–1531 (2006)CrossRefGoogle Scholar
  15. 15.
    Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)zbMATHCrossRefMathSciNetGoogle Scholar
  16. 16.
    Weiner, P.: Linear pattern matching algorithms. In: Proceedings of the 14th Annual IEEE Symposium on Switching and Automata Theory, pp. 1–11. IEEE Press, Los Alamitos (1973)CrossRefGoogle Scholar
  17. 17.
    Whitfield, E.J., Pruess, M., Apweiler, R.: Bioinformatics database infrastructure for biotechnology research. J. Biotechnol. 124(4), 629–639 (2006)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Søren Besenbacher
    • 1
  • Benno Schwikowski
    • 2
  • Jens Stoye
    • 3
  1. 1.deCODE GeneticsReykjavikIceland
  2. 2.Institut PasteurSystems Biology GroupParisFrance
  3. 3.Technische FakultätUniversität BielefeldGermany

Personalised recommendations