Algorithmic Complexity of Protein Identification: Searching in Weighted Strings

Cieliebak, Mark; Lipták, Zsuzsanna; Welzl, Emo; Erlebach, Thomas; Stoye, Jens

doi:10.1007/978-0-387-35608-2_13

Mark Cieliebak⁴,
Zsuzsanna Lipták⁴,
Emo Welzl⁴,
Thomas Erlebach⁵ &
…
Jens Stoye⁶^nAff7

Part of the book series: IFIP — The International Federation for Information Processing ((IFIPAICT,volume 96))

392 Accesses
1 Citations

Abstract

We investigate a problem which arises in computational biology: Given a constant-size alphabet A with a weight function µ: A → ℕ, find an efficient data structure and query algorithm solving the following problem: For a string σ over A and a weight M ∈ ℕ, decide whether a contains a substring with weight M (One-String Mass Finding Problem). If the answer is yes then we may in addition require a witness, i.e., indices i ≤ j such that the substring beginning at position i and ending at position j has weight M. We allow preprocessing of the string, and measure efficiency in two parameters: storage space required for the preprocessed data, and running time of the query algorithm for given M. We are interested in data structures and algorithms requiring subquadratic storage space and sublinear query time, where we measure the input size as the length of the input string. Among others, we present two non-trivial efficient algorithms: Lookup solves the problem with 0(n) space and EquationSource% MathType!MTEF!2!1!+- % feaagGart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn % hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr % 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9 % vqaqpepm0xbba9pwe9Q8fs0-yqaqpepae9pg0FirpepeKkFr0xfr-x % fr-xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaam4tamaabm % aabaWaaSaaaeaacaWGUbaabaGaciiBaiaac+gacaGGNbGaamOBaaaa % cqGHflY1ciGGSbGaai4BaiaacEgaciGGSbGaai4BaiaacEgacaWGUb % aacaGLOaGaayzkaaaaaa!45F5! \[O\left( {\frac{n}{{\log n}} \cdot \log \log n} \right)\]$$ time; Interval solves the problem for binary alphabets with 0(n) storage space in 0(log n) query time. Finally, we introduce other variants of the problem and sketch how our algorithms may be extended for these variants.

Download to read the full chapter text

Chapter PDF

Complexity Issues of String to Graph Approximate Matching

Graphs Cannot Be Indexed in Polynomial Time for Sub-quadratic Time String Matching, Unless SETH Fails

libFLASM: a software library for fixed-length approximate string matching

Article Open access 10 November 2016

Keywords

References

A. Apostolico and Z. Galil, editors. Combinatorial Algorithms on Words. Springer, 1995.
Google Scholar
V. Bafna and N. Edwards. SCOPE: A probabilistic model for scoring tandem mass spectra against a peptide database. Bioinformatics, 17 (Supplement 1): S13 — S21, 2001.
Article Google Scholar
M. Cosnard, J. Duprat, and A. G. Ferreira. The complexity of searching in X + Y and other multisets. Information Processing Letters, 34: 103109, 1990.
Google Scholar
M. Cieliebak, T. Erlebach, Zs. Liptak, J. Stoye, and E. Welzl. Algorithmic complexity of protein identification: Combinatorics of weighted strings. Technical Report 361, ETH Zurich, Department of Computer Science, 2001.
Google Scholar
M. Crochemore and W. Rytter. Text Algorithms. Oxford University Press, New York, NY, 1994.
MATH Google Scholar
D. Du and F. K. Hwang, editors. Combinatorial Group Testing and its Applications. World Scientific, second edition, 2000.
Google Scholar
J. Eng, A. McCormack, and J. R. Yates III. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Amer. Soc. Mass Spect., 5: 976–989, 1994.
Article Google Scholar
M. L. Fredman. Two applications of a probabilistic search technique: Sorting X +Y and building balanced search trees. In Conference Record of Seventh Annual ACM Symposium on Theory of Computing (STOC),pages 240–244, 1975.
Google Scholar
M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. Freeman, 1979.
MATH Google Scholar
D. Gusfield. Algorithms on Strings, Trees, and Sequences. Cambridge University Press, 1997.
Google Scholar
L. Gonick and M. Wheelis. The Cartoon Guide to Genetics. HarperPerennial, updated edition, 1991.
Google Scholar
W. J. Henze], T. M. Billeci, J. T. Stults, S. C. Wong, C. Grimley, and C. Watanabe. Identifying proteins from two-dimensional gels by molecular mass searching of peptide fragments in protein sequence databases. Proc. Natl. Acad. Sci. USA, 90 (11): 5011–5015, 1993.
Article Google Scholar
L.H. Harper, T.H. Payne, J.E. Savage, and E. Straus. Sorting X + Y. Communications of the ACM, 18 (6): 347–349, 1975.
Article MathSciNet MATH Google Scholar
P. James, M. Quadroni, E. Carafoli, and G. Gonnet. Protein identification by mass profile fingerprinting. Biochem. Biophys. Res. Commun., 195 (1): 58–64, 1993.
Article Google Scholar
M. Lothaire. Combinatorics on Words. Cambridge University Press, second edition, 1997.
Google Scholar
M. Mann, P. Hojrup, and P. Roepstorff. Use of mass spectrometric molecular weight information to identify proteins in sequence databases. Biol. Mass Spectrom., 22 (6): 338–345, 1993.
Article Google Scholar
M. Mann and M. Wilm. Error-tolerant identification of peptides sequence databases by peptide sequence tags. Anal. Chem., 66(24): 4390–4399, 1994.
Article Google Scholar
P. A. Pevzner, V. Daneik, and C. L. Tang. Mutation-tolerant protein identification by mass spectrometry. J. Comp. Biol., 7 (6): 777–787, 2000.
Article Google Scholar
P. A. Pevzner. Computational Molecular Biology: An Algorithmic Approach. MIT Press, 2000.
Google Scholar
D. J. C. Pappin, P. Højrup, and A. J. Bleasby. Rapid identification of proteins by peptide-mass fingerprinting. Curr. Biol., 3 (6): 327–332, 1993.
Article Google Scholar
P. A. Pevzner, Z. Mulyukov, V. Dančík, and C. L. Tang. Efficiency of database search for identification of mutated and modified proteins via mass spectrometry. Genome Res., 11(2):290–299, 2001.
Article Google Scholar
G. Rozenberg and A. Salomaa, editors. Handbook of Formal Languages, volume 1–3. Springer, 1997.
Google Scholar
J. Setubal and J. Meidanis. Introduction to Computational Molecular Biology. PWS Boston, 1997.
Google Scholar
L. Stryer. Biochemistry. Freeman, 1988.
Google Scholar
J. R. Yates, J. K. Eng, and A. L. McCormack. Mining genomes: Correlating tandem mass-spectra of modified and unmodified peptides to sequences in nucleotide databases. Anal. Chem., 67 (18): 3202–3210, 1995.
Article Google Scholar
J. R. Yates III. Database searching using mass spectrometry data. Electrophoresis, 19 (6): 893–900, 1998.
Article Google Scholar
J. R. Yates III, S. Speicher, P. R. Griffin, and T. Hunkapillar. Peptide mass maps: A highly informative approach to protein identification. Anal. Biochem., 214: 397–408, 1993.
Article Google Scholar

Download references

Author information

Jens Stoye
Present address: Faculty of Technology, Bielefeld University, Germany

Authors and Affiliations

Institute of Theoretical Computer Science, ETH Zurich, Switzerland
Mark Cieliebak, Zsuzsanna Lipták & Emo Welzl
Computer Engineering and Networks Laboratory, ETH Zurich, Switzerland
Thomas Erlebach
Max Planck Institute of Molecular Genetics, Konrad-Zuse-Zentrum (ZIB), Berlin, Germany
Jens Stoye

Authors

Mark Cieliebak
View author publications
You can also search for this author in PubMed Google Scholar
Zsuzsanna Lipták
View author publications
You can also search for this author in PubMed Google Scholar
Emo Welzl
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Erlebach
View author publications
You can also search for this author in PubMed Google Scholar
Jens Stoye
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Universidad de Chile, Chile
Ricardo Baeza-Yates
Università di Pisa, Italy
Ugo Montanari
Carleton University, Canada
Nicola Santoro

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Cieliebak, M., Lipták, Z., Welzl, E., Erlebach, T., Stoye, J. (2002). Algorithmic Complexity of Protein Identification: Searching in Weighted Strings. In: Baeza-Yates, R., Montanari, U., Santoro, N. (eds) Foundations of Information Technology in the Era of Network and Mobile Computing. IFIP — The International Federation for Information Processing, vol 96. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-35608-2_13

Download citation

DOI: https://doi.org/10.1007/978-0-387-35608-2_13
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4757-5275-5
Online ISBN: 978-0-387-35608-2
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

Algorithmic Complexity of Protein Identification: Searching in Weighted Strings

Abstract

Chapter PDF

Similar content being viewed by others

Complexity Issues of String to Graph Approximate Matching

Graphs Cannot Be Indexed in Polynomial Time for Sub-quadratic Time String Matching, Unless SETH Fails

libFLASM: a software library for fixed-length approximate string matching

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

Algorithmic Complexity of Protein Identification: Searching in Weighted Strings

Abstract

Chapter PDF

Similar content being viewed by others

Complexity Issues of String to Graph Approximate Matching

Graphs Cannot Be Indexed in Polynomial Time for Sub-quadratic Time String Matching, Unless SETH Fails

libFLASM: a software library for fixed-length approximate string matching

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation