Abstract
We investigate a problem which arises in computational biology: Given a constant-size alphabet A with a weight function µ: A → ℕ, find an efficient data structure and query algorithm solving the following problem: For a string σ over A and a weight M ∈ ℕ, decide whether a contains a substring with weight M (One-String Mass Finding Problem). If the answer is yes then we may in addition require a witness, i.e., indices i ≤ j such that the substring beginning at position i and ending at position j has weight M. We allow preprocessing of the string, and measure efficiency in two parameters: storage space required for the preprocessed data, and running time of the query algorithm for given M. We are interested in data structures and algorithms requiring subquadratic storage space and sublinear query time, where we measure the input size as the length of the input string. Among others, we present two non-trivial efficient algorithms: Lookup solves the problem with 0(n) space and EquationSource% MathType!MTEF!2!1!+- % feaagGart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn % hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr % 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9 % vqaqpepm0xbba9pwe9Q8fs0-yqaqpepae9pg0FirpepeKkFr0xfr-x % fr-xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaam4tamaabm % aabaWaaSaaaeaacaWGUbaabaGaciiBaiaac+gacaGGNbGaamOBaaaa % cqGHflY1ciGGSbGaai4BaiaacEgaciGGSbGaai4BaiaacEgacaWGUb % aacaGLOaGaayzkaaaaaa!45F5! \[O\left( {\frac{n}{{\log n}} \cdot \log \log n} \right)\]$$ time; Interval solves the problem for binary alphabets with 0(n) storage space in 0(log n) query time. Finally, we introduce other variants of the problem and sketch how our algorithms may be extended for these variants.
Chapter PDF
Similar content being viewed by others
References
A. Apostolico and Z. Galil, editors. Combinatorial Algorithms on Words. Springer, 1995.
V. Bafna and N. Edwards. SCOPE: A probabilistic model for scoring tandem mass spectra against a peptide database. Bioinformatics, 17 (Supplement 1): S13 — S21, 2001.
M. Cosnard, J. Duprat, and A. G. Ferreira. The complexity of searching in X + Y and other multisets. Information Processing Letters, 34: 103109, 1990.
M. Cieliebak, T. Erlebach, Zs. Liptak, J. Stoye, and E. Welzl. Algorithmic complexity of protein identification: Combinatorics of weighted strings. Technical Report 361, ETH Zurich, Department of Computer Science, 2001.
M. Crochemore and W. Rytter. Text Algorithms. Oxford University Press, New York, NY, 1994.
D. Du and F. K. Hwang, editors. Combinatorial Group Testing and its Applications. World Scientific, second edition, 2000.
J. Eng, A. McCormack, and J. R. Yates III. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Amer. Soc. Mass Spect., 5: 976–989, 1994.
M. L. Fredman. Two applications of a probabilistic search technique: Sorting X +Y and building balanced search trees. In Conference Record of Seventh Annual ACM Symposium on Theory of Computing (STOC),pages 240–244, 1975.
M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. Freeman, 1979.
D. Gusfield. Algorithms on Strings, Trees, and Sequences. Cambridge University Press, 1997.
L. Gonick and M. Wheelis. The Cartoon Guide to Genetics. HarperPerennial, updated edition, 1991.
W. J. Henze], T. M. Billeci, J. T. Stults, S. C. Wong, C. Grimley, and C. Watanabe. Identifying proteins from two-dimensional gels by molecular mass searching of peptide fragments in protein sequence databases. Proc. Natl. Acad. Sci. USA, 90 (11): 5011–5015, 1993.
L.H. Harper, T.H. Payne, J.E. Savage, and E. Straus. Sorting X + Y. Communications of the ACM, 18 (6): 347–349, 1975.
P. James, M. Quadroni, E. Carafoli, and G. Gonnet. Protein identification by mass profile fingerprinting. Biochem. Biophys. Res. Commun., 195 (1): 58–64, 1993.
M. Lothaire. Combinatorics on Words. Cambridge University Press, second edition, 1997.
M. Mann, P. Hojrup, and P. Roepstorff. Use of mass spectrometric molecular weight information to identify proteins in sequence databases. Biol. Mass Spectrom., 22 (6): 338–345, 1993.
M. Mann and M. Wilm. Error-tolerant identification of peptides sequence databases by peptide sequence tags. Anal. Chem., 66(24): 4390–4399, 1994.
P. A. Pevzner, V. Daneik, and C. L. Tang. Mutation-tolerant protein identification by mass spectrometry. J. Comp. Biol., 7 (6): 777–787, 2000.
P. A. Pevzner. Computational Molecular Biology: An Algorithmic Approach. MIT Press, 2000.
D. J. C. Pappin, P. Højrup, and A. J. Bleasby. Rapid identification of proteins by peptide-mass fingerprinting. Curr. Biol., 3 (6): 327–332, 1993.
P. A. Pevzner, Z. Mulyukov, V. Dančík, and C. L. Tang. Efficiency of database search for identification of mutated and modified proteins via mass spectrometry. Genome Res., 11(2):290–299, 2001.
G. Rozenberg and A. Salomaa, editors. Handbook of Formal Languages, volume 1–3. Springer, 1997.
J. Setubal and J. Meidanis. Introduction to Computational Molecular Biology. PWS Boston, 1997.
L. Stryer. Biochemistry. Freeman, 1988.
J. R. Yates, J. K. Eng, and A. L. McCormack. Mining genomes: Correlating tandem mass-spectra of modified and unmodified peptides to sequences in nucleotide databases. Anal. Chem., 67 (18): 3202–3210, 1995.
J. R. Yates III. Database searching using mass spectrometry data. Electrophoresis, 19 (6): 893–900, 1998.
J. R. Yates III, S. Speicher, P. R. Griffin, and T. Hunkapillar. Peptide mass maps: A highly informative approach to protein identification. Anal. Biochem., 214: 397–408, 1993.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer Science+Business Media New York
About this chapter
Cite this chapter
Cieliebak, M., Lipták, Z., Welzl, E., Erlebach, T., Stoye, J. (2002). Algorithmic Complexity of Protein Identification: Searching in Weighted Strings. In: Baeza-Yates, R., Montanari, U., Santoro, N. (eds) Foundations of Information Technology in the Era of Network and Mobile Computing. IFIP — The International Federation for Information Processing, vol 96. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-35608-2_13
Download citation
DOI: https://doi.org/10.1007/978-0-387-35608-2_13
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4757-5275-5
Online ISBN: 978-0-387-35608-2
eBook Packages: Springer Book Archive