Extracting Approximate Patterns
In a sequence, approximate patterns are exponential in number. In this paper, we present a new notion of basis for the patterns with don’t cares occurring in a given text (sequence). The primitive patterns are of interest since their number is lower than previous known definitions (and in a case, sub-linear in the size of the text), and these patterns can be used to extract all the patterns of a text.
We present an incremental algorithm that computes the primitive patterns occurring at least q times in a text of length n, given the N primitive patterns occurring at least q−1 times, in time O(|Σ|Nn2log2n log log n). In the particular case where q = 2, the complexity in time is only O(|Σ|n2 log2n log log n). We also give an algorithm that decides if a given pattern is primitive in a given text.
Unable to display preview. Download preview PDF.
- 1.A. Apostolico. Pattern discovery and the algorithmics of surprise. In P. Frasconi and R. Shamir, editors, Proceedings of the NATO ASI on Artificial Intelligence and Heuristic Methods for Bioinformatics, October 2001.Google Scholar
- 2.A. Apostolico and L. Parida. Compression and the wheel of fortune. In Proceedings of Data Compression Conference (DCC), Snowbird, Utah, March 2003.Google Scholar
- 3.M. Crochemore, C. Hancart, and T. Lecroq. Algorithmique du Texte. Vuibert, 2001.Google Scholar
- 4.M. Crochemore and M.-F. Sagot. Motifs in sequences: localization and extraction. In A. Konopka and al., editors, Handbook of Computational Chemistry. Marcel Dekker, Inc, 2001.Google Scholar
- 5.M. J. Fischer and M. S. Paterson. String matching and other products. SIAM-AMS proceedings, pages 113–125, 1974.Google Scholar
- 6.I. Jonassen, J. Collins, and D. Higgins. Finding flexible Patterns in unaligned protein sequences. Protein Science, pages 1587–1595, 1995.Google Scholar
- 8.L. Marsan and M.-F. Sagot. Extracting structured motifs using a suffix tree — Algorithms and application to consensus identification. In S. Minoru and R. Shamir, editors, Proceedings of the 4th Annual International Conference on Computational Molecular Biology (RECOMB), Tokyo, Japan, 2000. ACM Press.Google Scholar
- 9.B. Morgenstern, A. Dress, and T. Werner. Multiple DNA and protein sequence alignment based on segment-to-segment comparison. In Proceedings of the National Academy of Sciences USA, pages 1209–12103, 1996.Google Scholar
- 10.L. Parida, I. Rigoutsos, A. Floratos, D. Platt, and Y. Gao. Pattern discovery on character sets and real-valued data: linear bound on irredundant motifs and an efficient polynomial time algorithm. In Proceedings of the 11th Symposium on Discrete Algorithms, pages 297–308, 2000.Google Scholar
- 11.J. Pelfrêne. Indexation de motifs approches. Rapport de DÉA, September 2000.Google Scholar
- 12.J. Pelfrêne, S. Abdeddaïm, and J. Alexandre. Un algorithme d’indexation de motifs approchés (poster and short talk). In Journées Ouvertes Biologie Informatique Mathématiques, Saint-Malo, pages 263–264, June 2002.Google Scholar
- 13.N. Pisanti, M. Crochemore, R. Grossi, and M.-F. Sagot. Bases of motifs for generating repeated patterns with don’t cares. Technical report, Università di Pisa, February 2003.Google Scholar
- 16.J. Wang, B. Shapiro, and D. Shasha. Pattern Discovery in Biomolecular Data. Oxford University Press, 1999.Google Scholar
- 17.M. Waterman and R. Jones. Methods in enzymology, page 221. Academic Press, London, 1990. pp. 348–360, 2003.Google Scholar