Skip to main content

String Pattern Matching for a Deluge Survival Kit

  • Chapter
Handbook of Massive Data Sets

Part of the book series: Massive Computing ((MACO,volume 4))

Abstract

String Pattern Matching concerns itself with algorithmic and combinatorial issues related to matching and searching on linearly arranged sequences of symbols, arguably the simplest possible discrete structures. As unprecedented volumes of sequence data are amassed, disseminated and shared at an increasing pace, effective access to, and manipulation of such data depend crucially on the efficiency with which strings are structured, compressed, transmitted, stored, searched and retrieved. This paper samples from this perspective, and with the authors’ own bias, a rich arsenal of ideas and techniques developed in more than three decades of history.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 629.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 799.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 799.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Bibliography

  • K. Abrahamson. Generalized string matching. SIAM J. Computing, 16: 1039–1051, 1987.

    Article  MathSciNet  MATH  Google Scholar 

  • R. Agrawal, T. Imielinski, and A. Swami Mining association rules between sets of items in large databases. In Proc. ACM SIGMOD, pages 207–216, 1993.

    Google Scholar 

  • A.V. Aho and M.J. Corasick. Efficient string matching. C. ACM, 18: 333–340, 1975.

    Article  MATH  Google Scholar 

  • B. Alberts, D. Bray, J. Lewis, M. Raff, K. Roberts, and J.D. Watson. Molecular Biology of the Cell. Garland Publishing, 1989.

    Google Scholar 

  • S. Altschul, W. Gish, W. Miller, E.W. Myers, and D. Lipman. Basic linear alignment search tool. J. Mol. Biology, 215: 403–410, 1990.

    Article  Google Scholar 

  • A. Amir, A. Apostolico, and M. Lewenstein. Inverse pattern matching. J. of Algorithms, 24: 325–339, 1997a.

    Article  MathSciNet  MATH  Google Scholar 

  • A. Amir, G. Benson, and M. Farach. Let sleeping files lie: pattern matching in z-compressed files. In Proc. of 5th Annual ACM-SIAM Symposium on Discrete Algorihms, 1994.

    Google Scholar 

  • A. Amir, R. Feldman, and R. Kashi. A new and versatile method for association generation. Information Systems,1997b. To appear. Preliminary version appeared in PKDD 97.

    Google Scholar 

  • A. Apostolico. Optimal Parallel Detection of Squares in Strings. Algorithmica, 8: 285–319, 1992.

    Article  MathSciNet  MATH  Google Scholar 

  • A. Apostolico. String editing and longest common subsequences. In G. Rozenberg and A. Salomaa, editors, Handbook of Formal Languages, volume II, pages 361–398. Springer-Verlag, 1996.

    Google Scholar 

  • A. Apostolico and G. Bejerano. Optimal amnesic probabilistic automata or how to learn and classify proteins in linear time and space. In Proceedings of RECOMB 2000, pages 25–32, 2000.

    Google Scholar 

  • A. Apostolico, M.E. Bock, S. Lonardi, and X. Xu. Efficient detection of unusual words. Technical report, Purdue University Computer Science Department, 1996. To appear in Journal of Computational Biology.

    Google Scholar 

  • A. Apostolico, D. Breslauer,, and Z. Galil. Optimal Parallel Algorithms for Periods, Palindromes and Squares. In Proc. 19th International Colloquium on Automata, Languages, and Programming, volume 623 of Lecture Notes in Computer Science, pages 296–307. Springer-Verlag, 1992.

    Google Scholar 

  • A. Apostolico and A. Ehrenfeucht. Efficient Detection of Quasiperiodicities in Strings. Theoret. Comput. Sci., 119: 247–265, 1993.

    Article  MathSciNet  MATH  Google Scholar 

  • A. Apostolico, M. Farach, and C.S. Iliopoulos. Optimal Superprimitivity Testing for Strings. Inform. Process. Lett., 39: 17–20, 1991.

    Article  MathSciNet  MATH  Google Scholar 

  • A. Apostolico and Z. Galil, editors. Pattern Matching Algorithms. Oxford University Press, 1997.

    Google Scholar 

  • A. Apostolico and R. Giancarlo. Sequence alignment in molecular biology. Journal of Computational Biology, 5: 173–196, 1998.

    Article  MATH  Google Scholar 

  • A. Apostolico and F. P. Preparata. Optimal off-line detection of repetitions in a string. Theoret. Comput. Sci., 22: 297–315, 1983.

    Article  MathSciNet  MATH  Google Scholar 

  • A. Apostolico and F. P. Preparata. Data structures and algorithms for the strings statistics problem. Algorithmica, 15: 481–494, 1996.

    Article  MathSciNet  MATH  Google Scholar 

  • A. Apostolico and W. Szpankowski. Self-alignment in words and their applications. J. Algorithms, 13: 446–467, 1992.

    Article  MathSciNet  MATH  Google Scholar 

  • R. Baeza-Yates and C. Perleberg. Fast and practical approximate string matching. In Proc. III Symp. on Combinatorial Pattern matching, Lecture Notes in Computer Science, pages 185–92. Springer-Verlag, 1992.

    Google Scholar 

  • D. R. Bean, A. Ehrenfeucht, and G.F. McNulty. Avoidable patterns in strings of symbols. Pacific J. Math., 85: 261–294, 1979.

    Article  MathSciNet  MATH  Google Scholar 

  • G. Bejerano and G. Yona. Modeling protein families using probabilistic suffix trees. In S. Istrail, P. Pevzner, and M. Waterman, editors, Proceedings of RECOMB99, pages 15–24. ACM Press, 1999.

    Google Scholar 

  • A. Ben-Amram, O. Berkman, C. Iliopolous, and K. Park. Computing the Covers of a String in Linear Time. In Proc. 5th ACM-SIAM Symp. on Discrete Algorithms, pages 501–510, 1994.

    Google Scholar 

  • J. Bentley and D. Mcllroy. Data compression using long common strings. In Proceedings of the IEEE Data Compression Conference, pages 287–295, 1999.

    Google Scholar 

  • J. Berstel. Sur les mots sans carré définis par un morphism. In Proc. 6th International Colloquium on Automata, Languages, and Programming, volume 71 of Lecture Notes in Computer Science, pages 16–25. Springer-Verlag, 1979.

    Google Scholar 

  • A. Blumer, J. Blumer, A. Ehrenfeucht, D. Haussier, M.T. Chen, and J. Seiferas. The Smallest Automaton Recognizing the Subwords of a Text. Theoretical Computer Science, 40: 31–55, 1985.

    Article  MathSciNet  MATH  Google Scholar 

  • A. Brazma, I. Jonassen, I. Eidhammer, and D. Gilbert. Approaches to the automatic discovery of patterns in biosequences. Journal of Computational Biology, 5: 279–306, 1998a.

    Article  Google Scholar 

  • A. Brazma, I. Jonassen, J. Vilo, and E. Ukkonen. Pattern discovery in biosequences. In Proceedings of Fourth International Colloquium on Grammatical Inference (ICGI-98), volume 1433 of Lecture Notes in Computer Science, pages 255–270. Springer-Verlag, 1998b.

    Google Scholar 

  • A. Brazma, I. Jonassen, J. Vilo, and E. Ukkonen. Predicting gene regulatory elements in silico on a genomic scale. Genome Research, 8: 1202–1215, 1998c.

    Google Scholar 

  • A. Brazma, J. Vilo, E. Ukkonen, and K. Valtonen. Data mining for regulatory elements in yeast genome. In Fifth International Conference on Intelligent Systems for Molecular Biology (ISMB-97), pages 65–74. AAAI Press, 1997.

    Google Scholar 

  • D. Breslauer. An On-Line String Superprimitivity Test. Inform. Process. Lett., 44: 345–347, 1992.

    Article  MathSciNet  MATH  Google Scholar 

  • D. Breslauer. Testing String Superprimitivity in Parallel. Inform. Process. Lett., 49: 235–241, 1994.

    Article  MathSciNet  MATH  Google Scholar 

  • L. Brillouin. Science and Information Theory. Academic Press, 1971.

    Google Scholar 

  • G.S. Brodal, R. Lyngso, C.N.S. Pedersen, and J. Stoye. Finding maximal pairs with bounded gap. In Proc. 10th Combinatorial Pattern Matching, volume 1645 of Lecture Notes in Computer Science, pages 342–351. Springer-Verlag, 1999.

    Google Scholar 

  • M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipments Corporation, 1994.

    Google Scholar 

  • W.I. Chang and E.L. Lawler. Sublinear expected time approximate string matching and biological applications. Algorithmica, 12: 327–44, 1994.

    Article  MathSciNet  MATH  Google Scholar 

  • M. Crochemore. An optimal algorithm for computing the repetitions in a word. Inform. Process. Lett., 12: 244–250, 1981.

    Article  MathSciNet  MATH  Google Scholar 

  • M. Crochemore, F. Mignosi, and A. Restivo. Automata and Forbidden Words. Information Processing Letters, 67: 111–117, 1998a.

    Article  MathSciNet  MATH  Google Scholar 

  • M. Crochemore, F. Mignosi, and A. Restivo. Minimal Forbidden Words and Factor Automata. In L. Brim, J. Gruska, and J. Slatuska, editors, MFCS’98, volume 1450 of Lecture Notes in Computer Science, pages 665–673. Springer-Verlag, 1998b.

    Google Scholar 

  • M. Crochemore, F. Mignosi, A. Restivo, and S. Salemi. Text Compression Using Antidictionaries, 2000. DCA home page at URL http: //www-igm. univ-mlv.frk-,mac/REC/DCA.html.

    Google Scholar 

  • M. Crochemore and W. Rytter. Efficient parallel algorithms to test square-freeness and factorize strings. Inform. Process. Lett., 38: 5760, 1991a.

    Article  MathSciNet  MATH  Google Scholar 

  • M. Crochemore and W. Rytter. Usefulness of the Karp-Miller-Rosenberg algorithm in parallel computations on strings and arrays. Theoret. Comput. Sci., 88: 59–82, 1991b.

    Article  MathSciNet  MATH  Google Scholar 

  • M. Crochemore and W. Rytter. Text Algorithms. Oxford University Press, 1994.

    Google Scholar 

  • G. Das, R. Fleischer, L. G4sieniek, D. Gunopulos, and J. Kärkkäinen. Episode matching. In A. Apostolico and J. Hein, editors, Proceedings of the 8th Annual Symposium on Combinatorial Pattern Matching (CPM’97), volume 1264 of Lecture Notes in Computer Science, pages 12–27. Springer-Verlag, 1997.

    Google Scholar 

  • M. Farach and M. Thorup. String matching in Lempel-Ziv compressed strings. In Proc. of 27th Symposium on Theory of Computing, 1994.

    Google Scholar 

  • N.J. Fine and H.S. Wilf. Uniqueness Theorems for Periodic Functions. Proc. Amer. Math. Soc., 16: 109–114, 1965.

    Article  MathSciNet  MATH  Google Scholar 

  • M.J. Fischer and M.S. Paterson. String matching and other products. In R.M. Karp, editor, Complexity of Computation, volume 7, pages 113–125. SIAM-AMS Proceedings, 1974.

    Google Scholar 

  • L. Gasieniec, P. Indyk, and P. Krysta. External inverse pattern matching. In Proceedings of the 8th Annual Symposium on Combinatorial Pattern Matching, volume 1264 of Lecture Notes in Computer Science, pages 90–101. Springer-Verlag, 1997.

    Google Scholar 

  • L. Gasieniec and W. Rytter. Almost optimal fully lzw-compressed pattern matching. In J. Storer, editor, Data Compression Conference, 1999.

    Google Scholar 

  • D. Greene, M. Parnas, and F. Yao. Multi-index hashing for information retrieval. In Proc. 35th Annual Symposium on Foundations of Computer Science, pages 722–731, 1994.

    Chapter  Google Scholar 

  • D. Gusfield and J. Stoye. Linear time algorithms for finding and representing all tandem repeats in a string. Technical Report CSE-984, Department of Computer Science, University of California, Davis, 1998a.

    Google Scholar 

  • D. Gusfield and J. Stoye. Simple and flexible detection of contiguous repeats using a suffix tree. In 9th CPM 98, volume 1448 of Lecture Notes in Computer Science, pages 140–152. Springer-Verlag, 1998b.

    Google Scholar 

  • R. W. Hamming. Error detecting and error correcting codes. Bell System Tech. J., 29: 147–160, 1950.

    Article  MathSciNet  Google Scholar 

  • C.S. Iliopoulos, D.W.G. Moore, and K. Park. Covering a String. In Proc. 4th Symp. on Combinatorial Pattern Matching, volume 684 of Lecture Notes in Computer Science, pages 54–62. Springer-Verlag, 1993.

    Google Scholar 

  • C.S. Iliopoulos and K. Park. An Optimal O(log log n)-time Algorithm for Parallel Superprimitivity Testing. J. Korea Information Science Society, 21: 1400–1404, 1994.

    Google Scholar 

  • R. Karp and M.O. Rabin. Efficient randomized pattern matching algorithms. IBM J. Res. Dey., 31: 249–260, 1987.

    Article  MathSciNet  MATH  Google Scholar 

  • A. Lempel and J. Ziv. On the complexity of finite sequences. IEEE Trans. on information Theory, 22: 75–81, 1976.

    Article  MathSciNet  MATH  Google Scholar 

  • V.I. Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. Soviet Phys. Dokl., 6: 707–710, 1966.

    MathSciNet  MATH  Google Scholar 

  • M. Lothaire. Combinatorics on Words. Cambridge University Press, second edition, 1997.

    Google Scholar 

  • R. C. Lyndon and M. P. Schutzenberger. The equation am = bric in a free group. Michigan Math. J., 9: 289–298, 1962.

    Article  MATH  Google Scholar 

  • M.G. Main and R.J. Lorentz. An o(n log n) algorithm for finding all repetitions in a string. J. of Algorithms, pages 422–432, 1984.

    Google Scholar 

  • G. Manacher. A new Linear-Time On-Line Algorithm for Finding the Smallest Initial Palindrome of a String. J. Assoc. Comput. Mach., 22: 346–351, 1975.

    Article  MATH  Google Scholar 

  • U. Manber. A text compression scheme that allows fast searching directly in the compressed file. In M. Crochemore and D. Gusfield, editors, Proceedings of the 5th Annual Symposium on Combinatorial Pattern Matching, volume 807 of Lecture Notes in Computer Science, pages 113–124. Springer-Verlag, 1994.

    Google Scholar 

  • U. Manber and R. Baeza-Yates. An algorithm for string matching with a sequence of don’t cares. Inform. Process. Lett., 37: 133–136, 1991.

    Article  MathSciNet  MATH  Google Scholar 

  • H. Mannila, H. Toivonen, and A.I. Vercamo. Discovering frequent episodes in sequences. In Proceedings of the 1st International Conference on Knowledge Discovery and Data Mining (KDD’95), pages 210–215. AAAI Press, 1995.

    Google Scholar 

  • D. Moore and W.F. Smyth. Computing the Covers of a String in Linear Time. In Proc. 5th ACM-SIAM Symp. on Discrete Algorithms, pages 511–515, 1994.

    Google Scholar 

  • E. Moura, G. Navarro, N. Ziviani, and R. Beaza-Yates. Direct pattern matching on compressed texts. In Proc. SPIRE’98, pages 90–95. IEEE CS Press, 1998.

    Google Scholar 

  • G. Navarro and M. Raffinot. A general practical approach to pattern matching over ziv-lempel compressed text. In Proceedings CPM’pp, pages 14–36, 1999.

    Google Scholar 

  • G. Piatesky-Shapiro and W.J. Frawley, editors. Knowledge Discovery in Databases. AAAI Press/MIT Press, 1991.

    Google Scholar 

  • M. Rabin. Discovering repetitions in strings. In A. Apostolico and Z. Galil, editors, Combinatorial Algorithms on Words, pages 279–288. Springer-Verlag, 1985.

    Google Scholar 

  • D. Ron, Y. Singer, and N. Tishby. The Power of Amnesia: Learning Probabilistic Automata with Variable Memory Length. Machine Learning, 25: 117–150, 1996.

    Article  MATH  Google Scholar 

  • D. Russel and G.T. Gangemi, Sr. Computer Security Basics. O’Reilly and Associates, Inc., 1991.

    Google Scholar 

  • M.-F. Sagot, A. Viari, and H. Soldano. Multiple sequence comparison — A peptide matching approach. Theoret. Comput. Sci., 180: 115–137, 1997.

    Article  MathSciNet  MATH  Google Scholar 

  • C.E. Shannon and W. Weaver. The Mathematical Theory of Communication. University of Illinois Press, 1949.

    Google Scholar 

  • Y. Shibata, M. Takeda, A. Shinohara, and S. Arikawa. Pattern matching in text compressed by using antidictionaries. In M. Crochemore and M. Paterson, editors, Combinatorial Pattern Matching, volume 1645 of Lecture Notes in Computer Science, pages 37–49. Springer-Verlag, 1999.

    Google Scholar 

  • A. Thue. Über unendliche zeichenreihen. Norske Vid. Selsk. Skr. Mat. Nat. Kl. (Cristiania), 7: 1–22, 1906.

    MATH  Google Scholar 

  • A. Thue. Über die gegenseitige lage gleicher teile gewisser zeichenreihen. Norske Vid. Selsk. Skr. Mat. Nat. Kl. (Cristiania), 1: 1–67, 1912.

    MATH  Google Scholar 

  • E. Ukkonen. Approximate string matching and the q-gram distance. In R. Capocelli, A. De Santis, and U. Vaccaro, editors, SEQUENCES II - Methods in Communication, Security, and Computer Science, pages 300–312. Springer-Verlag, 1993.

    Google Scholar 

  • M. Waterman. Introduction to Computational Biology. Chapman and Hall, 1995.

    Google Scholar 

  • T.A. Welch. A technique for high performance data compression. IEEE Trans. on Computers, 17: 8–19, 1984.

    Google Scholar 

  • J. Ziv and A. Lempel. A universal algorithm for sequential data compression. IEEE Trans. on Inform. Theory, IT-23: 337–343, 1977.

    Google Scholar 

  • J. Ziv and A. Lempel. Compression of individual sequences via variable-rate coding. IEEE Trans. on Inform. Theory, 24: 530–536, 1978.

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Apostolico, A., Crochemore, M. (2002). String Pattern Matching for a Deluge Survival Kit. In: Abello, J., Pardalos, P.M., Resende, M.G.C. (eds) Handbook of Massive Data Sets. Massive Computing, vol 4. Springer, Boston, MA. https://doi.org/10.1007/978-1-4615-0005-6_6

Download citation

  • DOI: https://doi.org/10.1007/978-1-4615-0005-6_6

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-1-4613-4882-5

  • Online ISBN: 978-1-4615-0005-6

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics