Phrase-Based Pattern Matching in Compressed Text

  • J. Shane Culpepper
  • Alistair Moffat
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4209)


Byte codes are a practical alternative to the traditional bit-oriented compression approaches when large alphabets are being used, and trade away a small amount of compression effectiveness for a relatively large gain in decoding efficiency. Byte codes also have the advantage of being searchable using standard string matching techniques. Here we describe methods for searching in byte-coded compressed text and investigate the impact of large alphabets on traditional string matching techniques. We also describe techniques for phrase-based searching in a restricted type of byte code, and present experimental results that compare our adapted methods with previous approaches.


Pattern Match Brute Force Word Boundary Huffman Code Pattern Length 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Brisaboa, N.R., Fariña, A., Navarro, G., Esteller, M.F.: (S,C)-dense coding: An optimized compression code for natural language text databases. In: Nascimento, M.A., de Moura, E.S., Oliveira, A.L. (eds.) SPIRE 2003. LNCS, vol. 2857, pp. 122–136. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  2. Brisaboa, N.R., Iglesias, E.L., Navarro, G., Paramá, J.: An efficient compression code for text databases. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 468–481. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  3. Culpepper, J.S., Moffat, A.: Enhanced byte codes with restricted prefix properties. In: Consens, M.P., Navarro, G. (eds.) SPIRE 2005. LNCS, vol. 3772, pp. 1–12. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  4. de Moura, E.S., Navarro, G., Ziviani, N., Baeza-Yates, R.: Fast and flexible word searching on compressed text. ACM Transactions on Information Systems 18(2), 113–139 (2000)CrossRefGoogle Scholar
  5. Fariña, A.: New compression codes for text databases. PhD thesis, Universidade de Coruña (April 2005)Google Scholar
  6. Larsson, N.J., Moffat, A.: Offline dictionary-based compression. Proceedings of the IEEE 88(11), 1722–1732 (2000)CrossRefGoogle Scholar
  7. Manber, U.: A text compression scheme that allows fast searching directly in the compressed file. ACM Transactions on Information Systems 5(2), 124–136 (1997)CrossRefGoogle Scholar
  8. Navarro, G., Raffinot, M.: Flexible pattern matching in strings, 1st edn. Cambridge University Press, Cambridge (2002)MATHGoogle Scholar
  9. Seuss, D.: Fox in socks. Random House, 1st edn., (Written by T. Geisel) (1965)Google Scholar
  10. Spink, A., Wolfram, D., Jansen, B.J., Saracevic, T.: Searching the web: The public and their queries. Journal of the American Society for Information Science 52(3), 226–234 (2001)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • J. Shane Culpepper
    • 1
  • Alistair Moffat
    • 1
  1. 1.NICTA Victoria Laboratory, Department of Computer Science and Software EngineeringThe University of MelbourneAustralia

Personalised recommendations