Skip to main content

Compression, Indexing, and Retrieval for Massive String Data

  • Conference paper
Combinatorial Pattern Matching (CPM 2010)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6129))

Included in the following conference series:

Abstract

The field of compressed data structures seeks to achieve fast search time, but using a compressed representation, ideally requiring less space than that occupied by the original input data. The challenge is to construct a compressed representation that provides the same functionality and speed as traditional data structures. In this invited presentation, we discuss some breakthroughs in compressed data structures over the course of the last decade that have significantly reduced the space requirements for fast text and document indexing. One interesting consequence is that, for the first time, we can construct data structures for text indexing that are competitive in time and space with the well-known technique of inverted indexes, but that provide more general search capabilities. Several challenges remain, and we focus in this presentation on two in particular: building I/O-efficient search structures when the input data are so massive that external memory must be used, and incorporating notions of relevance in the reporting of query answers.

Supported in part by Taiwan NSC grant 96–2221–E–007–082–MY3 (W. Hon) and USA National Science Foundation grant CCF–0621457 (R. Shah and J. S. Vitter).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aggarwal, A., Vitter, J.S.: The Input/Output complexity of sorting and related problems. Communications of the ACM 31(9), 1116–1127 (1988)

    Article  MathSciNet  Google Scholar 

  2. Arroyuelo, D., Navarro, G.: A Lempel-Ziv text index on secondary storage. In: Ma, B., Zhang, K. (eds.) CPM 2007. LNCS, vol. 4580, pp. 83–94. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  3. Barbay, J., He, M., Munro, J.I., Rao, S.S.: Succinct indexes for strings, binary relations and multi-labeled trees. In: Proc. ACM-SIAM Symp. on Discrete Algorithms, pp. 680–689 (2007)

    Google Scholar 

  4. Bayer, R., Unterauer, K.: Prefix B-trees. ACM Transactions on Database Systems 2(1), 11–26 (1977)

    Article  Google Scholar 

  5. Belazzougui, D.: Succinct dictionary matching with no slowdown. In: Proc. Symp. on Combinatorial Pattern Matching (June 2010)

    Google Scholar 

  6. Bialynicka-Birula, I., Grossi, R.: Rank-sensitive data structures. In: Consens, M.P., Navarro, G. (eds.) SPIRE 2005. LNCS, vol. 3772, pp. 79–90. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  7. Burrows, M., Wheeler, D.: A block sorting data compression algorithm. Technical report, Digital Systems Research Center (1994)

    Google Scholar 

  8. Chan, H.L., Hon, W.K., Lam, T.W., Sadakane, K.: Compressed indexes for dynamic text collections. ACM Transactions on Algorithms 3(2) (2007)

    Google Scholar 

  9. Chien, Y.-F., Hon, W.-K., Shah, R., Vitter, J.S.: Geometric Burrows-Wheeler transform: Linking range searching and text indexing. In: Proc. IEEE Data Compression Conf., pp. 252–261 (2008)

    Google Scholar 

  10. Chiu, S.-Y., Hon, W.-K., Shah, R., Vitter, J.S.: I/O-efficient compressed text indexes: From theory to practice. In: Proc. IEEE Data Compression Conf., pp. 426–434 (2010)

    Google Scholar 

  11. Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: Proc. Symp. on Operating Systems Design and Implementation. December 2004, pp. 137–150, USENIX (2004)

    Google Scholar 

  12. Elias, P.: Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory IT-21, 194–203 (1975)

    Article  MATH  MathSciNet  Google Scholar 

  13. Ferragina, P., Giancarlo, R., Manzini, G.: The myriad virtues of wavelet trees. Information and Computation 207(8), 849–866 (2009)

    Article  MATH  MathSciNet  Google Scholar 

  14. Ferragina, P., Giancarlo, R., Manzini, G., Sciortino, M.: Boosting textual compression in optimal linear time. Journal of the ACM 52(4), 688–713 (2005)

    Article  MathSciNet  Google Scholar 

  15. Ferragina, P., González, R., Navarro, G., Venturini, R.: Compressed text indexes: From theory to practice. ACM Journal of Experimental Algorithmics 12, article 1.12 (2008)

    Google Scholar 

  16. Ferragina, P., Grossi, R.: The String B-tree: A new data structure for string search in external memory and its applications. Journal of the ACM 46(2), 236–280 (1999)

    Article  MATH  MathSciNet  Google Scholar 

  17. Ferragina, P., Grossi, R., Gupta, A., Shah, R., Vitter, J.S.: On searching compressed string collections cache-obliviously. In: Proc. ACM Conf. on Principles of Database Systems, Vancouver, June 2008, pp. 181–190 (2008)

    Google Scholar 

  18. Ferragina, P., Luccio, F., Manzini, G., Muthukrishnan, S.: Structuring labeled trees for optimal succinctness, and beyond. In: Proc. IEEE Symp. on Foundations of Computer Science, pp. 184–196 (2005)

    Google Scholar 

  19. Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proc. IEEE Symp. on Foundations of Computer Science, November 2000, vol. 41, pp. 390–398 (2000)

    Google Scholar 

  20. Ferragina, P., Manzini, G.: Indexing compressed texts. Journal of the ACM 52(4), 552–581 (2005)

    Article  MathSciNet  Google Scholar 

  21. Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Transactions on Algorithms 3(2) (May 2007) Conference version in SPIRE 2004

    Google Scholar 

  22. Ferragina, P., Venturini, R.: Compressed permuterm index. In: Proc. ACM SIGIR Conf. on Res. and Dev. in Information Retrieval, pp. 535–542 (2007)

    Google Scholar 

  23. Fischer, J., Mäkinen, V., Navarro, G.: Faster entropy-bounded compressed suffix trees. Theoretical Computer Science 410(51), 5354–5364 (2009)

    Article  MATH  MathSciNet  Google Scholar 

  24. Foschini, L., Grossi, R., Gupta, A., Vitter, J.S.: When indexing equals compression: Experiments on suffix arrays and trees. ACM Transactions on Algorithms 2(4), 611–639 (2004); Conference versions in SODA 2004 and DCC 2004

    Google Scholar 

  25. Frigo, M., Leiserson, C.E., Prokop, H., Ramachandran, S.: Cache-oblivious algorithms. In: Proc. IEEE Symp. on Foundations of Computer Science, vol. 40, pp. 285–298 (1999)

    Google Scholar 

  26. Gonnet, G.H., Baeza-Yates, R.A., Snider, T.: New indices for text: PAT trees and PAT arrays. In: Information Retrieval: Data Structures And Algorithms, ch. 5, pp. 66–82. Prentice-Hall, Englewood Cliffs (1992)

    Google Scholar 

  27. González, R., Navarro, G.: A compressed text index on secondary memory. In: Proc. Intl. Work. Combinatorial Algorithms, Newcastle, Australia, pp. 80–91. College Publications (2007)

    Google Scholar 

  28. Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Proc. ACM-SIAM Symp. on Discrete Algorithms (January 2003)

    Google Scholar 

  29. Grossi, R., Gupta, A., Vitter, J.S.: Nearly tight bounds on the encoding length of the Burrows-Wheeler transform. In: Proc. Work. on Analytical Algorithmics and Combinatorics (January 2008)

    Google Scholar 

  30. Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In: Proc. ACM Symp. on Theory of Computing, May 2000, vol. 32, pp. 397–406 (2000)

    Google Scholar 

  31. Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing 35(32), 378–407 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  32. Hon, W.-K., Lam, T.-W., Shah, R., Tam, S.-L., Vitter, J.S.: Compressed index for dictionary matching. In: Proc. IEEE Data Compression Conf., pp. 23–32 (2008)

    Google Scholar 

  33. Hon, W.-K., Lam, T.-W., Shah, R., Tam, S.-L., Vitter, J.S.: Succinct index for dynamic dictionary matching. In: Dong, Y., Du, D.-Z., Ibarra, O. (eds.) ISAAC 2009. LNCS, vol. 5878. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  34. Hon, W.-K., Shah, R., Thankachan, S.V., Vitter, J.S.: On entropy-compressed text indexing in external memory. In: Hyyro, H. (ed.) SPIRE 2009. LNCS, vol. 5721, pp. 75–89. Springer, Heidelberg (2009)

    Google Scholar 

  35. Hon, W.-K., Shah, R., Vitter, J.S.: Ordered pattern matching: Towards full-text retrieval. In: Purdue University Tech. Rept. (2006)

    Google Scholar 

  36. Hon, W.-K., Shah, R., Vitter, J.S.: Space-efficient framework for top-k string retrieval problems. In: Proc. IEEE Symp. on Foundations of Computer Science, Atlanta (October 2009)

    Google Scholar 

  37. Kärkkäinen, J.: Repetition-Based Text Indexes. Ph.d., University of Helsinki (1999)

    Google Scholar 

  38. Kärkkäinen, J., Rao, S.S.: Full-text indexes in external memory. In: Meyer, U., Sanders, P., Sibeyn, J. (eds.) Algorithms for Memory Hierarchies, ch. 7, pp. 149–170. Springer, Berlin (2003)

    Chapter  Google Scholar 

  39. Külekci, M.O., Hon, W.-K., Shah, R., Vitter, J.S., Xu, B.: A parallel sparse index for read alignment on genomes (2010)

    Google Scholar 

  40. Lam, T.-W., Sung, W.-K., Wong, S.-S.: Improved approximate string matching using compressed suffix data structures. Algorithmica 51(3), 298–314 (2008)

    Article  MATH  MathSciNet  Google Scholar 

  41. Langmead, B., Trapnell, C., Pop, M., Salzberg, S.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 10(3), article R25 (2009)

    Google Scholar 

  42. Li, R., Yu, C., Li, Y., Lam, T.-W., Yiu, S.-M., Kristiansen, K., Wang, J.: SOAP2: An improved ultrafast tool for short read alignment. Bioinformatics 25(15), 1966–1967 (2009)

    Article  Google Scholar 

  43. Lin, H., Zhang, Z., Zhang, M.Q., Ma, B., Li, M.: ZOOM: Zillions of oligos mapped. Bioinformatics 24(21), 2431–2437 (2008)

    Article  Google Scholar 

  44. Mäkinen, V., Navarro, G.: Succinct suffix arrays based on run-length encoding. Nordic Journal of Computing 12(1), 40–66 (2005)

    MathSciNet  Google Scholar 

  45. Mäkinen, V., Navarro, G.: Position-restricted substring searching. In: Proc. Latin American Theoretical Informatics Symp., pp. 703–714 (2006)

    Google Scholar 

  46. Mäkinen, V., Navarro, G.: Implicit compression boosting with applications to self-indexing. In: Ziviani, N., Baeza-Yates, R. (eds.) SPIRE 2007. LNCS, vol. 4726, pp. 229–241. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  47. Mäkinen, V., Navarro, G.: Dynamic entropy-compressed sequences and full-text indexes. ACM Transactions on Algorithms 4(3), article 12 (June 2008)

    Google Scholar 

  48. Mäkinen, V., Navarro, G., Sadakane, K.: Advantages of backward searching—efficient secondary memory and distributed implementation of compressed suffix arrays. In: Fleischer, R., Trippen, G. (eds.) ISAAC 2004. LNCS, vol. 3341, pp. 681–692. Springer, Heidelberg (2004)

    Google Scholar 

  49. Manber, U., Myers, G.: Suffix arrays: A new method for on-line string searches. SIAM Journal on Computing 22(5), 935–948 (1993)

    Article  MATH  MathSciNet  Google Scholar 

  50. Manzini, G.: An analysis of the Burrows-Wheeler transform. Journal of the ACM 48(3) (2001); Conference version in SODA 1999

    Google Scholar 

  51. McCreight, E.M.: A space-economical suffix tree construction algorithm. Journal of the ACM 23(2), 262–272 (1976)

    Article  MATH  MathSciNet  Google Scholar 

  52. Moffat, A., Zobel, J.: Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems 14(4), 349–379 (1996)

    Article  Google Scholar 

  53. Muthukrishnan, S.: Efficient Algorithms for Document Retrieval Problems. In: Proc. ACM-SIAM Symp. on Discrete Algorithms, pp. 657–666 (2002)

    Google Scholar 

  54. Muthukrishnan, S.: Data Streams: Algorithms and Applications. Foundations and Trends in Theoretical Computer Science. now Publishers, Hanover (2005)

    Google Scholar 

  55. Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1), article 2 (2007)

    Google Scholar 

  56. NCBI short read archive SRR001115, http://www.ncbi.nlm.nih.gov/

  57. Patrascu, M.: Succincter. In: Proc. IEEE Symp. on Foundations of Computer Science, pp. 305–313 (2008)

    Google Scholar 

  58. Puglisi, S.J., Smyth, W.F., Turpin, A.: Inverted files versus suffix arrays for locating patterns in primary memory. In: Crestani, F., Ferragina, P., Sanderson, M. (eds.) SPIRE 2006. LNCS, vol. 4209, pp. 122–133. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  59. Raman, R., Raman, V., Rao, S.S.: Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Transactions on Algorithms 3(4), article 43 (2007)

    Google Scholar 

  60. Russo, L., Navarro, G., Oliveira, A.: Fully-compressed suffix trees. In: Laber, E.S., Bornstein, C., Nogueira, L.T., Faria, L. (eds.) LATIN 2008. LNCS, vol. 4957, pp. 362–373. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  61. Sadakane, K.: Compressed text databases with efficient query algorithms based on the compressed suffix array. In: Lee, D.T., Teng, S.-H. (eds.) ISAAC 2000. LNCS, vol. 1969, pp. 410–421. Springer, Heidelberg (December 2000)

    Google Scholar 

  62. Sadakane, K.: New text indexing functiionalities of the compressed suffix arrays. Journal of Algorithms 48(2), 294–313 (2003)

    Article  MATH  MathSciNet  Google Scholar 

  63. Sadakane, K.: Compressed suffix trees with full functionality. Theory of Computing Systems 41(4), 589–607 (2007)

    Article  MATH  MathSciNet  Google Scholar 

  64. Sadakane, K.: Succinct Data Structures for Flexible Text Retrieval Systems. Journal of Discrete Algorithms 5(1), 12–22 (2007)

    Article  MATH  MathSciNet  Google Scholar 

  65. Sodan, A.C., Machina, J., Deshmeh, A., Macnaughton, K., Esbaugh, B.: Parallelism via multithreaded and multicore CPUs. IEEE Computer 43(3), 24–32 (2010)

    Google Scholar 

  66. Tam, A., Wu, E., Lam, T.W., Yiu, S.-M.: Succinct text indexing with wildcards. In: Proc. Intl. Symp. on String Processing Information Retrieval, August 2009, pp. 39–50 (2009)

    Google Scholar 

  67. Thankachan, S.V., Hon, W.-K., Shah, R., Vitter, J.S.: String retrieval for multi-pattern queries (2010)

    Google Scholar 

  68. Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)

    Article  MATH  MathSciNet  Google Scholar 

  69. Välimäki, N., Mäkinen, V.: Space-Efficient Algorithms for Document Retrieval. In: Ma, B., Zhang, K. (eds.) CPM 2007. LNCS, vol. 4580, pp. 205–215. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  70. Vitter, J.S.: Algorithms and Data Structures for External Memory. Foundations and Trends in Theoretical Computer Science. now Publishers, Hanover (2008)

    Google Scholar 

  71. Vitter, J.S., Shriver, E.A.M.: Algorithms for parallel memory I: Two-level memories. Algorithmica 12(2–3), 110–147 (1994)

    Article  MATH  MathSciNet  Google Scholar 

  72. Weiner, P.: Linear pattern matching algorithm. In: Proc. IEEE Symp. on Switching and Automata Theory, Washington, DC, vol. 14, pp. 1–11 (1973)

    Google Scholar 

  73. Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann, Los Altos (1999)

    Google Scholar 

  74. Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Computing Surveys 38(2) (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Hon, WK., Shah, R., Vitter, J.S. (2010). Compression, Indexing, and Retrieval for Massive String Data. In: Amir, A., Parida, L. (eds) Combinatorial Pattern Matching. CPM 2010. Lecture Notes in Computer Science, vol 6129. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13509-5_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-13509-5_24

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-13508-8

  • Online ISBN: 978-3-642-13509-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics