Compression, Indexing, and Retrieval for Massive String Data

  • Wing-Kai Hon
  • Rahul Shah
  • Jeffrey Scott Vitter
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6129)

Abstract

The field of compressed data structures seeks to achieve fast search time, but using a compressed representation, ideally requiring less space than that occupied by the original input data. The challenge is to construct a compressed representation that provides the same functionality and speed as traditional data structures. In this invited presentation, we discuss some breakthroughs in compressed data structures over the course of the last decade that have significantly reduced the space requirements for fast text and document indexing. One interesting consequence is that, for the first time, we can construct data structures for text indexing that are competitive in time and space with the well-known technique of inverted indexes, but that provide more general search capabilities. Several challenges remain, and we focus in this presentation on two in particular: building I/O-efficient search structures when the input data are so massive that external memory must be used, and incorporating notions of relevance in the reporting of query answers.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Aggarwal, A., Vitter, J.S.: The Input/Output complexity of sorting and related problems. Communications of the ACM 31(9), 1116–1127 (1988)CrossRefMathSciNetGoogle Scholar
  2. 2.
    Arroyuelo, D., Navarro, G.: A Lempel-Ziv text index on secondary storage. In: Ma, B., Zhang, K. (eds.) CPM 2007. LNCS, vol. 4580, pp. 83–94. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  3. 3.
    Barbay, J., He, M., Munro, J.I., Rao, S.S.: Succinct indexes for strings, binary relations and multi-labeled trees. In: Proc. ACM-SIAM Symp. on Discrete Algorithms, pp. 680–689 (2007)Google Scholar
  4. 4.
    Bayer, R., Unterauer, K.: Prefix B-trees. ACM Transactions on Database Systems 2(1), 11–26 (1977)CrossRefGoogle Scholar
  5. 5.
    Belazzougui, D.: Succinct dictionary matching with no slowdown. In: Proc. Symp. on Combinatorial Pattern Matching (June 2010)Google Scholar
  6. 6.
    Bialynicka-Birula, I., Grossi, R.: Rank-sensitive data structures. In: Consens, M.P., Navarro, G. (eds.) SPIRE 2005. LNCS, vol. 3772, pp. 79–90. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  7. 7.
    Burrows, M., Wheeler, D.: A block sorting data compression algorithm. Technical report, Digital Systems Research Center (1994)Google Scholar
  8. 8.
    Chan, H.L., Hon, W.K., Lam, T.W., Sadakane, K.: Compressed indexes for dynamic text collections. ACM Transactions on Algorithms 3(2) (2007)Google Scholar
  9. 9.
    Chien, Y.-F., Hon, W.-K., Shah, R., Vitter, J.S.: Geometric Burrows-Wheeler transform: Linking range searching and text indexing. In: Proc. IEEE Data Compression Conf., pp. 252–261 (2008)Google Scholar
  10. 10.
    Chiu, S.-Y., Hon, W.-K., Shah, R., Vitter, J.S.: I/O-efficient compressed text indexes: From theory to practice. In: Proc. IEEE Data Compression Conf., pp. 426–434 (2010)Google Scholar
  11. 11.
    Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: Proc. Symp. on Operating Systems Design and Implementation. December 2004, pp. 137–150, USENIX (2004)Google Scholar
  12. 12.
    Elias, P.: Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory IT-21, 194–203 (1975)MATHCrossRefMathSciNetGoogle Scholar
  13. 13.
    Ferragina, P., Giancarlo, R., Manzini, G.: The myriad virtues of wavelet trees. Information and Computation 207(8), 849–866 (2009)MATHCrossRefMathSciNetGoogle Scholar
  14. 14.
    Ferragina, P., Giancarlo, R., Manzini, G., Sciortino, M.: Boosting textual compression in optimal linear time. Journal of the ACM 52(4), 688–713 (2005)CrossRefMathSciNetGoogle Scholar
  15. 15.
    Ferragina, P., González, R., Navarro, G., Venturini, R.: Compressed text indexes: From theory to practice. ACM Journal of Experimental Algorithmics 12, article 1.12 (2008)Google Scholar
  16. 16.
    Ferragina, P., Grossi, R.: The String B-tree: A new data structure for string search in external memory and its applications. Journal of the ACM 46(2), 236–280 (1999)MATHCrossRefMathSciNetGoogle Scholar
  17. 17.
    Ferragina, P., Grossi, R., Gupta, A., Shah, R., Vitter, J.S.: On searching compressed string collections cache-obliviously. In: Proc. ACM Conf. on Principles of Database Systems, Vancouver, June 2008, pp. 181–190 (2008)Google Scholar
  18. 18.
    Ferragina, P., Luccio, F., Manzini, G., Muthukrishnan, S.: Structuring labeled trees for optimal succinctness, and beyond. In: Proc. IEEE Symp. on Foundations of Computer Science, pp. 184–196 (2005)Google Scholar
  19. 19.
    Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proc. IEEE Symp. on Foundations of Computer Science, November 2000, vol. 41, pp. 390–398 (2000)Google Scholar
  20. 20.
    Ferragina, P., Manzini, G.: Indexing compressed texts. Journal of the ACM 52(4), 552–581 (2005)CrossRefMathSciNetGoogle Scholar
  21. 21.
    Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Transactions on Algorithms 3(2) (May 2007) Conference version in SPIRE 2004Google Scholar
  22. 22.
    Ferragina, P., Venturini, R.: Compressed permuterm index. In: Proc. ACM SIGIR Conf. on Res. and Dev. in Information Retrieval, pp. 535–542 (2007)Google Scholar
  23. 23.
    Fischer, J., Mäkinen, V., Navarro, G.: Faster entropy-bounded compressed suffix trees. Theoretical Computer Science 410(51), 5354–5364 (2009)MATHCrossRefMathSciNetGoogle Scholar
  24. 24.
    Foschini, L., Grossi, R., Gupta, A., Vitter, J.S.: When indexing equals compression: Experiments on suffix arrays and trees. ACM Transactions on Algorithms 2(4), 611–639 (2004); Conference versions in SODA 2004 and DCC 2004Google Scholar
  25. 25.
    Frigo, M., Leiserson, C.E., Prokop, H., Ramachandran, S.: Cache-oblivious algorithms. In: Proc. IEEE Symp. on Foundations of Computer Science, vol. 40, pp. 285–298 (1999)Google Scholar
  26. 26.
    Gonnet, G.H., Baeza-Yates, R.A., Snider, T.: New indices for text: PAT trees and PAT arrays. In: Information Retrieval: Data Structures And Algorithms, ch. 5, pp. 66–82. Prentice-Hall, Englewood Cliffs (1992)Google Scholar
  27. 27.
    González, R., Navarro, G.: A compressed text index on secondary memory. In: Proc. Intl. Work. Combinatorial Algorithms, Newcastle, Australia, pp. 80–91. College Publications (2007)Google Scholar
  28. 28.
    Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Proc. ACM-SIAM Symp. on Discrete Algorithms (January 2003)Google Scholar
  29. 29.
    Grossi, R., Gupta, A., Vitter, J.S.: Nearly tight bounds on the encoding length of the Burrows-Wheeler transform. In: Proc. Work. on Analytical Algorithmics and Combinatorics (January 2008)Google Scholar
  30. 30.
    Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In: Proc. ACM Symp. on Theory of Computing, May 2000, vol. 32, pp. 397–406 (2000)Google Scholar
  31. 31.
    Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing 35(32), 378–407 (2005)MATHCrossRefMathSciNetGoogle Scholar
  32. 32.
    Hon, W.-K., Lam, T.-W., Shah, R., Tam, S.-L., Vitter, J.S.: Compressed index for dictionary matching. In: Proc. IEEE Data Compression Conf., pp. 23–32 (2008)Google Scholar
  33. 33.
    Hon, W.-K., Lam, T.-W., Shah, R., Tam, S.-L., Vitter, J.S.: Succinct index for dynamic dictionary matching. In: Dong, Y., Du, D.-Z., Ibarra, O. (eds.) ISAAC 2009. LNCS, vol. 5878. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  34. 34.
    Hon, W.-K., Shah, R., Thankachan, S.V., Vitter, J.S.: On entropy-compressed text indexing in external memory. In: Hyyro, H. (ed.) SPIRE 2009. LNCS, vol. 5721, pp. 75–89. Springer, Heidelberg (2009)Google Scholar
  35. 35.
    Hon, W.-K., Shah, R., Vitter, J.S.: Ordered pattern matching: Towards full-text retrieval. In: Purdue University Tech. Rept. (2006)Google Scholar
  36. 36.
    Hon, W.-K., Shah, R., Vitter, J.S.: Space-efficient framework for top-k string retrieval problems. In: Proc. IEEE Symp. on Foundations of Computer Science, Atlanta (October 2009)Google Scholar
  37. 37.
    Kärkkäinen, J.: Repetition-Based Text Indexes. Ph.d., University of Helsinki (1999)Google Scholar
  38. 38.
    Kärkkäinen, J., Rao, S.S.: Full-text indexes in external memory. In: Meyer, U., Sanders, P., Sibeyn, J. (eds.) Algorithms for Memory Hierarchies, ch. 7, pp. 149–170. Springer, Berlin (2003)CrossRefGoogle Scholar
  39. 39.
    Külekci, M.O., Hon, W.-K., Shah, R., Vitter, J.S., Xu, B.: A parallel sparse index for read alignment on genomes (2010)Google Scholar
  40. 40.
    Lam, T.-W., Sung, W.-K., Wong, S.-S.: Improved approximate string matching using compressed suffix data structures. Algorithmica 51(3), 298–314 (2008)MATHCrossRefMathSciNetGoogle Scholar
  41. 41.
    Langmead, B., Trapnell, C., Pop, M., Salzberg, S.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 10(3), article R25 (2009)Google Scholar
  42. 42.
    Li, R., Yu, C., Li, Y., Lam, T.-W., Yiu, S.-M., Kristiansen, K., Wang, J.: SOAP2: An improved ultrafast tool for short read alignment. Bioinformatics 25(15), 1966–1967 (2009)CrossRefGoogle Scholar
  43. 43.
    Lin, H., Zhang, Z., Zhang, M.Q., Ma, B., Li, M.: ZOOM: Zillions of oligos mapped. Bioinformatics 24(21), 2431–2437 (2008)CrossRefGoogle Scholar
  44. 44.
    Mäkinen, V., Navarro, G.: Succinct suffix arrays based on run-length encoding. Nordic Journal of Computing 12(1), 40–66 (2005)MathSciNetGoogle Scholar
  45. 45.
    Mäkinen, V., Navarro, G.: Position-restricted substring searching. In: Proc. Latin American Theoretical Informatics Symp., pp. 703–714 (2006)Google Scholar
  46. 46.
    Mäkinen, V., Navarro, G.: Implicit compression boosting with applications to self-indexing. In: Ziviani, N., Baeza-Yates, R. (eds.) SPIRE 2007. LNCS, vol. 4726, pp. 229–241. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  47. 47.
    Mäkinen, V., Navarro, G.: Dynamic entropy-compressed sequences and full-text indexes. ACM Transactions on Algorithms 4(3), article 12 (June 2008)Google Scholar
  48. 48.
    Mäkinen, V., Navarro, G., Sadakane, K.: Advantages of backward searching—efficient secondary memory and distributed implementation of compressed suffix arrays. In: Fleischer, R., Trippen, G. (eds.) ISAAC 2004. LNCS, vol. 3341, pp. 681–692. Springer, Heidelberg (2004)Google Scholar
  49. 49.
    Manber, U., Myers, G.: Suffix arrays: A new method for on-line string searches. SIAM Journal on Computing 22(5), 935–948 (1993)MATHCrossRefMathSciNetGoogle Scholar
  50. 50.
    Manzini, G.: An analysis of the Burrows-Wheeler transform. Journal of the ACM 48(3) (2001); Conference version in SODA 1999Google Scholar
  51. 51.
    McCreight, E.M.: A space-economical suffix tree construction algorithm. Journal of the ACM 23(2), 262–272 (1976)MATHCrossRefMathSciNetGoogle Scholar
  52. 52.
    Moffat, A., Zobel, J.: Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems 14(4), 349–379 (1996)CrossRefGoogle Scholar
  53. 53.
    Muthukrishnan, S.: Efficient Algorithms for Document Retrieval Problems. In: Proc. ACM-SIAM Symp. on Discrete Algorithms, pp. 657–666 (2002)Google Scholar
  54. 54.
    Muthukrishnan, S.: Data Streams: Algorithms and Applications. Foundations and Trends in Theoretical Computer Science. now Publishers, Hanover (2005)Google Scholar
  55. 55.
    Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1), article 2 (2007)Google Scholar
  56. 56.
    NCBI short read archive SRR001115, http://www.ncbi.nlm.nih.gov/
  57. 57.
    Patrascu, M.: Succincter. In: Proc. IEEE Symp. on Foundations of Computer Science, pp. 305–313 (2008)Google Scholar
  58. 58.
    Puglisi, S.J., Smyth, W.F., Turpin, A.: Inverted files versus suffix arrays for locating patterns in primary memory. In: Crestani, F., Ferragina, P., Sanderson, M. (eds.) SPIRE 2006. LNCS, vol. 4209, pp. 122–133. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  59. 59.
    Raman, R., Raman, V., Rao, S.S.: Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Transactions on Algorithms 3(4), article 43 (2007)Google Scholar
  60. 60.
    Russo, L., Navarro, G., Oliveira, A.: Fully-compressed suffix trees. In: Laber, E.S., Bornstein, C., Nogueira, L.T., Faria, L. (eds.) LATIN 2008. LNCS, vol. 4957, pp. 362–373. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  61. 61.
    Sadakane, K.: Compressed text databases with efficient query algorithms based on the compressed suffix array. In: Lee, D.T., Teng, S.-H. (eds.) ISAAC 2000. LNCS, vol. 1969, pp. 410–421. Springer, Heidelberg (December 2000)Google Scholar
  62. 62.
    Sadakane, K.: New text indexing functiionalities of the compressed suffix arrays. Journal of Algorithms 48(2), 294–313 (2003)MATHCrossRefMathSciNetGoogle Scholar
  63. 63.
    Sadakane, K.: Compressed suffix trees with full functionality. Theory of Computing Systems 41(4), 589–607 (2007)MATHCrossRefMathSciNetGoogle Scholar
  64. 64.
    Sadakane, K.: Succinct Data Structures for Flexible Text Retrieval Systems. Journal of Discrete Algorithms 5(1), 12–22 (2007)MATHCrossRefMathSciNetGoogle Scholar
  65. 65.
    Sodan, A.C., Machina, J., Deshmeh, A., Macnaughton, K., Esbaugh, B.: Parallelism via multithreaded and multicore CPUs. IEEE Computer 43(3), 24–32 (2010)Google Scholar
  66. 66.
    Tam, A., Wu, E., Lam, T.W., Yiu, S.-M.: Succinct text indexing with wildcards. In: Proc. Intl. Symp. on String Processing Information Retrieval, August 2009, pp. 39–50 (2009)Google Scholar
  67. 67.
    Thankachan, S.V., Hon, W.-K., Shah, R., Vitter, J.S.: String retrieval for multi-pattern queries (2010)Google Scholar
  68. 68.
    Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)MATHCrossRefMathSciNetGoogle Scholar
  69. 69.
    Välimäki, N., Mäkinen, V.: Space-Efficient Algorithms for Document Retrieval. In: Ma, B., Zhang, K. (eds.) CPM 2007. LNCS, vol. 4580, pp. 205–215. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  70. 70.
    Vitter, J.S.: Algorithms and Data Structures for External Memory. Foundations and Trends in Theoretical Computer Science. now Publishers, Hanover (2008)Google Scholar
  71. 71.
    Vitter, J.S., Shriver, E.A.M.: Algorithms for parallel memory I: Two-level memories. Algorithmica 12(2–3), 110–147 (1994)MATHCrossRefMathSciNetGoogle Scholar
  72. 72.
    Weiner, P.: Linear pattern matching algorithm. In: Proc. IEEE Symp. on Switching and Automata Theory, Washington, DC, vol. 14, pp. 1–11 (1973)Google Scholar
  73. 73.
    Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann, Los Altos (1999)Google Scholar
  74. 74.
    Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Computing Surveys 38(2) (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Wing-Kai Hon
    • 1
  • Rahul Shah
    • 2
  • Jeffrey Scott Vitter
    • 3
  1. 1.National Tsing Hua UniversityTaiwan
  2. 2.Louisiana State UniversityUSA
  3. 3.Texas A&M UniversityUSA

Personalised recommendations