Compression, Indexing, and Retrieval for Massive String Data

Hon, Wing-Kai; Shah, Rahul; Vitter, Jeffrey Scott

doi:10.1007/978-3-642-13509-5_24

Wing-Kai Hon¹⁸,
Rahul Shah¹⁹ &
Jeffrey Scott Vitter²⁰

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6129))

Included in the following conference series:

Annual Symposium on Combinatorial Pattern Matching

822 Accesses
18 Citations

Abstract

The field of compressed data structures seeks to achieve fast search time, but using a compressed representation, ideally requiring less space than that occupied by the original input data. The challenge is to construct a compressed representation that provides the same functionality and speed as traditional data structures. In this invited presentation, we discuss some breakthroughs in compressed data structures over the course of the last decade that have significantly reduced the space requirements for fast text and document indexing. One interesting consequence is that, for the first time, we can construct data structures for text indexing that are competitive in time and space with the well-known technique of inverted indexes, but that provide more general search capabilities. Several challenges remain, and we focus in this presentation on two in particular: building I/O-efficient search structures when the input data are so massive that external memory must be used, and incorporating notions of relevance in the reporting of query answers.

Supported in part by Taiwan NSC grant 96–2221–E–007–082–MY3 (W. Hon) and USA National Science Foundation grant CCF–0621457 (R. Shah and J. S. Vitter).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aggarwal, A., Vitter, J.S.: The Input/Output complexity of sorting and related problems. Communications of the ACM 31(9), 1116–1127 (1988)
Article MathSciNet Google Scholar
Arroyuelo, D., Navarro, G.: A Lempel-Ziv text index on secondary storage. In: Ma, B., Zhang, K. (eds.) CPM 2007. LNCS, vol. 4580, pp. 83–94. Springer, Heidelberg (2007)
Chapter Google Scholar
Barbay, J., He, M., Munro, J.I., Rao, S.S.: Succinct indexes for strings, binary relations and multi-labeled trees. In: Proc. ACM-SIAM Symp. on Discrete Algorithms, pp. 680–689 (2007)
Google Scholar
Bayer, R., Unterauer, K.: Prefix B-trees. ACM Transactions on Database Systems 2(1), 11–26 (1977)
Article Google Scholar
Belazzougui, D.: Succinct dictionary matching with no slowdown. In: Proc. Symp. on Combinatorial Pattern Matching (June 2010)
Google Scholar
Bialynicka-Birula, I., Grossi, R.: Rank-sensitive data structures. In: Consens, M.P., Navarro, G. (eds.) SPIRE 2005. LNCS, vol. 3772, pp. 79–90. Springer, Heidelberg (2005)
Chapter Google Scholar
Burrows, M., Wheeler, D.: A block sorting data compression algorithm. Technical report, Digital Systems Research Center (1994)
Google Scholar
Chan, H.L., Hon, W.K., Lam, T.W., Sadakane, K.: Compressed indexes for dynamic text collections. ACM Transactions on Algorithms 3(2) (2007)
Google Scholar
Chien, Y.-F., Hon, W.-K., Shah, R., Vitter, J.S.: Geometric Burrows-Wheeler transform: Linking range searching and text indexing. In: Proc. IEEE Data Compression Conf., pp. 252–261 (2008)
Google Scholar
Chiu, S.-Y., Hon, W.-K., Shah, R., Vitter, J.S.: I/O-efficient compressed text indexes: From theory to practice. In: Proc. IEEE Data Compression Conf., pp. 426–434 (2010)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: Proc. Symp. on Operating Systems Design and Implementation. December 2004, pp. 137–150, USENIX (2004)
Google Scholar
Elias, P.: Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory IT-21, 194–203 (1975)
Article MATH MathSciNet Google Scholar
Ferragina, P., Giancarlo, R., Manzini, G.: The myriad virtues of wavelet trees. Information and Computation 207(8), 849–866 (2009)
Article MATH MathSciNet Google Scholar
Ferragina, P., Giancarlo, R., Manzini, G., Sciortino, M.: Boosting textual compression in optimal linear time. Journal of the ACM 52(4), 688–713 (2005)
Article MathSciNet Google Scholar
Ferragina, P., González, R., Navarro, G., Venturini, R.: Compressed text indexes: From theory to practice. ACM Journal of Experimental Algorithmics 12, article 1.12 (2008)
Google Scholar
Ferragina, P., Grossi, R.: The String B-tree: A new data structure for string search in external memory and its applications. Journal of the ACM 46(2), 236–280 (1999)
Article MATH MathSciNet Google Scholar
Ferragina, P., Grossi, R., Gupta, A., Shah, R., Vitter, J.S.: On searching compressed string collections cache-obliviously. In: Proc. ACM Conf. on Principles of Database Systems, Vancouver, June 2008, pp. 181–190 (2008)
Google Scholar
Ferragina, P., Luccio, F., Manzini, G., Muthukrishnan, S.: Structuring labeled trees for optimal succinctness, and beyond. In: Proc. IEEE Symp. on Foundations of Computer Science, pp. 184–196 (2005)
Google Scholar
Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proc. IEEE Symp. on Foundations of Computer Science, November 2000, vol. 41, pp. 390–398 (2000)
Google Scholar
Ferragina, P., Manzini, G.: Indexing compressed texts. Journal of the ACM 52(4), 552–581 (2005)
Article MathSciNet Google Scholar
Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Transactions on Algorithms 3(2) (May 2007) Conference version in SPIRE 2004
Google Scholar
Ferragina, P., Venturini, R.: Compressed permuterm index. In: Proc. ACM SIGIR Conf. on Res. and Dev. in Information Retrieval, pp. 535–542 (2007)
Google Scholar
Fischer, J., Mäkinen, V., Navarro, G.: Faster entropy-bounded compressed suffix trees. Theoretical Computer Science 410(51), 5354–5364 (2009)
Article MATH MathSciNet Google Scholar
Foschini, L., Grossi, R., Gupta, A., Vitter, J.S.: When indexing equals compression: Experiments on suffix arrays and trees. ACM Transactions on Algorithms 2(4), 611–639 (2004); Conference versions in SODA 2004 and DCC 2004
Google Scholar
Frigo, M., Leiserson, C.E., Prokop, H., Ramachandran, S.: Cache-oblivious algorithms. In: Proc. IEEE Symp. on Foundations of Computer Science, vol. 40, pp. 285–298 (1999)
Google Scholar
Gonnet, G.H., Baeza-Yates, R.A., Snider, T.: New indices for text: PAT trees and PAT arrays. In: Information Retrieval: Data Structures And Algorithms, ch. 5, pp. 66–82. Prentice-Hall, Englewood Cliffs (1992)
Google Scholar
González, R., Navarro, G.: A compressed text index on secondary memory. In: Proc. Intl. Work. Combinatorial Algorithms, Newcastle, Australia, pp. 80–91. College Publications (2007)
Google Scholar
Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Proc. ACM-SIAM Symp. on Discrete Algorithms (January 2003)
Google Scholar
Grossi, R., Gupta, A., Vitter, J.S.: Nearly tight bounds on the encoding length of the Burrows-Wheeler transform. In: Proc. Work. on Analytical Algorithmics and Combinatorics (January 2008)
Google Scholar
Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In: Proc. ACM Symp. on Theory of Computing, May 2000, vol. 32, pp. 397–406 (2000)
Google Scholar
Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing 35(32), 378–407 (2005)
Article MATH MathSciNet Google Scholar
Hon, W.-K., Lam, T.-W., Shah, R., Tam, S.-L., Vitter, J.S.: Compressed index for dictionary matching. In: Proc. IEEE Data Compression Conf., pp. 23–32 (2008)
Google Scholar
Hon, W.-K., Lam, T.-W., Shah, R., Tam, S.-L., Vitter, J.S.: Succinct index for dynamic dictionary matching. In: Dong, Y., Du, D.-Z., Ibarra, O. (eds.) ISAAC 2009. LNCS, vol. 5878. Springer, Heidelberg (2009)
Chapter Google Scholar
Hon, W.-K., Shah, R., Thankachan, S.V., Vitter, J.S.: On entropy-compressed text indexing in external memory. In: Hyyro, H. (ed.) SPIRE 2009. LNCS, vol. 5721, pp. 75–89. Springer, Heidelberg (2009)
Google Scholar
Hon, W.-K., Shah, R., Vitter, J.S.: Ordered pattern matching: Towards full-text retrieval. In: Purdue University Tech. Rept. (2006)
Google Scholar
Hon, W.-K., Shah, R., Vitter, J.S.: Space-efficient framework for top-k string retrieval problems. In: Proc. IEEE Symp. on Foundations of Computer Science, Atlanta (October 2009)
Google Scholar
Kärkkäinen, J.: Repetition-Based Text Indexes. Ph.d., University of Helsinki (1999)
Google Scholar
Kärkkäinen, J., Rao, S.S.: Full-text indexes in external memory. In: Meyer, U., Sanders, P., Sibeyn, J. (eds.) Algorithms for Memory Hierarchies, ch. 7, pp. 149–170. Springer, Berlin (2003)
Chapter Google Scholar
Külekci, M.O., Hon, W.-K., Shah, R., Vitter, J.S., Xu, B.: A parallel sparse index for read alignment on genomes (2010)
Google Scholar
Lam, T.-W., Sung, W.-K., Wong, S.-S.: Improved approximate string matching using compressed suffix data structures. Algorithmica 51(3), 298–314 (2008)
Article MATH MathSciNet Google Scholar
Langmead, B., Trapnell, C., Pop, M., Salzberg, S.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 10(3), article R25 (2009)
Google Scholar
Li, R., Yu, C., Li, Y., Lam, T.-W., Yiu, S.-M., Kristiansen, K., Wang, J.: SOAP2: An improved ultrafast tool for short read alignment. Bioinformatics 25(15), 1966–1967 (2009)
Article Google Scholar
Lin, H., Zhang, Z., Zhang, M.Q., Ma, B., Li, M.: ZOOM: Zillions of oligos mapped. Bioinformatics 24(21), 2431–2437 (2008)
Article Google Scholar
Mäkinen, V., Navarro, G.: Succinct suffix arrays based on run-length encoding. Nordic Journal of Computing 12(1), 40–66 (2005)
MathSciNet Google Scholar
Mäkinen, V., Navarro, G.: Position-restricted substring searching. In: Proc. Latin American Theoretical Informatics Symp., pp. 703–714 (2006)
Google Scholar
Mäkinen, V., Navarro, G.: Implicit compression boosting with applications to self-indexing. In: Ziviani, N., Baeza-Yates, R. (eds.) SPIRE 2007. LNCS, vol. 4726, pp. 229–241. Springer, Heidelberg (2007)
Chapter Google Scholar
Mäkinen, V., Navarro, G.: Dynamic entropy-compressed sequences and full-text indexes. ACM Transactions on Algorithms 4(3), article 12 (June 2008)
Google Scholar
Mäkinen, V., Navarro, G., Sadakane, K.: Advantages of backward searching—efficient secondary memory and distributed implementation of compressed suffix arrays. In: Fleischer, R., Trippen, G. (eds.) ISAAC 2004. LNCS, vol. 3341, pp. 681–692. Springer, Heidelberg (2004)
Google Scholar
Manber, U., Myers, G.: Suffix arrays: A new method for on-line string searches. SIAM Journal on Computing 22(5), 935–948 (1993)
Article MATH MathSciNet Google Scholar
Manzini, G.: An analysis of the Burrows-Wheeler transform. Journal of the ACM 48(3) (2001); Conference version in SODA 1999
Google Scholar
McCreight, E.M.: A space-economical suffix tree construction algorithm. Journal of the ACM 23(2), 262–272 (1976)
Article MATH MathSciNet Google Scholar
Moffat, A., Zobel, J.: Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems 14(4), 349–379 (1996)
Article Google Scholar
Muthukrishnan, S.: Efficient Algorithms for Document Retrieval Problems. In: Proc. ACM-SIAM Symp. on Discrete Algorithms, pp. 657–666 (2002)
Google Scholar
Muthukrishnan, S.: Data Streams: Algorithms and Applications. Foundations and Trends in Theoretical Computer Science. now Publishers, Hanover (2005)
Google Scholar
Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1), article 2 (2007)
Google Scholar
NCBI short read archive SRR001115, http://www.ncbi.nlm.nih.gov/
Patrascu, M.: Succincter. In: Proc. IEEE Symp. on Foundations of Computer Science, pp. 305–313 (2008)
Google Scholar
Puglisi, S.J., Smyth, W.F., Turpin, A.: Inverted files versus suffix arrays for locating patterns in primary memory. In: Crestani, F., Ferragina, P., Sanderson, M. (eds.) SPIRE 2006. LNCS, vol. 4209, pp. 122–133. Springer, Heidelberg (2006)
Chapter Google Scholar
Raman, R., Raman, V., Rao, S.S.: Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Transactions on Algorithms 3(4), article 43 (2007)
Google Scholar
Russo, L., Navarro, G., Oliveira, A.: Fully-compressed suffix trees. In: Laber, E.S., Bornstein, C., Nogueira, L.T., Faria, L. (eds.) LATIN 2008. LNCS, vol. 4957, pp. 362–373. Springer, Heidelberg (2008)
Chapter Google Scholar
Sadakane, K.: Compressed text databases with efficient query algorithms based on the compressed suffix array. In: Lee, D.T., Teng, S.-H. (eds.) ISAAC 2000. LNCS, vol. 1969, pp. 410–421. Springer, Heidelberg (December 2000)
Google Scholar
Sadakane, K.: New text indexing functiionalities of the compressed suffix arrays. Journal of Algorithms 48(2), 294–313 (2003)
Article MATH MathSciNet Google Scholar
Sadakane, K.: Compressed suffix trees with full functionality. Theory of Computing Systems 41(4), 589–607 (2007)
Article MATH MathSciNet Google Scholar
Sadakane, K.: Succinct Data Structures for Flexible Text Retrieval Systems. Journal of Discrete Algorithms 5(1), 12–22 (2007)
Article MATH MathSciNet Google Scholar
Sodan, A.C., Machina, J., Deshmeh, A., Macnaughton, K., Esbaugh, B.: Parallelism via multithreaded and multicore CPUs. IEEE Computer 43(3), 24–32 (2010)
Google Scholar
Tam, A., Wu, E., Lam, T.W., Yiu, S.-M.: Succinct text indexing with wildcards. In: Proc. Intl. Symp. on String Processing Information Retrieval, August 2009, pp. 39–50 (2009)
Google Scholar
Thankachan, S.V., Hon, W.-K., Shah, R., Vitter, J.S.: String retrieval for multi-pattern queries (2010)
Google Scholar
Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)
Article MATH MathSciNet Google Scholar
Välimäki, N., Mäkinen, V.: Space-Efficient Algorithms for Document Retrieval. In: Ma, B., Zhang, K. (eds.) CPM 2007. LNCS, vol. 4580, pp. 205–215. Springer, Heidelberg (2007)
Chapter Google Scholar
Vitter, J.S.: Algorithms and Data Structures for External Memory. Foundations and Trends in Theoretical Computer Science. now Publishers, Hanover (2008)
Google Scholar
Vitter, J.S., Shriver, E.A.M.: Algorithms for parallel memory I: Two-level memories. Algorithmica 12(2–3), 110–147 (1994)
Article MATH MathSciNet Google Scholar
Weiner, P.: Linear pattern matching algorithm. In: Proc. IEEE Symp. on Switching and Automata Theory, Washington, DC, vol. 14, pp. 1–11 (1973)
Google Scholar
Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann, Los Altos (1999)
Google Scholar
Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Computing Surveys 38(2) (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

National Tsing Hua University, Taiwan
Wing-Kai Hon
Louisiana State University, USA
Rahul Shah
Texas A&M University, USA
Jeffrey Scott Vitter

Authors

Wing-Kai Hon
View author publications
You can also search for this author in PubMed Google Scholar
Rahul Shah
View author publications
You can also search for this author in PubMed Google Scholar
Jeffrey Scott Vitter
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA, and Bar-Ilan University, 52900, Ramat-Gan, Israel
Amihood Amir
IBM T.J. Watson Research Center, Yorktown Heights, NY, USA
Laxmi Parida

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hon, WK., Shah, R., Vitter, J.S. (2010). Compression, Indexing, and Retrieval for Massive String Data. In: Amir, A., Parida, L. (eds) Combinatorial Pattern Matching. CPM 2010. Lecture Notes in Computer Science, vol 6129. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13509-5_24

Download citation

DOI: https://doi.org/10.1007/978-3-642-13509-5_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13508-8
Online ISBN: 978-3-642-13509-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics