Information Retrieval

, Volume 15, Issue 6, pp 527–557 | Cite as

Implicit indexing of natural language text by reorganizing bytecodes

  • Nieves R. Brisaboa
  • Antonio Fariña
  • Susana Ladra
  • Gonzalo Navarro
Article

Abstract

Word-based byte-oriented compression has succeeded on large natural language text databases, by providing competitive compression ratios, fast random access, and direct sequential searching. We show that by just rearranging the target symbols of the compressed text into a tree-shaped structure, and using negligible additional space, we obtain a new implicitly indexed representation of the compressed text, where search times are drastically improved. The occurrences of a word can be listed directly, without any text scanning, and in general any inverted-index-like capability, such as efficient phrase searches, can be emulated without storing any inverted list information. We experimentally show that our proposal performs not only much more efficiently than sequential searches over compressed text, but also than explicit inverted indexes and other types of indexes, when using little extra space. Our representation is especially successful when searching for single words and short phrases.

Keywords

Word-based compression Searching compressed text Compressed indexing 

Notes

Acknowledgments

Funded by MICINN grants TIN2009-14560-C03-02 and TIN2010-21246-C02-01, Ministerio de Ciencia e Innovación grant CDTI CEN-20091048, and Xunta de Galicia grant 2010/17 (for the Spanish group); and for the fourth author by Fondecyt grant 1-110066.

References

  1. Anh, V., & Moffat, A. (2005). Inverted index compression using word-aligned binary codes. Information Retrieval, 8(1), 151–166.CrossRefGoogle Scholar
  2. Arroyuelo, D., González, S., & Oyarzún, M. (2010). Compressed self-indices supporting conjunctive queries on document collections. In Proceedings of the 17th international symposium on string processing and information retrieval (SPIRE), LNCS 6393, (pp. 43–54).Google Scholar
  3. Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. Boston, MA: Addison-Wesley Longman.Google Scholar
  4. Barbay, J., López-Ortiz, A., Lu, T., & Salinger, A. (2009). An experimental investigation of set intersection algorithms for text searching. ACM Journal of Experimental Algorithmics (JEA), 14(7), 3, 24 pp.Google Scholar
  5. Bentley, J., Sleator, D., Tarjan, R., & Wei, V. (1986). A locally adaptive data compression scheme. Communications of the ACM (CACM), 29(4), 320–330.MathSciNetMATHCrossRefGoogle Scholar
  6. Boyer, R., & Moore, J. (1977). A fast string searching algorithm. Communications of the ACM (CACM), 20(10), 762–772.MATHCrossRefGoogle Scholar
  7. Brisaboa, N., Fariña, A., Navarro, G., & Paramá, J. (2007). Lightweight natural language text compression. Information Retrieval, 10, 1–33.CrossRefGoogle Scholar
  8. Brisaboa, N., Fariña, A., Ladra, S., & Navarro, G. (2008a). Reorganizing compressed text. In Proceedings of the 31th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR), (pp. 139–146).Google Scholar
  9. Brisaboa, N., Fariña, A., Navarro, G., Places, A., & Rodríguez, E. (2008b). Self-indexing natural language. In Proceedings of the 15th international symposium on string processing and information retrieval (SPIRE), LNCS 5280, (pp. 121–132).Google Scholar
  10. Brisaboa, N., Cerdeira, A., & Navarro, G. (2009). A compressed self-indexed representation of XML documents. In Proceeding of the 13th European conference on digital libraries (ECDL), LNCS 5714, (pp. 273–284).Google Scholar
  11. Brisaboa, N., Fariña, A., Navarro, G., & Paramá, J. (2010). Dynamic lightweight text compression. ACM Transactions on Information Systems (TOIS), 28(3), 10, 32 pp.Google Scholar
  12. Clark, D. (1996). Compact pat trees. PhD thesis. Canada: University of Waterloo.Google Scholar
  13. Claude, F., & Navarro, G. (2008). Practical rank/select queries over arbitrary sequences. In Proceedings of the 15th international symposium on string processing and information retrieval (SPIRE), LNCS 5280, (pp. 176–187).Google Scholar
  14. Culpepper, S. (2007). Efficient data representations for information retrieval. PhD thesis. Australia: Department of Computer Science and Software Engineering, University of Melbourne.Google Scholar
  15. Culpepper, S., & Moffat, A. (2005). Enhanced byte codes with restricted prefix properties. In Proceedings of the 12th international symposium on string processing and information retrieval (SPIRE), LNCS 3772, (pp. 1–12).Google Scholar
  16. Culpepper, S., & Moffat, A. (2007). Compact set representation for information retrieval. In Proceedings of the 14th international symposium on string processing and information retrieval (SPIRE), LNCS 4726, (pp. 137–148).Google Scholar
  17. Culpepper, S., & Moffat, A. (2010). Efficient set intersection for inverted indexing. ACM Transactions on Information Systems (TOIS), 29(1), 1, 25 pp.Google Scholar
  18. Ding, S., Attenberg, J., & Suel, T. (2010). Scalable techniques for document identifier assignment in inverted indexes. In Proceedings of the 19th international conference on world wide web (WWW), (pp. 311–320).Google Scholar
  19. Ferragina, P., Manzini, G., Mäkinen, V., & Navarro, G. (2007). Compressed representations of sequences and full-text indexes. ACM Transactions on Algorithms (TALG), 3(2), 20, 24 pp.Google Scholar
  20. Ferragina, P., González, R., Navarro, G., & Venturini, R. (2009). Compressed text indexes: From theory to practice. ACM Journal of Experimental Algorithmics (JEA), 13, 12, 31 pp.Google Scholar
  21. Grossi, R., Gupta, A., & Vitter, J. (2003). High-order entropy-compressed text indexes. In Proceedings of 14th annual ACM-SIAM symposium on discrete algorithms (SODA), (pp. 841–850).Google Scholar
  22. Heaps, H. (1978). Information retrieval—computational and theoretical aspects. New York, NY: Academic Press.MATHGoogle Scholar
  23. Horspool, R. (1980). Practical fast searching in strings. Software: Practice and Experience (SPE), 10(6), 501–506.CrossRefGoogle Scholar
  24. Huffman, D. (1952). A method for the construction of minimum-redundancy codes. Proceedings of the Institute of Radio Engineers (IRE), 40(9), 1098–1101.Google Scholar
  25. Jacobson, G. (1989). Space-efficient static trees and graphs. In Proceedings of 30th IEEE symposium on foundations of computer science (FOCS), (pp. 549–554).Google Scholar
  26. Ladra, S. (2011). Algorithms and compressed data structures for information retrieval. PhD thesis. Spain: Department of Computer Science, University of A Coruña.Google Scholar
  27. Moffat, A. (1989). Word-based text compression. Software: Practice and Experience (SPE), 19(2), 185–198.CrossRefGoogle Scholar
  28. Moffat, A., & Culpepper, S. (2007). Hybrid bitvector index compression. In Proceedings of the 12th Australasian document computing symposium (ADCS), (pp. 25–31).Google Scholar
  29. Moura, E., Navarro, G., Ziviani, N., & Baeza-Yates, R. (2000). Fast and flexible word searching on compressed text. ACM Transactions on Information Systems (TOIS), 18(2), 113–139.CrossRefGoogle Scholar
  30. Munro, I. (1996). Tables. In Proceedings of the 16th conference on foundations of software technology and theoretical computer science (FSTTCS), LNCS 1180, (pp. 37–42).Google Scholar
  31. Navarro, G., Moura, E., Neubert, M., Ziviani, N., & Baeza-Yates, R. (2000). Adding compression to block addressing inverted indexes. Information Retrieval, 3(1), 49–77.CrossRefGoogle Scholar
  32. Raman, R., Raman, V., & Rao, S. (2002). Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In Proceedings of the 13th annual ACM-SIAM symposium on discrete algorithms (SODA), (pp. 233–242).Google Scholar
  33. Sanders, P., & Transier, F. (2007) Intersection in integer inverted indices. In Proceeding of the 9th workshop on algorithm engineering and experiments (ALENEX), (pp. 71–83).Google Scholar
  34. Schenkel, R., Suchanek, F., & Kasneci, G. (2007) Yawn: A semantically annotated wikipedia xml corpus. In 12th GI conference on databases in business, technology and web (BTW), (pp. 277–291).Google Scholar
  35. Strohman, T., & Croft, B. (2007). Efficient document retrieval in main memory. In Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR), (pp. 175–182).Google Scholar
  36. Transier, F., & Sanders, P. (2010). Engineering basic algorithms of an in-memory text search engine. ACM Transactions on Information Systems (TOIS) 29(1), 2, 37 pp.Google Scholar
  37. Turpin, A., & Moffat, A. (1997). Fast file search using text compression. In Proceedings of the 20th Australasian Computer Science Conference (ACSC), (pp. 1–8).Google Scholar
  38. Witten, I., Moffat, A., & Bell, T. (1999). Managing gigabytes: Compressing and indexing documents and images, 2nd edn. San Francisco, CA: Morgan Kaufmann Publishers.Google Scholar
  39. Yan, H., Ding, S., & Suel, T. (2009) Inverted index compression and query processing with optimized document ordering. In Proceedings of the 18th international conference on world wide web (WWW), (pp. 401–410).Google Scholar
  40. Zobel, J., Moffat, A., & Ramamohanarao, K. (1998). Inverted files versus signature files for text indexing. ACM Transactions on Database Systems (TODS), 23(4), 453–490.CrossRefGoogle Scholar
  41. Zukowski, M., Heman, S., Nes, N., & Boncz, P. (2006). Super-scalar RAM-CPU cache compression. In Proceedings of the 22nd international conference on data engineering (ICDE), (p. 59).Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  • Nieves R. Brisaboa
    • 1
  • Antonio Fariña
    • 1
  • Susana Ladra
    • 1
  • Gonzalo Navarro
    • 2
  1. 1.Database LaboratoryUniversity of A CoruñaA CoruñaSpain
  2. 2.Department of Computer ScienceUniversity of ChileSantiagoChile

Personalised recommendations