Algorithmica

, Volume 67, Issue 4, pp 529–546 | Cite as

Distribution-Aware Compressed Full-Text Indexes

Article

Abstract

In this paper we address the problem of building a compressed self-index that, given a distribution for the pattern queries and a bound on the space occupancy, minimizes the expected query time within that index space bound. We solve this problem by exploiting a reduction to the problem of finding a minimum weight K-link path in a properly designed Directed Acyclic Graph. Interestingly enough, our solution can be used with any compressed index based on the Burrows-Wheeler transform. Our experiments compare this optimal strategy with several other known approaches, showing its effectiveness in practice.

Keywords

Full-text indexing Compressed full-text indexes Succinct data structures Dynamic programming 

References

  1. 1.
    Aggarwal, A., Schieber, B., Tokuyama, T.: Finding a minimum-weight k-link path graphs with the concave Monge property and applications. Discrete Comput. Geom. 12, 263–280 (1994) MathSciNetCrossRefMATHGoogle Scholar
  2. 2.
    Barbay, J., Gagie, T., Navarro, G., Nekrich, Y.: Alphabet partitioning for compressed rank/select and applications. In: Proceedings of the 21st International Symposium on Algorithms and Computation, Part II (ISAAC 2010). LNCS, vol. 6507, pp. 315–326. Springer, Berlin (2010) Google Scholar
  3. 3.
    Belazzougui, D., Navarro, G.: Alphabet-independent compressed text indexing. In: Proceedings of the 19th Annual European Symposium on Algorithms (ESA 2011). LNCS, vol. 6942, pp. 748–759. Springer, Berlin (2011) CrossRefGoogle Scholar
  4. 4.
    Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, (1994) Google Scholar
  5. 5.
    Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005) MathSciNetCrossRefGoogle Scholar
  6. 6.
    Ferragina, P., Manzini, G.: On compressing the textual web. In: Proceedings of the 3rd ACM International Conference on Web Search and Data Mining (WSDM), pp. 391–400 (2010) CrossRefGoogle Scholar
  7. 7.
    Ferragina, P., González, R., Navarro, G., Venturini, R.: Compressed text indexes: from theory to practice. ACM J. Exp. Algorithmics 13 (2008) Google Scholar
  8. 8.
    Ferragina, P., Sirén, J., Venturini, R.: Distribution-aware compressed full-text indexes. In: Proc 19th Annual European Symposium on Algorithms (ESA), pp. 760–771 (2011) Google Scholar
  9. 9.
    Giancarlo, R.: Dynamic programming: special cases. In: Apostolico, A., Galil, Z. (eds.) Pattern Matching Algorithms, 2nd edn., pp. 201–236. Oxford Univ. Press, Oxford (1997) Google Scholar
  10. 10.
    Grossi, R., Vitter, J.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In: Proceedings of the 32nd ACM Symposium on Theory of Computing (STOC), pp. 397–406 (2000) Google Scholar
  11. 11.
    Grossi, R., Gupta, A., Vitter, J.: High-order entropy-compressed text indexes. In: Proceedings of the 14th ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 841–850 (2003) Google Scholar
  12. 12.
    Hagerup, T., Tholey, T.: Efficient minimal perfect hashing in nearly minimal space. In: Proceedings of the 17th Symposium on Theoretical Aspects of Computer Science (STACS), pp. 317–326 (2001) Google Scholar
  13. 13.
    Hirschberg, D.S., Larmore, L.L.: The least weight subsequence problem. SIAM J. Comput. 16(4), 628–638 (1987) MathSciNetCrossRefMATHGoogle Scholar
  14. 14.
    Kärkkäinen, J., Puglisi, S.J.: Fixed block compression boosting in FM-indexes. In: Proceedings of the 18th Symposium on String Processing and Information Retrieval (SPIRE 2011). LNCS, vol. 7024, pp. 174–184. Springer, Berlin (2011) CrossRefGoogle Scholar
  15. 15.
    Mäkinen, V., Navarro Sirén J, G., Välimäki, N.: Storage and retrieval of highly repetitive sequence collections. J. Comput. Biol. 17(3), 281–308 (2010) MathSciNetCrossRefGoogle Scholar
  16. 16.
    Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comput. Surv. 39(1) (2007) Google Scholar
  17. 17.
    Sadakane, K.: New text indexing functionalities of the compressed suffix arrays. J. Algorithms 48(2), 294–313 (2003) MathSciNetCrossRefMATHGoogle Scholar
  18. 18.
    Schieber, B.: Computing a minimum weight k-link path in graphs with the concave Monge property. J. Algorithms 29(2), 204–222 (1998) MathSciNetCrossRefMATHGoogle Scholar
  19. 19.
    Silvestri, F.: Mining query logs: turning search usage data into knowledge. Found. Trends Inf. Retr. 4(1–2), 1–174 (2010) CrossRefMATHGoogle Scholar
  20. 20.
    Sirén, J.: Compressed full-text indexes for highly repetitive collections. PhD thesis, University of Helsinki (2012) Google Scholar
  21. 21.
    Wilber, R.E.: The concave least-weight subsequence problem revisited. J. Algorithms 9(3), 418–425 (1988) MathSciNetCrossRefMATHGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • Paolo Ferragina
    • 1
  • Jouni Sirén
    • 2
  • Rossano Venturini
    • 1
  1. 1.Dipartimento di InformaticaUniversity of PisaPisaItaly
  2. 2.Department of Computer ScienceUniversity of HelsinkiHelsinkiFinland

Personalised recommendations