Advertisement

An Alphabet-Friendly FM-Index

  • Paolo Ferragina
  • Giovanni Manzini
  • Veli Mäkinen
  • Gonzalo Navarro
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3246)

Abstract

We show that, by combining an existing compression boosting technique with the wavelet tree data structure, we are able to design a variant of the FM-index which scales well with the size of the input alphabet Σ. The size of the new index built on a string T[1,n] is bounded by \(n H_{k}(T) + O\bigl((n \log\log n)/\log_{\vert {\Sigma}\vert } n\bigr)\) bits, where H k (T) is the k-th order empirical entropy of T.

The above bound holds simultaneously for all kαlog ∣ Σ ∣  n and 0< α < 1. Moreover, the index design does not depend on the parameter k, which plays a role only in analysis of the space occupancy.

Using our index, the counting of the occurrences of an arbitrary pattern P[1,p] as a substring of T takes O(p log ∣ Σ ∣ ) time. Locating each pattern occurrence takes O(log ∣ Σ ∣ (log2n/loglog n)) time. Reporting a text substring of length ℓ takes O((ℓ + log2 n/loglog n) log ∣ Σ∣ ) time.

Keywords

Compression Algorithm Query Time Discrete Algorithm Pattern Occurrence Alphabet Size 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994)Google Scholar
  2. 2.
    Ferragina, P., Giancarlo, R., Manzini, G., Sciortino, M.: Boosting textual compression in optimal linear time. Technical Report 240, Dipartimento di Matematica e Applicazioni, University of Palermo, Italy (2004)Google Scholar
  3. 3.
    Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: IEEE Symposium on Foundations of Computer Science (FOCS 2000), pp. 390–398 (2000)Google Scholar
  4. 4.
    Ferragina, P., Manzini, G.: An experimental study of a compressed index. Information Sciences: special issue on Dictionary Based Compression 135, 13–28 (2001)MATHGoogle Scholar
  5. 5.
    Ferragina, P., Manzini, G.: An experimental study of an opportunistic index. In: ACM-SIAM Symposium on Discrete Algorithms (SODA 2001), pp. 269–278 (2001)Google Scholar
  6. 6.
    Ferragina, P., Manzini, G.: On compressing and indexing data. Technical Report TR-02-01, Dipartimento di Informatica, University of Pisa, Italy (2002)Google Scholar
  7. 7.
    Ferragina, P., Manzini, G.: Compression boosting in optimal linear time using the Burrows-Wheeler transform. In: ACM-SIAM Symposium on Discrete Algorithms, SODA 2004 (2004)Google Scholar
  8. 8.
    Giancarlo, R., Sciortino, M.: Optimal partitions of strings: A new class of Burrows-Wheeler compression algorithms. In: Baeza-Yates, R., Chávez, E., Crochemore, M. (eds.) CPM 2003. LNCS, vol. 2676, pp. 129–143. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  9. 9.
    Gonnet, G.H., Baeza-Yates, R.A., Snider, T.: New indices for text: PAT trees and PAT arrays. In: Frakes, B., Baeza-Yates, R.A. (eds.) Information Retrieval: Data Structures and Algorithms. ch. 5, pp. 66–82. Prentice-Hall, Englewood Cliffs (1992)Google Scholar
  10. 10.
    Grabowski, S., Mäkinen, V., Navarro, G.: First Huffman, then Burrows- Wheeler: an alphabet-independent FM-index. In: Apostolico, A., Melucci, M. (eds.) SPIRE 2004. LNCS, vol. 3246, pp. 210–211. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  11. 11.
    Grossi, R., Gupta, A., Vitter, J.: High-order entropy-compressed text indexes. In: ACM-SIAM Symposium on Discrete Algorithms (SODA 2003), pp. 841–850 (2003)Google Scholar
  12. 12.
    Grossi, R., Gupta, A., Vitter, J.: When indexing equals compression: Experiments on compressing suffix arrays and applications. In: ACM-SIAM Symp. on Discrete Algorithms, SODA 2004 (2004)Google Scholar
  13. 13.
    Grossi, R., Vitter, J.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In: ACM Symposium on Theory of Computing (STOC 2000), pp. 397–406 (2000)Google Scholar
  14. 14.
    Healy, J., Thomas, E.E., Schwartz, J.T., Wigler, M.: Annotating large genomes with exact word matches. Genome Research 13, 2306–2315 (2003)CrossRefGoogle Scholar
  15. 15.
    Mäkinen, V., Navarro, G.: New search algorithms and time/space tradeoffs for succinct suffix arrays. Technical Report C-2004-20, University of Helsinki, Finland (2004)Google Scholar
  16. 16.
    Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing 22(5), 935–948 (1993)MATHCrossRefMathSciNetGoogle Scholar
  17. 17.
    Manzini, G.: An analysis of the Burrows-Wheeler transform. Journal of the ACM 48(3), 407–430 (2001)CrossRefMathSciNetGoogle Scholar
  18. 18.
    Navarro, G.: Indexing text using the Ziv-Lempel trie. Journal of Discrete Algorithms 2(1), 87–114 (2004)MATHCrossRefMathSciNetGoogle Scholar
  19. 19.
    Raman, R., Raman, V., Srinivasa Rao, S.: Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In: ACM-SIAM Symposium on Discrete Algorithms (SODA 2002), pp. 233–242 (2002)Google Scholar
  20. 20.
    Sadakane, K.: Succinct representations of LCP information and improvements in the compressed suffix arrays. In: ACM-SIAM Symposium on Discrete Algorithms (SODA 2002), pp. 225–232 (2002)Google Scholar
  21. 21.
    Sadakane, K., Shibuya, T.: Indexing huge genome sequences for solving various problems. Genome Informatics 12, 175–183 (2001)Google Scholar
  22. 22.
    Ziv, J., Lempel, A.: Compression of individual sequences via variable length coding. IEEE Transaction on Information Theory 24, 530–536 (1978)MATHCrossRefMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Paolo Ferragina
    • 1
  • Giovanni Manzini
    • 2
  • Veli Mäkinen
    • 3
  • Gonzalo Navarro
    • 4
  1. 1.Dipartimento di InformaticaUniversity of PisaItaly
  2. 2.Dipartimento di InformaticaUniversity of Piemonte OrientaleItaly
  3. 3.Department of Computer ScienceUniversity of HelsinkiFinland
  4. 4.Department of Computer ScienceUniversity of ChileChile

Personalised recommendations