International Symposium on String Processing and Information Retrieval

SPIRE 2015: String Processing and Information Retrieval pp 222-233 | Cite as

Space-Efficient Detection of Unusual Words

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9309)

Abstract

Detecting all the strings that occur in a text more frequently or less frequently than expected according to an IID or a Markov model is a basic problem in string mining, yet current algorithms are based on data structures that are either space-inefficient or incur large slowdowns, and current implementations cannot scale to genomes or metagenomes in practice. In this paper we engineer an algorithm based on the suffix tree of a string to use just a small data structure built on the Burrows-Wheeler transform, and a stack of \(O(\sigma ^2\log ^2 n)\) bits, where n is the length of the string and \(\sigma \) is the size of the alphabet. The size of the stack is o(n) except for very large values of \(\sigma \). We further improve the algorithm by removing its time dependency on \(\sigma \), by reporting only a subset of the maximal repeats and of the minimal rare words of the string, and by detecting and scoring candidate under-represented strings that do not occur in the string. Our algorithms are practical and work directly on the BWT, thus they can be immediately applied to a number of existing datasets that are available in this form, returning this string mining problem to a manageable scale.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Apostolico, A., Bock, M.E., Lonardi, S.: Monotony of surprise and large-scale quest for unusual words. Journal of Computational Biology 10(3–4), 283–311 (2003)Google Scholar
  2. 2.
    Apostolico, A., Bock, M.E., Lonardi, S., Xu, X.: Efficient detection of unusual words. Journal of Computational Biology 7(1–2), 71–94 (2000)CrossRefGoogle Scholar
  3. 3.
    Apostolico, A., Bock, M.E., Xu, X.: Annotated statistical indices for sequence analysis. In: Proceedgins of Compression and Complexity of Sequences 1997, pp. 215–229. IEEE (1998)Google Scholar
  4. 4.
    Apostolico, A., Gong, F.-C., Lonardi, S.: Verbumculus and the discovery of unusual words. Journal of Computer Science and Technology 19(1), 22–41 (2004)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Belazzougui, D.: Linear time construction of compressed text indices in compact space. In: Proceedings of the 46th Annual ACM Symposium on Theory of Computing, STOC 2014, pp. 148–193. ACM, New York (2014)Google Scholar
  6. 6.
    Belazzougui, D., Cunial, F.: A framework for space-efficient string kernels. In: Cicalese, F., Porat, E., Vaccaro, U. (eds.) CPM 2015. LNCS, vol. 9133, pp. 13–25. Springer, Heidelberg (2015) CrossRefGoogle Scholar
  7. 7.
    Belazzougui, D., Navarro, G., Valenzuela, D.: Improved compressed indexes for full-text document retrieval. Journal of Discrete Algorithms 18, 3–13 (2013)MathSciNetCrossRefMATHGoogle Scholar
  8. 8.
    Chairungsee, S., Crochemore, M.: Using minimal absent words to build phylogeny. Theoretical Computer Science 450, 109–116 (2012)MathSciNetCrossRefMATHGoogle Scholar
  9. 9.
    Crochemore, M., Mignosi, F., Restivo, A.: Automata and forbidden words. Information Processing Letters 67(3), 111–117 (1998)MathSciNetCrossRefGoogle Scholar
  10. 10.
    Crochemore, M., Rytter, W.: Jewels of stringology. World Scientific (2002)Google Scholar
  11. 11.
    Gog, S.: Compressed suffix trees: design, construction, and applications. PhD thesis, University of Ulm, Germany (2011)Google Scholar
  12. 12.
    Herold, J., Kurtz, S., Giegerich, R.: Efficient computation of absent words in genomic sequences. BMC Bioinformatics 9(1), 167 (2008)CrossRefGoogle Scholar
  13. 13.
    Hoare, C.A.R.: Quicksort. The Computer Journal 5(1), 10–16 (1962)MathSciNetCrossRefMATHGoogle Scholar
  14. 14.
    Ileri, A.M., Külekci, M.O., Xu, B.: A simple yet time-optimal and linear-space algorithm for shortest unique substring queries. Theoretical Computer Science 562, 621–633 (2015)Google Scholar
  15. 15.
    Keogh, E., Lonardi, S., Chiu, B.Y.-C.: Finding surprising patterns in a time series database in linear time and space. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2002, pp. 550–556. ACM, New York (2002)Google Scholar
  16. 16.
    Lin, J., Keogh, E., Wei, L., Lonardi, S.: Experiencing SAX: a novel symbolic representation of time series. Data Mining and Knowledge Discovery 15(2), 107–144 (2007)MathSciNetCrossRefGoogle Scholar
  17. 17.
    Morris, J.H., Pratt, V.R.: A linear pattern-matching algorithm. Technical Report 40, University of California, Berkeley (1970)Google Scholar
  18. 18.
    Simon, I.: String matching algorithms and automata. In: First South American Workshop on String Processing, Belo Horizonte, Brazil, pp. 151–157 (1993)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of HelsinkiHelsinkiFinland
  2. 2.Helsinki Institute for Information TechnologyHelsinkiFinland
  3. 3.Max Planck Institute of Molecular Cell Biology and GeneticsDresdenGermany

Personalised recommendations