Mathematics in Computer Science

, Volume 11, Issue 2, pp 209–218 | Cite as

Indeterminate String Factorizations and Degenerate Text Transformations

Open Access
Article
  • 151 Downloads

Abstract

The data explosion problem continues to escalate requiring novel and ingenious solutions. Pattern inference focusing on repetitive structures in data is a vigorous field of endeavor aimed at shrinking volumes of data by means of concise descriptions. The Burrows–Wheeler transformation computes a permutation of a string of letters over an alphabet, and is well-suited to compression-related applications due to its invertability and data clustering properties. For space efficiency the input to the transform can be preprocessed into Lyndon factors. Rather than this classic deterministic approach for letter based strings, we consider scenarios with uncertainty regarding the data: a position in an indeterminate or degenerate string is a set of letters. We first define indeterminate Lyndon words and establish their associated unique string factorization; then we introduce the novel degenerate Burrows–Wheeler transformation which may apply the indeterminate Lyndon factorization. A core computation in Burrows–Wheeler type transforms is the linear sorting of all conjugates of the input string—we achieve this in the degenerate case with lex-extension ordering. Like the original forms, indeterminate Lyndon factorization and the degenerate transform and its inverse can all be computed in linear time and space with respect to total input size of degenerate strings. Regular molecular biological strings yield a wealth of applications of big data—an important motivation for generalizing to degenerate strings is their extensive use in expressing polymorphism in DNA sequences.

Keywords

Degenerate biological string Degenerate Burrows-Wheeler transform Indeterminate Lyndon word Indeterminate suffix array Inverse transform Lex-extension order Linear 

Mathematics Subject Classification

68R15 

References

  1. 1.
    Adjeroh, D., Bell, T., Mukherjee, A.: The Burrows–Wheeler Transform: Data Compression, Suffix Arrays, and Pattern Matching. Springer, NewYork (2008)CrossRefGoogle Scholar
  2. 2.
    Antoniou, P., Daykin, J.W., Iliopoulos, C.S., Kourie, D., Mouchard, L., Pissis, S.P.: Mapping uniquely occuring short sequences derived from high throughput technologies to a reference genome. In: Proceedings of the 9th IEEE International Conference on Information Technology and Applications in Biomedicine (ITAB 2009). (2009). doi:10.1109/ITAB.2009.5394394
  3. 3.
    Apostolico, A., Crochemore, M.: Fast parallel Lyndon factorization with applications. Math. Syst. Theory 28(2), 89–108 (1995)MathSciNetCrossRefMATHGoogle Scholar
  4. 4.
    Bauer, M.J., Cox, A.J., Rosone, G., Sciortino, M.: Lightweight LCP construction for next-generation sequencing datasets. CoRR. arXiv:1305.0160 (2013)
  5. 5.
    Breslauer, D., Grossi, R., Mignosi, F.: Simple real-time constant-space string matching. In: Giancarlo, R., Manzini, G. (eds.) CPM, volume 6661 of Lecture Notes in Computer Science, pp. 173–183 (2011)Google Scholar
  6. 6.
    Burrows, M., Wheeler, D.J.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994)Google Scholar
  7. 7.
    Chemillier, M.: Periodic musical sequences and Lyndon words. Soft Comput. 8(9), 611–616 (2004)MATHGoogle Scholar
  8. 8.
    Chen, K.T., Fox, R.H., Lyndon, R.C.: Free differential calculus IV—the quotient groups of the lower central series. Ann. Math. 68, 81–95 (1958)MathSciNetCrossRefMATHGoogle Scholar
  9. 9.
    Crochemore, M., Désarménien, J., Perrin, D.: A note on the Burrows–Wheeler transformation. Theor. Comput. Sci. 332(1–3), 567–572 (2005)MathSciNetCrossRefMATHGoogle Scholar
  10. 10.
    Crochemore, M., Grossi, R., ärkkäinen, J.K., Landau, G.M.: A constant-space comparison-based algorithm for computing the Burrows–Wheeler transform. In: Proceedings of the 24th Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 74–82 (2013)Google Scholar
  11. 11.
    Crochemore, M., Perrin, D.: Two-way string matching. J. ACM 38(3), 651–675 (1991)MathSciNetCrossRefMATHGoogle Scholar
  12. 12.
    Daykin, D.E., Daykin, J.W.: Lyndon-like and V-order factorizations of strings. J. Discrete Algorithms 1, 357–365 (2003)MathSciNetCrossRefMATHGoogle Scholar
  13. 13.
    Daykin, D.E., Daykin, J.W.: Properties and construction of unique maximal factorization families for strings. Int. J. Found. Comput. Sci. 19(4), 1073–1084 (2008)MathSciNetCrossRefMATHGoogle Scholar
  14. 14.
    Daykin, J.W., Smyth, W.F.: A bijective variant of the Burrows–Wheeler transform using V-order. Theor. Comput. Sci. 531, 77–89 (2014)MathSciNetCrossRefMATHGoogle Scholar
  15. 15.
    Duval, J.-P.: Factorizing words over an ordered alphabet. J. Algorithms 4(4), 363–381 (1983)MathSciNetCrossRefMATHGoogle Scholar
  16. 16.
    Fredricksen, H., Maiorana, J.: Necklaces of beads in k colors and k-ary de Bruijn sequences. Discrete Math. 23(3), 207–210 (1978)MathSciNetCrossRefMATHGoogle Scholar
  17. 17.
    Gil, J.Y., Scott, D.A.: A bijective string sorting transform. CoRR. arXiv:1201.3077 (2012)
  18. 18.
    Holub, J., Smyth, W.F.: Algorithms on indeterminate strings. In: Proceedings of the 14th Australasian Workshop on Combinatorial Algorithms (AWOCA), pp. 36–45 (2003)Google Scholar
  19. 19.
    Iliopoulos, C., Mouchard, L., Rahman, M.: A new approach to pattern matching in degenerate DNA/RNA sequences and distributed pattern matching. Math. Comput. Sci. 2(4), 557–569 (2008)MathSciNetCrossRefMATHGoogle Scholar
  20. 20.
    Iliopoulos, C., Rahman, M., Voráček, M., Vagner, L.: Finite automata based algorithms on subsequences and supersequences of degenerate strings. J. Discrete Algorithms 8(2), 117–130 (2010)MathSciNetCrossRefMATHGoogle Scholar
  21. 21.
    Kärkkäinen, J., Kempa, D., Puglisi, S.J.: Slashing the time for BWT inversion. In: Proceedings of the Data Compression Conference (DCC), pp. 99–108 (2012)Google Scholar
  22. 22.
    Ko, P., Aluru, S.: Space efficient linear time construction of suffix arrays. In: Proceedings of the 14th Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 200–210 (2003)Google Scholar
  23. 23.
    Kufleitner, M.: On bijective variants of the Burrows–Wheeler transform. In: Proceedings of the Stringology, pp. 65–79 (2009)Google Scholar
  24. 24.
    Langmead, B., Trapnell, C., Pop, M., Salzberg, S.L.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10(3), R25 (2009)CrossRefGoogle Scholar
  25. 25.
    Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25(14), 1754–1760 (2009)CrossRefGoogle Scholar
  26. 26.
    Li, R., Yu, C., Li, Y., Lam, T.W., Yiu, S.M., Kristiansen, K., Wang, J.: Soap2: an improved ultrafast tool for short read alignment. Bioinformatics 25(15), 1966–1967 (2009)CrossRefGoogle Scholar
  27. 27.
    Lothaire, M.: Combinatorics on words. 2nd edn. Reading, MA (1983); Cambridge University Press, Cambridge (1997). Addison-Wesley (1983)Google Scholar
  28. 28.
    Lothaire, M.: Applied Combinatorics on Words (Encyclopedia of Mathematics and its Applications). Cambridge University Press, New York, NY (2005)CrossRefMATHGoogle Scholar
  29. 29.
    Lyndon, R.C.: On Burnside’s problem. Trans. Am. Math. Soc. 77, 202–215 (1954)MathSciNetMATHGoogle Scholar
  30. 30.
    Lyndon, R.C.: On Burnside’s problem II. Trans. Am. Math. Soc. 78(2), 329–332 (1955)MathSciNetMATHGoogle Scholar
  31. 31.
    Mantaci, S., Restivo, A., Rosone, G., Sciortino, M.: An extension of the Burrows– Wheeler transform and applications to sequence comparison and data compression. In: Proceedings of the 16th Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 178–189 (2005)Google Scholar
  32. 32.
    Neuburger, S., Sokol, D.: Succinct 2D dictionary matching. Algorithmica 65(3), 662–684 (2013)MathSciNetCrossRefMATHGoogle Scholar
  33. 33.
    Perret, L.: A chosen ciphertext attack on a public key cryptosystem based on Lyndon words. IACR Cryptol ePrint Arch 2005, 14 (2005)Google Scholar
  34. 34.
    Reutenauer, C.: Free Lie Algebras. London Mathematical Society Monographs New Series. Oxford University Press, Oxford (1993)MATHGoogle Scholar
  35. 35.
    Salson, M., Lecroq, T., Léonard, M., Mouchard, L.: A four-stage algorithm for updating a Burrows–Wheeler transform. Theor. Comput. Sci. 410(43), 4350–4359 (2009)MathSciNetCrossRefMATHGoogle Scholar
  36. 36.
    Smyth, B.: Computing Patterns in Strings. ACM Press Bks, Addison-Wesley, Pearson (2003)Google Scholar
  37. 37.
    Tsai, Y.: The constrained longest common subsequence problem. Inf. Process. Lett. 88(4), 173–176 (2003)MathSciNetCrossRefMATHGoogle Scholar
  38. 38.
    Wu, S., Manber, U.: Fast text searching: allowing errors. Commun. ACM 35(10), 83–91 (1992)CrossRefGoogle Scholar

Copyright information

© The Author(s) 2017

Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Authors and Affiliations

  1. 1.Department of InformaticsKing’s College LondonLondonUK
  2. 2.Department of Computer ScienceRoyal Holloway, University of LondonEghamUK
  3. 3.Information Science DepartmentStellenbosch UniversityStellenboschSouth Africa
  4. 4.Centre for Artificial Intelligence ResearchMeraka/CSIRSouth Africa

Personalised recommendations