Parallelising the Computation of Minimal Absent Words

  • Carl Barton
  • Alice Heliou
  • Laurent Mouchard
  • Solon P. Pissis
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9574)

Abstract

An absent word of a word y of length n is a word that does not occur in y. It is a minimal absent word if all its proper factors occur in y. Minimal absent words have been computed in genomes of organisms from all domains of life; their computation also provides a fast alternative for measuring approximation in sequence comparison. There exists an \(\mathcal {O}(n)\)-time and \(\mathcal {O}(n)\)-space algorithm for computing all minimal absent words on a fixed-sized alphabet based on the construction of suffix array (Barton et al., 2014). An implementation of this algorithm was also provided by the authors and is currently the fastest available. In this article, we present a new \(\mathcal {O}(n)\)-time and \(\mathcal {O}(n)\)-space algorithm for computing all minimal absent words; it has the desirable property that, given the indexing data structure at hand, the computation of minimal absent words can be executed in parallel. Experimental results show that a multiprocessing implementation of this algorithm can accelerate the overall computation by more than a factor of two compared to state-of-the-art approaches. By excluding the indexing data structure construction time, we show that the implementation achieves near-optimal speed-ups.

Keywords

Algorithms on strings Absent words Suffix array 

References

  1. 1.
    Abboud, A., Williams, V.V., Weimann, O.: Consequences of faster alignment of sequences. In: Esparza, J., Fraigniaud, P., Husfeldt, T., Koutsoupias, E. (eds.) ICALP 2014. LNCS, vol. 8572, pp. 39–51. Springer, Heidelberg (2014)Google Scholar
  2. 2.
    Acquisti, C., Poste, G., Curtiss, D., Kumar, S.: Nullomers: really a matter of natural selection? PLoS One 2(10), e1022 (2007)CrossRefGoogle Scholar
  3. 3.
    Barton, C., Heliou, A., Mouchard, L., Pissis, S.P.: Linear-time computation of minimal absent words using suffix array. BMC Bioinform. 15, 388 (2014)CrossRefGoogle Scholar
  4. 4.
    Belazzougui, D., Cunial, F., Kärkkäinen, J., Mäkinen, V.: Versatile succinct representations of the bidirectional Burrows-Wheeler transform. In: Bodlaender, H.L., Italiano, G.F. (eds.) ESA 2013. LNCS, vol. 8125, pp. 133–144. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  5. 5.
    Chairungsee, S., Crochemore, M.: Using minimal absent words to build phylogeny. Theoret. Comput. Sci. 450, 109–116 (2012)MathSciNetCrossRefMATHGoogle Scholar
  6. 6.
    Crochemore, M., Mignosi, F., Restivo, A.: Automata and forbidden words. Inf. Process. Lett. 67, 111–117 (1998)MathSciNetCrossRefMATHGoogle Scholar
  7. 7.
    Fischer, J.: Inducing the LCP-array. In: Dehne, F., Iacono, J., Sack, J.-R. (eds.) WADS 2011. LNCS, vol. 6844, pp. 374–385. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  8. 8.
    Garcia, S.P., Pinho, A.J.: Minimal absent words in four human genome assemblies. PLoS One 6(12), e29344 (2011)CrossRefGoogle Scholar
  9. 9.
    Garcia, S.P., Pinho, O.J., Rodrigues, J., Bastos, C.A.C., Ferreira, G.P.J.S.: Minimal absent words in prokaryotic and eukaryotic genomes. PLoS One 6, e16065 (2011)CrossRefGoogle Scholar
  10. 10.
    Haubold, B., Pierstorff, N., Möller, F., Wiehe, T.: Genome comparison without alignment using shortest unique substrings. BMC Bioinform. 6, 123 (2005)CrossRefGoogle Scholar
  11. 11.
    Jacobson, G.: Space-efficient static trees and graphs. In: 30th SFCS 1989, pp. 549–554. IEEE Computer Society (1989)Google Scholar
  12. 12.
    Manber, U., Myers, E.W.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)MathSciNetCrossRefMATHGoogle Scholar
  13. 13.
    Mignosi, F., Restivo, A., Sciortino, M.: Words and forbidden factors. Theoret. Comput. Sci. 273(1–2), 99–117 (2002)MathSciNetCrossRefMATHGoogle Scholar
  14. 14.
    Nong, G., Zhang, S., Chan, W.H.: Linear suffix array construction by almost pure induced-sorting. In: DCC 2009, pp. 193–202. IEEE Computer Society (2009)Google Scholar
  15. 15.
    Pinho, A.J., Ferreira, P.J.S.G., Garcia, S.P., Rodrigues, J.M.: On finding minimal absent words. BMC Bioinformatics 10 (2009)Google Scholar
  16. 16.
    Shun, J.: Fast parallel computation of longest common prefixes. In: SC 2014, pp. 387–398. IEEE Computer Society (2014)Google Scholar
  17. 17.
    Silva, R.M., Pratas, D., Castro, L., Pinho, A.J., Ferreira, P.J.S.G.: Three minimal sequences found in Ebola virus genomes and absent from human DNA. Bioinformatics 31(15), 2421–2425 (2015)CrossRefGoogle Scholar
  18. 18.
    Wu, Z.D., Jiang, T., Su, W.J.: Efficient computation of shortest absent words in a genomic sequence. Inf. Process. Lett. 110(14–15), 596–601 (2010)MathSciNetCrossRefMATHGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Carl Barton
    • 1
  • Alice Heliou
    • 2
    • 3
  • Laurent Mouchard
    • 4
  • Solon P. Pissis
    • 5
  1. 1.The Blizard Institute, Barts and The London School of Medicine and DentistryQueen Mary University of LondonLondonUK
  2. 2.Inria Saclay-Île de France, AMIB, Bâtiment Alan TuringPalaiseauFrance
  3. 3.Laboratoire d’Informatique de l’École Polytechnique (LIX), CNRS UMR 7161PalaiseauFrance
  4. 4.University of Rouen, LITIS EA 4108, TIBSRouenFrance
  5. 5.Department of InformaticsKing’s College LondonLondonUK

Personalised recommendations