Knowledge and Information Systems

, Volume 51, Issue 3, pp 1043–1066 | Cite as

Parallel construction of wavelet trees on multicore architectures

Regular Paper

Abstract

The wavelet tree has become a very useful data structure to efficiently represent and query large volumes of data in many different domains, from bioinformatics to geographic information systems. One problem with wavelet trees is their construction time. In this paper, we introduce two algorithms that reduce the time complexity of a wavelet tree’s construction by taking advantage of nowadays ubiquitous multicore machines. Our first algorithm constructs all the levels of the wavelet in parallel with O(n) time and \(O(n\lg \sigma + \sigma \lg n)\) bits of working space, where n is the size of the input sequence and \(\sigma \) is the size of the alphabet. Our second algorithm constructs the wavelet tree in a domain decomposition fashion, using our first algorithm in each segment, reaching \(O(\lg n)\) time and \(O(n\lg \sigma + p\sigma \lg n/\lg \sigma )\) bits of extra space, where p is the number of available cores. Both algorithms are practical and report good speedup for large real datasets.

Keywords

Succinct data structure Wavelet tree construction Multicore Parallel algorithm 

Notes

Acknowledgments

This work was supported in part by the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie Grant Agreement No. 690941 and the doctoral scholarships of CONICYT Nos. 21120974 and 63130228 (first and second authors, respectively). We also would like to thank Roberto Asín for making his multicore computers, Mastropiero and Günther Frager, available to us.

References

  1. 1.
    Arroyuelo D, Costa VG, González S, Marín M, Oyarzún M (2012) Distributed search based on self-indexed compressed text. Inf Process Manag 48(5):819–827. doi:10.1016/j.ipm.2011.01.008 CrossRefGoogle Scholar
  2. 2.
    Bingmann T (2015) malloc_count—tools for runtime memory usage analysis and profiling. http://panthema.net/2013/malloc_count/ (2013). Last accessed: 17 Jan 2015
  3. 3.
    Blumofe RD, Leiserson CE (1999) Scheduling multithreaded computations by work stealing. J ACM 46(5):720–748. doi:10.1145/324133.324234 MathSciNetCrossRefMATHGoogle Scholar
  4. 4.
    Brisaboa NR, Luaces MR, Navarro G, Seco D (2013) Space-efficient representations of rectangle datasets supporting orthogonal range querying. Inf Syst 38(5):635–655. doi:10.1016/j.is.2013.01.005 CrossRefGoogle Scholar
  5. 5.
    Burrows M, Wheeler DJ (1994) A block-sorting lossless data compression algorithm. Tech. rep., Digital Equipment CorporationGoogle Scholar
  6. 6.
    Claude F (2011) A compressed data structure library. https://github.com/fclaude/libcds. Last accessed: 13 August 2015
  7. 7.
    Claude F, Navarro G (2009) Practical rank/select queries over arbitrary sequences. In: SPIRE. Springer, Berlin, pp 176–187. doi:10.1007/978-3-540-89097-3_18
  8. 8.
    Claude F, Navarro G (2012) The wavelet matrix. In: SPIRE, vol 7608. Springer, Berlin, pp 167–179. doi:10.1007/978-3-642-34109-0_18
  9. 9.
    Claude F, Nicholson PK, Seco D (2011) Space efficient wavelet tree construction. In: SPIRE, vol 7024. Springer, Berlin, pp 185–196Google Scholar
  10. 10.
    Cormen TH, Leiserson CE, Rivest RL, Stein C (2009) Introduction to algorithms, 3rd edn., chap. Multithreaded algorithms. The MIT Press, Cambridge, pp 772–812Google Scholar
  11. 11.
    Faro S, Külekci MO (2012) Fast multiple string matching using streaming SIMD extensions technology. In: SPIRE. Springer, Berlin, pp 217–228. doi:10.1007/978-3-642-34109-0_23
  12. 12.
    Ferragina P, Manzini G (2000) Opportunistic data structures with applications. In: Proceedings of the 41st annual symposium on foundations of computer science, FOCS ’00. IEEE Computer Society, Washington, DC, USA, p 390. http://dl.acm.org/citation.cfm?id=795666.796543
  13. 13.
    Ferragina P, Manzini G, Mäkinen V, Navarro G (2004) String processing and information retrieval: 11th international conference, SPIRE 2004, Padova, Italy, 5–8 October 2004. Proceedings, chap. An Alphabet-Friendly FM-Index. Springer, Berlin, pp 150–160. doi:10.1007/978-3-540-30213-1_23
  14. 14.
    Ferragina P, Manzini G, Mäkinen V, Navarro G (2007) Compressed representations of sequences and full-text indexes. ACM Trans Algorithms 3(2):20. doi:10.1145/1240233.1240243 MathSciNetCrossRefMATHGoogle Scholar
  15. 15.
    Fuentes-Sepúlveda J, Elejalde E, Ferres L, Seco D (2014) Efficient wavelet tree construction and querying for multicore architectures. In: Gudmundsson J, Katajainen J (eds) Experimental algorithms, Lecture Notes in Computer Science, vol 8504. Springer, Berlin, pp 150–161. doi:10.1007/978-3-319-07959-2_13
  16. 16.
    Gog S (2015) Succinct data structure library 2.0. https://github.com/simongog/sdsl-lite (2012). Last accessed: 17 Jan 2015
  17. 17.
    González R, Grabowski S, Mäkinen V, Navarro G (2005) Practical implementation of rank and select queries. In: WEA. CTI Press, Greece, pp 27–38. PosterGoogle Scholar
  18. 18.
    Grossi R, Gupta A, Vitter JS (2003) High-order entropy-compressed text indexes. In: SODA. Soc. Ind. Appl. Math., Philadelphia, pp 841–850Google Scholar
  19. 19.
    Helman DR, JáJá J (2001) Prefix computations on symmetric multiprocessors. J Parallel Distrib Comput 61(2):265–278. doi:10.1006/jpdc.2000.1678 CrossRefMATHGoogle Scholar
  20. 20.
    Illumina, Inc. (2016) An introduction to next-generation sequencing technology. http://www.illumina.com/content/dam/illumina-marketing/documents/products/illumina_sequencing_introduction.pdf
  21. 21.
    Ladra S, Pedreira O, Duato J, Brisaboa NR (2012) Exploiting SIMD instructions in current processors to improve classical string algorithms. In: ADBIS. Springer, Berlin, pp 254–267. doi:10.1007/978-3-642-33074-2_19
  22. 22.
    Makris C (2012) Wavelet trees: a survey. Comput Sci Inf Syst 9(2):585–625CrossRefGoogle Scholar
  23. 23.
    Matsumoto M, Nishimura T (1998) Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator. ACM Trans Model Comput Simul 8(1):3–30. doi:10.1145/272991.272995 CrossRefMATHGoogle Scholar
  24. 24.
    Navarro G (2012) Wavelet trees for all. In: CPM. Springer, Berlin, pp 2–26. doi:10.1007/978-3-642-31265-6_2
  25. 25.
    Navarro G, Nekrich Y, Russo LMS (2013) Space-efficient data-analysis queries on grids. Theor Comput Sci 482:60–72. doi:10.1016/j.tcs.2012.11.031 MathSciNetCrossRefMATHGoogle Scholar
  26. 26.
    Pantaleoni J, Subtil N (2016) Nvbio library. http://nvlabs.github.io/nvbio/index.html. Accessed 12 April 2016
  27. 27.
    Raman R, Raman V, Satti SR (2007) Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans Algorithms 3(4):43. doi:10.1145/1290672.1290680 MathSciNetCrossRefGoogle Scholar
  28. 28.
    Schnattinger T, Ohlebusch E, Gog S (2012) Bidirectional search in a string with wavelet trees and bidirectional matching statistics. Inf Comput 213:13–22. doi:10.1016/j.ic.2011.03.007. http://www.sciencedirect.com/science/article/pii/S0890540112000235. Special Issue: Combinatorial Pattern Matching (CPM 2010)
  29. 29.
    Shun J (2015) Parallel wavelet tree construction. In: Proceedings of the IEEE data compression conference, Utah, USA, pp 63–72. doi:10.1109/DCC.2015.7
  30. 30.
    Singer J (2012) A wavelet tree based fm-index for biological sequences in SeqAn. Master’s thesis, Freie Universität Berlin. http://www.mi.fu-berlin.de/wiki/pub/ABI/FMIndexThesis/FMIndex.pdf
  31. 31.
    Tischler G (2011) On wavelet tree construction. In: CPM. Springer, Berlin, pp 208–218Google Scholar
  32. 32.
    Touati SAA, Worms J, Briais S (2013) The Speedup-Test: a statistical methodology for program speedup analysis and computation. Concurr Comput Pract Exp 25(10):1410–1426. doi:10.1002/cpe.2939. https://hal.inria.fr/hal-00764454. Article first published online: 15 Oct 2012
  33. 33.
    Välimäki N, Mäkinen V (2007) Space-efficient algorithms for document retrieval. In: CPM, LNCS, vol. 4580. Springer, Berlin, pp 205–215. doi:10.1007/978-3-540-73437-6_22
  34. 34.
    Wetterstrand KA (2016) DNA sequencing costs: data from the NHGRI genome sequencing program (GSP). http://www.genome.gov/sequencingcosts. Accessed 12 April 2016

Copyright information

© Springer-Verlag London 2016

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversidad de ConcepciónConcepciónChile
  2. 2.Faculty of EngineeringUniversidad del DesarrolloSantiagoChile

Personalised recommendations