Fast Parallel Suffix Array on the GPU

  • Leyuan WangEmail author
  • Sean Baxter
  • John D. Owens
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9233)


We implement two classes of suffix array construction algorithms on the GPU. The first, skew, makes algorithmic improvements to the previous work of Deo and Keely to achieve a speedup of 1.45 \(\times \) over their work. The second, a hybrid skew and prefix-doubling implementation, is the first of its kind on the GPU and achieves a speedup of 2.3–4.4 \(\times \) over Osipov’s prefix-doubling and 2.4–7.9 \(\times \) over our skew implementation on large datasets. Our implementations rely on two efficient parallel primitives, a merge and a segmented sort. We also demonstrate the effectiveness of our implementations in a Burrows-Wheeler transform and a parallel FM index for pattern searching.


Suffix array Parallel GPU Skew Prefix-doubling Burrows-wheeler transform FM index 



We thank to Yangzihao Wang for the initial implementation and good advice along the way. We would like to acknowledge Mrinal Deo for providing their paper’s original data, Vitaly Osipov for sharing his paper’s source code for comparision, and both Jason Mak and Carl Yang for feedback on early drafts of the paper. We appreciate the funding support of the National Science Foundation under grants OCI-1032859 and CCF-1017399, and UC Lab Fees Research Program Award 12-LR-238449.


  1. 1.
    Davidson, A., Tarjan, D., Garland, M., Owens, J.D.: Efficient parallel merge sort for fixed and variable length keys. In: Proceedings of Innovative Parallel Computing, InPar 2012 (2012)Google Scholar
  2. 2.
    Deo, M., Keely, S.: Parallel suffix array and least common prefix for the GPU. In: Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2013, pp. 197–206 (2013)Google Scholar
  3. 3.
    Edwards, J.A., Vishkin, U.: Parallel algorithms for Burrows-Wheeler compression and decompression. Theor. Comput. Sci. 525, 10–22 (2014)MathSciNetCrossRefzbMATHGoogle Scholar
  4. 4.
    Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proceedings of the 41st Annual Symposium on Foundations of Computer Science, FOCS 2000, pp. 390–398(2000)Google Scholar
  5. 5.
    Green, O., McColl, R., Bader, D.A.: GPU merge path: a GPU mergingalgorithm. In: Proceedings of the 26th ACM International Conference on Supercomputing, ICS 2012, pp. 331–340 (2012)Google Scholar
  6. 6.
    Kärkkäinen, J., Sanders, P.: Simple linear work suffix array construction. In: Proceedings of the 30th International Conference on Automata, Languages and Programming, ICALP 2003, pp. 943–955. Springer, Heidelberg (2003).
  7. 7.
    Karp, R.M., Miller, R.E., Rosenberg, A.L.: Rapid identification of repeated patterns in strings, trees and arrays. In: Proceedings of the Fourth Annual ACM Symposium on Theory of Computing STOC 1972, pp. 125–136 (1972)Google Scholar
  8. 8.
    Larsson, N.J., Sadakane, K.: Faster suffix sorting. Theor. Comput. Sci. 387(3), 258–272 (2007)MathSciNetCrossRefzbMATHGoogle Scholar
  9. 9.
    Lindholm, E., Nickolls, J., Oberman, S., Montrym, J.: NVIDIA Tesla: a unified graphics and computing architecture. IEEE Micro 28(2), 39–55 (2008)CrossRefGoogle Scholar
  10. 10.
    Liu, C.M., Luo, R., Lam, T.W.: GPU-accelerated BWT construction for large collection of short reads (2014). arXiv preprint arXiv:1401.7457
  11. 11.
    Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. In: Proceedings of the First Annual ACM-SIAM Symposium on Discrete Algorithms. pp. 319–327. SODA ’90 (1990)Google Scholar
  12. 12.
    Merrill, D., Grimshaw, A.: Revisiting sorting for GPGPU stream architectures. Technical report CS2010-03, Department of Computer Science, University of Virginia (2010)Google Scholar
  13. 13.
    Mori, Y.: libdivsufsort, version 2.0.1 (2010).
  14. 14.
    Nickolls, J., Buck, I., Garland, M., Skadron, K.: Scalable parallel programming with CUDA. ACM Queue 6, 40–53 (2008)CrossRefGoogle Scholar
  15. 15.
    Osipov, V.: Parallel suffix array construction for shared memory architectures. In: Calderón-Benavides, L., González-Caro, C., Chávez, E., Ziviani, N. (eds.) SPIRE 2012. LNCS, vol. 7608, pp. 379–384. Springer, Heidelberg (2012) CrossRefGoogle Scholar
  16. 16.
    Pantaleoni, J.: A massively parallel algorithm for constructing the BWT of large string sets, October 2014. abs/1410.0562(1410.0562v1)
  17. 17.
    Patel, R.A., Zhang, Y., Mak, J., Owens, J.D.: Parallel lossless data compression on the GPU. In: Proceedings of Innovative Parallel Computing (2012)Google Scholar
  18. 18.
    Satish, N., Harris, M., Garland, M.: Designing efficient sorting algorithms for manycore GPUs. In: Proceedings of the 23rd IEEE International Parallel and Distributed Processing Symposium (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2015

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of California, DavisDavisUSA
  2. 2.D. E. Shaw ResearchNew YorkUSA
  3. 3.Department of Electrical and Computer EngineeringUniversity of California, DavisDavisUSA

Personalised recommendations