Skip to main content
Log in

Parallel kd-Tree Construction on the GPU with an Adaptive Split and Sort Strategy

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript


We introduce a parallel kd-tree construction method for 3-dimensional points on a GPU which employs a sorting algorithm that maintains high parallelism throughout construction. Typically, large arrays in the upper levels of a kd-tree do not yield high performance when computing each node in one thread. Conversely, small arrays in the lower levels of the tree do not benefit from typical parallel sorts. To address these issues, the proposed sorting approach uses a modified parallel sort on the upper levels before switching to basic parallelization on the lower levels. Our work focuses on 3D point registration and our results indicate that a speed gain by a factor of 100 can be achieved in comparison to a naive parallel algorithm for a typical scene.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others




  1. Arya, S., Mount, D.M., Netanyahu, N.S., Silverman, R., Wu, A.Y.: An optimal algorithm for approximate nearest neighbor searching fixed dimensions. J. ACM 45(6), 891–923 (1998)

    Article  MathSciNet  Google Scholar 

  2. Atkinson, M.D., Sack, J.R., Santoro, N., Strothotte, T.: Min-max heaps and generalized priority queues. Commun. ACM 29(10), 996–1000 (1986)

    Article  Google Scholar 

  3. Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975)

    Article  Google Scholar 

  4. Bentley, J.L.: Multidimensional divide-and-conquer. Commun. ACM 23(4), 214–229 (1980)

    Article  MathSciNet  Google Scholar 

  5. Friedman, J.H., Bentley, J.L., Finkel, R.A.: An algorithm for finding best matches in logarithmic expected time. ACM Trans. Math. Softw. 3(3), 209–226 (1977)

    Article  Google Scholar 

  6. Garcia, V., Debreuve, E., Barlaud, M.: Fast k nearest neighbor search using GPU. In: 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (2008)

  7. Garrett, T., Radkowski, R., Sheaffer, J.: Gpu-accelerated descriptor extraction process for 3d registration in augmented reality. In: 23rd International Conference on Pattern Recognition, Cancun, Mexico (2016)

  8. Ha, L., Kruger, L., Silva, C.: Fast four-way parallel radix sorting on GPUs. Comput. Graph. Forum 28(8), 2368–2378 (2009)

    Article  Google Scholar 

  9. Harada, T., Howes, L.: Introduction to GPU radix sort. In: Heterogeneous Computing with OpenCL. Morgan Kaufman (2011)

  10. Harris, M.: Maxwell: The Most Advanced CUDA GPU Ever Made. NVIDIA, Santa Clara (2014)

    Google Scholar 

  11. Havran, V.: Heuristic ray shooting algorithms. Ph.D. thesis, Czech Technical University, Czech Technical University (2001)

  12. Hu, L., Nooshabadi, S., Ahmadi, M.: Massively parallel KD-tree construction and nearest neighbor search algorithms. In: 2015 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 2752–2755 (2015)

  13. Karras, T.: Maximizing parallelism in the construction of BVHs, Octrees, and k-d trees. In: Proceedings of the Fourth ACM SIGGRAPH / Eurographics Conference on High-Performance Graphics, EGGH-HPG’12, pp. 33–37 (2012)

  14. Leite, P., Teixeira, J.M., Farias, T., Reis, B., Teichrieb, V., Kelner, J.: Nearest neighbor searches on the GPU. Int. J. Parallel Program. 40(3), 313–330 (2012)

    Article  Google Scholar 

  15. Leite, P.J.S., Teixeira, J.M.X.N., de Farias, T.S.M.C., Teichrieb, V., Kelner, J.: Massively parallel nearest neighbor queries for dynamic point clouds on the GPU. In: 2009 21st International Symposium on Computer Architecture and High Performance Computing, pp. 19–25 (2009)

  16. Merrill, D.G., Grimshaw, A.S.: Revisiting sorting for GPGPU stream architectures. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT ’10, pp. 545–546 (2010)

  17. Qiu, D., May, S., Nüchter, A.: GPU-accelerated nearest neighbor search for 3D registration. In: Proceedings of Computer Vision Systems: 7th International Conference on Computer Vision Systems, ICVS 2009, pp. 194–203. Springer, Berlin (2009)

    Google Scholar 

  18. Radkowski, R.: Object tracking with a range camera for augmented reality assembly assistance. J. Comput. Inf. Sci. Eng. 16(1), 1–8 (2016)

    Article  Google Scholar 

  19. Radkowski, R., Garrett, T., Ingebrand, J., Wehr, D.: Trackingexpert—a versatile tracking toolbox for augmented reality. In: IDETC/CIE 2016, the ASME 2016 International Design Engineering Technical Conferences & Computers and Information in Engineering Conference, Charlotte, NC (2016)

  20. Satish, N., Harris, M., Garland, M.: Designing efficient sorting algorithms for manycore GPUs. In: 2009 IEEE International Symposium on Parallel Distributed Processing, pp. 1–10 (2009)

  21. Singh, D.P., Joshi, I., Choudhary, J.: Survey of GPU based sorting algorithms. Int. J. Parallel Program. (2017).

    Article  Google Scholar 

  22. Zhou, K., Hou, Q., Wang, R., Guo, B.: Real-time KD-tree construction on graphics hardware. ACM Trans. Graph. 27(5), 126:1–126:11 (2008)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Rafael Radkowski.



1.1 Median Splitting Determination

  1. I

    To determine which chunk a point belongs to, we use the technique described in [7] to compute a median. In brief, we consider the width w of a chunk to be a real number \(w=\frac{N}{2^l}\), where l is the zero-indexed tree level. Therefore, given a particular index i, we can determine the chunk by \(c = \frac{i}{w}\).

  2. II

    Determining whether a point is a median splitting element is also necessary during the histogram calculation. That can be determined with the following criteria, as per [7].

    $$\begin{aligned} splitting = {\left\{ \begin{array}{ll} {\left\lceil \frac{i}{w} \right\rceil < \frac{i+1}{w} \wedge i \ne 0}; &{} true \\ \mathrm {else}; &{} false \end{array}\right. } \end{aligned}$$
  3. III

    Finally, we need the ability to calculate the starting index of a chunk, excluding the splitting element. Because the chunk width is constant, the starting index \(i_s\) of a chunk c can be computed by:

    $$\begin{aligned} i_s = {\left\{ \begin{array}{ll} c = 0; &{} 0 \\ i \ne 0; &{} \left\lfloor (w \cdot c) \right\rfloor + 1 \end{array}\right. } \end{aligned}$$

1.2 Experimental Data

See Tables 2 and 3.

Table 2 Quantitative results for all construction operations
Table 3 Quantitative results for all nearest-neighbor queries

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wehr, D., Radkowski, R. Parallel kd-Tree Construction on the GPU with an Adaptive Split and Sort Strategy. Int J Parallel Prog 46, 1139–1156 (2018).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: