Skip to main content
Log in

An MPI+\(X\) implementation of contact global search using Kokkos

  • Original Article
  • Published:
Engineering with Computers Aims and scope Submit manuscript

Abstract

This paper describes an approach that seeks to parallelize the spatial search associated with computational contact mechanics. In contact mechanics, the purpose of the spatial search is to find “nearest neighbors,” which is the prelude to an imprinting search that resolves the interactions between the external surfaces of contacting bodies. In particular, we are interested in the contact global search portion of the spatial search associated with this operation on domain-decomposition-based meshes. Specifically, we describe an implementation that combines standard domain-decomposition-based MPI-parallel spatial search with thread-level parallelism (MPI-X) available on advanced computer architectures (those with GPU coprocessors). Our goal is to demonstrate the efficacy of the MPI-X paradigm in the overall contact search. Standard MPI-parallel implementations typically use a domain decomposition of the external surfaces of bodies within the domain in an attempt to efficiently distribute computational work. This decomposition may or may not be the same as the volume decomposition associated with the host physics. The parallel contact global search phase is then employed to find and distribute surface entities (nodes and faces) that are needed to compute contact constraints between entities owned by different MPI ranks without further inter-rank communication. Key steps of the contact global search include computing bounding boxes, building surface entity (node and face) search trees and finding and distributing entities required to complete on-rank (local) spatial searches. To enable source-code portability and performance across a variety of different computer architectures, we implemented the algorithm using the Kokkos hardware abstraction library. While we targeted development towards machines with a GPU accelerator per MPI rank, we also report performance results for OpenMP with a conventional multi-core compute node per rank. Results here demonstrate a 47 % decrease in the time spent within the global search algorithm, comparing the reference ACME algorithm with the GPU implementation, on an 18M face problem using four MPI ranks. While further work remains to maximize performance on the GPU, this result illustrates the potential of the proposed implementation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. Most of this discussion regarding the search operation is independent of whether the simulation is transient or quasi-static. We choose to use the term time step rather than load step here, but they may be used interchangeably unless otherwise noted.

  2. While OpenMP does support parallel reduce by providing a reduction construct that can be combined with parallel for, it does not provide similar support for parallel scan. An implementation of a parallel prefix scan algorithm for OpenMP must be provided externally.

  3. Note that in the Kokkos API the functor’s work increment is given by an integer instead of a range to permit parallel_for to compile into a kernel launch on a GPU.

  4. Both Kokkos and TBB APIs for parallel for, parallel reduce, and parallel scan support the use of C++11 lambda objects in place of functors. In certain situations, this use of lambda objects can improve the readability of code. We find the functor-based APIs are easier to describe at the current stage of C++11 feature adoption, and in their full form they are more general.

  5. As distinguished from data that once written is constant.

  6. Node and face IDs are simply indexes into data arrays.

  7. The atomic variable is a scalar View whose data will be atomically read-and-updated.

  8. A well-balanced tree with N nodes has \(O(\lg N)\) levels with a small constant factor.

  9. These operations are included in the outer loops of the respective searches in the ACME reference implementation.

References

  1. Hansen G (2011) A Jacobian-free Newton Krylov method for mortar-discretized thermomechanical contact problems. J Comput Phys 230:6546–6562

    Article  MathSciNet  MATH  Google Scholar 

  2. Brown KH, Glass MW, Gullerud AS, Heinstein MW, Jones RE, Voth TE (2004) ACME: algorithms for contact in a multiphysics environment API version 2.2. Technical report SAND2004-5486, Sandia National Laboratories

  3. Khamayseh A, Hansen G (2007) Use of the spatial \(k\)D-tree in computational physics applications. Commun Comput Phys 2(3):545–576

    MATH  Google Scholar 

  4. Attaway SW, Hendrickson BA, Plimpton SJ, Gardner DR, Vaughan CT, Brown KH, Heinstein MW (1998) A parallel contact detection algorithm for transient solid dynamics simulations using PRONTO3D. Comput Mech 22:143–159

    Article  MATH  Google Scholar 

  5. Devine K, Boman E, Heaphy R, Hendrickson B, Vaughan C (2002) Zoltan data management services for parallel dynamic applications. Comput Sci Eng 4(2):90–97 3

    Article  Google Scholar 

  6. Karras T (2012) Maximizing parallelism in the construction of BVHs, octtrees, and \(k\)-d trees. In: Dachsbacher C, Munkberg J, Pantaleoni J (eds) Eurographics/ACM SIGGRAPH symposium on high performance graphics, pp 33–37

  7. Karras T (2012) Thinking parallel, part II: tree traversal on the GPU. http://devblogs.nvidia.com/parallelforall/thinking-parallel-part-ii-tree-traversal-gpu/

  8. OpenMP application program interface, version 4.0, July 2013

  9. Butenhof DR (1997) Programming with POSIX threads. Addison-Wesley Longman Publishing Co., Inc, Boston

    Google Scholar 

  10. NVIDIA Corporation (2015) CUDA C programming guide. http://docs.nvidia.com/cuda

  11. Wienke S, Springer P, Terboven C, an Mey D (2012) OpenACC: first experiences with real-world applications. In: Proceedings of the 18th international conference on parallel processing, Euro-Par’12. Springer, Berlin, pp 859–870

  12. Robison AD (2013) Composable parallel patterns with Intel Cilk Plus. Comput Sci Eng 15(2):66–71

    Article  MathSciNet  Google Scholar 

  13. Reinders J (2007) Intel threading building blocks, 1st edn. O’Reilly & Associates Inc, Sebastopol

    Google Scholar 

  14. Leijen D, Schulte W, Burckhardt S (2009) The design of a task parallel library. In: 24th ACM SIGPLAN conference on object oriented programming systems languages and applications (OOPSLA’09), Orlando, FL. Also appeared in Sigplan Not., 44(10): 227–242

  15. Edwards HC, Trott CR (2013) Kokkos: enabling performance portability across manycore architectures. XSEDE, Boulder. https://www.xsede.org/documents/271087/586927/Edwards-2013-XSCALE13-Kokkos.pdf

  16. Edwards HC, Trott CR, Sunderland D (2013) Kokkos, a manycore device performance portability library for C++ HPC applications. San Jose, CA, March 2014. GPU technology conference. http://on-demand.gputechconf.com/gtc/2014/presentations/S4213-kokkos-manycore-device-perf-portability-library-hpc-apps.pdf. Also Sandia National Laboratories SAND2014-2317C

  17. Edwards HC, Trott CR, Sunderland D (2014) Kokkos: enabling manycore performance portability through polymorphic memory access patterns. J Parallel Distrib Comput 74(12):3202–3216

    Article  Google Scholar 

  18. Bell N, Hoberock J (2011) Thrust: a productivity-oriented library for CUDA. GPU computing gems Jade Edition. Elsevier, Boston, p 359

    Google Scholar 

  19. Robinson A et al (2008) ALEGRA: an arbitrary Lagrangian–Eulerian multimaterial, multiphysics code. In: Proceedings of the 46th AIAA aerospaces sciences meeting

  20. Graham SL, Kessler PB, McKusick MK (1982). Gprof a call graph execution profiler. In: Proceedings of the ACM SIGPLAN ’82 symposium on compiler construction. pp 120–126

  21. McCool M, Robison A, Reinders J (2012) Structured parallel programming. Morgan Kaufmann, San Francisco

    Google Scholar 

  22. Blelloch GE (1989) Scans as primitive parallel operations. IEEE Trans Comput 38(11):1526–1538

    Article  Google Scholar 

  23. Message Passing Interface Forum (1994) MPI: a message-passing interface standard. Technical report, Knoxville, TN, USA. http://www.mpi-forum.org

  24. OpenMP application program interface, version 1.0, October 1997

  25. Trott CR, Hoemmen M, Hammond SD, Edwards HC (2015) Kokkos: the programming guide. Technical report SAND2015-4178, Sandia National Laboratories. https://github.com/kokkos

  26. Intel Corporation (2015) Intel threading building blocks reference manual. https://www.threadingbuildingblocks.org/docs/help/reference/

  27. Lauterbach C, Garland M, Sengupta S, Luebke D, Manocha D (2009) Fast BVH construction on GPUs. Comput Graphi Forum 28:375–384

    Article  Google Scholar 

  28. Satish N, Kim C, Chhugani J, Nguyen AD, Lee VW, Kim D, Dubey P (2010) Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data, SIGMOD ’10. ACM, New York, pp 351–362

  29. Davidson A, Tarjan D, Garland M, Owens JD (2012) Efficient parallel merge sort for fixed and variable length keys. In: Innovative parallel computing. p 9

  30. Robison AD (2014) A parallel stable sort using C++11 for TBB, Cilk Plus, and OpenMP. https://software.intel.com/en-us/articles/a-parallel-stable-sort-using-c11-for-tbb-cilk-plus-and-openmp

  31. Ha L, Krüger J, Silva CT (2009) Fast four-way parallel radix sorting on GPUs. Comput Graph Forum 28(8):2368–2378

    Article  Google Scholar 

  32. Peters H, Schulz-Hildebrandt O, Luttenberger N (2010) Fast in-place sorting with CUDA based on bitonic sort. In: Proceedings of the 8th international conference on parallel processing and applied mathematics: part I, PPAM’09. Springer, Berlin, pp 403–410

  33. Sintorn E, Assarsson U (2008) Fast parallel GPU-sorting using a hybrid algorithm. J Parallel Distrib Comput 68(10):1381–1388

    Article  MATH  Google Scholar 

  34. Satish N, Harris M, Garland M (2009) Designing efficient sorting algorithms for manycore GPUS. In: Parallel distributed processing. IEEE international symposium on IPDPS 2009. pp 1–10

  35. Karras T (2012) Thinking parallel, part III: tree construction on the GPU. http://devblogs.nvidia.com/parallelforall/thinking-parallel-part-ii-tree-traversal-gpu/

Download references

Acknowledgments

This work was funded by the U.S. Department of Energy through the NNSA Advanced Scientific Computing (ASC) Integrated Codes (IC) program. Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. The authors would like to thank Tero Karras of NVIDIA Corporation for his CUDA code and suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Glen A. Hansen.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hansen, G.A., Xavier, P.G., Mish, S.P. et al. An MPI+\(X\) implementation of contact global search using Kokkos. Engineering with Computers 32, 295–311 (2016). https://doi.org/10.1007/s00366-015-0418-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00366-015-0418-x

Keywords

Mathematics Subject Classification

Navigation