Abstract
This paper describes an approach that seeks to parallelize the spatial search associated with computational contact mechanics. In contact mechanics, the purpose of the spatial search is to find “nearest neighbors,” which is the prelude to an imprinting search that resolves the interactions between the external surfaces of contacting bodies. In particular, we are interested in the contact global search portion of the spatial search associated with this operation on domain-decomposition-based meshes. Specifically, we describe an implementation that combines standard domain-decomposition-based MPI-parallel spatial search with thread-level parallelism (MPI-X) available on advanced computer architectures (those with GPU coprocessors). Our goal is to demonstrate the efficacy of the MPI-X paradigm in the overall contact search. Standard MPI-parallel implementations typically use a domain decomposition of the external surfaces of bodies within the domain in an attempt to efficiently distribute computational work. This decomposition may or may not be the same as the volume decomposition associated with the host physics. The parallel contact global search phase is then employed to find and distribute surface entities (nodes and faces) that are needed to compute contact constraints between entities owned by different MPI ranks without further inter-rank communication. Key steps of the contact global search include computing bounding boxes, building surface entity (node and face) search trees and finding and distributing entities required to complete on-rank (local) spatial searches. To enable source-code portability and performance across a variety of different computer architectures, we implemented the algorithm using the Kokkos hardware abstraction library. While we targeted development towards machines with a GPU accelerator per MPI rank, we also report performance results for OpenMP with a conventional multi-core compute node per rank. Results here demonstrate a 47 % decrease in the time spent within the global search algorithm, comparing the reference ACME algorithm with the GPU implementation, on an 18M face problem using four MPI ranks. While further work remains to maximize performance on the GPU, this result illustrates the potential of the proposed implementation.
Similar content being viewed by others
Notes
Most of this discussion regarding the search operation is independent of whether the simulation is transient or quasi-static. We choose to use the term time step rather than load step here, but they may be used interchangeably unless otherwise noted.
While OpenMP does support parallel reduce by providing a reduction construct that can be combined with parallel for, it does not provide similar support for parallel scan. An implementation of a parallel prefix scan algorithm for OpenMP must be provided externally.
Note that in the Kokkos API the functor’s work increment is given by an integer instead of a range to permit parallel_for to compile into a kernel launch on a GPU.
Both Kokkos and TBB APIs for parallel for, parallel reduce, and parallel scan support the use of C++11 lambda objects in place of functors. In certain situations, this use of lambda objects can improve the readability of code. We find the functor-based APIs are easier to describe at the current stage of C++11 feature adoption, and in their full form they are more general.
As distinguished from data that once written is constant.
Node and face IDs are simply indexes into data arrays.
The atomic variable is a scalar View whose data will be atomically read-and-updated.
A well-balanced tree with N nodes has \(O(\lg N)\) levels with a small constant factor.
These operations are included in the outer loops of the respective searches in the ACME reference implementation.
References
Hansen G (2011) A Jacobian-free Newton Krylov method for mortar-discretized thermomechanical contact problems. J Comput Phys 230:6546–6562
Brown KH, Glass MW, Gullerud AS, Heinstein MW, Jones RE, Voth TE (2004) ACME: algorithms for contact in a multiphysics environment API version 2.2. Technical report SAND2004-5486, Sandia National Laboratories
Khamayseh A, Hansen G (2007) Use of the spatial \(k\)D-tree in computational physics applications. Commun Comput Phys 2(3):545–576
Attaway SW, Hendrickson BA, Plimpton SJ, Gardner DR, Vaughan CT, Brown KH, Heinstein MW (1998) A parallel contact detection algorithm for transient solid dynamics simulations using PRONTO3D. Comput Mech 22:143–159
Devine K, Boman E, Heaphy R, Hendrickson B, Vaughan C (2002) Zoltan data management services for parallel dynamic applications. Comput Sci Eng 4(2):90–97 3
Karras T (2012) Maximizing parallelism in the construction of BVHs, octtrees, and \(k\)-d trees. In: Dachsbacher C, Munkberg J, Pantaleoni J (eds) Eurographics/ACM SIGGRAPH symposium on high performance graphics, pp 33–37
Karras T (2012) Thinking parallel, part II: tree traversal on the GPU. http://devblogs.nvidia.com/parallelforall/thinking-parallel-part-ii-tree-traversal-gpu/
OpenMP application program interface, version 4.0, July 2013
Butenhof DR (1997) Programming with POSIX threads. Addison-Wesley Longman Publishing Co., Inc, Boston
NVIDIA Corporation (2015) CUDA C programming guide. http://docs.nvidia.com/cuda
Wienke S, Springer P, Terboven C, an Mey D (2012) OpenACC: first experiences with real-world applications. In: Proceedings of the 18th international conference on parallel processing, Euro-Par’12. Springer, Berlin, pp 859–870
Robison AD (2013) Composable parallel patterns with Intel Cilk Plus. Comput Sci Eng 15(2):66–71
Reinders J (2007) Intel threading building blocks, 1st edn. O’Reilly & Associates Inc, Sebastopol
Leijen D, Schulte W, Burckhardt S (2009) The design of a task parallel library. In: 24th ACM SIGPLAN conference on object oriented programming systems languages and applications (OOPSLA’09), Orlando, FL. Also appeared in Sigplan Not., 44(10): 227–242
Edwards HC, Trott CR (2013) Kokkos: enabling performance portability across manycore architectures. XSEDE, Boulder. https://www.xsede.org/documents/271087/586927/Edwards-2013-XSCALE13-Kokkos.pdf
Edwards HC, Trott CR, Sunderland D (2013) Kokkos, a manycore device performance portability library for C++ HPC applications. San Jose, CA, March 2014. GPU technology conference. http://on-demand.gputechconf.com/gtc/2014/presentations/S4213-kokkos-manycore-device-perf-portability-library-hpc-apps.pdf. Also Sandia National Laboratories SAND2014-2317C
Edwards HC, Trott CR, Sunderland D (2014) Kokkos: enabling manycore performance portability through polymorphic memory access patterns. J Parallel Distrib Comput 74(12):3202–3216
Bell N, Hoberock J (2011) Thrust: a productivity-oriented library for CUDA. GPU computing gems Jade Edition. Elsevier, Boston, p 359
Robinson A et al (2008) ALEGRA: an arbitrary Lagrangian–Eulerian multimaterial, multiphysics code. In: Proceedings of the 46th AIAA aerospaces sciences meeting
Graham SL, Kessler PB, McKusick MK (1982). Gprof a call graph execution profiler. In: Proceedings of the ACM SIGPLAN ’82 symposium on compiler construction. pp 120–126
McCool M, Robison A, Reinders J (2012) Structured parallel programming. Morgan Kaufmann, San Francisco
Blelloch GE (1989) Scans as primitive parallel operations. IEEE Trans Comput 38(11):1526–1538
Message Passing Interface Forum (1994) MPI: a message-passing interface standard. Technical report, Knoxville, TN, USA. http://www.mpi-forum.org
OpenMP application program interface, version 1.0, October 1997
Trott CR, Hoemmen M, Hammond SD, Edwards HC (2015) Kokkos: the programming guide. Technical report SAND2015-4178, Sandia National Laboratories. https://github.com/kokkos
Intel Corporation (2015) Intel threading building blocks reference manual. https://www.threadingbuildingblocks.org/docs/help/reference/
Lauterbach C, Garland M, Sengupta S, Luebke D, Manocha D (2009) Fast BVH construction on GPUs. Comput Graphi Forum 28:375–384
Satish N, Kim C, Chhugani J, Nguyen AD, Lee VW, Kim D, Dubey P (2010) Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data, SIGMOD ’10. ACM, New York, pp 351–362
Davidson A, Tarjan D, Garland M, Owens JD (2012) Efficient parallel merge sort for fixed and variable length keys. In: Innovative parallel computing. p 9
Robison AD (2014) A parallel stable sort using C++11 for TBB, Cilk Plus, and OpenMP. https://software.intel.com/en-us/articles/a-parallel-stable-sort-using-c11-for-tbb-cilk-plus-and-openmp
Ha L, Krüger J, Silva CT (2009) Fast four-way parallel radix sorting on GPUs. Comput Graph Forum 28(8):2368–2378
Peters H, Schulz-Hildebrandt O, Luttenberger N (2010) Fast in-place sorting with CUDA based on bitonic sort. In: Proceedings of the 8th international conference on parallel processing and applied mathematics: part I, PPAM’09. Springer, Berlin, pp 403–410
Sintorn E, Assarsson U (2008) Fast parallel GPU-sorting using a hybrid algorithm. J Parallel Distrib Comput 68(10):1381–1388
Satish N, Harris M, Garland M (2009) Designing efficient sorting algorithms for manycore GPUS. In: Parallel distributed processing. IEEE international symposium on IPDPS 2009. pp 1–10
Karras T (2012) Thinking parallel, part III: tree construction on the GPU. http://devblogs.nvidia.com/parallelforall/thinking-parallel-part-ii-tree-traversal-gpu/
Acknowledgments
This work was funded by the U.S. Department of Energy through the NNSA Advanced Scientific Computing (ASC) Integrated Codes (IC) program. Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. The authors would like to thank Tero Karras of NVIDIA Corporation for his CUDA code and suggestions.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Hansen, G.A., Xavier, P.G., Mish, S.P. et al. An MPI+\(X\) implementation of contact global search using Kokkos. Engineering with Computers 32, 295–311 (2016). https://doi.org/10.1007/s00366-015-0418-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00366-015-0418-x