An MPI+ $$X$$ implementation of contact global search using Kokkos

Hansen, Glen A.; Xavier, Patrick G.; Mish, Sam P.; Voth, Thomas E.; Heinstein, Martin W.; Glass, Micheal W.

doi:10.1007/s00366-015-0418-x

An MPI+$X$ implementation of contact global search using Kokkos

Original Article
Published: 05 October 2015

Volume 32, pages 295–311, (2016)
Cite this article

Engineering with Computers Aims and scope Submit manuscript

Glen A. Hansen¹,
Patrick G. Xavier²,
Sam P. Mish²,
Thomas E. Voth¹,
Martin W. Heinstein³ &
…
Micheal W. Glass⁴

359 Accesses
8 Citations
Explore all metrics

Abstract

This paper describes an approach that seeks to parallelize the spatial search associated with computational contact mechanics. In contact mechanics, the purpose of the spatial search is to find “nearest neighbors,” which is the prelude to an imprinting search that resolves the interactions between the external surfaces of contacting bodies. In particular, we are interested in the contact global search portion of the spatial search associated with this operation on domain-decomposition-based meshes. Specifically, we describe an implementation that combines standard domain-decomposition-based MPI-parallel spatial search with thread-level parallelism (MPI-X) available on advanced computer architectures (those with GPU coprocessors). Our goal is to demonstrate the efficacy of the MPI-X paradigm in the overall contact search. Standard MPI-parallel implementations typically use a domain decomposition of the external surfaces of bodies within the domain in an attempt to efficiently distribute computational work. This decomposition may or may not be the same as the volume decomposition associated with the host physics. The parallel contact global search phase is then employed to find and distribute surface entities (nodes and faces) that are needed to compute contact constraints between entities owned by different MPI ranks without further inter-rank communication. Key steps of the contact global search include computing bounding boxes, building surface entity (node and face) search trees and finding and distributing entities required to complete on-rank (local) spatial searches. To enable source-code portability and performance across a variety of different computer architectures, we implemented the algorithm using the Kokkos hardware abstraction library. While we targeted development towards machines with a GPU accelerator per MPI rank, we also report performance results for OpenMP with a conventional multi-core compute node per rank. Results here demonstrate a 47 % decrease in the time spent within the global search algorithm, comparing the reference ACME algorithm with the GPU implementation, on an 18M face problem using four MPI ranks. While further work remains to maximize performance on the GPU, this result illustrates the potential of the proposed implementation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Scalable parallel implementation of CISAMR: a non-iterative mesh generation algorithm

Article 17 December 2018

Bowen Liang, Anand Nagarajan & Soheil Soghrati

PolyTop++: an efficient alternative for serial and parallel topology optimization on CPUs & GPUs

Article 05 June 2015

Leonardo S. Duarte, Waldemar Celes, … Glaucio H. Paulino

Hybrid Parallelization and Performance Optimization of the FLEUR Code: New Possibilities for All-Electron Density Functional Theory

Notes

Most of this discussion regarding the search operation is independent of whether the simulation is transient or quasi-static. We choose to use the term time step rather than load step here, but they may be used interchangeably unless otherwise noted.
While OpenMP does support parallel reduce by providing a reduction construct that can be combined with parallel for, it does not provide similar support for parallel scan. An implementation of a parallel prefix scan algorithm for OpenMP must be provided externally.
Note that in the Kokkos API the functor’s work increment is given by an integer instead of a range to permit parallel_for to compile into a kernel launch on a GPU.
Both Kokkos and TBB APIs for parallel for, parallel reduce, and parallel scan support the use of C++11 lambda objects in place of functors. In certain situations, this use of lambda objects can improve the readability of code. We find the functor-based APIs are easier to describe at the current stage of C++11 feature adoption, and in their full form they are more general.
As distinguished from data that once written is constant.
Node and face IDs are simply indexes into data arrays.
The atomic variable is a scalar View whose data will be atomically read-and-updated.
A well-balanced tree with N nodes has $O(\lg N)$ levels with a small constant factor.
These operations are included in the outer loops of the respective searches in the ACME reference implementation.

References

Hansen G (2011) A Jacobian-free Newton Krylov method for mortar-discretized thermomechanical contact problems. J Comput Phys 230:6546–6562
Article MathSciNet MATH Google Scholar
Brown KH, Glass MW, Gullerud AS, Heinstein MW, Jones RE, Voth TE (2004) ACME: algorithms for contact in a multiphysics environment API version 2.2. Technical report SAND2004-5486, Sandia National Laboratories
Khamayseh A, Hansen G (2007) Use of the spatial $k$D-tree in computational physics applications. Commun Comput Phys 2(3):545–576
MATH Google Scholar
Attaway SW, Hendrickson BA, Plimpton SJ, Gardner DR, Vaughan CT, Brown KH, Heinstein MW (1998) A parallel contact detection algorithm for transient solid dynamics simulations using PRONTO3D. Comput Mech 22:143–159
Article MATH Google Scholar
Devine K, Boman E, Heaphy R, Hendrickson B, Vaughan C (2002) Zoltan data management services for parallel dynamic applications. Comput Sci Eng 4(2):90–97 3
Article Google Scholar
Karras T (2012) Maximizing parallelism in the construction of BVHs, octtrees, and $k$-d trees. In: Dachsbacher C, Munkberg J, Pantaleoni J (eds) Eurographics/ACM SIGGRAPH symposium on high performance graphics, pp 33–37
Karras T (2012) Thinking parallel, part II: tree traversal on the GPU. http://devblogs.nvidia.com/parallelforall/thinking-parallel-part-ii-tree-traversal-gpu/
OpenMP application program interface, version 4.0, July 2013
Butenhof DR (1997) Programming with POSIX threads. Addison-Wesley Longman Publishing Co., Inc, Boston
Google Scholar
NVIDIA Corporation (2015) CUDA C programming guide. http://docs.nvidia.com/cuda
Wienke S, Springer P, Terboven C, an Mey D (2012) OpenACC: first experiences with real-world applications. In: Proceedings of the 18th international conference on parallel processing, Euro-Par’12. Springer, Berlin, pp 859–870
Robison AD (2013) Composable parallel patterns with Intel Cilk Plus. Comput Sci Eng 15(2):66–71
Article MathSciNet Google Scholar
Reinders J (2007) Intel threading building blocks, 1st edn. O’Reilly & Associates Inc, Sebastopol
Google Scholar
Leijen D, Schulte W, Burckhardt S (2009) The design of a task parallel library. In: 24th ACM SIGPLAN conference on object oriented programming systems languages and applications (OOPSLA’09), Orlando, FL. Also appeared in Sigplan Not., 44(10): 227–242
Edwards HC, Trott CR (2013) Kokkos: enabling performance portability across manycore architectures. XSEDE, Boulder. https://www.xsede.org/documents/271087/586927/Edwards-2013-XSCALE13-Kokkos.pdf
Edwards HC, Trott CR, Sunderland D (2013) Kokkos, a manycore device performance portability library for C++ HPC applications. San Jose, CA, March 2014. GPU technology conference. http://on-demand.gputechconf.com/gtc/2014/presentations/S4213-kokkos-manycore-device-perf-portability-library-hpc-apps.pdf. Also Sandia National Laboratories SAND2014-2317C
Edwards HC, Trott CR, Sunderland D (2014) Kokkos: enabling manycore performance portability through polymorphic memory access patterns. J Parallel Distrib Comput 74(12):3202–3216
Article Google Scholar
Bell N, Hoberock J (2011) Thrust: a productivity-oriented library for CUDA. GPU computing gems Jade Edition. Elsevier, Boston, p 359
Google Scholar
Robinson A et al (2008) ALEGRA: an arbitrary Lagrangian–Eulerian multimaterial, multiphysics code. In: Proceedings of the 46th AIAA aerospaces sciences meeting
Graham SL, Kessler PB, McKusick MK (1982). Gprof a call graph execution profiler. In: Proceedings of the ACM SIGPLAN ’82 symposium on compiler construction. pp 120–126
McCool M, Robison A, Reinders J (2012) Structured parallel programming. Morgan Kaufmann, San Francisco
Google Scholar
Blelloch GE (1989) Scans as primitive parallel operations. IEEE Trans Comput 38(11):1526–1538
Article Google Scholar
Message Passing Interface Forum (1994) MPI: a message-passing interface standard. Technical report, Knoxville, TN, USA. http://www.mpi-forum.org
OpenMP application program interface, version 1.0, October 1997
Trott CR, Hoemmen M, Hammond SD, Edwards HC (2015) Kokkos: the programming guide. Technical report SAND2015-4178, Sandia National Laboratories. https://github.com/kokkos
Intel Corporation (2015) Intel threading building blocks reference manual. https://www.threadingbuildingblocks.org/docs/help/reference/
Lauterbach C, Garland M, Sengupta S, Luebke D, Manocha D (2009) Fast BVH construction on GPUs. Comput Graphi Forum 28:375–384
Article Google Scholar
Satish N, Kim C, Chhugani J, Nguyen AD, Lee VW, Kim D, Dubey P (2010) Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data, SIGMOD ’10. ACM, New York, pp 351–362
Davidson A, Tarjan D, Garland M, Owens JD (2012) Efficient parallel merge sort for fixed and variable length keys. In: Innovative parallel computing. p 9
Robison AD (2014) A parallel stable sort using C++11 for TBB, Cilk Plus, and OpenMP. https://software.intel.com/en-us/articles/a-parallel-stable-sort-using-c11-for-tbb-cilk-plus-and-openmp
Ha L, Krüger J, Silva CT (2009) Fast four-way parallel radix sorting on GPUs. Comput Graph Forum 28(8):2368–2378
Article Google Scholar
Peters H, Schulz-Hildebrandt O, Luttenberger N (2010) Fast in-place sorting with CUDA based on bitonic sort. In: Proceedings of the 8th international conference on parallel processing and applied mathematics: part I, PPAM’09. Springer, Berlin, pp 403–410
Sintorn E, Assarsson U (2008) Fast parallel GPU-sorting using a hybrid algorithm. J Parallel Distrib Comput 68(10):1381–1388
Article MATH Google Scholar
Satish N, Harris M, Garland M (2009) Designing efficient sorting algorithms for manycore GPUS. In: Parallel distributed processing. IEEE international symposium on IPDPS 2009. pp 1–10
Karras T (2012) Thinking parallel, part III: tree construction on the GPU. http://devblogs.nvidia.com/parallelforall/thinking-parallel-part-ii-tree-traversal-gpu/

Download references

Acknowledgments

This work was funded by the U.S. Department of Energy through the NNSA Advanced Scientific Computing (ASC) Integrated Codes (IC) program. Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. The authors would like to thank Tero Karras of NVIDIA Corporation for his CUDA code and suggestions.

Author information

Authors and Affiliations

Computational Multiphysics Department 1443, Sandia National Laboratories, P.O. Box 5800, Albuquerque, NM, 87185-1321, USA
Glen A. Hansen & Thomas E. Voth
Simulation Modeling Sciences Department 1543, Sandia National Laboratories, P.O. Box 5800, Albuquerque, NM, 87185-0897, USA
Patrick G. Xavier & Sam P. Mish
Computational Solid Mechanics & Structural Dynamics Department 1542, Sandia National Laboratories, P.O. Box 5800, Albuquerque, NM, 87185-0845, USA
Martin W. Heinstein
Computational Simulation Infrastructure Department 1545, Sandia National Laboratories, P.O. Box 5800, Albuquerque, NM, 87185-0845, USA
Micheal W. Glass

Authors

Glen A. Hansen
View author publications
You can also search for this author in PubMed Google Scholar
Patrick G. Xavier
View author publications
You can also search for this author in PubMed Google Scholar
Sam P. Mish
View author publications
You can also search for this author in PubMed Google Scholar
Thomas E. Voth
View author publications
You can also search for this author in PubMed Google Scholar
Martin W. Heinstein
View author publications
You can also search for this author in PubMed Google Scholar
Micheal W. Glass
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Glen A. Hansen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hansen, G.A., Xavier, P.G., Mish, S.P. et al. An MPI+$X$ implementation of contact global search using Kokkos. Engineering with Computers 32, 295–311 (2016). https://doi.org/10.1007/s00366-015-0418-x

Download citation

Received: 15 January 2015
Accepted: 08 September 2015
Published: 05 October 2015
Issue Date: April 2016
DOI: https://doi.org/10.1007/s00366-015-0418-x

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An MPI+\(X\) implementation of contact global search using Kokkos

Abstract

Access this article

Similar content being viewed by others

Scalable parallel implementation of CISAMR: a non-iterative mesh generation algorithm

PolyTop++: an efficient alternative for serial and parallel topology optimization on CPUs & GPUs

Hybrid Parallelization and Performance Optimization of the FLEUR Code: New Possibilities for All-Electron Density Functional Theory

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

An MPI+\(X\) implementation of contact global search using Kokkos

Abstract

Access this article

Similar content being viewed by others

Scalable parallel implementation of CISAMR: a non-iterative mesh generation algorithm

PolyTop++: an efficient alternative for serial and parallel topology optimization on CPUs & GPUs

Hybrid Parallelization and Performance Optimization of the FLEUR Code: New Possibilities for All-Electron Density Functional Theory

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation