Asaadi H, Khaldi D, Chapman B( 2016) A comparative survey of the hpc and big data paradigms: Analysis and experiments. In: 2016 IEEE International Conference on Cluster Computing (CLUSTER), pp. 423– 432 . IEEE
ORNL (Oak Ridge National Laboratory) (2021): Frontier. https://www.olcf.ornl.gov/frontier/. Accessed: 2021-11-01
Brown WM ( 2011) Gpu acceleration in lammps. In: LAMMPS User’s Workshop and Symposium
Pronk S, Páll S, Schulz R, Larsson P, Bjelkmar P, Apostolov R, Shirts MR, Smith JC, Kasson PM, Van Der Spoel D, et al ( 2013) Gromacs 4.5: a high-throughput and highly parallel open source molecular simulation toolkit. Bioinformatics 29( 7), 845– 854
Salomon-Ferrer R, Gotz AW, Poole D, Le Grand S, Walker RC ( 2013) Routine microsecond molecular dynamics simulations with amber on gpus. 2. explicit solvent particle mesh ewald. Journal of chemical theory and computation 9( 9), 3878– 3888
Lee M, Malaya N, Moser RD ( 2013) Petascale direct numerical simulation of turbulent channel flow on up to 786k cores. In: SC’13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1– 11 . IEEE
Michel J-C, Moulinec H, Suquet P (1999) Effective properties of composite materials with periodic microstructure: a computational approach. Comput Methods Appl Mech Eng 172(1–4):109–143
MathSciNet
Article
Google Scholar
Jung J, Kobayashi C, Imamura T, Sugita Y (2016) Parallel implementation of 3d fft with volumetric decomposition schemes for efficient molecular dynamics simulations. Comput Phys Commun 200:57–65
MathSciNet
Article
Google Scholar
Tari V, Lebensohn RA, Pokharel R, Turner TJ, Shade PA, Bernier JV, Rollett AD (2018) Validation of micro-mechanical fft-based simulations using high energy diffraction microscopy on ti-7al. Acta Mater 154:273–283
Article
Google Scholar
Almgren AS, Bell JB, Lijewski MJ, Lukić Z, Van Andel E (2013) Nyx: A massively parallel amr code for computational cosmology. Astrophys J 765(1):39
Article
Google Scholar
Kowalski K, Bair R, Bauman NP, Boschen JS, Bylaska EJ, Daily J, de Jong WA, Dunning T Jr, Govind N, Harrison RJ et al (2021) From nwchem to nwchemex: Evolving with the computational chemistry landscape. Chem Rev 121(8):4962–4998
Article
Google Scholar
NVIDIA: cuFFT. https://docs.nvidia.com/cuda/cufft/index.html
ROCmSoftwarePlatform (2018) Rocmsoftwareplatform/ROCFFT: Next generation FFT implementation for ROCM . https://github.com/ROCmSoftwarePlatform/rocFFT
Gholami A, Hill J, Malhotra D, Biros G (2015) Accfft: a library for distributed-memory fft on cpu and gpu architectures. arXiv preprint arXiv:1506.07933
Takahashi D (2014) Ffte: A fast fourier transform package. http://www.ffte.jp/
Ayala A, Tomov S, Haidar A, Dongarra J ( 2020) heffte: highly efficient fft for exascale. In: International Conference on Computational Science, pp. 262– 275 . Springer
Barker B ( 2015) Message passing interface (mpi). In: Workshop: High Performance Computing on Stampede, vol. 262
Dagum L, Menon R (1998) Openmp: an industry standard API for shared-memory programming. IEEE Comput Sci Eng 5(1):46–55
Article
Google Scholar
Frigo M, Johnson SG (2005) The design and implementation of fftw3. Proc IEEE 93(2):216–231
Article
Google Scholar
Luszczek PR, Bailey DH, Dongarra JJ, Kepner J, Lucas RF, Rabenseifner R, Takahashi D ( 2006) The hpc challenge (hpcc) benchmark suite. In: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, vol. 213, pp. 1188455– 1188677
Wang H, Potluri S, Bureddy D, Rosales C, Panda DK (2013) Gpu-aware mpi on rdma-enabled clusters: Design, implementation and evaluation. IEEE Trans Parallel Distrib Syst 25(10):2595–2605
Article
Google Scholar
Schroeder TC ( 2011) Peer-to-peer & unified virtual addressing. In: GPU Technology Conference, NVIDIA
Potluri S, Wang H, Bureddy D, Singh AK, Rosales C, Panda DK ( 2012) Optimizing mpi communication on multi-gpu systems using cuda inter-process communication. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum, pp. 1848– 1857 IEEE
ROCmSoftwarePlatform(2018) ROCmSoftwarePlatform/RCCL: ROCM Communication Collectives Library (RCCL) . https://github.com/ROCmSoftwarePlatform/rccl
Sunitha N, Raju K, Chiplunkar N.N (2017) Performance improvement of cuda applications by reducing cpu-gpu data transfer overhead. In: 2017 international conference on inventive communication and computational technologies (ICICCT), pp 211– 215 . IEEE
Jodra JL, Gurrutxaga I, Muguerza J (2015) Efficient 3d transpositions in graphics processing units. Int J Parallel Prog 43(5):876–891
Article
Google Scholar
Ruetsch G, Micikevicius P (2009) Optimizing matrix transpose in Cuda. Nvidia CUDA SDK Appl Note 18:1
Google Scholar
AMD (2021) AMD INSTINCT\(^{\rm TM}\) MI100 accelerator | data center GPU | AMD . https://www.amd.com/en/products/server-accelerators/instinct-mi100