Subjective versus objective: classifying analytical models for productive heterogeneous performance prediction
- 191 Downloads
- 1 Citations
Abstract
Heterogeneous analytical models are valuable tools that facilitate optimal application tuning via runtime prediction; however, they require several man-hours of effort to understand and employ for meaningful performance prediction. Consequently, developers face the challenge of selecting adequate performance models that best fit their design goals and level of system knowledge. In this research, we present a classification that enables users to select a set of easy-to-use and reliable analytical models for quality performance prediction. These models, which target the general-purpose graphical processing unit (GPGPU)-based systems, are categorized into two primary analytical classes: subjective-analytical and objective-analytical. The subjective-analytical models predict the computation and communication components of an application by describing the system using minimum qualitative relations among the system parameters; whereas the objective-analytical models predict these components by measuring pertinent hardware events using micro-benchmarks. We categorize, enhance, and characterize the existing analytical models for GPGPU computations, network-level, and inter-connect communications to facilitate fast and reliable application performance prediction. We also explore a suitable combination of the aforementioned analytical classes, the hybrid approach, for high-quality performance prediction and report prediction accuracy up to 95 % for several tested GPGPU cluster configurations. The research aims to ultimately provide a collection of easy-to-select analytical models that promote straightforward and accurate performance prediction prior to large-scale implementation.
Keywords
Performance prediction Analytical modeling Qualitative analysis Quantitative analysis High-level abstraction Synchronous iterative algorithms GPGPU clusters Kepler K20References
- 1.Many Integrated Core (MIC) Architecture-Advanced (2014). http://www.intel.com/content/www/us/en/architecture-and-technology/many-integrated-core/intel-many-integrated-core-architecture.html. Accessed 10 Sep 2014
- 2.Intel Xeon Phi™ Product Family (2014). http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-detail.html. Accessed 10 Sep 2014
- 3.Burns G, Daoud R, Vaigl J (1994). LAM: an open cluster environment for MPI. In: Proceedings of supercomputing symposium, pp 379–386Google Scholar
- 4.The OpenMP\({\textregistered }\) API specification for parallel programming (2014). http://openmp.org/wp/. Accessed 10 Sep 2014
- 5.Texas Advanced Computing Center: Stampede (2014). http://www.tacc.utexas.edu/resources/hpc/#stampede
- 6.Kindratenko V, Enos J, Shi G, Showerman M, Arnold G, Stone J, Phillips J, Hwu W (2009) GPU clusters for high-performance computing. In: Proceedings of the workshop on parallel programming on accelerator clusters (PPAC 2009) held in conjunction with cluster 2009, New Orleans, LA, pp 1–8, 31 August–4 September 2009Google Scholar
- 7.Baghsorkhi SS, Delahaye M, Patel SJ, Gropp WD, Hwu WW (2010) An adaptive performance modeling tool for GPU architectures. In: Proceedings of the 15th ACM SIGPLAN symposium on principles and practice of parallel programming, vol 45(5), pp 105–114, May 2011Google Scholar
- 8.Schaa D, Kaeli D (2009) Exploring the multiple-GPU design space. In: Proceedings of the international symposium on parallel and distributed processing (IPDPS 2009), pp 1–12, 23 May–29 May 2009Google Scholar
- 9.Hong S, Kim H (2009) An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In: Proceedings of the 36th international symposium on computer architecture (ISCA 2009), vol 37(3), pp 152–163, June 2009Google Scholar
- 10.Infiniband (2014). http://www.infinibandta.org/. Accessed 10 Sep 2014
- 11.PCI-Express (2014). http://www.nvidia.com/page/pci_express.html. Accessed 10 Sep 2014
- 12.Pallipuram VK, Smith MC, Raut N, Ren X (2012) Exploring multi-level parallelism for large-scale spiking neural networks. In: Proceedings of the international conference on parallel and distributed techniques and applications (PDPTA 2012) held in conjunction with WORLDCOMP 2012, Las Vegas, NV, vol 2, pp 773–779, July 2012Google Scholar
- 13.Pallipuram VK, Raut N, Ren X, Smith MC, Naik S (2012) A multi-node GPGPU implementation of non-linear anisotropic diffusion filter. In: Proceedings of the symposium on application accelerators for high-performance computing (SAAHPC 2012), Argonne, IL, pp. 11–18, 10th July–11th July 2012Google Scholar
- 14.Kinnmark, Ingemar (1986) The shallow water wave equations: formulation, analysis and application. Lecture notes in engineering, vol 15. Springer, BerlinGoogle Scholar
- 15.Pallipuram VK, Smith MC, Raut N, Ren X (2012) A regression-based performance prediction framework for synchronous iterative algorithms on GPGPU clusters. Concurr Comput Pract Exp. doi: 10.1002/cpe.3017
- 16.Zhang Y, Owens JD (2011) A quantitative performance analysis model for GPU architectures. In: Proceedings of the 17th international symposium on high performance computer architecture (HPCA 2011), pp 382–393, 12th February–16th February 2011Google Scholar
- 17.Culler D, Karp R, Patterson D, Sahay A, Schauser KE, Santos E, Subramonian R, von Eicken T (1993) LogP: towards a realistic model of parallel computation. In: Proceedings of the 4th ACM SIGPLAN symposium on principles and practice of parallel programming, pp 1–12. doi: 10.1145/155332.155333
- 18.Alexandrov A, Ionescu MF, Schauser KE, Scheiman C (1995) LogGP: incorporating long messages into the LogP model: one step closer towards a realistic model for parallel computation. In: Proceedings of the 7th annual ACM symposium on parallel algorithms and architectures, pp 95–105. doi: 10.1145/215399.215426
- 19.Kielman T, Bal HE, Verstoep K (2000) Fast measurement of LogP parameters for message passing platforms. In: Proceedings of the 15th workshop on parallel and distributed processing (IPDPS 2000), pp 1176–1183Google Scholar
- 20.Hoefler T, Lichei A, Rehm W (2007) Low-overhead LogGP parameter assessment for modern interconnection networks. In: Proceedings of the parallel and distributed processing symposium (IPDPS 2007), pp 1–8, March 2007Google Scholar
- 21.Hodgkin AL, Huxley AF (1952) A quantitative description of membrane current and application to conduction and excitation in nerve. J Physiol 117:500–544CrossRefGoogle Scholar
- 22.Morris C, Lecar H (1981) Voltage oscillations in the barnacle giant muscle fiber. l. Biophys J 35(1):193–213CrossRefGoogle Scholar
- 23.Wilson HR (1999) Simplified dynamics of human and mammalian neocortical neurons. J Theor Biol 200(4):375–388CrossRefGoogle Scholar
- 24.Izhikevich EM (2003) Simple model to use for cortical spiking neurons. IEEE Trans Neural Netw 14(5):1569–1572CrossRefGoogle Scholar
- 25.Gupta A, Long L (2007) Character recognition using spiking neural networks. In: Proceedings of the international joint conference on neural networks (IJCNN 2007), pp 53–58, August 2007Google Scholar
- 26.Wu W, Liu H (2008) Noise removal using nonlinear diffusion filtering based on statistic-local open system. In: Proceedings of the congress on image and signal processing (CISP), vol 3, pp 372–378Google Scholar
- 27.Perona P, Malik J (1990) Scale space and edge detection using anisotropic diffusion. IEEE Trans Pattern Anal Mach Intell 2(7):629–639CrossRefGoogle Scholar
- 28.Lax D, Wendroff B (1960) Systems of conservation laws. Commun Pure Appl Math 13(2):217–237CrossRefMathSciNetMATHGoogle Scholar
- 29.Nvidia GPU Direct (2014). https://developer.nvidia.com/gpudirect. Accessed 10 Sep 2014
- 30.The Palmetto Cluster (2014). http://citi.clemson.edu/palmetto/. Accessed 10 Sep 2014
- 31.Nvidia Tesla Product Literature (2014). http://www.nvidia.com/object/tesla_product_literature.html. Accessed 10 Sep 2014
- 32.Nvidia’s Next Generation CUDA Compute Architecture: Kepler GK110-Whitepaper (2014). http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf. Accessed 10 Sep 2014
- 33.CUDA Downloads (2014). https://developer.nvidia.com/cuda-downloads. Accessed 10 Sep 2014
- 34.MPI Documents (2014). http://www.mpi-forum.org/docs/. Accessed 10 Sep 2014
- 35.Michaelis L, Menten ML (1913) Die kinetic der invertinwirkung. Biochem Z 49(333–369):1913Google Scholar
- 36.National Center for Supercomputing Applications (NCSA).https://www.ncsa.illinois.edu/. Accessed 10 Sep 2014
- 37.Danalis A, Marin G, McCurdy C, Meredith JS, Roth PC, Spafford K, Tipparaju V, Vetter JS (2010) The scalable heterogeneous computing (SHOC) benchmark suite. In: Proceedings of the 3rd workshop on general purpose computation on graphical processing units (GPGPU 2010), pp 63–74Google Scholar
- 38.Parallel thread execution ISA version 4.0 (2014). http://docs.nvidia.com/cuda/parallel-thread-execution/#abstract. Accessed 10 Sep 2014
- 39.Nvidia, CUDA Programming Guide (2014). http://docs.nvidia.com/cuda/index.html. Accessed 10 Sep 2014