Adaptive Off-Line Tuning for Optimized Composition of Components for Heterogeneous Many-Core Systems

Li, Lu; Dastgeer, Usman; Kessler, Christoph

doi:10.1007/978-3-642-38718-0_32

Lu Li¹⁹,
Usman Dastgeer¹⁹ &
Christoph Kessler¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7851))

Included in the following conference series:

International Conference on High Performance Computing for Computational Science

2098 Accesses
12 Citations

Abstract

In recent years heterogeneous multi-core systems have been given much attention. However, performance optimization on these platforms remains a big challenge. Optimizations performed by compilers are often limited due to lack of dynamic information and run time environment, which makes applications often not performance portable. One current approach is to provide multiple implementations for the same interface that could be used interchangeably depending on the call context, and expose the composition choices to a compiler, deployment-time composition tool and/or run-time system. Using off-line machine-learning techniques allows to improve the precision and reduce the run-time overhead of run-time composition and leads to an improvement of performance portability. In this work we extend the run-time composition mechanism in the PEPPHER composition tool by off-line composition and present an adaptive machine learning algorithm for generating compact and efficient dispatch data structures with low training time. As dispatch data structure we propose an adaptive decision tree structure, which implies an adaptive training algorithm that allows to control the trade-off between training time, dispatch precision and run-time dispatch overhead.

We have evaluated our optimization strategy with simple kernels (matrix-multiplication and sorting) as well as applications from RODINIA benchmark on two GPU-based heterogeneous systems. On average, the precision for composition choices reaches 83.6 percent with approximately 34 minutes off-line training time.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ansel, J., Chan, C.P., Wong, Y.L., Olszewski, M., Zhao, Q., Edelman, A., Amarasinghe, S.P.: PetaBricks: A language and compiler for algorithmic choice. In: Proceedings of the 2009 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2009, pp. 38–49. ACM (2009)
Google Scholar
Augonnet, C., Thibault, S., Namyst, R.: Automatic calibration of performance models on heterogeneous multicore architectures. In: Lin, H.-X., Alexander, M., Forsell, M., Knüpfer, A., Prodan, R., Sousa, L., Streit, A. (eds.) Euro-Par 2009. LNCS, vol. 6043, pp. 56–65. Springer, Heidelberg (2010)
Chapter Google Scholar
Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.-A.: StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. Concurrency and Computation: Practice and Experience, Special Issue: Euro-Par 2009 23, 187–198 (2011)
Article Google Scholar
Benkner, S., Pllana, S., Träff, J.L., Tsigas, P., Dolinsky, U., Augonnet, C., Bachmayer, B., Kessler, C., Moloney, D., Osipov, V.: PEPPHER: Efficient and productive usage of hybrid computing systems. IEEE Micro 31(5), 28–41 (2011)
Article Google Scholar
Danylenko, A., Kessler, C., Löwe, W.: Comparing machine learning approaches for context-aware composition. In: Apel, S., Jackson, E. (eds.) SC 2011. LNCS, vol. 6708, pp. 18–33. Springer, Heidelberg (2011)
Chapter Google Scholar
Dastgeer, U., Li, L., Kessler, C.: Performance-aware dynamic composition of applications for heterogeneous multicore systems with the PEPPHER composition tool. In: Proc. 16th Int. Workshop on Compilers for Parallel Computers (CPC 2012), Padova, Italy (January 2012)
Google Scholar
de Mesmay, F., Voronenko, Y., Püschel, M.: Offline library adaptation using automatically generated heuristics. In: Int. Parallel and Distr. Processing Symp. (IPDPS 2010), pp. 1–10 (2010)
Google Scholar
Frigo, M., Johnsson, S.G.: Fftw: An adaptive software architecture for the FFT. In: Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, vol. 3, pp. 1381–1384 (May 1998)
Google Scholar
Grewe, D., O’Boyle, M.F.P.: A static task partitioning approach for heterogeneous systems using openCL. In: Knoop, J. (ed.) CC 2011. LNCS, vol. 6601, pp. 286–305. Springer, Heidelberg (2011)
Chapter Google Scholar
Katagiri, T., Kise, K., Honda, H., Yuba, T.: Abclibscript: a directive to support specification of an auto-tuning facility for numerical software. Parallel Computing 32(1), 92–112 (2006)
Article Google Scholar
Kessler, C.W., Löwe, W.: A framework for performance-aware composition of explicitly parallel components. In: Parallel Computing: Architectures, Algorithms and Applications (ParCo 2007). Advances in Parallel Computing, vol. 15, pp. 227–234. IOS Press (2007)
Google Scholar
Kessler, C.W., Löwe, W.: Optimized composition of performance-aware parallel components. In: Proc. 15th Int. Workshop on Compilers for Parallel Computers (CPC 2010) (July 2010)
Google Scholar
Kessler, C.W., Löwe, W.: Optimized composition of performance-aware parallel components. Concurrency and Computation: Practice and Experience 24(5), 481–498 (2012); Published online in Wiley Online Library, doi: 10.1002/cpe.1844 (September 2011)
Article Google Scholar
Li, X., Garzarán, M.J.: Optimizing matrix multiplication with a classifier learning system. In: Ayguadé, E., Baumgartner, G., Ramanujam, J., Sadayappan, P. (eds.) LCPC 2005. LNCS, vol. 4339, pp. 121–135. Springer, Heidelberg (2006)
Chapter Google Scholar
Li, X., Garzarán, M.J., Padua, D.: A dynamically tuned sorting library. In: Proc. ACM Symp. on Code Generation and Optimization (CGO 2004), pp. 111–124 (2004)
Google Scholar
Park, E., Kulkarni, S., Cavazos, J.: An evaluation of different modeling techniques for iterative compilation. In: Proc. Int. Conf. on Compilers, Architectures and Synthesis for Embedded Systems (CASES 2011) (October 2011)
Google Scholar
Püschel, M., Moura, J.M.F., Johnson, J.R., Padua, D., Veloso, M.M., Singer, B.W., Xiong, J., Franchetti, F., Gacic, A., Voronenko, Y., Chen, K., Johnson, R.W., Rizzolo, N.: Spiral: Code generation for DSP transforms. Proceedings of the IEEE 93(2) (February 2005)
Google Scholar
Ross Quinlan, J.: C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., San Francisco (1993)
Google Scholar
Singer, B., Veloso, M.: Learning to predict performance from formula modeling and training data. In: Proc. 17th Int. Conf. on Machine Learning, pp. 887–894 (2000)
Google Scholar
Singer, B., Veloso, M.: Learning to construct fast signal processing implementations. Journal of Machine Learning Research 3, 887–919 (2002)
MathSciNet Google Scholar
Thomas, N., Tanase, G., Tkachyshyn, O., Perdue, J., Amato, N.M., Rauchwerger, L.: A framework for adaptive algorithm selection in STAPL. In: Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 277–288. ACM (2005)
Google Scholar
Thomson, J., O’Boyle, M., Fursin, G., Franke, B.: Reducing training time in a one-shot machine learning-based compiler. In: Gao, G.R., Pollock, L.L., Cavazos, J., Li, X. (eds.) LCPC 2009. LNCS, vol. 5898, pp. 399–407. Springer, Heidelberg (2010)
Chapter Google Scholar
Wang, Z., O’Boyle, M.F.P.: Mapping parallelism to multi-cores: a machine learning based approach. SIGPLAN Not. 44(4), 75–84 (2009)
Article Google Scholar
Wernsing, J.R., Stitt, G.: Elastic computing: A framework for transparent, portable, and adaptive multi-core heterogeneous computing. In: Proceedings of the ACM SIGPLAN/SIGBED 2010 Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES), pp. 115–124. ACM (2010)
Google Scholar
Whaley, R.C., Petitet, A., Dongarra, J.: Automated empirical optimizations of software and the ATLAS project. Parallel Computing 27(1-2), 3–35 (2001)
Article MATH Google Scholar
Yu, H., Rauchwerger, L.: An adaptive algorithm selection framework for reduction parallelization. IEEE Trans. on Par. and Distr. Syst. 17(10), 1084–1096 (2006)
Article Google Scholar

Download references

Author information

Authors and Affiliations

PELAB, IDA, Linköping University, S-581 83, Linköping, Sweden
Lu Li, Usman Dastgeer & Christoph Kessler

Authors

Lu Li
View author publications
You can also search for this author in PubMed Google Scholar
Usman Dastgeer
View author publications
You can also search for this author in PubMed Google Scholar
Christoph Kessler
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

INPT (ENSEEIHT) - IRIT, University of Toulouse, 31062, Toulouse, France
Michel Daydé
Lawrence Berkeley National Laboratory, 94720-8139, Berkeley, CA, USA
Osni Marques
Information Technology Center, The University of Tokyo, 113-8658, Tokyo, Japan
Kengo Nakajima

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, L., Dastgeer, U., Kessler, C. (2013). Adaptive Off-Line Tuning for Optimized Composition of Components for Heterogeneous Many-Core Systems. In: Daydé, M., Marques, O., Nakajima, K. (eds) High Performance Computing for Computational Science - VECPAR 2012. VECPAR 2012. Lecture Notes in Computer Science, vol 7851. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38718-0_32

Download citation

DOI: https://doi.org/10.1007/978-3-642-38718-0_32
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-38717-3
Online ISBN: 978-3-642-38718-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics