Adaptive Off-Line Tuning for Optimized Composition of Components for Heterogeneous Many-Core Systems

  • Lu Li
  • Usman Dastgeer
  • Christoph Kessler
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7851)

Abstract

In recent years heterogeneous multi-core systems have been given much attention. However, performance optimization on these platforms remains a big challenge. Optimizations performed by compilers are often limited due to lack of dynamic information and run time environment, which makes applications often not performance portable. One current approach is to provide multiple implementations for the same interface that could be used interchangeably depending on the call context, and expose the composition choices to a compiler, deployment-time composition tool and/or run-time system. Using off-line machine-learning techniques allows to improve the precision and reduce the run-time overhead of run-time composition and leads to an improvement of performance portability. In this work we extend the run-time composition mechanism in the PEPPHER composition tool by off-line composition and present an adaptive machine learning algorithm for generating compact and efficient dispatch data structures with low training time. As dispatch data structure we propose an adaptive decision tree structure, which implies an adaptive training algorithm that allows to control the trade-off between training time, dispatch precision and run-time dispatch overhead.

We have evaluated our optimization strategy with simple kernels (matrix-multiplication and sorting) as well as applications from RODINIA benchmark on two GPU-based heterogeneous systems. On average, the precision for composition choices reaches 83.6 percent with approximately 34 minutes off-line training time.

Keywords

Autotuning Heterogeneous architecture GPU 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Ansel, J., Chan, C.P., Wong, Y.L., Olszewski, M., Zhao, Q., Edelman, A., Amarasinghe, S.P.: PetaBricks: A language and compiler for algorithmic choice. In: Proceedings of the 2009 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2009, pp. 38–49. ACM (2009)Google Scholar
  2. 2.
    Augonnet, C., Thibault, S., Namyst, R.: Automatic calibration of performance models on heterogeneous multicore architectures. In: Lin, H.-X., Alexander, M., Forsell, M., Knüpfer, A., Prodan, R., Sousa, L., Streit, A. (eds.) Euro-Par 2009. LNCS, vol. 6043, pp. 56–65. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  3. 3.
    Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.-A.: StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. Concurrency and Computation: Practice and Experience, Special Issue: Euro-Par 2009 23, 187–198 (2011)CrossRefGoogle Scholar
  4. 4.
    Benkner, S., Pllana, S., Träff, J.L., Tsigas, P., Dolinsky, U., Augonnet, C., Bachmayer, B., Kessler, C., Moloney, D., Osipov, V.: PEPPHER: Efficient and productive usage of hybrid computing systems. IEEE Micro 31(5), 28–41 (2011)CrossRefGoogle Scholar
  5. 5.
    Danylenko, A., Kessler, C., Löwe, W.: Comparing machine learning approaches for context-aware composition. In: Apel, S., Jackson, E. (eds.) SC 2011. LNCS, vol. 6708, pp. 18–33. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  6. 6.
    Dastgeer, U., Li, L., Kessler, C.: Performance-aware dynamic composition of applications for heterogeneous multicore systems with the PEPPHER composition tool. In: Proc. 16th Int. Workshop on Compilers for Parallel Computers (CPC 2012), Padova, Italy (January 2012)Google Scholar
  7. 7.
    de Mesmay, F., Voronenko, Y., Püschel, M.: Offline library adaptation using automatically generated heuristics. In: Int. Parallel and Distr. Processing Symp. (IPDPS 2010), pp. 1–10 (2010)Google Scholar
  8. 8.
    Frigo, M., Johnsson, S.G.: Fftw: An adaptive software architecture for the FFT. In: Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, vol. 3, pp. 1381–1384 (May 1998)Google Scholar
  9. 9.
    Grewe, D., O’Boyle, M.F.P.: A static task partitioning approach for heterogeneous systems using openCL. In: Knoop, J. (ed.) CC 2011. LNCS, vol. 6601, pp. 286–305. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  10. 10.
    Katagiri, T., Kise, K., Honda, H., Yuba, T.: Abclibscript: a directive to support specification of an auto-tuning facility for numerical software. Parallel Computing 32(1), 92–112 (2006)CrossRefGoogle Scholar
  11. 11.
    Kessler, C.W., Löwe, W.: A framework for performance-aware composition of explicitly parallel components. In: Parallel Computing: Architectures, Algorithms and Applications (ParCo 2007). Advances in Parallel Computing, vol. 15, pp. 227–234. IOS Press (2007)Google Scholar
  12. 12.
    Kessler, C.W., Löwe, W.: Optimized composition of performance-aware parallel components. In: Proc. 15th Int. Workshop on Compilers for Parallel Computers (CPC 2010) (July 2010)Google Scholar
  13. 13.
    Kessler, C.W., Löwe, W.: Optimized composition of performance-aware parallel components. Concurrency and Computation: Practice and Experience 24(5), 481–498 (2012); Published online in Wiley Online Library, doi: 10.1002/cpe.1844 (September 2011)CrossRefGoogle Scholar
  14. 14.
    Li, X., Garzarán, M.J.: Optimizing matrix multiplication with a classifier learning system. In: Ayguadé, E., Baumgartner, G., Ramanujam, J., Sadayappan, P. (eds.) LCPC 2005. LNCS, vol. 4339, pp. 121–135. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  15. 15.
    Li, X., Garzarán, M.J., Padua, D.: A dynamically tuned sorting library. In: Proc. ACM Symp. on Code Generation and Optimization (CGO 2004), pp. 111–124 (2004)Google Scholar
  16. 16.
    Park, E., Kulkarni, S., Cavazos, J.: An evaluation of different modeling techniques for iterative compilation. In: Proc. Int. Conf. on Compilers, Architectures and Synthesis for Embedded Systems (CASES 2011) (October 2011)Google Scholar
  17. 17.
    Püschel, M., Moura, J.M.F., Johnson, J.R., Padua, D., Veloso, M.M., Singer, B.W., Xiong, J., Franchetti, F., Gacic, A., Voronenko, Y., Chen, K., Johnson, R.W., Rizzolo, N.: Spiral: Code generation for DSP transforms. Proceedings of the IEEE 93(2) (February 2005)Google Scholar
  18. 18.
    Ross Quinlan, J.: C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., San Francisco (1993)Google Scholar
  19. 19.
    Singer, B., Veloso, M.: Learning to predict performance from formula modeling and training data. In: Proc. 17th Int. Conf. on Machine Learning, pp. 887–894 (2000)Google Scholar
  20. 20.
    Singer, B., Veloso, M.: Learning to construct fast signal processing implementations. Journal of Machine Learning Research 3, 887–919 (2002)MathSciNetGoogle Scholar
  21. 21.
    Thomas, N., Tanase, G., Tkachyshyn, O., Perdue, J., Amato, N.M., Rauchwerger, L.: A framework for adaptive algorithm selection in STAPL. In: Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 277–288. ACM (2005)Google Scholar
  22. 22.
    Thomson, J., O’Boyle, M., Fursin, G., Franke, B.: Reducing training time in a one-shot machine learning-based compiler. In: Gao, G.R., Pollock, L.L., Cavazos, J., Li, X. (eds.) LCPC 2009. LNCS, vol. 5898, pp. 399–407. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  23. 23.
    Wang, Z., O’Boyle, M.F.P.: Mapping parallelism to multi-cores: a machine learning based approach. SIGPLAN Not. 44(4), 75–84 (2009)CrossRefGoogle Scholar
  24. 24.
    Wernsing, J.R., Stitt, G.: Elastic computing: A framework for transparent, portable, and adaptive multi-core heterogeneous computing. In: Proceedings of the ACM SIGPLAN/SIGBED 2010 Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES), pp. 115–124. ACM (2010)Google Scholar
  25. 25.
    Whaley, R.C., Petitet, A., Dongarra, J.: Automated empirical optimizations of software and the ATLAS project. Parallel Computing 27(1-2), 3–35 (2001)MATHCrossRefGoogle Scholar
  26. 26.
    Yu, H., Rauchwerger, L.: An adaptive algorithm selection framework for reduction parallelization. IEEE Trans. on Par. and Distr. Syst. 17(10), 1084–1096 (2006)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Lu Li
    • 1
  • Usman Dastgeer
    • 1
  • Christoph Kessler
    • 1
  1. 1.PELAB, IDALinköping UniversityLinköpingSweden

Personalised recommendations