Abstract
The massively hardware multithreaded VLIW emulated shared memory (ESM) architecture REPLICA has a dynamically reconfigurable on-chip network that offers two execution modes: PRAM and NUMA. PRAM mode is mainly suitable for applications with high amount of thread level parallelism (TLP) while NUMA mode is mainly for accelerating execution of sequential programs or programs with low TLP. Also, some types of regular data parallel algorithms execute faster in NUMA mode. It is not obvious in which mode a given program region shows the best performance. In this study we focus on generic stencil-like computations exhibiting regular control flow and memory access pattern. We use two state-of-the art machine-learning methods, C5.0 (decision trees) and Eureqa Pro (symbolic regression) to select which mode to use.We use these methods to derive different predictors based on the same training data and compare their results. The accuracy of the best derived predictors are 95% and are generated by both C5.0 and Eureqa Pro, although the latter can in some cases be more sensitive to the training data. The average speedup gained due to mode switching ranges between 1.92 to 2.23 for all generated predictors on the evaluation test cases, and using a majority voting algorithm, based on the three best predictors, we can eliminate all misclassifications.
Chapter PDF
Similar content being viewed by others
References
Ansel, J., Chan, C., Wong, Y.L., Olszewski, M., Zhao, Q., Edelman, A., Amarasinghe, S.: Petabricks: A language and compiler for algorithmic choice. In: ACM SIGPLAN Conference on Programming Language Design and Implementation, Dublin, Ireland (June 2009)
Danylenko, A., Kessler, C., Löwe, W.: Comparing machine learning approaches for context-aware composition. In: Apel, S., Jackson, E. (eds.) SC 2011. LNCS, vol. 6708, pp. 18–33. Springer, Heidelberg (2011)
Dastgeer, U., Enmyren, J., Kessler, C.W.: Auto-tuning SkePU: A multi-backend skeleton programming framework for multi-GPU systems. In: Proceedings of the 4th International Workshop on Multicore Software Engineering, IWMSE 2011, pp. 25–32. ACM, New York (2011)
Duesterwald, E., Gupta, R., Soffa, M.: Register pipelining: An integrated approach to register allocation for scalar and subscripted variables. In: Kastens, U., Pfahler, P. (eds.) CC 1992. LNCS, vol. 641, pp. 192–206. Springer, Heidelberg (1992)
Forsell, M.: Realizing multioperations for step cached MP-SOCs. In: Proc. SOC 2006 (2006)
Forsell, M.: Configurable Emulated Shared Memory Architecture for General Purpose MP-SoCs and NoC Regions. In: Proceedings of the 3rd ACM/IEEE International Symposium on Networks-on-Chip, San Diego, USA, May 10-13, pp. 163–172 (2009)
Forsell, M.: A PRAM-NUMA Model of Computation for Addressing Low-TLP Workloads. In: Proceedings of the 12th Workshop on Advances in Parallel and Distributed Computational Models (in conjunction with the 24th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2010), Atlanta, USA, April 19, pp. 1–8 (2010)
Forsell, M., Hansson, E., Kessler, C., Mäkelä, J.M., Leppänen, V.: NUMA Computing with Hardware and Software Co-support on Configurable Emulated Shared memory Architectures. International Journal of Networking and Computing 4(1) (2014)
Goel, B.: Per-core Power Estimation and Power Aware Scheduling Strategies for CMPs, 70 (2011)
Grewe, D., O’Boyle, M.: A static task partitioning approach for heterogeneous systems using opencl. In: Knoop, J. (ed.) CC 2011. LNCS, vol. 6601, pp. 286–305. Springer, Heidelberg (2011)
Hansson, E., Alnervik, E., Kessler, C., Forsell, M.: A Quantitative Comparison of PRAM based Emulated Shared Memory Architectures to Current Multicore CPUs and GPUs. In: 2014 27th International Conference on Architecture of Computing Systems (ARCS), pp. 1–7 (February 2014)
Keller, J., Kessler, C., Träff, J.L.: Practical PRAM Programming. John Wiley & Sons, Inc., New York (2001)
Kessler, M., Hansson, E., Åkesson, D., Kessler, C.: Exploiting instruction level parallelism for REPLICA - a configurable VLIW architecture with chained functional units. In: Proc. PDPTA 2012 (July 2012)
Li, L., Dastgeer, U., Kessler, C.: Pruning strategies in adaptive off-line tuning for optimized composition of components on heterogeneous systems. Accepted for Proc. Seventh International Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2) at ICPP (2014)
Phothilimthana, P.M., Ansel, J., Ragan-Kelley, J., Amarasinghe, S.: Portable performance on heterogeneous architectures. SIGPLAN Not. 48(4), 431–444 (2013)
Quinlan, R.: C5.0 release 2.07 GPL Edition [software], http://www.rulequest.com/download.html
Schmidt, M., Lipson, H.: Eureqa (version 0.99.5 beta) [software] (2014), http://www.nutonian.com
Schmidt, M., Lipson, H.: Distilling Free-Form Natural Laws from Experimental Data. Science 324(5923), 81–85 (2009)
Shafiee Sarvestani, A., Hansson, E., Kessler, C.: Extensible recognition of algorithmic patterns in DSP programs for automatic parallelization. International Journal of Parallel Programming, 1–19 (2012)
Wernsing, J.R., Stitt, G.: Elastic computing: A framework for transparent, portable, and adaptive multi-core heterogeneous computing. SIGPLAN Not. 45(4), 115–124 (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Hansson, E., Kessler, C. (2014). Optimized Selection of Runtime Mode for the Reconfigurable PRAM-NUMA Architecture REPLICA Using Machine-Learning. In: Lopes, L., et al. Euro-Par 2014: Parallel Processing Workshops. Euro-Par 2014. Lecture Notes in Computer Science, vol 8806. Springer, Cham. https://doi.org/10.1007/978-3-319-14313-2_12
Download citation
DOI: https://doi.org/10.1007/978-3-319-14313-2_12
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-14312-5
Online ISBN: 978-3-319-14313-2
eBook Packages: Computer ScienceComputer Science (R0)