Automatic Task-Based Code Generation for High Performance Domain Specific Embedded Language

  • Antoine Tran Tan
  • Joel Falcou
  • Daniel Etiemble
  • Hartmut Kaiser
Article

Abstract

Providing high level tools for parallel programming while sustaining a high level of performance has been a challenge that techniques like Domain Specific Embedded Languages try to solve. In previous works, we investigated the design of such a DSEL—NT\(^2\)—providing a Matlab -like syntax for parallel numerical computations inside a C++ library. In this paper, we show how NT\(^2\!\) has been redesigned for shared memory systems in an extensible and portable way. The new NT\(^2\!\) design relies on a tiered Parallel Skeleton system built using asynchronous task management and automatic compile-time taskification of user level code. We describe how this system can operate various shared memory runtimes and evaluate the design by using two benchmarks implementing linear algebra algorithms.

Keywords

C++ Parallel skeletons Asynchronous programming  Generative programming 

References

  1. 1.
    Abrahams, D., Gurtovoy, A.: C++ Template Metaprogramming: concepts, Tools, and Techniques from Boost and Beyond. Pearson Education, Boston (2004)Google Scholar
  2. 2.
    Agullo, E., Dongarra, J., Hadri, B., Kurzak, J., Langou, J., Langou, J., Ltaief, H., Luszczek, P., YarKhan, A.: Plasma users guide. Techn. Rep., Electrical Engineering and Computer Science Department, University of Tennessee. http://icl.cs.utk.edu/projectsfiles/plasma/pdf/usersguide.pdf (2009)
  3. 3.
    Aldinucci, M., Danelutto, M., Dazzi, P.: Muskel: an expandable skeleton environment. Scal. Comput. Pract. Exp. 8(4), 325–341 (2001)Google Scholar
  4. 4.
    Aldinucci, M., Danelutto, M., Dnnweber, J.: Optimization techniques for implementing parallel skeletons in grid environments. In: Gorlatch, S. (ed.) Proceedings of CMPP: International Workshop on Constructive Methods for Parallel Programming, pp. 35–47. Universitat Munster, Stirling (2004)Google Scholar
  5. 5.
    Aldinucci, M., Danelutto, M., Kilpatrick, P., Torquati, M.: Fastflow: high-level and efficient streaming on multi-core. In: Pllana, S., Xhafa, F. (eds.) Programming Multi-core and Many-core Computing Systems, chap 13. Parallel and Distributed Computing. Wiley (2014)Google Scholar
  6. 6.
    An, P., Jula, A., Rus, S., Saunders, S., Smith, T., Tanase, G.,Thomas, N., Amato, N., Rauchwerger, L.: STAPL: an adaptive, generic parallel C++ library. In: Dietz, H.G. (ed.) Languages and Compilers for Parallel Computing. Lecture Notes in Computer Science, vol. 2624, pp. 193–208. Springer, Berlin, Heidelberg (2003)Google Scholar
  7. 7.
    Ayguadé, E., Copty, N., Duran, A., Hoeflinger, J., Lin, Y., Massaioli, F., Teruel, X., Unnikrishnan, P., Zhang, G.: The design of openmp tasks. Parallel Distrib. Syst. IEEE Trans. 20(3), 404–418 (2009)CrossRefGoogle Scholar
  8. 8.
    Baker Jr, H.C., Hewitt, C.: The incremental garbage collection of processes. ACM SIGART Bull. 12, 55–59 (1977)CrossRefGoogle Scholar
  9. 9.
    Benoit, A., Cole, M., Gilmore, S., Hillston, J.: Flexible skeletal programming with Eskel. In: Proceedings of the 11th International Euro-Par Conference on Parallel Processing, Lisbon, Portugal. Euro-Par’05, pp. 761–770. Springer-Verlag, Berlin, Heidelberg (2005)Google Scholar
  10. 10.
    Black, F., Scholes, M.: The pricing of options and corporate liabilities. J. Polit. Econ. 81(3), 637–654 (1973)Google Scholar
  11. 11.
    Bosilca, G., Bouteiller, A., Danalis, A., Herault, T., Dongarra, J.: From serial loops to parallel execution on distributed systems. In: Kaklamanis, C., Papatheodorou, T., Spirakis, P.G. (eds.) Euro-Par 2012 Parallel Processing. Lecture Notes in Computer Science, vol. 7484, pp. 246–257. Springer, Berlin, Heidelberg (2012)Google Scholar
  12. 12.
    Harshvardhan, A. Buss, Papadopoulos, I., Pearce, O., Smith, T., Tanase, G., Thomas, N., Xu, X., Bianco, M., Amato, N.M., Rauchwerger, L.: Stapl: standard template adaptive parallel library. In: Proceedings of the 3rd Annual Haifa Experimental Systems Conference, SYSTOR ’10, pp. 14:1–14:10, ACM, New York, (2010)Google Scholar
  13. 13.
    Buttari, A., Langou, J., Kurzak, J., Dongarra, J.: A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Comput. 35(1), 38–53 (2009)MathSciNetCrossRefGoogle Scholar
  14. 14.
    Chamberlain, B.L., Callahan, D., Zima, H.P.: Parallel programmability and the chapel language. Int. J. High Perform. Comput. Appl. 21(3), 291–312 (2007)CrossRefGoogle Scholar
  15. 15.
    Ching, W.-M., Zheng, D.: Automatic parallelization of array-oriented programs for a multi-core machine. Int. J. Parallel Progr. 40(5), 514–531 (2012)CrossRefGoogle Scholar
  16. 16.
    Mysen, C., Gustafsson, N., Austern, M., Yasskin, J.: N3785: executors and schedulers, revision 3. Technical report. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3785.pdf (2013)
  17. 17.
    Ciechanowicz, P., Kuchen, H.: Enhancing muesli’s data parallel skeletons for multi-core computer architectures. In: High Performance Computing and Communications (HPCC), 12th IEEE International Conference on, pp. 108–113. IEEE (2010)Google Scholar
  18. 18.
    Cole, M.: Bringing skeletons out of the closet: a pragmatic manifesto for skeletal parallel programming. Parallel Comput. 30(3), 389–406 (2004)CrossRefGoogle Scholar
  19. 19.
    Cole, M.I.: Algorithmic skeletons: structured management of parallel computation. Pitman London, (1989)Google Scholar
  20. 20.
    Czarnecki, K., Eisenecker, U.W., Glück, R., Vandevoorde, D., Veldhuizen, T.L.: Generative programming and active libraries. In Generic Programming, pp. 25–39 (1998)Google Scholar
  21. 21.
    Dawes, B., Abrahams, D., Rivera. R.: Boost C++ Libraries. http://www.boost.org (2009)
  22. 22.
    Emoto, K., Matsuzaki, K., Hu, Z., Takeichi, M.: Domain-specific optimization strategy for skeleton programs. In: Kermarrec, A.-M., Boug, L., Priol, T. (eds.) Euro-Par 2007 Parallel Processing. Lecture Notes in Computer Science, vol. 4641, pp. 705–714. Springer, Berlin (2007)CrossRefGoogle Scholar
  23. 23.
    Estérie, P., Gaunard, M., Falcou, J., Lapresté, J.-T., Rozoy, B.: Boost. simd: generic programming for portable simdization. In: Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, pp. 431–432. ACM, (2012)Google Scholar
  24. 24.
    Falcou, J., Gaunard, M., Lapresté, J.-T., The numerical template toolbox. http://www.github.com/MetaScale/nt2 (2013)
  25. 25.
    Falcou, J., Sérot, J., Pech, L., Lapresté, J.-T.: Meta-programming applied to automatic smp parallelization of linear algebra code. In Euro-Par 2008-Parallel Processing, pp. 729–738. Springer, Berlin, (2008)Google Scholar
  26. 26.
    Friedman, D.P., Wise, D.S.: The impact of applicative programming on multiprocessing. Indiana University, Computer Science Department (1976)Google Scholar
  27. 27.
    Grelck, C., Scholz, S.-B.: Saca functional array language for efficient multi-threaded execution. Int. J. Parallel Progr. 34(4), 383–427 (2006)CrossRefMATHGoogle Scholar
  28. 28.
    Hudak, P.: Building domain-specific embedded languages. ACM Comput. Surv. 28(4es), 196 (1996)CrossRefGoogle Scholar
  29. 29.
    Kaiser, H., Brodowicz, M., Sterling, T.: Parallex an advanced parallel execution model for scaling-impaired applications. In: Parallel Processing Workshops, 2009. ICPPW’09. International Conference on, pp. 394–401. IEEE, (2009)Google Scholar
  30. 30.
    Kale, L.V., and Krishnan, S.: CHARM++: A Portable Concurrent Object Oriented System Based on C++, 28(10). ACM, (1993)Google Scholar
  31. 31.
    Kuchen, H.: A Skeleton Library. Springer, Berlin (2002)CrossRefMATHGoogle Scholar
  32. 32.
    Niebler, E.: Proto : A compiler construction toolkit for DSELs. In: Proceedings of ACM SIGPLAN Symposium on Library-Centric Software Design, (2007)Google Scholar
  33. 33.
    Gustafsson, N., Laksberg, A., Sutter, H., Mithani, S.: N3857: Improvements to std::future \(<\)T\(>\) and related APIs. Technical report. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3857.pdf (2014)
  34. 34.
    OpenMP Architecture Review Board. OpenMP application program interface version 4, (2013)Google Scholar
  35. 35.
    Reinders, J.: Intel Threading Building Blocks: outfitting C++ for Multi-Core Processor Parallelism. O’Reilly Media, California (2010)Google Scholar
  36. 36.
    Spinellis, D.: Notable design patterns for domain-specific languages. J. Syst. Softw. 56(1), 91–99 (2001)CrossRefGoogle Scholar
  37. 37.
    The C++ Standards Committee. ISO/IEC 14882:2011, Standard for Programming Language C++. Technical report. http://www.open-std.org/jtc1/sc22/wg21 (2011)
  38. 38.
    The C++ Standards Committee. N3797: Working Draft, Standard for Programming Language C++. Technical report. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3797.pdf (2013)
  39. 39.
    Tratt, L.: Model transformations and tool integration. Softw. Syst. Model. 4(2), 112–122 (2005)CrossRefGoogle Scholar
  40. 40.
    Vandevoorde, D., Josuttis, N.M.: C++ Templates. Addison-Wesley Longman Publishing Co, Boston (2002)Google Scholar
  41. 41.
    Veldhuizen, T.: Expression templates. C++ Report 7, 26–31 (1995)Google Scholar
  42. 42.
    Escriba, V.B.J.: N3865: More Improvements to std::future \(<\)T\(>\). Technical report. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3865.pdf (2014)
  43. 43.
    Yarkhan, A., Kurzak, J., and Dongarra, J.: Quark users guide. Technical report, Technical Report April, Electrical Engineering and Computer Science, Innovative Computing Laboratory, University of Tenessee (2011)Google Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • Antoine Tran Tan
    • 1
  • Joel Falcou
    • 1
  • Daniel Etiemble
    • 1
  • Hartmut Kaiser
    • 2
  1. 1.LRI, INRIAUniversité Paris-Sud XIOrsayFrance
  2. 2.CCTLouisiana State UniversityBaton RougeUSA

Personalised recommendations