Towards High-Performance Implementations of a Custom HPC Kernel Using ® Array Building Blocks

  • Alexander Heinecke
  • Michael Klemm
  • Hans Pabst
  • Dirk Pflüger
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7174)


Today’s highly parallel machines drive a new demand for parallel programming. Fixed power envelopes, increasing problem sizes, and new algorithms pose challenging targets for developers. HPC applications must leverage SIMD units, multi-core architectures, and heterogeneous computing platforms for optimal performance. This leads to low-level, non-portable code that is difficult to write and maintain. With Intel® Array Building Blocks (Intel ArBB), programmers focus on the high-level algorithms and rely on an automatic parallelization and vectorization with strong safety guarantees. Intel ArBB hides vendorspecific hardware knowledge by runtime just-in-time (JIT) compilation. This case study on data mining with adaptive sparse grids unveils how deterministic parallelism, safety, and runtime optimization make Intel ArBB practically applicable. Hand-tuned code is about 40% faster than ArBB, but needs about 8x more code. ArBB clearly outperforms standard semi-automatically parallelized C/C++ code by approximately 6x.


parallel languages vector computing high performance computing Intel® Array Building Blocks Intel ArBB OpenCL 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Blelloch, G.E.: Vector Models for Data-Parallel Computing (Artificial Intelligence), 1st edn. The MIT Press (1990)Google Scholar
  2. 2.
    Borkar, S., Chien, A.A.: The Future of Microprocessors. Communications of the ACM 54(5), 67–77 (2011)CrossRefGoogle Scholar
  3. 3.
    Bungartz, H.-J., Griebel, M.: Sparse Grids. Acta Numerica 13, 147–269 (2004)MathSciNetCrossRefzbMATHGoogle Scholar
  4. 4.
    Ciechanowicz, P., Kuchen, H.: Enhancing Muesli’s Data Parallel Skeletons for Multi-core Computer Architectures. In: Proc. of the 2010 IEEE 12th Intl. Conf. on High Performance Comp. and Comm., Orlando, FL, pp. 108–113 (2010)Google Scholar
  5. 5.
    Drepper, U., Molnar, I.: The Native POSIX Thread Library for Linux. Technical report, Redhat (2003)Google Scholar
  6. 6.
    Faulk, S., Porter, A., et al.: Measuring HPC productivity. Intl. J. of High Performance Computing Applications, 459–473 (2004)Google Scholar
  7. 7.
    Khronos OpenCL Working Group. The OpenCL Specification, Version 1.1, Document Revision 36 (2010)Google Scholar
  8. 8.
    Heinecke, A., Pflüger, D.: Multi- and Many-Core Data Mining with Adaptive Sparse Grids. In: Proc. of the 2011 ACM Intl. Conf. on Computing Frontiers (2011) (accepted for publication)Google Scholar
  9. 9.
    Intel Corp. Intel® Array Building Blocks Virtual Machine Specification, Version 1.0 Beta, Document Number 324820-002US (2011)Google Scholar
  10. 10.
    ISO/IEC. Information Technology – Programming Languages – Fortran – Part 1: Base Language, ISO/IEC 1539-1 (2010)Google Scholar
  11. 11.
    ISO/ISC. Standard for Programming Language C++, ISO/ISC DTR 19769, final draft (2011)Google Scholar
  12. 12.
    Javed, N., Loulergue, F.: OSL: Optimized Bulk Synchronous Parallel Skeletons on Distributed Arrays. In: Dou, Y., Gruber, R., Joller, J.M. (eds.) APPT 2009. LNCS, vol. 5737, pp. 436–451. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  13. 13.
    Matsuzaki, K., Emoto, K.: Lessons from Implementing the biCGStab Method with SkeTo Library. In: Proc. of the 4th Intl. Workshop on High-level Parallel Programming and Applications, Baltimore, MD, pp. 15–24 (2010)Google Scholar
  14. 14.
    Matsuzaki, K., Iwasaki, H., Emoto, K., Hu, Z.: A Library of Constructive Skeletons for Sequential Style of Parallel Programming. In: Proc. of the 1st Intl. Conf. on Scalable Information Systems, Hong Kong (2006)Google Scholar
  15. 15.
    Newburn, C.J., McCool, M., et al.: Intel® Array Building Blocks: A Retargetable, Dynamic Compiler and Embedded Language. In: Proc. of the Intl. Symp. on Code Generation and Optimization, Chamonix, France, pp. 224–235 (2011) (to appear)Google Scholar
  16. 16.
    OpenMP Architecture Review Board. OpenMP Application Program Interface, Version 3.0 (2008),
  17. 17.
    Pflüger, D.: Spatially Adaptive Sparse Grids for High-Dimensional Problems. Dissertation, Institut für Informatik, TU München, München (2010)Google Scholar
  18. 18.
    Satish, N., Kim, C., et al.: Can Traditional Programming Bridge the Ninja Performance Gap for Throughput Applications? In: Proc. of the 17th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, London, UK (2012) (submitted)Google Scholar
  19. 19.
    Sterling, T.L.: Productivity Metrics and Models for High Performance Computing. Intl. J. of High Performance Computing Applications 18(4), 433–440 (2004)CrossRefGoogle Scholar
  20. 20.
    Sutter, H.: The Free Lunch Is Over—A Fundamental Turn Toward Concurrency in Software. Dr. Dobb’s Journal 30(3) (2005)Google Scholar
  21. 21.
    Sutter, H., Larus, J.: Software and the Concurrency Revolution. ACM Queue 3, 54–62 (2005)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Alexander Heinecke
    • 1
  • Michael Klemm
    • 2
  • Hans Pabst
    • 2
  • Dirk Pflüger
    • 1
  1. 1.Technische Universität MünchenGarchingGermany
  2. 2.Intel GmbHFeldkirchenGermany

Personalised recommendations