Skip to main content

Implementation and Optimization of the OpenMP Accelerator Model for the TI Keystone II Architecture

  • Conference paper

Part of the Lecture Notes in Computer Science book series (LNPSE,volume 8766)

Abstract

The TI Keystone II architecture provides a unique combination of ARM Cortex-A15 processors with high performance TI C66x floating-point DSPs on a single low-power System-on-chip (SoC). Commercially available systems such as the HP Proliant m800 and nCore BrownDwarf are based on this ARM-DSP SoC. The Keystone II architecture promises to deliver high GFLOPS/Watt and is of increasing interest as it provides an alternate building block for future exascale systems. However, the success of this architecture is intimately related to the ease of migrating existing HPC applications for maximum performance. Effective use of all ARM and DSP cores and DMA co-processors is crucial for maximizing performance/watt. This paper explores issues and challenges encountered while migrating the matrix multiplication (GEMM) kernel, originally written only for the C6678 DSP to the ARM-DSP SoC using an early prototype of the OpenMP 4.0 accelerator model. Single precision (SGEMM) matrix multiplication performance of 110.11 GFLOPS and and double precision (DGEMM) performance of 29.15 GFLOPS was achieved on the TI Keystone II Evaluation Module Revision 3.0 (EVM). Trade-offs and factors affecting performance are discussed.

Keywords

  • Graphic Processing Unit
  • Target Device
  • Scratchpad Memory
  • OpenCL Kernel
  • Memory Management Unit

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-319-11454-5_15
  • Chapter length: 13 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   44.99
Price excludes VAT (USA)
  • ISBN: 978-3-319-11454-5
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   59.99
Price excludes VAT (USA)

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ali, M., Stotzer, E., Igual, F.D., van de Geijn, R.A.: Level-3 BLAS on the TI C6678 multi-core DSP. In: IEEE 24th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pp. 179–186. IEEE (2012)

    Google Scholar 

  2. Igual, F.D., Ali, M., Friedmann, A., Stotzer, E., Wentz, T., van de Geijn, R.A.: Unleashing the high-performance and low-power of multi-core DSPs for general-purpose HPC. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, p. 26. IEEE Computer Society Press (2012)

    Google Scholar 

  3. HP: HP moonshot system (2014), http://h17007.www1.hp.com/us/en/enterprise/servers/products/moonshot/index.aspx

  4. nCore HPC: ncore browndwarf y-class supercomputer (2014), http://ncorehpc.com/browndwarf/

  5. Stotzer, E., Jayaraj, A., Ali, M., Friedmann, A., Mitra, G., Rendell, A.P., Lintault, I.: OpenMP on the Low-Power TI Keystone II ARM/DSP System-on-Chip. In: Rendell, A.P., Chapman, B.M., Müller, M.S. (eds.) IWOMP 2013. LNCS, vol. 8122, pp. 114–127. Springer, Heidelberg (2013)

    CrossRef  Google Scholar 

  6. OpenMP ARB: OpenMP Application Program Interface, v.4.0 (2013), http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf

  7. Texas Instruments Literature: SPRS866: 66AK2H12/06 Multicore DSP+ARM Keystone II System-on-Chip (SoC)

    Google Scholar 

  8. Rajovic, N., Rico, A., Puzovic, N., Adeniyi-Jones, C., Ramirez, A.: Tibidabo: making the case for an ARM-based HPC system (2013)

    Google Scholar 

  9. Mitra, G., Johnston, B., Rendell, A.P., McCreath, E., Zhou, J.: Use of SIMD vector operations to accelerate application code performance on low-powered ARM and Intel platforms. In: Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW). IEEE (2013)

    Google Scholar 

  10. Khronos: OpenCL: The open standard for parallel programming of heterogeneous systems (2011), http://www.khronos.org/opencl

  11. Reyes, R., Lopez, I., Fumero, J.J., de Sande, F.: Directive-based programming for gpus: A comparative study. In: IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems (HPCC-ICESS), pp. 410–417. IEEE (2012)

    Google Scholar 

  12. Wolfe, M.: Implementing the PGI accelerator model. In: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, pp. 43–50. ACM (2010)

    Google Scholar 

  13. Han, T.D., Abdelrahman, T.S.: Hi CUDA: A high-level directive-based language for GPU programming. In: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, pp. 52–61. ACM (2009)

    Google Scholar 

  14. Ahmad, A., Ali, M., South, F., Monroy, G.L., Adie, S.G., Shemonski, N., Carney, P.S., Boppart, S.A.: Interferometric synthetic aperture microscopy implementation on a floating point multi-core digital signal processer. In: SPIE BiOS, International Society for Optics and Photonics, pp. 857134–857134 (2013)

    Google Scholar 

  15. Note, F.W., Van Zee, F.G., Smith, T., Igual, F.D., Smelyanskiy, M., Zhang, X., Kistler, M., Austel, V., Gunnels, J., Low, T.M., et al.: Implementing level-3 blas with blis: Early experience (2013)

    Google Scholar 

  16. NVIDIA: Unified Memory in CUDA 6 (2014), http://devblogs.nvidia.com/parallelforall/unified-memory-in-cuda-6/

  17. NVIDIA: NVIDIA Tegra K1 Processor (2014), http://www.nvidia.com/object/tegra-k1-processor.html

  18. Liao, C., Yan, Y., de Supinski, B.R., Quinlan, D.J., Chapman, B.: Early Experiences With The OpenMP Accelerator Model. In: Rendell, A.P., Chapman, B.M., Müller, M.S. (eds.) IWOMP 2013. LNCS, vol. 8122, pp. 84–98. Springer, Heidelberg (2013)

    CrossRef  Google Scholar 

  19. Schmidl, D., Cramer, T., Wienke, S., Terboven, C., Müller, M.S.: Assessing the performance of OpenMP programs on the Intel Xeon Phi. In: Wolf, F., Mohr, B., an Mey, D. (eds.) Euro-Par 2013. LNCS, vol. 8097, pp. 547–558. Springer, Heidelberg (2013)

    CrossRef  Google Scholar 

  20. Barker, J., Bowden, J.: Manycore Parallelism through OpenMP. In: Rendell, A.P., Chapman, B.M., Müller, M.S. (eds.) IWOMP 2013. LNCS, vol. 8122, pp. 45–57. Springer, Heidelberg (2013)

    CrossRef  Google Scholar 

  21. Cramer, T., Schmidl, D., Klemm, M., an Mey, D.: OpenMP Programming on Intel Xeon Phi Coprocessors: An Early Performance Comparison, pp. 38–44 (2012)

    Google Scholar 

  22. Leang, S.S., Rendell, A.P., Gordon, M.S.: Quantum chemical calculations using accelerators: Migrating matrix operations to the nvidia kepler gpu and the intel xeon phi. Journal of Chemical Theory and Computation 10(3), 908–912 (2014)

    CrossRef  Google Scholar 

  23. Newburn, C., Dmitriev, S., Narayanaswamy, R., Wiegert, J., Murty, R., Chinchilla, F., Deodhar, R., McGuire, R.: Offload Compiler Runtime for the Intel Xeon Phi Coprocessor. In: 2013 IEEE 27th International on Parallel and Distributed Processing Symposium Workshops PhD Forum (IPDPSW), pp. 1213–1225 (May 2013)

    Google Scholar 

  24. Li, B., Chang, H.C., Leon Song, S., Su, C.Y., Meyer, T., Mooring, J., Cameron, K.W.: The Power-Performance Tradeoffs of the Intel Xeon Phi on HPC Applications. In: Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW). IEEE (2014)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Mitra, G., Stotzer, E., Jayaraj, A., Rendell, A.P. (2014). Implementation and Optimization of the OpenMP Accelerator Model for the TI Keystone II Architecture. In: DeRose, L., de Supinski, B.R., Olivier, S.L., Chapman, B.M., Müller, M.S. (eds) Using and Improving OpenMP for Devices, Tasks, and More. IWOMP 2014. Lecture Notes in Computer Science, vol 8766. Springer, Cham. https://doi.org/10.1007/978-3-319-11454-5_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-11454-5_15

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-11453-8

  • Online ISBN: 978-3-319-11454-5

  • eBook Packages: Computer ScienceComputer Science (R0)