Skip to main content
Log in

High-performance optimizations on tiled many-core embedded systems: a matrix multiplication case study

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Technological advancements in the silicon industry, as predicted by Moore’s law, have resulted in an increasing number of processor cores on a single chip, giving rise to multicore, and subsequently many-core architectures. This work focuses on identifying key architecture and software optimizations to attain high performance from tiled many-core architectures (TMAs)—an architectural innovation in the multicore technology. Although embedded systems design is traditionally power-centric, there has been a recent shift toward high-performance embedded computing due to the proliferation of compute-intensive embedded applications. The TMAs are suitable for these embedded applications due to low-power design features in many of these TMAs. We discuss the performance optimizations on a single tile (processor core) as well as parallel performance optimizations, such as application decomposition, cache locality, tile locality, memory balancing, and horizontal communication for TMAs. We elaborate compiler-based optimizations that are applicable to TMAs, such as function inlining, loop unrolling, and feedback-based optimizations. We present a case study with optimized dense matrix multiplication algorithms for Tilera’s TILEPro64 to experimentally demonstrate the performance and performance per watt optimizations on TMAs. Our results quantify the effectiveness of algorithmic choices, cache blocking, compiler optimizations, and horizontal communication in attaining high performance and performance per watt on TMAs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Yuan N, Zhou Y, Tan G, Zhang J, Fan D (2009) High performance matrix multiplication on many cores. In: Proc of the 15th international Euro-Par conference on parallel processing (Euro-Par’09), Delft, The Netherlands, August 2009

    Google Scholar 

  2. MAXIMUMPC (2007) Fast forward: multicore vs manycore. June. Available online: http://www.maximumpc.com/article/fast_forward_multicore_vs_manycore

  3. Wikipedia (2013) Multi-core processor. February. Available online: http://en.wikipedia.org/wiki/Manycore

  4. Tilera (2013) Tilera cloud computing. February. Available online: http://www.tilera.com/solutions/cloud_computing

  5. Tilera (2013) Tilera TILEmpower platform. February. Available online: http://www.tilera.com/sites/default/files/productbriefs/TILEProEmpower_PB021_v4.pdf

  6. Levy M, Conte T (2009) Embedded multicore processors and systems. IEEE MICRO 29(3):7–9

    Article  Google Scholar 

  7. Asanovic K, Bodik R, Demmel J, Keaveny T, Keutzer K, Kubiatowicz J, Morgan N, Patterson K, Sen D, Wawrzynek J, Wessel D, Yelick K (2009) A view of the parallel computing landscape. Commun ACM 52(10):56–67

    Article  Google Scholar 

  8. Cuvillo Jd, Zhu W, Gao GR (2006) Landing OpenMP on Cyclops-64: an efficient mapping of OpenMP to a many-core system-on-a-chip. In: Proc of ACM 3rd conference on computing frontiers (CF), Ischia, Italy, May 2006

    Google Scholar 

  9. Vangal SR, Howard J, Ruhl G, Dighe S, Wilson H, Tschanz J, Finan D, Singh A, Jacob T, Jain S, Erraguntla V, Roberts C, Hoskote Y, Borkar N, Borkar S (2008) An 80-tile sub-100-W TeraFLOPS processor in 65-nm CMOS. IEEE J Solid-State Circuits 43(1):29–41

    Article  Google Scholar 

  10. Musoll E (2010) A cost-effective load-balancing policy for tile-based, massive multi-core packet processors. ACM Trans Embedded Comput Syst 9(3):24

    Article  Google Scholar 

  11. Wu N, Yang Q, Wen M, He Y, Ren J, Guan M, Zhang C (2011) Tiled multi-core stream architecture. In: Transactions on high-performance embedded architectures and compilers IV (HiPEAC IV), vol 4, pp 274–293

    Chapter  Google Scholar 

  12. Mattson TG, Wijngaart RVd, Frumkin M (2008) Programming the Intel 80-core network-on-a-chip terascale processor. In: Proc of IEEE/ACM conference on supercomputing (SC), Austin, Texas, November 2008

    Google Scholar 

  13. Crowell T (2011) Will 2011 mark the beginning of manycore? January. Available online: http://talbottcrowell.wordpress.com/2011/01/01/manycore/

  14. Tilera (2012) Manycore without boundaries: TILEPro64 processor. May. Available online: http://www.tilera.com/products/processors/TILEPRO64

  15. Brown R, Sharapov I (2008) Performance and programmability comparison between OpenMP and MPI implementations of a molecular modeling application. In: Lecture notes in computer science, vol 4315. Springer, Berlin, pp 349–360

    Google Scholar 

  16. Sun X, Zhu J (1995) Performance considerations of shared virtual memory machines. IEEE Trans Parallel Distrib Syst 6(11):1185–1194

    Article  Google Scholar 

  17. Cortesi D (1998) Origin2000 and Onyx2 performance tuning and optimization guide. Available online: http://techpubs.sgi.com/library/dynaweb_docs/0640/SGI_Developer/books/OrOn2_PfTune/sgi_html/index.html

  18. Krishnan M, Nieplocha J (2004) SRUMMA: a matrix multiplication algorithm suitable for clusters and scalable shared memory systems. In: Proc of the international parallel and distributed processing symposium (IPDPS), Santa Fe, New Mexico, April 2004

    Google Scholar 

  19. Lee H-J, Robertson JP, Fortes J (1997) Generalized Cannon’s algorithm for parallel matrix multiplication. In: Proc of the ACM international conference on supercomputing (ICS), Vienna, Austria, July 1997, pp 44–51

    Google Scholar 

  20. van de Geijn RA, Watts J (1995) Summa: scalable universal matrix multiplication algorithm. University of Texas at Austin, Tech rep. Available online: http://www.ncstrl.org:8900/ncstrl/servlet/search?formname=detail&id=oai%3Ancstrlh%3Autexas_cs%3AUTEXAS_CS%2F%2FCS-TR-95-13

  21. Li J, Ranka S, Sahni S (2012) GPU matrix multiplication. In: Rajasekaran S (ed) Handbook on multicore computing. CRC Press, Boca Raton

    Google Scholar 

  22. More A (2008) A case study on high performance matrix multiplication. Available online: mm-matrixmultiplicationtool.googlecode.com/files/mm.pdf

  23. Higham N (1990) Exploiting fast matrix multiplication within the level 3 BLAS. ACM Trans Math Softw 16(4):352–368

    Article  MathSciNet  MATH  Google Scholar 

  24. Goto K, Geijn R (2008) Anatomy of high-performance matrix multiplication. ACM Trans Math Softw 34(3):1–25

    Article  Google Scholar 

  25. Nishtala R, Vuduc RW, Demmel JW, Yelick KA (2004) Performance modeling and analysis of cache blocking in sparse matrix vector multiply. Tech rep UCB/CSD-04-1335, EECS Department, University of California, Berkeley. Available online: http://www.eecs.berkeley.edu/Pubs/TechRpts/2004/5535.html

  26. Lam MD, Rothberg EE, Wolf ME (1991) The cache performance and optimizations of blocked algorithms. In: Proc of the fourth ACM international conference on architectural support for programming languages and operating systems (ASPLOS), Santa Clara, California, April 1991, pp 63–74

    Chapter  Google Scholar 

  27. Rixner S (2002) Stream processor architecture. Kluwer Academic, Norwell

    MATH  Google Scholar 

  28. Zhu W, Cuvillo Jd, Gao GR (2005) Performance characteristics of OpenMP language constructs on a many-core-on-a-chip architecture. In: Proc of the 2005 and 2006 international conference on OpenMP shared memory parallel programming (IWOMP’05/IWOMP’06), Eugene, Oregon, June 2005

    Google Scholar 

  29. Garcia E, Venetis I, Khan R, Gao G (2010) Optimized dense matrix multiplication on a many-core architecture. In: Proc of the ACM Euro-Par conference on parallel processing

    Google Scholar 

  30. Safari S, Fijany A, Diotalevi F, Hosseini F (2012) Highly parallel and fast implementation of stereo vision algorithms on MIMD many-core Tilera architecture. In: Proc of the IEEE aerospace conference, Boston, MA, August 2012, pp 1–11

    Google Scholar 

  31. Munir A, Gordon-Ross A, Ranka S (2012) Parallelized benchmark-driven performance evaluation of SMPs and tiled multi-core architectures for embedded systems. In: Proc of the IEEE international performance computing and communications conference (IPCCC), Austin, Texas, December 2012

    Google Scholar 

  32. Keckler S, Olukotun K, Hofstee H (2009) Multicore processors and systems. Springer, Berlin

    Book  MATH  Google Scholar 

  33. Tilera (2012) Manycore without boundaries: TILE64 processor. April. Available online: http://www.tilera.com/products/processors/TILE64

  34. Intel (2013) Intel’s teraflops research chip. February. Available online: http://download.intel.com/pressroom/kits/Teraflops/Teraflops_Research_Chip_Overview.pdf

  35. Hoskote Y, Vangal S, Singh A, Borkar N, Borkar S (2007) A 5-GHz mesh interconnect for a TeraFLOPS processor. IEEE MICRO 27(5):51–61

    Article  Google Scholar 

  36. IBM (2012) Linux and Symmetric Multiprocessing, February. Available online: http://www.ibm.com/developerworks/library/l-linux-smp/

  37. Tilera (2009) Tile processor architecture overview for the TILEPro series. In: Tilera official documentation. November

    Google Scholar 

  38. Tilera (2010) Multicore development environment system programmer’s guide. In: Tilera official documentation. March

    Google Scholar 

  39. Tilera (2009) Tile processor architecture overview. In: Tilera official documentation. November

    Google Scholar 

  40. Kumar V, Grama A, Gupta A, Karypis G (1994) Introduction to parallel computing. Benjamin-Cummings, Redwood City

    MATH  Google Scholar 

  41. Tilera (2010) Multicore development environment optimization guide. In: Tilera official documentation. March

    Google Scholar 

  42. ARM (2012) Cortex-A15 MPCore: technical reference manual. April. Available online: http://infocenter.arm.com/help/topic/com.arm.doc.ddi0438e/DDI0438E_cortex_a15_r3p0_trm.pdf

  43. Oracle (2013) Sun studio 12: Fortran programming guide. February. Available online: http://docs.oracle.com/cd/E19205-01/819-5262/aeuic/index.html

  44. Mahlke S, Warter N, Chen W, Chang P, Hwu W-m (1991) The effect of compiler optimizations on available parallelism in scalar programs. In: Proc of 20th annual IEEE international conference on parallel processing (ICPP), Austin, Texas, August 1991

    Google Scholar 

  45. Williams J, Massie C, George A, Richardson J, Gosrani K, Lam H (2010) Characterization of fixed and reconfigurable multi-core devices for application acceleration. ACM Trans on Reconfigurable Technology and Systems 3(4)

  46. Tilera (2010) TILEmPower appliance user’s guide. In: Tilera official documentation. January

    Google Scholar 

  47. Tilera (2009) Tilera multicore development environment: iLib API reference manual. In: Tilera official documentation. April

    Google Scholar 

Download references

Acknowledgements

This work was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC), the Space and Naval Warfare Systems Command (SPAWAR N66001-11-1-4103), the Office of Naval Research (ONR R16480), and the National Science Foundation (NSF) (CNS-0953447 and CNS-0905308). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSERC, the SPAWAR, the ONR, and the NSF. Furthermore, the views expressed are those of the author(s) and do not reflect the official policy or position of the Department of Defense or the US Government. We would like to acknowledge Dr. Alan D. George, Director of the NSF Center of High-Performance Reconfigurable Computing (CHREC) at the University of Florida, Gainesville, Florida, USA, for providing access to CHREC resources and Tilera’s TILE64 and TILEPro64 for this work as well as discussions on high-performance computing with the leading author of this article.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arslan Munir.

Appendix: Matrix multiplication algorithms’ code snippets for Tilera’s TILEPro64

Appendix: Matrix multiplication algorithms’ code snippets for Tilera’s TILEPro64

This appendix section provides code snippets of our matrix multiplication algorithms for Tilera’s TILEPro64. The code snippets are presented selectively to provide an understanding of our algorithms and some portions of the code are skipped for conciseness.

1.1 A.1 Serial non-blocked matrix multiplication algorithm

1.1.1 A.1.1 SerialNonBlockedMM.h

figure a

1.1.2 A.1.2 SerialNonBlockedMM.c

figure b

1.2 A.2 Serial blocked matrix multiplication algorithm

1.2.1 A.2.1 SerialBlockedMM.h

figure c

1.2.2 A.2.2 SerialBlockedMM.c

figure d

1.3 A.3 Parallel blocked matrix multiplication algorithm

1.3.1 A.3.1 ParallelBlockedMM.h

figure e

1.3.2 A.3.2 ParallelBlockedMM.c

figure f

1.4 A.4 Parallel blocked cannon’s algorithm for matrix multiplication

1.4.1 A.4.1 ParallelBlockedCannonMM.h

figure g

1.4.2 A.4.2 ParallelBlockedCannonMM.c

figure h

Rights and permissions

Reprints and permissions

About this article

Cite this article

Munir, A., Koushanfar, F., Gordon-Ross, A. et al. High-performance optimizations on tiled many-core embedded systems: a matrix multiplication case study. J Supercomput 66, 431–487 (2013). https://doi.org/10.1007/s11227-013-0916-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-013-0916-9

Keywords

Navigation