The Macro-DSE for HPC Processing Unit: The Physical Constraints Perspective

  • Yuxing TangEmail author
  • Lei Wang
  • Yu Deng
  • Xiaoqiang Ni
  • Qiang Dou
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9663)


Because of the popularity of big data and cloud computing, the evolution of microarchitecture has to concentrated on raw computing ability, throughput, low power and cost at the same time. Due to the huge Non-recurring engineering costs, computer architects and processor designers rely on the simulation tools and models to optimize the main processing unit. Design space exploration (DSE) methodology is responsible to filter all the possible choices. However, thousands of parameters for current multi-core processor make it too expensive to complete the exhausting search. The future high performance computing (HPC) no longer insist on peak double precision performance (DFP) only, but also on high throughput and light-weight. Depending on the various details from the number of cores to the individual pipeline buffer size, we can divide the DSE problem into macro and micro level.

In this paper, we focus on the macro-DSE problem around choosing the right style for the processing core design. Firstly, we extended McPAT, the de facto DSE tools to support from 65 nm to 16 nm technology and up to 256 Cores. Based on the physical design constraints: chip area, power and balance design request, we examine and explore the design of future processing unit of high performance. Although traditional HPC pursued the peak performance only, our DSE results show the physical constrain will direct the processing unit of future HPC to limited choice. The experiment results show that with only 74.8 % increasing in chip die area and 3.8 % increasing in power, one many-core design can archive 4 times peak performance both in INT and FP, and 285.6 % increasing in performance/power efficiency than another. The key insight of our experiment indicates that unique type of processing core can be the best choice depending on the specific physical design plan.


Processor Design space HPC Cloud 



We thanks the other cpu@nudt team numbers that provide architecture, microarchitecture and physical design parameters of various processor. This work is supported in part by NSFC grants No. 61272139 and National Science and Technology Major Project HGJ-2015ZX01028001-001.


  1. 1.
    Bergman, K., Borkar, S., Campbell, D., Carlson, W., Dally, W., Denneau, M., Franzon, P., Harrod, W., Hill, K., Hiller, J., Karp, S., Keckler, S., Klein, D., Lucas, R., Richards, M., Scarpelli, A., Scott, S., Snavely, A., Thomas Sterling, R., Williams, S., Yelick, K.: ExtraScale Computing Study: Technology Challenges in Achieving Exascale System. Kogge, P. (ed. and study lead) (2008)Google Scholar
  2. 2.
    Danowitz, A., Kelley, K., Mao, J., Stevenson, J.P., Horowitz, M.: CPU DB: recording microprocessor history. Commun. ACM 55(4), 55–63 (2012)CrossRefGoogle Scholar
  3. 3.
    Lee, V.W., Kim, C., Chhugani, J., Deisher, M., Kim, D., Nguyen, A.D., Satish, N., Smelyanskiy, M., Chennupaty, S., Hammarlund, P., Singhal, R., Dubey, P.: Debunking the 100X GPU vs. CPU myth: an evalution of throughput computing on CPU and GPU. In: Proceedings of the 37th Annual International Symposium on Computer Architecdture (ISCA 2010), pp. 451–460 (2010)Google Scholar
  4. 4.
    Blem, E., Menon, J., Vijayaraghavan, T., Sankaralingam, K.: ISA wars: understanding the relevance of ISA being RISC or CISC to performance power and energy on modern architecture. ACM Trans. Comput. Syst. 33(1), 3 (2015)CrossRefGoogle Scholar
  5. 5.
    Tendler, J.M., Dodson, J.S., Fields, J.S., Le, H., Sinharoy, B.: POWER4 System microarchtecture. IBM J. Res. Dev. 46(1), 5–15 (2001)CrossRefGoogle Scholar
  6. 6.
    Sampson, R., Yang, M., Wei, S., Chakrabarti, C., Wenisch, T.F.: Sonic Millip3De: a massively parallel 3D-stacked accelerator for 3D ultrasound. In: Proceedings of the 19th IEEE International Symposium on High Performance Computer Architecture, pp. 318–329 (2013)Google Scholar
  7. 7.
    Akin, B., Franchetti, F., Hoe, J.C.: Data reorganization in memory using 3D-stacked DRAM. In: Proceedings of the 42nd International Symposium on Computer Architecture, pp. 131–143 (2015)Google Scholar
  8. 8.
    Koyanagi, M.: Heterogeneous 3D integration - technology enabler toward future super-chip. In: Proceedings of IEEE International Electron Devices Meeting (IEDM), pp. 1.2.1–1.2.8 (2013)Google Scholar
  9. 9.
    Li, S., Ahn, J.H., Strong, R.D., Brockman, J.B., Tullsen, D.M., Jouppi, N.P.: The McPAT framework for multicore and manycore architecture: simultaneously modeling power, area, and timing. ACM Trans. Archit. Code Optim. 10(1), 5 (2013)CrossRefGoogle Scholar
  10. 10.
    Xi, S.L., Jacobson, H., Bose, P., Wei, G.-Y., Brooks, D.: Quantifying sources of error in McPAT and potential impacts on architecture studies. In: Proceedings of 21st Internaional Symposium on High Performance Computer Architecture, pp. 577–589 (2015)Google Scholar
  11. 11.
    Leng, J., Hethering, T., ElTantawy, A., Gilani, S., Kim, N.S., Aamodt, T.M., Reddi, V.J.: GPUWattch: enabling energy optimizations in GPGPUs. In: Proceedings of the ACM/IEEE International Symposium on Computer Architecture (ISCA 2013), pp. 487–498 (2013)Google Scholar
  12. 12.
    Serafy, C., Srivastava, A., Yeung, D.: Unlocking the true potential of 3D CPUs with micro-fluidic cooling. In: Proceedings of the 2014 International Symposium on Low Power Electronics and Design, pp. 323–326 (2014)Google Scholar
  13. 13.
    Johns, C.R., Brokenshire, D.A.: Introduction to the cell broadband engine architecture. IBM J. Res. Dev. 51(5), 503–520 (2007)CrossRefGoogle Scholar
  14. 14.
    Gutta, S.R., Foley, D., Naini, A., Wasmuth, R., Cherepacha, D.: A low-power integrated X86-64 and graphics processor for mobile computing devices. In: 2011 IEEE International Solid-State Circuits Conference (ISSCC) Digest of Technical Papers, pp. 270–272 (2011)Google Scholar
  15. 15.
    Davy, G., Deckhout, L.: Chip multiprocessor design space exploration through statistical simulation. IEEE Trans. Comput. 12(58), 1668–1681 (2009)MathSciNetGoogle Scholar
  16. 16.
    Lee, J., Jang, H., Kim, J.: RpStacks: fast and accurate processor design space exploration using representative stall-event stacks. In: Proceedings of 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 255–267 (2014)Google Scholar
  17. 17.
    Rajovic, N., Carpenter, R.M., Gelado, I., Puzovic, N., Ramirez, A., Valero, M.: Supercomputing with commodity CPUs: are mobile SoCs Ready for HPC? In: Proceedings of 2013 International Conference of Supercomputing (SC 2013), pp. 1–12 (2013)Google Scholar
  18. 18.
    Dubach, C., Jones, T., O’Boyle, M.: Microarchitectural design space exploration using an architecture-centric approach. In: Proceeding of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 40), pp. 262–271 (2007)Google Scholar
  19. 19.
    Wang, L., Tang, Y., Deng, Y., Qi, F., et al.: A Scalable and fast microprocessor design space exploration methodology. In: Proceedings of McSoC (2015)Google Scholar
  20. 20.
    Gibbons, P.B.: Big data: scale down, scale up, scale out. The Keynotes in 29th IEEE International Parallel and Distributed Processing Symposium (IPDPS 29) (2015)Google Scholar
  21. 21.
    Dhodapkar, A., Aauterbach, G., Li, S., et al.: SeaMicro SM10000-64 server: building datacenter servers using cell phone chips. In: Proceedings of 23rd IEEE HotChips Symposium (2011)Google Scholar
  22. 22.
    Gwennap, L.: ThunderX rattles server market: cavium develops 48-Core ARM processor to challenge Xeon. MicroProcessor report, 9 June 2014Google Scholar
  23. 23.
    Gwennap, L.: 3D packaging gains momentum: xilinx FPGAs to use stacked silicon - will processors follow suit? MicroProcessor report 12/27/10-01 December 2012Google Scholar
  24. 24.
    Dreslinski, R.G., Fick, D., Giridhar, B., Kim, G., Seo, S., Fojtik, M., Satpathy, S., Lee, Y., Kim, D., Liu, N., Wieckowski, M., Chen, G., Sylvester, D., Blaauw, D., Mudge, T.: Centip3De: a many-core prototype exploring 3D integration and near-threshold computing. Commun. ACM 56(11), 97–104 (2013)CrossRefGoogle Scholar
  25. 25.
    Nickolls, J., Dally, W.J.: The GPU computing era. IEEE Micro 30(2), 56–69 (2010)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Yuxing Tang
    • 1
    Email author
  • Lei Wang
    • 1
  • Yu Deng
    • 1
  • Xiaoqiang Ni
    • 1
  • Qiang Dou
    • 1
  1. 1.School of ComputerNational University of Defense TechnologyChangshaChina

Personalised recommendations