Abstract
Moore’s Law predicted that the number of transistors on a chip would double approximately every 2 years. However, this trend is arriving at an impasse. Optimizing the usage of the available transistors within the thermal dissipation capabilities of the packaging is a pending topic. Multi-core processors exploit coarse-grain parallelism to improve energy efficiency. Vectorization allows developers to exploit data-level parallelism, operating on several elements per instruction and thus, reducing the pressure to the fetch and decode pipeline stages. In this paper, we perform an analysis of different resource optimization strategies for vector architectures. In particular, we expose the need to break down voltage and frequency domains for LLC, ALUs and vector ALUs if we aim to optimize the energy efficiency and performance of our system. We also show the need for a dynamic reconfiguration strategy that adapts vector register length at runtime.
This is a preview of subscription content, access via your institution.














References
Albright RK (2012) Optimizing performance/watt of embedded SIMD multiprocessors through a priori application guided power scheduling. Oregon State University, Corvallis
AMD (2000) 3DNow! technology manual. Motorola, Chicago
Neon. https://developer.arm.com/architectures/instruction-sets/simd-isas/neon. Accessed 04 Mar 2019
Asanovic̀ K (1998) Vector microprocessors. Ph.D. thesis
Barnes GH, Brown RM, Kato M, Kuck DJ, Slotnick DL, Stokes RA (1968) The ILLIAC IV computer. IEEE Trans Comput C–17(8):746–757
Binkert N, Sardashti S, Sen R, Sewell K, Shoaib M, Vaish N, Hill MD, Wood DA, Beckmann B, Black G, Reinhardt SK, Saidi A, Basu A, Hestness J, Hower DR, Krishna T (2011) The gem5 simulator. ACM SIGARCH Comput Archit News 39(2):1
Butenhof DR (1997) Programming with POSIX threads. Addison-Wesley Longman Publishing Co. Inc., Boston
Casas M, Moreto M, Alvarez L, Castillo E, Chasapis D, Hayes T (2015) Runtime-aware architectures. In: European Conference on Parallel Processing, pp 16–27
Cebrian JM, Jahre M, Natvig L (2015) ParVec: vectorizing the PARSEC benchmark suite. Computing 97:1077–1100
Cebrián JM, Natvig L, Meyer JC (2014) Performance and energy impact of parallelization and vectorization techniques in modern microprocessors. Computing 96(12):1179–1193
Chapman B (2007) The multicore programming challenge. In: Advanced Parallel Processing Technologies; 7th International Symposium, (7th APPT'07), Lecture Notes in Computer Science (LNCS), vol 4847. Springer-Verlag, New York, p 3
ITRS (2011) International technology roadmap for semiconductors
CRAY (1984) The CRAY X-MP series of computer systems
Dennard R, Gaensslen F, Rideout V, Bassous E, LeBlanc A (1974) Design of ion-implanted MOSFET’s with very small physical dimensions. IEEE J Solid State Circuits 9(5):256–268
Espasa R (1997) Advanced vector architectures. Ph.D. thesis, Universitat Politècnica de Catalunya
Espasa R, Valero M, Smith JE (1998) Vector architectures: past, present and future. In: Proceeding ICS ’98 Proceedings of the 12th International Conference on Supercomputing, pp 425–432
Fuller S (1998) Motorola AltiVec technology. Motorola, Chicago
Haley A (1956) DEUCE: a high-speed general-purpose computer. Proc IEEE Part B Radio Electron Eng 103(2S):165–173
Hennessy JL, Patterson DA (2017) Computer architecture: a quantitative approach, 6th edn. Morgan Kaufmann Publishers Inc., San Francisco
Hu Z, Buyuktosunoglu A, Srinivasan V, Zyuban V, Jacobson H, Bose P (2004) Microarchitectural techniques for power gating of execution units. In: Proceedings of the 2004 International Symposium on Low Power Electronics and Design—ISLPED ’04, ACM Press, New York, p 32
Inoue H (2016) How SIMD width affects energy efficiency: a case study on sorting. In: 2016 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS XIX), IEEE, pp 1–3
Inter Corporation (2012) Intel 64 and IA-32 architectures software developer’s manual volume 1: basic architecture
Intel Corporation (2015) Intel 64 and IA-32 architectures software developer’s manual volume 2A: instruction set reference
Hockney RW, Jesshope RC (1988) Parallel computers two: architecture, programming and algorithms, 2nd edn. IOP Publishing Ltd., Bristol
Jimborean A, Koukos K, Spiliopoulos V, Black-Schaffer D, Kaxiras S (2014) Fix the code. Don’t tweak the hardware: a new compiler approach to voltage-frequency scaling. In: Annual IEEE/ACM International Symposium
Kaxiras S, Martonosi M (2008) Computer architecture techniques for power-efficiency. Synth Lect Comput Archit 3(1):1–207
Koukos K, Black-Schaffer D, Spiliopoulos V, Kaxiras S (2013) Towards more efficient execution: a decoupled access-execute approach. In: International Conference on Supercomputing (ICS)
Lee Y, Avizienis R, Bishara A, Xia R, Lockhart D, Batten C (2011) Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators. In: International Symposium on Computer Architecture (ISCA), pp 129–140
Lemuet C, Sampson J, Francois J, Jouppi N (2006) The potential energy efficiency of vector acceleration. In: ACM/IEEE SC 2006 conference (SC’06), IEEE, p 1
Li S, Ahn JH, Strong RD, Brockman JB, Tullsen DM, Jouppi NP (2009) McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In: Proceedings of the 42nd Annual ACM/IEEE International Symposium on Microarchitecture (MICRO), pp 469–480
Li H, Bhunia S, Chen Y, Vijaykumar TN, Roy K (2003) Deterministic clock gating for microprocessor power reduction. In: International Symposium on High-Performance Computer Architecture (HPCA)
Majzoub S (2010) Voltage island design in multi-core SIMD processors. In: 2010 5th international design and test workshop, IEEE, pp 18–23
Mudge T (2001) Power: a first-class architectural design constraint. Computer 34(4):52–58
NEC (2017) Vector supercomputer SX series: SX-aurora TSUBASA. https://www.nec.com/en/global/solutions/hpc/sx/vector_engine.html. Accessed 04 Mar 2019
Russell RM (1971) The CRAY-1 computer system. In: Proceedings of Communication, ACM Computer Proceedings of WJCC Communication, ACM. McCarthy J, Time sharing computer systems Pt. I, AFIPS Press NJ 36(12):657–675
Russell RM (1978) The CRAY-1 computer system. Commun ACM 21(1):63–72
Satish N, Kim C, Chhugani J, Saito H, Krishnaiyer R, Smelyanskiy M, Girkar M, Dubey P (2012) Can traditional programming bridge the ninja performance gap for parallel computing applications? In: Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA), pp 440–451
Sodani A (2015) Knights landing (KNL): 2nd generation Intel Xeon Phi processor. In: IEEE Hot Chips 27 Symposium (HCS)
Stephens N, Biles S, Boettcher M, Eapen J, Eyole M, Gabrielli G, Horsnell M, Magklis G, Martinez A, Premillieu N, Reid A, Rico A, Walker P (2017) The ARM scalable vector extension. IEEE Micro 37(2):26–39
The Green 500 (2018). https://www.top500.org/green500/. Accessed 4 Mar 2019
Usami K, Goto Y, Matsunaga K, Koyama S, Ikebuchi D, Amano H, Nakamura H (2011) On-chip detection methodology for break-even time of power gated function units. In: IEEE/ACM International Symposium on Low Power Electronics and Design, IEEE, pp 241–246
Villa L, Espasa R, Valero M, Effective usage of vector registers in advanced vector architectures. In: Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques, IEEE Computer Society, pp 250–260
Watson WJ (1972) The TI ASC: a highly modular and flexible super computer architecture. In: Proceedings of the December 5–7, 1972, Fall Joint Computer Conference, Part I (AFIPS), pp 221–228
Wilkinson JH (1954) The Pilot ACE. In: Automatic Digital Computation. Her Majesty's Stationery Office, London, pp 5–14. Reprinted in [99, pp 193–199] and [1248, pp 219–228]
Wu Q, Martonosi M, Clark D, Reddi V, Connors D, Wu Y, Lee J, Brooks D, A dynamic compilation framework for controlling microprocessor energy and performance. In: 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’05), IEEE, pp 271–282
Wulf WA, McKee SA (1995) Hitting the memory wall. ACM SIGARCH Comput Archit News 23(1):20–24
Xi S, Jacobson H, Bose P, Wei GY, Brooks D (2015) Quantifying sources of error in McPAT and potential impacts on architectural studies. In: International Symposium on High Performance Computer Architecture (HPCA), pp 577–589
Yang X, Wang Z, Xue J, Zhou Y (2012) The reliability wall for exascale supercomputing. IEEE Trans Comput 61(6):767–779
Yoshida T (2016) Introduction of fujitsu’s hpc processor for the post-k computer. In: Hot Chips 28 Symposium (HCS) (Hot Chips' 16)
Acknowledgements
Funding was provided by RoMoL ERC Advanced Grant (Grant No. GA 321253), Juan de la Cierva (Grant No. JCI-2012-15047), Marie Curie (Grant No. 2013 BP_B 00243).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Barredo, A., Cebrian, J.M., Valero, M. et al. Efficiency analysis of modern vector architectures: vector ALU sizes, core counts and clock frequencies. J Supercomput 76, 1960–1979 (2020). https://doi.org/10.1007/s11227-019-02841-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-019-02841-6
Keywords
- Vector
- Efficiency
- DVFS
- Power wall