The Journal of Supercomputing

, Volume 74, Issue 3, pp 1341–1377 | Cite as

Theoretical peak FLOPS per instruction set: a tutorial

  • Romain Dolbeau


Traditionally, evaluating the theoretical peak performance of a CPU in FLOPS (floating-point operations per second) was merely a matter of multiplying the frequency by the number of floating-point instructions per cycle. Today however, CPUs have features such as vectorization, fused multiply-add, hyperthreading, and “turbo” mode. In this tutorial, we look into this theoretical peak for recent fully featured Intel CPUs and other hardware, taking into account not only the simple absolute peak, but also the relevant instruction sets, encoding and the frequency scaling behaviour of modern hardware.


FLOPS Performance Tutorial 


  1. 1.
    Alverson R, Callahan D, Cummings D, Koblenz B, Porterfield A, Smith B (1990) The tera computer system. ACM SIGARCH Comput Archit News 18(3b):1–6CrossRefGoogle Scholar
  2. 2.
    AMD® (2017) AMD optimizing C/C++ compiler.
  3. 3.
    AMD® (2017) Introducing the Radeon™ RX Vega\(^{64}\).
  4. 4.
  5. 5.
  6. 6.
  7. 7.
    Zuras D, Cowlishaw M, Aiken A, Applegate M, Bailey D, Bass S, Bhandarkar D, Bhat M, Bindel D, Boldo S et al (2008) IEEE standard for floating-point arithmetic. IEEE Std 754–2008, pp 1–70Google Scholar
  8. 8.
    August MC, Brost GM, Hsiung CC, Schiffleger AJ (1989) Cray X-MP: the birth of a supercomputer. Computer 22(1):45–52CrossRefGoogle Scholar
  9. 9.
    Brisebarre N, Defour D, Kornerup P, Muller JM, Revol N (2005) A new range-reduction algorithm. IEEE Trans Comput 54(3):331–339CrossRefzbMATHGoogle Scholar
  10. 10.
    Buchholz W (1962) Planning a computer system: project stretch. McGraw-Hill Inc, Hightstown, NJ, USAGoogle Scholar
  11. 11.
    Butler M (2010) Bulldozer: a new approach to multi-threaded compute performance. In: Hot Chips 22 Symposium (HCS), 2010 IEEE. IEEE, pp 1–17Google Scholar
  12. 12.
    Butler M, Barnes L, Sarma DD, Gelinas B (2011) Bulldozer: an approach to multithreaded compute performance. IEEE Micro 31(2):6–15.
  13. 13.
    Clark M (2016) A new X86 core architecture for the next generation of computing. Hot Chips 28 Symposium (HCS). IEEE, pp 1–19Google Scholar
  14. 14.
    Daumas M, Mazenc C, Merrheim X, Muller JM (1995) Modular range reduction: a new algorithm for fast and accurate computation on the elementary functions. J Univers Comput Sci 1(3):162–175MathSciNetzbMATHGoogle Scholar
  15. 15.
    Diefendorff K, Dubey PK, Hochsprung R, Scale H (2000) Altivec extension to PowerPC accelerates media processing. IEEE Micro 20(2):85–95CrossRefGoogle Scholar
  16. 16.
    Dolbeau R, Seznec A (2004) CASH: revisiting hardware sharing in single-chip parallel processor. J Instr Level Parallelism 6:1–16Google Scholar
  17. 17.
    Fayneh E, Yuffe M, Knoll E, Zelikson M, Abozaed M, Talker Y, Shmuely Z, Rahme SA (2016) 4.1 14nm 6th-Generation core processor soc with low power consumption and improved performance. In: Solid-State Circuits Conference (ISSCC), 2016 IEEE International. IEEE, pp 72–73Google Scholar
  18. 18.
    Fog A (1996–2016) Instruction tables: lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA cpus. Copenhagen University College of Engineering.
  19. 19.
    Govindu G, Zhuo L, Choi S, Prasanna V (2004) Analysis of high-performance floating-point arithmetic on fpgas. In: Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International. IEEE, p 149Google Scholar
  20. 20.
    Grisenthwaite R (2011) Armv8 technology preview. In: IEEE ConferenceGoogle Scholar
  21. 21.
    Gwennap L (2011) Adapteva: more flops, less watts. Microprocess Rep 6(13):11–02Google Scholar
  22. 22.
    Henderson D (2000) Elementary functions: algorithms and implementation. Math Comput Educ 34(1):94MathSciNetGoogle Scholar
  23. 23.
    Hennessy JL, Patterson DA (2011) Computer architecture: a quantitative approach, 5th edn. Elsevier, AmsterdamzbMATHGoogle Scholar
  24. 24.
    Intel® (2010) Intel® Xeon® Processor X5650 (12M Cache, 2.66 GHz, 6.40 GT/s Intel® QPI).
  25. 25.
    Intel® (2014) Intel® Xeon® Processor E5-2695 v3 (35m Cache, 2.30 GHz).
  26. 26.
    Intel® (2014) Intel® Xeon® Processor E5 v3 product families specification update.
  27. 27.
  28. 28.
    Intel® (2016) Intel® 64 and IA-32 architectures software developer’s manual volume 2 (2A, 2B & 2C): instruction set reference, A–Z. 325383-060.
  29. 29.
    Intel® (2016) Intel® Xeon Phi™ processor software optimization guide (334541-001).
  30. 30.
    Intel® (2017) Intel® Intrinsics guide.
  31. 31.
    Kanter D (2016) AMD finds Zen in microarchitecture. Microprocess Rep.
  32. 32.
    Kumar A (1997) The HP PA-8000 RISC CPU. IEEE Micro 17(2):27–32CrossRefGoogle Scholar
  33. 33.
    Kumar R, Jouppi NP, Tullsen DM (2004) Conjoined-core chip multiprocessing. In: Proceedings of the 37th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, pp 195–206Google Scholar
  34. 34.
    Lee B, Burgess N (2002) Parameterisable floating-point operations on FPGA. In: Conference Record of the Thirty-Sixth Asilomar Conference on Signals, Systems and Computers, 2002, vol 2. IEEE, pp 1064–1068Google Scholar
  35. 35.
    LLVM (2017)
  36. 36.
    LLVM Documentation (2017) Auto-vectorization in LLVM.
  37. 37.
    Lo YJ, Williams S, Van Straalen B, Ligocki TJ, Cordery MJ, Wright NJ, Hall MW, Oliker L (2014) Roofline model toolkit: a practical tool for architectural and program analysis. In: International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems. Springer, pp 129–148Google Scholar
  38. 38.
    Mantor M (2012) AMD Radeon™ HD 7970 with graphics core next (GCN) architecture. In: Hot Chips 24 Symposium (HCS), 2012 IEEE. IEEE, pp 1–35Google Scholar
  39. 39.
    Montoye RK, Hokenek E, Runyon SL (1990) Design of the IBM RISC System/6000 floating-point execution unit. IBM J Res Dev 34(1):59–70CrossRefGoogle Scholar
  40. 40.
    Munger B, Akeson D, Arekapudi S, Burd T, Fair HR, Farrell J, Johnson D, Krishnan G, McIntyre H, McLellan E et al (2016) Carrizo: a high performance, energy efficient 28 nm APU. IEEE J Solid State Circuits 51(1):105–116CrossRefGoogle Scholar
  41. 41.
    Muñoz DM, Sanchez DF, Llanos CH, Ayala-Rincón M (2010) Tradeoff of FPGA design of a floating-point library for arithmetic operators. J Integr Circuits Syst 5(1):42–52Google Scholar
  42. 42.
    NVidia (2008–2017) CUDA C programming guide.
  43. 43.
    NVidia (2008–2017) CUDA C programming guide: 5.4.1. Arithmetic instructions.
  44. 44.
    NVidia (2008–2017) CUDA GPUs.
  45. 45.
    Oberman S, Favor G, Weber F (1999) AMD 3DNow! technology: architecture and implementations. IEEE Micro 19(2):37–48CrossRefGoogle Scholar
  46. 46.
    Olofsson A, Nordström T, Ul-Abdin Z (2014) Kickstarting high-performance energy-efficient manycore architectures with epiphany. In: 2014 48th Asilomar Conference on Signals, Systems and Computers. IEEE, pp 1719–1726Google Scholar
  47. 47.
    Russell RM (1978) The CRAY-1 computer system. Commun ACM 21(1):63–72CrossRefGoogle Scholar
  48. 48.
    Shayesteh A (2006) Factored multi-core architectures. PhD thesis, University of California Los AngelesGoogle Scholar
  49. 49.
    Singh AYG, Favor G, Yeung A (2014) AppliedMicro X-Gene 2. In: HotChipsGoogle Scholar
  50. 50.
    Smith JE, Sohi GS (1995) The microarchitecture of superscalar processors. Proc IEEE 83(12):1609–1624CrossRefGoogle Scholar
  51. 51.
    Snavely A, Carter L, Boisseau J, Majumdar A, Gatlin KS, Mitchell N, Feo J, Koblenz B (1998) Multi-processor performance on the Tera MTA. In: Proceedings of the 1998 ACM/IEEE Conference on Supercomputing. IEEE Computer Society, pp 1–8Google Scholar
  52. 52.
    Sodani A (2015) Knights landing (KNL): 2nd Generation Intel® Xeon Phi Processor. In: Hot Chips 27 Symposium (HCS), 2015 IEEE. IEEE, pp 1–24Google Scholar
  53. 53.
    Stephens N (2016) Technology update: the scalable vector extension (sve) for the armv8-a architecture.
  54. 54.
    Strenski D (2007) FPGA floating point performance—a pencil and paper evaluation. HPC Wire.
  55. 55.
    Thornton JE (1965) Parallel operation in the control data 6600. In: Proceedings of the October 27–29, 1964, Fall Joint Computer Conference, Part II: Very High Speed Computer Systems. ACM, New York, NY, USA, AFIPS ’64 (Fall, part II), pp 33–40.
  56. 56.
    Tullsen DM, Eggers SJ, Levy HM (1995) Simultaneous multithreading: maximizing on-chip parallelism. ACM SIGARCH Comp Archit News 23(2):392–403.
  57. 57.
    Wikipedia (2017) x87.

Copyright information

© Springer Science+Business Media, LLC 2017

Authors and Affiliations

  1. 1.AtosRennesFrance

Personalised recommendations