How Far Can We Go on the x64 Processors?

  • Mitsuru Matsui
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4047)


This paper studies the state-of-the-art software optimization methodology for symmetric cryptographic primitives on the new 64-bit x64 processors, AMD Athlon64 (AMD64) and Intel Pentium 4 (EM64T). We fully utilize newly introduced 64-bit registers and instructions for extracting maximal performance of target primitives. Our program of AES with 128-bit key runs in 170 cycles/block on Athlon 64, which is, as far as we know, the fastest implementation of AES on a PC processor.

Also we implemented a “bitsliced” AES and Camellia for the first time, both of which achieved very good performance. A bitslice implementation is important from the viewpoint of a countermeasure against cache timing attacks because it does not require lookup tables with a key-dependent address. We also analyze performance of SHA256/512 and Whirlpool hash functions and show that SHA512 can run faster than SHA256 on Athlon 64. This paper exhibits an undocumented fact that 64-bit right shifts and 64-bit rotations are extremely slow on Pentium 4, which often leads to serious and unavoidable performance penalties in programming encryption primitives on this processor.


Fast Software Encryption x64 Processors Bitslice 


  1. 1.
    Aoki, K., Ichikawa, T., Kanda, M., Matsui, M., Moriai, S., Nakajima, J., Tokita, T.: The 128-Bit Block Cipher Camellia. IEICE Trans. Fundamentals 85(1), 11–24 (2002)Google Scholar
  2. 2.
    Barreto, P., Rijmen, V.: The Whirlpool Hashing Function. In: Proceedings of First Open NESSIE Workshop, Heverlee, Belgium (2000)Google Scholar
  3. 3.
  4. 4.
    Biham, E.: A Fast New DES Implementation in Software. In: Biham, E. (ed.) FSE 1997. LNCS, vol. 1267, pp. 260–272. Springer, Heidelberg (1997)CrossRefGoogle Scholar
  5. 5.
    Cryptography Research and Evaluation Committees: The CRYPTREC Homepage,
  6. 6.
    Federal Information Processing Standards Publication 197. Advanced Encryption Standard (AES), NIST (2001)Google Scholar
  7. 7.
    Federal Information Processing Standards Publication 180-2, Secure Hash Standard, NIST (2002)Google Scholar
  8. 8.
    Fog, A.: How To Optimize for Pentium Family Processorss, Available at,
  9. 9.
    Granlund, T.: Instruction latencies and throughput for AMD and Intel x86 Processors, Available at,
  10. 10.
    IA-32 Intel Architecture Optimization Reference Manual, Order Number 248966-011,
  11. 11.
    ISO/IEC 18033-3, Information technology - Security techniques – Encryption algorithms - Part3: Block ciphers (2005)Google Scholar
  12. 12.
    Kartunov, V.: Prescott: The Last of the Mohicans (Pentium 4: from Willamette to Prescott),
  13. 13.
    Matsui, M., Fukuda, S.: How to Maximize Software Performance of Symmetric Primitives on Pentium III and 4 Processors. In: Gilbert, H., Handschuh, H. (eds.) FSE 2005. LNCS, vol. 3557, pp. 398–412. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  14. 14.
    Nakajima, J., Matsui, M.: Performance Analysis and Parallel Implementation of Dedicated Hash Functions on Pentium III. IEICE Trans. Fundamentals E86- A(1), 54–63 (2003)Google Scholar
  15. 15.
    Nakajima, J., Matsui, M.: Fast Software Implementations of MISTY1 on Alpha Processors. IEICE Trans. Fundamentals E82-A(1), 107–116 (1999)Google Scholar
  16. 16.
    New European Schemes for Signatures, Integrity, and Encryption (NESSIE),
  17. 17.
    Osvik, D.A., Shamir, A., Tromer, E.: Full AES key extraction in 65 milliseconds using cache attacks. Crypto 2005, rump session (2005)Google Scholar
  18. 18.
    Rudra, A., Dubey, P., Jutla, C., Kummar, V., Rao, J., Rohatgi, P.: Efficient Rijndael Encryption Implementation with Composite Field Arithmetic. In: Koç, Ç.K., Naccache, D., Paar, C. (eds.) CHES 2001. LNCS, vol. 2162, pp. 171–184. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  19. 19.
    Satoh, A., Morioka, S., Takano, K., Munetoh, S.: A Compact Rijndael Hardware Architecture with S-Box Optimization. In: Boyd, C. (ed.) ASIACRYPT 2001. LNCS, vol. 2248, pp. 239–254. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  20. 20.
    Software Optimization Guide for AMD64 Processors, Publication 25112,
  21. 21.
    de Vries, H.: Understanding the detailed Architecture of AMD’s 64 bit Core,

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Mitsuru Matsui
    • 1
  1. 1.Information Technology R&D Center, Mitsubishi Electric CorporationJapan

Personalised recommendations