Comb to Pipeline: Fast Software Encryption Revisited
 5 Citations
 1.5k Downloads
Abstract
AESNI, or Advanced Encryption Standard New Instructions, is an extension of the x86 architecture proposed by Intel in 2008. With a pipelined implementation utilizing AESNI, parallelizable modes such as AESCTR become extremely efficient. However, out of the four nontrivial NISTrecommended encryption modes, three are inherently sequential: CBC, CFB, and OFB. This inhibits the advantage of using AESNI significantly. Similar observations apply to CMAC, CCM and a great deal of other modes. We address this issue by proposing the comb scheduler – a fast scheduling algorithm based on an efficient lookahead strategy, featuring a low overhead – with which sequential modes profit from the AESNI pipeline in realworld settings by filling it with multiple, independent messages.
We apply the comb scheduler to implementations on Haswell, Intel’s latest microarchitecture, for a wide range of modes. We observe a drastic speedup of factor 5 for NIST’s CBC, CFB, OFB and CMAC performing around 0.88 cpb. Surprisingly, contrary to the entire body of previous performance analysis, the throughput of the authenticated encryption (AE) mode CCM gets very close to that of GCM and OCB3, with about 1.64 cpb (vs. 1.63 cpb and 1.51 cpb, resp.), despite Haswell’s heavily improved binary field multiplication. This suggests CCM as an AE mode of choice as it is NISTrecommended, does not have any weakkey issues like GCM, and is royaltyfree as opposed to OCB3. Among the CAESAR contestants, the comb scheduler significantly speeds up CLOC/SILC, JAMBU, and POET, with the mostly sequential noncemisuse resistant design of POET, performing at 2.14 cpb, becoming faster than the wellparallelizable COPA.
Finally, this paper provides the first optimized AESNI implementations for the novel AE modes OTR, CLOC/SILC, COBRA, POET, McOEG, and Julius.
Keywords
AESNI pclmulqdq Haswell Authenticated encryption CAESAR CBC OFB CFB CMAC CCM GCM OCB3 OTR CLOC COBRA JAMBU SILC McOEG COPA POET Julius1 Introduction
With the introduction of AESNI, Advanced Encryption Standard New Instructions, on Intel’s microarchitectures starting from Westmere and later as well as on a variety of AMD CPUs, AES received a sigfinicant speedup in standard software, going well below 1 cycle per byte (cpb) and possessing a constant running time, which also thwarts cachetiming attacks. Important applications for AESNI include OpenSSL, Microsoft’s BitLocker, Apple’s FileVault, TrueCrypt, PGP and many more. In a nutshell, AESNI provides dedicated instructions for AES encryption and decryption. On Haswell, Intel’s newest architecture, the latency of these instructions is 7 clock cycles (cc) and the throughput is 1 cc. That is, AESNI has a pipeline of length 7 and one can issue one instruction per clock cycle. This pipeline can be naturally exploited by parallel AES modes such as CTR in the encryption domain, PMAC in the message authentication domain as well as GCM and OCB in the authenticated encryption domain.
However, numerous AES modes of operation – both standardized and novel such as CAESAR^{1} submissions – are essentially sequential by design. Indeed, NISTstandardized CBC, CFB, OFB and CMAC [10] as well as CLOC and POET from FSE 2014 and McOEG from FSE 2012 are essentially sequential, which limits their performance on stateoftheart servers and desktops significantly, as the pipeline cannot be filled entirely, having a severe performance penalty as a consequence.
In this paper, we aim to address this gap and propose an efficient lookahead comb scheduler for realworld Internet packets. Its application can change the landscape of AES modes of operation in terms of their practical throughput. Our contributions are as follows:
Novel Comb Scheduler. Communication devices of highspeed links are likely to process many messages at the same time. Indeed, on the Internet, the bulk of data is transmitted in packets of sizes between 1 and 2 KB, following a bimodal distribution. While most previous implementations of block cipher modes consider processing a single message, we propose to process several messages in parallel, which reflects this reality. This is particularly beneficial when using an inherently sequential mode. In this work, for the first time, we deal with AES modes of operation in this setting (see Sect. 3). More specifically, as our main contribution, we propose an efficient lookahead comb scheduler. For realworld packet lengths on the Internet, this algorithm allows us to fill the pipeline of AESNI and attain significant speedups for many popular modes. After covering some background in Sect. 2, we present our comb scheduler and its analysis in Sect. 3.
SpeedUp of Factor 5 for NIST’s CBC, OFB, CFB and CMAC. When applied to the NISTrecommended encryption and MAC modes, our comb scheduler delivers a performance gain of factor 5 with the realworld packet sizes. The modes get as fast as 0.88 cpb compared to around 4.5 cpb in the sequential message processing setting. These results are provided in Sect. 4.
Change of Landscape for AE. When our comb scheduler is applied to AE modes of operation, a high performance improvement is attained as well with the realworld message size distribution. CCM, having a sequential CBCbased MAC inside, gets as fast as GCM and OCB which are inherently parallel. Being royaltyfree, NISTrecommended and weakkey free, CCM becomes an attractive AE mode of operation in this setting.
In the context of the ongoing CAESAR competition, in the domain of noncemisuse resistant modes, the essentially sequential POET gets a significant speedup of factor 2.7 down to 2.14 cpb. Its rival CAESAR contestant COPA runs as 2.68 cpb, while being insecure under release of unverified plaintext. This is somewhat surprising, considering that POET uses 3 AES calls per block vs. 2 AES calls per block for COPA.
Section 5 also contains firsttime comprehensive performance evaluations of further AESbased modes in the CAESAR competition and beyond, both in the sequential and combscheduled implementations, including OTR, CLOC/SILC, JAMBU, COBRA, McOEG and Julius.
Faster GF \({(2^{128})}\) Multiplications on Haswell. Section 6 focuses on the technical implementation tricks on Haswell that we used to obtain our results and contains a detailed study of improved \(GF(2^{128})\) multiplications on the architecture.
2 Background
In this paper, we consider AESbased symmetric primitives, that is, algorithms that make use of the (full) AES block cipher in a blackbox fashion. In particular, this includes block cipher modes of operation, block cipher based message authentication codes, and authentication encryption (AE) modes.
NISTrecommended Modes. In its special publications SP80038AD [10], NIST recommends the following modes of operation: ECB, CBC, CFB, OFB and CTR as basic encryption modes; CMAC as authentication mode; and CCM and GCM as authenticated encryption modes.
Authenticated Encryption Modes and CAESAR. Besides the widely employed and standardized modes CCM and GCM, a great number of modes for authenticated encryption have been proposed, many of them being contestants in the currently ongoing CAESAR competition. We give a brief overview of the AE modes we will consider in this study.
Overview of the AE modes considered in this paper. The \(\Vert \) column indicates parallelizability; the “IF” column indicates whether a mode needs the inverse of the underlying block cipher in decryption/verification; the “E” and “M” columns give the number of calls, per message block, to the underlying block cipher and multiplications in \(GF(2^n)\), respectively.
Ref.  Year  \(\Vert \)  IF  E  M  Description  

Noncebased AE modes  
CCM  [37]  2002  –  yes  2  –  CTR encryption, CBCMAC authentication 
GCM  [31]  2004  yes  yes  1  1  CTR mode with chain of multiplications 
OCB3  [26]  2010  yes  –  1  –  Gray codebased xorencryptxor (XEX) 
OTR  [33]  2013  yes  yes  1  –  Twoblock Feistel structure 
CLOC  [21]  2014  –  yes  1  –  CFB mode with low overhead 
COBRA  [5]  2014  yes  yes  1  1  Combining OTR with chain of multiplications 
JAMBU  [38]  2014  –  yes  1  –  AES in stream mode, lightweight 
SILC  [22]  2014  –  yes  1  –  CLOC with smaller hardware footprint 
Noncemisuse resistant AE modes  
McOEG  [11]  2011  –  –  1  1  Serial multiplicationencryption chain 
COPA  [4]  2013  yes  –  2  –  Tworound XEX 
POET  [1]  2014  yes  –  3  –  XEX with two AXU (full AES128 call) chains 
Julius  [7]  2014  –  –  1  2  SIV with polynomial hashing 
For the specifications of the AE modes considered, we refer to the relevant references listed in Table 1. We clarify that for COBRA we refer to the FSE 2014 version with its reduced security claims (compared to the withdrawn CAESAR candidate); with POET we refer to the version where the universal hashing is implemented as full AES128 (since using four rounds would not comprise a mode of operation); and with Julius, we mean the CAESAR candidate regular JuliusECB.
The AESNI Instruction Set. Proposed in 2008 and implemented as of their 2010 Westmere microarchitecture, Intel developed special instructions for fast AES encryption and decryption [15], called the AES New Instruction Set (AESNI). It provides instructions for computing one AES round aesenc, aesenclast, its inverse aesdec, aesdeclast, and auxiliary instructions for key scheduling. The instructions do not only offer better performance, but security as well, since they are leaking no timing information. AESNI is supported in a subset of Westmere, Sandy Bridge, Ivy Bridge and Haswell microarchitectures. A range of AMD processors also support the instructions under the name AES Instructions, including processors in the Bulldozer, Piledriver and Jaguar series [19].
Pipelining. Instruction pipelines allow CPUs to execute the same instruction for dataindependent instances in an overlapping fashion. This is done by subdividing the instruction into steps called pipeline stages, with each stage processing its part of one instruction at a time. The performance of a pipelined instruction is characterized by its latency L (number of cycles to complete one instruction) and throughput T (the number of cycles to wait between issuing instructions). For instance, on the original Westmere architecture, the AESNI aesenc instruction has a latency of 6 cycles and a throughput of 2, meaning that one instruction can be issued every two cycles.
Previous Work. Matsui and Fukuda at FSE 2005 [29] and Matsui [28] at FSE 2006 pioneered comprehensive studys on how to optimize symmetric primitives on the thencontemporary generation of Intel microprocessors. One year later, Matsui and Nakajima [30] demonstrated that the vector instruction units of the Core 2 architecture lends itself to very fast bitsliced implementations of block ciphers. For the AES, on a variety of platforms, Bernstein and Schwabe [8] developed various microoptimizations yielding vastly improved performance. Intel’s AES instructions were introduced to the symmetric community by Shay Gueron’s tutorial [14] at FSE 2009. In the same year, Käsper and Schwabe announced new records for bitsliced AESCTR and AESGCM performance [25]. At FSE 2010, Osvik et al. [35] explored fast AES implementations on AVR and GPU platforms. Finally, a study of the performance of CCM, GCM, OCB3 and CTR modes was presented by Krovetz and Rogaway [26] at FSE 2011.
3 Comb Scheduler: An Efficient LookAhead Strategy
3.1 Motivation
A substantial number of block cipher modes of operation for (authenticated) encryption are inherently sequential in nature. Among the NISTrecommended modes, this includes the classic CBC, OFB, CFB and CCM modes as well as CBC derivatives such as CMAC. Also, more recent designs essentially owe their sequential nature to design goals, e.g allowing lightweight implementations or achieving stricter notions of security, for instance not requiring a nonce for security (or allowing its reuse). Examples include ALE [9], APE [3], CLOC [21] the McOE family of algorithms [11, 12], and some variants of POET [1].
While being able to perform well in other environments, such algorithms cannot benefit from the available pipelining opportunities on contemporary generalpurpose CPUs. For instance, as detailed in Sect. 6, the AESNI encryption instructions on Intel’s recent Haswell architecture feature a high throughput of 1, but a relatively high latency of 7 cycles. Modes of operation that need to process data sequentially will invariably be penalized in such environments.
Furthermore, even if designed with parallelizability in mind, (authenticated) modes of operation for block ciphers typically achieve their best performance when operating on somewhat longer messages, often due to the simple fact that these diminish the impact of potentially costly initialization phases and tag generation. Equally importantly, only longer messages allow highperformance software implementations to make full use of the available pipelining opportunities [2, 16, 26, 32].
In practice, however, one rarely encounters messages which allow to achieve the maximum performance of an algorithm. Recent studies on packet sizes on the Internet demonstrate that they basically follow a bimodal distribution [24, 34, 36]: 44 % of packets are between 40 and 100 bytes long; 37 % are between 1400 and 1500 bytes in size; the remaining 19 % are somewhere in between. Throughout the paper, we refer to this as the realistic distribution of message lengths. This emphasizes the importance of good performance for messages up to around 2 KB, as opposed to longer messages. Second, when looking at the weighted distribution, this implies that the vast majority of data is actually transmitted in packets of medium size between 1 and 2 KB. Considering the first mode of the distribution, we observe that many of the very small packets of Internet traffic comprise TCP ACKs (which are typically not encrypted), and that the use of authentication and encryption layers such as TLS or IPsec incurs overhead significant enough to blow up a payload of 1 byte to a 124 byte packet [20]. It is therefore this range of message sizes (128 to 2048 bytes) that authenticated modes of encryption should excel at processing, when employed for encryption of Internet traffic.
3.2 Filling the Pipeline: Multiple Messages
It follows from the above discussion that the standard approach of considering one message at a time, while arguably optimizing message processing latency, can not always generate optimal throughput in highperformance software implementations in most practically relevant scenarios. This is not surprising for the inherently sequential modes, but even when employing a parallelizable design, the prevailing distribution of message lengths makes it hard to achieve the best performance.
In order to remedy this, we propose to consider the scheduling of multiple messages in parallel already in the implementation of the algorithm itself, as opposed to considering it as a (singlemessage) black box to the message scheduler. This opens up possibilities of increasing the performance in the cases of both sequential modes and the availability of multiple shorter or mediumsized messages. In the first case, the performance penalty of sequential execution can potentially be hidden by filling the pipeline with a sufficient number of operations on independent data. In the second case, there is a potential of increasing performance by keeping the pipeline filled also for the overhead operations such as block cipher or multiplication calls during initialization or tag generation.
Note that while in this paper we consider the processing of multiple messages on a single core, the multiple message approach naturally extends to multicore settings.
Conceptually, the transition of a sequential to a multiple message implementation can be viewed as similar to the transition from a straightforward to a bitsliced implementation approach.
We note that an idealistic view of multiplemessage processing was given in [9] for dedicated authenticated encryption algorithm ALE. This consideration was rather rudimentary, did not involve realworld packet size distributions, and did not treat any modes of operation.
It is also important to note that while multiple message processing has the potential to increase the throughput of an implementation, it can also increase its latency (see also Sect. 3.4). The degree of parallelism therefore has to be chosen carefully and with the required application profile in mind.
3.3 Message Scheduling with a Comb
Consider the scenario where a number of messages of varying lengths need to be processed by a sequential encryption algorithm. As outlined before, blocks from multiple messages have to be processed in an interleaved fashion in order to make use of the available intermessage parallelism. Having messages of different lengths implies that generally the pipeline cannot always be filled completely. At the same time, the goal to schedule the message blocks such that pipeline usage is maximized has to be weighed against the computational cost of making such scheduling decisions: in particular, every conditional statement during the processing of the bulk data results in a pipeline stall.
In order to reconcile the goal of exploiting multimessage parallelism for sequential algorithms with the need for lowoverhead scheduling, we propose comb scheduling.
Comb scheduling is based on the observation that ideally, messages processed in parallel have the same length, so given a desired (maximum) parallelism degree P and a list of message lengths \(\ell _1,\dots ,\ell _{k}\), we can subdivide the computation in a number of windows, in each of which we process as many consecutive message blocks as we can for as many independent messages as possible according to the restrictions based on the given message lengths.
Since our scheduling problem exhibits optimal substructure, this greedy approach yields an optimal solution. Furthermore, the scheduling decisions of how many blocks are to be processed at which parallelism level can be precomputed once the \(\ell _i\) are known. This implies that instead of making each processing step conditional, we only have conditional statements whenever we proceed from one window to the next.
The sorted messages are then processed in groups of P. Inside each group, the processing is window by window according to the precomputed parallelism levels \(\mathcal {P}\) and window lengths \(\mathcal {B}\): In window w, the same \(\mathcal {P}[w]\) messages of the current message group are processed \(\mathcal {B}[w]\) blocks further. In the next window, at least one message will be exhausted, and the parallelism level decreases by at least one.
As comb scheduling is processing the blocks by common (sub)length from left to right, our method can be considered a symmetrickey variant of the wellknown comb method for (multi)exponentiation [27].
An Example. We illustrate comb scheduling in Fig. 1 with an example where \(P=k=7\): The precomputation determines that all 7 messages can be processed in a pipelined fashion for the first 5 blocks; four of the 7 messages can be processed further for the next 80 blocks; and finally three remaining messages are processed for another 9 blocks.
3.4 Latency Vs Throughput
A point worth discussing is the latency increase one has to pay when using multiple message processing. Since the speedup is limited by the parallelization level, one can at most hope for the same latency as in the sequential processing case.
Performance of CBC encryption (cpb) and relative speedup for comb scheduling with different parallelization levels for fixed lengths of 2048 bytes (top) and realistic message lengths (bottom).
Parallelization level P  

Sequential  2  3  4  5  6  7  8  
2 K messages  4.38  2.19  1.47  1.11  0.91  0.76  0.66  0.65 
Relative speedup  \(\times 1.00\)  \( \times 2.00\)  \( \times 2.98 \)  \(\times 3.95\)  \(\times 4.81\)  \(\times 5.76\)  \(\times 6.64\)  \(\times 6.74\) 
Realistic distribution  4.38  2.42  1.73  1.37  1.08  0.98  0.87  0.85 
Relative speedup  \(\times 1.00\)  \(\times 1.81\)  \(\times 2.53\)  \(\times 3.20\)  \(\times 4.06\)  \(\times 4.47\)  \(\times 5.03\)  \(\times 5.15\) 
Table 2 shows that for identical message lengths, the ideal linear speedup is actually achieved for 2 to 4 parallel messages: Setting \(M=2048\), instead of waiting \(4.38 \cdot M\) cycles in the sequential case, one has a latency of either \(2.19 \cdot 2 = 4.38 \cdot M\), \(1.47\cdot 3 = 4.41 \cdot M\) or \(1.11\cdot 4 = 4.44 \cdot M\) cycles, respectively. Starting from 5 messages, the latency slightly increases with the throughput, however remaining at a manageable level even for 7 messages, where it is only around 5 % higher than in the sequential case, while achieving a 6.64 times speedup in throughput. For realistic message lengths, using 7 multiple messages, we see an average increase in latency of 39 % which has to be contrasted (and, depending on the application, weighed against) the significant 5.03 times speedup in throughput.
4 Pipelined NIST Encryption Modes
In this section, we present the results of our performance study of the NISTrecommended encryption modes when instantiated with AES as the block cipher and implemented with AESNI and AVX vector instructions. We remark that we only measure encryption. Some modes covered, such as CBC and CFB, are sequential in encryption but parallel in decryption.
Experimental Setting. All measurements were taken on a single core of an Intel Core i54300U CPU (Haswell) at 1900 MHz. For each combination of parameters, the performance was determined as the median of 91 averaged timings of 200 measurements each. This method has also been used by Krovetz and Rogaway in their benchmarking of authenticated encryption modes in [26]. The measurements are taken over samples from the realistic distribution on message lengths.
Performance comparison (in cpb) of NIST encryption modes with trivial sequential processing and comb scheduling. Message lengths are sampled from the realistic Internet traffic distribution.
Mode  Sequential processing  Comb scheduling  Speedup 

AESECB  0.65  —  — 
AESCTR  0.78  —  — 
AESCBC  4.47  0.87  \(\times 5.14\) 
AESOFB  4.48  0.88  \(\times 5.09\) 
AESCFB  4.45  0.89  \(\times 5.00\) 
CMACAES  4.29  0.84  \(\times 5.10\) 
Discussion. Our performance results for pipelined implementations of NIST encryption modes are presented in Table 3. It is apparent that the parallel processing of multiple messages using comb scheduling speeds up encryption performance by a factor of around 5, bringing the sequential modes within about 10 % of CTR mode performance. The results also indicate that the overhead induced by the comb scheduling algorithm itself can be considered negligible compared to the AES calls.
Due to their simple structure with almost no overhead, it comes as no surprise that CBC, OFB and CFB performance are virtually identical. That CMAC performs slightly better despite additional initialization overhead can be explained by the fact that there are no ciphertext blocks to be stored to memory.
5 Pipelined Authenticated Encryption
We now turn our attention to the AESNI software performance of authenticated encryption modes. We consider the wellestablished modes CCM, GCM and OCB3 as well as a number of more recent proposals, many of them being contestants in the ongoing CAESAR competition.
Experimental Setting. The same experimental setup as for the encryption modes applies. For our performance measurements, we are interested in the performance of the various AE modes of operation during their bulk processing of message blocks, i.e. during the encryption phase. To that end, we do not measure the time spent on processing associated data. As some schemes can have a significant overhead when computing authentication tags (finalization) for short messages, we do include this phase in the measurements as well.
5.1 Performance in the Real World
Out of the AE modes in consideration, GCM, OCB3, OTR, COBRA, COPA and Julius are parallelizable designs. We therefore only measure their performance with sequential message processing. On the other hand, CCM, CLOC, SILC, JAMBU, McOEG and POET are sequential designs and as such will also be measured in combination with comb scheduling. In all cases, we again measure the performance using the message lengths sampled from the realistic bimodal distribution of typical Internet traffic.
Performance comparison (in cpb) of AESbased AE modes with trivial sequential processing and comb scheduling. Message lengths are sampled from the realistic Internet traffic distribution. CAESAR candidates are marked using a \({}^{\star }\) after their name.

Discussion. The performance data demonstrates that comb scheduling of multiple messages consistently provides a speedup of factors between 3 and 4 compared to normal sequential processing. For typical Internet packet sizes, comb scheduling enables sequential AE modes to run with performance comparable to the parallelizable designs, in some cases even outperforming them. This can be attributed to the fact that AE modes typically have heavier initialization and finalization than normal encryption modes, both implying a penalty for short message performance. By using comb scheduling, however, also the initial and final AES calls can be (at least partially) parallelized between different messages. The relative speedup for this will typically reduce with the message length. The surprisingly good performance of McOEG is due to the fact that it basically benefits doubly from multiple message processing, since not only the AES calls, but also its sequential finite field multiplications can now be pipelined. For the comb scheduling implementation of CCM, which is twopass, it is worth noting that all scheduling precomputations only need to be done once, since exactly the same processing windows can be used for both passes.
Best Performance Characteristics. From Table 4, it is apparent that for encryption of typical Internet packets, the difference, with respect to performance, between sequential and parallelizable modes somewhat blurs when comb scheduling is employed. This is especially true for the noncebased setting, where CLOC, SILC, CCM, GCM and OCB3 all perform on a very comparable level. For the noncemisuse resistant modes, our results surprisingly even show better performance of the two sequential modes for this application scenario. This can be attributed to the fact that the additional processing needed for achieving noncemisuse resistance hampers performance on short messages, which can be mitigated to some extent by comb scheduling.
5.2 Traditional Approach: Sequential Messages of Fixed Lengths
While the previous section analyzed the performance of the various AE modes using a model for a realistic message lengths, we provide some more detail on the exact performance exhibited by these modes for a range of (fixed) message lengths in this section. To this end, we provide performance measurements for specific message lengths between 128 and 2048 bytes. The results are summarized in Table 5.
Performance comparison (in cpb) of AE modes for processing a single message of various, fixed message lengths.

Discussion. The performance data clearly shows the expected difference between sequential and parallelizable modes when no use of multiple parallel messages can be made. Only initializationheavy sequential modes like McOEG and POET show significant performance differences between shorter and longer messages, while this effect is usually very pronounced for the parallelizable modes such as OCB3 and COPA. It can be seen from Table 5, that in the noncebased setting, the best performance is generally offered by OCB3, although OTR and GCM (on Haswell) provide quite similar performance. Among the noncemisuse resistant modes, COPA performs best for all message sizes.
5.3 Exploring the Limits: Upper Bounding the Comb Scheduler Advantage
Having seen the performance data with comb scheduling for realistic message lengths, it is natural to consider the question what the performance of the various modes would be for the ideal scenario where the scheduler is given only messages of a fixed length. In this case, the comb precomputation would result in only one processing window, so essentially no schedulerinduced branches are needed during the processing of the messages. In a sense, this constitutes an upper bound for the multimessage performance with comb scheduling for the various encryption algorithms.
Performance comparison (in cpb) of sequential AE modes when comb scheduling is used for various fixed message lengths.

Discussion. It can be seen that for all modes considered, the performance for longer messages at least slightly improves compared to the realistic message length mix of Table 4, though the differences are quite small and do not exceed around 0.2 cpb. For smaller lengths, the difference can be more pronounced for a mode with heavy initialization such as POET. Overall, this shows that comb scheduling for a realistic distribution provides a performance which is very comparable to that of comb scheduling of messages with an idealized distribution.
Impact of Working Set Sizes. It can be seen from the plots that, as expected, most modes achieve their best speedup in the multiple messages scenario for a parallelization level of around 7 messages. It is worth noting, however, that for each of these messages, a complete working set (internal state of the algorithm) has to be maintained. Since only 16 128bit XMM registers are available, even a working set of three 128bit words (for instance cipher state, tweak mask, checksum) for 7 simultaneously processed messages will already exceed the number of available registers. As the parallelization degree P increases, this becomes more and more a factor. This can be especially seen for POET, which has a larger internal state per instance. By contrast, CCM, JAMBU and McOEG suffer a lot less from this effect.
The experimental results also confirm the intuition of Sect. 6.1 that Haswell’s improved memory interface can handle fairly large working set sizes efficiently by hiding the stack access latency between the cryptographic operations. This allows more multiple messages to be processed faster despite the increased register pressure, basically until the number of moves exceeds the latency of the other operations, or ultimately the limits of the Level1 cache are reached.
6 Haswell Tricks: Towards Faster Code
In this section, we describe some of the optimization techniques and architecture features that were used for our implementations on Haswell.
6.1 General Considerations: AVX and AVX2 Instructions
In our Haswelloptimized AE scheme implementations we make heavy use of Intel Advanced Vector Extensions (AVX) which has been present in Intel processors since Sandy Bridge. AVX can be considered as an extension of the SSE+^{3} streaming SIMD instructions operating on 128bit xmm0 through xmm15 registers.
While AVX and AVX2, the latter which appears first on Intel’s Haswell processor, brings mainly support for 256bit wide registers to the table, this is not immediately useful in implementing an AESbased AE scheme, as the AESNI instructions as well as the pclmulqdq instruction support only the use of 128bit xmm registers. However, a feature of AVX that we use extensively is the threeoperand enhancement, due to the VEX coding scheme, of legacy twooperand SSE2 instructions. This means that, in a single instruction, one can nondestructively perform vector bit operations on two operands and store the result in a third operand, rather than overwriting one of the inputs, e.g. one can do \(c = a \oplus b\) rather than \(a = a \oplus b\). This eliminates overhead associated with mov operations required when overwriting an operand is not acceptable. With AVX, threeoperand versions of the AESNI and pclmulqdq instructions are also available.
A further Haswell feature worth taking into account is the increased throughput for logical instructions such as vpxor/vpand/vpor on AVX registers: While the latency remains at one cycle, now up to 3 such instructions can be scheduled simultaneously. Notable exceptions are algorithms heavily relying on mixed 64/128 bit logical operations such as JAMBU, for which the inclusion of a fourth 64bit ALU implies that such algorithms will actually benefit from frequent conversion to 64bit arithmetic via vpextrq/vpinsrq rather than artificial extension of 64bit operands to 128 bits for operation on the AVX registers.
On Haswell, the improved memory controller allows two simultaneous 16byte aligned moves vmovdqa from registers to memory, with a latency of one cycle. This implies that on Haswell, the comparatively large latency of cryptographic instructions such as vaesenc or pclmulqdq allows the implementer to “hide” more memory accesses to the stack when larger internal state of the algorithm leads to register shortage. This also benefits the generally larger working sets induced by the multiple message strategy described in Sect. 3.
6.2 Improved AES Instructions
Experimental latency (L) and inverse throughput (\(T^{1}\)) of AESNI and pclmulqdq instructions on Intel’s Haswell microarchitecture
Instruction  L  \(T^{1}\)  Instruction  L  \(T^{1}\) 

aesenc  7  1  aesimc  14  2 
aesdec  7  1  aeskeygenassist  10  8 
aesenclast  7  1  pclmulqdq  7  2 
aesdeclast  7  1 
6.3 Improvements for Multiplication in \(GF(2^{128})\)
The pclmulqdq instruction was introduced by Intel along with the AESNI instructions [17], but is not part of AESNI itself. The instruction takes two 128bit inputs and a byte input imm8, and performs carryless multiplication of a combination of one 64bit half of each operand. The choice of halves of the two operands to be multiplied is determined by the value of bits 4 and 0 of imm8.
Most practically used AE modes using multiplication in a finite field use block lengths of 128 bits. As a consequence, multiplications are in the field \(GF(2^{128})\). As the particular choice of finite field does not influence the security proofs, modes use the triedandtrue GCM finite field. For our performance study, we have used two different implementation approaches for finite field multiplication (gfmul). The first implementation, which we refer to as the classical method, was introduced in Intel’s white paper [17]. It applies pclmulqdq three times in a carryless Karatsuba multiplication followed by modular reduction. The second implementation variant, which we refer to as the Haswelloptimized method, was proposed by Gueron [16] with the goal of leveraging the much improved pclmulqdq performance on Haswell to trade many shifts and XORs for one more multiplication. This is motivated by the improvements in both latency (7 vs. 14 cycles) and inverse throughput (2 vs. 8 cycles) on Haswell [18].
In modes where the output of a multiplication over \(GF(2^{128})\) is not directly used, other than as a part of a chain combined using addition, the aggregated reduction method by Jankowski and Laurent [23] can be used to gain speedups. This method uses the inductive definitions of chaining values combined with the distributivity law for the finite field to postpone modular reduction at the cost of storing powers of an operand. Among the modes we benchmark in this work, the aggregated reduction method is applicable only to GCM and Julius. We therefore use this approach for those two modes, but apply the general gfmul implementations to the other modes.
6.4 Classical Vs. Haswell \(GF(2^{128})\) Multiplication
Given the speedup of pclmulqdq on Haswell, this may seem somewhat counterintuitive at first. We observe, however, that McOEG and COBRA basically make sequential use of multiplications, which precludes utilizing the pipeline for sequential implementations. In this case, the still substantial latency of pclmulqdq is enough to offset the gains by replacing several other instructions for the reduction. This is different in the multiple message case, where the availability of independent data allows our implementations to make more efficient use of the pipeline, leading to superior results over the classical multiplication method.
6.5 HaswellOptimized Doubling in \(GF(2^{128})\)
The doubling operation in \(GF(2^{128})\) is commonly used in AE schemes [6], and indeed among the schemes we benchmark, it is used by OCB3, OTR, COBRA, COPA and POET. Doubling in this field consists of left shifting the input by one bit and doing a conditional XOR of a reduction polynomial if the MSB of the input equals one. Neither SSE+ nor AVX provide an instruction to shift a whole xmm register bitwise nor to directly test just its MSB. Thus, these functions have to be emulated with other operations, opening up a number of implementation choices.
We emulate a left shift by one bit by the following procedure, which is optimal with regard to the number of instructions and cycles: Given an input v, the value \(2v \in GF(2^{128})\) is computed as in Listing. Consider \(v = (v_L \Vert v_R)\) where \(v_L\) and \(v_R\) are 64bit values. In line 3 we set \(v_1 = (v_L \ll 1 \Vert v_R \ll 1)\) and lines 4 and 5 set first \(v_2 = (v_R \Vert 0)\) and then \(v_2 = ((v_R \gg 63) \Vert 0)\). As such, we have \(v \ll 1 = v_1 \mid v_2\). This leaves us with a number of possibilities when implementing the branching of line 6, which can be categorized as (i) extracting parts from v and testing, (ii) AVX variants of the test instruction, (iii) extracting a mask with the MSB of each part of v and (iv) comparing against \(10\cdots 0_2\) (called MSB_MASK in Listing and RP is the reduction constant) and then extracting from the comparison result. Some of these approaches again leave several possibilities regarding the number of bits extracted, etc.
Interestingly, the approach taken to check the MSB of v has a great impact on the doubling performance. This is illustrated by Table 5 where we give performance of the doubling operation using various combinations of approaches. The numbers are obtained by averaging over \(10^8\) experiments. Surprisingly, we see that there is a significant speedup, about a factor \(\times 3\), when using comparison with MSB_MASK combined with extraction, over the other methods. Thus, we suggest to use this approach, where line 6 can be implemented as
7 Conclusions
In this paper, we have discussed the performance of various block cipherbased symmetric primitives when instantiated with the AES on Intel’s recent Haswell architecture.
As a general technique to speed up both inherently sequential modes and to deal with the typical scenario of having many shorter messages, we proposed our comb scheduler, an efficient algorithm for the scheduling of multiple simultaneous messages which is based on a lookahead strategy within a certain window size. This leads to significant speedups for essentially all sequential modes, even when taking realistic Internet traffic distributions into account. Applied to the NISTrecommended modes CBC, CFB, OFB and CMAC, comb scheduling attains a significant speedup of factor at least 5, resulting in a performance of around 0.88 cpb, which is within about 10 % of the performance of the parallelizable CTR mode on the same message distribution.
Applying comb scheduling to authenticated encryption modes (which typically feature higher initialization and finalization overhead, thus penalizing performance on the frequently occurring short messages), our technique speeds up the inherently sequential AE modes CCM, CLOC, SILC, JAMBU, McOEG and POET by factors between 3 and 4.5. This particularly results in a CCM performance comparable to GCM or OCB3, without being afflicted by issues with weakkey classes or encumbered by patents.
Our study also establishes that for practitioners wishing to use a noncemisuse resistant AE mode, the POET design with comb scheduling attains better performance than the completely parallelizable mode COPA. Since POET furthermore offers ciphertextmisuse resistance, this suggests that users do not have to choose between good performance or stricter notions of security.
Footnotes
References
 1.Abed, F., Fluhrer, S., Forler, C., List, E., Lucks, S., McGrew, D., Wenzel, J.: Pipelineable online encryption. In: Cid, C., Rechberger, C. (eds.) FSE 2014. LNCS, vol. 8540, pp. 205–223. Springer, Heidelberg (2015) Google Scholar
 2.Akdemir, K., Dixon, M., Feghali, W., Fay, P., Gopal, V., Guilford, J., Ozturk, E., Wolrich, G., Zohar, R.: Breakthrough AES Performance with Intel AES New Instructions. Intel Corporation (2010)Google Scholar
 3.Andreeva, E., Bilgin, B., Bogdanov, A., Luykx, A., Mennink, B., Mouha, N., Yasuda, K.: APE: authenticated permutationbased encryption for lightweight cryptography. In: Cid, C., Rechberger, C. (eds.) FSE 2014. LNCS, vol. 8540, pp. 168–186. Springer, Heidelberg (2015) Google Scholar
 4.Andreeva, E., Bogdanov, A., Luykx, A., Mennink, B., Tischhauser, E., Yasuda, K.: Parallelizable and authenticated online ciphers. In: Sako, K., Sarkar, P. (eds.) ASIACRYPT 2013, Part I. LNCS, vol. 8269, pp. 424–443. Springer, Heidelberg (2013) Google Scholar
 5.Andreeva, E., Luykx, A., Mennink, B., Yasuda, K.: COBRA: a parallelizable authenticated online cipher without block cipher inverse. In: Cid, C., Rechberger, C. (eds.) FSE 2014. LNCS, vol. 8540, pp. 187–203. Springer, Heidelberg (2015) Google Scholar
 6.Aoki, K., Iwata, T., Yasuda, K.: How fast can a twopass mode go? a parallel deterministic authenticated encryption mode for AESNI. In: DIAC 2012: Directions in Authenticated Ciphers (2012)Google Scholar
 7.Bahack, L.: Julius: Secure Mode of Operation for Authenticated Encryption Based on ECB and Finite Field Multiplications. CAESAR competition proposalGoogle Scholar
 8.Bernstein, D.J., Schwabe, P.: New AES software speed records. In: Chowdhury, D.R., Rijmen, V., Das, A. (eds.) INDOCRYPT 2008. LNCS, vol. 5365, pp. 322–336. Springer, Heidelberg (2008) Google Scholar
 9.Bogdanov, A., Mendel, F., Regazzoni, F., Rijmen, V., Tischhauser, E.: ALE: AESbased lightweight authenticated encryption. In: Moriai, S. (ed.) FSE 2013. LNCS, vol. 8424, pp. 447–466. Springer, Heidelberg (2014) Google Scholar
 10.Dworkin, M.J.: SP 800–38D. Recommendation for Block Cipher Modes of Operation: Galois/Counter Mode (GCM) and GMAC. Technical report, Gaithersburg, MD, USA (2007)Google Scholar
 11.Fleischmann, E., Forler, C., Lucks, S.: McOE: a family of almost foolproof online authenticated encryption schemes. In: Canteaut, A. (ed.) FSE 2012. LNCS, vol. 7549, pp. 196–215. Springer, Heidelberg (2012) Google Scholar
 12.Fleischmann, E., Forler, C., Lucks, S., Wenzel, J.: McOE: A Family of Almost Foolproof OnLine Authenticated Encryption Schemes. Cryptology ePrint Archive, Report 2011/644 (2011). http://eprint.iacr.org/
 13.Fog, A.: Software Optimization Resources, February 2014. http://www.agner.org/optimize/. Accessed 17 February 2014
 14.Gueron, S.: Intel’s new AES Instructions for enhanced performance and security. In: Dunkelman, O. (ed.) FSE 2009. LNCS, vol. 5665, pp. 51–66. Springer, Heidelberg (2009) Google Scholar
 15.Gueron, S.: Intel Advanced Encryption Standard (AES) New Instructions Set. Intel Corporation (2010)Google Scholar
 16.Gueron, S.: AESGCM software performance on the current high end CPUs as a performance baseline for CAESAR. In: DIAC 2013: Directions in Authenticated Ciphers (2013)Google Scholar
 17.Gueron, S., Kounavis, M.E.: Intel CarryLess Multiplication Instruction and its Usage for Computing the GCM Mode. Intel Corporation (2010)Google Scholar
 18.Gulley, S., Gopal, V.: Haswell Cryptographic Performance. Intel Corporation (2013)Google Scholar
 19.Hollingsworth, V.: New “Bulldozer” and “Piledriver” Instructions. Advanced Micro Devices Inc. (2012)Google Scholar
 20.Iveson, S.: IPSec Bandwidth Overhead Using AES, October 2013. http://packetpushers.net/ipsecbandwidthoverheadusingaes/. Accessed 17 February 2014
 21.Iwata, T., Minematsu, K., Guo, J., Morioka, S.: CLOC: authenticated encryption for short input. In: Cid, C., Rechberger, C. (eds.) FSE 2014. LNCS, vol. 8540, pp. 149–167. Springer, Heidelberg (2015) Google Scholar
 22.Iwata, T., Minematsu, K., Guo, J., Morioka, S., Kobayashi, E.: SILC: Simple Lightweight CFB. CAESAR competition proposalGoogle Scholar
 23.Jankowski, K., Laurent, P.: Packed AESGCM Algorithm Suitable for AES/PCLMULQDQ Instructions, pp. 135–138 (2011)Google Scholar
 24.John, W., Tafvelin, S.: Analysis of internet backbone traffic and header anomalies observed. In: Internet Measurement Conference, pp. 111–116 (2007)Google Scholar
 25.Käsper, E., Schwabe, P.: Faster and TimingAttack Resistant AESGCM. In: Clavier, C., Gaj, K. (eds.) CHES 2009. LNCS, vol. 5747, pp. 1–17. Springer, Heidelberg (2009) Google Scholar
 26.Krovetz, T., Rogaway, P.: The software performance of authenticatedencryption modes. In: Joux, A. (ed.) FSE 2011. LNCS, vol. 6733, pp. 306–327. Springer, Heidelberg (2011) Google Scholar
 27.Lim, C.H., Lee, P.J.: More Flexible Exponentiation with Precomputation. In: Desmedt, Y.G. (ed.) CRYPTO 1994. LNCS, vol. 839, pp. 95–107. Springer, Heidelberg (1994) Google Scholar
 28.Matsui, M.: How far can we go on the x64 processors? In: Robshaw, M. (ed.) FSE 2006. LNCS, vol. 4047, pp. 341–358. Springer, Heidelberg (2006) Google Scholar
 29.Matsui, M., Fukuda, S.: How to maximize software performance of symmetric primitives on Pentium III and 4 processors. In: Gilbert, H., Handschuh, H. (eds.) FSE 2005. LNCS, vol. 3557, pp. 398–412. Springer, Heidelberg (2005) Google Scholar
 30.Matsui, M., Nakajima, J.: On the power of bitslice implementation on intel core2 processor. In: Paillier, P., Verbauwhede, I. (eds.) CHES 2007. LNCS, vol. 4727, pp. 121–134. Springer, Heidelberg (2007) Google Scholar
 31.Dworkin, M.J.: SP 80038D. Recommendation for Block Cipher Modes of Operation: Galois/Counter Mode (GCM) and GMAC. Technical report, National Institute of Standards & Technology, Gaithersburg, MD, USA (2007)Google Scholar
 32.McGrew, D.A., Viega, J.: The security and performance of the Galois/Counter Mode (GCM) of operation. In: Canteaut, A., Viswanathan, K. (eds.) INDOCRYPT 2004. LNCS, vol. 3348, pp. 343–355. Springer, Heidelberg (2004) Google Scholar
 33.Minematsu, K.: Parallelizable rate1 authenticated encryption from pseudorandom functions. In: Nguyen, P.Q., Oswald, E. (eds.) EUROCRYPT 2014. LNCS, vol. 8441, pp. 275–292. Springer, Heidelberg (2014) Google Scholar
 34.Murray, D., Koziniec, T.: The state of enterprise network traffic in 2012. In: 2012 18th AsiaPacific Conference on Communications (APCC), pp. 179–184. IEEE (2012)Google Scholar
 35.Osvik, D.A., Bos, J.W., Stefan, D., Canright, D.: Fast software AES encryption. In: Hong, S., Iwata, T. (eds.) FSE 2010. LNCS, vol. 6147, pp. 75–93. Springer, Heidelberg (2010) Google Scholar
 36.Pentikousis, K., Badr, H.G.: Quantifying the deployment of TCP options  a comparative study, pp. 647–649 (2004)Google Scholar
 37.Whiting, D., Housley, R., Ferguson, N.: Counter with CBCMAC (CCM) (2003)Google Scholar
 38.Wu, H., Huang, T.: JAMBU Lightweight Authenticated Encryption Mode and AESJAMBU. CAESAR competition proposalGoogle Scholar