The Design and Evolution of OCB

We describe OCB3, the final version of OCB, a blockcipher mode for authenticated encryption (AE). We prove the construction secure, up to the birthday bound, assuming its underlying blockcipher is secure as a strong-PRP. We study the scheme’s software performance, comparing its speed, on multiple platforms, to a variety of other AE schemes. We reflect on the history and development of the mode.


Introduction
Schemes for authenticated encryption (AE) symmetrically encrypt a message in a way that ensures both its confidentiality and authenticity. OCB is a well-known algorithm for achieving this aim. It is a blockcipher mode of operation, the blockcipher usually being AES.
There are three main variants of OCB. The first, now called OCB1 (2001) [39], was motivated by Charanjit Jutla's IAPM [24]. A second version, now called OCB2 (2004) [18,38], added support for associated data (AD) [37] and redeveloped the mode using the idea of a tweakable blockcipher [30]. OCB2 was recently found to have a disastrous bug [17]. The final version of OCB, called OCB3 (2011) [26], corrected some missteps taken with OCB2 and achieved the best performance yet. It is specified in RFC 7253 [27] and was selected for the CAESAR final portfolio [7]. This paper is about OCB3: its definition, development, security, and software performance. We update the proceedings paper on OCB3 [26], freshening all of the experimental results, expanding the proof, and placing the entire enterprise in context. When we speak of OCB in this paper, we will henceforth mean OCB3.
OCB encryption protects both the confidentiality and authenticity of a plaintext M and, additionally, the authenticity of an associated data A and nonce N . The nonce, a string of 120 or fewer bits, must be unique to each encryption call. OCB does its work using a 128-bit blockcipher E. Encryption needs at most m + a + 2 blockcipher calls, where m = |M|/128 is the block size of the plaintext and a = |A|/128 is the block size of the AD. If the nonce is implemented as a counter and the implementation caches a secret 16-byte value with each message encrypted, then 63/64 ≈ 98% of the time the number of blockcipher calls can be reduced to m + a + 1. If the AD is fixed during a session, then after processing it the first time, there is effectively no computational cost for subsequent authentications of it and the addend of a should be ignored. OCB requires a single key K for the underlying blockcipher, and all blockcipher calls are keyed with it. OCB is online in the sense that one need not know the length of A or M to proceed with encryption, nor need one know the length of A or the ciphertext C to proceed with decryption. OCB is fully parallelizable: almost all of its blockcipher calls can be performed simultaneously. Computational work beyond blockcipher calls is restricted to a small number of logical operations per call.
OCB enjoys provable security: the mode is a secure AE scheme, under the standard definition [37,39], assuming that the underlying blockcipher E is secure as a strong pseudorandom permutation (PRP). As with many modes of operation, security degrades quadratically in the number of blocks processed.
The starting point for all versions of OCB was to insist on a parallelizable design based on a 128-bit blockcipher, presumably AES. We aimed to minimize overhead, understood as any work beyond a single AES-call per input block. As time has gone on, this starting point has only come to seem better. This is because a growing fraction of general-purpose CPUs include hardware support for AES, and the quality of AES acceleration has been steadily improving. On an Intel Cannon Lake processor, we see peak speeds for OCB of about 0.43 cpb (cycles per byte). On an ARM Cortex-A73 processor, we see peak speeds of about 1.14 cpb. These speeds are about 30% and 40% faster, respectively, than what we observe for GCM [10]. Securely encrypting at such speeds on a general-purpose processor might have seemed unrealistic even a decade ago.
We now move on to provide some preliminaries on blockciphers and AE schemes. We then give two equivalent descriptions of OCB-the first based on a conventional blockcipher, like AES, the second based on a tweakable blockcipher [30] constructed from an ordinary blockcipher. Next we describe some of the design choices underlying OCB, trying to sketch how the mechanism came to take its final form. After that, we describe our experimental work, comparing the software performance of OCB and five other AE schemes. Our experiments were carried out on three disparate platforms. Our performance study is limited to software; for hardware studies of CAESAR candidates, including OCB, see the work from Kris Gaj's lab at GMU [8,15]. Then comes a proof of security. We conclude with some comments about OCB and its significance, identifying some of the lessons learned along the way. by removing the second oracle in each experiment.
The ideal blockcipher of block size n is the blockcipher E n : K × {0, 1} n → {0, 1} n where each key K ∈ K names a distinct permutation. In this blockcipher, the key space K has 2 n ! keys, each corresponding to a table specifying an n-bit permutation.
Tweakable blockciphers. Liskov, Rivest, and Wagner (LRW) put forward the notion of a tweakable blockcipher [30], and they demonstrated that an OCB1-like construction could more easily be designed and proven secure based on this abstraction than a conventional blockcipher. Our work builds on this insight, viewing OCB as based either on a blockcipher or on a tweakable blockcipher (in which case we adjust the name to CB). A tweakable blockcipher (TBC) is a function E : K×T ×{0, 1} n → {0, 1} n where the key space K is a finite nonempty set, or a set that otherwise has an understood distribution on it; the tweak space T is a nonempty set; and the block size n ≥ 1 is a number. We insist that E(K , T, ·) be a permutation on {0, 1} n . Its inverse D = E −1 is defined by letting D(K , T, Y ) = X when E(K , T, X ) = Y . We may write E T K (·) = E(K , T, X ) and D(K , T, Y ) = D T K (Y ). For a granular notion of security for the TBC E : K × T × {0, 1} n → {0, 1} n , we identify each tweak as either a unidirectional tweak (only forward queries are allowed for it) or as a bidirectional one. (Both forward and backward queries are allowed.) Intuitively, this distinction is useful because, when making a TBC from a conventional blockcipher, unidirectional tweaks only need tweak-dependent "pre-whitening" but bidirectional tweaks need "post-whitening" too [30,38]. Formally, one names the subset B ⊆ T of bidirectional tweaks, understanding that the remaining tweaks U = T \ B are unidirectional. We associate with the TBC E : K × T ×{0, 1} n → {0, 1} n and the subset of tweaks B ⊆ T the partitioned-tweakspace PRP measure where K K is chosen uniformly at the beginning of the left-hand experiment, where π Perm(T , n) is chosen uniformly at the beginning of the right-hand experiment, and where adversary A may not ask any decryption (second-oracle) query (T, Y ) with T ∈ T \ B. Any such (invalid) query would get empty-string response. By Perm(T , n) we mean the space of all permutations π(T, ·) on {0, 1} n , one for each T ∈ T .
The partitioned-tweakspace PRP notion above unifies tweakable-PRP and strongtweakable-PRP security: with E : K × T × {0, 1} n → {0, 1} n a TBC we can resurrect the notion for a strong-tweakable-PRP [30] by defining Adv ±prp We can realize the conventional (not strong) version [30] by defining Adv . Going a step further, the PRP and strong-PRP security notions for a conventional blockcipher E : K×{0, 1} n → {0, 1} n can be recovered by regarding the tweak space as a singleton set.
The ideal TBC with tweak space T and block size n is the TBC E T ,n : K × T × {0, 1} n → {0, 1} n where each key K ∈ K and T ∈ T names a distinct permutation E n (K , T, ·) on {0, 1} n . AE schemes. Following traditions rooted in prior work [4,28,37,39], we regard an AE scheme (meaning a nonce-based AE scheme with associated-data, sometimes referred to as an AEAD scheme [37]) as a triple = ( .K, .E, .D) = (K, E, D). The key space K is a finite nonempty set (or a set that otherwise has an understood distribution). The encryption algorithm E takes in a key K ∈ K, a nonce N ∈ {0, 1} * , a string of associated data A ∈ {0, 1} * , and a plaintext M ∈ {0, 1} * . It returns, deterministically, a string C = E(K , N , A, M) that is either a string C ∈ C = {0, 1} * , the ciphertext, or the symbol ⊥. The former must happen if and only if (K , N , A, M) ∈ K × N × M × A for nonempty sets N , A, and M, the nonce space, AD space, and message space of . The decryption algorithm D takes in a tuple (K , N , A, C) and deterministically returns a value D(K , N , A, C) that is either a string M or the distinguished symbol ⊥. The latter must happen if (K , N , Let = (K, E, D) be an AE scheme with tag length τ . Given an adversary A, we measure how well it breaks the privacy of (here meaning the confidentiality) of by the real number The first addend is the probability that A outputs 1 when given the oracle E K (·, ·, ·). The experiment begins by uniformly selecting a key K K. Then, when the adversary asks a query (N , A, M), the oracle responds with E K (N , A, M). The second term is the probability that A outputs 1 when given the oracle $(·, ·, ·). This oracle, on input (N , A, M), returns a uniformly random string of length |M| + τ . We demand that A never asks two queries with the same first component (the N -value) and that it never asks a query outside of N × A × M. Any such adversary query is invalid.
Next we define authenticity. With and A as before, let Once again the experiment implicitly begins by selecting a key K K. We say that the adversary forges if it outputs a value (N , A, C) ∈ N ×A×C such that D(K , N , A, C) = ⊥ yet there was no prior query (N , A, M) that returned C. We demand that A never asks two queries with the same N -value (the first component) and never asks a query outside of N × A × M.
Informally, an AE scheme is secure if any adversary A with "reasonable" resources has "small" priv-advantage and small auth-advantage. Following the concrete-security tradition, and in view of the fact that we will only be defining OCB for use with an n = 128 bit blockcipher, we refrain from making any absolute or asymptotic definition of security.
An alternative definition of AE security merges privacy and authenticity to get a unified security measure The adversary is given either a pair of "real" encryption and decryption oracles or, alternatively, a pair of "fake" encryption and decryption oracles. The former begin by choosing a random K K. After, the first oracle (the encryption oracle) responds to a query (N , A, M) with E(K , N , A, M), while the second oracle (the decryption oracle) responds to a query (N , A, C) with D(K , N , A, C). The $(·, ·, ·) oracle, on query (N , A, M), just returns |M| + τ uniformly random bits, where τ is the tag length of . The ⊥(·, ·, ·) oracle, on query (N , A, C), always returns ⊥. The adversary may not repeat the first component, N , to its encryption oracle; it may not ask a decryption query of (N , A, C) following an encryption query of (N , A, M) that returned C; and it may only ask encryption queries in N × A × M and decryption queries in N × A × C.
The ae-security notion is essentially equivalent to the conjunction of priv-security and auth-security [41,Section 7]. In our proof of OCB's security, we will use one direction of this claim.
We say that AE schemes = (K, E, D) and = (K , E , D ) coincide if their encryption and decryption algorithms compute identical functions: We note that our priv and ae security notions demand indistinguishability from random bits (sometimes denoted IND$) instead of indistinguishability from the encryption of random (or fixed) bits. This notion goes back to OCB1 [39]. There are several reasons for it. First, this notion of privacy is stronger, yet is achieved by OCB (and most other real-world AE schemes). At the same time, it is also what is most convenient to prove. Also, the encryption algorithm of an AE scheme meeting IND$-security can be used as a pseudorandom generator (PRG), which might be convenient, in some cases, for the user. Perhaps most significantly, IND$-security implies a form of anonymity [1] where one demands that ciphertexts be indistinguishable whether they are constructed from one hidden key K or are, instead, constructed using either of two hidden keys K 0 , K 1 , which one being selected by the adversary.

Specification 1
We now define OCB (meaning OCB3). Given a blockcipher E: K×{0, 1} 128 → {0, 1} 128 and a tag length τ ∈ [0..128], where 2 d upper bounds the number of 128-bit blocks in any plaintext M, ciphertext C, and associated data A. We write ntz(i) for the number of trailing zeros in the binary representation of positive integer i (e.g., ntz(1) = ntz(3) = 0 and ntz(4) = 2). We write msb(X ) for the first (most significant) bit of the string X . We write A ∧ B for the bitwiseand of the equal-length strings A and B. We write A i for the shift of A by i positions to the left (maintaining string length, leftmost bits falling off, zero-bits entering at the right). We write either A B or A B for the concatenation of strings A and B. At line 121 we write [a] 7 for a ∈ [0..127] encoded as a 7-bit string, while at line 126 we regard the variable Bottom as a number instead of a string.
For reasons of generality, OCB is defined to operate on bit strings. But for reasons of simplicity and efficiency, most implementations will assume that all strings operated on are strings of bytes, with a byte being eight bits. Figure 2 illustrates encryption under OCB. Function Inc implicitly depends on the key K , with Inc i ( ) = ⊕ L[ntz(i)], Inc $ ( ) = ⊕ L $ , Inc * ( ) = ⊕ L * , and the (zero-argument) Init = 0 128 . Here L * = E K (0 128 ), L $ = 2 · L * = double(L * ), and L[i] = 2 2+i · L * for all i ≥ 0, the multiplication in GF (2 128 ). Here we write 2 = 0 126 10 = x to denote a particular nonzero point of the finite field. The diagrams should be read left to right and then top to bottom: first set to be Init K (N ); then modify with Inc 1 ; then compute C 1 using the blockcipher E with pre-and post-whitening with ; then modify with Inc 2 ; and so on.

Specification 2
In this section, we generalize OCB by describing the construction in terms of a tweakable blockcipher (TBC). We show that this alternative description really is a generalization of OCB by explaining how to recover the definition of OCB with a particular choice for its TBC. E E

The TBC-Based Scheme
We can view OCB as an AE scheme built not from a blockcipher but from a tweakable blockcipher (TBC) [30], as defined and discussed in Section 2. Refer to Fig. 3, which defines the TBC-based version of OCB, which we name CB. It depends on a tweakable blockcipher E : K × T CB × {0, 1} n → {0, 1} n with n = 128 and a tag length τ ∈ [0..128]. The tweak space for E is the rather complicated set is the nonce space for OCB and N 1 and N 0 are the positive and nonnegative integers, respectively. For the TBC-based version of OCB, it is not actually necessary that n = 128, and the set T CB could just as well be defined from any nonce space N . Tweaks, it can be seen, are of six mutually exclusive types. Those of the first type are in the set B = N ×N 1 of bidirectional tweaks. Omitting parenthesis and commas when writing tweaks, calls to the TBC E will look like The bottom-left picture shows the processing of a three-block AD; on bottom-right, an AD with a short final block.
Encryption under CB is illustrated in Fig. 4. The encryption process is conceptually simpler than it is under OCB because the offsets (the -values) of Fig. 2 are now gone, replaced by the tweaks that, conceptually, identify unrelated permutations E T K and E T K for distinct tweaks T, T .
The construction is specified in Fig. 5. There, multiplication is in GF(2 128 ) using the irreducible polynomial x 128 +x 7 +x 2 +x+1. We do not distinguish between strings and what they represent in this finite field. The code of Fig. 5 is more understandable with the following background. The sequence of values γ 0 , γ 1 , γ 2 , . . . employed is called a Gray code. We give a couple of well-known facts about Gray codes. First, that γ : N 0 → N 0 is a permutation. In fact, it is a permutation on Z 2 i for each i. Thus, we also have that 0 ≤ γ i ≤ 2i for all i. It follows from these facts and the definition of our λ-values that all of the values } are distinct and nonzero. This is the only thing we will actually need of these values.
The reader can now check that , τ ]-we have defined Tw to make that so-a fact important enough to record as a lemma:

Evolution
We now explain some of the design choices made for OCB, simultaneously sketching the mode's history.
Development of ocb1. The initial reason for developing OCB1 [39] was to improve the efficiency of authenticated encryption. Prior work by Jutla [24], and also by Katz and Yung [28] and Gligor and Donescu [11], had evidenced that tightly integrating privacy and authenticity let one get by with less work than what one would see from generically composing separate privacy and authenticity mechanisms. Attending to "lowlevel" concerns, we aimed to create a blockcipher-based AE scheme that would push the efficiency bar. OCB1 handles plaintexts of any length and produces ciphertexts τ -bits longer. It uses a near-minimal number of blockcipher calls and minimizes work beyond those calls. It allows nearly all blockcipher calls to be performed in parallel. It assumes a nonce, not a random IV. It is deterministic. A single key underlies all blockcipher calls. It is online, making a single pass over the plaintext or ciphertext, whose length may not be known in advance.

Development of ocb2.
A series of devastating attacks on 802.11 (WiFi) security created an urgent need to replace its confidentiality mechanism. Jesse Walker, Nancy Cam-Winget, and others advocated replacing it with OCB1. But they explained to Rogaway that there would need to be some way to authenticate but not encrypt header information. Formalizing and achieving this end led to the idea of associated data and to the paper by Rogaway that introduced it [37].
ocb1's speed questioned. In 2004 McGrew and Viega wrote a paper claiming that GCM [10] was about as fast, if not faster, than OCB1 [31]. They suggested that two design issues underlie this. First, OCB1 uses m + 2 blockcipher calls to encrypt a message of m = |M|/128 blocks. In contrast, GCM makes do with m + 1. Second, OCB1 twice needs the result of one AES computation before another AES computation can proceed. Both in hardware and in software, this can degrade performance. Still, we suspected that these two matters would have only a minor impact on performance. The timing studies were misleading, we believed, because McGrew and Viega had compared a reference implementation of OCB with an optimized implementation of GCM. Even that was irreproducible due to McGrew and Viega's use of proprietary GCM code.
We decided to revise OCB1/OCB2 to address and assess the two efficiency concerns above. We integrated scheme development and performance measurements. The result of our work was the definition of OCB3 and a comparative study of the performance of CCM [9], GCM, and the OCB triplet [26]. We found that, across a variety of platforms, all versions of OCB were considerably faster than both CCM and GCM. The performance differences among OCB1, OCB2, and OCB3 were small. OCB3 had the best performance, but OCB2's performance was actually worse than OCB1's. We now look at some of the OCB3 changes over OCB1/OCB2.
Reduced latency. Suppose that the message M = M 1 · · · M m M * being encrypted is not a multiple of 128 bits: it has a final block M * of 1-127 bits. With OCB1 and OCB2 one makes an M m -dependent blockcipher call to compute a ciphertext block C m that is in turn used to create a Checksum that is enciphered with another blockcipher call. This can result in a pipeline stall [31]: one blockcipher call must finish before the next one can begin. For OCB3 we restructure the algorithm so that the Checksum does not depend on any ciphertext block: we simplify the Checksum to M 1 ⊕ M 2 ⊕ · · · ⊕ M m when there is a full final block M m , and M 1 ⊕ M 2 ⊕ · · · ⊕ M m ⊕ M * 10 * otherwise.
Incrementing offsets. In OCB1, each noninitial offset is computed from the prior one by xoring in some key-derived value: the ith offset is ← ⊕ L[ntz(i) ]. In OCB2, each noninitial offset is computed from the prior one by multiplying it, in GF(2 128 ), by a constant: ← ( 1) ⊕ (msb( ) · 135), an operation we think of as doubling. The first approach turns out to be faster [39]. While doubling can be coded in five Intel x86-64 assembly instructions, it still runs more slowly. In some settings, doubling loses big: it is expensive on 32-bit machines, and some compilers do poorly at turning C/C++ code for doubling into machine code that exploits the available instructions. On Intel x86, the 128bit SSE registers cannot be efficiently shifted one position to the left. Finally, the doubling operation is not endian neutral: if we must create a bit pattern in memory to match the sequence generated by doubling, we will effectively favor big-endian architectures. We can trade this bias for a little-endian one by redefining double() to include a byteswap. But one is still favoring one endian convention over the other, and not just at key-setup time. For all of these reasons, OCB3 reverts to the OCB1 method incrementing offsets.
In the development of OCB3, we tried a variety of other methods to generate the needed offsets, exploring word-oriented LFSRs (linear feedback shift registers) of various designs [39,Appendix B]. No alternative performed significantly better, even in isolation, than the method used in OCB1.
Trimming a blockcipher call. OCB1 and OCB2 took m + 2 blockcipher calls to encrypt an m-block string M: one to map the nonce N into an initial offset ; one for each block of M; and one to encipher the final Checksum. The first of these is easy to eliminate if one is willing to replace a blockcipher call by, say, K 1 · N , the product in GF(2 128 ) of nonce N and a variant K 1 of K . But such a change would necessitate implementing a GF(2 128 ) multiply. We use a different method to trim the extra blockcipher call, most of the time, for a counter-based nonce: we employ a new xoruniversal hash function, which we call the stretch-then-shift hash. The initial offset is . 128] where Bottom is the last six bits of N and the (128+64)-bit string Stretch is made by a process involving enciphering N with its last six bits zeroed out. (This description ignores the inclusion of the tag length τ along with N .) The technique ensures that when the nonce N is a counter, the initial offset can be computed without a new blockcipher call about 63/64 ≈ 98% of the time. In this way we reduce cost from m + 2 blockcipher calls to an amortized m+1.02 blockcipher calls. Note, however, that achieving this reduction requires an implementation to cache the value of Ktop (line 124) and to notice, when true, that it need not be recomputed. And there are security considerations in doing this, for it is essential that the adversary not have access to the cached value.
Incorporating tag length. The algorithm of this paper coincides with that of RFC 7253 [27] but differs from the algorithm in the proceedings paper [26] in one minor way: the tag length is used in computing the initial offset (line 121). A change in this direction was requested during the RFC review process in order that a ciphertext created for one value of the tag length τ should not be of value in forging a ciphertext for a different value of τ . The change is heuristic, as security claims associated with OCB assume τ to be fixed. Subsequent work by Reyhanitabar, Vaudenay, and Vizár demonstrated that the manner in which we incorporate τ does not work to improve the security notion to one where the same key is used for multiple tag lengths [36].
Representing field points. A low-level choice where OCB and GCM [10] part ways is in the representation of field elements. In GCM the polynomial a 127 x 127 +· · · a 1 x+a 0 corresponds to the string a 0 . . . a 127 rather than the string a 127 . . . a 0 . The usual convention on machines, whether big-endian and little-endian, is that the msb is the leftmost bit of any register. One advantage of following that convention is that a left shift by one can be implemented by adding a register to itself, an operation that is sometimes faster than a 1-bit logical shift. For example, on an ARM Cortex-A57 addition has one cycle latency but logical shift has two.
Incremental APIs. Unlike OCB1 and OCB2, under OCB3, each 128-bit block of plaintext is processed in the same way whether or not it is the final 128 bits. This change facilitates implementing an incremental API, as one is able to output each 128-bit chunk of ciphertext after receiving the corresponding chunk of plaintext, even if it is not yet known if the plaintext is over. Similarly, each 128-bit block of AD is now treated the same way whether it is or is not the message's end, simplifying the incremental provisioning of it.
Endian issues. While we expect the majority of processors running OCB will be littleendian, the mode's definition does nothing to favor this convention. Endian issues arise when "register oriented" and "memory oriented" values interact. These are the same on big-endian machines, but are opposite on little-endian ones. One could, therefore, favor little-endian machines by building into the algorithm byte swaps that mimic those that would occur naturally each time memory and register oriented data interact. We experimentally adapted our implementation to do this but found that it made very little performance difference. One reason for this is the efficient byte-reversal capability on most modern processors (e.g., the pshufb instruction can reverse 16 bytes on x86 machines in a single cycle). Also, the OCB1-based approach for incrementing offsets allows precomputed values to be endian-adjusted at key-setup time, removing most endian-dependency for subsequent encryption or decryption calls. Since it makes little difference to performance, and big-endian specifications are conceptually easier, OCB does not make any gestures toward little-endian orientation.  31], CCM [9,43] and Chacha20-Poly1305 [5,32]. These schemes are well established and probably represent the bulk of AE in use today. So we begin with these three schemes. Then, in the recently concluded CAESAR competition, AEGIS-128 [44] and OCB were winners of the "high-performance" category. So we include AEGIS-128 as a fourth point of comparison. Finally, on the 64-bit Intel architecture we include AES-GCM-SIV, as specified in RFC 8452 [13] (2018). The AES-GCM-SIV mechanism builds on Rogaway and Shrimpton's SIV [41] to add nonce-reuse misuse-resistance, the most important security property, in our view, that is not delivered by OCB or the other alternatives we have named. AES-GCM-SIV is benchmarked only on the 64-bit Intel architecture because, at the time of writing, that is the only high-quality implementation available. We ran experiments on three different architectures. The most important architectures today are 64-bit Intel x86 for servers, desktops and laptops, and 64-bit ARM for high-performance embedded applications such as smartphones and tablets. We therefore benchmark on CPUs fitting these descriptions. For 64-bit Intel we use a recent Intel Core i3-8121U ("Cannon Lake"). For 64-bit ARM we test on a high-performance Cortex-A73. As a catch-all for architectures not falling into either of these categories, we also benchmark on an older 32-bit ARMv5 processor (Marvell 88FR131 "Feroceon"). The Feroceon results capture how much work each algorithm needs when the unit of work is a 32-bit RISC assembly instruction and the processor lacks hardware acceleration such as an AES unit or vector registers. The Feroceon is also a reasonable stand-in for what can be expected from ARM's current M-series embedded processors.

Performance
Source code. A problem sometimes seen in performance comparisons is to include implementations of uneven quality. To avoid this, we considered for benchmarking the best-performing version of each scheme among several sources. OpenSSL, Libgcrypt, and WolfSSL all provide open-source cryptographic libraries with significant assemblylanguage acceleration targeting multiple architectures. For each algorithm implemented by each library, we identified its peak speed on multiple architectures. In every case OpenSSL's library was the fastest. This is not surprising: OpenSSL has a longer record of producing good-quality assembly than the others. As a sanity check, whenever possible  we also verified that speeds reported on the SUPERCOP benchmarking website [42] were not faster that the speeds we found. The AEGIS-128 implementation we used for benchmarking is a modified version of the optimized one placed on the SUPERCOP benchmarking website by the AEGIS designers. The AES-GCM-SIV implementation we used is from Google's BoringSSL library. We ourselves have supplied the OCB code. The AEGIS-128 and OCB codes were written in C (with AES intrinsics when available) and all other AE algorithms have their kernels written in assembly. All non-library code was compiled with both Clang 8.0 and GCC 8.3 or 9.1, using both optimization levels 2 and 3, and whichever combination performed best is reported. The OpenSSL and BoringSSL libraries were built using the library's build script with the addition of architecture-specific flags. We found that changing which compiler or optimization level was used to compile the libraries made no measurable difference in our tests, no doubt because the time-critical code was already in assembly, thus unaffected by compiler options.
Discussion. Having AES in hardware is the most significant determinant of performance. On our 64-bit ARM and 64-bit Intel processors-both of which have hardware AES support as well as support for GCM's "carryless" multiplication-OCB, AEGIS, and GCM are fastest. On these architectures, OCB and AEGIS have similar performance profiles, while GCM encrypts 2KB messages 40-60% slower. Chacha20-Poly1305 and CCM have peak speeds 3-7 times slower than OCB and AEGIS-128. On systems without AES in hardware, Chacha20-Poly1305 is the clear winner. On our Feroceon system Chacha20-Poly1305 encrypts 2KB messages 18% faster than AEGIS-128, twice as fast as OCB, and over three times faster than CCM and GCM.
The relative speeds of these algorithms are easy to explain. OCB and AEGIS-128 were designed to have little marginal overhead beyond calls to the AES round function. On systems with a hardware AES unit, this results in outstanding peak performance. Although CCM and GCM can also take advantage of a hardware AES unit, the computation of GHASH for GCM and the CBC-MAC for CCM, in addition to the countermode encryption each performs, is enough to separate them from the speed leaders. Chacha20-Poly1305 does not use AES or carryless multiplication, instead depending on integer addition, rotation, and xor; it gets no benefit from AES hardware support. Chacha20-Poly1305 is designed to benefit from vectorized instructions, like AVX on Intel and NEON on ARM, but these do not have enough accelerative power to keep up with the AES unit's capabilities.
On a system without an AES unit, we see that AEGIS-128 is about twice as fast as OCB (because it computes about half as many AES rounds, per byte of input, as OCB), and OCB is about twice as fast as CCM (because it computes about half as many AES rounds as CCM). GCM is somewhere between OCB and CCM because GCM substitutes a less expensive GHASH computation for CCM's CBC-MAC. The simplicity of the Chacha20-Poly1305 design leads to its superior performance on simpler systems.
Effect of parallelism. On systems with an AES unit in hardware, parallelism has a profound effect. OCB can take advantage of any amount of AES parallelism. If a system is able to compute, say, 32 AES calls in parallel, then OCB computation can be arranged to loop and process 32 blocks per iteration. This means that OCB's performance depends primarily on how many AES computations can be done in parallel and how long each takes to complete. On Intel's Cannon Lake architecture, its AES unit can issue two AESrounds per cycle and they take five cycles to complete. This means that if the pipeline can be kept full, it can complete ten rounds of processing every five cycles. Since each 16 bytes of plaintext requires at least ten rounds of AES processing, this implies a maximum throughput of five cycles per 16 bytes, or 0.31 cpb. On Intel's Skylake architecture, where only one AES round can be issued per cycle, it will take at least ten cycles per 16 bytes, so 0.62 cpb. OCB achieves this lower-bound speed on Skylake but not on Cannon Lake. There is enough slack in the functional units during Skylake's computation to absorb the computational overhead beyond AES computations, whereas on Cannon Lake there is not.
Looking toward newer architectures, Intel is currently rolling out its AVX-512 architecture. Our Cannon Lake testbed has some previously developed 512-bit vector instructions, which are used by Chacha20-Poly1305, but Intel's Ice Lake architecture brings more significant improvements. Its AES unit doubles throughput and shaves 25% off latency which will benefit OCB much more than the other AE algorithms. AEGIS-128 has a serial component-it cannot start the first AES round of an input block until the first AES round of the prior block has completed-implying that its performance is limited by AES round latency. This should result in a 25% speed increase between Cannon Lake and Ice Lake architectures, but that is less than the potential doubling of OCB speed. GCM will see a doubling in AES throughput, but carryless multiplication will maintain its throughput, limiting any performance increase. CCM's serial CBC-MAC computation will be 25% faster on Ice Lake than on Cannon Lake, but will continue to keep CCM's performance slowest of the group. Chacha20-Poly1305's AVX-512 speed improvements are already reflected in this study and should see no significant improvement on Ice Lake. Preliminary results on an Ice Lake system, completed as we go to press, show an OCB peak throughput of 0.19 cpb, which is very close to the predicted doubling.
Testing methodology. The number of CPU cycles needed to encrypt a message is divided by the length of the message to arrive at the cost per byte to encrypt messages of that length. This is done for every message length from 1 to 2048 bytes. So as not to have performance results overly influenced by the memory subsystem of a host computer, we arrange for all code and data to be in level-1 cache before timing begins. Two timing strategies are used: the clock function of C and the x86 time-stamp counter. In the clock version, the ANSI C clock() function is called before and after repeatedly encrypting the same message, on sequential nonces, for a little more than one second. The clock difference determines how many CPU cycles were spent on average per processed byte. This method is highly portable, but is time-consuming when collecting an entire dataset. On x86 machines there is a "time-stamp counter" (TSC) that increments once per CPU cycle. To capture the average cost of encryption, the TSC is used to time encryption of the same message 64 times on successive counter-based nonces. The TSC method is not portable, working only on x86, but it is fast. The TSC read instruction might be executed out of order, in some cases it has high latency, and it continues counting when other processes run. To reduce the impact of these problems, we read the TSC once before and after encrypting the same message 65 times and then read the TSC once before and after encrypting the same message once more. Subtracting the second timing from the first gives us the cost for encrypting the message 64 times, and mitigates the out-of-order and latency problems. To avoid including context-switches, we run experiments multiple times and keep only the median timing.

Security
The key steps for proving OCB secure are establishing the security of the CB construction when based on an idealized TBC, and then showing that our way of creating a tweakable blockcipher, the Tw-construction, works. Both steps are in the informationtheoretic setting. Passing to the complexity-theoretic setting for the final result is then standard.

Security of the TBC-Based Scheme
The following lemma establishes the priv-security and auth-security of CB when based on an ideal TBC. Note that in this idealized setting, privacy and authenticity do not degrade with message lengths, AD lengths, or the number of queries: privacy is perfect, while the chance of forging is about 1/2 τ .
Proof. Consulting Fig. 4 will make the proof more understandable. Keep in mind that, in the idealized setting we consider, , E i K and E i * K are random permutations on n bits (for all permissible N and i). To emphasize this, we write π in place of E K , so π N i , π N i * , π N i $ , π N i * $ , π i and π i * . Any two such permutations are independent of one another as long as their superscripts are distinct. We must argue privacy and authenticity.
Privacy. During the adversary's attack it asks queries (N 1 , A 1 , M 1 ), . . . , (N q , A q , M q ) and gets responses C 1 , . . . , C q where C j = C j T j and C j = C j 1 · · · C j m j or C j = C j 1 · · · C j m j C j * . In general, in this proof we will use a superscript of j on a variables to indicate the value of that variable following the adversary's jth query. We may do this with any variable appearing in Fig. 4.
Since the N j -values must be distinct, each permutation of the form π N j ··· is used at most once. We are thus applying independent random permutations to a single point each, so all of these outputs are uniformly random and independent. The bulk of what is returned to the adversary, all the C j i values, are such π N j i outputs. But the C j * and T j values are not just permutation outputs but are, instead, π N j m j * or π N j m j $ or π N j m j * $ outputs that are xored with M j * 0 * and truncated to |M j * | bits, or, alternatively, are xored with Auth j and then truncated to τ bits. Either way, the result remains uniform and independent of all other outputs, as Auth j , M j * , and τ are independent of π N j ··· outputs. We conclude that each result from the adversary's j th query is a uniformly random string C j of length |M j | + τ that is independent of all other query responses. It follows that the adversary's privacy advantage is zero.
Authenticity. Before we launch into proving authenticity, consider the following simple game, which we call game G. Suppose that you know that an n-bit string Tag is not some particular value Tag 0 . All of the 2 n − 1 other possibilities are equally likely. Then, your chance of correctly predicting the τ -bit prefix of Tag is at most 2 n−τ /(2 n − 1). That is because the best strategy is to guess any τ -bit string other than the τ -bit prefix of Tag 0 . The probability of being right under this strategy is 2 n−τ /(2 n − 1). We will repeatedly use this fact in the sequel. Now suppose that the adversary asks a sequence of queries (N 1 , A 1 , M 1 ), . . . , (N q , A q , M q ) and then makes its forgery attempt (N , A, C), where C = C T , C = C 1 · · · C c or C = C 1 · · · C c C * . We need to bound the probability that (N , A, C) is a successful forgery.
We begin by considering the case where the forgery attempt (N , A, C) employs a nonce N that is not among {N 1 , . . . , N q }. Then to forge the adversary needs to find the correct value of T = (π N ··· (Checksum)⊕Auth) [1..τ ] but has seen nothing that depends on π N ··· . Having no relevant information to use, the adversary's chance of success is clearly 2 −τ , which is less than 2 n−τ /(2 n − 1).
Given the above, we can assume that the forgery attempt (N , A, C) uses a nonce N = N j that coincides with a nonce from some prior query. All the other queries had to employ different nonces and thereby generated information theoretically unrelated to the adversary's task at hand. We can therefore disregard these other encryption queries and assess the maximal forging probability for a new and simpler game: the adversary asks a single encryption query (N , A , M ), getting a response C = C T , and then it tries to forge, with the same nonce, outputting a triple (N , A, C). We must bound the probability that such a forgery is valid. We write A = A 1 · · · A a or A = A 1 · · · A a A * , M = M 1 · · · M m or M = M 1 · · · M m M * , A = A 1 · · · A a or A = A 1 · · · A a A * , C = C T , C = C 1 · · · C c or C = C 1 · · · C c C * . We proceed by case analysis, relating the form of the forgery attempt (N , A, C) to the encryption query (N , A , M ). We say that a string is full if it has a multiple of n bits, and partial otherwise.
We have several subcases to consider. Case 1a. If A is full and A is partial, then Auth Auth will be U 2n distributed, the uniform distribution on 2n bits, and the adversary will only be able to forge with probability 2 −τ . The uniformity of Auth Auth is because Auth will depend on a π a output but no π a * output, while Auth will depend on a π a * output. The case when A is partial and A is full is similar. Case 1b. If A and A are full and a = a , then Auth Auth will again be U 2n distributed, whence the forging probability will be 2 −τ . The uniformity of Auth Auth is because each includes a π 1 output in the xor but only one xors in a π max(a,a ) output. The case of A and A being partial but a = a is analogous. Case 1c. Suppose A = A 1 . . . A a  and A = A 1 . . . A a are full, equal in length, but distinct; say A j = A j for some j. Then, even if the adversary were given not only C but also Final, all π i (A i ) values, and all π i (A i ) values except for i = j, still the value of Auth = ⊕ i π i (A i ) = Z ⊕ π j (A j ), and therefore Tag, would be uniform in a space of 2 n − 1 possible values. We are now in the setting of game G, and the adversary's chance to predict T , the first τ bits of Tag, is at most 2 n−τ /(2 n − 1). Case 1d. The last case is when A = A 1 . . . A a A * and A = A 1 . . . A a A * are partial, equal in length, but distinct. If A j = A j for some j, then proceed as with Case 1c. If A * = A * , then A * 10 * = A * 10 * and proceed as with Case 1c, but with π a * (A * 10 * ) the unknown value.
Case 2: A = A , A = ε or A = ε. If A = ε = A, then Auth = 0 n , while Auth is U n distributed and independent of Final, so Tag is uniform and independent of all values the adversary has observed, whence its ability to predict its τ -bit prefix is at most 2 −τ . If A = ε = A then Auth = 0 n and the adversary must output the τ -bit prefix of Final. It has received no information on any π N i $ or π N i * $ from its encryption query, as the only such output was masked with Auth , formed using a π a or π a * output. So the forging probability is again 2 −τ .
Case 3: A = A ; M partial and C full, or the other way around. For this and all remaining cases, we imagine providing the adversary with Auth = Auth. So the adversary must predict the first τ bits of Final. But Final is the output of a random permutation π N c $ that has not yet been applied to any point. With no relevant information for carrying out this task, its chance to forge is at most 2 −τ . For the case with M full and C partial, Final is, similarly, an output π N c * $ where this permutation has not yet been applied to any point.

Case 4: A =
A ; c = m ; M and C both full or both partial. the adversary must predict the first τ bits of Final. But Final is the output of a random permutation π N c $ (if M and C both full) that has not yet been used at even one point. With no relevant information, the adversary forges with probability at most 2 −τ . Similarly when M and C are both partial.
Case 5: A = A , c = m , M and C full. For the forgery to be valid we must have C = C so let j to be the smallest value where C j = C j . Assume the adversary is given the value of Auth = Auth, all π N i permutations where i = j, and all π ···$ permutations. Then the adversary will know Checksum and all the values but one that xored together make Checksum. But it will be missing one addend in the xor, and all 2 n − 1 values other than M j are equally likely for it, whence Checksum can assume 2 n − 1 values, each equiprobable, and Final can assume 2 n − 1 values, each equiprobable. We are in the setting of game G and the adversary's chance to forge is at most 2 n−τ /(2 n − 1).
Case 6: A = A , c = m, M and C partial. Because C = C either there is a smallest j where C j = C j or else C 1 · · · C c = C 1 · · · C c but C * = C * . For the former case, proceed like Case 5. For the latter case the adversary may be able to compute Checksum = M 1 ⊕ · · · M c ⊕ M * 10 * but also Checksum = M 1 ⊕ · · · M c ⊕ M * 10 * . These two values are necessarily distinct, due to the 10 * padding. Even if the adversary is given both and knows Final the image of Checksum under π N c $ will be a uniformly random among the 2 n − 1 points that are not Final . We are again left in the situation of game G, and the adversary's probability to forge is at most 2 n−τ /(2 n − 1). This completes the proof. .128] is strongly xor-universal when c ∈[1.. 20], K ∈{0, 1} 128 , x ∈ X , and Stretch(K ) = K (K ⊕ (K c)) .

Stretch-then-Shift Hash
OCB employs a new hash function, "Init" in Fig. 3, to map the nonce N to the initial offset . We aimed to construct a hash that would need at most one AES computation no matter what N -values are presented, but would usually reuse a previously computed AES output, plus a couple of shifts and xors, if N is one more than it was last time around. Our method applies AES to the top 122 bits of N to create an initial value Ktop, but then modifies Ktop in a simple way using the lower six bits of N , call them Bottom. Specifically, we return the first 128 bits of Bottom. There is no deep intuition motivating the formula; it is just the simplest method we found that we could prove to work. We start with the needed definitions.
Definition. Let K be a finite set and let H : K × X → {0, 1} n be a function. We say that H is strongly xor-universal if for all distinct x, x ∈ X we have that H K (x)⊕ H K (x ) is uniformly distributed in {0, 1} n and, also, H K (x) is uniformly distributed in {0, 1} n for all x ∈ X . The first requirement is the usual definition for H being xor-universal and the second we call universal-1.
The technique. We aim to construct strongly xor-universal hash-functions H : K × X → {0, 1} n where K = {0, 1} 128 , X = [0..domSize − 1], and n = 128. We want domSize to be at least some modest-size number, say domSize ≥ 64, and intend that computing H K (x) be extremely fast. Computation of H should not require large tables, preprocessing of K , or special hardware support.
The method we propose is to stretch the key K into a longer string stretch(K ) and then extract its bits x + 1 to x + 128. Symbolically, H K (x) = (stretch(K ))[x + 1 .. x + 128] where S[a..b] denotes bits a through b of S, indexing beginning with 1. Equivalently, [1..128]. We call this a stretch-then-shift hash.
How to stretch K ? It seems natural to have stretch(K ) begin with K , so let us assume that stretch(K ) = K s(K ) for some function s. It is easy to see that s(K ) = K and s(K ) c will not work, but s(K ) = K ⊕ (K c), for some constant c, looks plausible. We now demonstrate that, for well-chosen c, this function does the job.  (2) of the matrix A i+1 and the column vector K . Let B i, j = A i + A j be the indicated 128 × 128 matrix, the matrix sum over GF (2). We would like to ensure that, for arbitrary 0 ≤ i < j < domSize(c) and a uniform K ∈ {0, 1} 128 that the 128-bit string H c K (i) + H c K ( j) is uniform-which is to say that A i+1 K + A j+1 K = (A i+1 + A j+1 )K = B i+1, j+1 K is uniform. This will be true if and only if B i, j is invertible in GF(2) for all 1 ≤ i < j ≤ domSize(c). Thus domSize(c) can be computed as the largest number domSize( j) such that B i, j is full rank, over GF (2), for all 1 ≤ i < j ≤ domSize( j). Recalling the universal-1 property we also demand that A i have full rank for all 1 ≤ i ≤ domSize(c). Now for any c, the number of matrices A i, j to consider is at most 2 13 , and finding the rank in GF(2) of that many 128 × 128 matrices is not a difficult calculation.
Our results are tabulated in Fig. 9 Then, H is strongly xor-universal.
Efficiency. On 64-bit computers, assuming K (K ⊕ (K 8)) is precomputed and in memory, the value of H K (x) can be computed by three memory loads and two multiprecision shifts, requiring fewer than ten cycles on most architectures. If only K is in memory then the first 64 bits of K ⊕ (K 8) can be computed with three additional assembly instructions.

Instantiating the TBC
We show that E = Tw[E, τ ] is a good TBC if E is a good blockcipher. In formalizing this, bidirectional tweaks are those of the form (N , i). Our result is again, for now, in the information-theoretic setting. Proof. We generalize the adversary's capabilities in attacking Tw[E, τ ]; see Fig. 10 for the construction we will call TW. There we write π instead E K . The adversary, which we still refer to as A, may now ask queries we will refer to as being of TYPE-1, TYPE-2, TYPE-3a, TYPE-3b. In other words, the adversary's queries may take any of the forms (1, W, λ), (2, W, Top, Bottom, λ), (3a, W, Top, Bottom, λ), or (3b, Z , Top, Bottom, λ). We insist that the adversary not ask a query with Top = 0 (we stop to distinguish field points and the corresponding strings) and we demand that any λ ∈ GF(2 n ) asked in a query is used only for queries of one numeric TYPE (it is fine to use the same λ in queries of TYPE-3a and 3b). The adversary may not repeat queries nor ask a query with a trivially known answer (a TYPE-3b query following the corresponding TYPE-3a query, or the other way around). Working in GF(2 n ), we sometimes write xor as addition. As the adversary asks its queries the mechanism makes what we will call internal queries to the random-permutation π . For example, the adversary's TYPE-1 query of (W, λ) results in an internal type-1 query of X . The internal queries come in two flavors, direct and indirect, as show in Fig. 5. Note that the total number of internal queries resulting from the adversary's q queries is at most σ = 2q+1. The hash function H that we use to compute Initial is the map defined and proven secure in Lemma 3. That said, any strongly xor-universal hash function with the needed domain and range will do. It is important to understand that all of the abilities present in a "real" adversary attacking Tw [E, τ ] are also represented in the abilities of an adversary attacking TW[E, τ ] we have now described.
We aim to show A will get small advantage in attacking TW. The proof involves a game-playing argument followed by a case analysis of some collision probabilities. We begin with a game 1 that perfectly simulates the TW-construction. As the adversary A asks queries the game grows the permutation π in the usual way, preparing each input for π or π −1 exactly as would TW. The responses to type-A and B queries are stored and looked up as needed. In game 2 we return, in response to each internal query π or π −1 , a freshly minted uniformly random point of GF(2 n ). Note that this results in values returned to the adversary that are, likewise, uniformly random. In game 3 we perfectly simulate an ideal tweakable blockcipher π (with the right domain and tweak space). By the "switching lemma" the advantage of A in distinguishing games 2 and 3 is at most 0.5 q(q −1)/2 n , so we must only bound the gap between games 1 and 2.
In game 1, consider answering each internal query by uniformly sampling from {0, 1} n and, hopefully, returning that sample. If we have already used our speculatively generated return value set a flag bad and re-sample from the co-range (for π -queries) or codomain (for π −1 queries). The above bad-setting events occur with probability at most 0.5 σ (σ −1)/2 n .
When an internal query clashes with any prior commitment made then, to accurately play game 1, we must answer the query according to the prior commitment. Assume we do so, and then set bad. Call these bad-setting events collisions. We can write games 1 and 2 so as to be identical until bad is set, so we have only to bound the probability of collisions in game 2, the version where we uniformly sample for responding to internal queries. Note that game 2 maintains the invariant that values returned to the adversary are independent of the values L and Initial selected internally. Because of this, we can simplify the temporal aspect of the game and replace it by an alternative one in which the adversary chooses all TBC queries, and their responses, at the beginning. Then we make the indirect queries that determine L and Initial, and determine if a collision has occurred. Excising adaptivity in such a manner has been illustrated in much prior work.
Any potential collision event-e.g., the 20 th internal query colliding with the 6 thcan be summarized by writing something like Coll(3a, 1), interpreted as saying that first there was a type-3a internal query (W, Top, Bottom, λ), which generated a π -input of X (its value to be determined) and an adversarially-known response Z , and then the adversary asked a type-1 query of (W , λ ), which gave rise to a π -input of X (value to be determined), and an adversarially-known response of Y . Now we make the underlying type-A and type-B queries and it so happens that X = X . Such an event is unlikely since it implies that W + Initial + λL = W + λ L, and Initial is uniform and independent of all other values named in the formula: we select the type-B output Ktop of π uniformly at random, and H is universal-1, making H Ktop (Bottom) uniform, too. The probability of the event happening, for a given pair of indirect queries, is at most 2 −n . The same holds for each of the 36 possible collision types. To avoid tedious repetition, we provide a few examples. We continue to use the same convention as in the last example, priming variables for the second query. . This is at most 2 −n as, for example, Initial is uniform and independent of Y , λ , and L. -Pr[Coll(3b, 2)] = Pr[W + Initial + λL = W + Initial + λ L. Since λ-values must be distinct between TYPE-2 and TYPE-3b the probability is Pr[cL = W +Initial + W + Initial ] for some c = 0. Since the RHS side is independent of L and L is uniform, the probability is at most 2 −n .
Continuing in this way one finds that each type of collision occurs with probability at most 2 −n , implying a probability for any collision of at most 0.5 σ (σ −1)/2 n . Summing with the addends of 0.5 σ (σ − 1) and 0.5 q(q − 1) and recalling that σ ≤ 2q + 1 we conclude that the total adversarial advantage is at most 4.5q 2 + 1.5q ≤ 6q 2 , completing the proof.

Wrapping Up
The following concretizes a claim, by now folklore, from Rogaway and Shrimpton [41,Section 7]. It is one half of the equivalence between ae-security and priv+auth security: an AE scheme that is secure in the priv and auth senses is secure under the all-in-one definition, too. We include a proof, as we don't know where else in the literature one appears.

Lemma 5.
Let be an AE scheme. Let A be an adversary that asks at most q d decryption queries. Then there are adversaries A priv and A auth such that Adversaries A priv and A auth are explicitly constructed from A in the proof of this lemma. They have time and query complexity similar to that of A.
Proof. Adversary A priv is constructed as follows. It runs adversary A. When A makes a left-oracle query of (N , A, M) adversary A priv asks its own oracle (N , A, M) and gets a response C, which it returns to A. When A makes a right-oracle query of (N , A, C), it returns ⊥ to A. When A halts with an output of b, adversary A priv outputs the same bit b and halts. Adversary A auth is constructed as follows. It selects a random j [1..q d ]. It then runs adversary A. When A makes a left-oracle query of (N , A, M) adversary A auth asks its own oracle (N , A, M) and gets a response C, which it returns to A. When A makes its ith right-oracle query of (N , A, C), adversary A auth returns ⊥ if i = j, while, if i = j, it outputs (N , A, C) (as its attempted forgery) and then halts.
Omitting oracle-argument placeholders (·, ·, ·) for improved readability, we have that where SomeForge is the event that adversary A, in the course of its execution with oracles E K , D K , asks its decryption oracle some query that results in a non-⊥ return value; where FirstForge i is the event that A, in the course of its execution with oracles E K , D K , gets its first non-⊥ oracle response on query-i; and where FirstForge is the event that A, in the course of its execution with oracles E K , D K , gets its first non-⊥ return value in response to its jth query, for j uniform in [ . Let A be an adversary that asks at most q queries, at most q d of them decryption queries, and at most σ blocks. Then Adv ae (A) ≤ 6(σ + 2q) 2 /2 n + 1.1 q d /2 τ .
The first term is by Lemma 4 and the observations made previously concerning adversary D's attack. The second term is by Lemma 6.
Finally we can quantify the security of OCB when instantiated with a "real" blockcipher E. Its time complexity is similar to that of A, and it asks at most σ + 2q queries.
Proof. Adversary A prp is constructed as follows. It runs A. When A asks an encryption query of (N , A, M) adversary A prp services this query by using the OCB construction with tag length τ , but employing its own left oracle in lieu of each forward blockcipher computation. When A asks a decryption query of (N , A, C) adversary A prp services this query by using the OCB construction with tag length τ , but employing its own left oracle in lieu of each forward blockcipher computation and employing its right oracle in lieu of each inverse blockcipher computation. Now

Conclusions
Limitations. There are three limitations on OCB that we would like to emphasize. The first is that OCB's security crucially depends on the encrypting party not repeating a nonce. The mode should never be used in situations where that can't be assured; one should instead employ a misuse-resistant AE scheme [41]. These include AES-GCM-SIV [13,14], COLM [2], and Deoxys-II [23]. A second limitation of OCB is its birthday-bound degradation in provable security. This limitation implies that, given OCB's 128-bit block size, one must avoid operating on anything near 2 64 blocks of data. The RFC on OCB [27] asserts that a given key should be used to encrypt at most 2 48 blocks (4 petabytes), including the associated data. Practical AE modes that avoid the birthday-bound degradation in security are now known [14,19,20,23,40]. Finally, OCB uses both the forward and the backward direction of its underlying blockcipher.
Minematsu has devised an elegant approach for avoiding use of a blockcipher's inverse in an OCB-like construction [33].
The OCB2 bug. We mentioned in the Introduction the devastating bug on OCB2 discovered by Inoue, Iwata, Minematsu, and Poettering [17]. What was the root cause of this bug [17], and why does OCB3 escape it? We offer two high-level explanations.
One is to claim that OCB2-and OCB1 and OCB3 as well-are over-optimized. In particular, the constructions twice omit post-whitening xors [38], switching from the XEX construction to the XE construction [30,38] for enciphering the Checksum and for Vernam-style encrypting the final block of plaintext. (OCB2 and OCB3 also use XE for processing the AD.) Shaving off these two xors increased conceptual complexity and proof complexity for an insignificant improvement in speed.
But the intermixing of XE and XEX constructions is not by itself a problem; what made this turn out so badly for OCB2 was the muddy (and therefore error-prone) abstraction boundary for dealing with TBCs that combine XE and XEX [38,Section 8]. Rather than partitioning the tweak space into unidirectional and bidirectional tweaks, as we did with OCB3, a one-bit tag was appended to each tweak-but then the tag was not treated as an intrinsic part of the tweak. This failed to properly capture the intent that things like π N 4 (with a bold π) andπ N 4 (with a bar over the π ) and π N 4 in [38, Fig. 1] all need to be instantiated so as to approximate independent permutations. Properly formalizing a partitioned-tweakspace PRP helped steer us clear of the error to which OCB2 succumbed.
ISO standardization. The OCB2 bug was made more significant because it had been standardized in ISO/IEC 19772 [18]. Unfortunately, nobody on that standardization committee had told us that they were planning to standardize this scheme. Had they, we would have explained that we had already abandoned OCB2 in favor of a cleaner and more empirically informed redesign, OCB3, an unpublished draft of which already existed. The same ISO standard included another faulty AE scheme: a badly misspecified version of encrypt-then-MAC [34]. These two problems suggest, to us, that a rather insular standardization process was then in place.
Patents. At present, OCB is not widely used. This is largely due to there having been potentially relevant patents held by multiple parties. This was the reason that OCB1 fell out of being a mandatory mechanism of 802.11 [16], and the reason it was not subsequently selected for a NIST Special Publication. The limited adoption of OCB exemplifies the often poisonous impact of patents on cryptographic practice.
The second author, who held the patents most directly reading against OCB, has renounced all OCB-relevant IP; all OCB-relevant patents of his have been placed in the public domain. While Rogaway had previously licensed all open-source software and all software for non-military use, IP concerns did not go away.
Final remark. As explained in Sect. 5, the initial reason for developing OCB1 was to improve the efficiency of authenticated encryption. But it soon became clear to us that AE schemes had a second benefit: that they could improve usability. By moving closer to what an ideal encryption scheme would deliver (and, also, by adjusting an encryption-scheme's syntax to include AD and avoid per-message randomness) one could make symmetric encryption less prone to misuse and less reliant on fussy compo-sition techniques [4,34]. OCB, and AE more broadly, helped birth a reconceptualization of symmetric encryption. One abandons weak security notions and schemes that are hard to correctly use and moves toward strong security notions and schemes that are easier to use well. This, not speed, may be the lasting legacy of AE.