1 Introduction

In many settings, secrets needed for cryptography are measured in a noisy fashion. Biometrics such as fingerprints [31, 35], keystroke dynamics [23, 24], voice [22], and iris scans [31] are examples — each physical measurement produces slight variants of one another. A long line of work has built ad hoc solutions for various cryptographic settings [17, 22,23,24], while another line of work starting with Dodis, Ostrovsky, Reyzin and Smith [13] explored a general primitive, called a fuzzy extractor, that can reproducibly derive secret keys given noisy measurements. The canonical fuzzy extractor construction combines a traditional key-derivation function (KDF) with a secure sketch, the latter serving as an error-correction code that additionally leaks a bounded amount of information about the secret.

In this work, we explore error correction for noisy secrets in the distribution-sensitive setting, in which one knows the distribution of secrets while designing cryptographic mechanisms. We ground our investigations in an important running case study: typo-tolerant password checking [12, 20], and ultimately offer a number of improvements, both theoretical and practical, on cryptographic error-tolerance in general and the design of typo-tolerant password hardening systems in particular.

Typo-Tolerant Password Checking. Recent work by Chatterjee et al. [12] revealed that users suffer from a high rate of typographical errors (typos), with even a handful of simple-to-correct typos (such as caps lock or other capitalization errors) occurring in \(10\%\) of all login attempts at Dropbox. They offered a usability improvement called “brute-force checking”: enumerate probable corrections of the submitted (incorrect) password, and check each of them using a previously stored slow-to-compute cryptographic hash of the correct password (e.g., scrypt [26], argon2 [6], or the PKCS#5 hashes [18, 27]). They also show empirically that this relaxed checking approach does not noticeably degrade security, assuming careful selection of the typos to correct.

To maintain performance, however, one must limit the runtime of password checking. One can at most handle approximately \(b = \textsf {RT}/ \mathsf {c}\) errors given a runtime budget \(\textsf {RT}\) and cryptographic hash function that takes time \(\mathsf {c}\) to compute.Footnote 1 Given that \(\mathsf {c}\) should be slow — in order to prevent brute-force attacks — the size b of the ball, or set of potential corrections around an incorrect password, must be fairly small. Extending to larger numbers of errors — for example we would like to handle the most frequent 64 typos, which would account for approximately \(50\%\) of all typos seen in measurement studies — would appear to force \(\mathsf {c}\) to be too low to ensure security in the face of attackers that obtain the password hash and mount dictionary attacks.

An existing alternative approach to brute-force ball search is to store, along with the password hash, a small bit of information to help in correcting errors. Because we want to maintain security in the case of compromise of a password database, we must ensure that this helper information does not unduly speed up brute-force cracking attacks. We therefore turn to secure sketches [13].

Secure Sketches. Introduced by Dodis, Ostrovsky, Reyzin and Smith [13], sketches allow correction of errors together with bounds on the information leaked about the original secret to an attacker. Traditionally, sketch security is measured by the conditional min-entropy \({\tilde{\mathrm {H}}_\infty }({W|s})\) of the secret \(W\) given the sketch \(s\) against unbounded adversaries. Fuller, Reyzin, and Smith (FRS) [15] show that the best one can hope for when achieving correction error at most \(\delta \) is \({\tilde{\mathrm {H}}_\infty }({W|s}) \ge {\mathrm {H}^\mathrm{fuzz}_{t, \infty }}({W}) - \log (1- \delta )\), where \({\mathrm {H}^\mathrm{fuzz}_{t, \infty }}({W})\) is called the fuzzy min-entropy of the distribution and captures the worst-case cumulative weight of all points in a ball.

FRS give a clever construction, called layered hashing, that almost achieves the optimal result. They prove that

$$\begin{aligned} {\tilde{\mathrm {H}}_\infty }({W|s}) \ge {\mathrm {H}^\mathrm{fuzz}_{t, \infty }}({W}) - \log (1/\delta ) - \log {\mathrm {H}_0}({W}) - 1 \;. \end{aligned}$$

Here \({\mathrm {H}_0}({W})\) is the Hartley entropy, defined to be the logarithm of the size of the distribution’s support. The FRS construction provides better bounds than any prior secure sketch construction (and, by the usual construction, the best known fuzzy extractor [13]). The construction works by splitting possible secrets into different layers according to their probability in the distribution \(W\), and then applying a universal hash of a specific length based on a message’s layer. Both the layer identifier and the resulting hash value are output. Intuitively, the idea is to tune hash lengths to balance error correction with entropy loss: more probable points are grouped into layers that have much shorter hash values, with less probable points grouped into layers with longer hashes.

The layered sketch works only in (what we call) the distribution-sensitive setting, meaning that the distribution of messages must be known at the time one designs the sketching algorithm. As another potential limitation, correcting an error using the sketch takes time linear in the size of the ball around the point, meaning the construction is only computationally efficient should balls be efficiently enumerable. That said, both conditions are satisfied in some settings, including typo-tolerant password checking: password leaks allow accurate characterization of password distributions [7, 19, 21, 33] when constructing sketches, and as mentioned above, the ball of errors required to cover most observed typos is small and fast to enumerate.

Our Contributions. In this work, we explore the open question above: How can we securely correct more errors than Chatterjee et al. in [12]? We offer two new approaches. The first uses secure sketching, and we give a new scheme, called the layer-hiding hash (LHH), and prove that it leaks less information than prior constructions. Perhaps counter-intuitively, LHH does so by actually lengthening, rather than shortening, the output of the sketch as compared to the FRS construction. Our second approach is a new distribution-sensitive brute-force based technique called popularity-proportional hashing (PPH), in which the time required to hash a password is tuned based on its popularity: The more probable the password is, the longer the hashing should take.

Finally, we offer a framework for comparing various approaches, and show that PPH offers a better time / security trade-off than LHH and FRS. For certain error settings, PPH allows us to correct more errors securely than Chatterjee et al.’s brute-force checking. Elsewhere their brute-force checking offers a better trade-off still. In fact, we conjecture that for many distributions no sketch will beat brute-force based approaches.

The Layer-Hiding Hash Sketch. Our first contribution is to provide a new sketch that we call the layer-hiding hash (LHH) sketch. We prove that LHH enjoys an upper bound of \({\tilde{\mathrm {H}}_\infty }({W|s}) \ge {\mathrm {H}^\mathrm{fuzz}_{t, \infty }}({W}) - \log (1/\delta ) - 1\), yielding a substantial saving of \(\log {\mathrm {H}_0}({W})\) bits of entropy over FRS. The LHH starts with the same general approach as FRS, that of putting passwords into layers based on their probability. The key insight is that one can, as the name implies, hide the layer of the password underlying a sketch. To do so, the construction takes the output of applying a layer-specific strongly universal hash to the password and pads it to a carefully chosen maximum length with random bits. During recovery, one looks for a matching prefix of the sketch value when applying (differing length) strongly universal hashes. Hiding the level intuitively avoids leaking additional information to the adversary, but, counterintuitively, the proof of security does not require any properties of the hash functions used. Rather, the proof only uses that the length of hash outputs is bounded plus the fact that (unlike in the FRS construction) sketches from different layers can collide. The proof of correctness relies on the strong universality of the underlying hashes.

LHH’s bound improves over FRS (and, consequently, all other known constructions) because it removes the \(\log {\mathrm {H}_0}({W})\) term. The improvement in the bound can be significant. Assuming \(W\) places non-zero probability on all passwords from the RockYou password leak [29] already makes \(\log {\mathrm {H}_0}({W}) \ge 3\). The min-entropy (let alone fuzzy min-entropy) of common password distributions is commonly measured to be only about 7 bits, making a loss of 3 bits significant. Of course, as pointed out by FRS, the loss due to \(\log (1/\delta )\) — which LHH also suffers — is likely to be even more problematic since we’d like \(\delta \) to be small. An important question left open by our work is whether one can build a sketch that replaces \(\log (1/\delta )\) with the optimal \(\log (1 - \delta )\).

Sketch-Based Typo-Tolerant Checking. A seemingly attractive way of building a typo-tolerant password-based authentication scheme is to store a sketch of the password along with a slow-to-compute hash of the password. To later authenticate a submitted string, one first checks it with the slow hash and, if that fails, uses the sketch to error correct, and checks the result with the slow hash. In terms of security, we are primarily concerned about attackers that obtain (e.g., by system compromise) the sketch and slow hash value and mount offline brute-force dictionary attacks. The sketch will leak some information useful to the attacker.

The first challenge that arises in security analysis is that the traditional sketch security measure, conditional min-entropy \({\tilde{\mathrm {H}}_\infty }({W|s})\), does not provide good bounds when adversaries can make many guesses. The reason is that it measures the worst-case probability of guessing the message given the sketch in a single attempt, and for non-flat distributions the success probability of subsequent guesses after the first will be much lower. We therefore introduce a more general conditional q-min-entropy notion, denoted \({\tilde{\mathrm {H}}^{q}_\infty }({W|s})\). It is the worst-case aggregate probability of a message being any of q values, conditioned on the sketch. We revisit secure sketches in this new regime and analyze the q-min-entropy for the FRS and LHH constructions. These results are actually strictly more general since they cover the \(q=1\) bounds as well, and so in the body we start with the more general treatment and show the \(q=1\) results mentioned above as corollaries.

Popularity-Proportional Hashing. We also offer a new distribution-sensitive variant of brute-force checking called popularity-proportional hashing. Recall that brute-force checking uses the same slow hash function for all passwords. In popularity-proportional hashing, we use knowledge of the distribution of passwords to tune the time required to hash each password. The more popular a password, equivalently the more probable, the longer the hash computation.

In typo-tolerant hashing this has a nice effect for certain distributions: the ball of possible passwords around a submitted string will consist of a mix of lower- and higher-probability points, making the aggregate time required to check all of them lower than in brute-force checking. Timing side-channels can be avoided by fixing an upper bound on this aggregate time, and setting the hashing costs of the scheme such that every password can be checked within this time. The checking algorithm is then implemented to process each password for this maximum time, and accordingly its run time reveals nothing about the password being checked. This serves to “smooth” the distribution from the point of view of a brute-force attacker, who must choose between checking a popular password versus lower-cost checks of less probable passwords. We shall ultimately see that PPH offers a better time / security trade-off than sketch-based checking using both FRS and LHH. We note that the benefits of population-proportional hashing appear to be specific to the typo-tolerant setting; in exact checking schemes one would want to hash passwords with the maximum cost allowed by the runtime of the scheme, regardless of their weight.

Comparing the Approaches. We use the following methodology to compare the time / security trade-offs of the various approaches to error-tolerant authentication. First, one fixes an error setting, such as choosing a set of 64 frequently made typos, as well as a runtime budget \(\textsf {RT}\) for authentication. Then, one compares the brute-force attack security of various constructions that operate in time at most \(\textsf {RT}\) and correct the specified errors. So for brute-force checking, for example, one must pick a slow hash function that takes \(\textsf {RT}/64\) time to compute, and for secure sketches one can use a slow hash of time \(\textsf {RT}/2\) (where for simplicity we ignore the sketch cost, which is in practice negligible relative to \(\textsf {RT}\)). For popularity-proportional hashing one picks hash speeds so that the ball whose passwords have the highest aggregate probability can be checked in time \(\textsf {RT}\).

With this framework in place, we prove that PPH provides a better time / security trade-off than both FRS-assisted and LHH-assisted checking. The proofs require lower-bounding the security of the FRS and LHH constructions in the face of a computationally efficient attacker whose runtime constraint affords him \(q\) slow hash queries (equivalently \(q\) guesses at the underlying password). The attack is simple: enumerate probable passwords, check which match the sketch, and output the heaviest q that match. It may not be obvious that this is efficient, we will argue so in the body.

To analyze the attacker’s success, we use a proof strategy which at a high level proceeds as follows. We first model the hash underlying the sketch as a random oracle. This is conservative as it can only make the adversary’s task harder. We then transform the analysis of the attacker’s success probability to a type of balls-in-bins analysis that differs slightly based on the construction. For the FRS case, which is simpler, balls of differing sizes represent passwords of differing weights, and bins represent individual sketch values within a layer. The random oracle ‘throws’ the balls into the bins; the compromise of a sketch and subsequent guessing attack is captured by sampling a bin and allowing the attacker to choose \(q\) balls from it. As such computing a lower bound on the \(q\)-conditional min-entropy is reduced to computing the expected (over the random oracle coins) aggregate weight of the q heaviest balls across all bins.

Instead of tackling analysis of this expectation directly, we instead form direct comparison with PPH by showing that with overwhelming probability the set of points queried by an optimal brute-force adversary running in the same time against PPH will be included in the set of points that the adversary against FRS chooses. As such a brute-force attacker against FRS-assisted checking will always either match or (more often) beat attacks against PPH. We derive a similar result for LHH-assisted checking via a modified balls-in-bins experiment.

With the improvement of PPH over sketch-assisted checking established, we next compare PPH and brute-force checking. We quantify precisely the conditions which determine whether PPH or brute-force checking represents the better trade-off for a given error setting, and show that for certain error settings PPH allows us to correct many more errors securely than brute-force checking.

While PPH can be shown to improve on sketch-assisted checking for any distribution, the same is not true for brute-force checking — indeed there exist settings in which brute-force checking will lead to a dramatic reduction in security — and in general comparing the brute-force and sketch-assisted approaches directly appears technically challenging. However by combining the above results, we show that for certain error settings (including passwords) the seemingly simplest brute-force checking approach provides the best trade-off of all — first by invoking the result showing PPH outperforms sketch-assisted checking, and then by showing that brute-force checking offers an even better trade-off than PPH. As such for any given error setting, our results can be used to determine how many errors can be tolerated, and whether PPH or brute-force checking offers the better approach to typo-tolerance.

Extensions and Open Problems. We frame our results in the context of typo-tolerant password hashing and (reflecting the majority of in-use password hashing functions) primarily measure hashing cost in terms of time. We will in Sect. 7 briefly discuss how our results may be extended to incorporate memory-hard functions [1,2,3, 6, 26] and indicate other cryptographic applications, such as authenticated encryption and fuzzy extraction, in which they are applicable. Finally we will discuss the key open problem — can any distribution-sensitive secure sketch offer a better time / security trade-off than brute-force based approaches? We conjecture that for a large class of error settings no sketch can perform better. We offer some intuition to this end, and highlight it as an interesting direction for future research.

2 Definitions and Preliminaries

Notation. The set of binary strings of length n is denoted by \(\{0,1\}^n\). We use \(\perp \) to represent the null symbol. We write x||y to denote the concatenation of two binary strings x and y, and \([y]^{j}_1\) to denote the binary string y truncated to the lowest order j bits. We let [j] denote the set of integers from 1 to j inclusive, and \([j_1, j_2]\) the set of integers between \(j_1\) and \(j_2\) inclusive. The notation \(x \;{\mathop {\leftarrow }\limits ^{\,\$}}\;{\mathcal {X} }\) denotes sampling an element uniformly at random from the set \(\mathcal {X}\), and we let \(x \;{\mathop {\leftarrow }\limits ^{\,W}}\;{\mathcal {X} }\) denote sampling an element from the set \(\mathcal {X}\) according to the distribution \(W\). All logs are to base 2, and e denotes Euler’s constant. For a given distribution \(W\) where \({\mathcal {M}}= \text {supp}(W)\), we let \({w}_1, \dots , {w}_{|{\mathcal {M}}|}\) denote the points in the support of \(W\) in order of descending probability, with associated probabilities \(p_{1}{}, \dots , p_{|{\mathcal {M}}|}{}\).

Hash Functions. Here we recall the definitions of universal and strongly universal hash function families.

Definition 1

A family of hash functions \(\mathsf {F}\ : \ {\mathcal {S}} \times \{0,1\}^{\ell } \rightarrow \{0,1\}^{d}\) is said to be universal if for all \({w}\ne {w}^\prime \in {\mathcal {S}} \), it holds that

$$\begin{aligned} {\Pr \left[ \,{\mathsf {F}({w}; \mathsf{sa}) = \mathsf {F}({w}^\prime ; \mathsf{sa})\ : \ {\mathsf{sa}\;{\mathop {\leftarrow }\limits ^{\,\$}}\;{\{0,1\}^\ell }}}\,\right] } = 2^{-d}\,. \end{aligned}$$

Definition 2

A family of hash functions \(\mathsf {F}\ : \ {\mathcal {S}} \times \{0,1\}^{\ell } \rightarrow \{0,1\}^{d}\) is said to be strongly universal if for all \({w}\ne {w}^\prime \in {\mathcal {S}} \), and \(y, y ^\prime \in \{0,1\}^{d}\), it holds that

$$\begin{aligned} {\Pr \left[ \,{\mathsf {F}({w}; \mathsf{sa}) = y\ : \ {\mathsf{sa}\;{\mathop {\leftarrow }\limits ^{\,\$}}\;{\{0,1\}^\ell }}}\,\right] } = 2^{-d}\,, and \end{aligned}$$
$$\begin{aligned} {\Pr \left[ \,{\mathsf {F}({w}; \mathsf{sa}) = y \wedge \mathsf {F}({w}^\prime ; \mathsf{sa}) = y ^\prime \ : \ {\mathsf{sa}\;{\mathop {\leftarrow }\limits ^{\,\$}}\;{\{0,1\}^\ell }}}\,\right] } \le 2^{-2d}\,. \end{aligned}$$

Error Settings and Typos. Let \({\mathcal {S}}\) be a set with associated distance function \(\mathsf {dist}{\,:\;}{\mathcal {S}} \times {\mathcal {S}} \rightarrow \mathbb {R}^{\ge 0}\). If \(\mathsf {dist}\) is a metric over \({\mathcal {S}}\) — that is to say that \(\mathsf {dist}\) is non-negative, symmetric, and for all \(x, y, z \in {\mathcal {S}} \), it holds that \(\mathsf {dist}(x, z) \le \mathsf {dist}(x, y) + \mathsf {dist}(y, z)\) — then we say that the pair \(({\mathcal {S}}, \mathsf {dist})\) is a metric space. We can assign to \({\mathcal {S}}\) a distribution \(W\), and let \({\mathcal {M}}\) denote the set of possible messages, \({\mathcal {M}}= \text {supp}(\) \(W)\). We set an error threshold \(t\), denoting the maximum distance between points \({w}, \tilde{{w}} \) for which will consider \(\tilde{{w}} \) an error of \({w}\). Together these components, \(({\mathcal {S}}, W, \mathsf {dist}, t)\) define an error setting.

For an error setting \(\mathsf {E}= ({\mathcal {S}}, W, \mathsf {dist}, t)\), the (closed) ball of size \(t\) around \(\tilde{{w}} \in {\mathcal {S}} \) is the set of all points \({w}^\prime \in \text {supp}(W)\) such that \(\mathsf {dist}({w}^\prime , \tilde{{w}}) \le t\), that is \({B_{t}(\tilde{{w}})} = \{{w}^\prime \in \text {supp}(W)\; | \; \mathsf {dist}({w}^\prime , \tilde{{w}}) \le t\}\). We let \(\beta _{\mathrm {max}}\) denote the size of the largest ball in the error setting; that is to say \(\beta _{\mathrm {max}}= \max _{\tilde{{w}}} | {B_{t}(\tilde{{w}})}|\). In this work, we shall be especially interested in error settings for which balls are efficiently enumerable, a property which we formalize below.

Definition 3

Let \(\mathsf {E}= ({\mathcal {S}}, W, \mathsf {dist}, t)\) be an error setting with maximum ball size \(\beta _{\mathrm {max}}\). We say \(\mathsf {E}\) has efficiently enumerable balls, if there exists an algorithm \(\textsf {Enum} \) which takes as input a point \(\tilde{{w}} \in {\mathcal {S}} \), and outputs a set of points \(\mathcal {L}\) such that for all \(\tilde{{w}} \in {\mathcal {S}} \) it holds that

$$\begin{aligned} {\Pr \left[ \,{\mathcal {L} = {B_{t}(\tilde{{w}})} \ : \ \mathcal {L} \;{\mathop {\leftarrow }\limits ^{\,\$}}\;{\textsf {Enum} (\tilde{{w}})}}\,\right] } = 1\;, \end{aligned}$$

and \(\textsf {Enum} \) runs in time polynomial in \(\beta _{\mathrm {max}}\).

Entropy. We now discuss several notions of entropy which capture the maximum success probability of an attacker who attempts to guess a point sampled from a given distribution. Traditionally these notions only consider the case in which the adversary gets one guess. However in subsequent work, when we wish to capture the success rate of an adversary attempting to perform a brute-force attack, it will be useful to generalize these entropy notions to capture the maximum success probability of an adversary who may output a vector of q guesses. We define these notions below generalized to the multi-guess setting; one can easily extract the familiar definitions by setting \(q= 1\).

Definition 4

Let \(W\) and Z be distributions. We define the q-min-entropy of \(W\), denoted \({\mathrm {H}^{{q}}_\infty }({W})\) to be,

$$\begin{aligned} {\mathrm {H}^{{q}}_\infty }({W}) = -\log {\bigg ( \max _{{w}_1, \dots , {w}_q}\ \sum _{i = 1}^{q} {\Pr \left[ \,{W= {w}_i}\,\right] }\bigg )}\, , \end{aligned}$$

where \({w}_1, \dots , {w}_q\) are distinct elements of \({\mathcal {S}} \). The conditional \(q\)-min-entropy of \(W\) conditioned on Z, denoted \({\tilde{\mathrm {H}}^{q}_\infty }({W|Z})\), is defined to be,

$$\begin{aligned} {\tilde{\mathrm {H}}^{q}_\infty }({W|Z}) = -\log { \bigg (\sum _{z}\max _{{w}_1, \dots , {w}_q}\sum _{i = 1}^q {\Pr }\left[ \, W= {w}_i\,\left| \right. \,Z= z\,\right] \cdot {\Pr \left[ \,{Z = z}\,\right] }\bigg ) }\; ; \end{aligned}$$

and the q-min-entropy of \(W\) joint with Z, denoted \({\mathrm {H}^{{q}}_\infty }({W, Z})\), is defined,

$$\begin{aligned} {\mathrm {H}^{{q}}_\infty }({W, Z}) = -\log {\bigg (\max _{\begin{array}{c} {w}_1, \dots , {w}_q \\ z_1, \dots , z_q \end{array}}\sum _{i = 1}^q {\Pr \left[ \,{W= {w}_i \wedge Z = z_i}\,\right] }\bigg )}\; , \end{aligned}$$

where the \({w}_1, \dots , {w}_q\) and \(z_1, \dots , z_q\) are distinct elements of the supports of \(W\) and Z respectively. The Hartley entropy of \(W\), denoted \({\mathrm {H}_0}({W})\), is defined to be,

$$\begin{aligned} {\mathrm {H}_0}({W}) = \log {| \text{ supp } (W)|}\; . \end{aligned}$$

For an example which surfaces the usefulness of extending min-entropy definitions beyond one guess, consider a pair of distributions \(W_1\) and \(W_2\), such that \(W_1\) is flat with \(2^{-{\mathrm {H}_\infty }({W})} = 2^{-\mu }\) and \(W_2\) consists of one point of probability \(2^{-\mu }\) and \(2^{2\mu } - 2^{\mu }\) points of probability \(2^{-2\mu }\). While \({\mathrm {H}^{{1}}_\infty }({W_1}) = {\mathrm {H}^{{1}}_\infty }({W_2}) = \mu \), the two distributions are clearly very different, and in particular an attacker given some \(q > 1\) guesses to predict a value sampled from each of the distributions is going to have a much easier time with \(W_1\). This difference is highlighted when considering the q-min-entropy, with \({\mathrm {H}^{{q}}_\infty }({W_1}) = q\cdot 2^{-\mu }\), whereas \({\mathrm {H}^{{q}}_\infty }({W_2}) = 2^{-\mu } + (q-1) \cdot 2^{-2\mu }\).

In the \(q= 1\) case, the conditional min-entropy and Hartley entropy are linked via the chain rule for conditional min-entropy [13]. It is straightforward to see that this result extends to the multi-guess setting; for completeness we include a proof in the full version.

Lemma 1

Let \(W, Z\) be distributions. Then

$$\begin{aligned} {\tilde{\mathrm {H}}^{q}_\infty }({W|Z}) \ge {\mathrm {H}^{{q}}_\infty }({W, Z}) - {\mathrm {H}_0}({Z}) \;. \end{aligned}$$

Secure Sketches. Let \(\mathsf {E}= ({\mathcal {S}}, W, \mathsf {dist}, t)\) be an error setting. Secure sketches, introduced by Dodis et al. in [13], allow reconstruction of a message which may be input with noise, while preserving as much of the min-entropy of the original message as possible.

In this work we focus on sketches in the distribution-sensitive setting, in which the distribution of secrets is precisely known at the time of designing the sketch. While distribution-sensitivity may not always be feasible, in the case of passwords there is a long line of work on accurately modeling the distribution of human-chosen passwords. Primarily motivated by password cracking, modeling techniques such as hidden Markov models (HMM) [11], probabilistic context free grammars (PCFG) [32, 33], or neural networks [21] use the plethora of real password leaks (e.g., [9]) to learn good estimates of \(W\). See [19] for a detailed discussion of these approaches. Of course, estimates may be wrong. A discussion on the effect of transferring our results to a setting in which the distribution is only approximately known is included in the full version. We recall the formal definition of secure sketches below.

Definition 5

Let \(\mathsf {E}= ({\mathcal {S}}, W, \mathsf {dist}, t)\) be an error setting. A secure sketch for \(\mathsf {E}\) is a pair of algorithms \(\mathsf {S} =({\mathsf {SS}}, {\mathsf{Rec}})\) defined as follows:

  • \({\mathsf {SS}}\) is a randomized algorithm which takes as input \({w}\in {\mathcal {S}} \), and outputs a bit string \(s \in \{0,1\}^*\).

  • \({\mathsf{Rec}}\) is an algorithm, possibly randomized, which takes as input \(\tilde{{w}} \in {\mathcal {S}} \) and \(s \in \{0,1\}^*\), and outputs \({w}^\prime \in {B_{t}(\tilde{{w}})} \cup \{\perp \}\).

We note that we slightly modify the definition of [15] so that \({\mathsf{Rec}}\) on input \(\tilde{{w}} \) always outputs \({w}^\prime \in {B_{t}(\tilde{{w}})} \cup \{\perp \}\), as opposed to \({w}^\prime \in {\mathcal {S}} \cup \{\perp \}\). As we shall see in the following definition, we only require \({\mathsf{Rec}}\) to return the correct point if that point lies in \({B_{t}(\tilde{{w}})}\). As such this is mainly a syntactic change, and all pre-existing sketch constructions discussed in this work already adhere to the condition. In the following definition, we generalize the security requirement to the multi-guess setting in the natural way; the usual definition (e.g. [13, 15]) is obtained by setting \(q= 1\).

Definition 6

A sketch \(\mathsf {S} =({\mathsf {SS}},{\mathsf{Rec}})\) is an \((({\mathcal {S}}, W, \mathsf {dist}, t), \bar{\mu }_q, \delta )\) -secure sketch if:

  1. 1.

    (Correctness) For all \({w}, \tilde{{w}} \in {\mathcal {S}} \) for which \(\mathsf {dist}({w}, \tilde{{w}}) \le t\), it holds that

    $$\begin{aligned} {\Pr \left[ \,{{w}= {w}^\prime \ : \ {w}^\prime \leftarrow {\mathsf{Rec}}(\tilde{{w}},{\mathsf {SS}}({w}))}\,\right] }\ge 1 - \delta \; , \end{aligned}$$

    where the probability is taken over the coins used by \({\mathsf {SS}}\) and \({\mathsf{Rec}}\).

  2. 2.

    (Security) The \(q\)-min-entropy of \(W\) conditioned on \(\mathsf {SS}(W)\) is such that,

    $$\begin{aligned} {\tilde{\mathrm {H}}^{q}_\infty }({W|{\mathsf {SS}}(W)}) \ge \bar{\mu }_q\; . \end{aligned}$$

Since the help string \(s \) is public any party — including the adversary — can query \({\mathsf{Rec}}(\cdot , s)\) on \(\tilde{{w}} \in {\mathcal {S}} \). In an ideal secure sketch, knowledge of \(s\) would offer no greater advantage than that gained via oracle access to \({\mathsf{Rec}}(\cdot , s)\). In this case, an adversary challenged to guess the original value \({w}\in {\mathcal {S}} \) is forced to guess some \(\tilde{{w}} \) such that \(\mathsf {dist}({w}, \tilde{{w}}) \le t\). To capture this notion of ideal secure sketch security, Fuller et al. [15] introduce the notion of fuzzy min-entropy, which we generalize to the multi-guess setting in the natural way.

Definition 7

Let \(\mathsf {E}= ({\mathcal {S}}, W, \mathsf {dist}, t)\) be an error setting. The \(q\)-fuzzy min-entropy of \(W\) is defined to be,

$$\begin{aligned} {\mathrm {H}^{q, \mathrm{fuzz}}_{t, \infty }}({W}) = - \log \bigg ( \max _{\tilde{{w}} _1, \dots , \tilde{{w}} _q} \sum _{{w}^\prime \in \cup _{i = 1}^{q}{B_{t}(\tilde{{w}} _i)}} {\Pr \left[ \,{W= {w}^\prime }\,\right] } \bigg )\; , \end{aligned}$$

where \(\tilde{{w}} _1, \dots , \tilde{{w}} _q\) are distinct elements of \({\mathcal {S}} \).

3 New Bounds for FRS Sketches

In this section we describe and analyze two constructions of secure sketches due to Fuller, Reyzin, and Smith [15]. The FRS sketches have a number of attractive properties. The first is that these are the only secure sketches (to our knowledge) that can be utilized with any choice of distance function \(\mathsf {dist}\). We would like this flexibility so that ultimately we can tailor the distance function used to the context of correcting password typos for which, being non-symmetric, traditional metrics such as edit distance are not best suited [12].

Even if edit distance were appropriate, we know of no constructions which provide sufficient security when used with parameters typical to password distributions. Constructions in [13, 14] either embed the edit metric into the Hamming or set distance metrics using a low distortion embedding of Ostrovsky and Rabani [25], or use a novel c-shingling technique.

As pointed out in [12], when applied to typical password distributions which have a large alphabet of 96 ASCII characters, then even if we only attempt to correct edit distance one errors, these constructions incur entropy loss \(\approx 91\) bits and \(\approx 31\) bits respectively. Given that password distributions typically have at most 8 bits of min-entropy [8], it is clear these constructions are unsuitable for our purposes.

Most importantly, the FRS constructions achieve almost optimal security in the \(q= 1\) case. It was shown in [15] that high fuzzy min-entropy is a necessary condition for the existence of a good secure sketch or fuzzy extractor for a given error setting, surfacing a lower bound on the security of such schemes. We recall the result in the lemma below, which we extend to the multi-guess setting. The proof is given in the full version.

Lemma 2

Let \(\mathsf {E}= ({\mathcal {S}}, W, \mathsf {dist}, t)\) be an error setting, and let \(\mathsf {S} = (\mathsf {SS}, {\mathsf{Rec}})\) be an \((({\mathcal {S}}, W, \mathsf {dist}, t), \bar{\mu }_q, \delta ), \bar{\mu }_q, \delta )\)-secure-sketch. Then \(\bar{\mu }_q\le {\mathrm {H}^{q, \mathrm{fuzz}}_{t, \infty }}({W}) - \log (1 - \delta )\).

FRS showed that in the distribution-sensitive setting, in which the precise distribution is known at the time of building the sketch, high fuzzy min-entropy also implies the existence of a good secure sketch for that distribution. We recall their constructions, and prove new results about them.

3.1 Secure Sketches for Flat Distributions

FRS describe a secure sketch which is nearly optimal for error settings \(\mathsf {E}= ({\mathcal {S}}, W, \mathsf {dist}, t)\) such that \(W\) is flat, which we recall in Fig. 1. We refer to this construction as \(\mathsf {FRS1} =(\mathsf {FRS1 - SS}, \mathsf {FRS1 - Rec})\).

The construction is built from a universal hash function family with output length \(\log (\beta _{\mathrm {max}}) + \log (1/\delta )\) bits, where \(\beta _{\mathrm {max}}\) denotes the size of the largest ball in the error setting. \(\mathsf {FRS1 - SS} \) chooses a salt \(\mathsf{sa}\;{\mathop {\leftarrow }\limits ^{\,\$}}\;{\{0,1\}^\ell }\), computes \(y = \mathsf {F}({w}; \mathsf{sa})\), and outputs \(s= (y, \mathsf{sa})\). On input \(\tilde{{w}} \in {\mathcal {S}} \) and \(s\), \({\mathsf{Rec}}\) searches in \({B_{t}(\tilde{{w}})}\) for a point \({w}^\prime \) such that \(\mathsf {F}({w}^\prime ; \mathsf{sa}) = y \), returning the first match which it finds. The authors note that the construction is not novel, with universal hash functions representing a commonly used tool for information reconciliation (e.g., [5, 28, 30]). Correctness follows from a straightforward application of Markov’s Inequality. In the following lemma we extend analysis to cover the \(q\)-conditional min-entropy. The proof is given in the full version.

Lemma 3

Let \(\mathsf {E}= ({\mathcal {S}}, W, \mathsf {dist}, t)\) be an error setting for which \(W\) is flat, and let \(\beta _{\mathrm {max}}\) denote the size of the largest ball. Let \(\mathsf {FRS1} = (\mathsf {FRS1 - SS}, \mathsf {FRS1 - Rec})\) be as described in Fig. 1, and let \(\mathsf {F}\ : \ {\mathcal {S}} \times \{0,1\}^\ell \rightarrow \{0,1\}^{\log (\beta _{\mathrm {max}}) + \log (1/\delta )}\) be a family of universal hash functions where \(0< \delta < 1\). Then \(\mathsf {FRS1} \) is a \((({\mathcal {S}}, W, \mathsf {dist}, t), \bar{\mu }_q, \delta ), \bar{\mu }_q, \delta )\)-secure sketch, where

$$\begin{aligned} \bar{\mu }_q\ge {\mathrm {H}^\mathrm{fuzz}_{t, \infty }}({W}) - \log (q) - \log (1/\delta )\;. \end{aligned}$$
Fig. 1.
figure 1

Construction of a secure sketch \(\mathsf {FRS1} = (\mathsf {FRS1 - SS}, \mathsf {FRS1 - Rec})\) for an error setting \(\mathsf {E}= ({\mathcal {S}}, W, \mathsf {dist}, t)\) from a universal hash function family \(\mathsf {F}: {\mathcal {S}} \times \{0,1\}^\ell \rightarrow \{0,1\}^{\log (\beta _{\mathrm {max}}) + \log {(1/\delta )}}\). Here \(\beta _{\mathrm {max}}\) denotes the size of the largest ball in the error setting.

3.2 Layered Hashing for Non-flat Distributions

The above construction may be significantly less secure in settings where the distribution in question is non-flat. In this case, having high fuzzy min-entropy does not exclude the possibility that the distribution contains a dense ball consisting of many low probability points. Disambiguating between points in this dense ball forces a hash function with a large range to be used, which leaks more information to adversaries.

The key idea is to split the support of the distribution into nearly flat layers; the layer in which a point lies is determined by its probability, and the layers are defined such that the probabilities of points in any given layer differ by at most a factor of two. We include the index of the layer in which a point lies as part of its sketch, and then apply the secure sketch for flat distributions of Lemma 3 tuned to the parameters of the appropriate layer. Revealing the layer in which a point lies degrades security; in an attempt to limit the damage, the number of layers is restricted so the extra loss amounts to \(\log {\mathrm {H}_0}({W}) + 1\) bits; for full details of the proof see [15].

For simplicity of exposition, we assume the existence of an efficient algorithm \(\textsf {L}\) which takes as input a point \({w}\in {\mathcal {S}} \) and outputs the index \(j\in \mathcal {J}\) of the layer in which it lies. We note that the parameters required to compute the cut-off points between layers are readily obtained from the password model, so computing the partitions is straightforward in practice; provided we can efficiently look up the weights of points in the password model, the algorithm \(\textsf {L}\) will be efficient also. The full construction is given in Fig. 2.

Fig. 2.
figure 2

Secure sketch \(\mathsf {FRS2} = (\mathsf {FRS2 - SS}, \mathsf {FRS2 - Rec})\) for an error setting \(\mathsf {E}= ({\mathcal {S}}, W, \mathsf {dist}, t)\) from a set of universal hash function families \(\mathsf {F}_j\ : \ {\mathcal {S}} \times \{0,1\}^{\ell _j} \rightarrow \{0,1\}^{j- {\mathrm {H}^\mathrm{fuzz}_{t, \infty }}({W}) + \log (1/\delta ) + 1}\) for \(j\in [\mu , \lambda ]\), utilizing layering \(\mathsf {FRS2 - Layer} \).

Theorem 1

[15]. Let \(\mathsf {E}= ({\mathcal {S}}, W, \mathsf {dist}, t)\) be an error setting. Let \(\mathsf {F}_j\ : \ {\mathcal {S}}\ \times \ \{0,1\}^{\ell _j}\rightarrow \{0,1\}^{j- {\mathrm {H}^\mathrm{fuzz}_{t, \infty }}({W}) + \log (1/\delta ) + 1}\) be a family of universal hash functions, where \(0 < \delta \le \frac{1}{2}\). Consider \(\mathsf {FRS2} = (\mathsf {FRS2 - SS}, \mathsf {FRS2 - Rec})\) with layering \(\mathsf {FRS2 - Layer} \) as defined in Fig. 2. Then \(\mathsf {FRS2} \) is a \((({\mathcal {S}}, W, \mathsf {dist}, t), \bar{\mu }_1, \delta )\)-secure sketch where

$$\begin{aligned} \bar{\mu }_1 = {\mathrm {H}^\mathrm{fuzz}_{t, \infty }}({W})- \log {{\mathrm {H}_0}({W})} - \log {(1/\delta )} - 1. \end{aligned}$$

In the following theorem, we provide an analysis for \(\mathsf {FRS2} \) in the \(q\)-min-entropy setting. Our analysis also provides a tighter bound in the case that \(q= 1\). The full proof is given in the full version.

Theorem 2

Let \(\mathsf {E}= ({\mathcal {S}}, W, \mathsf {dist}, t)\) be an error setting, where \({\mathrm {H}^\mathrm{fuzz}_{t, \infty }}({W}) = \tilde{\mu }\). Let \(\mathsf {F}_j\ : \ {\mathcal {S}}\ \times \ \{0,1\}^{\ell _j}\rightarrow \{0,1\}^{j- {\mathrm {H}^\mathrm{fuzz}_{t, \infty }}({W}) + \log (1/\delta ) + 1}\) be a family of universal hash functions, where \(0 < \delta \le \frac{1}{2}\). Consider \(\mathsf {FRS2} = (\mathsf {FRS2 - SS}, \mathsf {FRS2 - Rec})\) with layering \(\mathsf {FRS2 - Layer} \) as defined in Fig. 2. Then \(\mathsf {FRS2} \) is a \((({\mathcal {S}}, W, \mathsf {dist}, t), \bar{\mu }_q, \delta )\)-secure sketch for,

$$\begin{aligned} 2^{-\bar{\mu }_q} \le {\Pr \left[ \,{W\in {L}_{\lambda }}\,\right] } + \sum _{j= \mu }^{\lambda - 1 } {\Pr \left[ \,{W\in {L}_{j}( q\cdot |R _j|)}\,\right] } \; . \end{aligned}$$

Here \({L}_{j}(q^\prime )\) denotes the set of the \(\min \{q^\prime , |{L}_{j}|\}\) heaviest points in layer \({L}_{j}\). We let \(R_j= range(\mathsf {F}_j)\) and let \({L}_{\lambda } = \{{w}\in W\ : \ {\Pr \left[ \,{W= {w}}\,\right] } < 2^{-\lambda }\}\) where \(\lambda = {{\mathrm {H}_\infty }({W}) +\left\lfloor {\mathrm {H}_0}({W}) - 1 \right\rfloor }\).

We note that the additional tightness in the bound is especially beneficial when considering distributions with many sparsely populated or empty layers. To give a concrete example of a distribution for which the tightness in the bound makes a significant difference, consider an error setting for which \(W\) contains \(2^{99}\) points of weight \(2^{-100}\), and \(2^{199}\) points of weight \(2^{-200}\), and the case that \(q = 1\). Since \({\mathrm {H}_0}({W}) \approx 199 \), the bound of Theorem 1 implies that \({\tilde{\mathrm {H}}_\infty }({W|\mathsf {FRS2 - SS} (W)}) \ge {\mathrm {H}^\mathrm{fuzz}_{t, \infty }}({W}) - \log (1/\delta ) - 8.64 \). In contrast applying the new bound of Theorem 2 implies that \({\tilde{\mathrm {H}}_\infty }({W|\mathsf {FRS2 - SS} (W)}) \ge {\mathrm {H}^\mathrm{fuzz}_{t, \infty }}({W}) - \log (1/\delta ) - 2\), (since \({\Pr \left[ \,{W\in {L}_{j}( |R _j|)}\,\right] } \le 2^{-{\mathrm {H}^\mathrm{fuzz}_{t, \infty }}({W}) + \log (1/\delta ) + 1}\) for \(j= 100, 200\), and 0 otherwise). This results in a saving of over 6.6 bits of entropy.

4 A New Construction: Layer Hiding Hash

In this section we present a new construction which yields a substantial increase in security over \(\mathsf {FRS2} \), while enjoying the same degree of correctness. The construction, which we call layer hiding hash and denote \(\mathsf {LHH} = (\mathsf {LHH - SS}, \mathsf {LHH - Rec})\) is similar to \(\mathsf {FRS2} \), but crucially does not explicitly reveal the layer in which a point lies as part of the sketch.

First we split the distribution into layers as shown in Fig. 3. Note that this layering is slightly different to that used in \(\mathsf {FRS2} \). We now require a family of strongly universal hash functions, which we use to hash points \({w}\in {\mathcal {M}}\) to a fixed length which is a parameter of the scheme, and then truncate this hash to various lengths depending on the layer in which the point lies (in turn creating a family of strongly universal hash functions for each layer). The strong universality of the hash is required for the proof of correctness in which we bound the probability that the hash of a point \({w}\) collides with a given string; this represents a recovery error and the lengths of the truncated hashes are chosen such that the probability this event occurs is at most \(\delta \).

The twist that enables the security savings is that rather than outputting this truncated hash as is and revealing the layer in which a point lies, we now view this hash as a prefix. The sketch is then computed by choosing a string at random from the set of all strings of a given length (a parameter of the scheme) which share that prefix. This is done efficiently by padding the hash with the appropriate number of random bits. The effect of this is to nearly flatten the joint distribution of \(W\) and \(\mathsf {SS}(W)\) such that for all \({w}\in {\mathcal {M}}\) and \(s\in \text {supp}(\mathsf {SS}(W))\), it holds that \({\Pr \left[ \,{W= {w}\wedge \mathsf {SS}(W) = s}\,\right] } \le 2^{-(\gamma + \ell )}\) (where \(\gamma \) indexes the layer of least probable points, and \(\ell \) denotes the length of the salt) regardless of the layer in which the point lies. During recovery, the sketch searches in the ball of the input for a point whose truncated hash matches the prefix of the sketch value, and outputs the first match it finds. The full construction is shown in Fig. 3.

Fig. 3.
figure 3

Construction of secure sketch \(\mathsf {LHH} = (\mathsf {LHH - SS}, \mathsf {LHH - Rec})\) for an error setting \(\mathsf {E}= ({\mathcal {S}}, W, \mathsf {dist}, t)\) with \(\tilde{\mu }= {\mathrm {H}^\mathrm{fuzz}_{t, \infty }}({W})\), from a family of strongly universal hash functions \(\mathsf {F}: {\mathcal {S}} \times \{0,1\}^{\ell } \rightarrow \{0,1\}^{\gamma - \tilde{\mu }+ \log (\frac{1}{\delta }) + 1}\), utilizing layering \(\mathsf {LHH - Layer} \).

In the following theorem we analyze the correctness and security of \(\mathsf {LHH} \), and emphasize the substantial entropy saving in the \(q= 1\) case of \(\log {\mathrm {H}_0}({W})\) bits in comparison to \(\mathsf {FRS2} \). The proof is given in the full version.

Theorem 3

Let \(\mathsf {E}= ({\mathcal {S}}, W, \mathsf {dist}, t)\) be an error setting. Let \(\mathsf {F}: {\mathcal {S}} \times \{0,1\}^{\ell } \rightarrow \{0,1\}^{\gamma -{\mathrm {H}^\mathrm{fuzz}_{t, \infty }}({W}) + \log {(1/\delta )} + 1}\) be a family of strongly universal hash functions where \(0< \delta < 1\). Let \(\mathsf {LHH} = (\mathsf {LHH - SS}, \mathsf {LHH - Rec})\) be as shown in Fig. 3 with layering \(\mathsf {LHH - Layer} \). Then \(\mathsf {LHH} \) is a \((({\mathcal {S}}, W, \mathsf {dist}, t), \bar{\mu }_q, \delta ), \bar{\mu }_q, \delta )\)-secure sketch, where,

$$\begin{aligned} \bar{\mu }_q= {\mathrm {H}^{{q\cdot \eta }}_\infty }({W^\prime })\;. \end{aligned}$$

Here \(\eta = 2^{\gamma - {\mathrm {H}^\mathrm{fuzz}_{t, \infty }}({W}) + \log (1/\delta ) + 1}\), and \(W^\prime \) is the distribution constructed by taking each point \({w}\in {\mathcal {M}}\) and replacing it with \(2^{(\gamma - j)}\) points, each of weight \({\Pr \left[ \,{W= {w}}\,\right] } \cdot 2^{-(\gamma - j)}\), where \({w}\in {L}_{j}\). In particular in the case where \(q= 1\), this gives,

$$\begin{aligned} \bar{\mu }_1 \ge {\mathrm {H}^\mathrm{fuzz}_{t, \infty }}({W}) - \log {(1/\delta )} -1\; . \end{aligned}$$

5 Typo-Tolerant Password-Based Key Derivation

In this section we consider the application of secure sketches to typo-tolerant password-based key-derivation functions (PBKDF). PBKDFs are used in a number of settings, for example in password-based authentication during login and password-based encryption. PBKDFs are designed to slow attackers that mount brute-force attacks, by incorporating a computationally slow and / or memory-consuming task.

We begin by treating password-based authentication schemes (PBAS). We discuss how to extend to other PBKDF settings in Sect. 7. Roughly speaking, our results will apply in any situation in which the PBKDF-derived key is used in a cryptographically strong authentication setting, including notably password-based authenticated encryption. We will use an oracle model to capture computational slowness, analogous to prior treatments of PBKDFs in the random oracle model (ROM) [4, 34]. We will denote by \(\mathbf{H}\) the oracle, and assume it behaves as a random oracle mapping arbitrary length strings to randomly chosen strings of a fixed length \({\ell _H}\). We let \(\mathbf{H}\) take an additional input \(\mathsf {c}\) representing the unit cost of querying \(\mathbf{H}\). We formalize such schemes below, following [12].

Definition 8

A PBAS is a pair of algorithms \(\mathsf {PBAS}= (\mathrm {Reg}, {\mathrm {Chk}})\) defined as follows:

  • \(\mathrm {Reg}^\mathbf{H}\) is a randomized algorithm which takes as input a password \({w}\in {\mathcal {M}}\) and returns a string \(h\).

  • \({\mathrm {Chk}}^\mathbf{H}\) is a (possibly randomized) algorithm which takes as input \(\tilde{{w}} \in {\mathcal {S}} \) and string \(h\), and returns either true or false.

Both algorithms have access to oracle \(\mathbf{H}(\cdot ;\cdot ,\mathsf {c}){\,:\;}\{0,1\}^*\times \{0,1\}^{\ell _\mathsf{sa}} \rightarrow \{0,1\}^{\ell _H}\) where \(\mathsf {c}\) denotes the unit cost of calling \(\mathbf{H}\).

Fig. 4.
figure 4

Security game for password recovery in an offline brute-force cracking attack for a PBAS \(\mathsf {PBAS}= (\mathrm {Reg}, \mathrm {Chk})\) and error setting \(\mathsf {E}= ({\mathcal {S}}, W, \mathsf {dist}, t)\).

The canonical scheme \(\mathsf {PBAS}= (\mathrm {Reg},\mathrm {Chk})\), used widely in practice, has \(\mathrm {Reg}^\mathbf{H}\) choose a random salt \(\mathsf{sa}\) and output \((\mathsf{sa},\mathbf{H}({w};\mathsf{sa},\mathsf {c}))\). Then, \(\mathrm {Chk} ^\mathbf{H}(\tilde{{w}},(\mathsf{sa},h))\) computes \(h' = \mathbf{H}(\tilde{{w}};\mathsf{sa},\mathsf {c})\) and outputs \(h'\,{\mathop {=}\limits ^{\,?}}\,h\). The runtime is \(\mathsf {c}\), the cost of one query to \(\mathbf{H}\). Typically PBKDF \(\mathbf{H}\) will be the \(\mathsf {c}\)-fold iteration \(\mathbf{H}(\cdot ; \cdot , \mathsf {c})= H^{\mathsf {c}}(\cdot ; \cdot )\) of some cryptographic hash function \(H \ : \ \{0,1\}^*\times \{0,1\}^{\ell _\mathsf{sa}} \rightarrow \{0,1\}^{\ell _H}\) which we model as a random oracle. We will, in general, ignore the cost of other operations (e.g., the comparison \(h = h'\)) as they will be dominated by \(\mathsf {c}\). For example if \(\mathbf{H}\) consists of 10,000 iterations of a hash function such as SHA-256 then \(\mathsf {c}\) would be the cost of 10,000 computations of SHA-256.

We do not yet consider memory-hardness, and leave a proper treatment of it to future work (see Sect. 7).

Security Against Cracking Attacks. We will focus primarily on security against offline cracking attacks. Should an adversary obtain access to the output of \(\mathrm {Reg}\), we want that it should be computationally difficult — in terms of the number of oracle calls to \(\mathbf{H}\) — to recover the password \({w}\). We formalize this in game \(\mathrm {MR} \) shown in Fig. 4, a close relative of existing security notions capturing brute-force attacks against passwords (e.g. [4, 16]). For an error setting \(\mathsf {E}= ({\mathcal {S}}, W, \mathsf {dist}, t)\), we define the advantage of an adversary \({\mathcal {A}}\) against a scheme \(\mathsf {PBAS}\) by

$$\begin{aligned} \mathbf {Adv}^\mathrm {mr}_{\mathsf {PBAS},\mathsf {E}}({\mathcal {A}}) = {\Pr \left[ \,{\mathrm {MR} _{\mathsf {PBAS},\mathsf {E}}^{{\mathcal {A}}}\Rightarrow \textsf {true} }\,\right] } \;. \end{aligned}$$

The probability is over the coins used in the game and those of the adversary. We assume that the adversary \({\mathcal {A}}\) has exact knowledge of the error setting \(\mathsf {E}\). The number of queries \({\mathcal {A}}\) may make to oracle \(\mathbf{H}\) is determined by its run time T and the cost \(\mathsf {c}\) of querying \(\mathbf{H}\), and for simplicity all other computations are assumed to be free. For example if \(\mathbf{H}\) has cost \(\mathsf {c}\), then an adversary \({\mathcal {A}}\) running in time T may make \(q= T / \mathsf {c}\) queries to \(\mathbf{H}\).

5.1 Brute-Force Checkers

To improve the usability of a given \(\mathsf {PBAS}= (\mathrm {Reg}, \mathrm {Chk})\) for some error setting \(\mathsf {E}= ({\mathcal {S}}, W, \mathsf {dist}, t)\), Chatterjee et al. [12] advocate retaining the original \(\mathrm {Reg}\) algorithm but modify the \(\mathrm {Chk} \) algorithm to a ‘relaxed checker’ that loosens the requirement that a password be entered exactly. They define the (what we will call) brute-force error correction scheme \(\mathsf {PBAS - BF}= (\mathrm {Reg},\mathrm {Chk - BF})\) as follows.

Definition 9

Let \(\mathsf {PBAS}= (\mathrm {Reg}, \mathrm {Chk})\), and let \(\mathsf {E}= ({\mathcal {S}}, W, \mathsf {dist}, t)\) be an error-setting. Let \(\mathbf{H}(\cdot ; \cdot , \mathsf {c}_{\mathrm {bf}})\) be a random oracle. Then the brute-force error-correction scheme \(\mathsf {PBAS - BF}= (\mathrm {Reg},\mathrm {Chk - BF})\) is defined as follows,

  • \(\mathrm {Reg}({w})\) chooses a salt \(\mathsf{sa}\) at random, and outputs \((\mathsf{sa},\mathbf{H}({w}; \mathsf{sa}, \mathsf {c}_{\mathrm {bf}}))\).

  • \(\mathrm {Chk - BF} (\tilde{{w}}, (\mathsf{sa}, h))\) checks whether \(h = \mathbf{H}(\tilde{{w}}; \mathsf{sa}, \mathsf {c}_{\mathrm {bf}})\) or \(h = \mathbf{H}({w}^\prime ; \mathsf{sa}, \mathsf {c}_{\mathrm {bf}})\) for each \({w}^\prime \in {B_{t}(\tilde{{w}})}\). If it finds a match, it returns true, and otherwise returns false.

Since \(\mathsf {c}_{\mathrm {bf}}\) denotes the unit cost of running \(\mathbf{H}\), it follows that the runtime \(\textsf {RT}\) of this algorithm is the unit cost of \(\mathbf{H}\) times the worst case ball size, i.e., \(\textsf {RT}= \mathsf {c}_{\mathrm {bf}}{}{}{}\cdot {}{}{}\beta _{\mathrm {max}}\), where \(\beta _\mathrm{max} = \mathrm{max}_{\tilde{w}}|B_t(\tilde{w})|\). To avoid potential side-channels, one may want to always compute \(\mathbf{H}\) the same number of times, making the run time always \(\textsf {RT}\).

An adversary \({\mathcal {A}}\) running in time at most T in game \(\mathrm {MR} \) can make at most \(q_{\mathrm {bf}}= T/\mathsf {c}_{\mathrm {bf}}\) queries to \(\mathbf{H}\). It is straightforward to see that \({\mathcal {A}}\)’s optimal strategy is to query the \(q_{\mathrm {bf}}\) most probable points in \(W\) to \(\mathbf{H}\), and so \(\mathbf {Adv}^\mathrm {mr}_{\mathsf {PBAS - BF},\mathsf {E}}({\mathcal {A}})~\le ~2^{-{\mathrm {H}^{{q_{\mathrm {bf}}}}_\infty }({W})}\). This value is precisely the \(q\)-success rate of Boztas [10], and is a standard measure of the predictability of a password distribution.

Empirical analysis in [12] finds that when we only attempt to correct a very small number of errors per password (e.g. balls of size at most four) then the brute-force checker yields a noticeable increase in usability for a small reduction in security. However the above security bound highlights a potential limitation of the brute-force checker; if we wish to correct balls with larger numbers of points, we either need to accept an impractically long run time, or reduce \(\mathsf {c}_{\mathrm {bf}}\) to a level which for some error settings may result in significant security loss. This is an important consideration in the context of password typos where the large alphabet (of up to 96 ASCII characters depending on the password creation policy) means that the set of points within edit distance one of a six character string \(\tilde{{w}} \in {\mathcal {S}} \) contains well over 1000 points. This raises the question of whether secure sketches can be employed to achieve a better time / security trade-off.

5.2 Typo-Tolerant PBAS Using Secure Sketches

The error-correcting properties of secure sketches (see Sect. 2) make them a natural candidate to build typo-tolerant PBAS schemes. We now describe how to compose a secure sketch with any existing PBAS scheme to create a typo-tolerant PBAS. The construction is so simple it is essentially folklore. See also a discussion by Dodis et al. [13]. Our contribution here is merely to formalize it so that we can provide a full security analysis in our computational setting.

Definition 10

Let \(\mathsf {S} = (\mathsf {SS}, {\mathsf{Rec}})\) be an secure-sketch for error setting \(\mathsf {E}= ({\mathcal {S}},W, \mathsf {dist}, t)\). Let \(\mathbf{H}(\cdot ; \cdot , \mathsf {c}_{\mathrm {ss}})\) be a random oracle. Then we define the scheme \(\mathsf {PBAS - SS}[\mathsf {S} ] = (\mathrm {Reg - SS},\mathrm {Chk - SS})\) as follows:

  • \(\mathrm {Reg - SS} ({w})\) runs \(\mathsf {SS}({w})\) to obtain a sketch \(s \). It additionally chooses a salt \(\mathsf{sa}\) at random, and outputs \((s,\mathsf{sa},\mathbf{H}({w}; \mathsf{sa}, \mathsf {c}_{\mathrm {ss}}))\).

  • \(\mathrm {Chk - SS} (\tilde{{w}},(s,\mathsf{sa},h))\) first runs . It then checks whether \(h= \mathbf{H}(\tilde{{w}}; \mathsf{sa}, \mathsf {c}_{\mathrm {ss}})\) or \(h= \mathbf{H}({w}^\prime ; \mathsf{sa}, \mathsf {c}_{\mathrm {ss}})\). If either matches, it returns true, and otherwise returns false.

As written the run time of checking is always two callsFootnote 2 to \(\mathbf{H}\) with unit cost \(\mathsf {c}_{\mathrm {ss}}\); it follows that \(\textsf {RT}= 2 \cdot \mathsf {c}_{\mathrm {ss}}\). One could short-circuit checking by first checking \(\tilde{{w}} \) and only computing the secure sketch if authentication fails, however side-channels would now reveal when a user makes a typo. We would not want to short-circuit the calculations of \(\mathbf{H}\) on the sketch outputs, as this could reveal even more information about \({w}\) to a side-channel adversary.

An adversary \(\mathcal{B}\) running in time at most T in game \(\mathrm {MR} \) can make at most \({q_{\mathrm {ss}}}= T/\mathsf {c}_{\mathrm {ss}}\) queries to \(\mathbf{H}\). It is clear that \(\mathcal{B}\)’s optimal strategy on input \((s,\mathsf{sa},\mathbf{H}({w}; \mathsf{sa}, \mathsf {c}_{\mathrm {ss}}))\) is to query the \({q_{\mathrm {ss}}}\) heaviest points when ordered in terms of \({\Pr }\left[ \, W= {w}\,\left| \right. \,\mathsf {SS}(W) = s\,\right] \) to \(\mathbf{H}(\cdot ; \mathsf{sa}, \mathsf {c}_{\mathrm {ss}})\). As such for a given \(\mathsf {S} = (\mathsf {SS}, {\mathsf{Rec}})\) and error setting \(\mathsf {E}\), the definition of \(q\)-conditional min-entropy implies that \(\mathbf {Adv}^\mathrm {mr}_{\mathsf {PBAS - SS}[\mathsf {S} ],\mathsf {E}}(\mathcal{B}) \le 2^{-{\tilde{\mathrm {H}}^{{{q_{\mathrm {ss}}}}}_\infty }({W|\mathsf {SS}(W)})}.\)

5.3 Popularity-Proportional Hashing

We now describe a new distribution-sensitive variant of brute-force checking — popularity-proportional hashing (PPH). We shall see in Sect. 6 that for certain error settings and cracking attack run times, PPH allows us to correct more password errors securely than brute-force checking. For all other error settings, it serves as a useful stepping stone to show that brute-force checking provides a superior time / security trade-off than sketch-based typo-tolerant PBAS based on FRS and LHH.

The key idea is to partition the points in the error setting into layers based upon their probability (as done in \(\mathsf {LHH} \)), then have the hashing cost vary across the layers. This is accomplished by having the PBKDF \(\mathbf{H}\) take as input a different iteration count for each layer. Formally, for a distribution \(W\) with \({\mathrm {H}^\mathrm{fuzz}_{t, \infty }}({W}) = \tilde{\mu }\), if a password \({w}\) is such that \({\Pr \left[ \,{W= {w}}\,\right] }\in (2^{-(j+ 1)}, 2^{-j}]\), then hashing \({w}\) incurs a cost of \(\mathsf {c}_{\mathrm {PPH}}^j= \mathsf {c}_{\mathrm {PPH}}{}{}{}\cdot {}{}{}2^{\tilde{\mu }- (j+ 1)}\), where \(\mathsf {c}_{\mathrm {PPH}}\) is a parameter of the scheme. By making it more computationally intensive for an attacker to hash popular passwords, the boost to an attacker’s success probability resulting from querying a likely password is offset by the greater cost incurred to compute the relevant PBKDF output. We provide full details of the scheme in Fig. 5. In the following lemma, we show how to set the parameter \(\mathsf {c}_{\mathrm {PPH}}\) to achieve a desired checking run time \(\textsf {RT}\).

Lemma 4

Let \(\mathsf {E}= ({\mathcal {S}}, W, \mathsf {dist}, t)\) be an error setting. Let \(\mathsf {PBAS - PPH}\) be as shown in Fig. 5 using random oracle \(\mathbf{H}\). Then setting \(\mathsf {c}_{\mathrm {PPH}}= \textsf {RT} \) implies that

$$\begin{aligned} \textsf {RT} (\mathrm {Chk - PPH},\mathsf {c}_{\mathrm {PPH}})\le \textsf {RT} \; , \end{aligned}$$

where \(\textsf {RT} (\mathrm {Chk - PPH},\mathsf {c}_{\mathrm {PPH}})\) denotes the maximum run time of \(\mathrm {Chk - PPH} \) with cost parameter \(\mathsf {c}_{\mathrm {PPH}}\) on any input \(\tilde{{w}} \in {\mathcal {S}} \).

Proof

Fix any point \(\tilde{{w}} \in {\mathcal {S}} \). Then if \(W\) is such that \({\mathrm {H}^\mathrm{fuzz}_{t, \infty }}({W}) = \tilde{\mu }\), and recalling that \({w}\in {L}_{j}\) implies that \({\Pr \left[ \,{W= {w}}\,\right] } > 2^{-(j+1)}\) it follows that

$$\begin{aligned} 2^{-\tilde{\mu }} \ge {\Pr \left[ \,{W\in {B_{t}(\tilde{{w}})}}\,\right] } > \sum _{j = \mu }^{\gamma }|{B_{t}(\tilde{{w}})}\cap {L}_{j}| 2^{-(j+1)}\;. \end{aligned}$$

Multiplying both sides by \(\mathsf {c}_{\mathrm {PPH}}\cdot 2^{\tilde{\mu }}\) and recalling that \(\mathsf {c}_{\mathrm {PPH}}= \textsf {RT}\) and \(\mathsf {c}_{\mathrm {PPH}}^j= \mathsf {c}_{\mathrm {PPH}}\cdot 2^{\tilde{\mu }- (j+ 1)}\) gives

$$\begin{aligned} \textsf {RT}>\sum _{j = \mu }^{\gamma }|{B_{t}(\tilde{{w}})}\cap {L}_{j} | \mathsf {c}_{\mathrm {PPH}}\cdot 2^{\tilde{\mu }- (j+ 1)} = \sum _{j = \mu }^{\gamma }|{B_{t}(\tilde{{w}})}\cap {L}_{j} | \mathsf {c}_{\mathrm {PPH}}^j\;, \end{aligned}$$

where the right hand side is precisely the run time of \(\mathrm {Chk - PPH} \) on input \(\tilde{{w}} \). Since the choice of \(\tilde{{w}} \) was arbitrary, it follows that the run time of \(\mathrm {Chk - PPH} \) on any input \(\tilde{{w}} \in {\mathcal {S}} \) is at most \(\textsf {RT}\), proving the claim.    \(\blacksquare \)

Fig. 5.
figure 5

The popularity-proportional hashing PBAS scheme \(\mathsf {PBAS - PPH}= (\mathrm {Reg - PPH}, \mathrm {Chk - PPH})\), from a PBKDF \(\mathbf{H}\) such that \(\mathbf{H}(\cdot ; \cdot , \mathsf {c}_{\mathrm {PPH}}^j)\) costs \(\mathsf {c}_{\mathrm {PPH}}^j= \mathsf {c}_{\mathrm {PPH}}\cdot 2^{\tilde{\mu }- (j+ 1)}\) to compute where \(\mathsf {c}_{\mathrm {PPH}}\) is a parameter of the scheme, and \({\mathrm {H}^\mathrm{fuzz}_{t, \infty }}({W}) = \tilde{\mu }\). The scheme uses layering \(\mathsf {PPH - Layer} \).

6 Comparing the PBAS Approaches

In the last section we saw three different ways to provide typo-tolerant password-based authentication. Now we dig deeper into the trade-offs incurred by the different schemes, in an attempt to determine which provides the best time / security trade-off. We are most interested in the following question:

When balls are efficiently enumerable, can \(\mathsf {PBAS - SS}\) ever provide a better time / security trade-off compared to \( \mathsf {PBAS - BF}/ \mathsf {PBAS - PPH}\)?

We will answer this question, in the negative, for the cases of using FRS or LHH. To do so, we will fix an error setting \(\mathsf {E}\) with computationally enumerable balls (Definition 3), fix the time allotted to authentication, and show that for any error setting the popularity-proportional hashing PBAS \(\mathsf {PBAS - PPH}\) provably provides better security than both \(\mathsf {PBAS - SS}[\mathsf {FRS2} ]\) or \(\mathsf {PBAS - SS}[\mathsf {LHH} ]\). We will then discuss the conditions on error settings and attacker run time under which \(\mathsf {PBAS - BF}\) offers a better trade-off still.

An incorrect interpretation of our results would be that sketches are useless. This would be the wrong takeaway for several reasons. First, our analysis will only be for specific sketches, not all sketches in general, and so answering our question in full generality remains an interesting open question (see Sect. 7). Second, even if the answer to our main question is negative, it only considers computationally enumerable balls, and many of the error correction settings motivating secure sketches have balls too large to be efficiently enumerated. Potential examples include high-entropy biometrics such as iris scans and fingerprints. Another reason is that we only consider settings where one can check that a correction is in fact correct, which allows the brute-force ball search. Finally, and most broadly, we do not consider information theoretic security — the original setting of most sketch constructions.

With these caveats in place, we turn to setting up a framework by which we can make apples-to-apples comparisons between the different PBAS schemes.

Qualities of Typo-Tolerant PBAS. There are three key axes upon which we compare efficacy of typo-tolerant PBAS schemes; correctness, security and run time. Correctness is readily assessed — it is straightforward to see that for any error setting \(\mathsf {E}\) and \(\delta \)-correct sketch \(\mathsf {S} = (\mathsf {SS}, {\mathsf{Rec}})\), \(\mathsf {PBAS - SS}[\mathsf {S} ]\) inherits the \(\delta \)-correctness of underlying sketch. On the other hand, the two brute-force correction schemes, \(\mathsf {PBAS - BF}\) and \(\mathsf {PBAS - PPH}\) are perfectly correct. This highlights a bonus of the brute-force approach — if the correct password lies in the ball around a typo, these schemes will always recover the correct point.

The comparison between the time / security trade-offs incurred by the different approaches is less immediate. For a given PBAS, this trade-off is primarily dictated by the computational cost \(\mathsf {c}\) we assign to \(\mathbf{H}\) (corresponding, in practice, to picking larger security parameters for the slow hashing scheme). In order to compare different approaches, we fix a runtime budget \(\textsf {RT}\) for checking passwords, set each of the schemes’ parameters to achieve maximal security subject to the run time constraint \(\textsf {RT}\), and compare the security as measured by the message recovery game of Fig. 4.

6.1 \(\mathsf {PBAS - BF}\) versus \(\mathsf {FRS1} \) for Flat Distributions

As a warm up, we discuss the trade-off between \(\mathsf {PBAS - BF}\) and \(\mathsf {PBAS}[\mathsf {FRS1} ]\) where \(\mathsf {FRS1} = (\mathsf {FRS1 - SS}, \mathsf {FRS1 - Rec})\) (Lemma 3).

Let \(\mathsf {E}= ({\mathcal {S}}, W, \mathsf {dist}, t)\) be an error setting, such that \(W\) is flat with \({\mathrm {H}_\infty }({W}) = \mu \) and maximum ball size \(\beta _{\mathrm {max}}\). For a given run time budget \(\textsf {RT}\), setting \(\mathsf {c}_{\mathrm {ss}}= \textsf {RT}/2\) and \(\mathsf {c}_{\mathrm {bf}}= \mathsf {c}_{\mathrm {ss}}\cdot \frac{2}{\beta _{\mathrm {max}}}\) ensures both schemes have equal run times. Let \({\mathcal {A}}\) be an adversary in game \(\mathrm {MR} \) running in time at most T. Letting \( {q_{\mathrm {ss}}}= T/\mathsf {c}_{\mathrm {ss}}\), it follows that,

$$\begin{aligned} \mathbf {Adv}^\mathrm {mr}_{\mathsf {PBAS - BF},\mathsf {E}}({\mathcal {A}})= & {} \bigg ({q_{\mathrm {ss}}}\cdot \frac{\beta _{\mathrm {max}}}{2}\bigg ) 2^{-\mu } \;\; ; \text { and}\\ \mathbf {Adv}^\mathrm {mr}_{\mathsf {PBAS - SS}[\mathsf {FRS1} ], \mathsf {E}}({\mathcal {A}})\le & {} \bigg ({q_{\mathrm {ss}}}\cdot \frac{\beta _{\mathrm {max}}}{\delta } \bigg )2^{-\mu }\;. \end{aligned}$$

The first statement arises since \({\mathcal {A}}\) can query at most \(T/\mathsf {c}_{\mathrm {bf}}= {q_{\mathrm {ss}}}\cdot \frac{\beta _{\mathrm {max}}}{2}\) points in time T, each of which contributes weight \(2^{-\mu }\) to its success probability. The latter follows since \(\mathcal{B}\) can query at most \({q_{\mathrm {ss}}}= T / \mathsf {c}_{\mathrm {ss}}\) points in time T; substituting this into the bound on \(q\)-conditional min-entropy given in Lemma 3 yields the claim. Since \(0< \delta < 1\) (and since \(\delta \) represents the error probability of the sketch, in practice we would like \(\delta \) to be small), this clearly illustrates that in terms of existing upper bounds \(\mathsf {PBAS - BF}\) offers a significantly better time / security trade-off than \(\mathsf {PBAS - SS}[\mathsf {FRS1} ]\). However, this does not rule out tighter upper bounds being found. To conclusively show that \(\mathsf {PBAS - BF}\) offers the best performance, we would like to reverse the inequality sign in the above statement, and prove a lower bound on security for \(\mathsf {PBAS - SS}[\mathsf {FRS1} ]\) that is larger than the upper bound on security for \(\mathsf {PBAS - BF}\).

Let’s unpick what this means. Let \(\mathcal{B}\) be the optimal attacker in game \(\mathrm {MR} \) against \(\mathsf {PBAS - SS}[\mathsf {FRS1} ]\). We model the universal hash function family \(\mathsf {F}: {\mathcal {S}} \times \{0,1\}^{\ell } \rightarrow \{0,1\}^{\log (\beta _{\mathrm {max}}) + \log (1/\delta )}\) utilized by \(\mathsf {FRS1} \) as a family of random oracles \(\mathcal {H} = \{\textsf {h}\}\); rather than including \(\mathsf{sa}\;{\mathop {\leftarrow }\limits ^{\,\$}}\;{\{0,1\}^\ell }\) as part of the sketch, we now give access to the chosen random oracle \(\textsf {h}\;{\mathop {\leftarrow }\limits ^{\,\$}}\;{\{\mathcal {H}\}}\). We note that this modeling is conservative, since it can only make the attacker’s job harder. With this in place, we may lower bound \(\mathbf {Adv}^\mathrm {mr}_{\mathsf {PBAS - SS}[\mathsf {FRS1} ], \mathsf {E}}({\mathcal {A}})\) via a balls-in-bins experiment. We represent each point \({w}\in {\mathcal {M}}= \text{ supp } (W)\) by a ball of weight \(2^{-\mu }\), and associate each of the \(2^{\log (\beta _{\mathrm {max}}) + \log (1/\delta )} = \frac{\beta _{\mathrm {max}}}{\delta }\) points \(y \in \text {range}(\mathcal {H})\) with a bin. The choice of oracle \(\textsf {h}\;{\mathop {\leftarrow }\limits ^{\,\$}}\;{\mathcal {H}}\) fixes a ‘throwing’ of the balls into the bins, and the adversary’s success probability is equal to the expected weight accrued when they are allowed to choose up to \({q_{\mathrm {ss}}}\) balls from each bin where \({q_{\mathrm {ss}}}= T/\mathsf {c}_{\mathrm {ss}}\). The advantage is then calculated by taking the expectation of this total over the coins of the random oracle.

With this in place, all we must do is show that with overwhelming probability when we are allowed to choose at most \({q_{\mathrm {ss}}}\) balls from each bin, the resulting set contains at least \({q_{\mathrm {ss}}}\cdot \frac{\beta _{\mathrm {max}}}{2}\) balls. Intuitively this must be the case. We would expect each bin to contain \(\frac{\delta \cdot | \text{ supp } (W)|}{\beta _{\mathrm {max}}}\) balls, and this value must be much larger than \({q_{\mathrm {ss}}}\) or the attack would be trivial. As such to not hit our total, a very high proportion of the bins must contain a number of balls which has diverged wildly from the mean. However, formalizing this intuition is non-trivial. A theorem statement to this end can be easily derived as a special case of those of Theorem 4; we defer the formal statement and analysis to the full version.

6.2 PPH versus \(\mathsf {FRS2} \) and \(\mathsf {LHH} \)

In this section, we show that \(\mathsf {PBAS - PPH}\) offers a better time / security trade-off than \(\mathsf {PBAS - SS}\) implemented with the FRS and LHH sketch constructions of Sect. 3.

To facilitate the comparison, we first set the hashing cost parameter \(\mathsf {c}_{\mathrm {PPH}}\) such that \(\mathsf {PBAS - PPH}\) achieves the same runtime as \(\mathsf {PBAS - SS}\) with associated hashing cost \(\mathsf {c}_{\mathrm {ss}}\). With this parameter setting, \(\mathsf {PBAS - SS}\) has checking run time \(\textsf {RT}= 2 \cdot \mathsf {c}_{\mathrm {ss}}\), so Lemma 4 implies that setting \(\mathsf {c}_{\mathrm {PPH}}= 2 \cdot \mathsf {c}_{\mathrm {ss}}\) ensures \(\mathsf {PBAS - PPH}\) achieves run time \(\textsf {RT}\) also. With this in place, we now upper-bound the success probability of an optimal attacker against \(\mathsf {PBAS - PPH}\) with these parameters; the proof is given in the full version.

Lemma 5

Let \(\mathsf {E}=({\mathcal {S}}, W, \mathsf {dist}, t)\) be an error setting, where \({\mathrm {H}^\mathrm{fuzz}_{t, \infty }}({W}) = \tilde{\mu }\). Let \(\mathsf {PBAS - PPH}\) be the population-proportional hashing scheme where random oracle \(\mathbf{H}\) has associated cost \(\mathsf {c}_{\mathrm {PPH}}= 2\cdot \mathsf {c}_{\mathrm {ss}}\). Let \({\mathcal {A}}\) be an adversary in game \(\mathrm {MR} _{\mathsf {PBAS - PPH},\mathsf {E}}^{\mathcal {A}}\) running in time at most T. Then,

$$\begin{aligned} \mathbf {Adv}^\mathrm {mr}_{\mathsf {PBAS - PPH},\mathsf {E}}({\mathcal {A}}) \le {q_{\mathrm {ss}}}\cdot 2^{-\tilde{\mu }}\; . \end{aligned}$$

where \({q_{\mathrm {ss}}}= T/\mathsf {c}_{\mathrm {ss}}\).

To give the above term some context, consider the equivalent upper bounds on success probability for sketch-based schemes \(\mathsf {PBAS - SS}[\mathsf {FRS2} ]\), and \(\mathsf {PBAS - SS}[\mathsf {LHH} ]\) (which are derived by substituting the parameters of the error setting into Theorems 1 and 3 respectively).Footnote 3 For any adversary \(\mathcal{B}\) running in time at most T it holds that,

$$\begin{aligned} \mathbf {Adv}^\mathrm {mr}_{\mathsf {PBAS - SS}[\mathsf {FRS2} ],\mathsf {E}}(\mathcal{B}) \le {q_{\mathrm {ss}}}\cdot \frac{{\mathrm {H}_0}({W}) \cdot 2^{-(\tilde{\mu }-1)}}{\delta } \;, \text {and} \end{aligned}$$
$$\begin{aligned} \mathbf {Adv}^\mathrm {mr}_{\mathsf {PBAS - SS}[\mathsf {LHH} ],\mathsf {E}}(\mathcal{B}) \le {q_{\mathrm {ss}}}\cdot \frac{2^{-(\tilde{\mu }- 1)}}{\delta } \; . \end{aligned}$$

By comparison with Lemma 5, it is immediately clear that \(\mathsf {PBAS - PPH}\) enjoys better security upper bounds than either construction. Of course it could be that the upper bounds on the sketches can be improved.

We therefore, in the following theorem, lower bound the success probability of an optimal attack against \(\mathsf {PBAS - SS}[\mathsf {FRS2} ]\) and \(\mathsf {PBAS - SS}[\mathsf {LHH} ]\) in terms of the advantage of any adversary against \(\mathsf {PBAS - PPH}\). This rules out improving the upper bounds enough to make the sketch-based schemes better than \(\mathsf {PBAS - PPH}\). We first state the theorem, then discuss its significance.

Theorem 4

Let \(\mathsf {E}= ({\mathcal {S}}, W,\mathsf {dist}, t)\) be an error setting with \({\mathrm {H}^\mathrm{fuzz}_{t, \infty }}({W}) = \tilde{\mu }\). Let \(\Pi - \mathsf {S} = (\Pi - \mathsf {SS}, \Pi - {\mathsf{Rec}})\) be the secure sketch for the same error setting where \(\Pi \in \{ \mathsf {FRS2}, \mathsf {LHH} \}\), achieving \(1-\delta \) correctness for some \(0< \delta < 1\). We model the (strongly) universal hash functions used by the sketch as random oracles. Let \(\mathsf {PBAS - SS}[\Pi - \mathsf {S} ]\) be the sketch-assisted \(\mathsf {PBAS}\) built from \(\Pi - \mathsf {S} \), using random oracle \(\mathbf{H}\) with associated cost \(\mathsf {c}_{\mathrm {ss}}\). Let \(\mathsf {PBAS - PPH}\) be the popularity-proportional hashing \(\mathsf {PBAS}\) for this error setting, with random oracle \(\mathbf{H}^\prime \) with associated cost \(\mathsf {c}_{\mathrm {PPH}}\) set such that \(\textsf {RT} (\mathrm {Chk - SS},\mathsf {c}_{\mathrm {ss}}) \ge \textsf {RT} (\mathrm {Chk - PPH},\mathsf {c}_{\mathrm {PPH}})\). Then for any adversary \({\mathcal {A}}\) against \(\mathsf {PBAS - PPH}\) running in time at most T, there exists an adversary \(\mathcal{B}\) against \(\mathsf {PBAS - SS}[\Pi - \mathsf {S} ]\) such that

$$\begin{aligned} \mathbf {Adv}^\mathrm {mr}_{\mathsf {PBAS - PPH},\mathsf {E}}({\mathcal {A}}) \le \mathbf {Adv}^\mathrm {mr}_{\mathsf {PBAS - SS}[\Pi - \mathsf {S} ],\mathsf {E}}(\mathcal{B}) + \bigg (\frac{e\cdot \delta }{2} \bigg )^{q_{\mathrm {ss}}}\end{aligned}$$

and, moreover, \(\mathcal{B}\) runs in time T and so can make at most \({q_{\mathrm {ss}}}= T/\mathsf {c}_{\mathrm {ss}}\) queries.

We have stated the theorem in the form of a reduction to highlight that \(\mathsf {PBAS - PPH}\) provides at least as good a time / security trade-off as the seemingly more sophisticated sketch-based scheme. Given that \({q_{\mathrm {ss}}}\) will be large (this is the number of hash computations an attacker can make), then, provided that \(\delta < 2/e \approx 0.736\) the second term in the bound is infinitesimally far from zero. Since \(\delta \) represents the error rate of the sketch, in practice any useable sketch will require \(\delta \) much smaller than .736.

The proof of the theorem proceeds by specifying a concrete adversary \(\mathcal{B}\) against \(\mathsf {PBAS - SS}[\Pi - \mathsf {SS}]\) for \(\Pi \in \{ \mathsf {FRS2}, \mathsf {LHH} \}\), where the underlying (strongly) universal hash function family is modeled as a family of random oracles \(\mathcal {H} = \{ \textsf {h}\}\). It works as one would expect: the adversary is given some sketch \(s\) and access to the oracle \(\textsf {h}\) used in the computation of the sketch. The attack queries the \({q_{\mathrm {ss}}}\) heaviest points in the preimage set

$$\begin{aligned} \mathcal {X}_{s} = \{ {w}\in \text {supp}(W) \ : \ {\Pr \left[ \,{W= {w}\wedge \Pi - \mathsf {SS}^{\textsf {h}}(W) = s}\,\right] } > 0\} \end{aligned}$$

to the PBKDF \(\mathbf{H}\), where \({q_{\mathrm {ss}}}= T/\mathsf {c}_{\mathrm {ss}}\). This is the optimal attack.

We note that \(\mathcal{B}\) need not compute the entire preimage set before submitting his guesses to the oracle \(\mathbf{H}\) — rather his most efficient strategy is to compute the hashes of candidate points under \(\textsf {h}\) in descending order of weight, looking for points which lie in the preimage set. Intuitively this will be efficient because, assuming the sketch behaves uniformly, we would expect to find preimage set points at fairly regular intervals. For example, if (for simplicity) the sketch was simply \(\textsf {h}({w}) = y \), then the expected run time for \(\mathcal{B}\) to find \({q_{\mathrm {ss}}}\) matches (over the coins of \(\textsf {h}\)) is \({q_{\mathrm {ss}}}\cdot |y |\) computations of \(\textsf {h}\).

The proof then must show a lower bound on the success of \(\mathcal{B}\). This analysis turns out to be quite tricky, involving a nuanced balls-in-bins argument. We make things easier by targeting only a rather loose lower bound that suffices to show the desired relationship with \(\mathsf {PBAS - PPH}\). We believe that better lower bounds can be found. Better lower bounds would signify an even bigger gap between the security of \(\mathsf {PBAS - PPH}\) and \(\mathsf {PBAS - SS}\), making \(\mathsf {PBAS - PPH}\) look even better in comparison.

We note that while the above result shows that \(\mathsf {PBAS - PPH}\) always offers a better time / security trade-off than sketch-based schemes using FRS or LHH, the same cannot be shown to hold for \(\mathsf {PBAS - BF}\). For example, consider an error setting such that \(W\) consists of \(2^{49}\) points of weight \(2^{-50}\) and \(2^{99}\) points of weight \(2^{-100}\), for which all balls contain a single point, except for one large ball containing \(2^{20}\) of the lower weight points. As such \({\mathrm {H}^\mathrm{fuzz}_{t, \infty }}({W}) = 50\), and so by Theorems 2 and 3 it is easy to see that the security of the sketch-based schemes will degrade linearly and gracefully as T grows. On the other hand, the huge ball of \(2^{20}\) points means that for matching run-times we must set \(\mathsf {c}_{\mathrm {bf}}= 2^{-19} \cdot \mathsf {c}_{\mathrm {ss}}\) — so low that security for \(\mathsf {PBAS - BF}\) (at least initially; see Sect. 6.3) degrades dramatically compared to \(\mathsf {PBAS - SS}\).

This counterexample may be contrived, but for more realistic distributions there remains a technical challenge in comparing \(\mathsf {PBAS - BF}\) and \(\mathsf {PBAS - SS}\) for FRS and LHH directly. The fact that the latter schemes are parameterized by \(\tilde{\mu }= {\mathrm {H}^\mathrm{fuzz}_{t, \infty }}({W})\) means that the natural derived security bounds are in terms of \(\tilde{\mu }\) also, whereas for \(\mathsf {PBAS - BF}\), security is dictated by the sum of the weights of the points at the head of the distribution. Therefore any balls-in-bins analysis of the form described above involves a complicated comparison between two somewhat orthogonal terms. To overcome this, we can use PPH (whose success probability is also a function of \(\tilde{\mu }\)) as a bridge, first invoking Theorem 4 to show that \(\mathsf {PBAS - PPH}\) offers a better time / security trade-off than the sketch-based PBAS and then assessing, using results in Sect. 6.3, whether \(\mathsf {PBAS - BF}\) offers a better trade-off still.

6.3 Brute-Force Checking versus PPH

In the following theorem, we quantify precisely the conditions on an error setting \(\mathsf {E}\) under which \(\mathsf {PBAS - PPH}\) represents a better time / security trade-off than \(\mathsf {PBAS - BF}\). We fix a run time \(\textsf {RT}\) for the checking algorithms of both schemes, and set the associated hashing cost \(\mathsf {c}_{\mathrm {bf}}\) and hashing cost parameter \(\mathsf {c}_{\mathrm {PPH}}\) in a way that ensures both schemes work within this run time. We then consider the success probabilities of optimal adversaries attacking the schemes, both running in some time T.

The following theorem formally captures our comparison of the two schemes. Roughly speaking, the result indicates that there exists a crossover point: for T smaller than this point, \(\mathsf {PBAS - PPH}\) provides better security than \(\mathsf {PBAS - BF}\), and for T larger than this point, the inverse is true. The crossover point is dictated by the error setting. As we discuss in more detail below, the crossover point for typical error distributions seen with human-chosen passwords is actually pretty small, meaning that \(\mathsf {PBAS - BF}\) would appear to dominate for distributions of practical interest. Whether \(\mathsf {PBAS - PPH}\) can be improved is an open question.

Theorem 5

Let \(\mathsf {E}= ({\mathcal {S}}, W, \mathsf {dist}, t)\) be an error setting, where \({\mathrm {H}^\mathrm{fuzz}_{t, \infty }}({W}) = \tilde{\mu }\) and the largest ball is of size \(\beta _{\mathrm {max}}\). Let \(\mathsf {PBAS - BF}= (\mathrm {Reg}, \mathrm {Chk - BF})\) be the brute-force \(\mathsf {PBAS}\) for this error setting, using oracle \(\mathbf{H}\) with associated cost \(\mathsf {c}_{\mathrm {bf}}\) and run time budget \(\textsf {RT} \). Let \(\mathsf {PBAS - PPH}= (\mathrm {Reg - PPH}, \mathrm {Chk - PPH})\) be the popularity-proportional hashing \(\mathsf {PBAS}\) for this error setting using an oracle \(\mathbf{H}^\prime \) with associated cost parameter \(\mathsf {c}_{\mathrm {PPH}}\) set such that \(\textsf {RT} (\mathrm {Chk - BF},\mathsf {c}_{\mathrm {bf}}) \ge \textsf {RT} (\mathrm {Chk - PPH},\mathsf {c}_{\mathrm {PPH}})\). Let \({\mathcal {A}}\) and \(\mathcal{B}\) be optimal attackers in games \(\mathrm {MR} _{\mathsf {PBAS - PPH}, \mathsf {E}}^{\mathcal {A}}\) and \(\mathrm {MR} _{\mathsf {PBAS - BF}, \mathsf {E}}^\mathcal{B}\) respectively running in time T. Let \(q_{\mathrm {bf}}= T/\mathsf {c}_{\mathrm {bf}}= T\cdot \beta _{\mathrm {max}}/ \textsf {RT} \). Then if T is such that

$$\begin{aligned} T \le \bigg ({2^{-{\mathrm {H}^{{q_{\mathrm {bf}}}}_\infty }({W})} \cdot 2^{(\tilde{\mu }- 1)}\bigg ) \cdot \textsf {RT} }\;, \end{aligned}$$

it holds that \(\mathbf {Adv}^\mathrm {mr}_{\mathsf {PBAS - PPH},\mathsf {E}}({\mathcal {A}}) \le \mathbf {Adv}^\mathrm {mr}_{\mathsf {PBAS - BF},\mathsf {E}}(\mathcal{B})\). For all error settings such that,

$$\begin{aligned} T \ge \bigg ({2^{-{\mathrm {H}^{{q_{\mathrm {bf}}}}_\infty }({W})} \cdot 2^{\tilde{\mu }}+ 1 \bigg ) \cdot \textsf {RT} }\;, \end{aligned}$$

it holds that \( \mathbf {Adv}^\mathrm {mr}_{\mathsf {PBAS - BF},\mathsf {E}}(\mathcal{B}) \le \mathbf {Adv}^\mathrm {mr}_{\mathsf {PBAS - PPH},\mathsf {E}}({\mathcal {A}})\).

The proof works by upper and lower bounding the success probability of an optimal attacker against PPH, and comparing this to the success probability of an optimal attacker against the brute-force checker. The proof is given in the full version.

At a high level, the first bound in the theorem shows that \(\mathsf {PBAS - PPH}\) favors error settings for which the weight of points decreases slowly (relative to the attack run time) as we move down through the distribution starting from the heaviest point. In such error settings \(\mathsf {PBAS - PPH}\) allows us to securely correct larger balls — and accordingly more errors — than brute-force checking, provided balls are constructed such that the fuzzy min-entropy is high. This latter requirement is not much of a restriction, since a well designed error setting will seek to maximize the utility for a given level of security by defining balls to have many points but low aggregate mass. For most such error settings, while there will be a point after which \(\mathsf {PBAS - BF}\) offers the better time / security trade-off, this will be for an attack run time too large to be of concern. This class of distributions includes those described in Sect. 6.2 for which brute-force checking degrades security dramatically.

On the other hand, the second bound shows that if the weight of points decreases quickly as we move down through the distribution, then \(\mathsf {PBAS - BF}\) offers the better time / security trade-off. Intuitively this is because, as the weight of points decreases, the gap between the (higher) hashing cost under PPH decreases until it is, in fact, lower than the hashing cost used with \(\mathsf {PBAS - BF}\). As such the crossover point after which brute-force checking offers the better trade-off falls within the attack run times of concern. Since password distributions are typically very ‘top-heavy’, with the weights of points decreasing rapidly to leave a long tail, they fall into the class for which brute-force checking offers the better time / security trade-off.

The theorem gives both upper and lower bounds on T, with a small gap between them, meaning the crossover point is not precisely pinned down. This is due to a small amount of slack in upper and lower bounding the success probability of an optimal attacker against \(\mathsf {PBAS - PPH}\) for general error settings. For specific error settings, one can sharpen the analysis.

7 Conclusion

In this work we investigated error correction for cryptographic secrets in the known-distribution setting. Using typo-tolerant password checking as a guiding case study, we provided several improvements on both theory and practice. On the theory side, we introduced a new information-theoretic security goal for secure sketches that better matches the needs of applications that may allow an attacker to make multiple guesses about the secret. While for high-entropy settings the distinction is moot, for passwords it is critical. We then provided analysis of the best known schemes in this setting, due to Fuller et al. [15].

Our first main contribution was the design and analysis of a new secure sketch construction, the layer-hiding hash (LHH). We proved that it provides better security than prior schemes. We then introduced a new distribution-sensitive brute-force based technique called property-proportional hashing (PPH) that, unlike the prior brute-force checking approach of Chatterjee et al. [12], varies the run time of the hash function according to the popularity of the password being hashed.

We gave a framework for comparing different approaches to typo-tolerant authentication, and used it to show that PPH outperforms sketch-based solutions to typo-tolerance, even when using the layer-hiding hash sketch. We determine the conditions under which PPH improves on the brute-force checking approach of Chatterjee et al. [12], along with the conditions under which their simpler brute-force checking offers a better trade-off. Put all together, our results indicate that brute-force based approaches perform better than the best known secure sketches. We now finish with a few important points and open questions.

Complexity Beyond Time. Most in-use slow hashes only target extending the time required for a single hash computation. Increasingly, however, practitioners are transitioning to slow hashing that targets memory-hardness [1,2,3, 6, 26], meaning that computing a hash requires that the space-time complexity (the product of memory and time utilized) is lower bounded. Our constructions work with memory-hard hash functions as well, though our comparisons of different approaches currently only considers time complexity. Future work may also consider parallel computation models, which could be useful when a password checking system can use multiple cores to simultaneously check multiple possible corrections.

Additional Applications. While we motivated and used as a running example the setting of password-based authentication, our constructions are generic. They hold for any distribution-sensitive setting in which one has efficiently enumerable balls (the same general setting considered by FRS). The FRS, LHH, and PPH approaches will not work for error settings with large balls, such as attempting to correct large Hamming or edit distances. In these contexts, existing secure sketch constructions [13, 14] seem to be the only solution. We note that their entropy loss is significantly worse than the FRS or LHH constructions, and so they would not seem useful for passwords.

We have focused on authentication, but our results and comparisons are applicable to any cryptographic application in which noisy secrets are used to derive a key for which one can efficiently test correctness. This includes at least all authentication primitives, such as message authentication codes and authenticated encryption. Similarly, our new sketch constructions can also be used to build a fuzzy extractor using the construction from [13], which inherits the security improvement over the fuzzy extractor from FRS.

Secure Sketches in the Multi-guess Setting. In the previous section, we proved that \(\mathsf {PBAS - SS}\) never offers a better time / security trade-off than \(\mathsf {PBAS - PPH}/ \mathsf {PBAS - BF}\) when implemented with the FRS sketches, and the new — and nearly optimal, in the single-guess setting — LHH sketch. The key open question is whether any distribution-sensitive secure sketch can perform better in this context. The challenge is to design a sketch which preserves much of the min-entropy of the underlying distribution in the face of an attacker who can make \(q\) guesses for varying and large values of \(q\). This is an important requirement in many practical settings, yet has been overlooked in existing literature.

Intuitively, the correctness requirement means that the sketch must include sufficient information to disambiguate between points in the heaviest ball(s). As such any efficiently-computable sketch — (we disregard those which, for example, solve an NP-hard problem to create an optimal arrangement of points into sketch preimage sets) — is likely to leak more information than is strictly necessary for correctness in less heavy balls. This additional leakage can then be exploited by an attacker. More generally we would expect that the larger \(q\) is, the wider the gap between the security of a sketch \(\mathsf {S} = (\mathsf {SS}, {\mathsf{Rec}})\) for that error setting \({\tilde{\mathrm {H}}^{q}_\infty }({W|\mathsf {SS}(W)})\), and the theoretical best case security bound \({\mathrm {H}^{q, \mathrm{fuzz}}_{t, \infty }}({W}) - \log (1 - \delta )\).

We conjecture that for a significant class of error-settings — especially those such as passwords, which inherently contain large balls — no efficient distribution sensitive secure sketch can offer a better time / security trade-off than brute-force based approaches. Indeed it seems likely that any intuition leading to an improvement in secure sketch performance over LHH may also be utilized to create a brute-force approach which improves on \(\mathsf {PBAS - PPH}\) (similar to the way in which the same layered approach is used by both LHH and PPH, with better performance in the latter). Refining and improving upon the brute-force based approaches described here is an interesting open problem.