Keywords

1 Introduction

BMDs print votes, often as barcodes or QR codes, together with a human-readable text summary (some BMD printout resembles a hand-marked paper ballot, HMPB). Jurisdictions including the U.S. state of Georgia, Los Angeles County, California, and Philadelphia, Pennsylvania, recently purchased BMDs for all in-person voters to use.

Bugs, misconfiguration, or malware can make the printed votes and QR codes differ from each other and from the voter’s selections. Some have argued that this does not compromise election integrity because voters have the opportunity to inspect BMD printout and to start over if the printout does not match their intended selections; and that since voters can make mistakes hand-marking ballots, HMPBs are no more secure or reliable than BMD printout [19]. We find those arguments unpersuasive:

  • In some jurisdictions, the official record of the vote for counts and recounts is the QR code, which voters cannot check.Footnote 1

  • The arguments equate holding voters responsible for their own errors with holding voters responsible for the overall security of the system [1, 2].

  • Most voters do not inspect BMD printout [3, 5, 10]. Those who do rarely detect actual errors [3, 13]. To reliably detect errors entails voters taking 3–6 minutes to compare a written slate of candidates with the printed selections [12], but voters generally spend less than 3 seconds reviewing BMD printout [5, 10].

  • If a BMD misprints a voter’s selections, only the voter can get evidence of the problem: elections conducted using BMDs are not contestible [1].

  • If BMDs misbehave, there is no way to determine the correct election outcome because there is no trustworthy paper record of the vote: BMDs are not strongly software independent [21].

Concerns about BMDs are not merely hypothetical: BMDs have caused scanners to fail to count votes accurately, to allow voters to vote, and to present all voting options to voters, even after passing LAT [4, 6, 9, 15, 18, 20, 23, 24, 30, 33].

BMD advocates also claim BMDs eliminate ambiguous marks, prevent overvotes, and warn about undervotes (e.g., [19]). But that presumes BMDs function correctly; the rate of truly ambiguous handmade marks is minuscule [1]; and precinct-based optical scanners also protect against undervotes and overvotes (the Voluntary Voting Systems Guidelines, VVSG, require it).Footnote 2 Regardless, elections conducted using BMDs are not trustworthy unless there is a way to ensure that BMD misbehavior did not change any outcome. (If the paper trail itself is not trustworthy, risk-limiting audit procedures do not help because even an accurate full hand count may not reveal who really won.) Elections—and hence BMDs—need to be protected against malicious, technically capable attackers, such as nation states.Footnote 3 If testing has a high chance of detecting that an outcome was altered by a skilled attacker, it also protects against misconfiguration and bugs—which attackers could mimic.

2 Prior Work

Vulnerabilities of particular BMDs are discussed in depth in expert declarations by J. Alex Halderman in Curling et al. v. Raffensperger et al.. Theoretical vulnerabilities of various BMD designs are discussed in [1]. [7, 32] discuss testing BMDs; here, we quantitatively investigate their heuristic claims. Three approaches to testing BMDs have been proposed: pre-election logic and accuracy testing (LAT), “live” or “parallel” testing during the election, and “passive” testing by monitoring the spoiled ballot rate. In LAT and parallel testing, auditors make selections on a BMD then check whether the printout accurately reflects those selections. The primary difference is that LAT happens before the election and parallel testing happens during the election. Passive testing uses the spoiled ballot rate: if more voters than usual request a do-over, that might be because the machines are malfunctioning.

3 How Much Testing is Enough?

If the paper trail accurately reflects who won, accurate full hand counts and risk-limiting audits (RLAs) can catch and correct wrong outcomes. Here, we study whether testing can establish with high confidence that a paper trail printed by BMDs accurately reflects who won. If not, a recount need not show who really won, and a genuine RLA is impossible.

3.1 Threats and Defenses

We make the following assumptions about BMD threats and defenses:

  1. 1.

    Attackers seek to alter the outcome of one or more contests without being detected. (Some might want to be detected, to undermine public confidence.)

  2. 2.

    Attackers know the testing strategy. This does not preclude the possibility that the strategy will be adaptive or have a random element.

  3. 3.

    Attackers have access to the state history of each BMD, including votes, machine settings, etc.; auditors do not.

  4. 4.

    Attackers have an accurate model of voter behavior in past elections, including political preferences, voting speed, BMD settings, and so on; auditors generally do not, because it would require monitoring voters illegally.

  5. 5.

    Auditors seek to ensure that if any outcome is altered, there is a high chance of detecting it, while keeping the chance of false alarms small.

  6. 6.

    Auditors do not know which contest(s), if any, were altered.

  7. 7.

    Auditors must obey the law and protect voter privacy.

3.2 Jurisdiction Sizes, Contest Sizes, and Margins

U.S. elections are typically administered by counties, townships, or other political units smaller than states. A ballot style corresponds to the collection of contests a given voter is eligible to vote in. Typically in the U.S., some contests are on only a fraction of ballot styles in a jurisdiction, in part because many small political units have elections for various offices and measures. Many contests of all sizes are decided by small margins. For instance, in Georgia, U.S., the reported margin in the 2020 presidential election was about 0.2%.

Few votes need to be changed to alter the outcome of small contests and contests with small margins. Conversely, the number of voters in a jurisdiction is an upper bound on the number of passive tests that can be performed and on the sample size to “learn” voter behavior for efficient parallel testing. Thus, jurisdiction size is an important constraint on BMD testing. Since ballot layout, contests, equipment, demographics, political preferences, and other variables vary across and within jurisdictions and malware could affect only some equipment or ballot styles, it is not possible to pool data across jurisdictions to get more power.

Changing votes on 1% of ballots in a jurisdiction can alter the margin of a jurisdiction-wide plurality contest by 2% if there are no undervotes or invalid votes in that contest. If the undervote rate is 30%, then changing votes on 1% of the ballots can change the margin by \(0.02/0.7 = 2.9\%\). If a contest is only on 10% of the ballots and the undervote rate in the contest is 30%, altering the votes on 1% of ballots could change the margin in that contest by nearly 29%.

As of 2020, only 1,629 U.S. cities had populations of 100,000 or more, of over 81,363 incorporated places [26]. The 2020 median population of U.S. incorporated areas is 1,201, so about half of the 81,363 incorporated places have turnout of 1,201 or fewer voters. Thus, an attacker does not have to change many votes to alter the outcome of a typical contest for an elected official in a U.S. city or incorporated township. According to [29] the 2020 median turnout in the 6,405 U.S. counties with recorded active voter data was 4,470 voters, and turnout was less than 11,500 voters for more than 2/3 of jurisdictions. In 65.5% of states, more than 50% of counties have fewer than 30,000 active voters. In 85.5% of states, more than 50% of counties have fewer than 100,000 active voters.

Fig. 1.
figure 1

2020 turnout by jurisdiction in 3073 counties [29]. Turnout was below 10,000 in \(\approx \)50% of counties.

Fig. 2.
figure 2

Median 2020 turnout by jurisdiction in the U.S. [29]

3.3 Voting Transactions

We shall call a voter’s interaction with a BMD a voting transaction or transaction (see Table 1). Transactions are characterized by many variables, including:

  • when the transaction starts

  • time since the previous voter finished (a measure of polling-place congestion)

  • number of transactions before the current transaction

  • the voter’s sequence of selections and revisions of selections

  • the time to make each selection before taking another action

  • whether the voter looks at every page of options in each contest

  • the time the voter spends reviewing and revising selections

  • precisely where voter touches the screen

  • BMD settings, including font size, language, use of audio, volume, tempo, pausing, rewinding, use of the sip-and-puff interface, inactivity warnings

Table 1 lists some of the variables and the number of values they can take.Footnote 4 The huge number of possible transactions helps an attacker pick a subset large enough to change an outcome but that auditors are unlikely to probe.

Table 1. Some parameters of BMD transactions and their number of possible values.

4 Passive Testing

Passive testing sounds an alarm if the number of spoiled ballots exceeds some threshold, t. To ensure that passive testing has a false negative rate (failing to detect altered outcomes) of at most X%, we need to know that the chance that the number of spoiled ballots is greater than or equal to t is at least X% if BMDs altered any outcome. Conversely, to limit the false alarm rate to at most Y%, we need to know that the chance that the number of spoiled ballots is greater than or equal to t is at most Y% if BMDs function correctly.

Finding such a value of t is impossible in practice because the distribution of spoiled ballots may depend on ballot design, voting rules, the number of contests, and other things that vary from election to election and place to place—and when BMDs misbehave, also on the number of altered transactions, and the voters and contests affected. Hence, to lower-bound the difficulty, we will assume (optimistically) that the number of spoiled ballots has a Poisson distribution whether BMDs behave correctly or not; but with a rate that depends on the rate of altered transactions. We assume either that 7% of voters will notice errors and spoil their ballots, consistent with the findings of [3], or that 25% of voters will. We consider contest margins of 1%–5% and rates of false positives (false alarms) and false negatives (failing to notice altered outcomes) of 5% and 1%. Results are in Table 2; software to calculate these numbers is in https://github.com/pbstark/Parallel19.

Table 2. Minimum turnout for passive testing with a 5% false negative rate to have at most a 5% false positive rate (cols 3–5) or for passive testing with a 1% false negative rate to have at most a 1% false positive rate (cols 6–8), as a function of the the contest margin (col 1), the percentage of voters who would notice errors (col 2), and the base rate at which voters spoil BMD printout. The number of spoiled ballots is assumed to have a Poisson distribution, with known rate, absent malfunctions. Malfunctions increase the rate by half the margin times the detection rate.

Combining Table 2 and Fig. 1 shows that even if the probability distribution of spoiled ballots were known to be Poisson and the spoilage rate when equipment functions correctly were known perfectly, in 2020, in 58.2% of U.S. states fewer than half the counties had enough voters for passive testing to work, even in county-wide contests, on the assumption that 7% of voters whose votes are altered will spoil their ballots.

If turnout is roughly 50%, jurisdiction-wide contests in jurisdictions with fewer than 60,000 voters—22 of California’s 58 counties in 2020 [29]—cannot in principle limit the chances of false positives and false negatives to 5% for margins below 4%, even under these optimistic assumptions. For contests that involve only part of a jurisdiction, the situation is worse.

4.1 Targeting Vulnerable Voters

The analysis above assumes that all voters are equally likely to detect errors and spoil their ballots. But an attacker can use BMD settings, state history, and session data to target voters who are less likely to notice problems.

Voters with Visual Impairments. Approximately 0.8% of the U.S. population is legally blind; approximately 2% age 16 to 64 have a visual impairment [16]. Current BMDs do not provide voters with visual impairments a way to check the printout. If an attacker only alters votes when the voter uses the audio interface or large fonts, detection may be very unlikely.

Voters with Motor Impairments. Some BMDs allow voters to print and cast a ballot without looking at it, for instance the ES &S ExpressVote with “Autocast,” aka “permission to cheat” [1]. The attacker can change every vote cast using Autocast, with zero chance of detection.

Voters who use Languages other than English. U.S. law requires some jurisdictions to provide ballots in languages other than English. For instance, Los Angeles County, CA, provides voting materials in 13 languages [14]. In 2013, roughly 26% of voters in Los Angeles County spoke a language other than English at home [14]. It is our understanding that BMDs generally print only in English. If voters who use a foreign language on the BMD are unlikely to check the English-language printout, an attacker could change the outcome of contests with large margins with little chance of detection.

Fast and Slow Voters. An attacker can monitor how long it takes voters to make their selections, whether they change selections, how long they review the summary screen, etc. A voter who spends little time reviewing selections onscreen may be unlikely to review the printout carefully. Conversely, a voter who takes a very long time to make selections or changes selections repeatedly might find voting difficult or confusing and be unlikely to notice errors.

Thus, it is in the attacker’s interest to target the same groups of voters BMDs are supposed to help: voters with visual impairments, voters with limited dexterity, voters who use a language other than English, and voters with cognitive disabilities.

4.2 FUD Attacks on Passive testing

Even under ideal circumstances, passive testing does not produce direct evidence of problems; it does not identify which ballots or contests have errors; and it does not provide any evidence about whether problems changed outcomes. Relying on spoiled ballots as a sign of fraud opens the door to a simple, legal way to undermine elections: encourage voters to spoil ballots.

5 LAT and Parallel Testing

Suppose that malware alters one or more votes with probability \(p\), independently across transactions, uniformly across voters—regardless of the voter’s selections or any aspect of the transaction that the attacker can ascertain. Then if auditors make \(n\) tests, the chance that the BMD will alter at least one of the votes in at least one of the tests—and the attack will be detected—is \(1 - (1-p)^n\). For \(p=0.01\), \(n=300\) tests would give a 95% chance of detecting a problem.

A BMD can handle roughly 140 transactions per day. Testing enough to have a 95% chance of detecting a 1% problem on one BMD would leave no time for voters to use that BMD. Even for pre-election LAT, where capacity for actual voters is not an issue, conducting 300 “typical” tests would take about 25 h.

If there were a large number of machines known to have been (mis)programmed identically, tests could be spread across them. But there are many small contests that need to be tested in conjunction with all other contests that appear on any ballot style that contains them, and there is no guarantee that all BMDs in a jurisdiction are programmed identically.

This threat model is completely unrealistic. An attacker who wants Alice to beat Bob will not alter votes for Alice: it would needlessly increase the chance of detection. And as discussed in Sect. 4.1, rather than randomly changing votes for Bob into votes for Alice, an attacker can target transactions that auditors are unlikely to probe.

Setting aside specific machines for testing facilitates a “Dieselgate” type attack [11], as does conducting tests on a schedule, as suggested by [7]. Tests need to be unpredictable—with respect to the specific BMDs tested, time, vote pattern, duration, and other characteristics of voting transactions—or attackers can avoid detection by altering only transactions that do not correspond to any test. There may be pressure to reduce testing when BMDs are busy, to reduce waiting times. Because malware can monitor the pace of voting, reducing testing when machines are busy makes it easier to avoid detection.

An attacker need not alter many transactions to change the outcome of small contests and contests with small margins. The fewer votes altered, the more tests required to ensure a large chance of detection. To test efficiently, tests should sample more common transactions with higher probability. Attackers might be able to estimate of the distribution of transactions using malware installed on BMDs in previous elections, but testers will not, since it involves tracking voter behavior at a level of detail that violates voter privacy. See assumptions 4 and 7 and Sect. 5.2.

Auditors do not know which contest(s) and candidate(s) are affected. To have a large chance of detecting interference, there needs to be a large chance of testing a transaction the attacker alters. Attackers can target transactions that are intrinsically expensive to test, e.g., transactions that take longer than 10 min, transactions in which the voter changes some number of selections, transactions that display the ballot in a language other than English, transactions that use the audio interface at a reduced tempo, etc.

5.1 Lower Bounds on the Difficulty of Parallel Testing

We now study an idealized version of parallel testing, where auditors can tell whether a random sample of BMD printouts accurately show the voters’ selections. Suppose a contest has 4,470 voters, the median jurisdiction turnout in 2020. Suppose that malware alters votes in 23 transactions, which could change a margin by more than 1% in a jurisdiction-wide contest. How many randomly selected printouts would need to be checked to have at least a 95% chance of finding at least one with an error? The answer is the smallest \(n\) such that

$$\begin{aligned} \frac{4470-23}{4470} \cdot \frac{4469-23}{4469} \cdots \frac{4470-(n-1)-23}{4470-(n-1)} \le 0.05, \end{aligned}$$
(1)

i.e., \(n=546\) printouts, about 12.2% of the transactions, corresponding to testing each BMD several times per hour.

Conversely, suppose auditors randomly check 13 printouts per day per machine (on average, testing hourly for a 13-hour day, \(\approx \)9.2% of BMD capacity). To have at least a 95% chance of detecting that the outcome of a contest with a 1% margin was altered, there would need to be at least 6,580 voters in the contest (almost 150% the median turnout in jurisdictions across the U.S.), corresponding to 47 BMDs, even under these unrealistically optimistic assumptions.

5.2 Building a Model of Voter behavior

In practice, auditors cannot check whether voters’ BMD printout is correct. Instead of sampling voters’ actual transactions in the election, they will have to come up with their own test transactions. Testing transactions uniformly at random from all possible transactions is doomed because the number of possible transactions is so large. To mimic voters, auditors might consider sampling from P, the population distribution of voting transactions, i.e., the fraction of voters who use the BMD in each of the \(S = 6.14 \times 10^6\) ways in the optimistic estimate in Table 1. Suppose an attacker wants to change the outcome of a contest with a margin of m, expressed as a fraction of ballots cast (rather than as a number of votes). The attacker only needs to change a fraction \(m/2\) of the transactions to change the margin by \(m\). To have probability at least \(1-\alpha \) of detecting a change to the outcome of any contest with true margin \(m\), auditors must test in a way that has probability at least \(1-\alpha \) of sampling at least once from every subset of transactions that contains a fraction \(m/2\) of the transactions. If auditors could sample transactions independently at random from \(P\), each sample transaction would have probability \(1-m/2\) of not being one of the altered transactions. The chance that t randomly selected transactions would not include one that is altered would be \((1-m/2)^t\). Thus the number of transactions auditors would need to test is the smallest \(t\) for which

$$\begin{aligned} ( 1-m/2 )^t \le \alpha , \; \text{ i.e., } \end{aligned}$$
(2)
$$\begin{aligned} t \ge \frac{ \log \alpha }{ \log (1-m/2) }. \end{aligned}$$
(3)

This is essentially Eq. 1 for sampling with replacement; the two are indistinguishable when t is small compared to the total number of possible transactions. A key difference is that in Eq. 1, auditors are sampling from the actual transactions in the election, while in Eq. 3, auditors are sampling from a model, the frequency distribution of of transactions.

In practice, the auditors do not know P—they will have to estimate it by monitoring voters. In reality, this is impossible to do well: (i) In a given election, P will depend on the particular contests on the ballot and the particular voters who participate, both of which change from election to election. (ii) The variables that characterize a voting transaction include the voter’s selections and details about how the voter uses the BMD, so collecting the data would violate voter privacy illegally. To get a sense of the statistical difficulty of the problem, we ignore these practical difficulties. If auditors could select voters at random (with replacement) and observe in detail how they use the BMD—all the variables in Table 1—that would yield independent, identically distributed (IID) draws from \(P\), which could be used to make an estimate, \(\hat{P}\). If \(\hat{P}\) differs too much from P, no number of tests will suffice, because \(\hat{P}\) might estimate that the frequency of a transaction is zero when in fact it is sufficiently frequent that altering it could change an outcome. (By assumption 4, above, the attacker knows P and hence can exploit differences between \(\hat{P}\) and P.) How many voters would auditors have to observed to ensure (with sufficiently high probability) that \(\hat{P}\) is accurate enough for parallel testing?

The \(L_1\) distance between two distributions bounds the difference in the probability they assign to any set (\(|\hat{P}(A)-P(A)| \le \Vert \hat{P}-P\Vert _1/2\)). If \(\Vert \hat{P}- P\Vert _1 \ge m\), there may be a set \(A\) of transactions for which \(P(A) = m/2\) but \(\hat{P}(A) = 0\), so changing votes for transactions in \(A\) could alter some marginFootnote 5 by \(m\), with zero chance of detection, no matter how many tests are performed, if the tests are drawn from \(\hat{P}\) rather than P.

We cannot guarantee that \(\Vert \hat{P} - P\Vert _1 \le \varepsilon \) with certainty, but by observing enough randomly selected voters, we can ensure that the chance that \(\Vert \hat{P}-P\Vert _1 > \varepsilon \) is at most \(\beta \). If \(\alpha \le \beta \), even an infinite number of tests drawn from \(\hat{P}\) may not suffice to guarantee chance at least \(1-\alpha \) of detecting outcome-changing manipulations. If \(\alpha > \beta \), to guarantee chance at least \(1-\alpha \) of catching an outcome-changing error, the minimum number of tests required is

$$\begin{aligned} \min \left\{ t : ( 1+ \varepsilon /2 - m/2 )^t \le \frac{\alpha -\beta }{1-\beta } \right\} . \end{aligned}$$
(4)

Minimax Lower Bounds. Suppose auditors draw an IID sample of \(n\) transactions from \(P\), a frequency distribution on \(S\) possible transactions. Let \(\mathcal {M}_S\) denote the collection of all frequency distributions for those transactions. Then the training sample size \(n\) must be at least large enough to ensure that the \(L_1\) error of the best estimator \(\hat{P}\) is unlikely to exceed \(\varepsilon \), provided \(P \in \mathcal {M}_S\):

$$\begin{aligned} \inf _{\hat{P}} \sup _ { P \in \mathcal {M}_S } \Pr _P \{ \Vert \hat{P} - P \Vert _1 \le \varepsilon \} \ge 1- \beta . \end{aligned}$$
(5)

Theorem

([8]) For any \(\zeta \in \left( 0, 1 \right] \),

$$\begin{aligned} \inf _{\hat{P}} \sup _{P \in \mathcal {M}_S} \mathbb {E}_P \Vert \hat{P} - P\Vert _1\ge & {} \frac{1}{8} \sqrt{ \frac{eS}{( 1+\zeta ) n}} \mathbbm {1} \left( \frac{(1+\zeta )n}{S} > \frac{e}{16} \right) \nonumber \\&+ \exp \left( - \frac{2( 1+\zeta )n}{S} \right) \mathbbm {1} \left( \frac{( 1+\zeta )n}{S} \le \frac{e}{16} \right) \nonumber \\&- \exp \left( - \frac{ \zeta ^2 n }{ 24 } \right) - 12 \exp \left( -\frac{\zeta ^2\,S }{ 32( \ln S)^2 } \right) , \end{aligned}$$
(6)

where the infimum is over all \(\mathcal {M}_S\)-measurable estimators \(\hat{P}\).

Lemma

Let \(X\) be a random variable with variance \(\mathrm Var{X} \le 1\), and let \(\beta \in (0, 1)\). If \(\Pr \{X \ge \mathbb {E}X + \lambda \} \le \beta \) then \(\lambda \ge -\sqrt{\beta /(1-\beta )}\).

Proof

Suppose \(\lambda \ge 0\). Then \(\lambda \ge - \sqrt{\beta /(1-\beta )}\). Suppose \(\lambda < 0\). By Cantelli’s inequality and the premise of the lemma,

$$\begin{aligned} \beta \ge \Pr \{X \ge \mathbb {E}X + \lambda \} \ge 1 - \frac{\sigma ^2}{\sigma ^2 + \lambda ^2} = \frac{\lambda ^2}{\sigma ^2 + \lambda ^2} \ge \frac{\lambda ^2}{1 + \lambda ^2}. \end{aligned}$$
(7)

Solving for \(\lambda \) yields the desired inequality. \(\Box \).

Now \(0 \le \Vert \hat{P} - P \Vert _1 \le 2\), so \(\mathrm Var\Vert \hat{P} - P \Vert _1 \le 1\). By the lemma, we need \(\lambda \ge -\sqrt{\beta /(1-\beta )}\) to ensure that \( \Pr \{ \Vert \hat{P}-P \Vert _1 \ge \mathbb {E}X + \lambda \} \le \beta \). If \(\Vert \hat{P}-P \Vert _1 \ge 2r\), there can be a set of transactions \(\tau \) such that \(P(\tau ) = m/2\) but \(\hat{P}(\tau ) = 0\), so if tests are generated randomly according to \(\hat{P}\) there is zero probability of testing any transaction in \(\tau \), no matter how many tests are performed. Thus if \(\Pr \{ \Vert \hat{P} - P \Vert _1 \ge m \} > \alpha \), even an infinite number of tests cannot guarantee chance at least \(1-\alpha \) detecting that a fraction \(m/2\) of the transactions were altered, enough to wipe out a margin of \(m\). By the lemma, that is the case if \(m < \mathbb {E} \Vert \hat{P} - P \Vert _1 -\sqrt{\alpha /(1-\alpha )}\), i.e., if \(\mathbb {E} \Vert \hat{P} - P \Vert _1 > m+\sqrt{\alpha /(1-\alpha )}\).

The theorem gives a family of lower bounds on \(\mathbb {E} \Vert \hat{P} - P \Vert _1\) in terms of \(n\). If the lower bound exceeds \(m+\sqrt{\alpha /(1-\alpha )}\), testing by drawing transactions from \(\hat{P}\) cannot protect against all outcome-changing errors. The bound grows with \(S\), the number of possible transactions. To be optimistic, we use an unrealistically small value \(S = 6.14\times 10^6\) (Table 3).Footnote 6

To guarantee a 95% chance of detecting that \(m/2=5\%\) of transactions were altered, which could change jurisdiction-wide margins by 10% or more, the training sample would need to include at least 1.082 million transactions, even if auditors could conduct an infinite number of parallel tests. That is larger than the turnout in 99.7% of U.S. jurisdictions in 2020 [29]; it is roughly 0.5% of the U.S. voting population. To guarantee 99% chance of detecting that 0.5% of transactions were altered, which could change jurisdiction-wide margins by 1% or more, would require observing 3.876 million voters in complete, privacy-eliminating detail—more than the turnout in 99.9% of U.S. jurisdictions in 2020 [29], roughly 1.9% of the U.S. voting population.

Table 3. Lower bound on the sample size (col 4) required to estimate the distribution of voting transactions well enough to ensure the probability (col 1) of detecting the manipulation of the fraction of transactions (col 3) using some number of tests (col 2), if the support of the distribution of transactions has \(S=6.14\times 10^6\) points.
Fig. 3.
figure 3

Minimum training sample sizes as a function of the fraction of altered votes.

6 Complications

Reality is worse than the optimistic assumptions in our analyses:

Margins are not Known in Advance. Margins are not known until the election is over, when it is too late to do more testing if contests have narrower margins than anticipated. Testing to any pre-defined threshold, e.g., a 95% chance of detecting changes to 0.5% of the votes in any contest, will not always suffice.

Tests have Uncertainty. If the BMD printout reflects the wrong electoral outcome, a perfect full manual tally, recount, or risk-limiting audit based on BMD printout will confirm that wrong outcome. Suppose one could design practical parallel tests that had a 95% chance of sounding an alarm if BMDs alter \(\ge 0.5\%\) of the votes in any contest. A reported margin of 1% or less in a plurality contest is below the “limit of detection” of such tests. Would laws require a runoff whenever a reported margin is below the limit of detection of the tests?

Special Risks for Some Voters. As discussed above in Sect. 4.1, BMDs can be used to selectively disenfranchise voters with disabilities and voters whose preferred language is not English. Indeed, the attacker’s best strategy is to target such voters, in part because poll workers might more likely to think that complaints by such voters reflect voter mistakes rather than BMD malfunctions.

The only Remedy is a New Election. If a BMD is caught misbehaving, it should be removed from service and all BMDs in that jurisdiction should be investigated. But there is no way to determine the correct outcome or which votes were affected: BMDs are not strongly software independent [21].

7 Conclusion

We show that to protect against outcome-altering BMD malfunctions requires orders of magnitude more testing than is feasible. To our knowledge, no jurisdiction has conducted any parallel testing of BMDs of the kind suggested by [7, 32], much less enough to reliably detect outcome-changing errors, bugs, or hacks.

Even if it were possible to test enough to get high confidence that no more than some threshold percentage the votes were changed in any contest, fairness would demand a runoff in contests decided by less than that threshold.

Some BMDs may be the best extant technology for voters with particular disabilities to mark and cast a paper ballot independently. But many BMDs are poorly designed. Some have easily exploited security flaws [1] and some do not enable voters with common disabilities to vote independently [22, pp. 68–90]. To our knowledge, no VVSG-certified BMD system provides a means for blind voters to check whether the printout matches their intended selections.

Using BMDs makes elections less trustworthy, less resilient, less transparent, more fragile, and more expensive [1, 17]. BMDs have failure modes that hand-marked paper ballots do not have, and lack resilience when failures occur [1]. BMDs shift the burden of ensuring that voting equipment functions correctly from officials to voters, but do not provide voters any way to prove that they observed problems, if they do; nor can election officials show that outcomes are correct despite any problems that might have occurred [1]. BMDs undermine the ability of election officials to provide affirmative evidence that outcomes are correct, the fundamental principle of “evidence-based elections” [2, 25].

Voters who use BMDs should be urged to bring a written list of their selections to the polls to check against BMD printout, and to request a fresh ballot if the printout does not match their intended selections. Election officials should track spoiled BMD printouts. There should be research on how to encourage voters to check BMD printout and report discrepancies, how to ensure the checks are accurate, and how to ensure that any reported problems are accountably and transparently recorded, addressed, and publicized; these issues also arise in end-to-end cryptographically verifiable (E2E-V) voting systems. For the foreseeable future, prudent election administration requires keeping the use of BMDs to a minimum.