1 Introduction

The vast majority of votes in the United States are counted mechanically, either by optical scanners that read paper ballots or by direct-recording electronic (DRE) voting machines [70]. To validate that these tabulation devices are configured and functioning correctly, jurisdictions perform a procedure called “logic and accuracy testing” (“L\( { \& }\)A testing”) shortly before each election. It typically involves casting a “test deck”—a set of ballots with known votes—on each machine, then printing the results and ensuring the tally is as expected. Any deviation is a potential indicator that the election equipment has misbehaved.

While more sophisticated mechanisms such as risk-limiting audits [37], and end-to-end verification [15] can reliably detect and recover from both errors and attacks after the fact, they are not yet widely applied in the U.S . Even if they were, L\( { \& }\)A testing would remain useful for heading off some sources of error before they affected results. Ideally, L\( { \& }\)A testing can protect against certain kinds of malfunction, configuration error, and fraud as well as strengthen voter confidence, but its effectiveness depends on many details of how the testing is performed. In the U.S., L\( { \& }\)A testing requirements—like most aspects of election procedure and the selection of voting equipment—are determined by individual states, resulting in a diversity of practices with widely varying utility.

Unfortunately, this heterogeneity means that many states diverge negatively from the norm and makes it difficult to offer the national public any blanket assurances about the degree of protection that L\( { \& }\)A testing affords. Moreover, many states do not publish detailed L\( { \& }\)A procedures, leaving voters with little ability to assess the effectiveness of their own states’ rules, let alone whether any tests they observe comply with them. Yet this decentralized regulatory environment has also allowed a variety of positive L\( { \& }\)A testing procedures to evolve, and there are abundant opportunities for the exchange of best practices.

This paper provides the first comparative analysis of L\( { \& }\)A testing requirements across the fifty states. To determine how each state performs L\( { \& }\)A testing, we conducted an extensive review of available documentation and reached out to election officials in every state. We then assessed and scored each state’s policy using criteria designed to reflect its functional effectiveness and suitability as a basis for voter confidence. The results provide a detailed understanding of how states’ procedures differ and how well they approach an ideal model of what L\( { \& }\)A testing can achieve. Our analysis reveals that several important L\( { \& }\)A criteria are absent in many or most states’ rules, yet we also highlight specific examples of policies that could serve as models for broader dissemination. We hope this work will encourage the adoption of more effective L\( { \& }\)A testing requirements across the United States and help promote policies that better inspire public trust.

2 Background

2.1 L\( { \& }\)A Testing Goals

L\( { \& }\)A testing was first introduced in the early 1900s s for lever-style voting machines [63], which contained a mechanical counter for each candidate. The counters were susceptible to becoming jammed due to physical failure or tampering, so tests were designed to establish that each counter would advance when voted.

Modern DRE voting machines and ballot scanners can suffer from analogous problems—miscalibrated touch-screens or dirty scanner heads can prevent votes in specific ballot positions from being recorded [31]—but they also have more complex failure modes that call for different forms of testing. These devices must be provisioned with an “election definition” that specifies the ballot layout and rules. If the election definition is wrong—for instance, the order or position of voting targets do not match the ballots a scanner will read—votes may be miscounted.

Problems with election definitions caused by human error are surprisingly common. They contributed to the publication of incorrect initial election results in Northampton County, Pennsylvania, in 2019 [14], Antrim County, Michigan, in 2020 [28], and DeKalb County, Georgia, in 2022 [23]. In these documented cases the errors were fortunately detected, but only after the results were announced. They likely could have been prevented in the first place by sufficient L\( { \& }\)A testing.

L\( { \& }\)A testing can also serve a role in election security. Research has long recognized that L\( { \& }\)A testing cannot reliably defeat an adversary who manages to execute malware on voting machines, because the malware could detect when it was under test and only begin cheating during the election itself (see, e.g., [24]). However, L\( { \& }\)A testing can potentially thwart more limited attackers who manage to tamper with election definitions or configuration settings. For example, although there is no evidence that the instances of error described above were caused by fraud, attackers could cause similar election definition problems deliberately in an attempt to alter results. This would likely require far less sophistication than creating vote-stealing malware. Moreover, there is growing concern about threats posed by dishonest election insiders, who routinely have the access necessary to perform such an attack [17].

Beyond providing these protections, L\( { \& }\)A testing also frequently serves a role in enhancing public confidence in elections. Most states conduct at least part of their L\( { \& }\)A testing during a public ceremony, where interested political party representatives, candidates, news media, and residents can observe the process and sometimes even participate by marking test ballots. Some jurisdictions also provide live or recorded video of their testing ceremonies online. These public tests can help build trust by allowing voters to meet their local officials, observe their level of diligence, and become more familiar with election processes. Additionally, public observers have the potential to make testing stronger, by providing an independent check that the required tests were completed and performed correctly. At least in principle, public observation could also help thwart attempts by dishonest officials to subvert L\( { \& }\)A testing by skipping tests or ignoring errors.

2.2 U.S. Elections

L\( { \& }\)A testing fills a role that is best understood with a view towards the broader context of election administration in the jurisdictions where it is practiced. In the U.S., many subjects are put to the voters, frequently all at once, and a single ballot might include contests ranging from the national presidency and congress to the state governor, legislature, and judges to the local mayor, city council, sheriff, and school board [13]. This means elections tend to involve many contests—typically around 20, although some jurisdictions have occasionally had nearly 100 [76]. There may also be several ballot variants within a single polling place to accommodate candidates from different sets of districts. These features make tallying by hand impracticable in many areas. As a result, nearly all jurisdictions rely on electronic tabulation equipment, today most commonly in the form of computerized ballot scanners [70]. Ensuring that these machines are properly configured and functioning on election day is the key motivation for L\( { \& }\)A testing.

Election administration in the U.S. is largely the province of state and local governments. Although the Constitution gives Congress the power to override state law regarding the “manner of holding Elections for Senators and Representatives,” this authority has been applied only sparingly, for instance to establish accessibility requirements and enforce civil rights [30, 73]. Each state legislature establishes its own election laws, and the state executive (typically the secretary of state) promulgates more detailed regulations and procedures. In practice, election administration powers are exercised primarily by local jurisdictions, such as counties or cities and townships, where local officials (often elected officials called “clerks”) are responsible for conducting elections [45].

Because of this structure, there is little standardization of election practices across the states, and L\( { \& }\)A testing is no exception. Testing processes (and the ceremonies that accompany them) vary substantially between and within states. As we show, these variations have significant effects, both with respect to error-detection effectiveness and procedural transparency and intelligibility. Pessimistically, one can view this broad local discretion as a way for some jurisdictions to use lax practices with little accountability. We note, however, that it also grants many clerks the power to depart upwards from their states’ mandatory procedures, achieving stronger protections than the law requires. This provides an opportunity for improved practices to see early and rapid adoption.

2.3 Related Work

Although L\( { \& }\)A testing itself has so far received little research attention, there is extensive literature analyzing other aspects of election mechanics across states and countries, with the goal of informing policymaking and spreading best practices. For instance, past work has examined state practices and their impacts regarding post-election audits [68], voter registration list maintenance [10], voter identification requirements [16], online voter registration [79], election observation laws [27], the availability of universal vote-by-mail [67]. A far larger body of research exists comparing state practices in fields other than elections.

Despite the abundance of this work, we are the first (to our knowledge) to examine states’ L\( { \& }\)A testing practices in detail. A 2018 state-by-state report by the Center for American Progress [58] considered L\( { \& }\)A testing among several other aspects of election security preparedness; however, it primarily focused on the narrow question of whether states required all equipment to be tested. To build upon this research, we consider many other policy choices that influence the effectiveness of L\( { \& }\)A requirements and procedures.

3 Methodology

3.1 Data Collection

To gather information on states’ practices, we began by collecting official documentation where publicly available, relying primarily on state legal codes, state election websites, and Internet search engines. If we could not locate sufficient information, we attempted to contact the state via email or by phone to supplement our understanding or ask for clarifications. We directed these inquiries to the state elections division’s main contact point, as identified on its website.

State responses varied. While some states provided line by line answers to each of our questions, it was common for states to indicate that our criteria were more specific than what state resources dictated, pointing us instead to the same statues and documentation we had already examined, providing us with additional documentation that was still unresponsive to our questions, or replying in paragraphs that partially addressed some questions while completely disregarding others. In cases where we could not find evidence to support that a state satisfied certain criteria and the state did not provide supporting evidence upon request, we did not award the state any points for those criteria.

Upon finalizing our summary of each state’s practices, we contacted officials again to provide an opportunity for them to complete or correct our understanding. Over the course of nine months, we communicated with all 50 states and received at least some feedback on our summaries from all but seven states—Iowa, New Jersey, New York, Rhode Island, Tennessee, Vermont, and Wisconsin. Our data and analysis are current as of July 2022.

3.2 Evaluation Criteria

To uniformly assess and compare states practices, we applied the following criteria and scoring methodology, which reflect attributes we consider important for maximizing the benefits of L\( { \& }\)A testing in terms of accuracy and voter confidence. These criteria are non-exhaustive, but we believe they are sufficiently comprehensive to evaluate state procedures relative to one another. (Additional desirable testing properties are discussed in Sect. 5.) Note that our assessments do not necessarily reflect practice in each of a state’s subdivisions, since local officials sometimes have authority to exceed state guidelines. To keep the analysis tractable, we instead focus on the baseline established by statewide requirements.

We developed two categories of criteria: procedural criteria, which encompass the existence of procedures, the scope of testing, and transparency; and functional criteria, which reflect whether the testing could reliably detect various kinds of errors and attacks. To facilitate quantitative comparisons, we assigned point values to each criterion, such that each category is worth a total of 10 points and the weights of specific items reflect our assessment of their relative importance.

Procedural Criteria

Rules and Transparency (5 points).   To provide the strongest basis for trust, testing should meet or exceed published requirements and be conducted in public.

  • RT1 (1.5 pts): Procedures are specified in a detailed public document. This captures the threshold matter of whether states have published L\( { \& }\)A requirements. Detailed or step-by-step guidelines received full credit, and general laws or policies received half credit.

  • RT2 (1.0 pts): The document is readily available, e.g., via the state’s website.

    Making L\( { \& }\)A procedures easily available helps inform the public and enables observers to assess tests they witness.Footnote 1

  • RT3 (1.5 pts): Some testing is open to the public, candidates/parties, journalists.

    This tracks the potential for public L\( { \& }\)A ceremonies to strengthen confidence.

  • RT4 (1.0 pts): Local jurisdictions have latitude to exceed baseline requirements.

Scope of Testing (5 points).   A comprehensive approach to testing covers every ballot design across all the voting machines or scanners where they can be used.

  • ST1 (2.0 pts): All voting machines/scanners must be tested before each election.

  • ST2 (1.0 pts): All devices must be tested at a public event before each election.

  • ST3 (2.0 pts): All devices must be tested with every applicable ballot design. Failing to test all machines or all ballot styles risks that localized problems will go undetected, so each was assigned a substantial 2 points. One additional point was provided if all testing is public, to reflect transparency interests.

Functional Criteria

In each of three sets of functional criteria, we assess a simple form of the protection (with a small point value) and a more rigorous form (with a large point value).

Basic Protections (4 points).   To guard against common errors, tests should cover every voting target and ensure detection of transpositions.

  • BP1 (1.0 pts): All choices receive at least one valid vote during testing.

  • BP2 (3.0 pts): No two choices in a contest receive the same number of votes.

    The first test minimally detects whether each candidate has some functioning voting target. The second further ensures the detection of transposed targets within a contest, which can result from misconfigured election definitions.

Overvote Protection (2 points).   Testing should exercise overvote detection and, ideally, confirm that the overvote threshold in each contest is set correctly.

  • OP1 (0.5 pts): At least one overvoted ballot is cast during testing.

  • OP2 (1.5 pts): For each contest c, a test deck includes a ballot with \(n_c\) selections and one with \(n_c+1\) selections, where \(n_c\) is the permitted number of selections.

    An overvote occurs when the voter selects more than the permitted number of candidates, rendering the selections invalid. The first practice minimally detects that the machine is configured to reject overvotes, while the second tests that the allowed number of selections is set correctly for each contest.

Nondeterministic Testing (4 points).   For stronger protection against deliberate errors, attackers should be unable to predict how the test deck is marked.

  • ND1 (1.0 pts): Public observers are allowed to arbitrarily mark and cast ballots.

  • ND2 (3.0 pts): Some ballots must be marked using a source of randomness.

    Attackers who can predict the test deck potentially can tamper with the election definition such that errors will not be visible during testing. If the public can contribute test ballots, this introduces uncertainty for the attacker, while requiring random ballots allows for more rigorous probabilistic detection.

figure a

4 Analysis

Our nationwide review of L\( { \& }\)A procedures highlights significant variation among the testing practices of the fifty states, as illustrated by the maps in Fig. 2. The tables on page 7 summarize our findings and rank the states with respect to the procedural and functional criteria. We also provide a capsule summary of each state’s practices in Appendix A.

Fig. 1.
figure 1

We count the number of states that met, partly met, or did not meet each criterion. While states commonly require simple protections (BP1, OP1), most do not achieve more rigorous forms of error detection (e.g., OP2, ND1, and ND2).

4.1 Performance by Criterion

Figure 1 shows the number of states that met, partly met, or did not meet each criterion. All 50 states have laws that require L\( { \& }\)A testing, but only 22 have a public, statewide document that details the steps necessary to properly conduct the tests (RT1). Of those that do not, several (such as California) merely instruct local jurisdictions to follow instructions from their voting equipment vendors, limiting the efficacy of logic and accuracy procedures to each vendor’s preferences.

States generally performed well with respect to transparency criteria. Every state has some public documentation about its L\( { \& }\)A practices, with 40 states making this documentation readily available (RT2). At least 45 states perform some or all testing in public (RT3), although 7 of these impose restrictions on who is allowed to attend. At least 32 states test every machine in public (ST2). Just three states (Kentucky, Maryland, and Hawaii) do not conduct any public L\( { \& }\)A testing, which may be a significant lost opportunity to build public trust.

Most states also scored high marks regarding the scope of their testing. We were able to confirm that at least 44 states require all equipment to be tested before each election (ST1). Exceptions include Tennessee, Texas, and Indiana, which only require testing a sample of machines. At least 32 states require every ballot style to be tested, but 5 or more do not (ST3), which increases the chances that problems will go undetected in these jurisdictions.

Consideration of several other criteria is more complicated, because we often lack evidence for or against their being met. We have insufficient data about 19 or more states for RT4 and most of the functional criteria (BP2, OP2, ND1, and ND2). Details concerning functional criteria tended to be less frequently described in public documentation, which potentially biases our analysis in cases where states were also unresponsive to inquiries. We treated such instances as unmet for scoring purposes, but it is also informative to consider the ratio of met to unmet, as depicted in Fig. 1. One example is whether states allow local jurisdictions to exceed their baseline requirements (RT4). Although we lack data for 23 states, the criterion is met by at least 26 states, and we have only confirmed that it is unmet in one state (New Mexico). This suggests that many of the unconfirmed states likely also allow local officials to depart upwards from their requirements.

After accounting for what data was available, states clearly perform much better on our procedural criteria than on our functional criteria. This suggests that many of the functional attributes we looked for are aspirational relative to current practice and indicates that L\( { \& }\)A testing could provide much more value.

The only two functional criteria that most states meet are basic protections for voting targets (BP1) and overvotes (OP1), which are provided in at least 40 and 42 states, respectively. At least 17 states would detect transposed voting targets within contests (BP2), but as few as 8 fully validate overvote thresholds (OP2). These more rigorous protections require more complicated procedures and larger test decks, but that some states achieve them suggests that they would be practical to implement more broadly.

Policies facilitating even the basic form of nondeterministic testing were rare. Only 11 states scored even partial points for conducting nondeterministic testing, with 9 of them allowing public observers to mark test ballots (ND1) and 3 of them (Arizona, Connecticut, Vermont) confirming that election officials are required to mark random selections (ND2). Of the three, only Arizona confirmed that it required officials to use a random number generator, thus earning full points. These findings are surprising, since unpredictable or randomized testing can thwart certain kinds of attacks that predictable tests cannot. That nondeterministic testing is rare greatly limits the security benefits of typical state L\( { \& }\)A practices.

Fig. 2.
figure 2

Mapping state scores (darker indicates better performance) shows that L\( { \& }\)A testing practices vary significantly within all geographic regions of the U.S.

Fig. 3.
figure 3

Many states have perfect or near-perfect procedural scores, but functional scores are generally lower, reflecting opportunities for making L\( { \& }\)A more effective.

Fig. 4.
figure 4

No state’s functional score exceeds its procedural score, perhaps due to more limited data about functional aspects of testing. At most levels of procedural scores, states’ functional scores spanned a wide range, with no strong correlation.

4.2 Performance by State

When comparing states’ overall L\( { \& }\)A testing practices, we find wide variation across both procedural and functional criteria. As illustrated in Fig. 2, this variation is not clearly explained by regionalism. However, the plot in Fig. 3 reveals several notable features in the distributions of states’ scores.

Most obviously, procedural scores were much higher than functional scores. Again, this likely reflects both the relative scarcity of public documentation about functional aspects of L\( { \& }\)A testing and that our chosen functional criteria were somewhat aspirational. No states achieved perfect functional scores, but 4 states (Montana, Ohio, Pennsylvania, and Washington) achieved perfect procedural scores. Four other states could potentially achieve this benchmark but did not provide missing information we requested. Eleven more states clustered just shy of perfect procedural scores, of which 10 could achieve full points simply by making detailed L\( { \& }\)A procedures public (BP1)—potentially a zero-cost policy change.

Notable relationships occur between certain criteria. For instance, concerning the scope of testing, only 2 states that are known to require testing every ballot (ST3) style do not also require testing every machine (ST1). It is much more common for states that require testing every machine to not require testing every ballot style (or remain silent), which 14 of 44 states did. This suggests that L\( { \& }\)A policymakers are more likely to be aware of the potential for problems that affect only specific machines than of issues that can affect only specific ballot styles.

The distribution of functional scores highlights further relationships. The largest cluster, at 1.5, are 16 states that require basic protections for voting targets (BP1) and overvotes (OP1) but meet no other functional criteria. Interestingly, many of these states employ similar statutory language, with little or no variation.Footnote 2 Although we have so far been unable to identify the common source these states drew on, the situation suggests that providing stronger model legislation could be a fruitful way to encourage the adoption of improved L\( { \& }\)A practices.

At least 21 other states accomplish basic voting-target and overvote protections plus one or more of the stronger functional criteria. Most commonly, they require additional testing to detect transposed voting targets within contests (BP2), which 17 states do. Eight of these states accomplish no further functional criteria, resulting in a cluster at score 4.5. Five others also fully validate overvote thresholds (OP2), as do only 3 other states, indicating a strong correlation between these more rigorous testing policies. In a surprising contrast, although nondeterministic testing is comparably uncommon (only 8 states fully achieve either ND1 or ND2), practicing it does not appear to be well correlated any of the other non-basic functional criteria. This may indicate that states have introduced nondeterministic testing haphazardly, rather than as the result of careful test-process design.

Considering both scoring categories together (Fig. 4), we see that although no state’s functional score exceeds its procedural score, there is otherwise a wide range of functional scores at almost every level of procedural score. This may partly reflect limitations due to unresponsive states, but it may also suggest opportunities to better inform policymakers about ways to strengthen L\( { \& }\)A functionality, particularly in states in the lower-right corner of the figure, which have a demonstrated ability to develop procedurally robust testing requirements.

The overall national picture shows that every one of our evaluation criteria is satisfied by at least one state, indicating that even the most rigorous functional criteria are realistic to implement in practice. Several states have the potential to serve as models of best practice across most dimensions of L\( { \& }\)A testing, especially if procedural specifics are made readily accessible. In particular, Arizona and South Dakota each achieved full points in all but one criterion from each category, and Connecticut achieved the highest total score of any state under our metrics. We provide additional information and references regarding their procedures in our state-by-state summaries, found in Appendix A.

5 Discussion

Our findings support the need for strengthened L\( { \& }\)A procedures nationwide. Current practice has room for substantial improvement in both transparency and substance, and state policy should seek to realize this potential.

Election security researchers and practitioners should work together to establish normative standards for L\( { \& }\)A testing procedures and to draft model legislation to realize them. The precise mechanism for establishing this standard is beyond the scope of this paper, but a potential route would be for the National Institute of Standards and Technology (NIST) to issue L\( { \& }\)A testing guidelines. Under the Help America Vote Act (HAVA), NIST is charged with the design of voting system standards, in coordination with the U.S. Election Assistance Commission (EAC) [30], and it has previously issued guidance for other aspects of election technology administration, such as cybersecurity and accessibility.

One challenge in the adoption of any technical standard is leaving safe and flexible opportunities for upward departure. It would be dangerous to lock in procedures that are later found to be insufficient, especially if every state would have to update its laws in response. For this reason, it is important that any L\( { \& }\)A policy changes allow some degree of flexibility involved for local jurisdictions. Too much flexibility, however, can weaken security guarantees even with the best of intentions. One clerk we spoke with in the preparation of this paper offhandedly told to us that she did not always follow the state requirement that every candidate in a contest receive a different number of votes, since in real elections ties could occur and she felt it was important to test that behavior too. Despite the well-meaning nature of this deviation, it decreased the guarantees provided by her L\( { \& }\)A testing, since it meant the two candidates who tied in the test deck could have had their votes unnoticeably swapped. Clerks do not have the resources to rigorously analyze all ramifications of deviations from procedure, so latitude to deviate should be provided only where it cannot reduce the integrity of the process, such as in optional, additional phases of testing.

We leave to future work determining what model L\( { \& }\)A policies should look like. While the elements of transparency, openness, and security we considered in this paper are potential low-hanging fruit, there are other elements of successful L\( { \& }\)A practice that we did not measure or describe. For instance, testing policies should consider not only ballot scanners but also ballot marking devices (BMDs), which are computer kiosks that some voters use to mark and print their ballots. Most jurisdictions use BMDs primarily for voters with assistive needs, but some states require all in-person voters to use them [70]. Errors in BMD election definitions can lead to inaccurate results [23], but carefully designed L\( { \& }\)A testing might reduce the incidence of such problems. Another example of an intervention that would have detected real-world issues in the past is “end-to-end” L\( { \& }\)A testing, where tabulator memory cards are loaded into the central election management system (EMS) and its result reports are checked against the test decks. One of the problems in Antrim County that caused it to report initially incorrect results in 2020 was an inconsistency between some tabulators and the EMS software, and “end-to-end” L\( { \& }\)A testing could have headed off this issue [28].

We do, however, recommend that future L\( { \& }\)A guidelines incorporate elements of nondeterministic testing. While our data shows that this practice is still quite rare in the U.S., using test decks that are unpredictable would make it more difficult to construct malicious election definitions that pass the testing procedure.

Election technology has evolved over time, but some L\( { \& }\)A testing practices still carry baggage from the past. For instance, functional requirements in many U.S. states are suited for detecting common problems with mechanical lever voting machines but less adept at uncovering common failure modes in modern computerized optical scanners, such as transposed voting targets. Other nations, which may at this time be adopting optical scan equipment of their own, can learn from these standards and improve on them as they choose their own practices for the future. By applying careful scrutiny of existing process and incorporating the elements that most make sense in their own context, these polities can ensure their testing procedures are constructed in a way to meet their needs.

6 Conclusion

We performed the first detailed comparative analysis of L\( { \& }\)A testing procedures, based on a review of L\( { \& }\)A requirements and processes across all fifty U.S. states. Although L\( { \& }\)A testing can be a valuable tool for spotting common configuration errors and even certain kinds of low-tech attacks, our analysis shows that there is wide variation in how well states’ testing requirements fulfill these prospects. We hope that our work can help rectify this by highlighting best practices as well as opportunities for improvement. Rigorous, transparent L\( { \& }\)A testing could also be a valuable tool for increasing public trust in elections, by giving voters a stronger basis for confidence that their votes will be counted correctly.