Introduction

In the last 20 years, technological advances such as optical mark recognition and online surveys have allowed much data entry to be computerized, which increases both efficiency and accuracy. However, not all data entry can be automated: Behavioral observations, children’s data, and scores from open-ended questions still need to be manually entered into a computer. Moreover, many of the data that theoretically could be entered automatically are not. For both financial and practical reasons, field data, surveys, and classroom exams are often completed on paper and then later entered into a computer.

Manual data entry inevitably leads to data-entry errors. In medical settings, data-entry errors can have catastrophic consequences for patients and are thankfully rare (Gibson, Harvey, Everett, Parmar, & on behalf of the CHART Steering Committee, 1994): Estimates put the error rate around 0.20% (Reynolds-Haertle & McBride, 1992) or between 0.04% and 0.67%, depending upon the type of data (Paulsen, Overgaard, & Lauritsen, 2012). However, in research contexts, data-entry errors are more common. Error rates typically range from 0.55% to 3.6% (Barchard & Pace, 2011; Bateman, Lindquist, Whitehouse, & Gonzalez, 2013; Buchele, Och, Bolte, & Weiland, 2005; Kozak, Krzanowski, Cichocka, & Hartley, 2015; Walther et al., 2011), although error rates as high as 26.9% have been found (Goldberg, Niemierko, & Turchin, 2008). Even if only 1% of entries are erroneous, if a study contains just 200 items, manually entering the data could result in data-entry errors for almost every participant.

Simple data-entry errors, such as typing an incorrect number or skipping over a line, can drastically change the results of a study (Barchard & Pace, 2008, 2011; Hoaglin & Velleman, 1995; Kruskal, 1960; Wilcox, 1998). For example, they can reverse the direction of a correlation or make a significant t test nonsignificant (Barchard & Verenikina, 2013). In one study, data-entry errors sometimes increased sample means tenfold, made confidence intervals 17 times as wide, and increased correlations by more than .20 (Kozak et al., 2015). In clinical research, data-entry errors can impact study conclusions, thus swaying the standard care of thousands of patients (Goldberg et al., 2008). Researchers therefore use a variety of strategies to prevent data-entry errors, such as numbering the items, using data-entry software that shows all items simultaneously, and entering data exactly as shown on the page (Schneider & Deenan, 2004). However, whatever steps are taken to ensure the initial data accuracy, researchers cannot know whether entries were entered correctly unless they check them.

Some researchers have advocated holistic data-checking methods, such as examining scatterplots and histograms (Tukey, 1977) and calculating univariate and multivariate statistics to detect outliers and other influential data points (Osborne & Overbay, 2004; Tabachnick & Fidell, 2013). However, these methods may not detect errors that fall within the intended ranges for the variables (Stellman, 1989). Therefore, item-by-item data checking is necessary to ensure that all data-entry errors are identified.

A variety of item-based data-checking methods can be used. In visual checking, the data checker visually compares the original paper data sheets with the entries on the computer screen. In solo read aloud, the data checker reads the original paper data sheets aloud and visually checks that the entries on the screen match. In partner read aloud, two data checkers are required: One reads the data sheets aloud, while another checks that the entries on the computer screen match. Finally, in double entry, the data checker enters the data into the computer a second time, and the computer compares the two entries and flags any discrepancies (see Fig. 1); the data checker then examines the original data sheet to determine which entry is correct. The purpose of all these data-checking procedures is to identify and correct data-entry errors. It is therefore important to study the effectiveness and efficiency of these methods.

Fig. 1
figure 1

Double-entry screen layout

Double entry is recommended by many sources (Burchinal & Neebe, 2006; Cummings & Masten, 1994; DuChene et al., 1986; McFadden, 1998; Ohmann et al., 2011) because it is more accurate than visual checking and partner read aloud. For example, among experienced data enterers checking medical data, double entry detected 73% more errors than partner read aloud (Kawado et al., 2003), and among university students checking psychological data, double entry was three times as likely as visual checking and partner read aloud to correct every data-entry error (Barchard & Verenikina, 2013). In fact, double entry is as accurate as optical mark recognition and intelligence character recognition (Paulsen et al., 2012). However, double entry is more time-consuming than visual checking or partner read aloud (Barchard & Pace, 2011; Barchard & Verenikina, 2013; Kawado et al., 2003; Reynolds-Haertle & McBride, 1992), so researchers have sought alternatives. For example, visual checking can be augmented by graphical representations of the numbers being checked: One study showed that this reduced initial data-entry errors by 60%, but unfortunately it had no noticeable effect on data-checking accuracy (Tu et al., 2016). Some researchers have created dynamic data-entry systems that only ask a portion of the values to be reentered: those entries that Bayesian analyses suggest are likely to be data-entry errors (Chen, Chen, Conway, Hellerstein, & Parikh, 2011). Although such systems save time and are better than no data checking, one study showed that they still left 22% of the errors in the dataset (Chen, Hellerstein, & Parikh, 2010).

In this study we examined the effectiveness of a new data-checking method: solo read aloud. Although substantial research has demonstrated that double entry is superior to visual checking and partner read aloud, no previous research has examined the accuracy of solo read aloud or compared its accuracy to that of other data-checking methods. Therefore, the purpose of our study was to compare solo read aloud to double entry, visual checking, and partner read aloud. On the basis of previous research, we predicted that double entry would be more accurate than visual checking and partner read aloud, but also more time-consuming. We made no predictions regarding solo read aloud, which had never been empirically tested.

The present study went beyond previous research in two additional ways. First, this study included a detailed examination of the types of errors that data-checking participants left in datasets, to provide guidance for future improvements to data-checking systems. As you will see, this novel analysis led to insights regarding further improvements in double-entry systems. Second, this study compared the accuracy of data-checking participants with previous data-entry experience to those without. Surprisingly, no previous research has examined the effect of experience on data-checking accuracy. This study examined whether data-entry experience increases speed, reduces errors, and changes subjective opinions of the data-checking system.

Method

Participants

A total of 412 undergraduates (255 female, 153 male, 4 unspecified) participated in return for course credit. They ranged in age from 18 to 50 (M = 20.98, SD = 5.09). They identified themselves as Caucasian (32.2%), Hispanic (26.7%), Asian (21.6%), African-American (9.7%), Pacific Islander (2.7%), and Other (0.5%).

We originally planned to collect data from about 400 participants, roughly 100 using each method. After 4.5 years of data collection, we had 412 participants. Because participants were assigned to the data-checking methods completely at random, there were slightly more participants using some methods than others: double entry, 94; visual checking, 98; partner read aloud, 119; and solo read aloud, 101.

Of these 412 participants, 90 had previous data-entry experience. These 90 participants had between 4 h of data-entry experience and more than 2 years of full-time work (40+ h per week). About two-thirds of them (61) had more than 100 h of data-entry experience.

Materials

The 412 participants in our study were taking the role of research assistants, each of whom was checking the complete dataset for an imaginary study with 20 subjects. Before participants arrived for the study, the data sheets were entered into the computer. These data sheets contained six types of data (see Fig. 2): a six-digit ID code, a letter (M or F; labeled Sex), five-point numerical rating scales (labeled Learning Style), five-point alphabetical scales (SD D N A SA; labeled Study Habits), words in capital letters (often with spelling errors; labeled Spelling Test), and three-digit whole numbers (labeled Math Test).

Fig. 2
figure 2

Example data sheet

When we entered these 20 data sheets, we deliberately introduced 32 errors (see Table 1). Thirteen of these errors would be easy for data checkers to identify later: for example, entering a word when a number was expected or entering two numbers in a single cell. These entries were so obviously wrong that they could be detected by simply looking at the Excel sheet: A data checker would not even need to look at the data sheet. The other 19 errors were less obvious, and thus would be difficult for data checkers to identify from a superficial examination of the data: for example, entering an incorrect number that was still within the range for that variable. These entries were plausible values for those variables, so the data checker would only know that they were wrong by noticing that they did not match the data sheet.

Table 1 Errors researchers inserted into the Excel file for participants to locate and correct

These 32 errors represented 5% of the entries. This error rate is higher than the rates found in most previous research (e.g., Barchard & Pace, 2011). Using a larger number and variety of errors increased our ability to detect differences between the data-checking methods.

Procedures

Participants completed this study individually during single 90-min sessions supervised by trained research assistants. Because participants used Excel 2007 to check the data, they started the study by watching a video on how to use Excel. The video explained how to scroll vertically and horizontally, move between cells, enter and delete cell contents, and save and close the spreadsheet. To view this video (and the other videos used in this study), see the supplementary materials.

Next, the computer randomly assigned participants to one of the four data-checking methods and showed them a video on using that method to identify and correct data-entry errors. These videos provided an overview of the imaginary study. They showed an example paper data sheet with its 34 variables (ID, sex, and eight items for each of the four scales) and explained that the participant would use Excel to check data entered by someone else. The videos also reviewed the Excel data file, explaining that each row contained the data for one subject and that each column provided a different piece of information about that subject.

The Excel sheet in the videos differed depending upon the data-checking method. The Excel sheet in the visual-checking, solo read-aloud, and partner read-aloud videos showed data for five subjects. The Excel sheet in the double-entry video did not show any data initially. This video showed participants how to enter the data themselves. Then it showed them the first set of entries, along with the mismatch counter and the out-of-range counter. The mismatch counter gives the number of nonidentical entries between the first and second entries. The out-of-range counter gives the number of entries that are outside the range for the variable (e.g., an entry of 8 on a five-point scale).

Most critically, the data-checking videos showed different methods of identifying and correcting data-entry errors. The visual-checking video told participants to check the data in Excel by visually comparing them to the paper data sheets. If the entry did not match the paper data sheet, participants were to correct the entry. The solo read-aloud video told participants to read the paper data sheets out loud and to check that the data in Excel matched. The partner read-aloud video told participants to listen as the researcher read the paper data sheet out loud; the participant was told to say check when the entries matched, and verify when they did not, in which case the researcher would read the data point a second time. During the two read-aloud videos, the computer took the roles of the participant and researcher: It read the data out loud and said check and verify as needed. During all three of these videos, the computer demonstrated how to check the data for an entire subject. If the Excel entries did not match what was on the data sheet, the computer demonstrated how to correct them.

Similarly, the double-entry video showed participants how to check and correct the data for an entire subject. First, the video told participants to enter the data themselves. Next, the video showed participants the mismatch counter and out-of-range counter and explained what they were. If either of these counters was nonzero, participants were told to check the original paper data sheet to determine the correct value. Finally, the video demonstrated correcting whichever Excel entry was incorrect (one example corrected the original entry, and another example corrected the second entry).

Several variations of double entry are used in real research (Kawado et al., 2003). In one method, a single person enters the data twice, and then that person (or another person) compares the two entries to each other and resolves discrepancies. In another method, two people enter the data into the computer, and then one of those two people (or another person) compares the entries to each other and resolves discrepancies. In this study, we used the second method: The research team entered the data into the computer before participants arrived; the double-entry participants then entered the data a second time. We used this method so that all participants—regardless of whether they were using double entry, visual checking, solo read aloud, or partner read aloud—would be responsible for identifying and correcting exactly the same errors.

After participants had finished watching the videos, they started checking the data. During the training phase, participants checked the data for five data sheets, and the researcher answered any questions they had and corrected any procedural errors they had made. During the testing phase, participants checked the data for 20 additional sheets without further interactions with the researcher.

After participants had completed the data checking, they provided their subjective opinions of the data-checking method they had used by rating their assigned method on 16 adjectives (i.e., satisfying, comfortable, frustrating, pleasant, painful, boring, relaxing, accurate, enjoyable, tedious, uncomfortable, fun, annoying, calming, depressing, and reliable) using a five-point scale from (1) strongly disagree to (5) strongly agree.

Statistical analyses

We examined the effect of data-entry experience and data-entry method on three criterion variables: time, number of data-checking errors, and subjective opinions. Time was the difference between the time when the participant loaded the Excel file whose data they were checking and the time when they closed that file and moved on to the follow-up questionnaire. The number of data-checking errors was the number of discrepancies between the participant’s completed data file and the correct data. If the participant corrected each of the 32 original errors and did not introduce any new errors, the participant was further coded as having completed perfect data checking. Subjective opinions were the participants’ ratings on the 16-adjective measure. We used the total score from the 16 adjectives, after reverse coding the seven items containing negative adjectives.

We began by examining the effect of data-entry experience on these three criterion variables. We hypothesized that previous data-entry experience would improve both the speed and accuracy of data checking. We made no prediction about the effect of data-entry experience on subjective opinions of the data-checking methods.

We used different types of models for the different criterion variables. We selected these types of models on the basis of theoretical considerations (e.g., continuous vs. dichotomous criterion variables) and then checked to ensure that model assumptions had been met. For the criterion variable of time, which was continuous with a roughly normal distribution, we used multiple linear regression. For the number of data-checking errors, which was a count variable with a large number of zero values, we used negative binomial regression (as recommended by Cameron & Trivedi, 1998). For the dichotomous variable of whether the participant had perfect data checking, we used logistic regression with Nagelkerke’s pseudo-R2 as a measure of effect size. Finally, for the criterion variable of subjective opinions, which was a nearly continuous variable with a roughly normal distribution, we used multiple linear regression.

Next we examined the effect of data-checking method. We used the same three criterion variables and the same models. To take into account data-entry experience, we fit a hierarchical series of models. In the first model, the only predictor was data-entry experience. In the second model, data-checking method was added as a second predictor. In the third model, the interaction between data-entry experience and data-checking method was added as a third predictor. However, for each of the criterion variables, the interaction term was nonsignificant (all ps > .05). Therefore, we do not report the results of the third models.

For each criterion variable, we present the results in two tables. In the first table, we provide means and 95% confidence intervals for each combination of data-entry experience and data-checking method. The confidence intervals were constructed using percentile bootstrapping with 10,000 replications. In the second table, we present the hierarchical models that we used to examine the effects of data-entry experience and data-checking method. For convenience, we have indicated the results of these significance tests in the first table (with the means). Thus, the first table shows what we found, and the second table shows how we found it.

For each criterion variable, we fit the hierarchical models twice, to compare the data-checking methods to each other. First, because double entry is the current gold standard in data checking, we compared double entry to the remaining three data-checking methods. These first analyses included all participants. Second, because solo read aloud has never been examined empirically, we compared it to partner read aloud and visual checking (double entry was excluded from this analysis because this comparison had already been completed). These second analyses included only the participants who used the solo read-aloud, partner read-aloud, and visual-checking methods.

Most of these models converged upon solutions without incident, providing useable results for testing the hypotheses above. However, when we were predicting the number of errors for variables with very few errors (i.e., sex), these models did not converge. For this variable, we are unable to make conclusions about the relative frequency of errors across data-checking methods, because so few errors were made.

Finally, to determine whether the data-checking participants were able to judge the accuracy of their work, we completed two analyses. First, we correlated subjective judgments of accuracy and reliability with actual accuracy. These subjective judgments of accuracy and reliability were obtained as part of the 16-item measure of subjective opinions. Actual accuracy was calculated as the number of correct entries in the Excel sheet after the participant had finished checking it. If participants were good judges of the quality of their data-checking efforts, these correlations should be large and positive. Second, we examined the effect of data-entry experience and data-checking method on subjective opinions of accuracy and reliability. If subjective opinions are a good way of evaluating data-checking quality, then the effects of data-entry experience and data-checking methods on subjective opinions should mirror their effects on actual errors. We therefore used the same models as we had used to compare actual errors: Model 1 examined the effect of data-entry experience on subjective opinions; Model 2 added the predictor of data-checking method; and Model 3 added the interaction term (which was not significant, so Model 3 is not reported).

We conducted a power analysis before beginning data collection. However, our analytic plan changed during peer review, making those earlier calculations irrelevant. Therefore, we conducted a sensitivity analysis after the fact to determine what size of effect we would have been able to detect, given our sample size of 412 participants. Using the R statistical package lmSupport (Curtin, 2017), we found that we had power of at least .80 to detect small differences between the data-checking methods in terms of the total number of errors. Specifically, when predicting the total number of errors on the basis of data-entry experience (Model 1), we had 80% power to find effects as small as η2 = .0188. When predicting the total number of errors on the basis of data-checking method while controlling for data-entry experience (Model 2), we had 80% power to find effects as small as partial η2 = .0261. Finally, when predicting the total number of errors from the interaction between data-checking method and data-entry experience, while controlling for data-checking method and data-entry experience (Model 3), we had 80% power to find effects as small as partial η2 = .0263.

Results

Data-entry experience

Time

Data-entry experience had no significant effect on the time it took participants to check the data. Across the four data-checking methods, participants with data-entry experience took 1.8% less time than participants without data-checking experience (see Table 2). However, this small effect for data-entry experience was nonsignificant [F(1, 409) = 0.73, p = .395; see Table 3, Model 1, for all participants).

Table 2 Means [95% confidence intervals] for time to complete data checking (in minutes)
Table 3 Hierarchical multiple linear regressions predicting time to complete data checking

Perfect data checking

Data-entry experience had no significant effect on the proportion of participants who created an error-free dataset (by correcting each of the original errors without introducing any new errors). Across the four data-checking methods, the participants with data-entry experience were 14.6% more likely to create an error-free dataset (see Table 4). However, this small effect was not statistically significant (see Table 5).

Table 4 Proportions of participants [95% confidence intervals] with perfect data checking
Table 5 Hierarchical generalized linear models predicting perfect data checking

Number of errors

Data-entry experience significantly reduced the number of errors. Across the four data-checking methods, participants with data-entry experience had 41% fewer total errors (see Table 6, first section). This moderate main effect was statistically significant (p = .007; see Table 7, first section). In addition, the participants with data-entry experience had 64% fewer errors when checking the spelling test (p = .002) and 43% fewer errors while checking the math test (p = .046; see Tables 6 and 7, later sections).

Table 6 Mean numbers of errors per participant [95% confidence intervals]
Table 7 Hierarchical generalized linear models predicting numbers of errors

The 412 participants had a total of 743 errors at the end of the experiment. These errors can be divided into two types: original errors that participants left in the dataset, and new errors that participants introduced while checking the data. In Table 8, these are referred to as left original error and introduced new error. Using Table 1 as a guide, we further subdivided these errors into those that would be easy to find during data checking, such as cells that contained entirely wrong words, and errors that would be hard to find, such as cells in which the digits of a number had been reordered. We found that only 70 of the 743 errors (9.4%) were easy-to-find errors that could have been detected by a systematic examination of histograms or frequency tables during preanalysis data screening. Most of the data-checking errors that participants made (90.6%) would be hard to detect during data screening.

Table 8 Mean numbers of easy-to-find errors and hard-to-find errors per participant [95% confidence intervals]

Participants with data-entry experience were 40% less likely to leave hard-to-find errors in the dataset (p = .039; see Table 9). There were no significant differences in the number of hard-to-find errors that they introduced into the dataset (p = .120) or the number of easy-to-find errors that they left in the dataset (p = .053) or introduced into the dataset (p = 206).

Table 9 Hierarchical generalized linear models predicting the numbers of easy-to-find errors and hard-to-find errors

Subjective evaluations

Data-entry experience was not associated with subjective evaluations of the data-checking methods. Across the four data-checking methods, the participants with data-entry experience provided ratings that were 1.3% higher than the ratings provided by participants without data-checking experience (see Table 10). However, this small effect was not statistically significant (see Table 11).

Table 10 Means [95% confidence intervals] for subjective evaluations
Table 11 Hierarchical multiple linear regressions for subjective evaluations

Summary

This is the first empirical study of the effect of data-entry experience. We found that data-entry experience significantly and substantially reduces the number of errors.

Data-checking methods

Next we compared the four data-checking methods: double entry, solo read aloud, partner read aloud, and visual checking. Because participants were assigned to data-checking methods completely at random, participants with data-entry experience were not evenly distributed between the four data-checking methods. This represents a potential confound, which could cause spurious differences between the methods. To be conservative, we therefore controlled for data-entry experience in all comparisons of the data-checking methods.

Time

Double entry was the slowest method, as expected. Overall, double entry took 45 min, solo read aloud took 35 min, visual checking took 32 min, and partner read aloud took 30 min (see Table 2). After we controlled for data-entry experience, double entry was 28% slower than visual checking, 39% slower than partner read aloud, and 48% slower than solo read aloud, all ps < .001 (see Table 3, Model 2 for all participants).

Solo read aloud was the second slowest method. After we controlled for data-entry experience, solo read aloud was 16% slower than visual checking and 16% slower than partner read aloud, all ps < .01 (see Table 3, Model 2 for comparisons of solo read-aloud, partner read-aloud, and visual-checking participants).

Perfect data checking

Originally, there were 32 errors in the Excel file. Double-entry participants were the most likely to correct all of these errors without introducing any new errors. Overall, the probabilities of creating an error-free dataset were .83 for double entry, .46 for solo read aloud, .21 for partner read aloud, and .35 for visual checking (see Table 4 and Fig. 3). After we controlled for data-entry experience, double entry was 83% more likely than solo read aloud to result in perfect checking, 95% more likely than partner read aloud, and 89% more likely than visual checking, all ps < .001 (see Table 5).

Fig. 3
figure 3

Proportions of participants with perfect data checking (with the standard error for each proportion). DE = double entry. VC = visual checking. PRA = partner read aloud. SRA = solo read aloud

Solo read aloud was the next most likely to result in perfect data. After we controlled for data-entry experience, solo read aloud was 68% more likely than partner read aloud to result in perfect data checking (p < .001), although it was not significantly more likely than visual checking (p = .098; see Table 5).

Number of errors

Double-entry participants had the fewest total errors. These participants averaged 0.64 errors, as compared to 1.30 for solo read aloud, 3.27 for partner read aloud, and 1.66 for visual checking (see Table 6 and Fig. 4). After we controlled for data-entry experience, solo read aloud had 2.01 times as many total errors as double entry, partner read aloud had 5.07 times as many, and visual checking had 2.61 times as many, all ps < .01 (see Table 7, total errors).

Fig. 4
figure 4

Mean numbers of errors (with the standard error for each mean). DE = double entry. VC = visual checking. PRA = partner read aloud. SRA = solo read aloud

Solo read aloud had the next fewest errors. After we controlled for data-entry experience, partner read aloud resulted in 2.53 times as many errors as solo read aloud (p < .001). Visual checking resulted in 1.31 times as many errors as solo read aloud, but this ratio was not statistically significant, p = .168 (see Table 7). Thus, solo read aloud was not as accurate as double entry, but it was more accurate than partner read aloud.

To determine whether the number of errors depended on the type of data being checked, we fit separate models for each of the six types of data on the data sheets. Most of these models converged without incident. For those models, the differences between data-checking methods largely reflected the patterns seen above for total number of errors (see Tables 6 and 7). Double entry had significantly fewer errors than most of the other data-checking methods for four of the types of data. Solo read aloud resulted in significantly fewer errors than partner read aloud for three of the types of data and resulted in significantly fewer errors than visual checking for one type of data. The regression model predicting errors when checking the sex data did not converge, because participants rarely made errors when checking this variable (five of the 412 participants made one error; the rest made zero errors); however, we noted descriptively that double-entry and solo read-aloud participants made no errors when checking the sex data, whereas partner read aloud and visual checking each had participants who made errors. All together, these results reinforce our previous conclusion that solo read aloud was less accurate than double entry, but more accurate than partner read aloud. In addition, solo read aloud was sometimes more accurate than visual checking.

Excluding extreme cases

To ensure that these results were not driven by a few extreme scores, we repeated these analyses after excluding the worst 1% of participants on total number of errors. These five participants (one from double entry, one from visual checking, and three from partner read aloud, all of whom had no previous data-entry experience) each had 20 or more errors. See the supplementary materials, Tables A and B. For total number of errors and errors on checking the math test, prior data-entry experience was no longer significant, but the effect was still in the same direction. Additionally, solo read-aloud participants now made significantly fewer errors than visual-checking participants for the study habits test, but the direction of the difference was the same as before. For the remaining analyses, removing these five participants had no effect on either the direction or the significance of coefficients. We concluded that the five error-prone participants did not have a substantial or meaningful effect on the above comparisons of the data-checking methods.

Types of errors

Double entry and solo read aloud both substantially reduced the number of easy-to-find errors, as compared to partner read aloud and visual checking (see Table 8). After we controlled for previous data-entry experience (see Table 9), partner read aloud and visual checking resulted in 22.58 and 23.98 times as many easy-to-find errors as double entry, both ps < .012, and 24.30 and 25.77 times as many as solo read loud, both ps < .010. Solo read aloud was not significantly different from double entry in the number of easy-to-find errors that were left in the dataset, p = .962. In addition, partner read aloud introduced 11.22 times as many easy-to-find errors into the dataset as solo read aloud, p = .028. The model comparing double entry to the other methods did not converge, but we noted that double entry resulted in even fewer easy-to-find errors being introduced into the dataset than solo read aloud.

Double entry and solo read aloud both also reduced the number of hard-to find-errors, as compared to partner read aloud and visual checking. After controlling for previous data-entry experience (see Table 9), partner read-aloud and visual-checking participants left 3.72 and 2.96 times as many hard-to-find errors in the dataset as double-entry participants, both ps < .001, and 2.19 and 1.76 times as many as solo read-aloud participants, both ps < .019. In addition, partner read-aloud participants introduced 5.77 and 2.37 times as many hard-to-find errors as double-entry and solo read-aloud participants, both ps < .004. However, solo read aloud was not as good as double entry. Solo read-aloud participants introduced 2.44 times as many hard-to-find errors into the dataset as double-entry participants, p < .004. Thus, double entry and solo read aloud are both better than partner read aloud and visual checking, but solo read aloud is not as good as double entry at avoiding hard-to-find errors. Double entry remains the gold standard in terms of reducing the number of errors and shows the greatest advantages in eliminating hard-to-find errors.

Double-entry errors

Although double-entry participants had fewer than half as many errors as those using the next best method, they still made some errors: 60 errors, to be exact. We examined these errors carefully to glean insights for improving double-entry systems. We found that double-entry participants always made the two computer entries match each other, but sometimes did not make the entries match the original paper data sheet. This happened in two circumstances. Most often (32 of the 60 errors), double-entry participants failed to find an existing error. When the two entries disagreed, they changed their entry to match the original (incorrect) entry. Sometimes (28 of the errors), double-entry participants entered something incorrectly themselves and then changed the original entry to match. Thus, they introduced an error that had not existed before.

As we described above, previous data-entry experience reduced the number of errors. Consistent with that overall finding, double-entry participants with previous data-entry experience never made their entries match incorrect original entries. However, they did sometimes change correct entries to match their incorrect entries. See Table 8.

Subjective opinions

For all four data-checking methods, subjective evaluations were near the midpoint of the five-point scale. However, participants liked solo read aloud the least (see Table 10). After we controlled for previous data-entry experience (see Table 11), participants reported significantly lower subjective evaluations of solo read aloud than double entry (p = .003) and visual checking (p = .003) and slightly (but not significantly) lower evaluations than partner read aloud (p = .121). In contrast, ratings of double entry were not significantly different from the ratings of either partner read aloud (p = .124) or visual checking (p = .947).

Comparing actual accuracy with perceived accuracy and reliability

Participants were poor judges of the quality of their data-checking efforts. First, a participant’s subjective opinions on accuracy and reliability were not significantly related to their actual accuracy [accuracy judgment: r(410) = .03, 95% CI [– .07 to .13], p = .538; reliability judgment: r(410) = .08, 95% CI [– .01 to .18], p = .091]. Second, participants with data-entry experience did not differ from participants with no data-entry experience in terms of their accuracy and reliability judgments, all ps > .06 (see Table 11), even though their actual accuracy was substantially and significantly higher (see Tables 6 and 7). Third, participants did not give higher ratings of accuracy and reliability to those data-checking methods that had the highest actual accuracy, both ps > .05 (see Table 11). In contrast, there were four statistically significant differences between the data-checking methods in terms of actual accuracy (see Table 7, total errors). Thus, subjective opinions of accuracy and reliability are poor substitutes for measuring actual accuracy.

Discussion

Every data-checking method corrected the vast majority of errors. Even the worst data-checking method (partner read aloud completed by people with no previous data-entry experience) left only 3.54 errors in the Excel file, on average. Because the Excel file only contained 5% errors to start, 99.48% of the entries were correct when the data checking was complete. The best data-checking method (double entry completed by people with previous data-entry experience) resulted in 99.91% accuracy. In absolute terms, the difference between 99.48% accuracy and 99.91% accuracy is small, which may explain why most researchers think that their particular data-checking method is excellent. This may also explain why there was no correlation between the number of errors and perceived accuracy. With all methods having accuracy rates greater than 99%, it is difficult for researchers to discern the differences between data-checking methods without doing systematic research like the present study. The reader is reminded, however, that an accuracy rate of 99.48% is far from optimal. Such errors can reverse the sign of a correlation coefficient or make a significant t test nonsignificant (Barchard & Verenikina, 2013). Moreover, the vast majority of data-entry errors in this study and in Barchard and Verenikina’s were within the range for the variables, and thus could not be detected using holistic methods such as frequency tables and histograms. Therefore, it is essential that researchers use high-quality item-by-item data-checking methods.

Previous research has shown that double entry results in significantly fewer errors than single entry (Barchard & Pace, 2011), visual checking (Barchard & Pace, 2011; Barchard & Verenikina, 2013), and partner read aloud (Barchard & Verenikina, 2013; Kawado et al., 2003). This study replicated those findings by showing that double entry had the fewest errors of the four data-checking methods we examined. It was more likely than other methods to result in perfect data (more than five times as likely as the next best method) and had fewer errors (less than half as many as the next best method). As compared to the other methods, it was particularly good at finding the hard-to-find errors: those errors that would be impossible to detect using holistic methods such as histograms and frequency tables.

Double entry might result in the lowest error rates because it does not rely upon maintaining attention. If people have a small lapse in concentration while entering the data, they might stop typing or type the wrong thing. If they stop typing, then the Excel sheet will show them exactly where they need to start typing again. If they type the wrong thing, then the Excel sheet will highlight that error, making it easy to correct. Similarly, if they have a short lapse of attention while they are looking for mismatches and out-of-range values, and therefore overlook a highlighted cell, the highlighting remains in place so they can easily spot the problem later. Because lapses of attention seem less likely to result in data-checking errors for double entry, double entry may maintain its low error rate if someone checks data for several hours in a row (as might happen if this were a paid professional or a student working toward a thesis deadline), even though the checker might become increasingly tired.

In contrast, participants using visual checking and partner read aloud sometimes overlooked what should have been obvious errors, such as cells that contained entirely wrong words. This might have occurred because both methods are sensitive to lapses of attention. When using visual checking, data checkers have to look back and forth between the paper data sheet and the computer monitor, and thus constantly have to find the place where they left off. If they have a short lapse of attention, they might accidentally skip a few items, and nothing in the data-entry system would alert them to that mistake. When using partner read aloud, data checkers have to maintain strict attention on the computer screen. If they have a short lapse of attention or if they get behind in reading the screen and mentally comparing it to the data being read out loud, they might do only a cursory comparison of some items and thus overlook some errors. Fundamentally, both methods allow errors to go undetected because short lapses of attention can allow data checkers to skip some items.

In addition to replicating the superiority of double entry over visual checking and partner read aloud, this study also examined a data-checking method that has never been examined empirically: solo read aloud. Solo read aloud was more accurate than partner read aloud. It was more than twice as likely to result in a perfect dataset and had fewer than half as many errors. Solo read aloud might be more accurate than partner read aloud because the data checker is able to control the speed at which the data are read. The person reading the data out loud reads at an even pace, which might sometimes leave the data checker rushing to catch up. This can result in some entries being checked hastily and possibly inaccurately.

Solo read aloud was significantly more accurate than visual checking for one of the six types of data (the spelling test). It might have been more accurate because solo read aloud requires participants to compare a sound to a visual stimulus, whereas visual checking requires participants to compare two visual stimuli. Previous research has shown that cross-modal comparisons are more accurate than within-modal comparisons (Ueda & Saiki, 2012). On the other hand, solo read aloud was only sometimes more accurate than visual checking. Indeed, solo read aloud was slightly (though not significantly) worse than visual checking for the five-point numeric and alphanumeric scales (the learning style and study habits scales), two types of data that are very common in psychology. Therefore, future research should determine under which circumstances each of these methods works best.

Solo read aloud was not as accurate as double entry. Solo read aloud resulted in twice as many errors and was only about one-sixth as likely to result in a perfect dataset. Therefore, double entry remains the gold standard data-checking method. However, double entry is not always possible. For example, when an instructor is entering course grades into the university’s student information system or when a researcher is using a colleague’s data-entry system for a joint project, double entry might not be an option. In these situations, we recommend solo read aloud or visual checking.

In our study, participants liked solo read aloud the least. This might be due to the fact that participants were checking data while sitting next to the study administrator: Participants might have felt self-conscious about talking to themselves. This discomfort is likely to be much reduced or eliminated entirely if participants are able to use solo read aloud in a private location.

Regardless of which data-checking method was used, previous experience reduced error rates. Across the four data-checking methods, the average number of errors went from 1.89 for participants with no previous data-entry experience to 1.14 for people with experience, a 40% decrease. This suggests that junior researchers should be given experience with data entry. This experience could occur during undergraduate courses (e.g., research methods, statistics, labs, and honors theses). Recall, however, that many of our participants had a lot of data-entry experience: 61 of them had over 100 h. Thus, data entry completed during coursework might not be sufficient. Junior researchers are likely to start data entry for potentially publishable research (e.g., theses and dissertations) while their error rates are still relatively high. Moreover, undergraduate research assistants with little to no previous data-entry experience are often asked to enter data for publications. This raises the question of how junior researchers and research assistants can get data-entry experience, without erroneous results being published. We recommend they use double-entry systems. This doubles the amount of data-entry experience that research assistants can obtain and is also the best method of preventing data-entry errors from ruining study results.

Although double-entry participants had fewer than half as many errors as those using the next best method, and although data-entry experience improves data-checking accuracy, experienced double-entry participants still made some errors. Therefore, there is still room for improvement. By examining the errors that double-entry participants made, this study provides two clues that might help us design better double-entry systems. First, double-entry participants always made the two computer entries match each other, but sometimes they did not make the entries match the original paper data sheet. This suggests they were more focused on making the two entries match each other than on making them match the paper data sheet. Second, double-entry participants with previous data-entry experience never left original errors in the Excel file, but they sometimes changed correct entries to match their incorrect entries. This suggests they were biased to assume that their entries were correct.

Double-entry participants may have introduced errors into the Excel sheet because they preferred their own entries to the original entries; if the entries mismatched, they assumed theirs was correct. This preference could have come about in three ways. First, participants likely noticed that the original entries had more errors than they had. When we entered the data into the Excel sheet originally, we introduced 32 errors, which is a 5% error rate. Participants entering psychological data typically make about 1% errors (Barchard & Verenikina, 2013). Thus, when the present participants identified discrepancies, they probably found, over and over again, that the error was in the original entries. They might have therefore started assuming that the first entry was wrong. However, this would probably not occur in real data entry: Likely, the errors rates of the first and second enterers would be more comparable. Second, our double-entry participants might have preferred their own entries because they were explicitly told that they were to check the original entries (they were not told that they should check their own entries). Therefore, when mismatches occurred, they might have thought that their job was to fix the original entries. Finally, people might naturally prefer their own work to someone else’s. Therefore, we might be able to design a better double-entry system by ensuring that data checkers do not have stronger affiliations with one of the entries than the other, because of either the instructions they are given or a natural affinity for their own work.

In this study, double entry involved one person (the researcher) who entered the data and a second person (the participant) who entered the data a second time, compared the entries, and fixed the errors by referring to the original paper data sheet. There are three ways this system could be modified to prevent the data checker from having a greater affinity for one of the sets of entries. System 1 would be for one person to enter the data twice, compare the two sets of entries using the computer, and fix errors by referring to the original paper data sheet. System 2 would be for one person to enter the data twice and a second person to compare the entries and fix errors. System 3 would be for two separate people to enter the data and a third person to compares the entries and fix errors.

We believe System 3 would be more accurate than System 1 or 2, because two different people would be entering the data: If one of them made a data-entry error, the other person would be unlikely to make the same error. Only one study has compared double-entry systems (Kawado et al., 2003), and it did find that System 3 resulted in fewer errors than System 2. However, that study used only two data enterers, making it difficult to generalize the results to other data enterers. Moreover, both of their data enterers had previous experience, making it difficult to generalize to the types of data enterers typically used in psychology, and they were entering medical data, which are quite different from the types of data usually used in psychology. Therefore, future research should determine which double-entry system is the most accurate for the data and data enterers typically used in psychological studies. At this point, we recommend any of these double-entry systems, but particularly System 3. Double-entry modules are available through commercial statistical programs such as SPSS and SAS, and several free programs are also available, including web-based systems (Harris et al., 2009), standalone programs (Gao et al., 2008; Lauritsen, 2000–2018), and add-ons for Excel (Barchard, Bedoy, Verenikina, & Pace, 2016). All of these programs can implement double-entry System 3.

Final words

Psychology is gravely concerned about avoiding errors due to random sampling of participants, violation of statistical assumptions, and response biases. These are all important concerns. However, psychologists need to be equally concerned that their analyses are based on the actual data they collected. Therefore, we should use double-entry systems when possible, particularly ones in which two different people enter the data and a third person compares them. When double entry is not possible, we should use solo read aloud or visual checking. Finally, we should increase the data-entry experience of junior researchers, by including double entry in undergraduate and graduate courses and by asking research assistants to double enter our own research data.