Teacher’s assessments of academic works students produce by writing are susceptible to various cognitive biases. One major cognitive bias, a form of the halo effect (Cooper 1981; Thorndike 1920), is the handwriting legibility effect (for the presentation effect, see, also Graham et al. 2011). This effect refers to higher grades given to legibly than illegibly handwritten academic works (for a meta-analysis, see, Graham et al. 2011). Legibility is defined by the ease of grasping the message the writing conveys, processing fluency (Szymczak 2016, for “clear display”, see, Kahneman 2011).

There are findings of the robust handwriting legibility effect with notable score average differences or even effect sizes (for reviews, see, Graham et al. 2011; Morris 2013; Meadows and Billington 2005). For example, James (1927) found in 43 high-school teachers that the handwriting quality of an essay biased their grading of the content of the essay by 8.7 points (the average of 59.8 for poor and 68.5 for good quality). Shepherd (1929) found that participant teachers gave illegibly handwritten texts lower scores than legible writings. The mean estimated effect size calculated by Graham et al. (2011) across the two experiments in Shepherd’s (1929) study was a notable 1.2. Chase’s (1968) and Soloff’s (1973) studies in teachers demonstrated a similar finding with notable effect sizes (0.7 and 0.94, respectively, as estimated by Graham et al. 2011). Markham’s (1976) or Briggs’ (1970) findings in teachers and teacher students statistically supported this evidence, although no standard deviations were available to allow the calculation of the effect sizes.

More recently, Klein and Taub (2005) found that handwriting legibility affected sixth-grade teachers’ judgments of the content of essays, the score differences corresponding to large effect sizes as indexed by Cohen's ds of 1.412 for pen and 0.81 for pencil.Footnote 1 Similarly, Greifeneder et al. (2012) observed in university students higher grades for essays handwritten with high than low legibility with a high effect size (ηp2 = 0.70). Greifeneder et al. (2010) observed similar findings in university students with from medium to large effect sizes (Cohen’s d’s ranging from 0.77 to 0.90 across different levels of content quality). However, they also observed that if the students were explicitly informed about the threat of the handwriting legibility effect to content quality assessments, this effect disappeared.

Some studies have failed to observe the handwriting legibility effect. For example, Chase (1979) found no such effect in graduate student scorers. Less legible handwriting was only found to make the scorers more susceptible to bias by the achievement expectations of the presumed writer, expectations that could be drawn from the information in a cover sheet about the writer’s previous academic achievements in other topics. Massey (1983) found that experienced examiners (the University of Oxford Delegacy of Local Examinations) were not observably biased by the quality of handwriting when assessing its content quality. Marshall (1972) also failed to find such an effect in their reasonably large sample of 480 classroom teachers. In an earlier study, Marshall and Powers (1969) had also observed quite small and confusing effects of handwriting quality in their sample of 420 prospective teachers. They reported that essay scores differed, confusingly, only between neatly written (5.66) and fairly neatly handwritten essays (5.02), but not between either of them and poorly written essays (5.25), the only difference corresponding in size to Cohen's d of 0.41 (Note also the estimated effect size of 0.38 approximated by Graham et al. 2011). Also, Eames and Loewenthal (1990) failed to observe the handwriting legibility effect in their reasonably small sample of 16 experienced psychology examiners.

The evidence of the handwriting legibility effect is thus contradictory. The positive evidence is difficult to be simply crossed off as a set of false alarms, and yet the amount of negative evidence is considerable. For the negative evidence, three possible explanations can be proposed. First, the variation in handwriting legibility is incapable of implicitly altering the participants’ behaviour. Second, the handwriting legibility effect could be compensated away by another, directionally opposite gender-mediated effect of handwriting legibility. Namely, illegible handwriting could be attributed to male and legible handwriting to female writers (Burr 2002; Hartley 1991). Males could, in turn, be given higher grades than females (King 1998; Martin 1972; Spear 1984; for negative findings, however, see, Birch et al. 2016). Because this gender-mediated handwriting legibility effect and the handwriting legibility effect proper are directionally opposite, they could cancel each other out. Third, the participants could be aware of biases threatening grading and, consequently, spontaneously and selectively inhibit the behavioural manifestations of the handwriting legibility effect.

These explanations were addressed in the present experiment with the negative evidence of the handwriting legibility effect.

Materials and methods

Participants and design

FortyFootnote 2 first-year teacher students (30 female; age M = 22.5 years, SD = 2.76) of the University of Turku, Finland, formed a convenience sample in the experiment. The sample was a convenience sample. Informed consent was obtained from the participants after the nature of the experiment was explained to them. The participants were informed at the beginning of the experiment that a detailed description of the study was to be provided after the experiment (to avoid experimental bias due to knowledge of the independent variables). The participants were debriefed on the purpose of the research immediately following the completion of their task. The research was undertaken at the University of Turku that does not require an internal research permit.

True experimental research was pursued. Each participant graded three test answers about the human circulatory system that, independently and pseudorandomly, differed in content quality (high, medium, low) and handwriting legibility (high, medium, low) across the participants.

Constructing test answers

A set of test answers was collected from test answers of a group of fifth-grade Finnish school pupils. Answers of different content qualities were selected from this set, typewritten and edited by the researchers for experimental control purposes. The answers were then copied in handwriting by a group of sixth-grade Finnish school pupils. The final set of 9 test answers were of three levels of content quality and handwriting legibility. The construction of this set is described in more detail in the following. The school principals and the teachers of the pupils had granted permission to use the pupils’ test answers without personal data for the present research.

Twenty four fifth-grade Finnish students of a teacher training school wrote by hand answers to a test about human blood circulation as a part of their everyday activities in an environmental studies class. They were instructed to write everything they knew about human blood circulation. The topic had been addressed previously, so the students were familiar with the topic.

Then two of the authors (HS & MV) graded the answers in the range from 4 to 10 points. The assessment followed the criteria derived from the objectives for the teaching of the environmental studies in grades 3–6 as laid out in the Core Curriculum for Basic Education 2014 (Finnish National Board of Education 2016) and from the contents of the study book on biology and geography, Pisara (Cantell et al. 2016).

The answers were then divided into three categories by content quality, excellent, good and poor. The excellent answer was chosen from among answers with a score of 10. It was typed with a computer and further edited in length down to 126 words. The good test answer was similarly made up in 121 words by combining elements from several test answers of similar quality so that it achieved a score 7.5. The poor test answer was similarly made up in 121 words so that it achieved a score 5.

Then, sixth-grade school pupils (N = 16) wrote the typed answers by hand, so that each student wrote all three answers. These handwritten answers were assessed for legibility by 9 adult females who were kindly willing to help with the assessment stage of material production. A score was given to describe how easy each answer was to read (range from “difficult to read” 1 via “mediocre to read” 3 to “easy to read” 5). Based on these scores, three handwritings, one high, one medium and one low, were selected for being used in conjunction with answers of excellent, good and poor content quality.

Materials

The materials for the participants included the general information of the experiment, detailed instructions, a summary of the human circulatory system and three test answers, each followed by an assessment form. The order of the three test answers in the set was pseudorandom across the participants. In the end, the materials included a questionnaire about the participant’s personal attributes (age, gender) and about the gender of the presumed writer of each test answer (this was not asked about earlier to prevent handwriting legibility to engage the participants’ attention during grading).

In the assessment forms, three elements of an answer were graded with a score that could range from “unacceptable” 0 to “excellent” 6 with steps of 1. These three elements were (a) the basic structure and functions of the human circulatory system, (b) the composition and functions of blood and (c) the role of physical exercise in health. Then, an overall score was the answer as a whole. This score could range from “unacceptable” 4 to “excellent” 10 with steps of 1. The confidence of grading also received a score that could range from “unconfident” 0 to “highly confident” 10 with steps of 1. Finally, it was to be assessed whether a female or male student had written the test answer. The writer’s presumed grade in mathematics was asked last of but not analysed here.

Finally, the legibility of each answer was assessed with a score that could range from “very difficult to read” 1 to “very easy to read” 5 with steps of 1. Also, the participants were asked about their gender, age and study year. Finally, a sample of the participant’s own writing was obtained. This sample was not analysed here.

Procedure

The participants were given 20 min to complete the task at their own pace. They were instructed to strictly follow the ordering of the sub-tasks in their individual sets of materials.

Analysis

The effects of content quality and handwriting legibility were tested as within-subject factors in repeated-measures analyses of variance (ANOVA) performed using version 24 of SPSS software (SPSS Inc., Chicago, IL, USA). Greenhouse–Geisser-adjusted degrees of freedom were used whenever the sphericity assumption (Mauchly’s test) was violated. Only p-value was then reported as corrected. Subsequent pairwise comparisons were uncorrected. Partial eta square (η2p) was used as a measure of effect size in ANOVAs and Pearson-correlation corrected Cohen’s d (to allow its use in within-subject designs) in subsequent pairwise comparisons (Morris and DeShon 2002). All statistical tests were two-tailed with an alpha level of 0.05. There were two missing values in the dataset (one in the basic structure and functions of the human circulatory system with excellent quality and the other in the basic structure and functions of the human circulatory system with poor content quality). These values were replaced with the average of the existing values of the variable.

Results

Content quality

Scores for elements

The participants graded each of the three elements of the test answers with a score. A repeated-measures ANOVA performed on this score with content quality (excellent, good, poor) and assessed elements (the basic structure and functions of the human circulatory system, the composition and functions of blood, the role of physical exercise in health) as within-subject factors revealed a statistically significant main effect of content quality, F(2, 78) = 158.4, p < 0.001, ηp2 = 0.80 (Fig. 1a). Subsequent comparisons indicated a significant difference between any pair of the three factor levels (Paired t-tests, p < 0.001). A statistically significant interaction between content quality and assessed elements, F(4, 156) = 32.2, p < 0.001, ηp2 = 0.45, indicated a distinctly low score for “role of physical exercise in health” with poor content quality. Statistically significant (Paired t-tests, p ≤ 0.002) differences were found between excellent and good and between good and poor content quality for each of the three assessed elements.

Fig. 1
figure 1

Results for content quality. (a) Mean scores of content quality across content quality levels and test answer elements. (b) Mean scores of overall content quality across content quality levels. (c) Mean scores of grading confidence across content quality levels. Error bars (that in Panel (a) are overlapped by markers) refer to the standard error of the mean

Overall scores

The participants also graded each test answer as a whole with an overall score. A repeated-measures ANOVA performed on the overall score revealed a statistically significant main effect of content quality (excellent, good, poor), F(2, 78) = 89.8, p < 0.001, ηp2 = 0.70 (Fig. 1b). Subsequent pairwise comparisons indicated a statistically significant difference between any pair of content qualities (Paired t-tests, p < 0.001), Cohen’s d values being 0.82 (corrected with correlation of 0.33) between excellent and good and 1.37 (corrected with correlation of 0.11) between good and poor content qualities.

Grading confidence

A repeated-measures ANOVA performed on the participants’ grading confidence revealed a statistically significant main effect of content quality (excellent, good, poor), F(2, 78) = 31.8, p < 0.001, ηp2 = 0.30 (Fig. 1c). Subsequent pairwise comparisons indicated statistically significant (Paired t-tests, p ≤ 0.032) differences between excellent and good and between good and poor content quality, Cohen’s d values being 0.61 (corrected with correlation of 0.10) between excellent and good and 0.36 (corrected with correlation of 0.09) between good and poor content quality.

Presumed gender of the writer

The observed probabilities of the presumed genders of the writers did not differ from the expected 0.5 with excellent, good or poor content quality (Binomial exact test, p ≥ 0.268).

To assess whether the presumed gender of the writer was reflected in grading, females and males were grouped (15 males and 25 females) from a subsample of cases with medium handwriting legibility (to keep handwriting legibility constant). A mixed-model ANOVA on scores for content quality with assessed elements (the basic structure and functions of the human circulatory system, the composition and functions of blood, the role of physical exercise in health) as a within-subject factor and gender (female, male) as a between-subject factor revealed no significant main effect of gender, F(1, 38) = 0.10, p = 0.753, or its interaction with assessed elements, F(1, 76) = 0.83, p = 0.406 (Fig. 2a).

Fig. 2
figure 2

Results for a writer’s presumed gender. Cases with medium handwriting legibility are only included. (a) Mean scores of content quality across genders and test answer elements. Ns refers to statistically non-significant (p > .05) main effect of gender or its interaction with a main effect of elements (b) Mean scores of overall content quality across genders. Ns refers to statistically non-significant (p > .05) gender difference

Consistently, no significant gender differences in the overall score for content quality were found in the participants’ grades (paired t-test, t(38) = 0.50, p = 0.619) (Fig. 2b).

Handwriting legibility

Scores for elements

The participants graded each of the three elements of the test answers with a score. A repeated-measures ANOVA on this score with handwriting legibility (high, medium, low) and assessed elements (the basic structure and functions of the human circulatory system, the composition and functions of blood, the role of physical exercise in health) as within-subject factors revealed neither main effect of handwriting legibility, F(2, 78) = 0.5, p = 0.61, nor an interaction between handwriting legibility and assessed elements, F(4, 156) = 0.6, p = 0.61 (Fig. 3a).

Fig. 3
figure 3

Results for handwriting legibility. (a) Mean scores of content quality across handwriting legibility levels and test answer elements. (b) Mean scores of overall content quality across handwriting legibility levels. (c) Mean scores of grading confidence across handwriting legibility levels. Ns refers to statistically non-significant (p > .05) main effect of handwriting legibility or its interaction with content quality. p = .087 refers to the statistical significance of the main effect of handwriting legibility. Error bars refer to the standard error of the mean

Overall scores

The participants also graded each test answer as a whole with an overall score. A repeated-measures ANOVA on this overall score revealed no statistically significant main effect of handwriting legibility (high, medium, low), F(2, 78) = 0.55, p = 0.577 (Fig. 3b).

Grading confidence

A repeated-measures ANOVA performed on the participants’ grading confidence revealed a marginally statistically significant main effect of handwriting legibility (high, medium, low), F(2, 78) = 2.5, p = 0.087, ηp2 = 0.06 (Fig. 3c).

Presumed gender of the writer

The gender distribution differed significantly from expected 0.5 with low (34 males against 6 females) and with high (3 males vs. 37 females) handwriting legibility (Binomial exact test, both p < 0.001). This distribution (15 males vs. 25 females) did not differ significantly from the expected 0.5 (Binomial exact test, p > 0.01) with medium handwriting legibility.

Perceived handwriting legibility

A repeated-measures ANOVA performed on the participants’ perceived level of handwriting legibility revealed a statistically significant main effect of handwriting legibility (high, medium, low), F(2, 78) = 135.5, p < 0.001, ηp2 = 0.78, subsequent pairwise comparisons indicating statistically significant (p < 0.001) differences between consecutive levels of handwriting legibility, Cohen’s d values being 0.99 (corrected with correlation of 0.09) between high and medium and 1.81 (corrected with correlation of 0.18) between medium and poor (Fig. 4).

Fig. 4
figure 4

Results for perceived handwriting legibility. Mean scores of perceived handwriting legibility across handwriting legibility levels. Error bars refer to the standard error of the mean

Discussion

The participants, first-year teacher students, were instructed to grade fifth-year primary school student’s handwritten test answers. The answers varied independently in content quality and handwriting legibility. The participants were found to succeed in grading (Fig. 1a and b), although with lower grading confidence towards lower content quality (Fig. 1c). The grades were found not to be systematically altered by handwriting legibility (Fig. 3a and b)—a negative finding of the handwriting legibility effect. Nevertheless, the participants’ grading confidence was found to decrease at a trend level towards lower handwriting legibility (Fig. 3c), and they were found to stereotype different levels of handwriting legibility by the presumed gender of the writer (low legibility associated with males and high legibility with females). However, these gender stereotypes could not be found to be observably reflected in the grades (Fig. 2a and b).

The negative evidence (Fig. 3a and b) of the handwriting legibility effect (Graham et al. 2011) called for further analysis of the data to address the plausibility of the three explanations for the negative evidence. In the following, these explanations are addressed.

The first explanation for the negative evidence, the inability of the variation in handwriting legibility to implicitly alter the participants’ behaviour, had to be rejected. The participants were found to spontaneously stereotype texts written with different levels of handwriting legibility by gender. They attributed high-legibility writing to female and low-legibility writing to male writers (see also, Burr 2002; Hartley 1991), presumably reflecting the participants’ previous experiences of higher-legibility handwritings produced by females than males (Weintraub et al. 2007; Graham et al. 1998). Furthermore, a statistical trend for the participants’ lower grading confidence towards lower handwriting legibility (Fig. 3c) even tentatively suggests that handwriting legibility directly affected the process of grading albeit not the grades themselves. It is possible that the participants needed higher cognitive effort to grade low- than high-legibility test answers (Kahneman 2011). Their decreased grading confidence towards lower content quality (Fig. 1c) is more difficult to explain. It is possible that the test answers with lower content quality were cognitively more demanding to grade as they were poorer matches to the predefined elements of an ideal test answer specified in the instructions.

The second explanation for the unobservability of the handwriting legibility effect—higher grades given to (illegibly-writing) males than (legibly-writing) females (King 1998; Martin 1972; Spear 1984)—had to be rejected as well. The logic of this explanation as such was straightforward. The handwriting legibility effect involves lower grades for low-legibility than high-legibility handwritings (Graham et al. 2011). However, genders, if stereotyped from these legibility levels (Burr 2002; Hartley 1991), should lead to just the opposite. That is, low-legibility writers, just because they are males, should earn higher grades than high-legibility writers, just because they are females (King 1998; Martin 1972; Spear 1984), thereby compensating away the handwriting legibility effect proper (Graham et al. 2011). Despite the present data showed that participants stereotyped handwriting legibility by gender, the gender stereotypes were not observably reflected in the grades (Fig. 2a and b).

Out of three explanations, one remains. Most probably, the participants spontaneously and selectively inhibited the effects of handwriting legibility on grading in particular. The handwriting legibility effect (Graham et al. 2011) has been found to disappear if participants become explicitly aware of the threat of this bias (Greifeneder et al. 2010). In the present study, the participants were not informed about possible bias by handwriting legibility. However, the grading task was well-structured and informed, which as such could effectively support the participants’ objectivity in their assessments.

There are also some limitations of the present findings. First, the participants explicitly expressed their gender stereotypes for handwriting legibility not until the grades had been given. Therefore, it remains unclear whether or not the participants implicitly engaged in such stereotyping during grading. Note, however, that if they did not, the alternative hypothesis (of the handwriting-induced gender effect) for the unobservability of the handwriting legibility effect must have been rejected anyway. Secondly, the finding that gender stereotypes were not measurably reflected in grades was only from a part of the dataset, namely, from cases with mediocre handwriting legibility (to control for handwriting legibility variation). This obviously reduced statistical power, which emphasizes the inconclusiveness of this negative finding. Note also that the genders were likely to be less clearly inferable from medium- than high- or low-legibility handwritings and, thereby, possibly less effective in biasing grading. Thirdly, despite the test answers had some typical elements of informational content and handwriting legibility faced by class teachers in their professions, grading here took place as a part of an experiment in a non-natural environment. Therefore, the external validity of the present findings remains unclear. There is a need for future studies on whether, and if so to what extent, teachers would be similarly inclined to resist the handwriting legibility effect in their professional settings.

Conclusions

Teacher students participants were found not to measurably grade higher for test answers written with legible than illegible handwriting (the handwriting legibility effect). Nevertheless, they spontaneously stereotyped different levels of handwriting legibility by gender and, at a trend level, were less confident in grading the content of handwriting towards the lower legibility of this writing. Therefore, the handwriting legibility was unlikely absent simply due to the insufficient variation in handwriting legibility for implicitly altering the participants’ behaviour. The finding that the gender stereotypes were not measurably reflected in grades, in turn, suggests that a directionally opposite handwriting legibility-induced gender effect on the grades was unlikely to compensate away the handwriting legibility effect proper. Thus, most probably, the participants voluntarily inhibited the handwriting legibility effect. Such voluntary debiasing could also explain at least some of the previous negative findings of this effect.