Introduction

Handwriting has been a cornerstone of human civilization (Harari, 2014), and even in the digital age still plays an important role in daily learning and communication. It is still used in many daily activities, for instance writing a birthday card or filling in a banking form. More importantly, it is still the dominant way of learning and assessment for students (Cutler & Graham, 2008; Jones & Hall, 2013). In fact, with the availability of digital pens and handwriting input methods, handwriting now plays an increasingly important role in information processing and communication (e.g., note-taking, annotations and handwritten graphics using digital pens, or handwriting input using figures on smartphones). However, handwriting is a skill that many students struggle with (McBride, 2019) and may be in danger of being lost (e.g., Jones & Hall, 2013; Medwell & Wray, 2008; Wollscheid et al., 2016), especially in languages with a logographic script such as Chinese. In Chinese-speaking regions, there has been a significant decline in people’s handwriting literacy due to the dominant use of typing, resulting in many Chinese handwriters suffering from what is known as character amnesia (提笔忘字 in Chinese, see for example Almog, 2019; Hilburger, 2016; Huang et al., 2021; Huang, Zhou et al., 2021).

In this paper, we draw on a large-scale Chinese character handwriting database (Cai, Xu & Yu, in prep-b), and use an Item Response Theory model to select 30 characters for use in a test of character amnesia rate. This short diagnostic test can be used in a pencil-and-paper assessment of character amnesia in educational and clinical settings.

Character amnesia in Chinese handwriting

The dominance of digital typing in recent decades has diminished the opportunity for people to handwrite. With increasing use of digital communication, people now rarely use a pen other than a stylus, and keyboard-based input systems dominate Chinese character production. While digital typing resembles handwriting in its sequential spelling of phonemes into graphemes in transparent scripts (e.g., letters in Greek or English), this is not true in logographic writing systems such as Chinese. In the Chinese character-based system, phonemes in the spoken language do not have transparent mappings onto graphmes in writing. Instead, each meaning-carrying Chinese syllable is mapped onto a character. Chinese has thousands of characters, which are in turn combinations of more than 600 orthographic components (i.e., radicals and strokes, documented in the GB13 Character Component Standard, 1997). For example, the character 语 (meaning “speech”) is composed of three radicals (讠, 五, and 口). A radical is further composed by two or more strokes arranged in a two-dimensional space with a particular stroke order (e.g., 讠 decomposes into upper and lower components, which are re-used in other contexts). In addition, there is no one-to-one mapping between phonology and orthography in Chinese. A syllable such as yu3 (subscript indicating tone) in Mandarin can be realized as different characters, depending on the intended meaning (e.g., 语 meaning “language”, 羽 meaning “feather”, 雨 meaning “rain”, 宇 meaning “eaves”). As a result, handwriting literacy in Chinese requires explicit orthographic-motor knowledge of thousands of characters, which is typically accumulated in childhood via years if not decades of rote learning, and needs to be constantly consolidated against attrition in adulthood (McBride-Chang & Ho, 2000; Tong & McBride-Chang, 2010). To make things worse, the dominant form of digital typing of Chinese (at least in mainland China) makes use of a Pinyin romanization system, which generates lists of candidate characters from a phonological transcription entered without tones. For example, the input “yu” will generate candidates such as 鱼 (“fish”), 语 (“language”), 雨 (“rain”), and 玉 (“jade”), from which the intended target can be selected. Such phonology-based input methods render it unnecessary for people to access the radicals and strokes to write a character.

A negative consequence of digital typing, especially the use of Pinyin input methods, is character amnesia (Pyle, 2016). As mentioned above, character amnesia refers to the phenomenon whereby Chinese speakers fail to handwrite a character even though they know what character they are supposed to write (e.g., Almog, 2019; Hilburger, 2016; Huang, Zhou et al, 2021). For instance, when asked to handwrite the character according to a dictation prompt (e.g., shui3 dao4 de dao4, 水稻的稻, meaning “rice from the word rice-plant”), people have access to both the phonology (i.e., dao4) and meaning (i.e., rice), but may produce no stroke or just some strokes of the target character. Character amnesia has been a societal concern widely reported in the news. A survey by the China Youth Daily (Xu & Zhou, 2013), for instance, revealed that 98.9% of 2517 survey respondents reported experience of character amnesia, with about 30% of them saying that character amnesia was quite common for them. A recent survey by Guang Ming Daily (Guangming Wang Public Opinion Center, 2019) on internet users in China found 60% reported having regular experience of character amnesia. A recent analysis (Huang, Zhou et al., 2021) of a large-scale handwriting database (developed by Wang et al., 2020) showed that college students in mainland China suffered from character amnesia about 5.6% of the time on average.

Assessing handwriting literacy in Chinese

A range of assessment tools have been developed for measuring handwriting literacy. Many of these focus on penmanship, the fine motor skills involved in producing legible characters. Typically, tests of penmanship involve test-takers (normally children) being asked to copy a number of words on a sheet (Phelps et al., 1985; Reisman, 1993), or to produce highly familiar characters such as their own names (Chan & Louie, 1992; Chan et al., 2008; Tse et al., 2019; Yin & Treiman, 2013). Tests of penmanship provide an opportunity for a trained professional to evaluate important aspects of handwriting legibility such as alignment, spacing and size. Other assessments place emphasis on fluency. Copied characters-per-minute is commonly used as a measure of handwriting fluency (Chow et al., 2003; Tseng & Hsueh, 1997; Wallen et al., 1997). Recent work has explored more multifaceted measurements of handwriting properties in copying tasks, including legibility and pen-pressure (Li-Tsang et al., 2019), and other handwriting features made available in a standardized format by computerized handwriting input (Falk et al., 2011; Li-Tsang et al., 2022).

By using highly familiar or provided characters, tests of penmanship or fluency typically minimize or eliminate the need to retrieve orthographic representations. Thus, these tests do not evaluate orthographic retrieval, the process involved in character amnesia. By design, most existing Chinese handwriting assessments do not assess how good (or bad) a test-taker is at accessing the orthographic components of characters. A better solution for assessing character amnesia is to use writing to dictation, where test-takers handwrite a character according to a spoken dictation prompt (see for example Huang, Lin et al., 2021; Huang, Zhou et al., 2021; Wang et al., 2020).

A key issue in the design of dictation tests is the selection of test characters. Chinese has approximately 12,500 characters according to the Modern Chinese Language Dictionary (The Commercial Press, 2016) and about 4500 characters that are commonly used in daily life according to the Modern Chinese Frequency Dictionary (Institute of Language Teaching & Research, 1986). Since an exhaustive test is impossible, it is important for a practical assessment of character amnesia to select a reasonably small number of test characters. The selected characters should not be too easy or too difficult to write, as such characters will not be able to finely assess test-takers’ degree of character amnesia. In this paper, we aim to develop an assessment of character amnesia on the basis of a large-scale handwriting database (Cai, Xu & Yu, in prep-b), using Item Response Theory to select a set of characters that discriminate well between test-takers with different amnesia rates. Below, we describe the construction of the short-form test and quantify how well it predicts amnesia rates on a comprehensive long-form test.

Item response theory

Item response theory (IRT) departs from classical test theory in trying to jointly infer properties of both the test items and the test-takers (Bock, 1997). The core idea is that item properties such as difficulty can and should inform the interpretation of test-takers’ answers. For example, a correct answer to a difficult item should ‘count more’ towards an estimate of ability than a correct answer to an easy item. Jointly inferring item difficulty alongside test-taker ability is an evidence-based way to set the appropriate weights. Difficulty is just one commonly-inferred item characteristic: other possibilities include diagnosticity (how quickly probability of success changes with increases in ability), guess rates (for multiple choice questions) and factor loadings (if multiple distinct factors are thought to influence responses). IRT models form a large and diverse family of approaches (Chalmers, 2012; Van der Linden & Hambleton, 1997) differing in the item characteristics they aim to estimate, the assumptions they make about the structure of the target ability being tested, and the inferential framework used. Particularly common variants include the Rasch model, which estimates difficulty only, the 2-parameter IRT, which estimates difficulty and discriminability, and the 3-parameter IRT model, which additionally considers guess-rate (Reise et al., 2005). In this paper we take a hierarchical Bayesian approach (Bürkner & Vuorre, 2019), applying the hierarchical 2-parameter IRT (closely following an example implementation by the Stan development team, 2021) to our domain of interest in handwriting assessment.

In addition to supporting inferences about test-taker abilities, estimates of item characteristics can inform test design. Tests designers typically aim to achieve maximum information within time and resource constraints. Information gain is highest when all outcomes for a test item are equally likely, and zero if a particular response is guaranteed ahead of time. Test information gain is therefore a function of both the item characteristics and the ability of the test-taker (Lord, 1977; Schmidt & Embretson, 2012). In particular, for the 2-parameter model considered here, item information is given by \(\gamma^{2} P_{i} (\alpha )Q_{i} (\alpha )\), where γ is the estimate of item discrimability, α is participant ability, P (α) is the probability of writing the character correctly for the item in question, and Q(α) is 1 − P (α) (Tsompanaki & Caroni, 2021).

The main barrier to directly optimizing for information gain is its dependence on the test-taker’s ability, which implies that different combinations of items will be most sensitive in different ability ranges. The optimal distribution of sensitivity over ability levels depends on the goals of the test: if the goal is highest overall accuracy, better results will be achieved by concentrating sensitivity in the most common ability ranges. However, in some applications it may be desirable to sacrifice some precision in the most commonly occurring ability ranges in order to better distinguish the extremes.

In the absence of a uniquely optimal way to create subsets of test items for all applications, we focus on subsets maximizing the correlation between predicted amnesia rates and observed amnesia rates in held-out observations. In the following, we distinguish between holding out responses, which examines the model’s ability to predict amnesia/success on unseen characters given only responses to a tested subset, and holding out participants, which examines the subsetting procedure’s ability to select characters appropriate for new test-takers from the same population. To the best of our knowledge, analytical results giving optimal subsets for these measures of performance are not available, so we turn instead to empirical cross-validation of heuristic subsetting schemes.

One reason analytical results are hard to come by is that high discrimination and a diversity of difficulties are both highly desirable, and these factors interact with each other to determine the overall quality of the test (An & Yung, 2014). Low discrimination items can attract both correct and incorrect answers from participants with a wide range of ability scores, so observing one response from a low-discrimination item is weakly informative about where on the ability scale any one participant lies. High discriminability items come closer to ruling out all ability levels not consistent with the observed response, and are therefore typically more informative. However, high discriminability does not give a straightforward guarantee of high information gain. High discrimination items can give poor information gain if they cluster together in a small difficulty range, with the restricted coverage making the information from each item redundant. They may also be uninformative if they take on extreme values such as very high or very low difficulty, leaving them discriminating sharply between participant ability values that are rarely observed.

Below, we propose two heuristic schemes for selecting items from a large pool of candidates to construct a short-form test of character amnesia rates, with a third random selection scheme considered as a baseline. The two heuristic schemes are subsetting based on high discriminability only, or on high discriminability after first ensuring a diversity of difficulties. We make no claim to optimality, and investigate the quality of these item selection schemes empirically with the expectation that their performance will depend on the specific distribution of participant ability and item characteristics particular to our application domain. Our primary measure of test performance is the prediction of amnesia/correct handwriting responses in the 1200 item test using participant ability parameters estimated from responses to the short form test, for held-out participants not involved in any aspect of the test construction or item characteristic estimation. We find the correlation between predicted and observed character amnesia rates to be typically around 0.89 in repeated tests of the best-performing selection procedure (maximum discrimination) using different sets of held-out participants. We also discuss the stability of the item selection procedures given sampling variation in the responses used to calibrate our estimates of the item characteristics, and describe how prediction performance changes with the number of items included in the short-form test.

The handwriting database

Our work here draws on an existing large-scale handwriting database (Cai, Xu & Yu, in prep), which served as input to our IRT analysis for the construction of the diagnostic test. The database contains handwriting outcomes of 1200 Chinese characters by 42 adult native speakers of Mandarin (32 female, age range 19–22 years, mean 20.50) in a written-dictation task. All the participants were right-handed, had normal or corrected-to-normal vision and hearing and reported no neurological, psychiatric disorders or ongoing medical treatment.

The characters used originate from SUBTLEX-CH (Cai & Brysbaert, 2010), a corpus of Chinese film subtitles, but with several filters to extract a subset suitable for a dictation task (Wang et al., 2020). The first filter is on character frequency, which was constrained to a log frequency between 1.5 and 5.0 (corresponding to more than 32 and less than 100,000 instances in the corpus). Cai et al. (in prep) discarded characters with four or fewer strokes where most handwriters could be expected to be at ceiling performance. These filters resulted in a candidate set of 2095 characters. These were further filtered for familiarity rating. For each of these, Cai et al. (in prep) took the most common bi-character word context the character appeared in (e.g., 水稻, shui3 dao4, for the target character 稻, dao4) and asked 15 participants (university undergraduates from the same universities as the participants in the current study) to rate them 1–7 for familiarity. Bi-character words with an average familiarity rating of less than 4 were excluded. Dictation phrases were converted to audio using Google’s text-to-speech reader https://cloud.google.com/text-to-speech and further filtered for clarity, resulting in 1600 candidate dictation recordings. In order to keep participation times within a 4 × 1 h timeframe, Cai et al. (in prep) selected 1200 items uniformly at random from this candidate pool. Mean log-frequency with regard to the original corpus was 3.44. The number of strokes per character ranged between 5 and 21, with a mean of 9.72, while the number of constituent radicals ranged between 1 and 7 with a mean of 2.98.

In the experiment, participants heard a dictation prompt indicating the to-be-written character in a one-word context (e.g., prompts of the form “水稻的稻”, meaning “The character ‘稻’ in the word 水稻”) and wrote the target character on the completion of the prompt. When they had finished handwriting, participants were shown the target character and self-reported whether they had written the character correctly (a success response), knew which character they were supposed to write but forgot how to write it (a character amnesia response), or didn’t know the character to write (a don’t know response) (see Fig. 1). Self-reports were checked by three checkers each examining all responses to 400 of the characters. This process identified 297 self-reported handwriting outcomes as questionable (0.6%): on this basis we accept the accuracy of self-reported response status and use it throughout. In all, there were 45,737 success responses (90.7%), 2679 character amnesia responses (5.3%) and 1984 don’t-know responses (3.9%). Here we consider only success and character amnesia responses, and interpret results as handwriting success conditional on knowing the character (Fig. 2).

Fig. 1
figure 1

Presentation procedure for the dictation task. Participants heard a start-of-trial ding, followed by a 500 ms pause and a dictation prompt (e.g., 水稻的稻/shui3 dao4 de1 dao4/, meaning ‘rice’ (dao4) from the word ‘rice-plant’ (shui3 dao4). After they finished handwriting, the target character was shown on screen and they self-evaluated their response using the options: “0: Correct writing”, “1: I knew the character to write but forgot how to write it” and “2: I did not know the character to write”

Fig. 2
figure 2

Proportion of each response type (amnesia, don’t know, or success) for each participant. Most participants wrote most characters correctly. The smallest number of amnesia responses observed was one, the maximum was 137

The mean amnesia rate for participants in this study was 0.055 (median 0.053) with a maximum observed amnesia rate of 0.11. The smallest number of amnesia responses for any one participant was one, the maximum was 137 (see Fig. 3 for the proportion of each response type from each participant). Mean amnesia rate for characters was 0.061, but the distribution of amnesia rates is strongly right-skewed, with a median at 0.024 and third quantile at 0.071. There were 553 characters that never attracted an amnesia response, while a small minority of characters produced very many amnesia responses, with the most-forgotten character, 瘪 (bie3, deflated), giving an amnesia rate of 0.852.

Fig. 3
figure 3

Histogram of amnesia rates by-item and by-participant. Most characters were usually written correctly, with a minority of characters attracting high amnesia rates. Median participant amnesia was around 5%

Analysis

In this section, we provide the model definition and item selection procedures which together constitute our handwriting test. The IRT model follows the 2-parameter IRT implementation given in the Stan user’s guide (Stan development team, 2021) and estimates participant ability (α), item difficulty (β), and item discriminability (γ). The item selection procedures are aimed at high discriminability or diverse difficulty, with random item selection also considered as a baseline. We then aim to answer the following three questions. First, how stable are the item selection procedures, given that there is some sampling variability in the estimates of item characteristics? Second, given our best estimates of the item characteristics, how does prediction performance change with the number of items included in the subset? Finally, at our chosen subset size of 30 items, what is the expected performance when predicting character amnesia rates in future test-takers?

Brief model description

Full implementation details appear in Appendix 1. In brief, each trial is considered to be a Bernoulli random variable with probability of success given by \(\Pr (success) = logit^{ - 1} (\gamma j(\alpha_{i} - \beta_{j} ))\) for participant i responding to item j, where the participant has ability α, and the item has difficulty β and discrimination γ. The inverse logit transforms the combined parameters into an S-shaped curve between zero and one, interpretable as a probability of success. The difference between participant ability and item difficulty determines the location of the ‘step’ from likely failure to likely success, while the discrimination parameter determines how steep the step is. Subscripts indicate that distinct participant abilities and item characteristics are estimated for each participant and item. A plate diagram and details of the prior specification appears in Appendix 1, along with some further discussion of the modelling choices. Code is available at https://osf.io/nersu/?view_only=80343b61798d4eac8f9912de0b9c1ec9.

Item selection procedures

The three item selection procedures described below draw from the 1200 characters for which item characteristic estimates are available, and attempt to select subsets that will support accurate inference of amnesia rates in the whole set.

Random

The random selection procedure chooses characters uniformly at random without replacement from the available set of 1200. It is presented here as a baseline procedure.

Maximum discrimination

The maximum discrimination selection procedure excludes from consideration any characters with an amnesia rate of less than 1%. Although items with no variation in responses are uninformative, they might otherwise be selected by this procedure. This is because lack of variation is most consistent with an item having high discrimination and floor or ceiling difficulty, since low discrimination items produce the minority response more readily over a broader range of abilities. We therefore exclude very low difficulty items from consideration before sorting on mean estimated discrimination (γ) and selecting the highest n items to produce candidate test-sets of size n.

Diverse difficulty

The diverse difficulty selection procedure first excludes amnesia rates of less than 1% as in maximum discrimination. The considered items are sorted by mean estimated difficulty (β) and divided into n equally sized bins where n is the desired subset size, with the last bin ‘short’ if the available characters cannot be evenly divided. Within each difficulty bin, the highest-discrimination character is selected. While this procedure attempts to select items from the full range of difficulties spanned by the items showing amnesia responses, note that the β scores from test sets selected in this way preserve a strong right skew of difficulty scores present in the original full set of items. That is to say, the easiest items are very easy, and difficulty increases quickly over the first few bins and more slowly over subsequent bins.

Stability of the calibration estimates

We investigated the stability of the item selection procedures defined above by repeatedly removing 10 participants, estimating item characteristics from the remaining observations, and selecting items according to these item characteristics. Over 30 repetitions of this procedure, we asked how often items were selected for inclusion in a 30-item test. The random procedure is not sensitive to the number of calibration observations, but the maximum discrimination and diverse difficulty procedures will produce subsets deterministically in the limit of perfectly stable β and γ estimates. How variable the character ‘inclusion rate’ is when repeatedly dropping a number of participants gives some indication of how sensitive the item selection procedure is to sampling variability in the calibration participation.

Thirty repetitions selecting item sets of size 30 gives 900 opportunities for item selection. With only 1200 items available, there will be some collisions due to chance: random item selection would typically (95%) result in between 632 and 658 unique characters used. In contrast, the diverse difficulty item selection procedure used only 290 distinct characters across the 30 tests, with 6 characters appearing in more than half the tests. The maximum discrimination procedure used only 153 distinct characters in the test lists it selected, with 19 characters appearing in over half the tests. These large overlaps lead us to conclude that increasing the number of calibration participants further is unlikely to result in substantial changes to the subsets selected by maximum discrimination and diverse difficulty, particularly since the substitutions are for items estimated to have similar characteristics.

Test set size

To examine the impact of test set size we take participant amnesia rates on the full set of 1200 tested characters as the ‘ground truth’ and ask to what extent amnesia rate on the full set can be predicted by the model when fit only to participant responses to a subset. In these analyses the item characteristics used were those estimated from all available data. Because these item characteristics are estimated using the same participants whose responses are being predicted, it is possible that prediction performance may be somewhat optimistic. Here our focus is on the relative improvement in prediction accuracy for held-out items as the number of items in the test increases. In the next analysis, we assess prediction performance for held-out participants who were not involved in estimates of item characteristics.

We find that as test-set size increases, correlations between predicted and observed amnesia rate on the full set of items approach a correlation of roughly 0.9, with diminishing improvements for additional test items (Fig. 4). The maximum discrimination subsetting scheme reaches this level of performance fastest at around 20 items, followed by diverse difficulty, which requires around 50 items for similar performance. The quality of random shows large variation depending on the characters selected, but appears to be approaching a somewhat lower correlation of 0.8 at the largest set size tested, 60 items. We conclude that a test-set size of 30 items is adequate for predicting the amnesia rates that would have been observed in a comprehensive long-form test.

Fig. 4
figure 4

Correlations between predicted and observed character amnesia rates using test subsets chosen randomly (left), for diverse difficulty (center), or for maximum discrimination (right). Random subsets are poorly performing and highly variable, maximum discrimination and diverse difficulty subsets converge relatively quickly to predictions correlating around 0.9 with observed rates on all items. Reference lines are loess-smoothed trends. A proposed test-set size of 30 items is highlighted with a dashed vertical line. We conclude that tests of this size can be adequately predictive of amnesia rates seen in the full 1200 item set

Figure 5 ‘unpacks’ one of the correlations plotted in Fig. 4 to show predicted vs observed amnesia rates for all participants. This particular test used a test size of 30 items and observed a predicted vs observed correlation of 0.89 for maximum discrimination, 0.79 for diverse difficulty, and 0.63 for random item selection. We present it here as a typical representative of results at this test-set size. Note that high amnesia rates appear to be systematically overestimated by this model when using only the subset responses. Rank-ordering among participants is still well-preserved.

Fig. 5
figure 5

A single prediction check at the target test size of 30 items. In this test, the subset chosen by maximum discrimination (right) produces predicted amnesia rates that correlate at r = 0.89 with those observed over all responses. Correlations from diverse difficulty (center) and random (left) were 0.79 and 0.62 respectively. The reference line shows the identity line representing the ideal outcome: estimates from the high-discriminability subsets appear to over-estimate very high amnesia rates, although rank ordering is well-preserved

Performance with new test-takers

We sought to evaluate out-of-sample performance of the procedure as a whole, including item selection and subsequent prediction of amnesia rates after fitting to the selected short-form test. Over 30 repetitions, we randomly held out 10 participants and fit the model using the responses of the remaining 32. This calibration fit provided estimates of item characteristics for all 1200 characters. Item subsets were then selected according to each of the three selection procedures using these estimates. The model was then re-fit to all the data that would have been visible had the hold-out participants been new test-takers: all responses from participants involved in the calibration and item selection plus responses to the selected subset only from hold-out participants who were not involved in calibration or item selection. Overall amnesia rates were estimated for all participants, including the newly introduced ones who supplied only a subset of responses, by predicting success or amnesia on each item in the full item set using the item characteristics estimated from the 32-participant calibration data.

Our main outcome of interest was the correlation between the observed and predicted amnesia rate for the held-out participants who only provided observations for the selected subset of items (Fig. 6). Over thirty repetitions, these ranged between 0.96 and 0.49 with maximum discrimination item sets, between 0.94 and 0.46 for diverse difficulty items sets, and between 0.91 and 0.04 for random item sets, with mean correlation of 0.82, 0.74, and 0.62 respectively. Since this measure is bounded above these distributions are all left-skewed: the modal correlation observed for the best performing selection procedure (maximum discrimination) is 0.89.

Fig. 6
figure 6

Over 30 repetitions, 10 participants were held out and item characteristics estimated from the remaining 32. Item subsets were then selected according to these estimates, and observations from only selected items were added to the observation pool from the held-out participants. Histograms show the 30 correlations obtained between estimates of the 10 hold-out participants’ amnesia rates based only on the selected subset of items and the observed amnesia rates taken from their responses to all items

Discussion

We set out to address the issue of constructing a short-form test that can assess a Mandarin Chinese speaker’s character amnesia rate in handwriting. We approached this issue by applying IRT to an existing large-scale handwriting database. Acknowledging that different application areas will have different requirements, we find that a thirty-item test is probably adequate for many applications, with best results given by the maximum discrimination item selection procedure. We estimate that the correlation between predicted and actual amnesia rates for new participants not involved in calibration assessment of the item characteristics is likely to be between 0.82 and 0.89 in this case. We further show that increasing the size of the test set beyond this does not give substantial improvement for tests using fewer than 60 items.

The selected 30 characters can be easily compiled into a quick (about 15 min) diagnostic test of handwriting literacy with an implementation following the dictation test format used in the calibration data (Fig. 1). In particular, the test will involve test-takers listening to a dictation prompt specifying a target character for the test-taker to write down. We note that in addition to measuring a test-taker’s character amnesia rate, such a test can also measure a test-taker’s handwriting fluency if handwriting times are recorded (see Wang et al., 2020, for one such implementation). In addition, the resultant handwritten images can also be used for penmanship evaluation (Xu et al., in prep).

We anticipate a range of uses for this test in both research and applied settings. First, in the absence of easily administered and validated assessments of character amnesia and other handwriting problems (e.g., disfluency or poor penmanship), the identification of factors that lead to decline of handwriting literacy has been largely speculative in nature. Digital typing has been argued to be the main cause of handwriting problems (e.g., Almog, 2019) and there is indeed some empirical evidence that people with more digital exposure (e.g., using smart phones or computers) tend to be worse in handwriting (Huang, Lin et al., 2021). However, it remains largely unclear exactly how digital typing damages people’s handwriting ability. It is possible that digital typing adversely affects handwriting because handwriting time is diminished. Alternatively, it is possible that aspects of digital typing are at work. For instance, phonology-based inputs such as Pinyin allow users to generate candidate characters without having to input any orthographic components, potentially leading to decay of orthographic-motor representations that are critical for Chinese character handwriting. If this is the case, it should be expected that people who use orthography-based input methods such as Cangjie may be less susceptible to character amnesia (and indeed other handwriting difficulties). Addressing these issues will require broad assays of handwriting literacy over many individuals, and our short diagnostic test can be used for this purpose.

Second, our diagnostic test may have applications in prescreening Chinese-speaking children for developmental dysgraphia, since this manifests as a specific difficulty in acquiring handwriting skills (Berninger et al., 2008; Gosse & Van Reybroeck, 2020; Kandel et al., 2017; McCloskey & Rapp, 2017). These children have highly inflated character amnesia rates compared to normally-developing children (Cheng-Lai et al., 2013; Lam et al., 2011), and as such, could be expected to receive high amnesia estimates in an appropriately-calibrated test. The test would certainly need to be re-calibrated to find the item characteristics as experienced by a school-age population, and the sensitivity and specificity of the test would depend on the degree of variation in handwriting ability among normally-developing students (which may be somewhat larger than the undergraduate cohort studied here) and the degree of separation between the groups.

Another related application might be to prescreen elderly people for very early signs of neurodegenerative diseases such as Alzheimer’s disease or Parkinson’s disease. Alzheimer’s disease, for example, is initially associated with linguistic degradation symptoms such as semantic memory decay and word finding difficulty (Bartzokis, 2004). This means that people with early Alzheimer’s disease may also have increased character amnesia in handwriting. The mild phase of the disease is also associated with deterioration in fine motor control and coordination (Yan et al., 2008). A handwriting task may be well placed to serve as a test of both cognitive and motor deficits (Garre-Olmo et al., 2017), although as in the developmental dysgraphia case the practical performance of such a test would depend on the to-be-determined variability in the healthy group and the degree of contrast shown by the target group.

Of course, there are caveats in these applications. The most important limitation for applications of this test procedure is that it needs to be calibrated for each particular population. The only reason it is possible to make inferences based on so few characters is because we have preexisting information about the character attributes, but these character attributes are certainly different for different populations. Easy and difficult characters for university undergraduates are not the same as easy and difficult characters for primary school students. Test size, predictive performance, and even which of the subsetting procedures works best may vary by target population, since these all depend on the shape of the full distributions over participant abilities and item difficulties. However, estimating these quantities appears feasible from realistic amounts of calibration data. For the case we examine here, adequate calibration data was collected from a relatively small number of participants. While more variable populations may require more calibration data, the current results suggest on the order of 40 − 50 calibration participants is a reasonable starting point for this method, with exact numbers to be assessed on a case-by-case basis.

It is worth noting that the interpretation of failures to handwrite may also be different in these different target populations. The test given here estimates handwriting failures from all causes, whether orthographic retrieval, attention, motor planning and execution, or other factors. It may be that different factors dominate in different cohorts. Although ‘character amnesia’ has been used as a convenient catch-all term in previous literature (e.g., Hilburger, 2016), and the overall rate of handwriting failures may in many cases be directly of interest, if the target of investigation is a specific sub-process, the test described here would need to be supplemented with additional construct-specific testing.

A second important caveat is the tendency of the model to over-estimate high amnesia rates when only given access to a selected subset of items. The amnesia response is a relatively rare outcome across all 1200 characters, so it is necessarily more common in the highly informative subsets than in general. Test-takers who produce amnesia responses are then unable to ‘dilute’ this response with subsequent successes as much as they otherwise might in a longer test. How serious a limitation this is depends on the application. For applications concerned mainly with rank-ordering, such as identifying students who may benefit from a particular writing exercise, or applications making between-groups comparisons on the same test within the same population, it is not a major problem, as ranks are reasonably well-preserved. It does, however, limit potential comparisons between tests calibrated for different populations, or absolute claims about amnesia rates in natural writing outside the test context. There are applications where such claims might be desirable, for example in cross-sectional analysis of age cohort effects; the test presented here does not support such analyses.

Future work could examine richer modelling of response types, moving from the binary amnesia/success outcomes used here to ternary amnesia/success/don’t know, or also distinguishing between different types of failure-to-write depending on whether many, some, or no elements of the target character were produced. Since the handwriting test is administered via a computer interface, it may also be desirable to use adaptive designs. Estimates of item difficulty should allow for adaptive staircase designs on character difficulty, with estimates of test-taker ability admitting a common language in terms of α to allow meaningful comparisons between people who saw different tests. We defer investigating these improvements to future work.

Conclusion

The causes of character amnesia states are the subject of frequent speculation, often invoking the fact that people use handwriting less and less due to the dominance of digital typing (Guangming Wang Public Opinion Center, 2019; Xu & Zhou, 2013). However, an understanding of handwriting failures is also part the broader study of writing systems more generally (Morin et al., 2020). The work presented above provides one part of the toolkit needed to answer such questions, by supplying a quick and easily-administered test of character amnesia rate calibrated for adult native speakers of Mandarin. The example test developed here, calibrated for university undergraduates, demonstrates that quick and easy testing for character amnesia rates with reasonable reliability is feasible using calibration data from the target population on the order of 40 × 4 participation hours. The example test described above and scripts to select items using new calibration data are available under a Creative Commons Attribution-NonCommercial 4.0 International Public License at https://osf.io/nersu/?view_only=80343b61798d4eac8f9912de0b9c1ec9.