Executive function is the top-down cognitive control process in initiating, maintaining, and flexibly updating goal-directed behaviors (Diamond, 2013). Executive function decline is a hallmark of cognitive aging (Lacreuse et al., 2020), which increases the risk of cognitive impairment and falls in the elderly group (Montero-Odasso & Speechley, 2018). Thus, fast, accessible, convenient executive function tests to identify high-risk individuals are valuable to facing the increasingly aging population challenge. Wisconsin Card Sorting Task (WCST) and its variants (Barceló, 2003; Berg, 1948; Eling et al., 2008; Grant & Berg, 1948; Greve, 2001; Nelson, 1976; Nyhus & Barcelo, 2009) are popular tools to measure executive function or cognitive flexibility in specific. However, WCST is a standard neuropsychological test usually administered by professionals, and its complex scoring procedure causes inconsistency in the literature (Miles et al., 2021). The study thus evaluates the feasibility of an open-source, self-administered, short-version online card sorting task (OCST) with an automated scoring procedure using a sample of young and elderly Chinese participants.

WCST and its variants

WCST originates from ideas of the German psychology of thinking and clinical practices of assessing prefrontal function in the early 20th century (Eling et al., 2008). In 1948, Grant and Berg (1948) formalized the design of the “University of Wisconsin Card Sorting Test,” the predecessor of the well-known WCST. The WCST quickly became a famous test of prefrontal lobe damage following Milner’s pioneering work (Milner, 1963). The WCST was then refined and published as a standardized neuropsychological test with norms and an improved scoring procedure (Heaton et al., 1981; Heaton et al., 1993).

Card sorting enables quantitative evaluation of executive function (Milner, 1963). Typically, there are four stimulus cards and a pack of response cards. All cards show patterns composed of different forms (e.g., triangle, star, cross, circle), colors (e.g., red, green, yellow, blue), and numbers (1, 2, 3, 4). The pack of response cards is shuffled before the test begins. Participants are instructed to figure out the sorting rule and classify each response card into one of the four stimulus cards. The sorting rule changed without warning after the participant made ten consecutive correct choices. The sorting rule usually changes in a fixed order, such as color-form-number-color-form-number, unknown to participants. As the stimulus cards have orthogonal properties, participants’ choice history can indicate whether they can quickly form and flexibly adjust their mindset.

The WCST task has been undergoing continual modifications or improvements. Milner (1963) used 128 cards in her pioneering work. However, a shorter version (WCST-64) with 64 response cards might fit the clinical settings better (Greve, 2001). Nelson (1976) developed the Modified Wisconsin Card Sorting Task (M-WCST), which removed all ambiguous response cards that shared more than one attribute from the stimulus cards. Barcelo´ proposed an innovative design, namely the Madrid card sorting test (MCST), which incorporates the task-switch and WCST paradigm enabling both behavior and neurophysiologic recording (Barceló, 2003, 2021). A beta version of online MCST is also available in Spanish and English. Opensource software has also accelerated the widespread usage of WCST-like card sorting task, such as the Berg Card Sorting Task provided by the Psychology Experiment Building Language (PEBL) (Fox et al., 2013; Piper et al., 2012). Lange and colleagues designed a self-administered computerized variant of the WCST (cWCST) characterized by an unpredicted sorting rule change and removal of all ambiguous cards as MCST (Lange & Dewitte, 2019; Steinke et al., 2020; Steinke et al., 2021). An online open-source WCST-like card sorting task is also available with the assistance of the powerful jsPsych library (de Leeuw, 2015; Vékony, 2022). However, as far as we know, few studies have verified the online, self-administered WCST-like card sorting task with a community sample.

Application of WCST

Card sorting is the most widely accepted task in assessing executive function deficits (Stuss & Benson, 1984). A survey of 747 North American psychologists revealed WCST as one of the ten most frequently used neuropsychological assessment tools (Rabin et al., 2005). In a following survey study, the computerized version of WCST ranked as one of the two most commonly used automated test tools (Rabin et al., 2014). A recent systematic review suggests that WCST was one of the top five assessment tools of executive function with the most validations for children and adolescents in low- and middle-income countries (Kusi-Mensah et al., 2022). The classical form of WCST originating from the Milner version (Milner, 1963) is undoubtedly a famous neurocognitive task frequently used by clinicians (Miles et al., 2021).

WCST and its variants also rank as the seventh most frequently used neurocognitive tool to evaluate executive functions in aging (Faria et al., 2015). The age-related performance decline on WCST measures was expected and supported by behavior (Haaland et al., 1987; Lineweaver et al., 1999; Marquine et al., 2021; Perez-Enriquez et al., 2021) and neuroimaging studies (Esposito et al., 1999; Heckner et al., 2021). However, there was also evidence that WCST was not sensitive to aging in the Taiwanese population (Shan et al., 2008). In addition to age, education level is another factor affecting WCST performance (Lineweaver et al., 1999; Marquine et al., 2021). The WCST performance deficit might not solely stem from the deficits in cognitive flexibility but also the reduced working memory (Hartman et al., 2001; Lange et al., 2016).

Although WCST is a popular tool for assessing cognitive flexibility (Uddin, 2021), the construct validity of WCST as a pure measure has yet to be questioned (Nyhus & Barcelo, 2009). Optimal WCST performance depends on multiple cognitive components, including set-shifting related to the frontoparietal network and rule inference related to the frontostriatal network (Lange et al., 2017). Recent brain imaging studies also imply that large-scale functional brain networks subserve the cognitive flexibility component measured by WCST, questioning the anatomical specificity of WCST (Nomi et al., 2017). Some researchers have thus advocated refining the WCST to make the measure “pure” or specific to a cognitive process (Barceló, 2021; Nyhus & Barcelo, 2009). However, the refined WCST task, such as MCST, is more like a task-switching paradigm, which might not be comparable to the classical WCST measures (Miles et al., 2021).

Scoring of the card sorting task

The standardized WCST can provide up to 16 main outcome measures (Chiu & Lee, 2021), several of which are redundant as they are linear combinations of other measures. Seven major indices are enough to validate the latent structure of the WCST, including total correct, perseverative responses, perseverative errors, non-perseverative errors, conceptual level responses, categories completed, failure-to-maintain-set (Greve et al., 2005). Table 1 provides a brief explanation of the seven measures. Other indices, such as “trials-to-complete-the-first-category” and “learning to learn,” might not be suitable for factor analysis as many subjects might fail to learn the task structure making these indices zero. Some novel indices have also been validated, such as “Cognitive persistence” (Teubner-Rhodes et al., 2017).

Table 1 Description of the seven OCST indices

Perseverative errors and Perseverated responses, which indicate the stubborn usage of outdated response rules, were widely used as cognitive flexibility indexes. Unfortunately, conceptual confusion and inconsistent scoring of perseveration are common in practices (Flashman et al., 1991; Miles et al., 2021). The seminal work of Flashman et al. (1991) spent about six pages elaborating on nine rules of scoring perseveration. After a decade, Miles and colleagues still have to use a lengthy tutorial paper to solve the inconsistency and recommend automated scoring to solve this issue (Miles et al., 2021). The critical concept of scoring perseveration behavior is the perseverated-to principle, which refers to the repeatedly used incorrect sorting rule (Miles et al., 2021). Perseverated Responses thus refers to the number of responses that conform to the perseverated-to-principle, which can reveal the subject’s cognitive flexibility in mindset shifting. Even though the idea of perseverated responses seems intuitive initially, their calculation remains inconsistent in the literature. The status is due to the ambiguous trials, where the selected stimulus card shares two or more attributes with the current response card. Flashman and colleagues proposed the “sandwich rule” to deal with ambiguous trials, formalized later in the standardized WCST manual (Heaton et al., 1993). If the ambiguous response matches the “perseverated-to-principle” and is preceded and followed by an unambiguous perseverative error, it is scored as a perseverative response (Flashman et al., 1991, p191). However, several exceptional cases exist in practice, such as successive ambiguous responses being “sandwiched,” which might confuse the scorer. Moreover, a perseverative response can be correct as it might overlap with the right attribute (Miles et al., 2021). The scoring procedure is tedious and error-prone without training. Figure 1A is a typical high-performance profile easy to score. However, Fig. 1B presents a typical low-performance profile with lots of scoring ambiguity, challenging an inefficient scorer. An automated scoring procedure is thus necessary to standardize clinical and research practices (Miles et al., 2021).

Fig. 1
figure 1

Two typical scoring cases. A High-performance response profile. B Low-performance response profile hard for an inexperienced scorer. Note. The x-axis depicts the 64 trials sequentially. The y-axis indicates the four dimensions: C (Color), F (Form), N (Number), and O (Other). The colored tile indicates the match between the response card and chosen stimulus card on each trial. O refers to a choice that did not match any of the C, F, or N dimensions. WCST is characterized by ambiguous trials where the chosen stimulus card might match the response card on more than one dimension. As shown by Fig. 1A, the examinee chose the stimulus card matching the form and number dimension at the first trial, which was incorrect, indicated by the red color. In the second trial, the examinee chose the stimulus card matching the number dimension, which was wrong. From the third trial, the examinee got the correct rule (color) and completed ten consecutive correct choices indicated by the yellow borderline. On trial 13, the correct rule shifted to form, and the examinee made a preservative error (indicated by a white “P”). Figure 1B presents a hard-scoring scenario involving almost every specific scoring principle. For example, trials 60–63 are preservative responses or preservative errors as they were enclosed by unambiguous perseverative errors (the sandwich rule). Please refer to our Method section for a detailed description of the scoring method. The visualization method producing Fig. 1 is helpful for an informative visual check and clinical diagnosis, which is publicly available on our GitHub repository (see Code Availability section)

Reliability of card sorting task measures

Reliability is fundamental in clinical settings and individual difference studies. First, the reliability estimates for a specific measure inform practitioners about the precision of the test scores. Second, low-reliability measurements reduce the statistical power to detect potential associations in individual difference studies (Hedge et al., 2018). However, the reliability of WCST and its variants received incomparable research attention concerning its widespread usage (Kopp et al., 2021).

A validation study indicates the test–retest reliability estimates of measures from M-WCST proposed by Nelson (1976) are only modest (ranging from .46 to .64 ) with a sample of 229 healthy community-dwelling old adults (Lineweaver et al., 1999). Even after one year, there was a practice effect on the measure of non-perseverative error (Lineweaver et al., 1999). However, Kopp and colleagues report that the M-WCST manifests desirable reliability estimates (> 0.9) using the split-half estimation method in a sample of neurological inpatients (n = 146) (Kopp et al., 2021). Chiu and Lee (2021) investigated the test–retest reliability of the classical WCST using a schizophrenia sample in Taiwan (n = 63) with a 2-week interval. The study demonstrates that most WCST measures were acceptable except for non-perseverative errors and failure-to-maintain-set (Chiu & Lee, 2021). A recent study using a sample of healthy Argentinian adults (18–89 years old, n = 235) indicates that the classical manual version of WCST has good reliability using Cronbach’s alpha coefficient (Miranda et al., 2020).

The inconsistency among the limited number of studies has two implications. First, many cognitive tasks, including WCST and its variants, might suffer from practice or learning effects. Thus, test–retest reliability might not be applicable in practice. Split-half reliability is a convenient estimate of internal consistency among different trials suitable for estimating the reliability of cognitive tasks (Parsons et al., 2019; Pronk et al., 2022; Steinke & Kopp, 2020). Second, the reliability estimates depend on specific measurements and the sample used, which cannot be generalized across task variants, different indices, populations, and test scenarios. Thus, more research efforts are needed to validate the usage of WCST and its variants in research and practice.

The present study

The classical form of WCST is widely used in clinical and research practices. In addition to the commercial and standardized WCST (Heaton et al., 1981; Heaton et al., 1993), open-source solutions are also widely used (Fox et al., 2013; Vékony, 2022). However, there were two unresolved issues. First, the scoring procedure of WCST-like tasks was inconsistent, which made the comparison among studies problematic (Flashman et al., 1991; Miles et al., 2021). Second, the reliability of WCST-like tasks received insufficient research efforts. Third, online WCST-like tasks in cognitive aging studies should be validated. The present study operationalizes the expert consensus on WCST scoring (Flashman et al., 1991; Miles et al., 2021) into an automated scoring and visualization procedure using the R language. In addition, we investigate the split-half reliability of an online, WCST-like card sorting task (OCST) (Vékony, 2022) with a large community sample involving young and elderly participants. Our results suggest that most OCST measures manifest acceptable reliability and are sensitive to the age difference at the group level.

Method

Sample

We recruited 256 young (18–45 years) and old adults (55–81 years) from a community medical examination center of a subordinate county of Hefei City, Anhui province of China when they attended their regular annual health screen through advertisement. All participants had a corrected-to-normal vision. Six participants did not provide age, gender, or education information, and 30 participants who failed to complete the whole test were excluded from the analysis, leaving the final sample of 220 participants. The final young group included 65 females and 42 males (age, M = 30.1 years, SD = 5.5 years), and the old group included 53 females and 60 males (age, M = 64.0 years, SD = 6.7 years). The study was approved by the ethics committee of the Hefei Institutes of Physical Science and was conducted following the Declaration of Helsinki. All participants gave written informed consent and received monetary compensation.

Procedure

Two nurses from the community medical examination center screened the volunteers. Eligible participants were taken to one of two testing rooms. The nurse registered participant information, introduced the task instructions, and familiarized the participants with the keyboard operation. Participants completed a short practice first. Then they completed the formal test by themselves. Some of them also completed a two-step decision-making task and an attention network task, which belongs to another parallel study. After the online test, the nurse measured the participants’ working memory using the digit-span test (Wechsler, 1997). The data for estimating split-half reliability is publicly available (see Data Availability Statement).

Task design

The OCST (Vékony, 2022) follows the design of the short form Berg Card Sorting Task (Fox et al., 2013; Piper et al., 2012), which uses 64 response cards as the standardized WCST-64. We translated the task instruction into simplified Chinese and added a short practice to help the old participants who might be unfamiliar with keyboard operations. The task was deployed on the Tencent Cloud using the Python Django web framework. A typical trial began with a screen with four stimulus cards on the top row and a response card in the middle of the bottom row. The four stimulus cards had a red triangle, two green stars, three yellow diamonds, and four blue circles. The response card was drawn from 64 cards (4 colors × 4 forms × 4 numbers) without replacement. Participants needed to decide which of the four cards the response card belonged to by clicking on the corresponding keys and receiving feedback on whether they were correct. After the participant made ten consecutive right choices, the new sorting rule or category began. There was a maximum of six categories in the order of color-form-number-color-form-number unknown by the participants.

Scoring procedure

We calculated the seven measures of Table 1. All the measures were straightforward except for perseverative responses and perseverative errors. We distilled the recommendations in the literature (Flashman et al., 1991; Miles et al., 2021) into a logical flow and operationalized it into an R script publicly available on GitHub (see Code Available Statement). Figure 2 elaborates on the procedure and scoring principle of our scoring script. For each scoring principle, Supplementary Table 1 gave the source information in the quoted and cited form. We strongly recommend Flashman et al. (1991) and Miles et al. (2021) for interested readers. Please note that perseverative errors are a subset of perseverative responses (Miles et al., 2021). A perseverative response can be correct because of the large portion of ambiguous trials in the WCST. Perseverative errors can be easily identified if they are perseverative responses. The scoring procedure is a two-step decision. Firstly, the perseverated-to-principle was determined for each trial according to the first-error rule, new-category rule, and sequential-error rule. Then, an unambiguous error was classified as a perseverative error if the choice adhered to the perseverated-to-principle of that trial (Unambiguous-perseverated-to rule). For ambiguous trials enclosed by perseverative errors, we used the sandwich rule to judge whether it is a perseverative error (or perseverative response).

Fig. 2
figure 2

A schematic diagram of the automated scoring logic of perseverative response. Note. The first incorrect unambiguous choice dimension is considered the perseverated-to principle (first-error rule). After achieving a category criterion (ten consecutive correct choices), the old rule became the perseverated-to principle (new-category rule). Suppose the participants make three sequential unambiguous incorrect choices on a particular dimension (ambiguous choices interlayered by the three trials do not influence the continuity). In that case, this dimension becomes the perseverated-to principle from the second unambiguous position of the sequence (sequential error rule). Trials that are unambiguous and cohere with this trial’s perseverated-to principle are tagged as a “perseverative response,” which are also “perseverative errors” (unambiguous-perseverated-to rule). Ambiguous responses match the perseverated-to principle, and enclosed by unambiguous perseverated responses are considered preservative responses (sandwich rule). Please note that perseverative errors are a subset of perseverative responses. Perseverative errors are incorrect perseverative responses

Split-half reliability calculation

Evaluating the split-half reliability involves splitting the trials into two halves, scoring each half for each subject, and calculating their Pearman correlation. The Spearman-Brown formula was then used to calculate the underestimation due to the usage of only half trials (Parsons et al., 2019). However, splitting a trial sequence into two halves is not that straightforward. Although first-second and odd-even splitting is common practice in evaluating questionnaires, recent simulation studies recommend the split method based on random permutation (Parsons et al., 2019), sampling-based (Steinke & Kopp, 2020), or Monte Carlo method (Pronk et al., 2022).

We conducted the split-half reliability estimation using splithalfr (Pronk et al., 2022) and R version 4.1.3 (R Core Team, 2022) using four split methods: the first-second half split, odd-even trial split, permutated split, and Monte Carlo-based split. For permutated and Monte Carlo splits, we made 5000 resamplings and calculated the mean Pearson correlation of each split. For first-second, odd-even, and permutated splits, Spearman–Brown correction was made. As the Monte Carlo splits method constructed a full-length trial sequence for each half, reliability was the median Pearson correlation. Because card sorting trials are not independent, it is unreasonable to split trials and then calculate task indexes for each half. Thus, we first tag each trial using our scoring procedure. Each trial is then labeled as a correct, perseverated response, perseverated error, non-perseverated error, conceptual level responses, or failure-to-maintain-set. We also assign a value of 1/10 to each trial belonging to an achieved category. The method, insight by Kopp et al. (2021), enables the split-half estimation of all seven measures. The reliability analysis script is publicly available (see Code Available Statement).

Statistical analysis

The intercorrelations among the seven measures were evaluated using Spearman’s rank correlation coefficients and visualized using the corrplot package. We calculated the age group’s effect size for each measure using Cohen’s d and simulated the effect size distribution by performing bootstrap resampling (n = 5000). We then used the Gardner–Altman estimation plot to visualize the result with the dabestr package in R (Ho et al., 2019). To examine whether card sorting performances declined with age, we performed multiple regressions with gender, education years, and digit span scores as covariates. As the linear trend might be driven by the group difference between the young and the elderly, we performed separate regression analyses for the elderly and young groups. Permutation tests of linear regression models were conducted using the permuco package (Frossard & Renaud, 2021). The Bonferroni correction was used after repeated analysis of seven measures. Specifically, all p values were multiplied by seven to adjust the type I errors. The alpha threshold was .05 (two-tailed) for hypothesis testing. All statistical analyses were in the R version 4.1.3 environment (R Core Team, 2022), the script of which was publicly available.

Results

Descriptive statistics

Table 2 provides descriptive statistics on the raw scores of the seven measures for elderly females (n = 53), elderly males (n = 60), young females (n = 65), and young males (n = 42), respectively. The intercorrelations among those measures were high except for failure-to-maintain-set. The intercorrelation matrix manifested itself into three clusters (Fig. 3). The first cluster was total correct, categories completed, and conceptual level responses with intercorrelation larger than .85. Another cluster comprised perseverative responses, perseverative errors, and non-perseverative errors with intercorrelations larger than .49. Failure-to-maintain-set is independent of the other measures except a small-to-moderate correlation with categories completed.

Table 2 Means and standard deviations for age and core task indices in old and young groups
Fig. 3
figure 3

The intercorrelations among the seven card sorting measures. Note. TC: Total Correct; PR: Perseverative Response; PE: Perseverative Errors; NPE: Non-perseverative Errors; CLR: Conceptual Level Responses; CAT: Categories Completed; FMS: Failure-to-Maintain-Set

Split-half reliability estimates

Table 3 describes the reliability estimates for the seven indices using the first-second, odd-even, permutated, and Monte Carlo split methods. In general, the first-second split method tended to underestimate, and the odd-even method tended to overestimate the reliability estimates. The permutated splits and the Monte Carlo method yielded comparable estimates except for failure-to-maintain-set.

Table 3 Split-half reliability estimates of the seven OCST measures in old and young groups

Figure 4 elaborates on the reliability difference between the young and elderly groups. Although there were slight differences, the reliability estimates for both groups were comparable. Category completed, conceptual level response and total correct were the top three measures with the highest reliability. The category completed was significantly larger than .9, followed by the conceptual level response and the total correct, around 0.9. The three measures fell into a desirable range suitable to be a clinical assessment tool. Reliability estimates of perseverative response, perseverative error, and non-perseverative error were around .8, acceptable in individual difference studies. However, the failure-to-maintain-set failed to manifest a reasonable reliability estimate.

Fig. 4
figure 4

Monte Carlo reliability estimates for the seven indices of OCST. Note: TC: Total Correct; PR: Perseverative Response; PE: Perseverative Errors; NPE: Non-perseverative Errors; CLR: Conceptual Level Responses; CAT: Categories Completed; FMS: Failure-to-Maintain-Set

Age effect

Figure 5 illustrates the difference between the young and elderly groups. Elderly participants completed about two categories (M = 2.1, SD = 1.4), which is lower than young participants (M = 3.5, SD = 1.5) (categories completed: Cohen’s d = –.97, 95% CI [–1.29, –.67]). Consistently, they also made less correct responses (M = 40.3, SD = 10.1) than the young participants (M = 49.6, SD = 8.9) (total correct: Cohen’s d = –.97, 95% CI [–1.27, –.67]). Furthermore, the conceptual level responses were also lower in the elderly group (M = 24.0, SD = 11.6) compared with the young group (M = 35.1, SD = 11.3) (Cohen’s d = –.96, 95% CI [–1.27, –.65]).

Fig. 5
figure 5

Gardner–Altman estimation plot depicting the age group difference on the seven measures. A Total Correct (TC); B Perseverative Response (PR); C Perseverative Errors (PE); D Non-perseverative Errors (NPE); E Conceptual Level Responses (CLR); F Categories Completed (CAT); G Failure-to-Maintain-Set (FMS)

Elderly participants (M = 12.6, SD = 6.8) manifested more perseverative responses than the young participants (M = 7.7, SD = 5.6) (Cohen’s d = .78, 95% CI [.46, 1.08]). The elderly (M = 11.4, SD = 5.8) also committed more perseverative errors than the young group (M = 7.2, SD = 4.6) (Cohen’s d = .80, 95% CI [.48, 1.09]). On the measure of non-perseverative error, the elderly group (M = 12.3, SD = 6.3) got poor performance compared with the young group (M = 7.3, SD = 5.5) (Cohen’s d = .85, 95% CI [.54, 1.14]). All participants made few failure-to-maintain set errors (range from 0 to 5); one-half of the young participants made zero errors, and one-half of the elderly made less than one error. The young group (M = .5, SD = .7) made fewer failure-to-maintain-set errors than the elderly group (M = 1.0, SD = 1.1) with medium effect size (Cohen’s d = .46, 95% CI [.20, .69]).

We also examined whether the seven measures manifested age-specific decline in the elderly and young group separately with education years, gender, and digit span score as covariates. However, none of the continuous age effects reached a significance level after Bonferroni correction (all adjusted p > .05). In addition, our results did not yield any effect of gender (all adjusted p > .05). For the young group, there was a positive association between education years and category completed (adjusted p = .021). However, the effect of education years did not reach the significance level in the elderly group (all adjusted p > .05).

Discussion

WCST and its variants have become a popular clinical and research tool for assessing executive function since its origin. The classical form of WCST has many ambiguous trials, making scoring perseveration responses challenging. Although significant progress has been made in standardizing the scoring procedure (Flashman et al., 1991; Heaton et al., 1981; Heaton et al., 1993), there is still great controversy and inconsistency (Miles et al., 2021). The present study contributes automated scoring and informative visualization procedure freely available as R scripts, which can benefit clinical psychologists and researchers. We also report split-half reliability estimates for the seven frequently used WCST measures for a publicly available online card sorting task (OCST) in a young and elderly community sample. Our results suggest that most WCST measures manifest acceptable reliability and are sensitive to the age difference at the group level.

The automated scoring and visualization tool

The standardized WCST can provide up to sixteen measures (Chiu & Lee, 2021). Most of them are straightforward except for perseverative response and perseverative error (Flashman et al., 1991; Miles et al., 2021). The challenge is mainly due to the ambiguous trials where the response card shares more than one dimension with the chosen stimulus card. There are two solutions to this issue. First, remove all ambiguous response cards, as are the M-WCST (Nelson, 1976), MCST (Barceló, 2003), and cWCST (Steinke et al., 2021). However, these modified versions might not be comparable to the classical form of WCST. Second, providing systematic tutorials, as done by Flashman et al. (1991) and Miles et al (2021). However, it is still hard to master all the scoring principles. An automated scoring tool is thus necessary to solve the scoring inconsistency of WCST-like tasks (Miles et al., 2021). The present study provides an open-source and transparent scoring procedure that strictly coheres to the expert consensus. The scoring procedure can facilitate the scoring of OCST (Vékony, 2022), a publicly available card sorting task that follows the typical design of WCST. It can be easily modified if a custom task script is used. Furthermore, we also contribute an informative visualization tool facilitating clinical diagnosis, scoring check, or self-education of the scoring principle. The automated scoring and visualization method is valuable for neuropsychological services in developing countries where trained professionals are lacking.

Reliability of card sorting measures

The Monte Carlo results indicate that the reliability estimates of category completed, conceptual level response, and total correct are suitable to be used in clinical diagnosis usage (rel > 0.9), and the perseveration response, non-perseverative errors, and perseverative errors are acceptable to be used in research (rel > 0.8). An exception is failure-to-maintain-set, the reliability estimate of which is around 0.5. The low reliability of the failure-to-maintain-set measure was mainly due to the very few errors made. Our estimates are generally superior to the standardized M-WCST reported by the manual, which quantifies a five-year interval test–retest reliability as 0.55 (Schretlen, 2010). However, as Schretlen (2010) estimated test–retest reliability and a different task version, readers should be cautious about comparing our findings directly with Schretlen (2010). Our reliability estimates are lower than the reliability of cWCST in Steinke et al. (2021), which reports that all measures achieved good reliability (rel > 0.9). However, there are several fundamental differences between Steinke et al. (2021) and our study: First, the OCST in our study was a 64-trial, short-version task, while the cWCST had about 168 trials. An increase in trial length is beneficial for good reliability. Second, the OCST has ambiguous response cards. The exclusion of ambiguous trials in cWCST reduces the task difficulty, which might promote a consistent strategy in the task. Third, the cWCST only selected three kinds of error and three response time measures, while we reported seven widely used indices of WCST in the literature. Despite this, the reliability estimates are valuable as the OCST makes it the best to follow the task form and scoring of WCST-64, which is still one of the leading forces in clinical settings (Miles et al., 2021).

Method of splitting

The present study reports four split-half reliability estimates using different splitting methods. The first-second splitting provides the worst, and the odd-even splitting provides the best reliability estimates in our research for most indices. The pattern is consistent with Kopp and colleagues’ recent study, which investigates the split-half reliability of M-WCST with a clinical sample (Kopp et al., 2021). Their study systematically evaluates the impact of the trial grain size on reliability estimates, which ranges from the odd-even (grain size = 1) to the first-second approach (grain size = half of the trial length). There seems to be a decreasing trend as the trial grain size increases. The phenomenon might stem from confounding learning effects or strong dependence between trials (Pronk et al., 2022). Thus, the odd-even splitting might overestimate, while the first-second splitting might underestimate the reliability estimates of WCST and its variants.

Splitting by random permutation or Monte Carlo sampling might be the optimal choice. First, the random sampling averaged the bias due to arbitrary trial grain size. Second, the sampling approach can provide point estimates and a confidence interval to indicate the precision. Using random permutation splitting, Kopp and colleagues report encouraging reliability estimates (rel > 0.9) for the M-WCST in the clinical setting (Kopp et al., 2021) and cWCST in a young sample in the lab setting (Steinke et al., 2021). Our results yielded comparable reliability estimates between the random permutation and Monte Carlo sampling methods except for the measure of failure-to-maintain-set. The Monte Carlo method yields more reasonable estimates than the random permutation method on failure-to-maintain-set. As mentioned in the results section, about half of the young participants omitted zero errors, and half of the old participants omitted less than one error. That means the failure-to-maintain-set error was only in one split in many cases. The Monte Carlo method avoided this problem as it simulated a complete trial record.

Sensitivity to cognitive aging

To deal with a world full of noise, people of our time need to filter interference, make forward-looking plans, inhibit useless or harmful behaviors, and change their mindset flexibly when the environment changes, the core of which is executive function. Despite its irreplaceability in human cognitive architecture, executive function is fragile and declines with age (Lacreuse et al., 2020). WCST and its variants are popular neuropsychological assessment tools for prefrontal or executive function. A recent meta-analysis suggests an association between prefrontal intactness and executive function. Moreover, compared with other executive function measures, the WCST indices have more robust correlations with the prefrontal volume size (Yuan & Raz, 2014).

The WCST and its variants have widespread usage in cognitive aging or normative studies (Esposito et al., 1999; Faria et al., 2015; Hartman et al., 2001; Heckner et al., 2021; Lineweaver et al., 1999; Marquine et al., 2021; Perez-Enriquez et al., 2021; Sanchez-Rodriguez et al., 2022; Shan et al., 2008). Most of those studies support age differences in the WCST measures, which is consistent with our findings. On seven major WCST indices, the old group (55~80 years) showed noticeable deterioration compared with the young group (18~45 years) in our study. To test whether there was continuous executive function decline with age in the old group, we also performed regression analyses controlling the confound of gender, education years, and digit span score. However, our results did not yield any significant linear aging trend on the seven measures. A study using Taiwan samples also reports age group differences but not a continuous linear decline with age (Shan et al., 2008). A recent study has revealed a complex aging pattern using a dataset of the Attention Network Test, which reveals that aging accompanies both improvement and decline (Verissimo et al., 2021). Thus, the executive function might have a non-linear dependence on aging. To verify this issue, a larger sample size with hierarchical sampling can benefit the linear and non-linear analysis of the aging effect.

The collinearity among OCST measures

The standardized WCST test can provide up to 16 different indices. However, many indices are linear combinations of several other indices. For example, total error equals perseverative error plus non-perseverative error. Greve et al. (2005) conducted the first large-scale confirmatory factor analysis of the WCST (128-card version). Their study adopted seven major indices: total correct, perseverative error, perseverative response, non-perseverative error, conceptual level response, category completed, and failure-to-maintain-set. The perseverative error was removed from the final model due to the collinearity issue. Although the authors suggested a three-factor solution, the model fit was unsatisfactory, indicating the model might not reflect the data structure (Greve et al., 2005). Our correlation analysis reveals a serious collinearity problem among the seven measures used by Greve et al. (2005), with ten pairwise correlation coefficients larger than 0.8. An exception is failure-to-maintain-set, which only showed a moderate correlation with the category completed. Unfortunately, the reliability of the failure-to-maintain-set was unsatisfactory in our study. The collinearity issue questions the necessity of reporting many index scores in practice and research. As the current study used a 64-card, self-administered version in a Chinese community sample, whether the collinearity issues apply to other versions or populations should be checked in future studies. Moreover, the latent factor structure or the construct validity of WCST and its variants calls for additional research attention.

Usage of OCST in online cognitive aging studies

The aging of the world is accelerating. Fast identification of individuals with abnormal aging risk can help the community and family take quick actions to weaken the negative consequences such as neurodegenerative diseases and falls. The worldwide epidemic, such as COVID-19, also raises the emergency of developing Digital Neuropsychology to provide online accessible neuropsychological test services (Germine et al., 2019; Steinke et al., 2021). Steinke et al. (2021) comprise the first valuable research evaluating the split-half reliability of a self-administered, computerized version of WCST (cWCST) in the lab by recruiting young volunteers. However, as far as we know, the present study is the first to explore an online version of the WCST-like task (OCST) in the community with both young and old volunteers. Unlike the cWCST, the OCST follows the original design and improved scoring scheme of standardized WCST-64. Thus, our results provide valuable information to researchers and practitioners planning to use the classical WCST version in community-based research.

Limitations and future directions

The present study's automated scoring, informative visualization, and split-half reliability estimation procedure can significantly benefit clinicians and researchers. Furthermore, we provide reliability estimates for a publicly available card sorting task in the community-dwelling young and elderly sample. However, the sample size of the present study still needs to be increased to make a reliable norm. In addition, the validity of using the card sorting task as a fast cognitive screening tool should be justified with everyday function measures. Last but not least, it is worthwhile in the future to directly compare different WCST versions in the community sample.

Conclusions

Executive function decline is a hallmark of cognitive aging, which increases the risk of cognitive impairment and falls in the elderly group. Fast executive function tests to identify high-risk individuals are necessary for the care service in the community. Card sorting tasks have been widely used as a measure of executive function. The study investigated the usability of an open-source, self-administered, online short-version card sorting task with a sample of young and old Chinese. We developed an automated scoring procedure following the recent recommendations on scoring perseverative responses to make the results comparable to the standardized WCST. Reliability estimates of commonly used measures were calculated using the split-half method. All task indices' reliabilities were reasonably good except for "failure to maintain-set." The R script of automated scoring and estimation of reliability was publicly available.