Detecting invalid data is a vital part of conducting research. In particular, psychological research can be vulnerable to invalid data in that participants’ responses may not always reflect “true” values (Curran, 2016). Common threats to validity include participants misunderstanding instructions, participants misrepresenting themselves, or participants responding carelessly or with insufficient effort (Curran, 2016; Groves et al., 2009; Johnson, 2005; Krosnick, 1999). In practice, many participants show evidence of careless/insufficient effort (C/IE) responding; typical estimates are around 10% of responders (Maniaci & Rogge, 2014; Meade & Craig, 2012), though some report rates much higher (46%; Brühlmann et al., 2020). This is worrying because even relatively small percentages of C/IE respondents (~5–10%) can cause measurable losses in data quality or even alter the conclusions drawn from a study (Arias et al., 2020; Credé, 2010; Huang et al., 2015; Wood et al., 2017; Woods, 2006).

Text responses appear to be no exception. In a study analysing ~7000 text responses to a survey, 8% of these responses were manually labelled as irrelevant or miscellaneous text and removed from further analysis (Cunningham & Wells, 2017). In another sample of over 600,000 text responses, 28% were manually identified as either meaningless, or meaningful but not useful (Etz et al., 2018). As such, invalid data detection has been recognized as important when analysing open-ended or free text responses (Etz et al., 2018; Müller et al., 2014), especially for use in computational text analysis (Banks et al., 2018). To date, many methods for detecting invalid responses have been proposed (for reviews, see Curran, 2016; Meade & Craig, 2012). However, only some of these data quality indicators are applicable to text responses, including response speed or completion time (Curran, 2016; Huang et al., 2012; Leiner, 2019; Maniaci & Rogge, 2014), character or word count (Chen, 2011; Etz et al., 2018), and manual coding (Kennedy et al., 2020). In the present study, we propose an alternative method of detecting invalid text responses using supervised machine learning (ML). We first review past methods of detecting invalid responses and then provide an overview to our current approach.

Time-based data quality indicators

Past studies have considered response speed or completion time to be an indicator of response validity. The rationale is that participants who answer exceptionally quickly are unlikely to have read the instructions or given substantial thought to an item. Though extreme response speeds on either end of the spectrum (i.e., fast or slow) are potential signals of C/IE responding, only very fast respondents are typically excluded (Huang et al., 2012; Wood et al., 2017).

Some researchers have used static cutoffs for completion time, whereby invalid responses were defined as those submitted in under two seconds (Kaczmirek et al., 2017; Wise & Kong, 2005). Expanding upon this strategy, Huang et al. (2012) used cutoffs where invalid respondents were those who completed the study in less than 2 s per item (or less than 52 s per page), whereas Niessen et al. (2016) used 3 s per item. However, these static cutoffs have been largely designed with closed-ended responses (e.g., Likert-type scales) in mind, rather than open-ended or text responses. Relative cutoffs for completion time might be more applicable to text responses since relative cutoffs for valid/invalid classification are based on one’s own data. For example, Maniaci and Rogge (2014) proposed labelling responses as invalid if they are completed in less than half the mean completion time, and these authors emphasized recalibrating this cutoff for one’s own studies. A comparable approach was suggested by Leiner (2019), which involves computing speed factors (median completion time divided by the participant’s completion time) for the total study and/or the pages containing the responses of interest. Similar to Maniaci and Rogge (2014), participants completing pages more than twice as quickly as the median (speed factor > 2) are flagged as suspicious using this cutoff (Leiner, 2019).

Text-based data quality indicators

Text response length has also been thought to signal invalid responding in that very short answers are unlikely to contain enough content to constitute a meaningful response to the item. Further, common types of invalid text responses are typically short (e.g., “don’t know” responses; Behr et al., 2014; Holland & Christian, 2009; Kennedy et al., 2020; Scholz & Zuell, 2012). Despite general agreement that low character count can indicate poor data quality, static cutoffs for character count have differed considerably in the literature. In one case, responses that were 50 or fewer characters long (Kaczmirek et al., 2017) were treated as invalid, whereas other researchers have implemented minimums of 200 (Banks et al., 2018) or 900 characters (Jones et al., 2021). As an alternative strategy, some authors have used multiple rounds of cutoffs to filter out text responses shorter than either six, 16, or 51 characters (Etz et al., 2018), which flagged the most suspicious cases first (fewer than six characters) and the less suspicious cases after (fewer than 16 or 51 characters). Finally, some relative cutoffs have been used, such that responses below the 50th percentile on character count have been filtered out before analysis (Viani et al., 2021).

Like character count, low word count has also been conceptualized as a signal of invalid responses to open text questions (Chen, 2011). Some minimum cutoffs for word count have included one (Kaczmirek et al., 2017) or three words (Gagnon et al., 2019), though no empirical support is presented for these cutoffs. Some more deliberate approaches have involved explicitly asking participants to write at least 50 words, and then later checking whether responses met that 50-word minimum (Banks et al., 2018; Brühlmann et al., 2020). However, other studies without an instructed minimum word count may find this cutoff inappropriate. Indeed, Banks et al. (2018) note that “a minimum number of words cannot be recommended because this issue is context dependent” (p. 452). Researchers appear to have agreed with this sentiment, given that cutoffs for character/word count have varied so widely in the literature. The wide range of static cutoffs proposed in past research might also reflect the diversity of tasks and items used across studies. As such, relative cutoffs might be preferable since they are dependent on researchers’ own datasets, rather than static cutoffs determined by others’ datasets.

The final text-based data quality indicator is content analysis of the text response, or manual coding (Brühlmann et al., 2020; Kennedy et al., 2020; Scholz & Zuell, 2012; Sischka et al., 2020; Smyth et al., 2009). Theoretically, coding is the most direct measure of whether a text response is valid or not – while response time and character/word count might signal an invalid response, the content of the text dictates the response’s validity. Coding schemes also allow researchers to decide what constitutes a valid or invalid response in their own contexts. For instance, Brühlmann et al. (2020) manually labelled text responses for whether they were “thematically substantive” (p. 5), consisted of complete sentences, answered subquestions of the original question, and answered subquestions elaboratively. In terms of invalid responses, many researchers have arrived at similar codes, including refusals to respond, “don’t know” or “no opinion” responses, irrelevant or non sequitur responses, and incomprehensible or nonsensical responses (Behr et al., 2014; Kennedy et al., 2020; Scholz & Zuell, 2012). Despite these advantages of using coding to detect invalid responses, it remains difficult to conduct compared to other data quality indicators due to being more time- and labour-intensive. Coding typically requires a multitude of human hours for manual analysis of the content within texts. While this approach may be feasible for studies consisting of small sample sizes, it becomes untenable when texts are too long or too numerous.

Machine learning to detect invalid responses

To overcome these limitations, we suggest an approach that leverages ML to keep the advantages of coding, but without the need to manually examine entire datasets (Nelson et al., 2018; Sebastiani, 2002). Using ML to detect C/IE responding has been a very recent development (Gogami et al., 2021; Schroeders et al., 2021) and these studies have not yet attempted to detect invalid text responses. Specifically, Schroeders et al. (2021) used ML to detect careless responders using a combination of existing data quality indicators (e.g., Mahalanobis distance, even-odd consistency). Though this approach appeared to work well with simulated data (balanced accuracy = .86–.95), performance dropped considerably with data from actual participants (balanced accuracy = .60–.66). A similar approach was taken by Gogami et al. (2021), who also used ML to detect careless responders, but they logged additional data (such as scrolling behaviour, deleting text, and reselecting response options) in concert with traditional data quality indicators (e.g., character count, response time). Implementing supervised ML with those data was reported to effectively detect careless responders to their survey, which included open text items (recall/sensitivity = .86; Gogami et al., 2021). However, these past studies attempted to detect invalid respondents (rather than invalid responses), and neither used the content of text responses as data. In the present study, we ask if ML can detect invalid text responses based on the content of participants’ text responses alone. It is one thing to flag individuals who are carelessly responding to the survey overall; it is another to identify single text responses that are meaningless or irrelevant to the question being asked.

This distinction is relevant from both theoretical and practical standpoints. For one, it is theoretically important to consider that invalid text responses may not represent C/IE responding as it is typically conceptualized for other response types (e.g., Likert-type scales). It is certainly plausible that some participants will write meaningless text to finish the study faster. Importantly, it’s equally plausible that some participants will write meaningful but not useful text (Etz et al., 2018; Kennedy et al., 2020). For example, participants may simply fail to understand the question or write that they are declining to respond (Groves et al., 2009; Scholz & Zuell, 2012). In such cases, participants may take their time writing thoughtful or lengthy responses, causing them to pass basic time- or text-based data quality indicators – yet these texts would ultimately still be invalid and still need to be detected. Further, a practical consideration is that researchers may not have the ability to administer or analyse items beyond the text response itself. Researchers currently have scarce options if other safeguards were not included at the time of testing; text- or content-based measures are the only guarantee.

We propose that invalid text responses can be effectively detected using ML. Classifying texts into specific categories based on their content alone has been a long-established problem in ML literature (Aggarwal & Zhai, 2012; Yang, 1999), to which many effective solutions have been proposed. For example, researchers have used ML to successfully classify texts as sarcastic or nonsarcastic (Sarsam et al., 2020), abusive or nonabusive (Nobata et al., 2016), or expressing positive or negative sentiment (Liu & Zhang, 2012). Similar to these tasks, the current approach can be conceptualized as a supervised binary classification task (Chicco, 2017): it is supervised in that there are manually labelled values (valid or invalid) that we train, validate, and test the model upon; and it is binary classification in that there are exactly two possible target values (e.g., 0 = valid, 1 = invalid) for each text response. As such, we suggest a supervised ML approach that (a) trains, validates, and tests on a subset of texts manually labelled as valid or invalid, (b) calculates performance metrics to help select the best model, and (c) predicts whether unlabelled texts are valid or invalid based on the text alone. Finally, we compare the ML model’s predictions to existing data quality indicators to assess its performance relative to traditional methods (e.g., time- or text-based cutoffs).

Here, we examined the effectiveness of our ML approach using text responses from an online survey, in which we asked participants about any recurrent involuntary autobiographical memories (IAMs) they may have experienced (Yeung & Fernandes, 2020). These are memories that spring to mind unintentionally and repetitively (Berntsen, 1996; Berntsen & Rubin, 2008), which occur commonly in daily life among large and diverse samples (Berntsen & Rubin, 2008; Brewin et al., 1996; Bywaters et al., 2004; Marks et al., 2018; Yeung & Fernandes, 2020; Yeung & Fernandes, 2021). During our online survey, if participants reported that they had experienced recurrent IAMs within the past year, we asked them to write a text description of their most frequently recurring one. Importantly, these text data seem prone to common concerns about validity. For instance, in order to provide a valid response, participants must have (a) read and understood the instructions, (b) accurately judged whether they had experienced this type of memory, (c) retrieved relevant information about a specific memory, and (d) decided to describe this memory in writing. Declining to perform any of these steps (or performing them carelessly) would have likely led to invalid responses (Groves et al., 2009; Krosnick, 1999). Others have also demonstrated that an ML approach can successfully classify autobiographical memory texts into categories (i.e., specific vs. nonspecific memories; Takano et al., 2017). As such, we believe that our dataset, consisting of involuntary autobiographical memories, is a realistic example of psychological research involving text as data, from which invalid responses could be detected using ML.

Methods

Participants

Undergraduate students were recruited from the University of Waterloo, who participated in the current study in return for course credit. Data were collected in five waves between September 2018 and February 2020, with each wave occurring at the start of an academic term (i.e., Fall/September, Winter/January, Spring/May). In total, 6187 unique participants were recruited, and they produced 3624 text responses. Of these participants, 71% were women, 28% were men, and 1% were nonbinary, genderqueer, or gender nonconforming. Mean age was 19.9 (SD = 3.3, range = 16–49).

Materials

Recurrent Memory Scale

The Recurrent Memory Scale (Yeung & Fernandes, 2020) assessed participants’ recurrent IAMs. Participants indicated if they had experienced at least one recurrent IAM within the past year, not within the past year, or never (Berntsen & Rubin, 2008). If they had experienced at least one within the past year, they wrote a brief descriptionFootnote 1 of their one most frequently recurring IAM and rated it on a series of five-point Likert scales (e.g., frequency of recurring, valence; see Yeung & Fernandes, 2020 for details). Only participants’ text responses describing their memories were used in the current study; no ratings were analysed.

Procedure

Undergraduate students enrolled in at least one psychology course at the University of Waterloo self-registered for the 60-minute online study. After providing informed consent, participants completed a battery of questionnaires in a randomized order, including the Recurrent Memory Scale. For the current study, we analysed only the text responses from this scale; all other measures were unrelated to the current study. All procedures were approved by the University of Waterloo’s Office of Research Ethics (Protocol #40049).

Data preparation

Text data were preprocessed following recommended standards (Banks et al., 2018; Kobayashi et al., 2018), including stop word removal (removing highly frequent words) and lemmatization (reducing words to their roots). During preprocessing, 50 of the 3624 responses were removed because they no longer contained any text (e.g., the original text contained only nonalphabetic characters or stop words), resulting in 3574 processed text responses.

The processed texts from the labelled subset (n = 940) were then split into training, validation, and test sets. Specifically, 20% of these processed and labelled text responses were partitioned off as the test set (n = 188), to be held out for final evaluation of the model (i.e., never used during training or validation; Chicco, 2017). The remaining 80% of the processed and labelled text responses constituted the training (n = 676–677) and validation sets (n = 75–76), which were split ten times using stratified K-fold cross-validation (K = 10; Geisser, 1975; Marcot & Hanea, 2021).

Texts were then represented using a bag-of-words, unigramFootnote 2 approach (Grimmer & Stewart, 2013), which decomposes texts into lists of singular words without retaining word order. Specifically, processed text responses were transformed into document-term matrices (Grimmer & Stewart, 2013; Kobayashi et al., 2018; Welbers et al., 2017) and weightedFootnote 3 using term frequency-inverse document frequency (TF-IDF; Salton & Buckley, 1988). In brief, TF-IDF accounts for how commonly a word is used across the entire dataset; words that are prevalent across many different texts are down-weighted since these common words are unlikely to distinguish between texts well (Salton & Buckley, 1988).

Results

Manual coding of text response subset

As part of a prior study (Yeung & Fernandes, 2020), research assistants were trained to label a subset of the text responses (n = 949; 26% of responses collected) as either valid or invalid. For each text, two research assistants independently judged whether the text was valid (the participant’s text described a memory) or invalid (the participant’s text did not describe a memory). Specifically, they were instructed to judge whether the participant had chosen to describe a memory of a real, experienced event in their text response. One coder (C1) labelled all 949 texts, whereas the other three coders (C2, C3, C4) labelled approximately 316 (33%) texts each. All coders were naïve to the hypotheses of the study.

Interrater reliability between coders was calculated using intraclass correlation coefficients (ICCs) based on a mean-rating (k = 2), absolute agreement, one-way random effects model (Koo & Li, 2016). For the validity judgment, the average measures ICC between C1 and C2, C3, or C4 was .84 (95% CI [.82, .86]), indicating good reliability, F(948, 949) = 6.2, p < .001.

To compare interrater reliability across coder pairs, ICCs were also calculated between C1 and each of C2, C3, and C4 separately. Based on a mean-rating (k = 2), absolute agreement, two-way random effects model, the average measures ICC for C1–C2 was .78 (95% CI [.73, .81]); for C1–C3 was .88 (95% CI [.85, .90]); and for C1–C4 was .86 (95% CI [.83, .88]), indicating good reliability across coder pairs (all ps < .001).

Because confidence intervals did not overlap for C1–C2 compared to C1–C3 and C1–C4, we recalculated the ICC between C1 and C2, C3, or C4 (one-way random effects model) with C2’s labels removed. The average measures ICC with C1–C2 excluded was .87 (95% CI [.85, .88]), F(631, 632) = 7.6, p < .001, which was not significantly different from the ICC with C1–C2 included (.84, 95% CI [.82, .86]). As such, we retained C2’s labels in the following analyses. The final labelled subset consisted of 949 texts, 878 of which were labelled as valid by at least one coder (92.5%), and 71 of which were labelled as invalid by both codersFootnote 4 (7.5%).

Invalid text detection

Supervised ML was implemented using Python (Harris et al., 2020; Lemaître et al., 2017; McKinney, 2010; Pedregosa et al., 2011; Seabold & Perktold, 2010; Waskom, 2021).

Model validation

Using stratified ten-fold cross-validation, we selected the best performing model based on the Matthews correlation coefficientFootnote 5 (MCC; Chicco & Jurman, 2020; Luque et al., 2019), which indexes how well predictions are made on positive (invalid) as well as negative (valid) cases. While no single performance metric is exhaustive, the MCC offers many advantages in our context. First, it is unaffected by the high class imbalance (a 9:1 ratio of valid to invalid texts) present in our study (Chicco & Jurman, 2020). In contrast, other popular metrics such as accuracy (the proportion of correct classifications) are inadequate with imbalanced datasets since they are strongly impacted by the ratio of classes present in the data (Chawla et al., 2002; Chawla, 2009; He & Garcia, 2009; Menardi & Torelli, 2012). For example, trivially assigning the majority label (e.g., valid) to all cases can achieve impressive accuracy scores even though this strategy does not actually perform the task at hand (i.e., detecting invalid texts). Conversely, MCCs can only be high in a binary classification task if the model is correctly labelling both positive and negative cases (Chicco & Jurman, 2020). To supplement the information provided by MCCs, we also present macro F1 scores (the arithmetic mean over individual F1 scores; Opitz & Burst, 2019) in Table 1.

Table 1 Cross-validation performance of all classifiers and resamplers

We defined the best model as the combination of classifier and resampler that resulted in the highest mean MCC across the ten folds (Lever et al., 2016), with preference given to simpler models (Chicco, 2017; Hand, 2006). Classifiers are algorithms that dictate the class to which each case is predicted to belong (Aggarwal & Zhai, 2012); resamplers are methods of temporarily modifying datasets with class imbalance (i.e., datasets where one class highly outnumbers another). Because our data were highly imbalanced, we used resampling methods to balance the classes’ proportions during training (Estabrooks et al., 2004; He & Garcia, 2009). Classifiers tend to label minority cases poorly (in our case, invalid texts) unless this imbalance is accounted for, since classifiers can simply ignore the minority class and still achieve high performance metrics (Estabrooks et al., 2004).

Here, we compared the performance of common classifiers and resamplers that have been previously applied to text classification (Aggarwal & Zhai, 2012; Fernández et al., 2018). For classifiers, we compared multinomial naïve Bayes (McCallum & Nigam, 1998), logistic regression (Ng & Jordan, 2002), random forest (Breiman, 2001), decision tree (Safavian & Landgrebe, 1991), gradient boost (Friedman, 2001), linear support vector machine (SVM; Fan et al., 2008), and logistic regression with elastic net regularization (Zou & Hastie, 2005). For resamplers, we compared Synthetic Minority Oversampling Technique (SMOTE; Chawla et al., 2002), Support Vector Machine SMOTE (SVM-SMOTE; Nguyen et al., 2011), and adaptive synthetic sampling (ADASYN; He et al., 2008).

As shown in Table 1, mean MCCs were highest for naïve Bayes & SVM-SMOTE; naïve Bayes & SMOTE; and naïve Bayes & ADASYN. In our case, we selected naïve Bayes & SVM-SMOTE as the best model because of its highest mean MCC compared to the other models (MMCC = .60). Performance of this cross-validation trained model was similar between test (MMCC = .62, SD = .04) and validation (MMCC = .60, SD = .29), suggesting reasonable ability to generalize to unseen data.

Model evaluation

The best model was then trained on the full training and validation set (80% of the labelled data; n = 752) and evaluated on the test set (20% of the labelled data; n = 188; Sebastiani, 2002). The best model was never trained on any of the test data (Chicco, 2017). The correlationFootnote 6 between the best model’s predictions and human coding was positive, statistically significant, and large (r = .57, 95% CI [.46, 0.66], t(186) = 9.35, p < .001). To illustrate the model’s performance, we present the confusion matrix (Table 2), which indicates how each text response was classified according to human coding versus the ML model. Two of the four cells reflect cases where human coding and the model agreed, which we interpret as the model making “correct” classifications. These include cases where both methods labelled a text response as invalid (true positives or hits; n = 8), and cases where both methods labelled a text response as valid (true negatives or correct rejections; n = 169). The other two cells reflect cases where human coding and the model disagreed, which we interpret as the model making “incorrect” classifications. These include cases where humans coded a text response as valid, but the model labelled it invalid (false positives or false alarms; n = 7), and cases where humans coded a text response as invalid, but the model labelled it valid (false negatives or misses; n = 4). Further, we also provide examples of correctly and incorrectly classified text responses from the test set (Table 3).

Table 2 Confusion matrix for best model’s predictions on test data (n = 188)
Table 3 Example text responses for best model’s predictions on test data (n = 188)

Finally, we extracted the words that were most predictive of each class according to the ML model (i.e., feature importance). For the invalid class, the ten most predictive words were “remember”, “memory”, “thing”, “song”, “listen”, the negator “n’t”, “think”, “happen”, “sure”, and “like”. The ML model seems to have identified that invalid texts tended to contain terms related to the metacognitive aspects of remembering (e.g., being unsure if an event happened), rather than descriptions of the remembered event itself. In contrast, the most predictive words for the valid class included words like “ambulance”, “awkward”, “couple”, “girlfriend”, “lonely”, “outburst”, “shocked”, “sister”, “speak”, and “visit”. These words seem to reflect the ML model identifying that valid texts tended to contain specific details from remembered events (e.g., persons present, emotions felt).

Comparison to existing data quality indicators

As alternative methods of detecting invalid text responses, we implemented non-ML data quality indicators as recommended by existing literature (Curran, 2016; Etz et al., 2018; Leiner, 2019; Maniaci & Rogge, 2014; Meade & Craig, 2012). First, we detected univariate outliers for all indicator variables (total survey completion time, page completion time, character count, word count) in the labelled subset (n = 949) using the median absolute deviation (MAD) method (Leys et al., 2013; Leys et al., 2019). For each variable, values more than three MADs plus or minus the median were winsorized (i.e., replaced by the three MAD cutoff value). Problematic skew (1.1–30.2) and kurtosis (1.6–920.6) values were resolved following this process (skew = 0.8–0.9, kurtosis = -0.9–0.3).

Time-based measures

Similar to Maniaci and Rogge (2014) and Leiner (2019), we calculated speed factors for both the total survey as well as the one page containing the text response of interest. Speed factors were computed by dividing the median completion time by the participant’s completion time; a speed factor of 2 represents the participant completing the total survey or page twice as quickly compared to the median completion time. We used the relatively lenient cutoff of 2 suggested by Leiner (2019), such that instances of participants finishing either the total survey or the page more than twice as fast as the median time were labelled as invalid responses.

Text-based measures

Following Etz et al. (2018), we used two cutoffs for character count: a minimum of 16 characters or 51 characters (inclusive). Both cutoffs have been described as effective methods of detecting meaningful but not useful data (Etz et al., 2018), where responses shorter than either 16 or 51 characters (exclusive) were labelled as invalid.

For word count, a cutoff of minimum 50 words has been suggested (Banks et al., 2018; Brühlmann et al., 2020). However, this cutoff was inappropriate for the current dataset, as our text responses were typically shorter than 50 words long (in the labelled subset, Mwords = 31.1, SDwords = 21.2). Instead, we opted for a cutoff of minimum ten words long (inclusive), which was approximately 1 SD below the mean for word countFootnote 7. Responses shorter than ten words (exclusive) were labelled as invalid based on this cutoff.

Correlations to human coding

To compare all data quality indicators’ abilities to estimate human coding, we conducted correlations between human coding, model predictions, character count cutoffs, word count cutoff, total speed factor cutoff, and page speed factor cutoff (see Table 4). These analyses were confined to the test set (n = 188) to ensure that the ML model was never trained or validated on these data. To account for multiple comparisons, Holm corrections were applied.

Table 4 Correlation matrix of data quality indicators

The correlations illustrate that the ML model’s predictions successfully estimated human coding. Of the non-ML data quality indicators, only the text-based measures (character count cutoffs, word count cutoff) were significantly correlated with human coding (rs = .31–.40, ps < .001). Time-based measures were not significantly correlated with human coding (rs = – .07–.02, ps = .4–.8).

We then tested whether the human–model correlation was significantly different from the correlations between human coding and existing text-based data quality indicators (Graham & Baldwin, 2014). To do so, we conducted Williams’ tests to compare correlation coefficients derived from a single sample (Dunn & Clark, 1971; Neill & Dunn, 1975). These Williams’ tests indicated that the human–model correlation was significantly stronger than the 16-character count cutoff (t(186) = 2.19, p = .03, Cohen’s q = .22), the 51-character count cutoff (t(186) = 3.45, p < .001, Cohen’s q = .31), and the ten-word count cutoff (t(186) = 3.56, p < .001, Cohen’s q = .32).

Confusion matrices also illustrate the shortcomings of the existing text-based data quality indicators (Table 5). For instance, the 16-character count cutoff only labelled two responses out of 188 as invalid. Both were successful detections (true positives or hits), since these two responses were indeed labelled by humans as invalid – however, the 16-character count cutoff incorrectly labelled the vast majority of human-labelled invalid texts as valid (83%; false negatives or misses). In comparison, the 51-character count cutoff and the ten-word count cutoff labelled more responses as invalid (21 and 22, respectively). Unfortunately, most of these “invalid” responses were human-labelled as valid (71–73%; false positives or false alarms).

Table 5 Confusion matrices for existing data quality indicators on test data (n = 188)

Predicting validity of unseen text responses

To apply the ML model to out-of-sample, unlabelled data, we trained the ML model on the full labelled dataset (including train, validation, and test sets) to increase training data (and theoretically, final model performance; Sebastiani, 2002).

Out of the 2634 unlabelled text responses, the final model predicted 170 as invalid (6.5%) and 2464 as valid (93.5%), which is similar to base rates produced by human coding (7.5% invalid and 92.5% valid out of 949 labelled text responses). The final model detected invalid texts containing a wide range of content, including text responses about declining to respond, text responses about misunderstanding instructions, and text responses that were irrelevant to the question (e.g., describing dreams when the question asked about memories). For a random sample of unlabelled texts predicted as valid or invalid, see Table 6.

Table 6 Example text responses from the unlabelled subset (n = 2634) that were predicted as valid or invalid by the best model

Discussion

Best practices dictate that researchers using text as data should detect and remove text responses that are invalid (e.g., meaningless or irrelevant text; Banks et al., 2018; Etz et al., 2018; Müller et al., 2014), since even small proportions of invalid responses can measurably distort results (Arias et al., 2020; Credé, 2010; Huang et al., 2015; Wood et al., 2017; Woods, 2006). However, current literature offers little guidance as to how to detect these invalid text responses effectively or efficiently, and even less evidence as to which methods work well with text data. Here, we implemented and assessed a supervised ML-based method of detecting invalid text responses and compared it to existing detection methods. Results indicated that our ML approach accurately classified valid and invalid autobiographical memory texts with performance near human coding, significantly outperforming previous non-ML approaches.

First, we found that only some of the data quality indicators recommended in the current literature could reliably discriminate valid from invalid responses in our text data. Specifically, only the minimum cutoffs for character count and word count were significantly correlated with human coding. Because labelling very short text responses as invalid approximated human coding to some extent in our dataset, it might be reasonable to flag extremely suspicious cases using character or word count (Etz et al., 2018). At the very least, using a 16-character count cutoff (Etz et al., 2018) did no harm: no valid responses (as labelled by humans) were incorrectly labelled as invalid based on this criterion in our test set. However, one should not expect this 16-character count cutoff to never result in false alarms, since even extremely short text responses (e.g., a few words) can be valid (see Table 6). Confusion matrices also demonstrate the limited effectiveness of using these text-based approaches to discriminate valid from invalid responses. Character and word count cutoffs suffered from very high miss rates (labelling responses as valid when humans labelled them as invalid), very high false alarm rates (labelling responses as invalid when humans labelled them as valid), or both (see Table 5).

Time-based cutoffs also offered little value for detecting invalid text responses in our current study. The maximum cutoffs for speed factor (for either the total survey or single page) were not significantly correlated with human coding. This is interesting considering that time-based data quality indicators have effectively detected C/IE responding in other studies, leading many researchers to recommend their use (Curran, 2016; Huang et al., 2012; Jones et al., 2021; Maniaci & Rogge, 2014; Niessen et al., 2016). Importantly, some caveats to using response time have been reported: Meade and Craig (2012) noted that while time-based measures were better than nothing, using response time alone was ineffective at detecting invalid responses in their studies. Further, Leiner (2019) suggested that response time may fail to detect invalid responses when questions are fact- or knowledge-based – a potentially relevant distinction to the extent that text responses may rely more heavily on retrieving knowledge from memory compared to closed-ended responses (Holland & Christian, 2009; Kaczmirek et al., 2017; Krosnick, 1999; Scholz & Zuell, 2012). In our case, retrieval was necessary because the question asked participants to recall an autobiographical memory – given that individual differences in autobiographical memory function are well documented (Palombo et al., 2018; Rubin, 2021), responding “too quickly” may represent natural variance in individuals’ access to autobiographical memories more so than C/IE responding.

Critically, our analyses indicated that our ML approach was a significantly better proxy for human coding compared to existing data quality indicators. Though character and word count cutoffs were acceptable to some degree (rs = .31–.40, ps < .001), the ML model performed significantly better in all regards (ps < .03). A closer inspection of the best non-ML data quality indicator in our study (16-character count cutoff) further illustrates its inadequacy for invalid text detection – only two out of a possible 188 texts (1%) were flagged this way. Our own data suggest that human coders detected far more invalid responses (7%) when examining the same set of texts. As such, this character cutoff is likely insufficient to address the problem of detecting invalid text responses, as only the most extreme cases are ever identified as suspicious. In comparison to existing methods, our ML approach was more accurate at both flagging text responses that humans labelled as invalid and retaining text responses that humans labelled as valid. Given its success here, we have made our code openly available (https://github.com/ryancyeung/invalid-text-detect) in the hopes of facilitating research involving text data. Our code is fully open source, takes less than one minute to run from start to finish on a modern computer, and can be implemented with any text data (see Appendix for further details).

Though the ML model significantly outperformed existing methods here, many considerations should be taken into account when deciding which method best fits with one’s own data and goals. As with any method of excluding data, researchers should weigh the potential benefits of removing invalid data against the potential costs of accidentally discarding valid data. In some contexts, it might be critical that meaningless or irrelevant text be removed, lest they alter a study’s conclusions – if this is the case, researchers might find it acceptable if some valid data is misclassified and ultimately excluded prior to analysis. In other contexts, researchers might prefer retaining as much valid data as possible, even if it means that some invalid data is included in analyses. Recent literature suggests that the former case is more realistic, since many studies have shown that even small amounts of invalid responses can cause noticeable harm to data quality (Arias et al., 2020; Credé, 2010; Huang et al., 2015; Wood et al., 2017; Woods, 2006). Our ML approach might be particularly attractive in these situations, since the model shows a balanced ability to both retain valid data and detect invalid data. However, some may find the possibility of removing valid data (false positives/false alarms) too costly. In these circumstances, using a lenient non-ML method might be more appealing (e.g., a 16-character count cutoff). Alternatively, our ML approach can be supplemented with manual adjustments to the unlabelled texts predicted as invalid. If it is paramount to preserve as much data as possible (e.g., with small sample sizes), one should consider manually inspecting the cases predicted as invalidFootnote 8 and adjusting as necessary. Any manual adjustments to the predicted cases should be conducted rigorously (e.g., with multiple coders) and reported transparently.

We also want to acknowledge that our current ML approach requires some up-front commitment of resources, since a labelled or manually coded subset is necessary. Recommendations to date for the size of this labelled subset have ranged between 100, 200, and 500 labelled texts (Figueroa et al., 2012; Grimmer & Stewart, 2013; Hopkins & King, 2010). When choosing how many texts to label, a factor to consider is how many invalid texts are to be expected – if humans label very few texts as invalid in the labelled subset, ML may have difficulty training well on these rare cases, especially after recommended splits into training, validation, and test sets (Chicco, 2017). For example, if an ML model is trained on very few invalid cases, it might fail to learn from enough cases to generalize well to new, unseen text responses. We mitigate the impact of class imbalance (e.g., few invalid texts compared to valid texts) using resampling techniques (Chawla et al., 2002; Estabrooks et al., 2004; He et al., 2008; He & Garcia, 2009; Nguyen et al., 2011), but researchers should still take caution to ensure that instances of the minority class (in our case, invalid texts) exist in the training, validation, and test sets. Devoting resources to labelling a subset of text responses may appear onerous to some; however, we note that researchers have already adopted coding as a method of detecting invalid text responses (Brühlmann et al., 2020; Etz et al., 2018; Kennedy et al., 2020; Scholz & Zuell, 2012; Sischka et al., 2020), which necessitates coding the entire dataset manually. Instead, our current ML approach only requires coding a subset of the text responses. Our tools here can easily cut the workload into a small fraction of what is typically undertaken.

Future directions

Other researchers have demonstrated that unsupervised ML can be used to detect irrelevant text responses to short-answer questions in an educational context (Gagnon et al., 2019). Specifically, it was found that irrelevant text responses clustered together, since the words commonly used in irrelevant answers tended to be dissimilar to the words commonly used in relevant answers. Future work could investigate unsupervised ML further, as this could eliminate the need to manually label a subset to train, validate, and test upon. However, the assumption that valid texts use similar words might not hold in a context where responses are more varied (e.g., autobiographical memories) compared to educational short-answer questions.

Alternatively, performance and/or selection of the best ML model might be improved by adding steps such as hyperparameter tuning (Chicco, 2017) or feature selection (Aggarwal & Zhai, 2012), which have generally improved text classification in other contexts (Desmet & Hoste, 2018; Takano et al., 2017). Text responses could also be represented in different ways to retain information about word order (e.g., trigram/skip-gram models). For instance, word embeddings can be used to represent words or phrases as vectors (Mikolov et al., 2013), which is thought to capture syntactic and semantic components of language better than the bag-of-words representations used in the current study. However, these more complex representations do not always result in better performance (Grimmer & Stewart, 2013). In fact, the simpler bag-of-words representations we implemented here (count, TF-IDF) have been shown to be well suited for relatively short texts (Joti et al., 2020; Padurariu & Breaban, 2019; Wang et al., 2017), such as those in our dataset (Mwords = 30.9, Mcharacters = 164.1).

Estimates of ML model performance may also be made more reliable by using nested cross-validation instead of the flat cross-validation used in our present study (Varma & Simon, 2006). That is, instead of conducting cross-validation in a single round, it is possible to cross-validate in multiple, recursive rounds (i.e., one cross-validation nested within another cross-validation), which has the potential to reduce error in the performance estimates used for model selection. Importantly, it remains worth noting that in practical settings such as our own, nested cross-validation often results in similar quality model selection to that of flat cross-validation (Wainer & Cawley, 2021). Future extensions of this work could explore whether nested cross-validation might impact the model selection process.

Conclusions

Researchers have recommended that those working with text as data ought to identify invalid text responses prior to analysis, in order to ensure that data are meaningful and analytically useful. Despite the importance of detecting invalid data, options in past literature are of limited applicability to text responses. While some options are relatively fast to compute (e.g., response time, character/word count), their effectiveness as detection methods has not been well-demonstrated. Conversely, more direct measures of text validity (e.g., coding) are not always practical due to the time and labour required. In the current work, we demonstrated that a supervised ML approach can approximate the accuracy of human coding, but without the need to manually label full datasets. Our ML approach also significantly outperformed existing data quality indicators when classifying text responses as valid or invalid. In the hopes of facilitating research involving text as data, we present our openly available code as a new option for detecting invalid responses that is both effective and practical.