Skip to main content
Log in

Model-agnostic unsupervised detection of bots in a Likert-type questionnaire

  • Original Manuscript
  • Published:
Behavior Research Methods Aims and scope Submit manuscript

Abstract

To detect bots in online survey data, there is a wealth of literature on statistical detection using only responses to Likert-type items. There are two traditions in the literature. One tradition requires labeled data, forgoing strong model assumptions. The other tradition requires a measurement model, forgoing collection of labeled data. In the present article, we consider the problem where neither requirement is available, for an inventory that has the same number of Likert-type categories for all items. We propose a bot detection algorithm that is both model-agnostic and unsupervised. Our proposed algorithm involves a permutation test with leave-one-out calculations of outlier statistics. For each respondent, it outputs a p value for the null hypothesis that the respondent is a bot. Such an algorithm offers nominal sensitivity calibration that is robust to the bot response distribution. In a simulation study, we found our proposed algorithm to improve upon naive alternatives in terms of 95% sensitivity calibration and, in many scenarios, in terms of classification accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Open Practices Statement

Example R code for implementing the methods studied in the manuscript is available at an Open Science Framework repository (https://osf.io/e5v3s/). The simulation study was not preregistered.

Notes

  1. Some studies blur the distinction between the two traditions. Person-fit statistics require measurement model assumptions but are often used (e.g., Beck et al., 2019) toward direct detection. Interestingly, the approach in Patton et al. (2019) iteratively re-estimates parameters of a measurement model by removing suspicious respondents directly detected. Regardless, our taxonomy remains instructive for the purposes of the present article.

  2. Note then that i is ambiguous as an index—it may refer to a training set respondent or a test set respondent, depending on the presence of the superscript.

  3. Schroeders et al. (2022) is an unusual case where the favored approach does not involve NRIs—the raw Likert-type item responses are themselves the features, a convenience afforded by their flexible (and complicated) classifier. Note that raw Likert-type item responses as features admits no obvious ideal point, in contrast to our proposed approach.

  4. The setup for unsupervised learning differs slightly from common textbook examples (e.g., James et al., 2013) where the goal is to extract features that can reproduce the data at hand. In such examples, there is still no labeled data and the existence of separate classes is not assumed. Furthermore, in both supervised and unsupervised paradigms, a user-defined training/test split of the data at hand is possible in order to assess model performance and prevent overfitting. If such a split is done, both training and test sets in the supervised paradigm contain labeled data, and neither set has labeled data in the unsupervised paradigm. In either paradigm, the chosen model can then be applied to yet additional future observations. We do not employ or consider this alternative use of training/test set terminology and instead reserve such terms to signal the presence/absence of known class exemplars when constructing the classifier.

References

Download references

Acknowledgements

We acknowledge the support of the Natural Science and Engineering Research Council of Canada (NSERC), (funding reference number RGPIN-2018-05357 and DGECR-2018-00083), and the Fonds de recherche du Québec–Nature et technologies (2022-PR-298903). Cette recherche a été financée par le Conseil de recherches en sciences naturelles et en génie du Canada (CRSNG), [numéro de référence RGPIN-2018-05357 et DGECR-2018-00083] et les Fonds de recherche du Québec–Nature et technologies (2022-PR-298903).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Carl F. Falk.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 177 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ilagan, M.J., Falk, C.F. Model-agnostic unsupervised detection of bots in a Likert-type questionnaire. Behav Res (2023). https://doi.org/10.3758/s13428-023-02246-7

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.3758/s13428-023-02246-7

Keywords

Navigation