Detecting computer-generated disinformation

Stiff, Harald; Johansson, Fredrik

doi:10.1007/s41060-021-00299-5

Detecting computer-generated disinformation

Regular Paper
Open access
Published: 23 December 2021

Volume 13, pages 363–383, (2022)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Data Science and Analytics Aims and scope Submit manuscript

Detecting computer-generated disinformation

Download PDF

5970 Accesses
15 Citations
2 Altmetric
Explore all metrics

Abstract

Modern neural language models can be used by malicious actors to automatically produce textual content looking as it has been written by genuine human users. Due to progress in the controllability of computer-generated text, there is a risk that state-sponsored actors may start using such methods for conducting large-scale information operations. Various detection algorithms have been suggested in the research literature to identify texts produced by language model-based generators, but these are often mainly evaluated on test data from the same distribution as they have been trained on. We evaluate promising Transformer-based detection algorithms in a large variety of experiments involving both in-distribution and out-of-distribution test data, as well as evaluation on more realistic in-the-wild data. It is shown that the generalizability of the detectors can be questioned, especially when applied to short social media posts. Moreover, the best performing (RoBERTa-based) detector is shown to be non-robust also to basic adversarial attacks, illustrating how easy it is for malicious actors to avoid detection by the current state-of-the-art detection algorithms.

Fake news, disinformation and misinformation in social media: a review

Article 09 February 2023

Esma Aïmeur, Sabrine Amri & Gilles Brassard

Natural language processing: state of the art, current trends and challenges

Article 14 July 2022

Diksha Khurana, Aditya Koli, … Sukhdev Singh

Natural Language Processing

1 Introduction

Progress in the research field of neural language generation has in recent years resulted in a variety of generative models able to produce texts of high quality. Current state-of-the-art generative models for text, also referred to as neural language models, produce textual output that often is so grammatically correct, fluent, and coherent that it is hard to tell apart from text written by humans [5, 48]. As with many other technologies, there are several ways in which such models can be used for malicious purposes. Examples of use cases include targeted bot attacks [47], fake news generation [43, 48], and fake reviews generation [1].

An even more worrying threat is the potential weaponization of such techniques by state actors and state-sponsored groups [5]. Open societies are already today challenged with malicious actors deliberately creating and spreading disinformation in social media for everything from economical gain to increasing divide and sowing distrust in the political system in democratic countries, using a combination of bots, sockpuppets, and hijacked accounts [4]. Although it is hard to estimate the real-world impact of such online information operations, it is clear that some actors spend a considerable amount of money to orchestrate the spreading of lies and disinformation that follows specific narratives of interest [3, 13]. Hence, there is more than a hypothetical risk that malicious actors will attempt to make use of generative models of various modality, including the high-quality texts generated by powerful language models such as GPT-2 [35] and Grover [48]. Potentially, this will lower the existing barriers for state actors and other malicious users to efficiently produce misinformation [5] in the social media landscape.

Although there lately has been some reports of language models used to abuse governmental process [47] and creating blog posts more or less automatically [24], there has so far not been any reports of state actors actively using language models for information operation purposes. One explanation for this may be that it until recently has been complicated to control language models to follow a specific narrative, while at the same time keep producing varied text of high quality. However, controlling language models has become a popular research topic, resulting in several new ways to better steer what is generated by a language model, in addition to previous coarser methods like fine-tuning and priming. Examples of such methods include adding additional metadata to the training [28, 48], and the use of attribute classifiers [12, 29] for guiding the text generation process. As noted by Brown et al. [5], this may lead to an increased future risk of language models being misused by, e.g., state-sponsored groups. For this reason, it is of interest to find out to which degree a malicious actor can achieve control of the content being generated using such techniques. Furthermore, it is becoming increasingly important to investigate to which extent existing detection methods suggested in the research literature can be used to distinguish between human-generated text and text being produced using neural language models, especially as there exists a growing amount of research indicating that this task is very challenging for humans [1, 20, 26, 48]. Initial research on the large-scale state-of-the-art neural generative model GPT-3 even suggests that human abilities to distinguish between real texts and texts generated by the largest GPT-3 models are not better than random guessing [5].

The main contribution of the work presented in this article is that we evaluate the performance of promising machine learning-based detection models suggested in the available research literature on a wide variety of datasets, covering several types of texts, including news articles, product reviews, forum posts, and tweets. The texts are generated using existing language models such as GPT-2 and Grover, while more fine-grained control of the topic of the generated texts is achieved using the control mechanisms PPLM and GeDi. Such control of the generated texts is likely to be utilized by actors wanting to misuse language models for information operation purposes. The generalizability of the detection methods is studied in both in-distribution and out-of-distribution experiments, as well as on in-the-wild data. Lastly, the detectors’ robustness toward adversarial attacks is investigated.

All in all, it is shown that detectors based on RoBERTa [30] demonstrate reasonable generalizability to out-of-distribution data, but that the detectors are not accurate enough for practical use, other than in constrained scenarios where pre-trained generative models are likely to be used out of the box. Furthermore, active countermeasures such as the use of adversarial attacks can cause the detection algorithms to perform worse than random guessing. This calls for future research into more robust detection methods than are available today.

The rest of this article is structured as follows. In Sect. 2, it is explained how neural language models work, how they are trained, and how they can be used by attackers to automatically generate novel text content on a specific topic and with a specific sentiment. In Sect. 3, various detection algorithms suggested in the existing research literature for distinguishing between real and computer-generated text are reviewed, together with related work on bot detection and adversarial attacks. The actual task definition studied in this work is specified in Sect. 4. Next, the experimental setup is described in Sect. 5, including datasets used for training or fine-tuning the detectors, as well as evaluating their performance on in-distribution, out-of-distribution, and in-the-wild data produced by various neural language models being controlled in different ways. The obtained results are presented and analyzed in Sect. 6, while their potential implications are discussed in Sect. 7, together with ideas for future work. Finally, conclusions are presented in Sect. 8.

2 Neural text generation

The idea of text generation methods is not new. For example, various n-gram language models have been around for a long time [34]. Earlier text generation methods usually relied on extracting and storing statistical frequencies from large text corpora, and used these to estimate probability distributions from which new text sequences could be sampled. However, text produced by such models tends to be ungrammatical and incoherent [18], and hence, easy for humans to tell apart from its real human-generated counterpart. Neural RNN-based language models [40] took a step forward in terms of quality of the generated text, but the quality reached another level when large-scale language models based on the Transformer architecture [44] were introduced. Unlike RNNs which have to process data sequentially, Transformer models allow for significantly better parallelization thanks to their attention mechanism which allows them to selectively focus on segments of input text they predict to be the most relevant. Since Transformer-based language models such as GPT-2 [35] require large quantities of text and compute to train, much of their popularity has, at least until recently, been relying on being trained and publicly released by large companies or research organizations. However, it has become easier for other actors to both fine-tune and train such models from scratch due to factors such as release of open-source code, developments of new hardware accelerators, and new research on how to fine-tune existing language models to other languages. Hence, although this is not feasible for the average user, it is without doubt accomplishable for state actors under the premise that they can find large enough representative datasets for the domain and language for which they are interested in generating text.

2.1 Language modeling

On a high level, neural language models can be described as being trained to predict the next token (such as a word or a word-piece) in a text sequence, given the previous tokens. This is an example of a self-supervised learning task for which no human-annotated texts are required as training data. Instead, only large quantities of unstructured and unlabeled text are required, where tokens can be masked out automatically. As a concrete example of a single training example, the language model can be asked to predict the next word in the text sequence: “Barack Obama is a former,” where continuations like “US,” “American,” or “president” can be expected. More formally, given a corpus of texts $D=\{ {\mathbf {x}}^i \}_{i=1}^{|D|}$, where each text ${\mathbf {x}}^i$ is composed of a sequence of tokens $(x^i_1, \dots , x^i_{N})$, a left-to-right neural language model $P_{\theta }$ is trained using a language modeling objective to learn the distribution:

$$\begin{aligned} P({\mathbf {x}}) = \prod _{i=1}^N P(x_i|x_{<i}). \end{aligned}$$

(1)

The chain rule decomposition of Eq. 1 follows from how the texts are generated in an autoregressive manner. The parameters $\theta $ of the language model $P_{\theta }$ are obtained by optimizing the language modeling loss function:

$$\begin{aligned} L = -\sum _{{\mathbf {x}}\in D}\log P_\theta (x_i|x_{<i}). \end{aligned}$$

(2)

Once a neural language model has been trained, it can be used to estimate the probability of a text sequence, but also to generate new text.

2.2 Generating text

A trained neural language model can be used to create text conditioned on some input, such as the beginning of a sentence, or just an empty start token in the case of unconstrained text generation. New texts can then be generated by sampling tokens repeatedly from the conditional distribution $P_{\theta }(x_i|x_{<i})$ until an end token is generated or other stopping criteria are fulfilled, such as a pre-specified maximum sequence length being reached. Although, in theory, it is possible to simply greedily select the most probable token at each step, this leads to repetitive and highly non-varied text [25]. Hence, some kind of non-deterministic sampling strategy is needed. One such strategy could be to let each token have a chance of being generated that is directly proportional to its estimated probability, as expressed by the language model. However, this tends to lead to texts that significantly deviate from human-generated text, as the probability distribution often contains a long tail of tokens that individually are assigned low probabilities, but which cumulatively are assigned a high probability mass [27]. It is therefore more common in practice to sample from a truncated part of the probability distribution. One common strategy is top-k sampling [17], where the probability distribution is reassigned to only include the k most probable tokens. Nucleus sampling [25] is a dynamic version of top-k sampling that dynamically truncates the distribution to the smallest set of tokens with a total probability mass reaching above a fixed threshold $p \in [0,1]$.

2.2.1 GPT-2

GPT-2 [35] is a Transformer-based language model trained to predict the next token in a sequence as described in Sect. 2.1. It has originally been trained on a dataset (WebText) containing 40 GB of text scraped from the internet. The relatively large training data size and its powerful architecture makes it capable of generating diverse coherent text in a multitude of domains. GPT-2 can easily be adapted to generate text in more restrictive domains (e.g., reviews and social media comments) with additional fine-tuning on datasets several orders of magnitude smaller than the WebText dataset.

2.2.2 GROVER

Grover [48] is a language model with the same architecture as GPT-2. However, it has been trained on the RealNews dataset, containing articles from a broad range of news domains. Unlike GPT-2, it is trained to generate texts conditioned on a headline, date, author, and domain, adding the possibility of steering the generated text more closely toward a desired style and topic. When generating an article, the text is initialized with the desired article attributes enclosed in their corresponding start and end tokens as illustrated in Fig. 1, whereafter the rest of the text is generated auto-regressively as described in Sect. 2.2. There are three different model sizes of Grover, ranging from a 117M parameter (Base) model to the largest 1.5B parameter (Mega) model.

2.3 Controllable text generation

Although the output of neural language models can be controlled to some degree by conditioning (probing) them on an input sequence or fine-tuning them on a more domain-specific dataset, this level of control is in general not enough for a malicious actor who wants to utilize it within the context of information operations. Instead, there are more sophisticated ways in which the generated text can be controlled. One way is to incorporate metadata in the form of control tokens as additional information during the training, so that these later on can be utilized for finer-level control during text generation. Examples of this kind of class-conditional language models are Grover [48] and CTRL [28]. Another, more flexible, way to achieve control over what is being generated is to make use of attribute classifiers. An early variant of this was suggested by Adelani et al. [1]. In their approach, texts generated by a language model were used as input to a separate discriminator model to ensure that all texts with an unwanted sentiment could be discarded. While such a filtering mechanism is not impacting the actual text generation (and thereby may require generating large amounts of texts before producing a text that is in line with what its user wants), more modern approaches incorporate the attribute classifier into the generation process, so that the text generation can be more directly guided. Examples of methods that use such attribute classifiers are Plug and Play Language Models (PPLM) [12] and generative discriminator-guided sequence generators (GeDi) [29].

2.3.1 PPLM

PPLM relies on an external attribute model in addition to a pre-trained neural language model in order to generate text with a desired characteristic. The attribute model is typically implemented as a standard text classifier. This makes it several orders of magnitude smaller than the original language model, but still allow for effective steering of the output [12]. This is achieved by sampling text using the language model and feeding the generated text into the attribute model. This results in a probability of the text to be of the correct class, according to the attribute model. Gradients from the attribute model are utilized in a backward pass that updates the internal latent representations so that a new distribution over the vocabulary can be generated from the updated latent. This process is repeated at every generation step, leading to a gradual transition of the generated text toward the desired attribute.

2.3.2 GeDi

GeDi uses generative discriminators to, with the help of a control code, guide (larger) language models toward generating text with a desired attribute, or alternatively, away from generating text with undesired attributes. GeDi drastically reduces the required computation time per generated sample compared to PPLM [29] (as it unlike PPLM does not require performing multiple forward passes per generation step), but on the other hand is more computationally expensive and difficult to train since it requires training a separate (but smaller) language model using hybrid generative-discriminative training. In essence, GeDi guides the text generation process by at each step efficiently compute classification probabilities for all possible next tokens at once using Bayes rule. This is accomplished by normalizing over two class-conditional distributions, where the first is conditioned on the desired attribute (e.g., positive sentiment) and the other on the undesired attribute (e.g., negative sentiment). The computed likelihoods can then efficiently guide the generation of text from the original (large) language model using various heuristics.

3 Detection models

Detection of text being generated by language models has received increasing attention [1, 20, 43, 48] since the advent of large-scale language models such as GPT-2. However, compared to detection of images [11, 31, 45, 46] and videos [2, 7, 32] being synthesized or manipulated using generative models, its text counterpart is under-researched.

Among the suggested approaches for predicting whether a text sequence has been machine generated or not, different classes of methods can be identified. Some of these make direct use of the probability distribution expressed by neural language models, while others rely on machine learning-based classifiers trained using supervised learning. Within the first class of methods, the total probability method introduced by Solaiman et al. [39] is a representative example. It simply computes the total probability of the text sequence of interest, based on a pre-trained GPT-2 language model. If the computed probability is closer to the mean likelihood over a set of known machine-generated sequences than the corresponding mean likelihood over a set of human-written texts, the text sequence is classified as machine generated. This idea can easily be expanded upon to also incorporate other pre-trained language models.

A related detection method is GLTR [20]. GLTR relies on that text generation methods tend to sample from a truncated head of the full probability distribution. In addition to calculate the probability of each word in the text sequence of interest according to a pre-trained language model, it also computes its absolute rank of the word. After binning the ranks into a smaller number of buckets, the text can be overlayed with colors corresponding to the chosen buckets. In this way, a human can more easily spot if probable words are being overrepresented in the text sequence. Averages over the calculated values can also be used as input features to shallow classifiers, which has been tested with limited success [26]. Other detectors based on shallow classifiers have also been proposed, such as a baseline logistic regression model representing texts using TF-IDF features on unigram and bigram level [39].

Together with the public release of the largest GPT-2 model (consisting of 1.5B parameters), OpenAI released a sequence classifier based on a pre-trained RoBERTa [30] model, fine-tuned to distinguish between texts being generated from the GPT-2 model and real texts [39]. The detector was trained on 250, 000 samples from the WebText dataset [35] and an equal amount of texts synthesized with GPT-2 using a mixture of sampling methods.

In a similar vein, Zellers et al. [48] have proposed adding a linear classification layer on top of their powerful Grover language model. They argue that the capability of Grover to generate text also makes it a strong detector. Their largest Grover-Mega detector has been trained on an equal amount of human-written articles and articles generated by Grover-Mega, using top-$p=0.94$. In their experimental results, it is shown to outperform other detectors (including a detector based on BERT [15]), although later work [43] questions the generalizability of using Grover as a detector when other potential generators are taken into consideration.

Although there is a growing amount of research on detection of text being generated by language models, there is still a lack of understanding of which detection models that perform the best, especially when they have to generalize to data from other distributions than being trained on. In a real-world scenario, a sophisticated attacker is unlikely to generate text straight from a publicly available language model on which a detection model can be trained. Instead, it can be expected that such language models are retrained on other types of (non-public) text data before use, and that some kind of controlled text generation method is used to steer the content of the text being generated. The use of alternative sampling mechanisms, or even adversarial attacks aimed at confusing specific detection models, can also cause the generated text to deviate significantly from what a public language model would generate using the default settings. Hence, there still exists many research questions to address within this area of research.

Although the focus in this article is on detection of computer-generated text, it is highly related to the more well-researched problem of bot detection. Bot detection has been an active field of research for more than a decade [9], i.e., much longer than there have been widespread discussions on the impact of social bots on polarization and spread of disinformation [37]. Early machine learning-based approaches to detecting automation of social media accounts were often based on relatively simple measurements related to posting behavior, posted content, and account properties of individual accounts [6], but such approaches are not working as well today, due to newer generations of bots that are often far more sophisticated than previous generations [10]. As demonstrated in [42], bot detection systems are often not robust enough to generalize to social bot scenarios that are not part of the training data. Moreover, the false positive and false negative rates of such systems on real-world data can be questioned [36]. Graph-based approaches that take coordination and synchronization among groups of accounts into consideration when classifying the accounts are therefore becoming increasingly popular [9]. For coming generations of social bots, it is more than likely that controllable text generation using language models will be utilized for generating the textual content, making the bot detection problem even more challenging than it already is today.

Both bot detection and detection of computer-generated text can be seen as an arms race where improved detectors may cause increasingly sophisticated attack strategies designed for bypassing the defense. For this reason, it becomes relevant to study the developed machine learning-based detection methods’ robustness against adversarial attacks, i.e., slight modifications of the input designed to be difficult for the machine learning-based models to classify accurately [22]. To the best of our knowledge, adversarial attacks have not previously been studied in the context of detection of language model-generated text, but the relatively immature research field of adversarial examples is quickly evolving. In white-box attacks in which an attacker has perfect knowledge of the classification model used for detection, it is in many cases a rather straightforward optimization problem to modify input in a way such that a minimal perturbation causes the input to be misclassified by the model, e.g., changing a few pixels in an input image. This can, for example, be accomplished using gradient-based search algorithms such as FGSM [23] or L-BFGS [41]. In black-box scenarios, in which the used detection model is unknown to the attacker, it becomes more challenging to carry out adversarial attacks. However, it has been demonstrated that adversarial examples transfer surprisingly well [41], so that an attack optimized for a substitute model to which the adversary has access is likely to misclassified also by the target detection model unavailable to the attacker. Suggested defenses against adversarial attacks include adversarial training [23, 41] and defensive distillation [33], but these defenses may often be broken using black box-attacks or more expensive iterative optimization attacks [22].

For textual input, adversarial attacks are less studied. Such attacks are somewhat different as they have to be carried out on the level of individual characters or words rather than on the level of pixels. While a slight change of the intensity level of a single pixel rarely changes the overall content of an image, adding or changing a single token in a sentence may change its meaning completely. Despite this, various attacks for NLP applications are continuing to emerge in the research literature [49], which emphasizes the need to also evaluate the robustness against adversarial attacks for machine learned-based models aimed at detecting text being generated by language models.

4 Task definition

We assess the performance of promising machine learning-based detection algorithms suggested in the research literature for distinguishing between real and machine-generated texts with respect to:

1.
their ability to generalize to different domains, generators, and control mechanisms, and
2.
the extent to which they are robust against adversarial attacks.

The generalizability aspect is important since it in practice is most likely that an adversary who generates text in an information operations context will do this using a method that generates data deviating from what the detection model has been trained on originally. This can, e.g., be a result of the attacker using a new or fine-tuned language model, an alternative sampling strategy, or by steering the generated text toward a specific narrative of interest.

The robustness aspect becomes relevant in situations where the attacker knows that a defender may use machine learning-based detection models to automatically identify use of machine-generated text. If there are publicly available detection models, the attacker may design adversarial examples specifically targeted for being misclassified by these models. If the defender instead uses a non-public detection model, black-box attacks may still be a valid threat. For these reasons, it is of interest to evaluate both the generalizability and robustness of the detectors. The performance of the detectors is evaluated on a binary classification task, i.e., predicting whether individual texts have been machine generated or not.

5 Experimental setup

In this section, the experimental setup used to investigate the generalizability and robustness of promising detectors is described. When evaluating the performance of such detectors, representative data become highly important. Section 5.1 describes the datasets used for evaluation of the models, while Sect. 5.2 presents the actual detection models that have been tested on this data. Section 5.3 describes the methodology for investigating the detection models’ generalizability, while Sect. 5.4 describes how the robustness has been evaluated using white-box and black-box adversarial text attacks. In the experiments, machine-generated text is treated as the positive class. The performance is measured in terms of accuracy, precision, recall, and F1-score. A high-level overview of the conducted experiments is given in Fig. 2. Details on the hyperparameters used for each generation strategy are described in further detail in “Appendix A.2.”

5.1 Generators and datasets

The experiments have been conducted on two very different types of textual domains: news articles and social media texts. On a finer scale, the social media texts that have been included in this research can be divided into tweets, Reddit comments, Yahoo answers, and Yelp user reviews. For each domain of interest, representative datasets have been required, covering both real human-written texts and language model-generated texts. Further details about these datasets and their generator models are presented below and summarized in Table 1.

Table 1 Generator models used to synthesize the different texts

Detecting computer-generated disinformation

Abstract

Similar content being viewed by others

Fake news, disinformation and misinformation in social media: a review

Natural language processing: state of the art, current trends and challenges

Natural Language Processing

1 Introduction

2 Neural text generation

2.1 Language modeling

2.2 Generating text

2.2.1 GPT-2

2.2.2 GROVER

2.3 Controllable text generation

2.3.1 PPLM

2.3.2 GeDi

3 Detection models

4 Task definition

5 Experimental setup

5.1 Generators and datasets

5.1.1 News articles

5.1.2 Social media texts

5.1.3 Controlled text generation

5.2 Detection models

5.3 Evaluating generalizability

5.4 Evaluating robustness to adversarial attacks

6 Results

6.1 Generalizability results

6.1.1 In-distribution detection

6.1.2 Out-of-distribution detection

6.1.3 In-the-wild detection

6.2 Robustness results

7 Discussion

7.1 Quality of the generated texts

7.2 Future work

8 Conclusions

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Hyperparameters

1.1 Generation details

1.2 Generation parameters for GeDi and PPLM

Precision recall and F1 scores

Quality assessment of generated texts

1.1 Automated text quality assessment

1.1.1 Topic verification

Text generations

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation