1 Introduction

Progress in the research field of neural language generation has in recent years resulted in a variety of generative models able to produce texts of high quality. Current state-of-the-art generative models for text, also referred to as neural language models, produce textual output that often is so grammatically correct, fluent, and coherent that it is hard to tell apart from text written by humans [5, 48]. As with many other technologies, there are several ways in which such models can be used for malicious purposes. Examples of use cases include targeted bot attacks [47], fake news generation [43, 48], and fake reviews generation [1].

An even more worrying threat is the potential weaponization of such techniques by state actors and state-sponsored groups [5]. Open societies are already today challenged with malicious actors deliberately creating and spreading disinformation in social media for everything from economical gain to increasing divide and sowing distrust in the political system in democratic countries, using a combination of bots, sockpuppets, and hijacked accounts [4]. Although it is hard to estimate the real-world impact of such online information operations, it is clear that some actors spend a considerable amount of money to orchestrate the spreading of lies and disinformation that follows specific narratives of interest [3, 13]. Hence, there is more than a hypothetical risk that malicious actors will attempt to make use of generative models of various modality, including the high-quality texts generated by powerful language models such as GPT-2 [35] and Grover [48]. Potentially, this will lower the existing barriers for state actors and other malicious users to efficiently produce misinformation [5] in the social media landscape.

Although there lately has been some reports of language models used to abuse governmental process [47] and creating blog posts more or less automatically [24], there has so far not been any reports of state actors actively using language models for information operation purposes. One explanation for this may be that it until recently has been complicated to control language models to follow a specific narrative, while at the same time keep producing varied text of high quality. However, controlling language models has become a popular research topic, resulting in several new ways to better steer what is generated by a language model, in addition to previous coarser methods like fine-tuning and priming. Examples of such methods include adding additional metadata to the training [28, 48], and the use of attribute classifiers [12, 29] for guiding the text generation process. As noted by Brown et al. [5], this may lead to an increased future risk of language models being misused by, e.g., state-sponsored groups. For this reason, it is of interest to find out to which degree a malicious actor can achieve control of the content being generated using such techniques. Furthermore, it is becoming increasingly important to investigate to which extent existing detection methods suggested in the research literature can be used to distinguish between human-generated text and text being produced using neural language models, especially as there exists a growing amount of research indicating that this task is very challenging for humans [1, 20, 26, 48]. Initial research on the large-scale state-of-the-art neural generative model GPT-3 even suggests that human abilities to distinguish between real texts and texts generated by the largest GPT-3 models are not better than random guessing [5].

The main contribution of the work presented in this article is that we evaluate the performance of promising machine learning-based detection models suggested in the available research literature on a wide variety of datasets, covering several types of texts, including news articles, product reviews, forum posts, and tweets. The texts are generated using existing language models such as GPT-2 and Grover, while more fine-grained control of the topic of the generated texts is achieved using the control mechanisms PPLM and GeDi. Such control of the generated texts is likely to be utilized by actors wanting to misuse language models for information operation purposes. The generalizability of the detection methods is studied in both in-distribution and out-of-distribution experiments, as well as on in-the-wild data. Lastly, the detectors’ robustness toward adversarial attacks is investigated.

All in all, it is shown that detectors based on RoBERTa [30] demonstrate reasonable generalizability to out-of-distribution data, but that the detectors are not accurate enough for practical use, other than in constrained scenarios where pre-trained generative models are likely to be used out of the box. Furthermore, active countermeasures such as the use of adversarial attacks can cause the detection algorithms to perform worse than random guessing. This calls for future research into more robust detection methods than are available today.

The rest of this article is structured as follows. In Sect. 2, it is explained how neural language models work, how they are trained, and how they can be used by attackers to automatically generate novel text content on a specific topic and with a specific sentiment. In Sect. 3, various detection algorithms suggested in the existing research literature for distinguishing between real and computer-generated text are reviewed, together with related work on bot detection and adversarial attacks. The actual task definition studied in this work is specified in Sect. 4. Next, the experimental setup is described in Sect. 5, including datasets used for training or fine-tuning the detectors, as well as evaluating their performance on in-distribution, out-of-distribution, and in-the-wild data produced by various neural language models being controlled in different ways. The obtained results are presented and analyzed in Sect. 6, while their potential implications are discussed in Sect. 7, together with ideas for future work. Finally, conclusions are presented in Sect. 8.

2 Neural text generation

The idea of text generation methods is not new. For example, various n-gram language models have been around for a long time [34]. Earlier text generation methods usually relied on extracting and storing statistical frequencies from large text corpora, and used these to estimate probability distributions from which new text sequences could be sampled. However, text produced by such models tends to be ungrammatical and incoherent [18], and hence, easy for humans to tell apart from its real human-generated counterpart. Neural RNN-based language models [40] took a step forward in terms of quality of the generated text, but the quality reached another level when large-scale language models based on the Transformer architecture [44] were introduced. Unlike RNNs which have to process data sequentially, Transformer models allow for significantly better parallelization thanks to their attention mechanism which allows them to selectively focus on segments of input text they predict to be the most relevant. Since Transformer-based language models such as GPT-2 [35] require large quantities of text and compute to train, much of their popularity has, at least until recently, been relying on being trained and publicly released by large companies or research organizations. However, it has become easier for other actors to both fine-tune and train such models from scratch due to factors such as release of open-source code, developments of new hardware accelerators, and new research on how to fine-tune existing language models to other languages. Hence, although this is not feasible for the average user, it is without doubt accomplishable for state actors under the premise that they can find large enough representative datasets for the domain and language for which they are interested in generating text.

2.1 Language modeling

On a high level, neural language models can be described as being trained to predict the next token (such as a word or a word-piece) in a text sequence, given the previous tokens. This is an example of a self-supervised learning task for which no human-annotated texts are required as training data. Instead, only large quantities of unstructured and unlabeled text are required, where tokens can be masked out automatically. As a concrete example of a single training example, the language model can be asked to predict the next word in the text sequence: “Barack Obama is a former,” where continuations like “US,” “American,” or “president” can be expected. More formally, given a corpus of texts \(D=\{ {\mathbf {x}}^i \}_{i=1}^{|D|}\), where each text \({\mathbf {x}}^i\) is composed of a sequence of tokens \((x^i_1, \dots , x^i_{N})\), a left-to-right neural language model \(P_{\theta }\) is trained using a language modeling objective to learn the distribution:

$$\begin{aligned} P({\mathbf {x}}) = \prod _{i=1}^N P(x_i|x_{<i}). \end{aligned}$$
(1)

The chain rule decomposition of Eq. 1 follows from how the texts are generated in an autoregressive manner. The parameters \(\theta \) of the language model \(P_{\theta }\) are obtained by optimizing the language modeling loss function:

$$\begin{aligned} L = -\sum _{{\mathbf {x}}\in D}\log P_\theta (x_i|x_{<i}). \end{aligned}$$
(2)

Once a neural language model has been trained, it can be used to estimate the probability of a text sequence, but also to generate new text.

2.2 Generating text

A trained neural language model can be used to create text conditioned on some input, such as the beginning of a sentence, or just an empty start token in the case of unconstrained text generation. New texts can then be generated by sampling tokens repeatedly from the conditional distribution \(P_{\theta }(x_i|x_{<i})\) until an end token is generated or other stopping criteria are fulfilled, such as a pre-specified maximum sequence length being reached. Although, in theory, it is possible to simply greedily select the most probable token at each step, this leads to repetitive and highly non-varied text [25]. Hence, some kind of non-deterministic sampling strategy is needed. One such strategy could be to let each token have a chance of being generated that is directly proportional to its estimated probability, as expressed by the language model. However, this tends to lead to texts that significantly deviate from human-generated text, as the probability distribution often contains a long tail of tokens that individually are assigned low probabilities, but which cumulatively are assigned a high probability mass [27]. It is therefore more common in practice to sample from a truncated part of the probability distribution. One common strategy is top-k sampling [17], where the probability distribution is reassigned to only include the k most probable tokens. Nucleus sampling [25] is a dynamic version of top-k sampling that dynamically truncates the distribution to the smallest set of tokens with a total probability mass reaching above a fixed threshold \(p \in [0,1]\).

2.2.1 GPT-2

GPT-2 [35] is a Transformer-based language model trained to predict the next token in a sequence as described in Sect. 2.1. It has originally been trained on a dataset (WebText) containing 40 GB of text scraped from the internet. The relatively large training data size and its powerful architecture makes it capable of generating diverse coherent text in a multitude of domains. GPT-2 can easily be adapted to generate text in more restrictive domains (e.g., reviews and social media comments) with additional fine-tuning on datasets several orders of magnitude smaller than the WebText dataset.

2.2.2 GROVER

Grover [48] is a language model with the same architecture as GPT-2. However, it has been trained on the RealNews dataset, containing articles from a broad range of news domains. Unlike GPT-2, it is trained to generate texts conditioned on a headline, date, author, and domain, adding the possibility of steering the generated text more closely toward a desired style and topic. When generating an article, the text is initialized with the desired article attributes enclosed in their corresponding start and end tokens as illustrated in Fig. 1, whereafter the rest of the text is generated auto-regressively as described in Sect. 2.2. There are three different model sizes of Grover, ranging from a 117M parameter (Base) model to the largest 1.5B parameter (Mega) model.

Fig. 1
figure 1

An example of how Grover is conditioned on article fields in order to generate a news article. The desired characteristics of the text, e.g., the chosen title, is added as an initial string that Grover will be conditioned on. The generation begins after the \(<|\)beginarticle\(|>\) token

2.3 Controllable text generation

Although the output of neural language models can be controlled to some degree by conditioning (probing) them on an input sequence or fine-tuning them on a more domain-specific dataset, this level of control is in general not enough for a malicious actor who wants to utilize it within the context of information operations. Instead, there are more sophisticated ways in which the generated text can be controlled. One way is to incorporate metadata in the form of control tokens as additional information during the training, so that these later on can be utilized for finer-level control during text generation. Examples of this kind of class-conditional language models are Grover [48] and CTRL [28]. Another, more flexible, way to achieve control over what is being generated is to make use of attribute classifiers. An early variant of this was suggested by Adelani et al. [1]. In their approach, texts generated by a language model were used as input to a separate discriminator model to ensure that all texts with an unwanted sentiment could be discarded. While such a filtering mechanism is not impacting the actual text generation (and thereby may require generating large amounts of texts before producing a text that is in line with what its user wants), more modern approaches incorporate the attribute classifier into the generation process, so that the text generation can be more directly guided. Examples of methods that use such attribute classifiers are Plug and Play Language Models (PPLM) [12] and generative discriminator-guided sequence generators (GeDi) [29].

2.3.1 PPLM

PPLM relies on an external attribute model in addition to a pre-trained neural language model in order to generate text with a desired characteristic. The attribute model is typically implemented as a standard text classifier. This makes it several orders of magnitude smaller than the original language model, but still allow for effective steering of the output [12]. This is achieved by sampling text using the language model and feeding the generated text into the attribute model. This results in a probability of the text to be of the correct class, according to the attribute model. Gradients from the attribute model are utilized in a backward pass that updates the internal latent representations so that a new distribution over the vocabulary can be generated from the updated latent. This process is repeated at every generation step, leading to a gradual transition of the generated text toward the desired attribute.

2.3.2 GeDi

GeDi uses generative discriminators to, with the help of a control code, guide (larger) language models toward generating text with a desired attribute, or alternatively, away from generating text with undesired attributes. GeDi drastically reduces the required computation time per generated sample compared to PPLM [29] (as it unlike PPLM does not require performing multiple forward passes per generation step), but on the other hand is more computationally expensive and difficult to train since it requires training a separate (but smaller) language model using hybrid generative-discriminative training. In essence, GeDi guides the text generation process by at each step efficiently compute classification probabilities for all possible next tokens at once using Bayes rule. This is accomplished by normalizing over two class-conditional distributions, where the first is conditioned on the desired attribute (e.g., positive sentiment) and the other on the undesired attribute (e.g., negative sentiment). The computed likelihoods can then efficiently guide the generation of text from the original (large) language model using various heuristics.

3 Detection models

Detection of text being generated by language models has received increasing attention [1, 20, 43, 48] since the advent of large-scale language models such as GPT-2. However, compared to detection of images [11, 31, 45, 46] and videos [2, 7, 32] being synthesized or manipulated using generative models, its text counterpart is under-researched.

Among the suggested approaches for predicting whether a text sequence has been machine generated or not, different classes of methods can be identified. Some of these make direct use of the probability distribution expressed by neural language models, while others rely on machine learning-based classifiers trained using supervised learning. Within the first class of methods, the total probability method introduced by Solaiman et al. [39] is a representative example. It simply computes the total probability of the text sequence of interest, based on a pre-trained GPT-2 language model. If the computed probability is closer to the mean likelihood over a set of known machine-generated sequences than the corresponding mean likelihood over a set of human-written texts, the text sequence is classified as machine generated. This idea can easily be expanded upon to also incorporate other pre-trained language models.

A related detection method is GLTR [20]. GLTR relies on that text generation methods tend to sample from a truncated head of the full probability distribution. In addition to calculate the probability of each word in the text sequence of interest according to a pre-trained language model, it also computes its absolute rank of the word. After binning the ranks into a smaller number of buckets, the text can be overlayed with colors corresponding to the chosen buckets. In this way, a human can more easily spot if probable words are being overrepresented in the text sequence. Averages over the calculated values can also be used as input features to shallow classifiers, which has been tested with limited success [26]. Other detectors based on shallow classifiers have also been proposed, such as a baseline logistic regression model representing texts using TF-IDF features on unigram and bigram level [39].

Together with the public release of the largest GPT-2 model (consisting of 1.5B parameters), OpenAI released a sequence classifier based on a pre-trained RoBERTa [30] model, fine-tuned to distinguish between texts being generated from the GPT-2 model and real texts [39]. The detector was trained on 250, 000 samples from the WebText dataset [35] and an equal amount of texts synthesized with GPT-2 using a mixture of sampling methods.

In a similar vein, Zellers et al. [48] have proposed adding a linear classification layer on top of their powerful Grover language model. They argue that the capability of Grover to generate text also makes it a strong detector. Their largest Grover-Mega detector has been trained on an equal amount of human-written articles and articles generated by Grover-Mega, using top-\(p=0.94\). In their experimental results, it is shown to outperform other detectors (including a detector based on BERT [15]), although later work [43] questions the generalizability of using Grover as a detector when other potential generators are taken into consideration.

Although there is a growing amount of research on detection of text being generated by language models, there is still a lack of understanding of which detection models that perform the best, especially when they have to generalize to data from other distributions than being trained on. In a real-world scenario, a sophisticated attacker is unlikely to generate text straight from a publicly available language model on which a detection model can be trained. Instead, it can be expected that such language models are retrained on other types of (non-public) text data before use, and that some kind of controlled text generation method is used to steer the content of the text being generated. The use of alternative sampling mechanisms, or even adversarial attacks aimed at confusing specific detection models, can also cause the generated text to deviate significantly from what a public language model would generate using the default settings. Hence, there still exists many research questions to address within this area of research.

Although the focus in this article is on detection of computer-generated text, it is highly related to the more well-researched problem of bot detection. Bot detection has been an active field of research for more than a decade [9], i.e., much longer than there have been widespread discussions on the impact of social bots on polarization and spread of disinformation [37]. Early machine learning-based approaches to detecting automation of social media accounts were often based on relatively simple measurements related to posting behavior, posted content, and account properties of individual accounts [6], but such approaches are not working as well today, due to newer generations of bots that are often far more sophisticated than previous generations [10]. As demonstrated in [42], bot detection systems are often not robust enough to generalize to social bot scenarios that are not part of the training data. Moreover, the false positive and false negative rates of such systems on real-world data can be questioned [36]. Graph-based approaches that take coordination and synchronization among groups of accounts into consideration when classifying the accounts are therefore becoming increasingly popular [9]. For coming generations of social bots, it is more than likely that controllable text generation using language models will be utilized for generating the textual content, making the bot detection problem even more challenging than it already is today.

Both bot detection and detection of computer-generated text can be seen as an arms race where improved detectors may cause increasingly sophisticated attack strategies designed for bypassing the defense. For this reason, it becomes relevant to study the developed machine learning-based detection methods’ robustness against adversarial attacks, i.e., slight modifications of the input designed to be difficult for the machine learning-based models to classify accurately [22]. To the best of our knowledge, adversarial attacks have not previously been studied in the context of detection of language model-generated text, but the relatively immature research field of adversarial examples is quickly evolving. In white-box attacks in which an attacker has perfect knowledge of the classification model used for detection, it is in many cases a rather straightforward optimization problem to modify input in a way such that a minimal perturbation causes the input to be misclassified by the model, e.g., changing a few pixels in an input image. This can, for example, be accomplished using gradient-based search algorithms such as FGSM [23] or L-BFGS [41]. In black-box scenarios, in which the used detection model is unknown to the attacker, it becomes more challenging to carry out adversarial attacks. However, it has been demonstrated that adversarial examples transfer surprisingly well [41], so that an attack optimized for a substitute model to which the adversary has access is likely to misclassified also by the target detection model unavailable to the attacker. Suggested defenses against adversarial attacks include adversarial training [23, 41] and defensive distillation [33], but these defenses may often be broken using black box-attacks or more expensive iterative optimization attacks [22].

For textual input, adversarial attacks are less studied. Such attacks are somewhat different as they have to be carried out on the level of individual characters or words rather than on the level of pixels. While a slight change of the intensity level of a single pixel rarely changes the overall content of an image, adding or changing a single token in a sentence may change its meaning completely. Despite this, various attacks for NLP applications are continuing to emerge in the research literature [49], which emphasizes the need to also evaluate the robustness against adversarial attacks for machine learned-based models aimed at detecting text being generated by language models.

4 Task definition

We assess the performance of promising machine learning-based detection algorithms suggested in the research literature for distinguishing between real and machine-generated texts with respect to:

  1. 1.

    their ability to generalize to different domains, generators, and control mechanisms, and

  2. 2.

    the extent to which they are robust against adversarial attacks.

The generalizability aspect is important since it in practice is most likely that an adversary who generates text in an information operations context will do this using a method that generates data deviating from what the detection model has been trained on originally. This can, e.g., be a result of the attacker using a new or fine-tuned language model, an alternative sampling strategy, or by steering the generated text toward a specific narrative of interest.

The robustness aspect becomes relevant in situations where the attacker knows that a defender may use machine learning-based detection models to automatically identify use of machine-generated text. If there are publicly available detection models, the attacker may design adversarial examples specifically targeted for being misclassified by these models. If the defender instead uses a non-public detection model, black-box attacks may still be a valid threat. For these reasons, it is of interest to evaluate both the generalizability and robustness of the detectors. The performance of the detectors is evaluated on a binary classification task, i.e., predicting whether individual texts have been machine generated or not.

5 Experimental setup

In this section, the experimental setup used to investigate the generalizability and robustness of promising detectors is described. When evaluating the performance of such detectors, representative data become highly important. Section 5.1 describes the datasets used for evaluation of the models, while Sect. 5.2 presents the actual detection models that have been tested on this data. Section 5.3 describes the methodology for investigating the detection models’ generalizability, while Sect. 5.4 describes how the robustness has been evaluated using white-box and black-box adversarial text attacks. In the experiments, machine-generated text is treated as the positive class. The performance is measured in terms of accuracy, precision, recall, and F1-score. A high-level overview of the conducted experiments is given in Fig. 2. Details on the hyperparameters used for each generation strategy are described in further detail in “Appendix A.2.”

Fig. 2
figure 2

An illustration of the conducted experiments. The robustness of the detection models is evaluated using four different groups of datasets with increasing levels of difficulty. The evaluation begins with texts generated with nucleus sampling, continuing with out-of-distribution texts and in-the-wild datasets generated with novel models and sampling strategies. Finally, the models are evaluated on the most challenging dataset of adversarial examples that have been optimized to fool the detectors

5.1 Generators and datasets

The experiments have been conducted on two very different types of textual domains: news articles and social media texts. On a finer scale, the social media texts that have been included in this research can be divided into tweets, Reddit comments, Yahoo answers, and Yelp user reviews. For each domain of interest, representative datasets have been required, covering both real human-written texts and language model-generated texts. Further details about these datasets and their generator models are presented below and summarized in Table 1.

Table 1 Generator models used to synthesize the different texts

5.1.1 News articles

For news articles, it is well known that Grover language models are able to produce highly realistic articles. For this reason, two Grover models of different size have been included as language models used to create machine-generated news articles, while data instances from the RealNews dataset [48] (originally used for training Grover) have been utilized as real texts. As Grover allows for conditioning on metadata relating to headline, domain, author, and date, this information was extracted from genuine news articles, sampled randomly from the RealNews dataset [48]. The generated text was thereafter sampled auto-regressively from the generator, conditioned on the sampled metadata.

5.1.2 Social media texts

The GPT-2 language model has (unlike Grover) partly been pre-trained on social media data, and is therefore better suited for generating such data. We fine-tune four separate generative models, all based on a pre-trained medium-size version of GPT-2. The fine-tuning was performed on the following social media datasets, from which we also have extracted the real social media texts:

  • Sentiment140 [21]: A dataset of Twitter posts originally created for sentiment analysis. Fine-tuning was carried out for one epoch on all of the 1, 599, 502 texts belonging to the training split of the dataset.

  • GoEmotions [14]: A dataset containing Reddit comments, originally used for fine-grained emotion classification. All of the 43, 410 comments in the training split were used for fine-tuning the GPT-2 model for ten epochs.

  • Yahoo! Answers (nfL6) [8]: A dataset of 87, 362 questions and their corresponding answers. The first 82, 363 answers of the dataset were used to fine-tune a pre-trained GPT-2 model for ten epochs. None of the questions in the dataset were used.

  • Yelp Polarity Reviews [50]: A dataset containing an equal number of positive and negative Yelp reviews. The GPT-2 model was fine-tuned for one epoch on the training split containing 560, 000 reviews.

The texts from each dataset that were not used for training the generators were later utilized as real texts when evaluating the various detectors. Each dataset has been obtained from Huggingface Datasets.Footnote 1

5.1.3 Controlled text generation

While the data described so far cover the different domains and generators used in the experiments, some further complexity arises when taking the controllability into account. As explained earlier, an attacker may want to be able to control the content of what is being generated, which potentially can have an impact on the detectors’ performance. In addition to unconditioned text generation (and conditioning on sampled metadata for the Grover generator) as described above, we have also controlled a subset of the generated news articles and social media texts on a more fine-grained level using PPLM and GeDi.

For PPLM, two different attribute models were used. For the first attribute model, a simple bag-of-words (BoW) model was utilized. A list of military-related termsFootnote 2 was used to represent a military topic. According to this straightforward model, the likelihood of a text containing a military topic is given by the sum of likelihoods of each word in the bag. As the second (slightly more complex) attribute model, a single linear layer was trained on top of the last hidden state of each generator model on the task of classifying sentiment (based on data from the Stanford Sentiment Treebank [38]). Once trained, gradients from the attribute models were used to steer the generated texts to be (1) positive or negative, or (2) military-related, respectively, while simultaneously taking gradient steps in the direction of high likelihood as expressed by the underlying text generation model.

For GeDi, the generated text was also steered toward a specific sentiment (negative) or topic (food). For this purpose, pre-trained generative discriminatorsFootnote 3 were utilized. The parameters used for steering the generation using PPLM and GeDi are found in “Appendix A.2.”

5.2 Detection models

Although many different types of detection models have been suggested in the research literature for the task of discriminating between real and machine-generated texts (as described in more detail in Sect. 3), we have in the experiments reported here focused on Transformer-based detection models as these have shown most promising results in previous research. In our conducted experiments, only pre-trained detectors which are publicly available have been included. The external detectors are interesting as they are available out of the box, making them reasonably easy to use for various types of actors (such as social media platforms) attempting to detect online information operations. As shown in Table 2, we have included several versions of pre-trained Transformer-based detection models from OpenAI, based on the RoBERTa architecture. The difference between the OpenAI RoBERTa-Base and OpenAI RoBERTa-Large models is within the number of parameters; the Large model is simply a deeper network, consisting of more Transformer layers than the Base model. Both models have been fine-tuned on the task of detecting generated text from the same dataset. We have also included a large Grover-Mega model, in which a linear classification layer has been trained on top of the Grover language model as described in Sect. 3.

Table 2 A list of the detection models used in the experiments

While the pre-trained detection models have already been trained on a mix of real and machine-generated text, they are not necessarily covering the same domain as they are applied to in the experiments. Although we do want detection models that generalize to data they have not been trained on, it may be too much of a challenge for a detection model that has only been trained on well-written news articles to generalize to shorter and less formal social media posts. For this reason, we have additionally included an OpenAI RoBERTa-Large model that has been further fine-tuned for half an epoch on the training data listed in Table 3, in order to get a better sense of the importance of domain-specific training examples when generalizing to previously unseen domains. The data contain a mix of real and machine-generated texts, where the latter spans from news articles to tweets and Reddit comments. Fifty percent of the generated texts were created using Grover and the rest by GPT-2. We used 70% of the text samples for fine-tuning and 10% for validation. The remaining 20% were used as test data during evaluation. We used a batch size of 128 and a learning rate of \(5\times 10^{-5}\) when fine-tuning the RoBERTa model.

Table 3 Data used for fine-tuning and validating the RoBERTa model

5.3 Evaluating generalizability

In order to evaluate the generalizability of the various detectors, they have been tested in experiments of increasing difficulty. First of all, the detectors were tested on test data held out from the dataset already described in Table 3. These texts have been generated with similar sampling strategies and language models as the texts used for training the detector models. The fine-tuned RoBERTa model has the advantage of being trained on data from the same domain, while this is not the case for the other detection models.

Next, the detectors were tested in more challenging out-of-distribution evaluations focusing on the impact of fine-grained control mechanisms such as PPLM and GeDi on the detectors’ accuracy. Hence, this experiment simulated a scenario in which an attacker attempts to steer the text generation in a certain direction, such as following a specific narrative. Texts synthesized with PPLM and GeDi were not present in any of the detectors’ training data, which simulates a more realistic scenario where a defender cannot be assumed to have knowledge of the techniques used by the attacker.

Finally, we also wanted to get an idea of how well the detectors generalize to in-the-wild data, possibly originating from completely other types of generators than the detectors have been trained on. For this reason, a number of additional datasets have been experimented with to test the detectors’ in-the-wild detection capabilities:

  • TweepFake dataset [16]: TweepFake (Twitter Deep Fake Dataset) is a dataset consisting of a mix of tweets written by 23 genuine Twitter accounts and equally many bot accounts automatically posting impostor tweets using various language models. The tweets synthesized by bots were generated with language models such as GPT-2, RNNs, and LSTMs. In total, the dataset contains 25, 836 tweets with an equal number of human and bot tweets.

  • Deepfake bot submissions dataset [47]: A dataset consisting of 795 human-written comments and 1, 001 comments generated with the 124M version of GPT-2, fine-tuned on real comments submitted to a federal public comment website for Medicaid Reform Waiver. We use the 795 human-written comments and an equal amount of the generated comments in our evaluations.

  • Mixed NLG dataset [43]: A comprehensive dataset with texts synthesized with eight different Transformer-based language models, as well as texts written by humans. The dataset contains 1066 texts from each of the models, in addition to equally many human-written texts.

  • GPT-3 dataset [5]: A dataset containing samples generated with the full 175B version of GPT-3, the state-of-the-art successor of GPT-2.Footnote 4 We split the GPT-3 samples each time an end-of-text token appears, resulting in a total of 2008 texts. Equally many real texts have been taken from the WebText dataset.

Information about all the test datasets used to evaluate the detectors’ ability to generalize is summarized in Table 4.

Table 4 All of the (balanced) test datasets used to evaluate the detectors

5.4 Evaluating robustness to adversarial attacks

In the last experiments, the robustness of the detector models to adversarial examples was evaluated using perturbed inputs explicitly designed to cause misclassifications. First, a subset of the generated texts were post-processed with the DeepWordBug [19] adversarial attack algorithm, with the goal of making the detectors misclassifying them as human written. Human-written texts were not attacked as it is unlikely that an attacker would be interested in carrying out an attack in that direction. The adversarial attack algorithm ranks each token of the input according to its individual contribution to the classification score. Subsequently, the algorithm perturbs the most influential tokens with one of four character-level transformations: adjacent character swapping, character substitution, character deletion, and character insertion. The attacks were restricted such that a Levenshtein edit distance of no more than 30 was allowed between the adversarial example and the original text. In the first robustness experiment, a white-box attack was carried out against the large OpenAI RoBERTa model. In a second experiment, it was investigated how well this attack transfers to the other RoBERTa models in a black-box setting.

6 Results

In this section, the experimental results are presented. The detection models’ generalizability is evaluated in Sect. 6.1, while the robustness results resulting from the straightforward white-box and black-box adversarial attacks are presented in Sect. 6.2.

6.1 Generalizability results

When presenting the detection models’ achieved performance (evaluated on well-balanced test data) the experiments have been grouped into in-distribution, out-of-distribution, or in-the-wild. In addition to the calculated accuracies, detection results in terms of precision, recall, and F1 scores can be found in “Appendix B.” The best result achieved for each dataset is marked in bold font.

6.1.1 In-distribution detection

Table 5 shows the performance of the detector models on the test data generated with the models in Table 1 using nucleus sampling.

Table 5 Detection accuracy (%) on the in-distribution datasets

Across all of the evaluated detection models, irrespectively of whether they have been fine-tuned on data from this particular domain or not, news articles seem to be relatively easy to detect, especially for texts generated with the smaller Grover model. Given the relatively large text length of news articles, this is of no surprise. However, despite their length, articles from the Grover-Mega generator still remain a problem to distinguish from real articles for most of the detectors, even though the fine-tuned OpenAI detector succeeded to reach an accuracy of 95.25%.

All of the shorter social media texts apart from the ones from the Yelp Polarity dataset were undoubtedly more difficult to detect across all of the models. The OpenAI-Large model reached an accuracy of 66.57% and 69.27% for texts from the Sentiment140 and GoEmotions datasets, respectively, with the fine-tuned OpenAI-Large model achieving better accuracies, especially for the Sentiment140 dataset. Notably, the 1.5B parameter Grover-Mega discriminator trained solely on news articles performed just slightly better than random chance.

All in all, these results suggest that current state-of-the-art detectors do not seem to reliably distinguish between real and machine-generated social media posts, not even if having having access to training data from a similar distribution.

6.1.2 Out-of-distribution detection

In the second experiment, the impact of controlling the text output using PPLM and GeDi was studied. The results are presented in Table 6.

Table 6 Detection accuracy (%) on the texts generated with PPLM and GeDi

For PPLM, there does not seem to be a notable impact on the detection performance, especially for the best performing RoBERTa models. However, this is not the case for texts generated with GeDi, which in general seems to be harder to detect. The impact is especially noticeable when using GeDi to generate texts with a negative sentiment. Probably, this is due to the GeDi model being a more sophisticated control model than PPLM, thereby being able to steer the text toward a specific topic without compromising the humanness of the texts as much as PPLM. This hypothesis is strengthened when looking manually at samples of the generated texts, as discussed in more detail in Sect. 7.

Interestingly, the OpenAI-Large detector fine-tuned on, e.g., the Sentiment140 and GoEmotions datasets consistently performed worse than the same detector without fine-tuning when applied to the corresponding data being controlled by GeDi. This suggests that texts generated with GeDi experience a noteworthy distribution shift, compared to the corresponding texts being generated solely with nucleus sampling.

6.1.3 In-the-wild detection

In the last experiment on detection generalizability, the detectors were evaluated on the in-the-wild datasets, which arguably give a better indication of how the trained detectors are able to generalize to other generators and text generation methods than they have been trained on. The obtained results are presented in Table 7.

Table 7 Detection accuracy (%) on the in-the-wild datasets

Texts generated with the relatively simple GPT generator were surprisingly hard to detect across all detection models, more so than the texts from the 175B parameter state-of-the-art GPT-3 generator. Likewise, generations from the cross-lingual XLM model and the XLNet were equally difficult to detect. However, after manual inspection of the texts from especially the two latter models, we found the texts to be of such a poor quality that they would not be especially useful for an attacker attempting to use language models for conducting information operations. Therefore, these detection accuracies are of limited practical interest.

The fine-tuned OpenAI-Large model did not generalize particularly well to tweets from the TweepFake dataset, even though it was trained on data that included real and generated tweets. Notably, the OpenAI-Large model that was not fine-tuned on Sentiment140 performed better at detecting fake tweets from the TweepFake dataset than the fine-tuned version. Although it is a reasonable result given that the tweet datasets contain texts synthesized with different models, it shows how brittle the detectors are to model variations.

6.2 Robustness results

As a robustness test, the synthesized Yahoo Answers and Yelp Polarity texts were post-processed using the DeepWordBug algorithm mentioned in Sect. 5.4. Both attacks were performed on the OpenAI-Large detector as it overall was the best performing detector in terms of generalizability. Table 8 summarizes the results from the conducted attacks. An example of one of the generated adversarial examples is shown in Fig. 3.

Table 8 Attack results when perturbing 1000 generated texts of the Yahoo Answers and Yelp Polarity datasets, respectively
Fig. 3
figure 3

An adversarial example and the original text from the Yelp Review dataset. The three edited words cause the OpenAI-Large model to incorrectly change its classification of the text to human-generated from machine generated

Clearly, the attacks were effective, causing a majority of the synthesized texts to be classified as human-generated. However, these attacks require that the attacker has access to the detection model. This may well be the case for publicly available detectors, but not for non-public detectors trained on private datasets. Therefore, to further test the effectiveness of the adversarial examples we also considered an alternative scenario in which the attacker was not given access to the detector itself, but only partial information about its model architecture. Hence, some of the other RoBERTa-based detector models were evaluated on the adversarial examples optimized for OpenAI-Large, investigating to which extent the adversarial examples transfer between the models. The results of the transferability experiment are shown in Table 9.

Table 9 Accuracy on the datasets of adversarial examples generated for the OpenAI-Large model

Interestingly, the adversarial examples remained adversarial to a large extent across the detection models, causing a severe decrease in the detection accuracies. A model that cannot be accessed and queried by the attacker is therefore not necessarily safe, as the attacker can use the adversarial examples computed for a surrogate model trained on the same task as the target model. This is crucial to have in mind when considering fine-tuning a publicly available detection model on a domain-specific detection task. As the fine-tuned model shares the same architecture and many features with the original model, it is likely to be brittle to the same adversarial examples that fool the original detector.

7 Discussion

Given the experimental results, it is quite clear that among the evaluated detection methods, the detectors based on a RoBERTa architecture are in general performing better than a Grover-based detector on detection tasks involving data from other distributions than they have been trained on. This is in line with results from Uchendu et al. [43], suggesting that a pre-trained Grover detector does not perform well on textual data generated using other language models than the Grover generator.

Somewhat surprising, it does not seem to consistently help to fine-tune the off-the-shelf OpenAI RoBERTa detector on more representative data for the actual detection task. Most likely, this is due to the used experimental setup. In the experiments, we fine-tune on a number of data sources at once, rather than on a single data source. We aimed at investigating how to build generalizable detectors, rather than to reach state-of-the-art performance on single datasets. When aiming for the latter, it is probably a better approach to pre-train a large-scale RoBERTa detector on a large and varied dataset, and then use this detector as a base when fine-tuning individual detectors for each domain of interest based on this pre-trained detector.

As has been shown, almost all the evaluated detectors perform worse on social media posts than on news articles. This is a problem, as we have identified the social media domain as being of high importance from an information operation perspective. For this reason, it is of practical importance to be able to increase the detection performance, especially for short posts such as tweets. In initial follow-up experiments, we have found that it is possible to increase detection accuracy by concatenating several posts from the same source. This is especially useful when considering classifying social media posts, as such posts can be obtained and concatenated on a user level. As an example, we can increase the detection accuracy for the fine-tuned OpenAI RoBERTa detection model on the Sentiment140 dataset from 82.2 to \(98.9\%\), simply by classifying concatenations of ten tweets rather than individual tweets, as illustrated in Fig. 4. This is a promising strategy given the low number of tweets needed to reach an accuracy of this magnitude. Nonetheless, it is only feasible under the assumption that the accounts are not posting a mix of human-written and machine-generated text.

Fig. 4
figure 4

Detection accuracy of the fine-tuned OpenAI-Large model on generated tweets of the Sentiment140 dataset with respect to the number of tweets used in each prediction

7.1 Quality of the generated texts

Although contemporary language models can generate texts with unprecedented quality, there is still a risk that some generated texts may end up highly repetitive or with other defects. This is extra important for generated news articles controlled with GeDi and PPLM, as their outputs are more prone to be fraught with defects due to the extra complexity the control mechanisms add to the text generation process. As an attacker is likely to reject synthesized texts with obvious defects, especially news articles, and that evaluations of detection algorithms may result in overly optimistic results if a large fraction of such low-quality texts are used during testing, there was a need for verifying the text quality of the generated texts used in the experiments. A limited manual assessment was performed on data samples generated using all different combinations of generators and control mechanisms being used in the experiments. In this assessment, a file consisting of 3000 random samples were created per combination. Each such file was checked for approximately ten minutes each by the two authors (independently of each other), whereupon the observations were discussed. The high-level findings from this manual assessment are that the generators produce impressive content of high quality, especially when the topic of the text is not being controlled using PPLM or GeDi. GeDi succeeds well with controlling the topic, especially for shorter social media posts. These are very hard to tell apart from genuine social media posts, while it was somewhat easier in general for longer news articles as these in many cases got more problems to follow the same red line throughout the article, compared to the corresponding uncontrolled generated news articles. For PPLM, there were in some instances more visible signs of repetition or that the attempt to follow a certain topic created less trustworthy content. There were also more samples in which PPLM did not succeed on having an impact on the topic of the generated text.

In addition to this manual assessment, simple heuristics were utilized for more quantitative text quality assessment. However, this was only used for news articles as we found it to be highly variable and not sufficiently well correlated with human judgement when evaluated on shorter social media posts. The method used for quantitative assessment of the social media posts and the obtained results are described in detail in “Appendix C.1.” The results from the quantitative text quality assessment confirm the findings that the generated news articles are generally of high quality, with slightly more quality issues for texts being controlled using PPLM.

Fig. 5
figure 5

Top: an example of a news article generated by the Grover-Mega generator. Bottom: an article generated by Grover-Base and being controlled using PPLM. The text achieved the highest perplexity among the texts generated in this way

To better illustrate the quality of the generated texts, two examples are illustrated in Fig. 5. The first generated news article has received a rather low perplexity value, while the second has received the highest perplexity among the articles generated by PPLM-controlled Grover-Base articles. As can be seen, the objective to introduce positive sentiment has in the bottom example got too much influence over the textual content, as being reflected in the perplexity score. In general, this phenomenon tends to occur more often for PPLM than for GeDi. More examples of generated texts for different domains, with and without attribute models, are provided in “Appendix D.”

To summarize, the control mechanisms seem to work overall, but there certainly are individual cases where the investigated generators and control mechanisms fail to produce texts with a content and quality that suits the needs of an attacker. Our findings suggest that large-scale language models such as Grover combined with GeDi are more of a viable threat from an information operations perspective, compared to PPLM which is harder to control and often results in generated text of slightly worse nature.

7.2 Future work

In the experiments presented in this article, and in almost all existing research on detection of text generated by language models, only English texts have been taken into consideration. Information operations involving machine-generated text are in practice not likely to only involve generation of English text, but rather a wide variety of languages adapted for the intended target groups. For this reason, future work in this area should not only focus on English, as detectors may perform differently on other languages due to factors such as the amount of available training data and the morphology of the language.

Another idea for future work is to attempt to increase the robustness of detectors against adversarial attacks, as the best existing detectors in this work have been shown to be highly susceptible to both direct and indirect adversarial attacks. Therefore, it is of interest to evaluate how well approaches based on, e.g., adversarial training and out-of-distribution detection methods work in the context of building more robust detectors.

Finally, it would also be interesting to study to which degree language models like GPT-3 can be controlled by attackers during inference time, simply by conditioning on a few examples of the types of texts of interest. This type of in-context learning has been shown to work surprisingly well for other tasks [5], but it is rather sensitive to the exact choice of prompt and would probably not work for every attribute attackers would like to control in an information operations context.

8 Conclusions

Control mechanisms such as PPLM and GeDi provide users with more fine-grained control of what is being generated by neural language models, e.g., GPT-2 and Grover. Unfortunately, this increases the risk of malicious actors misusing automatically generated text for creating and spreading disinformation. Several detection algorithms have been suggested in the research literature for predicting whether texts have been computer generated or not. In this work, the generalizability of several machine learning-based detectors has been investigated. Overall, the detectors were able to tell computer-generated news articles apart from real ones with reasonable accuracy, while the same task was considerably more challenging for shorter social media posts. Controlling the text generation process with PPLM does not seem to increase the difficulty of the detection task, while the contrary holds for textual output being controlled by GeDi. When evaluating the detection methods on in-the-wild datasets and data from outside the distribution the detectors have been trained on, the accuracy decreases significantly. Furthermore, even the best performing RoBERTa-based detector is shown to be highly sensitive to simple adversarial attacks, causing it to perform worse than random on white-box attacks in which the detection model is accessible to the attacker. The adversarial attacks are also shown to transfer well, i.e., the attacker can severely reduce the detector’s accuracy even though not having access to the detection model.

These results question the practical usefulness of current state-of-the-art detection methods, and call for more research on how to improve their generalizability and robustness.