1 Introduction

Through large language models (LLMs), such as ChatGPT (Generative Pretrained Transformer), generative artificial intelligence (GenAI) has become a cheap and quick method to generate misleading or fake stories, mimicking human writing with coherence and fluidity [1, 2]. This new means for creating and disseminating information disorder contributes to the computational amplification phenomenon because of the ability to produce and disseminate content on a large scale [3]. The potential consequences of these activities include causing harm to online communities and manipulating public opinions by spreading disinformation or conspiracy theories [2].

Among the numerous ethical and practical challenges related to the use of GenAI systems, the ability to detect machine-generated text accurately is a crucial issue [4, 5]. Hence, the primary approach to encountering machine-generated misinformation or disinformation involves using detection systems. Research in this field has started to grow, and most of the studies are available on arXiv, meaning that they have not yet been peer-reviewed or will be published in the next coming weeks or months, making it difficult to establish a well-defined standard. However, some of the available results demonstrated the limitations of current detection systems. First, they cannot be considered accurate and reliable tools as they do not differentiate between human and machine writing effectively [6]. Second, they still suffer several limitations, the majority of which stem from their binary classification problems and dependence on the English language, rendering them ineffective in many cases [7]. Even the classifier developed by Open AI, the company behind ChatGPT, was unreliable, as it generated more false positives than true positives, leading to the shutdown of the online service [8].

Machine-generated texts have become so sophisticated that they are increasingly difficult to distinguish from human writing, even for experts [2, 9, 10]. This capacity to generate compelling pieces extends far beyond its (mis)use in creating and disseminating information disorders. For instance, journalists and news publishers also employ GenAI systems to provide information to their audiences. According to a survey published by the World Association of News Publishers (WAN-IFRA) [11], half of the newsrooms worldwide is already using GenAI technologies.

Generating misleading or inaccurate content is not always intentional, as the system is likely to produce wrong or inaccurate outcomes without being prompted [12]. This phenomenon is called ‘artificial hallucination’, which is described as generating realistic experiences that do not correspond to any real-world input [13]. It occurs when the generated content relies on the internal logic or patterns of the system [14]. It can be explained by the fact that the system was trained on large amounts of unsupervised data [5]. The black-box nature of the system also explains its malfunctions [14]. Furthermore, research pinpointed that the process followed by LLMs is error-prone, starting with biased training data, which poses a threat not only to the accuracy and reliability of the machine-generated content but also to its ability to generate harmful content [15, 16].

Because LLMs are just as likely to be used to inform as to misinform or disinform [13], the ability to detect the human or non-human nature of a text cannot guarantee that a given piece of content has been intentionally manipulated. From this perspective, the relevance of direct detection systems is questioned in the context of news information [17, 18]. Also, distinguishing truthful text from misinformation has become particularly challenging as they present similar writing styles to machine-generated texts with true content [19], while research primarily focused on detecting AI-generated text without focusing on this specific context [20].

On the other hand, there is a need to develop more comprehensive approaches that consider the broader ecosystem of dis- and misinformation dissemination. This requires a nuanced perspective, acknowledging that transparency about the nature of a text’s authorship is insufficient to address the multifaceted challenges posed by misleading content. Although research has stressed the importance of semantic detection and fact-verification in preventing and detecting the misuse of machine-generated content [21, 22], these computational approaches remain limited [23]. This is mainly because verification or automated fact-checking requires socio-technical considerations upstream and downstream of the process, not only because humans use these automated tools at the end but also because verification and fact-checking still require a human touch, especially from the perspective of developing a critical and nuanced approach, which are difficult to automate in news verification and fact-checking [24,25,26]. At the same time, research also demonstrated the added value of human expertise to evaluate and mitigate artificial hallucinations [27, 28].

Building upon these considerations, this study participates in the paradigm shift from classifying a news piece as human or non-human to focusing on the content quality by evaluating the presence of manipulated or fake content. Therefore, it explores the potential of leveraging human-based judgement methods from the field of natural language processing (NLP) to assess the characteristics of machine-generated content [29, 30]. Specifically, it outlines the potential applications of the Information Disorder Level (IDL) index, a human-based judgement metric designed to evaluate the factual accuracy of machine-generated content. It demonstrates that the assessment of machine-generated content is most reliable when done by humans because it involves critical thought about the meaning of the information and its informative value, which is related to the accuracy and reliability of the news.

2 Method

In NLP, human-based evaluations involve judges (experts) who are asked to rate a corpus of generated texts and human-written texts by humans by assigning a score on a rating scale. In Lester and Porter’s experiment, for instance, which was one of the first in this field, eight experts were asked to assign a rating to 15 texts according to different criteria (quality, consistency, writing style, content, organisation and accuracy) [31]. Such an approach is intrinsic, i.e., related to the content’s quality according to several criteria. In contrast, an extrinsic approach includes measuring the impact of the generated texts on task performance, the quantity or level of post-edition of generated texts or the speed at which people read generated texts [32].

Assessments based on human judgement must ensure that subjects/judges are independent, impartial and familiar with the application domain, considering that the opinions of human experts are likely to vary [33, 34]. Although they are long and expensive to implement, their benefits are to assess the quality of the system and its properties, to demonstrate progress in the field and understand the current state of the field [30].

Human-based judgement methods were used in journalism studies to assess the audiences’ perception of automatically generated content that derived from a data-to-text approach and to question the human or non-human nature of the author [35,36,37,38]. They also used rating scales to assess the intrinsic quality of generated texts, such as coherence, descriptive value, usability, writing quality, informativeness, clarity, pleasantness, interest, boredom, preciseness, trustworthiness and objectivity [39]; or intelligence, education, reliability, bias, accuracy, completeness, factuality, quality and honesty [40]. Hence, one of the main advantages of the method is that the quality indicators are established according to the research objective. In the context of text generated from large language models, such as ChatGPT, they can be valuable to assess both the accuracy of an event report and to what extent the system generates “artificial hallucinations” from a perspective grounded in fact detection and verification.

The development of the Information Disorder Level index is grounded in these considerations. It is derived from human analysis of a corpus of forty news articles generated using ChatGPT (see Fig. 1). Our primary objective, in this experiment, was to test the model’s ability to create fake news articles in different styles. First, we asked ChatGPT to generate twenty fake news on three topics (a Russian nuclear threat to Brussels, the Chinese invasion of Taiwan, and a car accident in Norway) using five different editorial styles (factual, sensationalist, high-quality newspaper, pro-Russian, and columnist).

Fig. 1.
figure 1

Sample text: Tensions between the Wagner Group and the Russian military (based on real events, factual style).

As we observed that ChatGPT had difficulty sticking to the facts in its writing, we asked the system to generate twenty more news stories, but this time based on real-world events (a Ukrainian invasion of Russia, the death of a famous American spy, the destruction of a dam in Ukraine, and tensions between the Wagner Group and Ukrainian forces in Donetsk). While acknowledging that the system’s knowledge does not extend beyond 2021, we sought to evaluate ChatGPT’s ability to generate news articles with real-world insights using prompts based on real-world events.

The content generated by ChatGPT effectively replicated journalistic writing, which can be defined by the use of relatively short sentences and adherence to the inverted pyramid structure. This characteristic feature of journalism implies that the narrative progresses from general information to specific details [41]. [42]. However, strict adherence to the facts seemed to be the most challenging for the system. ChatGPT also tended to add comments or opinions that had nothing to do with factual journalism. We hypothesised that this was due to the nature of the prompts, where the system was also being asked to generate editorials.

To define the Information Disorder Level (IDL) index, we considered that each sentence of a text contains short pieces of information ranging from ‘True’ to ‘False’. However, assessing the factuality of a sentence can be more nuanced than such a binary approach. Hence, we introduced the ‘Mostly true’ and ‘Mostly false’ scales. We defined these different levels as follows:

  • True: Completely true or accurate and reliable (informative).

  • Mostly True: Predominantly true with some elements of falsehood.

  • Cannot Say: Difficult to determine accuracy.

  • Mostly False: Predominantly false with some elements of truth.

  • False: Completely false or incorrect (mis- or dis-informative).

Considering the total number of assessed sentences (the ‘Cannot say’ answer is not included in the formula, based on the assumption that, as a joker, it does not provide meaningful input to the evaluation process), the IDL index consists of the sum of the cumulative scores for ‘Mostly true’ (1 point attributed to each sentence), ‘Mostly false’ (2 points attributed to each sentence), and ‘False’ (3 points attributed to each sentence), divided by the total number of sentences assessed multiplied by 3 (the maximum possible score). The index is then normalised on a scale ranging from 0 to 10.

The formula for the IDL index can be expressed as:

$$\begin{aligned} \text {IDL index} = \left( \frac{(\text {MT} \times 1) + (\text {MF} \times 2) + (\text {F} \times 3)}{\text {(MT + MF + F)} \times 3}\right) \times 10 \end{aligned}$$


$$\begin{aligned} \text {MT} & = \text {number of sentences classified as `Mostly True'} \\ \text {MF} & = \text {number of sentences classified as `Mostly False'} \\ \text {F} & = \text {number of sentences classified as `False'} \\ \end{aligned}$$

At the operational level, we have developed an interface in JavaScript that allows a user to evaluate a text generated by a machine using the metric. The tool consists of a three-stage process. Two fields are displayed on the first screen: the first for pasting the machine-generated text and the second to paste the prompt used (see Fig. 2). The second stage consists of the actual assessment after the sentence tokenisation or segmentation of the text, which is based on the sentences’ boundaries, such as dots, question marks or exclamation marks, or ellipsis [43, 44]. The evaluator can always refer to the prompt used to generate the text to check if all elements are present and if there are additional elements (see Fig. 3). In other words, the evaluator proceeds by comparisons between the prompt used (source) and the generated text (target). In this prototype version, we did not include the omission of facts, which could be integrated into further developments.

Considering that current information is also characterized by the distinction between facts and comments [45, 46], we introduced the Opinion/Comments (OC) rate into the prototype. Also, the human judge has the possibility of marking the sentence as an opinion or a comment, which is computed into the Opinions/Comments (OC) rate that corresponds to the percentage of sentences marked as such. It is considered a complementary indicator of the informational quality of the machine-generated content, although it is not the central element to assess the factuality of a report event. In the third step of the evaluation process, a final screen provides the results (see Fig. 4).

Fig. 2.
figure 2

First step: paste the machine-generated text and the prompt used (optional).

Fig. 3.
figure 3

Second step: assessing the content after sentence tokenisation.

Fig. 4.
figure 4

Third and final step: showing the results.

3 Results

Each text in the corpus was evaluated using the assessment tool, and the scores for the Information Disorder Level (IDL) index and the Opinion/Comments (OC) rate were recorded in a spreadsheet. Descriptive statistics were computed to evaluate the IDL index and the OC rate. The IDL index ranged from 0 (in only two cases) to 8.2. The average is 3.9, and the median is 3.3. Around 32.5% of the machine-generated texts get a score of 5 or above. In 80% of the cases, ChatGPT added made-up content, regardless of subject or style, and in 35% of the cases, it reached alarming proportions, as measured by an IDL index of 5 or higher.

As explained previously, separating facts from opinions and comments is an ethical prerequisite in journalism. Here also, ChatGPT performed poorly when contributing thoughts or observations in 100% of the cases. No text in the corpus was exempt from such additions, with a minimum Opinions/Comments (OC) rate of 2.31, reaching up to 9.5, an average of 5.65 and a median of 5.75 (see Fig. 5). To mitigate biases in these results, we excluded sensationalist, pro-Russian, and columnist writing styles to examine the OC rate for factual and high-quality newspaper styles (see Fig. 6). The 14 pieces of text retained for this analysis show an average OC rate (which was normalised on a scale of 10) of 3.72, with a minimum of 2.31 and a maximum of 5,45.

Fig. 5.
figure 5

Descriptive statistics of the corpus.

Fig. 6.
figure 6

Sample text: Car accident in Norway (factual style).

A correlation analysis was performed to examine the possible relationship between the IDL index and the OC rate. The correlation coefficient of 0.05 suggested a lack of meaningfulness positive correlation between these two variables, which can be due to the difficulty in assessing the factuality or truthfulness of a comment or an opinion [47]. The results obtained for the t-value (-1.58) and the p-value (0.20) indicate that there is no statistically significant difference between the means of the two variables. Additionally, a linear regression model was fitted to explore the relationship between the IDL index and the OC rate. However, the model did not yield statistically significant results. The low multiple R-squared (0.003) and adjusted R-squared (-0.023) values suggest that the model does not fit the data well. Therefore, based on the analysis, there is no strong evidence to suggest that the IDL index has a significant influence or relationship with the OC rate.

4 Conclusion

The limits of this experiment are related to the relatively small size of the corpus (with only forty samples) as well as to a human evaluation carried out by a single judge. Given the subjective nature of any analysis or evaluation activity, a corpus should ideally be submitted to at least two human evaluators to better frame and weigh the results. Nevertheless, the results presented in this paper illustrate the potential of using the IDL index and the OC rate as quality indicators to assess content generated by LLMs.

As ChatGPT added opinions or comments to all the samples related to the factual and high-quality news styles, it is possible to hypothesise that this mixture of genres is a clue to determining that it consists of a machine-generated non-journalistic piece. However, some media outlets and blogs fail to distinguish between facts, opinions, and comments. In addition, the sample included writing styles that, by their very nature, contained opinions or comments. Hence, further investigation is needed in this area.

The invented content by ChatGPT is part of the story’s logic and is more akin to fictionalising than to what is commonly called artificial hallucination. While ChatGPT may not fully understand its writing, it can be considered a simulation or extrapolation of content generation. Therefore, we suggest that the invented parts in the generated texts should be understood as a product of pattern-matching abilities rather than manifesting artificial hallucination.

For newsrooms using generative AI, these results suggest that every piece of machine-generated content should be verified and post-edited by a human before being published. From a digital literacy perspective, the IDL index can be considered a useful tool to understand the limits of generative AI and encourage critical thinking about what makes a report event factual. The tool developed for this experiment is available on GitHub: The corpus used and the source code of the web application are also available on GitHub: