Abstract
Large language models have enabled the rapid production of misleading or fake narratives, presenting a challenge for direct detection methods. Considering that generative artificial intelligence tools are likely to be used either to inform or to disinform, evaluating the (non)human nature of machine-generated content is questioned, especially regarding the ‘hallucination’ phenomenon, which relates to generated content that does not correspond to real-world input. In this study, we argue that assessing machine-generated content is most reliable when done by humans because doing so involves critical consideration of the meaning of the information and its informative, misinformative or disinformative value, which is related to the accuracy and reliability of the news. To explore human-based judgement methods, we developed the Information Disorder Level (IDL) index, a language-independent metric to evaluate the factuality of machine-generated content. It has been tested on a corpus of forty made-up and actual news stories generated with ChatGPT. For newsrooms using generative AI, results suggest that every piece of machine-generated content should be vetted and post-edited by humans before being published. From a digital media literacy perspective, the IDL index is a valuable tool to understand the limits of generative AI and trigger a reflection on what constitutes the factuality of a reported event.
Keywords
- Generative AI
- natural language processing
- social science
This research was funded by EU CEF Grant No. 2394203.
Download conference paper PDF
1 Introduction
Through large language models (LLMs), such as ChatGPT (Generative Pretrained Transformer), generative artificial intelligence (GenAI) has become a cheap and quick method to generate misleading or fake stories, mimicking human writing with coherence and fluidity [1, 2]. This new means for creating and disseminating information disorder contributes to the computational amplification phenomenon because of the ability to produce and disseminate content on a large scale [3]. The potential consequences of these activities include causing harm to online communities and manipulating public opinions by spreading disinformation or conspiracy theories [2].
Among the numerous ethical and practical challenges related to the use of GenAI systems, the ability to detect machine-generated text accurately is a crucial issue [4, 5]. Hence, the primary approach to encountering machine-generated misinformation or disinformation involves using detection systems. Research in this field has started to grow, and most of the studies are available on arXiv, meaning that they have not yet been peer-reviewed or will be published in the next coming weeks or months, making it difficult to establish a well-defined standard. However, some of the available results demonstrated the limitations of current detection systems. First, they cannot be considered accurate and reliable tools as they do not differentiate between human and machine writing effectively [6]. Second, they still suffer several limitations, the majority of which stem from their binary classification problems and dependence on the English language, rendering them ineffective in many cases [7]. Even the classifier developed by Open AI, the company behind ChatGPT, was unreliable, as it generated more false positives than true positives, leading to the shutdown of the online service [8].
Machine-generated texts have become so sophisticated that they are increasingly difficult to distinguish from human writing, even for experts [2, 9, 10]. This capacity to generate compelling pieces extends far beyond its (mis)use in creating and disseminating information disorders. For instance, journalists and news publishers also employ GenAI systems to provide information to their audiences. According to a survey published by the World Association of News Publishers (WAN-IFRA) [11], half of the newsrooms worldwide is already using GenAI technologies.
Generating misleading or inaccurate content is not always intentional, as the system is likely to produce wrong or inaccurate outcomes without being prompted [12]. This phenomenon is called ‘artificial hallucination’, which is described as generating realistic experiences that do not correspond to any real-world input [13]. It occurs when the generated content relies on the internal logic or patterns of the system [14]. It can be explained by the fact that the system was trained on large amounts of unsupervised data [5]. The black-box nature of the system also explains its malfunctions [14]. Furthermore, research pinpointed that the process followed by LLMs is error-prone, starting with biased training data, which poses a threat not only to the accuracy and reliability of the machine-generated content but also to its ability to generate harmful content [15, 16].
Because LLMs are just as likely to be used to inform as to misinform or disinform [13], the ability to detect the human or non-human nature of a text cannot guarantee that a given piece of content has been intentionally manipulated. From this perspective, the relevance of direct detection systems is questioned in the context of news information [17, 18]. Also, distinguishing truthful text from misinformation has become particularly challenging as they present similar writing styles to machine-generated texts with true content [19], while research primarily focused on detecting AI-generated text without focusing on this specific context [20].
On the other hand, there is a need to develop more comprehensive approaches that consider the broader ecosystem of dis- and misinformation dissemination. This requires a nuanced perspective, acknowledging that transparency about the nature of a text’s authorship is insufficient to address the multifaceted challenges posed by misleading content. Although research has stressed the importance of semantic detection and fact-verification in preventing and detecting the misuse of machine-generated content [21, 22], these computational approaches remain limited [23]. This is mainly because verification or automated fact-checking requires socio-technical considerations upstream and downstream of the process, not only because humans use these automated tools at the end but also because verification and fact-checking still require a human touch, especially from the perspective of developing a critical and nuanced approach, which are difficult to automate in news verification and fact-checking [24,25,26]. At the same time, research also demonstrated the added value of human expertise to evaluate and mitigate artificial hallucinations [27, 28].
Building upon these considerations, this study participates in the paradigm shift from classifying a news piece as human or non-human to focusing on the content quality by evaluating the presence of manipulated or fake content. Therefore, it explores the potential of leveraging human-based judgement methods from the field of natural language processing (NLP) to assess the characteristics of machine-generated content [29, 30]. Specifically, it outlines the potential applications of the Information Disorder Level (IDL) index, a human-based judgement metric designed to evaluate the factual accuracy of machine-generated content. It demonstrates that the assessment of machine-generated content is most reliable when done by humans because it involves critical thought about the meaning of the information and its informative value, which is related to the accuracy and reliability of the news.
2 Method
In NLP, human-based evaluations involve judges (experts) who are asked to rate a corpus of generated texts and human-written texts by humans by assigning a score on a rating scale. In Lester and Porter’s experiment, for instance, which was one of the first in this field, eight experts were asked to assign a rating to 15 texts according to different criteria (quality, consistency, writing style, content, organisation and accuracy) [31]. Such an approach is intrinsic, i.e., related to the content’s quality according to several criteria. In contrast, an extrinsic approach includes measuring the impact of the generated texts on task performance, the quantity or level of post-edition of generated texts or the speed at which people read generated texts [32].
Assessments based on human judgement must ensure that subjects/judges are independent, impartial and familiar with the application domain, considering that the opinions of human experts are likely to vary [33, 34]. Although they are long and expensive to implement, their benefits are to assess the quality of the system and its properties, to demonstrate progress in the field and understand the current state of the field [30].
Human-based judgement methods were used in journalism studies to assess the audiences’ perception of automatically generated content that derived from a data-to-text approach and to question the human or non-human nature of the author [35,36,37,38]. They also used rating scales to assess the intrinsic quality of generated texts, such as coherence, descriptive value, usability, writing quality, informativeness, clarity, pleasantness, interest, boredom, preciseness, trustworthiness and objectivity [39]; or intelligence, education, reliability, bias, accuracy, completeness, factuality, quality and honesty [40]. Hence, one of the main advantages of the method is that the quality indicators are established according to the research objective. In the context of text generated from large language models, such as ChatGPT, they can be valuable to assess both the accuracy of an event report and to what extent the system generates “artificial hallucinations” from a perspective grounded in fact detection and verification.
The development of the Information Disorder Level index is grounded in these considerations. It is derived from human analysis of a corpus of forty news articles generated using ChatGPT (see Fig. 1). Our primary objective, in this experiment, was to test the model’s ability to create fake news articles in different styles. First, we asked ChatGPT to generate twenty fake news on three topics (a Russian nuclear threat to Brussels, the Chinese invasion of Taiwan, and a car accident in Norway) using five different editorial styles (factual, sensationalist, high-quality newspaper, pro-Russian, and columnist).
As we observed that ChatGPT had difficulty sticking to the facts in its writing, we asked the system to generate twenty more news stories, but this time based on real-world events (a Ukrainian invasion of Russia, the death of a famous American spy, the destruction of a dam in Ukraine, and tensions between the Wagner Group and Ukrainian forces in Donetsk). While acknowledging that the system’s knowledge does not extend beyond 2021, we sought to evaluate ChatGPT’s ability to generate news articles with real-world insights using prompts based on real-world events.
The content generated by ChatGPT effectively replicated journalistic writing, which can be defined by the use of relatively short sentences and adherence to the inverted pyramid structure. This characteristic feature of journalism implies that the narrative progresses from general information to specific details [41]. [42]. However, strict adherence to the facts seemed to be the most challenging for the system. ChatGPT also tended to add comments or opinions that had nothing to do with factual journalism. We hypothesised that this was due to the nature of the prompts, where the system was also being asked to generate editorials.
To define the Information Disorder Level (IDL) index, we considered that each sentence of a text contains short pieces of information ranging from ‘True’ to ‘False’. However, assessing the factuality of a sentence can be more nuanced than such a binary approach. Hence, we introduced the ‘Mostly true’ and ‘Mostly false’ scales. We defined these different levels as follows:
-
True: Completely true or accurate and reliable (informative).
-
Mostly True: Predominantly true with some elements of falsehood.
-
Cannot Say: Difficult to determine accuracy.
-
Mostly False: Predominantly false with some elements of truth.
-
False: Completely false or incorrect (mis- or dis-informative).
Considering the total number of assessed sentences (the ‘Cannot say’ answer is not included in the formula, based on the assumption that, as a joker, it does not provide meaningful input to the evaluation process), the IDL index consists of the sum of the cumulative scores for ‘Mostly true’ (1 point attributed to each sentence), ‘Mostly false’ (2 points attributed to each sentence), and ‘False’ (3 points attributed to each sentence), divided by the total number of sentences assessed multiplied by 3 (the maximum possible score). The index is then normalised on a scale ranging from 0 to 10.
The formula for the IDL index can be expressed as:
where:
At the operational level, we have developed an interface in JavaScript that allows a user to evaluate a text generated by a machine using the metric. The tool consists of a three-stage process. Two fields are displayed on the first screen: the first for pasting the machine-generated text and the second to paste the prompt used (see Fig. 2). The second stage consists of the actual assessment after the sentence tokenisation or segmentation of the text, which is based on the sentences’ boundaries, such as dots, question marks or exclamation marks, or ellipsis [43, 44]. The evaluator can always refer to the prompt used to generate the text to check if all elements are present and if there are additional elements (see Fig. 3). In other words, the evaluator proceeds by comparisons between the prompt used (source) and the generated text (target). In this prototype version, we did not include the omission of facts, which could be integrated into further developments.
Considering that current information is also characterized by the distinction between facts and comments [45, 46], we introduced the Opinion/Comments (OC) rate into the prototype. Also, the human judge has the possibility of marking the sentence as an opinion or a comment, which is computed into the Opinions/Comments (OC) rate that corresponds to the percentage of sentences marked as such. It is considered a complementary indicator of the informational quality of the machine-generated content, although it is not the central element to assess the factuality of a report event. In the third step of the evaluation process, a final screen provides the results (see Fig. 4).
3 Results
Each text in the corpus was evaluated using the assessment tool, and the scores for the Information Disorder Level (IDL) index and the Opinion/Comments (OC) rate were recorded in a spreadsheet. Descriptive statistics were computed to evaluate the IDL index and the OC rate. The IDL index ranged from 0 (in only two cases) to 8.2. The average is 3.9, and the median is 3.3. Around 32.5% of the machine-generated texts get a score of 5 or above. In 80% of the cases, ChatGPT added made-up content, regardless of subject or style, and in 35% of the cases, it reached alarming proportions, as measured by an IDL index of 5 or higher.
As explained previously, separating facts from opinions and comments is an ethical prerequisite in journalism. Here also, ChatGPT performed poorly when contributing thoughts or observations in 100% of the cases. No text in the corpus was exempt from such additions, with a minimum Opinions/Comments (OC) rate of 2.31, reaching up to 9.5, an average of 5.65 and a median of 5.75 (see Fig. 5). To mitigate biases in these results, we excluded sensationalist, pro-Russian, and columnist writing styles to examine the OC rate for factual and high-quality newspaper styles (see Fig. 6). The 14 pieces of text retained for this analysis show an average OC rate (which was normalised on a scale of 10) of 3.72, with a minimum of 2.31 and a maximum of 5,45.
A correlation analysis was performed to examine the possible relationship between the IDL index and the OC rate. The correlation coefficient of 0.05 suggested a lack of meaningfulness positive correlation between these two variables, which can be due to the difficulty in assessing the factuality or truthfulness of a comment or an opinion [47]. The results obtained for the t-value (-1.58) and the p-value (0.20) indicate that there is no statistically significant difference between the means of the two variables. Additionally, a linear regression model was fitted to explore the relationship between the IDL index and the OC rate. However, the model did not yield statistically significant results. The low multiple R-squared (0.003) and adjusted R-squared (-0.023) values suggest that the model does not fit the data well. Therefore, based on the analysis, there is no strong evidence to suggest that the IDL index has a significant influence or relationship with the OC rate.
4 Conclusion
The limits of this experiment are related to the relatively small size of the corpus (with only forty samples) as well as to a human evaluation carried out by a single judge. Given the subjective nature of any analysis or evaluation activity, a corpus should ideally be submitted to at least two human evaluators to better frame and weigh the results. Nevertheless, the results presented in this paper illustrate the potential of using the IDL index and the OC rate as quality indicators to assess content generated by LLMs.
As ChatGPT added opinions or comments to all the samples related to the factual and high-quality news styles, it is possible to hypothesise that this mixture of genres is a clue to determining that it consists of a machine-generated non-journalistic piece. However, some media outlets and blogs fail to distinguish between facts, opinions, and comments. In addition, the sample included writing styles that, by their very nature, contained opinions or comments. Hence, further investigation is needed in this area.
The invented content by ChatGPT is part of the story’s logic and is more akin to fictionalising than to what is commonly called artificial hallucination. While ChatGPT may not fully understand its writing, it can be considered a simulation or extrapolation of content generation. Therefore, we suggest that the invented parts in the generated texts should be understood as a product of pattern-matching abilities rather than manifesting artificial hallucination.
For newsrooms using generative AI, these results suggest that every piece of machine-generated content should be verified and post-edited by a human before being published. From a digital literacy perspective, the IDL index can be considered a useful tool to understand the limits of generative AI and encourage critical thinking about what makes a report event factual. The tool developed for this experiment is available on GitHub: https://laurence001.github.io/idl/. The corpus used and the source code of the web application are also available on GitHub: https://github.com/laurence001/idl/tree/main.
References
Giansiracusa, N.: How algorithms create and prevent fake news: Exploring the impacts of social media, deepfakes, GPT-3, and more. APress (2021)
Ferrara, E.: Social bot detection in the age of ChatGPT: Challenges and opportunities. First Monday (2023)
Wardle, C., Derakhshan, H.: Information disorder: toward an interdisciplinary framework for research and policymaking. Council of Europe Strasbourg (2017)
De Angelis, L., et al.: ChatGPT and the rise of large language models: the new AI-driven infodemic threat in public health. Front. Public Health 11, 1166120 (2023)
Ray, P.: ChatGPT: a comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet of Things and Cyber-Physical Systems (2023)
Weber-Wulff, D., et al.: Testing of detection tools for AI-generated text. ArXiv [cs.CL]. (2023). http://arxiv.org/abs/2306.15666
Crothers, E., Japkowicz, N., Viktor, H.: Machine generated text: A comprehensive survey of threat models and detection methods. ArXiv [cs.CL] (2022). http://arxiv.org/abs/2210.07321
Kirchner, J., Ahmad, L., Aaronson, S., Leike, J.: New AI classifier for indicating AI-written text. OpenAI (2023)
Gehrmann, S., Strobelt, H., Rush, A.: GLTR: Statistical detection and visualization of generated text. ArXiv [cs.CL]. (2019). http://arxiv.org/abs/1906.04043
Gao, C., et al.: Comparing scientific abstracts generated by ChatGPT to real abstracts with detectors and blinded human reviewers. NPJ Digital Med. 6, 75 (2023)
Henriksson, T.: New survey finds half of newsrooms use Generative AI tools; only 20% have guidelines in place - WAN-IFRA. World Association Of News Publishers (2023). https://wan-ifra.org/2023/05/new-genai-survey/
Dwivedi, Y., Kshetri, N., Hughes, L., Slade, E., Jeyaraj, A., Kar, A.: So what if ChatGPT wrote it?” Multidisciplinary perspectives on opportunities, challenges and implications of generative conversational AI for research, practice and policy. Int. J. Inf. Manage. 71, 102642 (2023)
Hanley, H., Durumeric, Z.: Machine-made media: monitoring the mobilization of machine-generated articles on misinformation and mainstream news websites. ArXiv [cs.CY] (2023). http://arxiv.org/abs/2305.09820
Li, Z.: The dark side of ChatGPT: legal and ethical challenges from stochastic parrots and hallucination. ArXiv [cs.CY] (2023). http://arxiv.org/abs/2304.1434
Ferrara, E. Should ChatGPT be biased? Challenges and risks of bias in large language models. ArXiv [cs.CY] (2023). http://arxiv.org/abs/2304.03738
Rozado, D.: The political biases of ChatGPT. Soc. Sci. 12, 148 (2023)
Tang, R., Chuang, Y., Hu, X.: The science of detecting LLM-generated texts. ArXiv [cs.CL] (2023). http://arxiv.org/abs/2303.07205
Zellers, R., et al.: Defending a. ArXiv [cs.CL] (2019). http://arxiv.org/abs/1905.12616
Schuster, T., Schuster, R., Shah, D., Barzilay, R.: The limitations of stylometry for detecting machine-generated fake news. Comput. Linguist. 46, 499–510 (2020). https://doi.org/10.1162/coli_a_00380
Kumarage, T., et al.: phJ-Guard: Journalism Guided Adversarially Robust Detection of AI-generated News. arXiv preprint arXiv:2309.03164 (2023)
Pu, J., et al.: Deepfake text detection: Limitations and opportunities. ArXiv [cs.CR] (2022). http://arxiv.org/abs/2210.09421
Guo, B., et al.: How close is ChatGPT to human experts? Comparison corpus, evaluation, and detection. ArXiv [cs.CL] (2023). http://arxiv.org/abs/2301.07597
Lazarski, E., Al-Khassaweneh, M., Howard, C.: Using NLP for fact checking: a survey. Designs 5, 42 (2021). https://doi.org/10.3390/designs5030042
Dierickx, L., Lindén, C., Opdahl, A.L.: Automated fact-checking to support professional practices: systematic literature review and meta-analysis. Int. J. Commun. 17, 21 (2023)
Graves, D.: Understanding the promise and limits of automated fact-checking. Reuters Institute for the Study of Journalism (2018)
Schlichtkrull, M., Ousidhoum, N., Vlachos, A.: The intended uses of automated fact-checking artefacts: Why, how and who. ArXiv [cs.CL] (2023). http://arxiv.org/abs/2304.14238
Alkaissi, H., McFarlane, S.: Artificial hallucinations in ChatGPT: implications in scientific writing. Cureus. 15, 1–5 (2023)
Buholayka, M., Zouabi, R., Tadinada, A.: Is ChatGPT ready to write scientific case reports independently? A comparative evaluation between human and artificial intelligence. Cureus. 15, 1–6 (2023). https://doi.org/10.7759252Fcureus.39386
Thomson, C., Reiter, E.: A gold standard methodology for evaluating accuracy in data-to-text systems. ArXiv [cs.CL] (2020). http://arxiv.org/abs/2011.03992
van der Lee, C., Gatt, A., Miltenburg, E., Krahmer, E.: Human evaluation of automatically generated text: current trends and best practice guidelines. Comput. Speech Lang. 67, 101151 (2021)
Lester, B.: Developing and empirically evaluating robust explanation generators: The KNIGHT experiments. Comput. Linguist. 23, 65–101 (1997)
Belz, A., Reiter, E.: Comparing automatic and human evaluation of NLG systems. In: 11th Conference of the European Chapter of the Association For Computational Linguistics, pp. 313–320 (2006)
Belz, A., Reiter, E.: An investigation into the validity of some metrics for automatically evaluating natural language generation systems. Comput. Linguist. 35, 529–558 (2009)
Dale, R., White, M.: Shared tasks and comparative evaluation in natural language generation. In: Proceedings of the Workshop on Shared Tasks and Comparative Evaluation in Natural Language Generation, pp. 1–6 (2007)
Graefe, A., Haim, M., Haarmann, B., Brosius, H.: Perception of automated computer-generated news: credibility, expertise, and readability. 11th Dubrovnik Media Days, Dubrovnik (2015)
Haim, M., Graefe, A.: Automated news: better than expected? Digit. J. 5, 1044–1059 (2017)
Wölker, A., Powell, T.: Algorithms in the newsroom? News readers’ perceived credibility and selection of automated journalism. Journalism (London, England). 22, 86–103 (2021). https://doi.org/10.1177/1464884918757072
Melin, M., Back, A., Sodergard, C., Munezero, M., Leppanen, L., Toivonen, H.: No landslide for the human journalist - an empirical study of computer-generated election news in Finland. IEEE Access Pract. Innov. Open Solut. 6, 43356–43367 (2018). https://doi.org/10.1109/access.2018.2861987
Clerwall, C.: Enter the robot journalist: users’ perceptions of automated content. J. Pract. 8, 519–531 (2014). https://doi.org/10.1080/17512786.2014.883116
Van Der Kaa, H., Krahmer, E.: Journalist versus news consumer: the perceived credibility of machine-written news. In: Proceedings of the Computation+Journalism Conference. (2014)
Johnston, J., Graham, C.: The new, old journalism: narrative writing in contemporary newspapers. J. Stud. 13, 517–533 (2012). https://doi.org/10.1080/1461670x.2011.629803
Tandoc Jr, E., Thomas, R., Bishop, L.: What is (fake) news? Analyzing news values (and more) in fake stories. Med. Commun. 9, 110–119 (2021). https://doi.org/10.17645252Fmac.v9i1.3331
Jurish, B., Würzner, K.: Word and sentence tokenization with hidden Markov models. J. Lang. Technol. Comput. Linguist. 28, 61–83 (2013). https://doi.org/10.21248252Fjlcl.28.2013.176
Matusov, E., Leusch, G., Bender, O., Ney, H.: Evaluating machine translation output with automatic sentence segmentation. In: Proceedings of the Second International Workshop on Spoken Language Translation (2005)
Hanitzsch, T. Deconstructing journalism culture: toward a universal theory. Communication Theory. 17, 367–385 (2007). https://doi.org/10.1111252Fj.1468-2885.2007.00303.x
Ward, S.: Truth and Objectivity. The Routledge Handbook of Mass Media Ethics, pp. 101–114 (2020). https://doi.org/10.4324252F9781315545929-8
Walter, N., Salovich, N.: Unchecked vs. uncheckable: how opinion-based claims can impede corrections of misinformation. Mass Commun. Soc. 24, 500–526 (2021). https://doi.org/10.1080252F15205436.2020.1864406
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2023 The Author(s)
About this paper
Cite this paper
Dierickx, L., Lindén, CG., Opdahl, A.L. (2023). The Information Disorder Level (IDL) Index: A Human-Based Metric to Assess the Factuality of Machine-Generated Content. In: Ceolin, D., Caselli, T., Tulin, M. (eds) Disinformation in Open Online Media. MISDOOM 2023. Lecture Notes in Computer Science, vol 14397. Springer, Cham. https://doi.org/10.1007/978-3-031-47896-3_5
Download citation
DOI: https://doi.org/10.1007/978-3-031-47896-3_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-47895-6
Online ISBN: 978-3-031-47896-3
eBook Packages: Computer ScienceComputer Science (R0)