Keywords

Introduction

The eruption of digitization and the establishment of social media as a major content production and reproduction means have led to new paradigms of journalism and news spreading. The rapid changes that took place in the last 20 years led to an environment of pluralism without borders, where also many threats are lurking. One of these threats is the rapid spreading of misinformation and disinformation. It has been reported that fake news is spreading even six times faster than credible information [1]. This phenomenon represents a major concern, firstly, for media organizations and professionals, and, secondly, for law enforcement agencies (LEAs) due to the fact that the rapid spread of disinformation can severely threaten several aspects of society. According to the European Commission, the spread of both disinformation and misinformation can have a range of harmful consequences, such as the threatening of our democracies, the polarization of debates, and the setting of the health, security, and environment of EU citizens at risk [2].

As the practices of misinformation and disinformation evolve, it is of utmost importance to design, develop, and engage innovative technologies and solutions in order to tackle such phenomena. In this light, numerous approaches have emerged taking advantage of machine learning (ML) in order to address this problem from different viewpoints. Even though, from a technical perspective, different solutions for fake news detection and identification of misinformation exist, such as transfer learning, multi-task learning, reinforcement learning, and online learning, no universal solution that can address all the aspects of the issue has been developed so far. Almost each and every single solution aims to address the problem in a specific topic or narrow domain and based on a limited dataset.

The purpose of this study is to present an approach that combines and evaluates the results of different machine learning prediction models into a common environment named “Meta-Detection Toolset.” This solution relies on the calculation of a meta-score by using weights-based voting among different “prediction models,” which are referred to herein as “verification services.” The weights of the verification services are constantly updated by the end users of the toolset based on an annotation procedure. This leverages the current solution into a lifelong learning approach that is future-proof and adaptable as the machine learning models improve or deteriorate, through the course of time, and might perform better or worse for different topics or styles of writing.

The remainder of this study is structured as follows: Section “Related Work” contains related works concerning natural language processing applications (e.g., topic selection and language modeling) and lifelong learning studies. Section “Meta-Detection Toolset” presents in detail the proposed Meta-Detection Toolset, and, finally, Section “Conclusion: Future Work” concludes the article and paves the path for future updates of the presented toolset.

Related Work

Lifelong learning (LL) or continuous learning (CL) is an emerging trend in computer science as well as in artificial intelligence. Thus, in the last few years, there has been an upward trend for studies focused on producing systems and solutions based on the concept of LL. The vast majority of dis/misinformation fighting tools are based on machine learning and deep learning algorithms. A comparative analysis of six available state-of-the-art fake news detection tools was made by Giełczyk et al. [3]. This comparison was feasible due to the fact that the datasets used were labeled and this is something rare in real-life conditions. The use of LL comes as an answer to minimize the need of expensive and scarce labeled data. In the domain of dis/misinformation and the trustworthiness of news and articles, LL is at a nascent stage and, thus, this section presents notable LL studies relevant to other fields of text analysis and natural language processing.

Topic identification is an application that can be enabled by LL approaches. More specifically, ML-based models called “topic models” extract hidden structures and correlations from a collection of documents in order to classify similar documents under a common topic. Each topic contains sets of common or contextually related words or characteristics [4]. In this light, Chen et al. [5] proposed a lifelong topic model called non-negative matrix factorization (NMF)—lifelong topic model (LTM). The method of Chen et al. [5] showed better performance compared to other methods after extensive experiments on public corpora. In the same direction, Xu et al. [6] proposed the lifelong learning topic (LLT) model that tries to lift the limitations when there are limited co-occurrences in a dataset. The LLT model is based on the notion of lifelong learning and expands the topic knowledge discovered by learning new word-embeddings based on the topics generated in previous iterations. Another interesting approach to topic modeling and learning was made by Zhang et al. [7]. More specifically, Zhang et al. combined generative adversarial network (GAN) with lifelong learning in their solution named lifelong knowledge-enhanced adversarial neural topic model (LKATM). LKATM discovers topics in documents by using a knowledge extractor that utilizes knowledge distillation and data augmentation in order to transfer prior topic knowledge.

Apart from topic identification, language modeling is another task where LL is exploited to offer state-of-the-art solutions. In this context, Sun et al. [8] proposed the language modeling for lifelong language modeling (LAMOL) framework. Specifically, LAMOL is a language model that learns to solve tasks and, at the same time, generates training samples. The dynamic representations for imbalanced lifelong learning (DRILL) solution is presented by Ahrens et al. [9] and mainly focuses on addressing dataset limitations. In particular, DRILL is defined as a novel lifelong learning architecture for open-domain sequence classification. DRILL is a hybrid architectural and rehearsal-based continuous learning method that utilizes meta-learning and a self-organizing neural architecture in order to adapt to new unseen data while trying to avoid catastrophic forgetting.

Chen et al. [10] proposed an LL solution for sentiment classification based on product reviews. Their approach focused on negative or positive reviews for various products. Each product represented a different task for the LL learner they proposed. He et al. [11] studied and proposed the applicability of an LL model in Weibo rumor detection. Their approach aimed to mitigate the rapid changes happening in online news and rumors as well as the limited availability of data.

To the best of our knowledge and at the time of writing the current study, there is no dedicated LL method or approach to the trustworthiness of news or articles. Thus, the related works presented in this section constitute LL solutions and approaches concerning applications relevant to text and language processing in other domains, such as rumor detection, sentiment detection, language modeling, and topic detection.

Meta-Detection Toolset

The proposed solution of the Meta-Detection Toolset engages different verification services. These diverse verification services serve as predictors of credibility for a given piece of content (typically, an article provided in the form of a URL or text). Based on the integration and implementation of a weighted majority algorithm [12], equal weights are initially assigned to each verification service. During the continuous training process, the weight assigned to a verification service is automatically adjusted, according to the accuracy of its predictions. Verification services with more correct predictions during the training phase are provided with higher weights, thus playing a more significant role when the MDT is calculating the credibility of a certain article. The verification services that the MDT can host may vary, ranging, for instance, from BERT-based models up to sentiment and stylometric analysis models (Fig. 27.1). Also, the input that the MDT can process might come from diverse sources, such news sites, or social media posts (Twitter, Telegram, etc.), as shown in Fig. 27.1.

Fig. 27.1
A schematic diagram of the M D T presents verification services provided including toxic language detection, topic modeling, bot detection, clickbait detection, SPAM detection, sentiment analysis, stylometric analysis, and author attribution for U R L, file or text upload, Twitter feed, and Telegram posts.

Potential verification services that can be hosted in the MDT

End users, for example, fact-checkers, also play an active role in the training process. More specifically, end users can insert their credibility evaluation of specific articles (i.e., indicating whether a specific piece of content represents legitimate or fake news). The aforementioned users’ evaluations are provided in the form of ground-truth labels (legitimate/fake), stored in a database, and utilized during the continuous training phase for updating the weights assigned to each verification service. Thus, a growing number of these annotations lead to improved verification results of the Meta-Detection Toolset.

The accumulated experience of the toolset leads to the generation of a model that extensively utilizes contemporary AI technologies for combating the spread of dis/misinformation on the web or in social media. This model is comprised of multiple specialized verification services and has the ability to combine them, aiming to evaluate the truth based on a complex scoring mechanism. This AI-based process is called Meta-Detection and achieves continuous improvement established by annotation processes performed by specialized end-users. In the context of the Meta-Detection Toolset, an integrated management environment of the verification services has been developed, where the Meta-Detection scores are also determined according to the annotations provided by fact-checkers. More specifically, for a specific article, for example, the annotation of a ground-truth label is provided (legitimate/fake) by certified fact-checkers.

As shown in Fig. 27.2, data ingestion can be achieved either at the end users’ side over the HTTPS protocol or by using data connectors (Kafka topics and/or REST APIs). Then, the input data can be consumed by various verification services integrated into or connected with the toolset. Following the completion of the verification services’ computation processes, the prediction results are sent to the MDT and the results are combined in order to compute a meta-score that reflects the credibility of the digital content. The meta-score results are available through endpoints of REST APIs and/or Kafka topics.

Fig. 27.2
A schematic diagram of the meta-detection toolset presents sources 1 to N, Kafka topics, rest A P Is, end users, M D T U I, and web server services, including verification services, meta-score engines, and advanced analytics.

High-level architecture of the Meta-Detection Toolset

As shown in Figs. 27.2 and 27.3, the evaluations of different verification services are combined by the MDT. The annotation process performed by the fact-checkers helps to identify which verification services perform better compared to the rest. These annotations are provided through the Meta-Detection Toolset user interface and better-performing verification services are provided with higher weights. In this way, the MDT enables the knowledge retention from previous evaluations and is capable of updating the weights in a continuous way, leading to a continuous learning paradigm. Last but not least, the system is constantly expanding by evaluating new pieces of content and recalculating the weights based on the experts’ feedback. This entire process is depicted in Fig. 27.3.

Fig. 27.3
A schematic diagram of the continuous learning process of M D T presents experts' input in the search for evaluation results, getting verification services results, retrieving weights, M D T evaluation, and updating weights to update past knowledge or insert new and retrieving past knowledge.

Continuous learning process of MDT

Conclusion: Future Work

The work presented in this study (Section “Meta-Detection Toolset”) combines the prediction results from various dis/misinformation prediction models and computes a meta-score that reflects the credibility of the digital content, aiming to achieve continuous improvement based on annotation processes. Through this solution, an aggregation of different ML prediction models is implemented in order to provide more trustful insights on content credibility for news articles. Thus, end users are provided with a reliable indicative score about the credibility of the content under evaluation.

The future steps involve the expansion of the Meta-Detection Toolset so as to integrate more verification services that could work on different data formats, such as pictures, video, voice, and other types as well, in order to assess their credibility. In addition to the legitimate/fake annotations, the future steps of the Meta-Detection Toolset focus on enabling end users to also provide annotations of the type of news included in a URL (e.g., political news and sports) or even insert a news category annotation of their own choice. For each category of news (including the user-defined categories), a distinct set of verification services’ weights will be calculated. This aims to improve the predictions of the MDT, as more annotations arrive over time. It will also enhance the proposed solution with the ability of learning new tasks (credibility evaluation of additional content categories), initially unknown to it.