Span identification and technique classification of propaganda in news articles

Propaganda is a rhetorical technique designed to serve a specific topic, which is often used purposefully in news article to achieve our intended purpose because of its specific psychological effect. Therefore, it is significant to be clear where and what propaganda techniques are used in the news for people to understand its theme efficiently during our daily lives. Recently, some relevant researches are proposed for propaganda detection but unsatisfactorily. As a result, detection of propaganda techniques in news articles is badly in need of research. In this paper, we are going to introduce our systems for detection of propaganda techniques in news articles, which is split into two tasks, Span Identification and Technique Classification. For these two tasks, we design a system based on the popular pretrained BERT model, respectively. Furthermore, we adopt the over-sampling and EDA strategies, propose a sentence-level feature concatenating method in our systems. Experiments on the dataset of about 550 news articles offered by SEMEVAL show that our systems perform state-of-the-art.


Introduction
Recently, with the development of related models in the field of natural language processing, research on propaganda detection also goes ahead, which originates from document level [1], then develops to sentence level [6,21] and now to fragment level [13,26]. At present, identifying those specific fragments which contain at least one propaganda technique and identifying the applied propaganda technique in the University of Technology Sydney, Ultimo, Australia fragment are main tasks of the fragment-level propaganda detection. As an extension of text classification task in the field of natural language processing, there are many relevant advanced algorithms [8,10,12,19,27] which can be used for reference.
SEMEVAL, the most influential and largest semantic evaluation competition all over the world, provides a news article corpus in which fragments containing one out of 14 propaganda techniques [14,18] have been annotated as shown in Fig. 1. Based on this dataset, numerous researchers have sprung up putting forward a large quantity of algorithms to search the usages of propaganda techniques. Among the algorithms, the great mass of them are based on the popular language models such as ELMO [16], GPT [17] and especially BERT [3]. As shown in Fig. 2, BERT Model raised by Google outperforms previous methods in 11 NLP tasks. Undoubtedly, it has achieved state-of-the-art performance on multiple NLP benchmarks [22]. In our systems, we choose BERT as our basic model as well.
In this work, we introduce our systems for span identification and technique classification of propaganda in news articles. As for the span identification task, we have set forth two of architectures working on it. The first is the BERTbased binary classifier, and the other one is the BERT-based three-token type classifier. The latter is our second-to-none Fig. 1 The corpus of news articles which have been retrieved with the newspaper 3k library and sentence splitting has been performed automatically with NLTK sentence splitter system. Besides joining the most popular BERT model, we have also optimized the sampling [2] process, combined EDA [23,24] to prevent the overfitting of our system and adopted the sentence-level feature concatenating (SLFC), in which case our model can learn characteristics better. As for the technique classification of propaganda task, we have designed a BERT-based architecture with a dimensionality reduction Full Connected (FC) layer and a linear classifier. Same as SI task, we have utilized EDA strategy in the data loading process. The final result in "Experiment and results" shows that it is very meaningful of our optimizing and improving of the pre-trained BERT model. At last, both of our systems for SI and TC have exceeded most of the existed models and made a breakthrough.
The contributions of our paper are as follows: • We fine-tune the BERT with Linear layers and devise two accurate systems for the span identification and technique classification of propaganda in news articles. • We change the binary sequence tagging task SI into a three-way classification task by adding 'invalid' token type and compare the binary tagging method with the three-token type method. • We propose SLFC approach in SI system. To our best knowledge, it is the first work to integrate sentence-level classification features into each word.
• For our systems, we have obtained the optimal network parameters through experiments and comparative analysis.

Related work
The followings are the history and the correlative approaches about propaganda detection in news articles.

Propaganda detection
Propaganda techniques detection is born in the process of fake news detection. Some of the earlier workers judge a news article as authentic or not only according to its origin. As we can imagine that this approach is unscientific. Recently, with the rise of artificial intelligence and machine learning, propaganda detection has attracted researcher's eyes which promotes it to become a standalone research field. In the early days of NLP neural networks, a bidirectional long-term short-term memory (BiLSTM) [5] layer was proposed to capture the semantic features of human language. Gradually, people began to utilize it to detect the using of propaganda in news articles. Initially, a corpus has been created for news articles automatically annotated with a novel multi-granularity neural network which is superior to some powerful BERT-based baselines [14]. Simultaneously, Proppy [1], a system to unmask propaganda in online news, has appeared for document-level propaganda detection, which works by analysing various representations, from writing style and readability level to the presence of certain keywords. Later, to further improve the accuracy of detection, researchers began to pay attention to the detection of sentence level. Hou et al. proposed a model for sentencelevel detecting which could understand semantic features of language better by constructing context-dependent input pairs (sentence-title pair and sentence-context pair) [6]. After the NLP4IF workshop, fragment-level classification (FLC) of propaganda occurred. For instance, different neural architectures (e.g., CNN, LSTM-CRF, and BERT) have been explored to further improve the effect of neural networks [15].

BERT-based model
In our experiment, we have been designing models specifically for SI and TC tasks based on BERT [3,4] model architecture, which incorporates the strength of the other language models. As shown in Fig. 2, BERT utilizes the transformer's attention mechanism [20] to decode the input word vector. Unlike the previous NLP models, BERT is able to run in parallel. More uniquely, the pre-training process of BERT includes two tasks, Masked Language Model (MLM) and Next Sentence Prediction (NSP), which make the BERT model more suitable for NLP tasks. After completing different pre-training and fine-tuning for different tasks, BERT has made great progress on many NLP tasks. Many researchers have discovered the huge potential of the two-stage new model (pre-training and then fine-tuning) on BERT. As a consequence, in recent years, based on BERT, many improved models occurred, such as MT-DNN [7], XLNET [25], ALBERT [11], etc.
In our system, using BERT is mainly for word feature extraction, thanks to that BERT adopts the popular feature extractor transformer, and also implements a bidirectional language model. It is the core concept of BERT to convert word into word vector input, which is added by Token Embedding, Segment Embedding, and Position Embedding to integrate the whole sentence semantics into each word in the same sentence. For our SI task, we process the obtained feature vector from BERT generator by incorporating sentence-level features into each word vector and then put it to a multi-class (prop, non-prop, invalid) classifier layer. To fit our TC task, we truncate the valid fragments and pad it for the latter FC layer and the classifier. Since there are two versions of BERT, taking our SI and TC tasks into account we use the 12-layer BERT pre-trained model as our basis.

Method
In this section, we will introduce the details of our solutions and show the model architectures designed for the span identification (SI) and technique classification (TC) tasks.

Data process
S On account of that only a small portion of the texts use propaganda in SI and some of the techniques rarely appear in the given fragments in TC which lead to the imbalance of dataset, we have proposed two methods aimed at these two problems.
Over-sampling In the task of SI, we utilize the over-sampling (OS) [2,26] method to get more balanced and suitable dataset for training. Since sentences with propaganda techniques are relatively few, we sample them with a higher probability, and the number of non-propaganda ones is correspondingly reduced considering the whole training process. Nevertheless, during our experiment, we find that if over-sampling is overused, the labeled part will be too much in the sample which will cause overfitting, and the F1-score will decline to an undesirable level as a result. Therefore, when training our model, we take the strategy that the first half of the epoch uses the over-sampling and the latter part uses the sequential sampling. While TC is merely a classification task and each fragment in the given dataset corresponds to a specific propaganda technique, over-sampling is superfluous.
Data augmentation Since the pre-trained BERT model is easy to overfit, we have adopted a data augmentation scheme to improve the generalization ability and robustness [24] of the model. In the task of SI, we apply EDA Synonym Replacement (SR) [9] and Random Swap (RS) [23] to our model. Namely, each word has equal probability of being swapped or replaced by its synonyms without changing the label. Compared to short sentences, long sentences absorb more noise, which can better balance the dataset. After processing, sentences different from before are added to the training dataset. While in TC task, the data augmentation strategy is the same as that in SI which is a random process, initially. However, some of the techniques still cannot be detected such as Appeal_to_Authority, Bandwagon, Reduc-tio_ad_hitlerum and Black-and-White_Fallacy. Aiming at this problem, on the basis of random data augmentation, we compulsively add them into the set to be data augmented. In this way, the purpose of increasing the valid noise of the training dataset is achieved. Meanwhile, the training time gets shortened as well.

Approach of span identification (SI)
To deal with the SI task, we first defined it as a binary classification task [13], but after experiment we found Precision and F1-score of this solution were unexpected. After analyzing the cause and effect of this issue, we propose a three-classification model to classify each word in the news articles into three token types. The concrete architecture of our model is shown in Fig. 3. Two of them are 'prop' and 'non-prop'; the other one is 'invalid' which means the label of some invalid words like 'CLS', 'SEP' [3] and those used to ensure the input sentence with the same length. Classifying these invalid words into a 'invalid' token type reduces the noise and improves Precision and F1-score. Furthermore, we have utilized sampling skill and EDA to optimize the dataset.
Due to that the labels of the plain-text document offered by SEMEVAL are at char level, converting them into word level for word embedding in pre-trained BERT model is the first step. Before inputting them into the classifier, we combine the word vectors in each sentence with the feature vector of the sentence where they are. Then the word vectors with semantic integration of the sentence are normalized for the last classifier layer. As shown in Figure 3 on the right, the concatenating process generates the new concatenated vectors by placing the sentence vector in front of the word vectors. The following formula (1) shows the concatenating process mathematically: where s 1 , s 2 represent the elements in a sentence vector which contains the classifying result of sentence-level prediction. And the right matrix (768 × 200) contains 200 word vectors (768 dim) on behalf of one sentence. By concatenating, the input matrix (770 × 200) of the final classifier is made as the below one. This concatenating step is reasonable considering that sentence-level prediction is more undemanding and accurate than word-level prediction. The result also shows Fig. 3 The architecture of Span Identification task adopting over-sampling, data augmentation and sentence-level feature concatenating. The Concat means adding the classification feature of the sentence to its every word vector that concatenating step plays a key role in the promotion of word-level prediction accuracy. Finally, by merging the successive words with identical propaganda technique, those specific fragments which include at least one propaganda technique are identified.

Approach of technique classification (TC)
For the multi-class classification task TC, we have utilized a Full Connection layer and a linear classifier based on BERT model, as shown in Fig. 4. Since the dimension of the valid fragment vector is large, we utilize the former Full Connection layer for dimensionality reduction, and the second for classifying them into 14 classes. And we handle those propaganda techniques that rarely appear in the dataset by utilizing EDA so as to solve the imbalance of dataset. Comparing our model without and with EDA, the latter gets an improvement of around 4 points in F1 score as shown in Table 3. For details of TC task, we take the given text fragment identified as propaganda and its document context as the input of the pre-trained BERT generator. Different from SI task which is a word-level classification task, the TC task is fragment-level. Hence, incorporating sentence-level features into each word vector is ineffectual for TC task. As for the fragment which belongs to several sentences, we divide it into different sentences in the training process, while evaluating we treat it as a whole. Then for the sake of obtaining the valid fragment with propaganda techniques, we make segmentation of the output of BERT and pad it with invalid zero vector to a settled length (120). With the dimensionality reduction of our Full Connection layer, a linear classifier is used for 14 token types classification.

Experiment and results
In this section, we will show the experiment details and the achieved experiment results by comparing our surpassing systems respectively for SI and TC to several other attempts.

Experiment details
In our experiment, we have trained our models parallelly with 4 Nvidia GTX 1080Ti GPUs to reduce the time required. Based on the PyTorch Framework and CrossEntropy Optimizer [28] (after trying the focal loss), we have fine-tuned the pre-trained BERT model for our SI and TC tasks.
Dataset The datasets for both of the SI and TC tasks are news articles in plain text format, including train-articles, dev-articles and the label files. To begin with, we have split each article into individual sentences to reduce parameters of our model. And before the experiment, we divided the annotated corpus of about 550 articles into 80% train set for model training and 20% test set for model eval-uation, respectively. By calculating the instances of each technique, we find that the dataset for TC is imbalanced as shown in Table 1. Some of the techniques such as "Loaded_Language" has a high proportion of 34.64%, while some of the techniques such as "Black-and-White_Fallacy", "Slogans" and "Whataboutism, Straw_Men, Red_Herring" show up less often. What is worse, neither "Bandwagon, Reductio_ad_hitlerum" nor "Thought-terminating_Cliches" has no more than 80 instances which may badly influence the training process. During training, in order to enhance the generalization capability of our model, we utilized EDA to make train set extended and more well-proportioned. Besides, particularly for SI task, we adopted the over-sampling strategy for tagged sentences.
Evaluation metric So as to make a fair comparison, we use different evaluation criteria in different comparison experiment. For both of SI and TC tasks, we adopt the F1-score (F1) as the main metric. In addition, the general Precision (P) and Recall (R) are the secondary metric for SI task. The F1-score is denoted by the following formula:

Results: span identification (SI)
For the purpose of achieving the SI task, we have presented two diverse architectures and optimized one of them with over-sampling, EDA, and sentence-level feature concatenating (SLFC). As is shown in Table 2, our top perform system is three-token type classification system with Oversampling, EDA and SLFC. We have contrasted our SI system with BERT-based Binary classifier model, and BERT-based Three-token type classifier model. As we have seen, the BERT-based three-token type classifier reaching 40.8815% F1-score, 40.1099% Precision and 41.6834% Recall behaves better than baseline which is merely BERT-based with no fine tuning and the Binary classifier model. We owe this success to the 'invalid' token type which impairs the noise of the invalid words by classifying the irrelevant words individually. Besides, after using EDA, it only took two epochs or so to reach the peak of the Recall, without which it took about six epochs to reach the peak and the results were not as expected. Ultimately, our SI system, based on our three-token type classifier and utilizing our strategies of over-sampling, EDA and SLFC, prevails over others, which scores 44.1732% of F1-score on the test set.
Next, we will give a deep analysis of the usage of SLFC and how it benefits our system on Recall and F1-score shown in Fig. 5. Generally speaking, the word-level prediction requires more accurate detection and there is a bigger margin of error than sentence-level prediction, which is why we give each word more information about whether the sentence it is in has propaganda with the aid of SLFC. Namely, the sentence classification prediction provides reference for the word prediction. If a sentence is propagandistic, it is of high probability containing propaganda fragments. On the contrary, if a sentence is non-propagandistic, the words in it are not of propagandistic as well. Based on this knowledge, we successively apply SLFC to our model, which does increase the F1-score by around three points and the Recall by around four points, respectively. Meanwhile, the precision does not decrease significantly. All in all, compared with no SLFC, Fig. 5 The comparison of SI training process between our systems with and without SLFC The bold values shows that our better model or method our system identifies the propaganda spans more accurate which consequently promotes the F1-score and Recall.

Results: technique classification (TC)
To better complete the TC task, we have presented two architectures, one without EDA and another with EDA. Comparing them with the baseline which is merely BERT-based with no fine tuning, both of our systems with EDA or not have reached a new high state, improving the F1-score by two times approximately as shown in Table 3. During our experiment, we have made an experimental comparison and analysis for our strategy of utilizing EDA in the data loading process of our TC system. The final result has indicated that our TC system with EDA improved F1-score by around 3% compared to the absent EDA system. It stands to reason that our TC system reaches the state-of-the-art in the end, which scores 57.5729% of F1-score on the test set. The respective promotions with EDA strategy in F1score for each of propaganda technique are shown in Table  4. Compared with our no EDA model, in spite of the fact that three of techniques ('Doubt', 'Flag-Waving', and 'Whataboutism, Straw_Men, Red_Herring') have slightly decreased to some extent, more than half of the techniques have made progress in F1-score. For details, most of them have gotten about eight-point improvement on average, such as 'Appeal_to_fear-prejudice', 'Exaggeration, Minimisation', 'Repetition' and so on. What is worth mentioning is that the techniques named 'Causal_Oversimplification' and 'Thought-terminating_Cliches' have gotten about 14-point improvement. Thus, our TC system makes many breakthroughs on the whole, giving the credit to EDA which can enhance the data set, prompt the model to converge faster and improve the generalization ability and robustness of the model.

Parameter analysis
After a series of experiments, we have given a set of optimal parameters [epoch, learning rate (lr), sentence length (sentlen)] for the models of the two tasks. The optimal parameter combinations are shown in bold in Table 5.
For the sentence length, which is the length of the single input into the BERT and is usually set to 256, we have set it to 200 and 210 for our SI and TC tasks, respectively. In SI task, it is attributed to that the whole sentences in dataset do not exceed 200 in length, and too much padding will lead to greater classification error. As for TC task, the maximum length of valid fragments in the dataset is 210, so we choose it as the limit for padding. In terms of learning rate, both of our choices are 3 ×10 −5 because our valid dataset is small. Through the analysis of the SLFC method for SI task in Sect. 4.2, we have found that the model began to converge around the epoch 7, so we set the training epoch to 8 to prevent overfitting in our SI system. Besides, in the experiment process of TC task, we have found the epoch parameter greater than 15 caused F1-score decreased, so we set it as the best choice for our TC system. Based on the above optimal parameters, our SI and TC systems finally obtained the F1-score of 0.441732 and 0.575729, and both of the training processes have taken around 2.5 h using 4 Nvidia GTX 1080Ti graphics cards (i.e. around 10 GPU hours).

Conclusion and future work
Based on the BERT model, we have set forth two specific systems for Span Identification and Technique Classification of Propaganda in news articles. In the data loading process, we have tried two strategies, over-sampling in SI task and EDA in both of SI and TC tasks, in order to deal with the imbalance between data with and without tags and enlarge our training dataset. For SI task, we have afresh defined it as a three-token type sequence tagging task with our SI system, and adopted sentence-level feature concatenating method. For TC task, we have devised a system based on BERT with a dimensionality reduction FC layer and a linear classifier. Ultimately, we have achieved two efficient and accurate systems for propaganda detection in news articles. And the final result also confirmed that our research further perfects the BERT model.
In the future, we are going to improve the Precision, Recall and F1-score further by drawing lessons from the SpanBERT model, which may perform better. Namely, in the process of masking, we would like to cover consecutive words randomly instead of scattered words. And we are thinking about searching for a more suitable architecture of BERT adopting the popular Neural Architecture Search (NAS). Besides, we hope our model can be compressed to some extent. For instance, we can prune the classifier layer, quantify or share the parameters of our model. In these cases our model can be applied widely and conveniently in our daily lives. The bold values shows that our better model or method

Conflict of interest
We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.