1 Introduction

Digital technology leads to a massive amount of data, especially text data on the internet, which are available in news articles, academic publications, emails, messages and other formats [1]. How much the news impacts our lives can be seen with recent world events. Blogs, news, social media messages and posts contain a lot of truth, but they could also include fake information, which seek to manipulate people. Without access to control mechanisms, it has been reported that many suspicious messages and accounts are being spread across multiple platforms. Identifying and labelling fake news is a demanding problem due to the massive amount of content [2]. Natural language processing gives the ability to the computer to understand the text and spoken words in a similar manner that human beings can. Machine learning functions, along with natural language processing, are currently the best tool to automate the analytical process of identifying fake news. Researchers are trying to identify fake news using various techniques from word-based analysis, through syntactic and semantic analysis, to different classification algorithms such as statistical-based, and also using machine learning [3].

The morphological and syntactic analysis seems to improve the methods of analysing the content of texts [4]. Syntactic analysis can be performed using component grammar or dependency grammar. The common core of all varieties of dependency grammar is the assumption that syntactic structure consists primarily of binary asymmetrical relations that hold between words. This structure can be displayed in a dependency tree, where nodes represent words and labelled arcs represent different types of dependency relations. A dependency tree representation of syntactic structure emphasizes the functional role of a word in a sentence [5].

Morphological analysis is a basic task of natural language processing that segments an input sentence and annotates them with parts-of-speech (POS) tags [6].

It can be also used for identifying fake news [7,8,9]. In this paper, a new technique MDgwPosF is introduced for feature extraction, which consists of standard feature extraction method TfIdf weighted by the word depths and relative frequency of POS tags. The MDgwPosF technique was evaluated on two different data sets about Covid-19 virus, from which one was manually and the second was automatically anotated.

The current state of the research of different feature extraction methods are summarized in the second section. The proposed feature extraction method, as well as the dataset used for evaluation, is presented. The results were summarized in Sect. 4 and finally, discussion and conclusions form the content of the last section of the paper.

2 Related work

Various research is based on different techniques of feature extraction from texts. The development of extraction techniques leads to classification with higher accuracy. Kadhim [10] used the commonly utilized technique TfIdf to identify terms. He compared different supervised machine learning classification algorithms, such as Naïve Bayes, Support Vector Machine, and k-nearest neighbours. His results say that different techniques perform differently depending on the dataset.

Szabo Nagy and Kapusta [11] proposed a novel technique for fake news classification. The technique is named as TwIdw. It is employed for feature extraction and is based on TfIdf, with the replacement of term frequencies by the depth of words in documents. An increase in accuracy of up to 3.9% was observed with the feed-forward neural network method using the political dataset.

The technique TfIdf was also used by Gaydhani et al. [12] in their research for feature extraction. They performed experiments considering n-grams as features and passing their TfIdf values to multiple machine learning models. The results were evaluated on three different classification algorithms—Naïve Bayes, Logistic Regression and Support Vector Machine. Support Vector Machine performed more poorly as compared to Naïve Bayes and Logistic Regression. The best results were achieved using the Logistic Regression model and they had a 95.6% accuracy on the test data after the model tuning.

Das et al. [13] compared n-grams and TfIdf as feature extraction techniques for sentiment analysis. Support Vector Machine, Logistic Regression, Multinomial Naive Bayes, Random Forest, Decision Tree, and k-nearest neighbours were used for classifications. From two feature extraction methods, a significant increase in feature extraction with TfIdf was observed. TfIdf got the maximum accuracy (93.81%), precision (94.20%), recall (93.81 %), and F1-score (91.99%) value in Random Forest classifier.

N-grams from morphological tags were used by Kapusta et al. [9] for a classification task. Three techniques based on POS tags were proposed and applied to different groups of n-grams in the pre-processing phase of fake news detection. The results showed that the newly proposed techniques are comparable with the traditional TfIdf technique, and the morphological analysis can improve the baseline TfIdf technique.

Multiple works can be found that analyse the TfIdf improving techniques. Wu and Yuan [14] introduced an improved TfIdf algorithm based on word frequency distribution information and category distribution information. The improved algorithm introduces the concept of word frequency distribution and class distribution to describe the weight of the feature item more accurately. The experimental results show that the improved algorithm can achieve better classification results on both balanced and unbalanced text data sets. The classification accuracy is 12.88% higher than the original algorithm.

Text classification plays a very important role in processing massive text data, but the accuracy of classification is often affected by the performance of term weighting. TfIdf is not effective enough for text classification, especially for processing text data with unbalanced distributions. For this reason, Jiang et al. [15] calculated the variance between the document frequency value of a particular term and the average of all document frequencies. The document frequency variance was proposed to enhance the ability in processing text data with an unbalanced distribution. They proposed four techniques TF-IADF, TF-IADF+, TF-IADF\(_\textrm{norm}\), and TF-IADF+\(_\textrm{norm}\).

Zhang and Ge [16] introduced a new algorithm and named it TF-IDF-\(\rho\). They utilized it to represent desensitized data for text classification. Their experiments show an increase in F1 measure by 4.07% at most for the TF-IDF-\(\rho\) in comparison with the traditional TfIdf. Another new improvement of TfIdf was surveyed by Zhang et al. [17]. Their paper presents a new improved method, which is called TFIDFZ algorithm. The algorithm gives different weights according to the word character and different positions in the text. The improved algorithm is named TFIDFZW algorithm. The experimental results show that the precision and recall rates of TFIDFZ algorithm and TFIDFZW algorithm are better than those of the traditional TfIdf.

Dependency grammar is one of the methods for syntactic analysis. Zhang et al. [18] classified Chinese short texts based on dependency grammar in their research. They trained word vector based on sentence dependency triples. The results of the experiment show that the proposed algorithm improves the performance of short text classification remarkably. Nagy and Kapusta [3] used dependency grammar together with TfIdf values to improve the classification of fake news. The results show that it is possible to use the dependency grammar information with acceptable accuracy for the classification of fake news and dependency grammar can improve existing techniques such as traditional TfIdf.

Dependency grammar together with POS tags was used by Zhi et al. [19] for a sentence classifier to filter non-feature-containing sentences before feature extraction. To evaluate the performance of their classifier, they produced a dataset with corresponding annotations. The result shows that their classifier can successfully filter out 79% of non-feature-containing sentences. Namdari and Durrani [20] investigated the predictability of fundamental and technical analyses using a multilayer feed-forward perceptron neural network (MLP). Historical stock prices and financial ratios of technology companies were utilized. The model incorporated self-organizing maps (SOMs) and underwent hyper-parameter optimizations with a three-hidden layer MLP. The hybrid model successfully predicted short-term stock trends with a directional accuracy of 70.36%, surpassing the performance of fundamental and technical analyses. Neural networks have found applications in various other fields, showcasing their versatility and effectiveness in handling data [21, 22].

3 Materials and methods

3.1 Feature extraction

The major objective of feature extraction is used to convert a text from any setup into a keyword schedule which may be easy to process by supervised learning [6, 10, 23]. This paper focuses on two approaches of feature extraction: POS tags from morphological analysis and improved TfIdf by adding terms depth in the sentences. Feature extraction and creation of input vectors to the classification models are based on two research papers, the first paper of Nagy and Kapusta [3] and the second paper of Kapusta et al. [9].

Nagy and Kapusta [3] introduced and improved the TfIdf technique for feature extraction and named it as MultipleDgw.

MultipleDgw is calculated as follows

$$\begin{aligned} \vec {MultipleDgw(d)}=(t_1*w_1, t_2*w_2,\ldots ,t_n*w_n) \end{aligned}$$
(1)

where t is the TfIdf value of a term, w is the weight (depth) of the term in sentence, and d is a document. MultipleDgw calculated the weights based on the knowledge that verb and derived nouns, and erratically adjectives, are more important within the sentence as prepositions, conjunctions, or other parts of speech. When calculating the weight, the order (depth) of words is the basis. An existing problem in calculating the weights may be the fact that the analysed text could contain the same word with different depths. This is handled by calculating the average depth for words that occurred more than once in the analysed records. The calculation is also derived from min-max normalization, and it is accurately explained in the paper.

The second paper [9] introduces PosF vector \(\vec {PosF(d)} = (p_1, p_2,\ldots , p_n)\) which represents the relative frequency of POS tags in the frame of the analysed list of POS tags in the document. This technique is an analogy of the Term Frequency technique, and the concrete calculation is explained in the paper.

In the pursuit of improving the classification of unstructured texts, a new vector is introduced. The vector MDgwPosF is a merge of MultipleDgw and PosF, and it is calculated as follows:

$$\begin{aligned} \vec {MDgwPosF(d)}&= (t_1*w_1, t_2*w_2,\ldots ,t_n\nonumber \\ & *w_n, p_1, p_2,\ldots , p_n) \end{aligned}$$
(2)

where tw is the MultipleDgw value of the term, p is PosF value of the morphological tag, and d is a document. In this paper, the proposed method is being evaluated according to the base models TfIdf and TfIdf with POS tags.

3.2 Datasets

Two datasets are used in the research (Table 1). Both datasets are about the Covid-19 pandemic and contain true and fake news. The first dataset was collected automatically by Li [24], and it is a more evenly distributed dataset as the ratio between the true and fake information is almost 50:50. This dataset contains true records from trusted news sources and fake records from well-known fake news websites that are intentionally trying to spread misinformation. This dataset is referred to as Data_auto in this research. The second dataset was collected by Koirala [25] and it contains news between December 2019–July 2020. This dataset was collected using webhose.io and was manually labelled. It is referred to as Data_manual in this research. This dataset is not evenly distributed, as it contains 2061 true records and 659 fake news.

Table 2 presents the sizes of the generated vectors from the data sets using three different approaches: TfIdf (M1), TfIdf and relative frequencies of POS tags (M2), and MDgwPosF (M3). The table provides informative insights into the dimensions of the vectors produced by these methods.

Table 1 Number of records in datasets
Table 2 Vector shapes

3.3 Methods

Three neural networks were implemented in this research—feed-forward neural network, LSTM and GRU.

Feed-forward neural network is one of the basic neural network architectures where the output of one layer is forwarded to each neuron in the next layer, and thus it works in a unidirectional way. As there is no connection to the previous layers, feed-forward neural networks cannot persist past information.

LSTM neural network was introduced in 1997 by Hochreiter and Schidhubert [26]. The LSTM neural network consists of building blocks for the layers of a recurrent neural network [27, 28]. A LSTM unit is composed of a cell, and gates—input, output and forget. The cell is “remembering” the values over a time interval, so the word at the beginning of the text can influence the output of the word later in the text [29].

GRU neural network was first introduced in 2014 [30]. It is a recurrent neural network like LSTM, but less complex [31]. This neural network also has gating mechanism to control the information flow through cell state but has fewer parameters and does not contain an output gate [32].

Three different neural networks were created of each type: a simpler one, a moderate one and a compact one. The architecture of the neural networks in this research is as Table 3 shows, and the architecture of each type is the same. Sigmoid activation function was used between hidden layers and hard-sigmoid on output layers. In between neurons dropouts of 0.25 were used. In recurrent neural networks recurrency to the layers was added. Adam optimizer, 25 epochs and batch size of 10 were used. In previous studies, experimental outcomes showed that optimizers like RMSProp and Adam that use adaptive moment estimation are posting improved results [33, 34]. The training was performed 10 times because of k-fold validation with value of k = 10. K-fold ensures that every observation from the original dataset appears in the train and test set. First step of k-fold validation is the random shuffling of the dataset, so the inputs are not biased in any way. The original sample is randomly partitioned into k equal sized subsamples. K times is performed the training and testing, in each iteration a subsample is chose for testing and remaining \(k - 1\) subsamples are used for training the model. Each of the k subsamples is used exactly once for testing.

Keras python library was used for models’ implementation. Figure 1 summarizes all of the methods and models in the paper.

Fig. 1
figure 1

The workflow of the experiment

Table 3 Used neural networks architectures

4 Results

The quality of the proposed methods (TfIdf labelled as M1, TfIdf and POS tags labelled as M2, proposed MDgwPosF as M3) was evaluated using evaluation measures (accuracy, precision, recall, F1-score, precision_fake, recall_fake, precision_real, recall_real). Within the 10-fold validation, 10 measurements of each evaluation metric were performed for each fold.

Descriptive statistics of values of accuracy for each NN models (FF, LSTM, GRU), methods (M1, M2 and M3), topologies (T1, T2, T3) and dataset (auto, manual) are given in Table 4 for dataset auto and Table 5 for dataset manual. Tables are sorted using the mean values. The dataset which was created automatically achieved better outcomes. As this dataset was created using classification methods which has an impact for its accuracy, even though with other classification methods, it still achieved better results. On the other hand, more stable results for accuracy were obtained in the manual dataset. The auto dataset has the most heterogenous values based on all characteristics. Results also show better values of Mean and Confidence Interval for a Mean (\(-\)95.00%, +95.00%) for the dataset which was created automatically. Descriptive statistics shows that the most successful neural network model was the feed-forward model. This achieved the best results for both datasets and, also, for all topologies and methods. A more detailed view shows better results for the MDgwPosF (M3) method, and it is evident from most measurements for individual NN models and topologies. The most significant impact on the results was from the used neural network model. It is indicated by not only the Mean, but also the Confidence \(-\)95.00% that when using the feed-forward neural network model, \(-\)95% minimal value accuracy is achieved above 0.89. From the perspective of the used method and topology, these differences are not large. Similar findings were obtained for most of the values of other observed performance measures (precision, recall). Figure 2 represents the Fl-score for all the results.

Fig. 2
figure 2

Means with Confidence Intervals Plot for F1-score

Given the results of recall and precision, this metric only confirmed a small difference in results for all three methods examined. The results for F1-score confirm the better results for dataset auto. Additionally, looking more closely at the results, the best F1-scores were observed for the MDgwPosF (M3) method independent of the model and topology, except for the LSTM model and the T2 topology. It was unclear whether the results were in support of the M3 method for the manual dataset. The results also show that the best results for both datasets were achieved while using feed-forward model and T2 topology.

From the perspective of the feed-forward neural network model, the best results were reported for the MDgwPosF (M3) method in dataset auto, and for the TfIdf with POS tags (M2) and MDgwPosF (M3) methods in dataset manual. It is also evident that topology T2 was the most successful, regardless of the model or method used.

An interesting outlook is achieved by evaluating the results for recall. This performance measurement represents how many real fake and real true news were correctly classified.

Fig. 3
figure 3

Means with confidence intervals plot for rec_fake

Fig. 4
figure 4

Means with confidence intervals plot for rec_real

The findings are confirmed by the results of rec_fake (Fig 3). Very low values for F1-score for models LSTM and GRU can be noticed for the manual dataset. Conversely, for rec_fake, better results were observed for dataset manual. This is mainly due to the unbalanced manual dataset. To ensure comparability of the methods, rebalancing methods were not applied. For this reason, the imbalance affected the results. Despite this fact, it is compelling that in the case of the feed-forward model, very good results were recorded for both datasets and practically all topologies and methods (Figs. 34).

The results for the feed-forward model were comparable for both datasets. The feed-forward model could be successfully trained even with an unbalanced dataset. Also, apart from the T3 topology, the best results were recorded for the MDgwPosF (M3) method. From the descriptive statistics for all performance measures, the feed-forward neural network model was among the successful models. Despite the success of the MDgwPosF (M3) method for most performance measures, the results are not significant.

Considering the analysis of data exploration, a null hypothesis was established. The global null hypothesis is: there is no statistically significant difference of the models’ performance in terms of classification correctness or performance measures (accuracy, precision, fake precision, real precision, recall, fake recall, real recall and F1-score).

To verify the hypothesis, Dunnet’s one-sided tests were used, because a demonstration for which models the proposed model is more efficient in terms of classification accuracy is needed. Null and alternative hypotheses were formulated, one-tailed tests were used to find out which proposed model is more effective in terms of its classification correctness. To achieve results many-to-one comparisons were performed.

The null hypothesis for many-to-one comparison is: there is no statistically significant difference in efficiency/performance in terms of classification correctness between the proposed model and existing models. Alternative hypothesis is: the proposed model (FF_T2_M3) is more effective than the existing models (FF_T1_M1,..., GRU_T3_M3) in terms of classification correctness (Tables 6 and 7).

Table 4 Data exploration for auto dataset
Table 5 Data exploration for manual dataset
Table 6 Global null hypothesis testing
Table 7 Many-to-one comparisons for variable acc

From the proposed models, the FF_T2_M3 was chosen. For classification tasks, the feed-forward neural network model is one of the traditional methods. Even though GRU and LSTM can also be used for classification tasks, their success is debatable.

With neural network topologies, it is natural to expect achievements of better results with a larger, richer topology. On the other hand, the problems of overfitting must be always counted on, which disqualifies richer models. For this reason, T2 topology verification is the ideal balance between network success and the risk of overfitting.

Hypotheses will be tested on all data as well as on individual datasets. Based on data exploration, a difference between the models can be seen. Feed-forward neural network performed much better as other models.

The Kolmogorov–Smirnov test was used to verify the assumption of normality. The examined variables (acc(FF_T1_M1), acc(FF_T1_M2),..., f1-sc(GRU_T3_M2), f1-sc(GRU_T3_M3)) have normal distribution (total: \(N = 20, \max D < 0.293, p > 0.05\), auto: \(N=10, \max D < 0.306, p > 0.05\), manual: \(N = 10, \max D < 0.361, p > 0.05\)); therefore, parametric model was used for evaluating the hypotheses.

Adjusted test (Greenhouse–Geisser adjustment) was used to verify models’ effectiveness due to the violation of condition of the sphericity of covariance matrix. While the sphericity condition is not fulfilled, the type I. error is increasing. Epsilon represents the breach of the degree of the sphericity condition. Epsilon equal to 1 represents condition fulfilment. The smaller the Epsilon value is, the more sphericity condition is breached.

When comparing (Table 6) the proposed model against to the existing models, the Epsilon values were significantly smaller than one (total: G–G Epsilon \(< 0.385\), Adj. \(p < 0.001\); auto: G–G Epsilon \(< 0.265\), Adj. \(p < 0.001\); manual: G–G Epsilon \(< 0.275\), Adj. \(p < 0.001\)). Null hypotheses were rejected at the 0.001 significance level, which claim that there is no statistically significant difference in the values of evaluation measures (acc, prec, prec_fake, prec_real, rec, rec_fake, rec_real a f1-sc) between the models. Hypotheses were tested on all data as well as on individual datasets.

Dunnet’s one-side tests (Table 7) were used to examinate the effectiveness of the proposed model MDgwPosF (M3) against existing models in many-to-one comparisons (existing models to proposed neural network model). The Dunnett test was used for this purpose. Significant p values indicate which controlled models the proposed model is more effective at the 0.05/0.01/0.001 significance level (*\(p < 0.05\), **\(p < 0.01\), ***\(p < 0.001\)).

It is clear from the results that statistically significant differences were recorded for acc(FF_T2_M3) for other types of neural network models used.

It can be concluded that the differences in performance measure results for the LSTM and GRU models are statistically significant in favor of the feed-forward models. No statistically significant differences were found between individual methods and topologies for the feed-forward model. This means that the performance measure is most influenced by the used neural network model. It is clear from the descriptive statistics that the proposed FF_T2_M3 method achieved the best results compared to the FF_T2_M1 and FF_T2_M2 methods, but these differences are not statistically significant. Similarly, the proposed M3 method achieved better results for other topologies T1 and T3 in feed-forward models.

In addition to one-sided tests for the accuracy variable, similar calculations of one-sided tests for the other variables (prec, prec_fake, prec_real, rec, rec_fake, rec_real and f1-sc) were made. The results were very similar and, also, confirmed statistically significant differences between the models, but within the comparison between topologies and methods, the differences were not statistically significant.

5 Discussion

The results provide clear evidence of statistically significant distinctions in the type of used neural network and in datasets. As LSTM and GRU are networks with loops in them, allowing information to persist, this can lead to lower accuracy because of the high number of unique information. The results did not confirm statistically significant differences in the investigated methods of preparation of the input vector. From the results, it can be observed that the morphological analysis improved the results for the identification of fake news. The contribution of syntactic analysis is uncertain. Information about part of speech can improve the classification. Information about the meaning of individual words in sentences, i.e., word dependency did not bring significant improvement. Results show that the combination of syntactic analysis and morphological analysis into one method (M3) brings the most significant improvement.

The M3 method for almost all models and topologies on all data as well as on individual datasets (Table 8) yields better results in terms of classification accuracy, except for models and topology GRU_T1 on the auto dataset, LSTM_T2 on the auto dataset as well as on all data. Statistically significantly better results (p < 0.05) in terms of classification accuracy (Table 8) were demonstrated only in the case of the LSTM model and the T1 topology, where a statistically significant difference between the M1 and M3 methods was demonstrated in favour of the M3 method (p< 0.05). On the contrary, a statistically significant difference between method M2 and M3 (Table 8) was not demonstrated (p > 0.05). In the case of the remaining models and topology, statistically significant better results (p > 0.05) in terms of classification accuracy (Table 8) were not demonstrated, even though in most case s (except three cases) the M3 method yielded better results. To verify the effectiveness of the proposed method M3, unadjusted tests were used for repeated measurements (Table 8), considering the validity of the condition of sphericity of the covariance matrix (p > 0.05).

Table 8 Summary of results for the M3 method with respect to model and topology

The feed-forward model achieved the best results from all the models. These results appear to be statistically significant. The statistical significance was not verified for them as the main focus was on the methods rather than the neural network model. The feed-forward model appears to be the most suitable model for classification tasks. The remaining two models, GRU and LSTM, can be used for classification tasks as well, but their main purpose is not classification. These two models were presented only for the purpose of a more robust comparison of the methods and for the evaluation of the results. The results confirmed these findings about individual models.

Similar proceeding was used in the case of the topologies, where a suitable topology (T2) was designed, and it was compared to a simpler topology (T1) and a more robust topology (T3). The measurements show worse results for the simpler topology (T1). A surprise was that the topology of T3 was similar as T2 in most cases after accounting for methods and models.

The main part of the analysis was mainly focused on methods. Looking at the results of the descriptive statistics, the T3 method (in combination with the FF model and the T2 topology) was the most successful in the manual dataset (mean = 0.86). It was also among the most successful (FF_T2_M3, FF_T1_M3, FF_T1_M2) methods in the auto dataset (mean = 0.94).

The only statistically significant difference was observed in the case of the M3 method only in the manual dataset, with the LSTM model and the T1 topology. In the other models and topologies, M3 was among the most successful models, but without statistically significant differences.

6 Conclusion

In this paper, an improved feature extraction method MDgwPosF (M3) is proposed, and it is compared to two base methods—TfIdf (M1) and TfIdf with POS tags (M2). An evaluation using multiple neural networks (feed-forward, LSTM, GRU) with different topologies (T1, T2 and T3) was performed. The effectiveness of the models were verified on two data sets about Covid-19 virus.

The proposed method is based on syntactic and morphological analysis of the texts. Syntactic analysis is performed using dependency grammar. The morphological part of the feature extraction consists of the relative frequency of the POS tags.

The quality of the methods (M1, M2 and M3) was evaluated using three different neural networks (feed-forward, LSTM, GRU) with three different topologies (T1, T2 and T3) and on two different datasets—auto and manual. All evaluation metrics (accuracy, precision, recall and F1-score) was calculated for 10-fold validation. The dataset which was created automatically achieved better outcomes. The descriptive statistics shows that the most successful neural network model was the feed-forward model. This model achieved the best results for both datasets and for all topologies and methods. From the topologies the results show that the T2 topology performs the best. A statistical evaluation was performed where the hypotheses were formulated. Kolmogorov–Smirnov test was used to verify the assumption of normality and adjusted test (Greenhouse–Geisser adjustment) were used to verify the effectiveness of the models. The results validate the presence of statistically significant variances in the type of used neural network and in datasets. The M3 method for almost all models and topologies on all data as well as on individual datasets yields better results in terms of classification accuracy.