Feature extraction from unstructured texts as a combination of the morphological and the syntactic analysis and its usage in fake news classification tasks

Szabó Nagy, Kitti; Kapusta, Jozef; Munk, Michal

doi:10.1007/s00521-023-08967-2

Feature extraction from unstructured texts as a combination of the morphological and the syntactic analysis and its usage in fake news classification tasks

Original Article
Open access
Published: 07 September 2023

Volume 35, pages 22055–22067, (2023)
Cite this article

Download PDF

You have full access to this open access article

Neural Computing and Applications Aims and scope Submit manuscript

Feature extraction from unstructured texts as a combination of the morphological and the syntactic analysis and its usage in fake news classification tasks

Download PDF

1051 Accesses
Explore all metrics

Abstract

In this paper, a new technique of feature extraction is proposed, which is considered an essential part of natural language processing. Feature extraction is the process of transformation of the unstructured text to a format which is recognizable by computers. This means a transformation to a vector of numbers. The study evaluates and compares the performance of three methods: M1, which is the baseline method TfIdf; M2, which combines TfIdf with POS tags; and M3, a novel technique called MDgwPosF that incorporates weighted TfIdf values based on word depths and the relative frequency of POS tags. The primary focus of the study is to assess and compare the performance of these methods, with particular emphasis on evaluating how M3 performs in comparison with M1 and M2. Two different datasets and feed-forward, LSTM and GRU neural networks were used in this study. The results showed that the feed-forward model with the proposed method MDgwPosF in moderate topology achieved the best performance across various measures. The dataset created automatically performed better than the manual dataset. The differences between methods and topologies were not statistically significant. Statistically significant differences between the classification models were proven. The MDgwPosF method achieved higher accuracy compared to the baseline TfIdf, indicating that incorporating additional information into the vector can enhance the performance of TfIdf.

Review of Feature Extraction Techniques for Fake News Detection

Introduction to Text Classification: Impact of Stemming and Comparing TF-IDF and Count Vectorization as Feature Extraction Technique

Natural language processing: state of the art, current trends and challenges

Article 14 July 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Digital technology leads to a massive amount of data, especially text data on the internet, which are available in news articles, academic publications, emails, messages and other formats [1]. How much the news impacts our lives can be seen with recent world events. Blogs, news, social media messages and posts contain a lot of truth, but they could also include fake information, which seek to manipulate people. Without access to control mechanisms, it has been reported that many suspicious messages and accounts are being spread across multiple platforms. Identifying and labelling fake news is a demanding problem due to the massive amount of content [2]. Natural language processing gives the ability to the computer to understand the text and spoken words in a similar manner that human beings can. Machine learning functions, along with natural language processing, are currently the best tool to automate the analytical process of identifying fake news. Researchers are trying to identify fake news using various techniques from word-based analysis, through syntactic and semantic analysis, to different classification algorithms such as statistical-based, and also using machine learning [3].

The morphological and syntactic analysis seems to improve the methods of analysing the content of texts [4]. Syntactic analysis can be performed using component grammar or dependency grammar. The common core of all varieties of dependency grammar is the assumption that syntactic structure consists primarily of binary asymmetrical relations that hold between words. This structure can be displayed in a dependency tree, where nodes represent words and labelled arcs represent different types of dependency relations. A dependency tree representation of syntactic structure emphasizes the functional role of a word in a sentence [5].

Morphological analysis is a basic task of natural language processing that segments an input sentence and annotates them with parts-of-speech (POS) tags [6].

It can be also used for identifying fake news [7,8,9]. In this paper, a new technique MDgwPosF is introduced for feature extraction, which consists of standard feature extraction method TfIdf weighted by the word depths and relative frequency of POS tags. The MDgwPosF technique was evaluated on two different data sets about Covid-19 virus, from which one was manually and the second was automatically anotated.

The current state of the research of different feature extraction methods are summarized in the second section. The proposed feature extraction method, as well as the dataset used for evaluation, is presented. The results were summarized in Sect. 4 and finally, discussion and conclusions form the content of the last section of the paper.

2 Related work

Various research is based on different techniques of feature extraction from texts. The development of extraction techniques leads to classification with higher accuracy. Kadhim [10] used the commonly utilized technique TfIdf to identify terms. He compared different supervised machine learning classification algorithms, such as Naïve Bayes, Support Vector Machine, and k-nearest neighbours. His results say that different techniques perform differently depending on the dataset.

Szabo Nagy and Kapusta [11] proposed a novel technique for fake news classification. The technique is named as TwIdw. It is employed for feature extraction and is based on TfIdf, with the replacement of term frequencies by the depth of words in documents. An increase in accuracy of up to 3.9% was observed with the feed-forward neural network method using the political dataset.

The technique TfIdf was also used by Gaydhani et al. [12] in their research for feature extraction. They performed experiments considering n-grams as features and passing their TfIdf values to multiple machine learning models. The results were evaluated on three different classification algorithms—Naïve Bayes, Logistic Regression and Support Vector Machine. Support Vector Machine performed more poorly as compared to Naïve Bayes and Logistic Regression. The best results were achieved using the Logistic Regression model and they had a 95.6% accuracy on the test data after the model tuning.

Das et al. [13] compared n-grams and TfIdf as feature extraction techniques for sentiment analysis. Support Vector Machine, Logistic Regression, Multinomial Naive Bayes, Random Forest, Decision Tree, and k-nearest neighbours were used for classifications. From two feature extraction methods, a significant increase in feature extraction with TfIdf was observed. TfIdf got the maximum accuracy (93.81%), precision (94.20%), recall (93.81 %), and F1-score (91.99%) value in Random Forest classifier.

N-grams from morphological tags were used by Kapusta et al. [9] for a classification task. Three techniques based on POS tags were proposed and applied to different groups of n-grams in the pre-processing phase of fake news detection. The results showed that the newly proposed techniques are comparable with the traditional TfIdf technique, and the morphological analysis can improve the baseline TfIdf technique.

Multiple works can be found that analyse the TfIdf improving techniques. Wu and Yuan [14] introduced an improved TfIdf algorithm based on word frequency distribution information and category distribution information. The improved algorithm introduces the concept of word frequency distribution and class distribution to describe the weight of the feature item more accurately. The experimental results show that the improved algorithm can achieve better classification results on both balanced and unbalanced text data sets. The classification accuracy is 12.88% higher than the original algorithm.

Text classification plays a very important role in processing massive text data, but the accuracy of classification is often affected by the performance of term weighting. TfIdf is not effective enough for text classification, especially for processing text data with unbalanced distributions. For this reason, Jiang et al. [15] calculated the variance between the document frequency value of a particular term and the average of all document frequencies. The document frequency variance was proposed to enhance the ability in processing text data with an unbalanced distribution. They proposed four techniques TF-IADF, TF-IADF+, TF-IADF$_\textrm{norm}$, and TF-IADF+$_\textrm{norm}$.

Zhang and Ge [16] introduced a new algorithm and named it TF-IDF-$\rho$. They utilized it to represent desensitized data for text classification. Their experiments show an increase in F1 measure by 4.07% at most for the TF-IDF-$\rho$ in comparison with the traditional TfIdf. Another new improvement of TfIdf was surveyed by Zhang et al. [17]. Their paper presents a new improved method, which is called TFIDFZ algorithm. The algorithm gives different weights according to the word character and different positions in the text. The improved algorithm is named TFIDFZW algorithm. The experimental results show that the precision and recall rates of TFIDFZ algorithm and TFIDFZW algorithm are better than those of the traditional TfIdf.

Dependency grammar is one of the methods for syntactic analysis. Zhang et al. [18] classified Chinese short texts based on dependency grammar in their research. They trained word vector based on sentence dependency triples. The results of the experiment show that the proposed algorithm improves the performance of short text classification remarkably. Nagy and Kapusta [3] used dependency grammar together with TfIdf values to improve the classification of fake news. The results show that it is possible to use the dependency grammar information with acceptable accuracy for the classification of fake news and dependency grammar can improve existing techniques such as traditional TfIdf.

Dependency grammar together with POS tags was used by Zhi et al. [19] for a sentence classifier to filter non-feature-containing sentences before feature extraction. To evaluate the performance of their classifier, they produced a dataset with corresponding annotations. The result shows that their classifier can successfully filter out 79% of non-feature-containing sentences. Namdari and Durrani [20] investigated the predictability of fundamental and technical analyses using a multilayer feed-forward perceptron neural network (MLP). Historical stock prices and financial ratios of technology companies were utilized. The model incorporated self-organizing maps (SOMs) and underwent hyper-parameter optimizations with a three-hidden layer MLP. The hybrid model successfully predicted short-term stock trends with a directional accuracy of 70.36%, surpassing the performance of fundamental and technical analyses. Neural networks have found applications in various other fields, showcasing their versatility and effectiveness in handling data [21, 22].

3 Materials and methods

3.1 Feature extraction

The major objective of feature extraction is used to convert a text from any setup into a keyword schedule which may be easy to process by supervised learning [6, 10, 23]. This paper focuses on two approaches of feature extraction: POS tags from morphological analysis and improved TfIdf by adding terms depth in the sentences. Feature extraction and creation of input vectors to the classification models are based on two research papers, the first paper of Nagy and Kapusta [3] and the second paper of Kapusta et al. [9].

Nagy and Kapusta [3] introduced and improved the TfIdf technique for feature extraction and named it as MultipleDgw.

MultipleDgw is calculated as follows

$$\begin{aligned} \vec {MultipleDgw(d)}=(t_1*w_1, t_2*w_2,\ldots ,t_n*w_n) \end{aligned}$$

(1)

where t is the TfIdf value of a term, w is the weight (depth) of the term in sentence, and d is a document. MultipleDgw calculated the weights based on the knowledge that verb and derived nouns, and erratically adjectives, are more important within the sentence as prepositions, conjunctions, or other parts of speech. When calculating the weight, the order (depth) of words is the basis. An existing problem in calculating the weights may be the fact that the analysed text could contain the same word with different depths. This is handled by calculating the average depth for words that occurred more than once in the analysed records. The calculation is also derived from min-max normalization, and it is accurately explained in the paper.

The second paper [9] introduces PosF vector $\vec {PosF(d)} = (p_1, p_2,\ldots , p_n)$ which represents the relative frequency of POS tags in the frame of the analysed list of POS tags in the document. This technique is an analogy of the Term Frequency technique, and the concrete calculation is explained in the paper.

In the pursuit of improving the classification of unstructured texts, a new vector is introduced. The vector MDgwPosF is a merge of MultipleDgw and PosF, and it is calculated as follows:

$$\begin{aligned} \vec {MDgwPosF(d)}&= (t_1*w_1, t_2*w_2,\ldots ,t_n\nonumber \\ & *w_n, p_1, p_2,\ldots , p_n) \end{aligned}$$

(2)

where tw is the MultipleDgw value of the term, p is PosF value of the morphological tag, and d is a document. In this paper, the proposed method is being evaluated according to the base models TfIdf and TfIdf with POS tags.

3.2 Datasets

Two datasets are used in the research (Table 1). Both datasets are about the Covid-19 pandemic and contain true and fake news. The first dataset was collected automatically by Li [24], and it is a more evenly distributed dataset as the ratio between the true and fake information is almost 50:50. This dataset contains true records from trusted news sources and fake records from well-known fake news websites that are intentionally trying to spread misinformation. This dataset is referred to as Data_auto in this research. The second dataset was collected by Koirala [25] and it contains news between December 2019–July 2020. This dataset was collected using webhose.io and was manually labelled. It is referred to as Data_manual in this research. This dataset is not evenly distributed, as it contains 2061 true records and 659 fake news.

Table 2 presents the sizes of the generated vectors from the data sets using three different approaches: TfIdf (M1), TfIdf and relative frequencies of POS tags (M2), and MDgwPosF (M3). The table provides informative insights into the dimensions of the vectors produced by these methods.

Table 1 Number of records in datasets

Full size table

Table 2 Vector shapes

Full size table

3.3 Methods

Three neural networks were implemented in this research—feed-forward neural network, LSTM and GRU.

Feed-forward neural network is one of the basic neural network architectures where the output of one layer is forwarded to each neuron in the next layer, and thus it works in a unidirectional way. As there is no connection to the previous layers, feed-forward neural networks cannot persist past information.

LSTM neural network was introduced in 1997 by Hochreiter and Schidhubert [26]. The LSTM neural network consists of building blocks for the layers of a recurrent neural network [27, 28]. A LSTM unit is composed of a cell, and gates—input, output and forget. The cell is “remembering” the values over a time interval, so the word at the beginning of the text can influence the output of the word later in the text [29].

GRU neural network was first introduced in 2014 [30]. It is a recurrent neural network like LSTM, but less complex [31]. This neural network also has gating mechanism to control the information flow through cell state but has fewer parameters and does not contain an output gate [32].

Three different neural networks were created of each type: a simpler one, a moderate one and a compact one. The architecture of the neural networks in this research is as Table 3 shows, and the architecture of each type is the same. Sigmoid activation function was used between hidden layers and hard-sigmoid on output layers. In between neurons dropouts of 0.25 were used. In recurrent neural networks recurrency to the layers was added. Adam optimizer, 25 epochs and batch size of 10 were used. In previous studies, experimental outcomes showed that optimizers like RMSProp and Adam that use adaptive moment estimation are posting improved results [33, 34]. The training was performed 10 times because of k-fold validation with value of k = 10. K-fold ensures that every observation from the original dataset appears in the train and test set. First step of k-fold validation is the random shuffling of the dataset, so the inputs are not biased in any way. The original sample is randomly partitioned into k equal sized subsamples. K times is performed the training and testing, in each iteration a subsample is chose for testing and remaining $k - 1$ subsamples are used for training the model. Each of the k subsamples is used exactly once for testing.

Keras python library was used for models’ implementation. Figure 1 summarizes all of the methods and models in the paper.

Table 3 Used neural networks architectures

Full size table

4 Results

The quality of the proposed methods (TfIdf labelled as M1, TfIdf and POS tags labelled as M2, proposed MDgwPosF as M3) was evaluated using evaluation measures (accuracy, precision, recall, F1-score, precision_fake, recall_fake, precision_real, recall_real). Within the 10-fold validation, 10 measurements of each evaluation metric were performed for each fold.

Descriptive statistics of values of accuracy for each NN models (FF, LSTM, GRU), methods (M1, M2 and M3), topologies (T1, T2, T3) and dataset (auto, manual) are given in Table 4 for dataset auto and Table 5 for dataset manual. Tables are sorted using the mean values. The dataset which was created automatically achieved better outcomes. As this dataset was created using classification methods which has an impact for its accuracy, even though with other classification methods, it still achieved better results. On the other hand, more stable results for accuracy were obtained in the manual dataset. The auto dataset has the most heterogenous values based on all characteristics. Results also show better values of Mean and Confidence Interval for a Mean ($-$95.00%, +95.00%) for the dataset which was created automatically. Descriptive statistics shows that the most successful neural network model was the feed-forward model. This achieved the best results for both datasets and, also, for all topologies and methods. A more detailed view shows better results for the MDgwPosF (M3) method, and it is evident from most measurements for individual NN models and topologies. The most significant impact on the results was from the used neural network model. It is indicated by not only the Mean, but also the Confidence $-$95.00% that when using the feed-forward neural network model, $-$95% minimal value accuracy is achieved above 0.89. From the perspective of the used method and topology, these differences are not large. Similar findings were obtained for most of the values of other observed performance measures (precision, recall). Figure 2 represents the Fl-score for all the results.

Given the results of recall and precision, this metric only confirmed a small difference in results for all three methods examined. The results for F1-score confirm the better results for dataset auto. Additionally, looking more closely at the results, the best F1-scores were observed for the MDgwPosF (M3) method independent of the model and topology, except for the LSTM model and the T2 topology. It was unclear whether the results were in support of the M3 method for the manual dataset. The results also show that the best results for both datasets were achieved while using feed-forward model and T2 topology.

From the perspective of the feed-forward neural network model, the best results were reported for the MDgwPosF (M3) method in dataset auto, and for the TfIdf with POS tags (M2) and MDgwPosF (M3) methods in dataset manual. It is also evident that topology T2 was the most successful, regardless of the model or method used.

An interesting outlook is achieved by evaluating the results for recall. This performance measurement represents how many real fake and real true news were correctly classified.

The findings are confirmed by the results of rec_fake (Fig 3). Very low values for F1-score for models LSTM and GRU can be noticed for the manual dataset. Conversely, for rec_fake, better results were observed for dataset manual. This is mainly due to the unbalanced manual dataset. To ensure comparability of the methods, rebalancing methods were not applied. For this reason, the imbalance affected the results. Despite this fact, it is compelling that in the case of the feed-forward model, very good results were recorded for both datasets and practically all topologies and methods (Figs. 3, 4).

The results for the feed-forward model were comparable for both datasets. The feed-forward model could be successfully trained even with an unbalanced dataset. Also, apart from the T3 topology, the best results were recorded for the MDgwPosF (M3) method. From the descriptive statistics for all performance measures, the feed-forward neural network model was among the successful models. Despite the success of the MDgwPosF (M3) method for most performance measures, the results are not significant.

Considering the analysis of data exploration, a null hypothesis was established. The global null hypothesis is: there is no statistically significant difference of the models’ performance in terms of classification correctness or performance measures (accuracy, precision, fake precision, real precision, recall, fake recall, real recall and F1-score).

To verify the hypothesis, Dunnet’s one-sided tests were used, because a demonstration for which models the proposed model is more efficient in terms of classification accuracy is needed. Null and alternative hypotheses were formulated, one-tailed tests were used to find out which proposed model is more effective in terms of its classification correctness. To achieve results many-to-one comparisons were performed.

The null hypothesis for many-to-one comparison is: there is no statistically significant difference in efficiency/performance in terms of classification correctness between the proposed model and existing models. Alternative hypothesis is: the proposed model (FF_T2_M3) is more effective than the existing models (FF_T1_M1,..., GRU_T3_M3) in terms of classification correctness (Tables 6 and 7).

Table 4 Data exploration for auto dataset

Full size table

Table 5 Data exploration for manual dataset

Full size table

Table 6 Global null hypothesis testing

Full size table

Table 7 Many-to-one comparisons for variable acc

Full size table

From the proposed models, the FF_T2_M3 was chosen. For classification tasks, the feed-forward neural network model is one of the traditional methods. Even though GRU and LSTM can also be used for classification tasks, their success is debatable.

With neural network topologies, it is natural to expect achievements of better results with a larger, richer topology. On the other hand, the problems of overfitting must be always counted on, which disqualifies richer models. For this reason, T2 topology verification is the ideal balance between network success and the risk of overfitting.

Hypotheses will be tested on all data as well as on individual datasets. Based on data exploration, a difference between the models can be seen. Feed-forward neural network performed much better as other models.

The Kolmogorov–Smirnov test was used to verify the assumption of normality. The examined variables (acc(FF_T1_M1), acc(FF_T1_M2),..., f1-sc(GRU_T3_M2), f1-sc(GRU_T3_M3)) have normal distribution (total: $N = 20, \max D < 0.293, p > 0.05$, auto: $N=10, \max D < 0.306, p > 0.05$, manual: $N = 10, \max D < 0.361, p > 0.05$); therefore, parametric model was used for evaluating the hypotheses.

Adjusted test (Greenhouse–Geisser adjustment) was used to verify models’ effectiveness due to the violation of condition of the sphericity of covariance matrix. While the sphericity condition is not fulfilled, the type I. error is increasing. Epsilon represents the breach of the degree of the sphericity condition. Epsilon equal to 1 represents condition fulfilment. The smaller the Epsilon value is, the more sphericity condition is breached.

When comparing (Table 6) the proposed model against to the existing models, the Epsilon values were significantly smaller than one (total: G–G Epsilon $< 0.385$, Adj. $p < 0.001$; auto: G–G Epsilon $< 0.265$, Adj. $p < 0.001$; manual: G–G Epsilon $< 0.275$, Adj. $p < 0.001$). Null hypotheses were rejected at the 0.001 significance level, which claim that there is no statistically significant difference in the values of evaluation measures (acc, prec, prec_fake, prec_real, rec, rec_fake, rec_real a f1-sc) between the models. Hypotheses were tested on all data as well as on individual datasets.

Dunnet’s one-side tests (Table 7) were used to examinate the effectiveness of the proposed model MDgwPosF (M3) against existing models in many-to-one comparisons (existing models to proposed neural network model). The Dunnett test was used for this purpose. Significant p values indicate which controlled models the proposed model is more effective at the 0.05/0.01/0.001 significance level (*$p < 0.05$, **$p < 0.01$, ***$p < 0.001$).

It is clear from the results that statistically significant differences were recorded for acc(FF_T2_M3) for other types of neural network models used.

It can be concluded that the differences in performance measure results for the LSTM and GRU models are statistically significant in favor of the feed-forward models. No statistically significant differences were found between individual methods and topologies for the feed-forward model. This means that the performance measure is most influenced by the used neural network model. It is clear from the descriptive statistics that the proposed FF_T2_M3 method achieved the best results compared to the FF_T2_M1 and FF_T2_M2 methods, but these differences are not statistically significant. Similarly, the proposed M3 method achieved better results for other topologies T1 and T3 in feed-forward models.

In addition to one-sided tests for the accuracy variable, similar calculations of one-sided tests for the other variables (prec, prec_fake, prec_real, rec, rec_fake, rec_real and f1-sc) were made. The results were very similar and, also, confirmed statistically significant differences between the models, but within the comparison between topologies and methods, the differences were not statistically significant.

5 Discussion

The results provide clear evidence of statistically significant distinctions in the type of used neural network and in datasets. As LSTM and GRU are networks with loops in them, allowing information to persist, this can lead to lower accuracy because of the high number of unique information. The results did not confirm statistically significant differences in the investigated methods of preparation of the input vector. From the results, it can be observed that the morphological analysis improved the results for the identification of fake news. The contribution of syntactic analysis is uncertain. Information about part of speech can improve the classification. Information about the meaning of individual words in sentences, i.e., word dependency did not bring significant improvement. Results show that the combination of syntactic analysis and morphological analysis into one method (M3) brings the most significant improvement.

The M3 method for almost all models and topologies on all data as well as on individual datasets (Table 8) yields better results in terms of classification accuracy, except for models and topology GRU_T1 on the auto dataset, LSTM_T2 on the auto dataset as well as on all data. Statistically significantly better results (p < 0.05) in terms of classification accuracy (Table 8) were demonstrated only in the case of the LSTM model and the T1 topology, where a statistically significant difference between the M1 and M3 methods was demonstrated in favour of the M3 method (p< 0.05). On the contrary, a statistically significant difference between method M2 and M3 (Table 8) was not demonstrated (p > 0.05). In the case of the remaining models and topology, statistically significant better results (p > 0.05) in terms of classification accuracy (Table 8) were not demonstrated, even though in most case s (except three cases) the M3 method yielded better results. To verify the effectiveness of the proposed method M3, unadjusted tests were used for repeated measurements (Table 8), considering the validity of the condition of sphericity of the covariance matrix (p > 0.05).

Table 8 Summary of results for the M3 method with respect to model and topology

Full size table

The feed-forward model achieved the best results from all the models. These results appear to be statistically significant. The statistical significance was not verified for them as the main focus was on the methods rather than the neural network model. The feed-forward model appears to be the most suitable model for classification tasks. The remaining two models, GRU and LSTM, can be used for classification tasks as well, but their main purpose is not classification. These two models were presented only for the purpose of a more robust comparison of the methods and for the evaluation of the results. The results confirmed these findings about individual models.

Similar proceeding was used in the case of the topologies, where a suitable topology (T2) was designed, and it was compared to a simpler topology (T1) and a more robust topology (T3). The measurements show worse results for the simpler topology (T1). A surprise was that the topology of T3 was similar as T2 in most cases after accounting for methods and models.

The main part of the analysis was mainly focused on methods. Looking at the results of the descriptive statistics, the T3 method (in combination with the FF model and the T2 topology) was the most successful in the manual dataset (mean = 0.86). It was also among the most successful (FF_T2_M3, FF_T1_M3, FF_T1_M2) methods in the auto dataset (mean = 0.94).

The only statistically significant difference was observed in the case of the M3 method only in the manual dataset, with the LSTM model and the T1 topology. In the other models and topologies, M3 was among the most successful models, but without statistically significant differences.

6 Conclusion

In this paper, an improved feature extraction method MDgwPosF (M3) is proposed, and it is compared to two base methods—TfIdf (M1) and TfIdf with POS tags (M2). An evaluation using multiple neural networks (feed-forward, LSTM, GRU) with different topologies (T1, T2 and T3) was performed. The effectiveness of the models were verified on two data sets about Covid-19 virus.

The proposed method is based on syntactic and morphological analysis of the texts. Syntactic analysis is performed using dependency grammar. The morphological part of the feature extraction consists of the relative frequency of the POS tags.

The quality of the methods (M1, M2 and M3) was evaluated using three different neural networks (feed-forward, LSTM, GRU) with three different topologies (T1, T2 and T3) and on two different datasets—auto and manual. All evaluation metrics (accuracy, precision, recall and F1-score) was calculated for 10-fold validation. The dataset which was created automatically achieved better outcomes. The descriptive statistics shows that the most successful neural network model was the feed-forward model. This model achieved the best results for both datasets and for all topologies and methods. From the topologies the results show that the T2 topology performs the best. A statistical evaluation was performed where the hypotheses were formulated. Kolmogorov–Smirnov test was used to verify the assumption of normality and adjusted test (Greenhouse–Geisser adjustment) were used to verify the effectiveness of the models. The results validate the presence of statistically significant variances in the type of used neural network and in datasets. The M3 method for almost all models and topologies on all data as well as on individual datasets yields better results in terms of classification accuracy.

Data availability

Data used in this research are available online [24, 25].

References

Singh K, Devi S, Devi H, Mahanta A (2022) A novel approach for dimension reduction using word embedding: an enhanced text classification approach. Int J Inf Manag Data Insights 2:100061. https://doi.org/10.1016/j.jjimei.2022.100061
Article Google Scholar
Lai C, Chen M, Kristiani E, Verma V, Yang C (2022) Fake news classification based on content level features. Appl Sci 12:1–21
Article Google Scholar
Nagy K, Kapusta J (2021) Improving fake news classification using dependency grammar. PLoS ONE. https://doi.org/10.1371/journal.pone.0256940
Article Google Scholar
Jung H, Lee B (2020) Research trends in text mining: semantic network and main path analysis of selected journals. Expert Syst Appl 162:113851
Article Google Scholar
De Marneffe M, Nivre J (2019) Dependency grammar. Ann Rev Linguist 5:197–218
Article Google Scholar
Lee H, Park G, Kim H (2018) Effective integration of morphological analysis and named entity recognition based on a recurrent neural network. Pattern Recogn Lett 112:361–365. https://doi.org/10.1016/j.patrec.2018.08.015
Article Google Scholar
Kapusta J, Hájek P, Munk M, Benko L (2020) Comparison of fake and real news based on morphological analysis. Proc Comput Sci 171:2285–2293
Article Google Scholar
Kapusta J, Obonya J (2020) Improvement of misleading and fake news classification for flective languages by morphological group analysis. Informatics 7:4
Article Google Scholar
Kapusta J, Drlik M, Munk M (2021) Using of n-grams from morphological tags for fake news classification. Peer J Comput Sci 7:e624
Article Google Scholar
Kadhim A (2019) Term weighting for feature extraction on twitter: a comparison between BM25 and TF-IDF. In: 2019 international conference on advanced science and engineering, ICOASE 2019, pp 124–128
Szabó Nagy K, Kapusta J (2023) TwIdw-a novel method for feature extraction from unstructured texts. Appl Sci 13:6438
Article Google Scholar
Gaydhani A, Doma V, Kendre S, Bhagwat L (2018) Detecting Hate speech and offensive language on Twitter using machine learning: an N-gram and TFIDF based approach. arxiv:1809.08651
Das M, Kamalanathan S, Alphonse P (2021) A comparative study on TF-IDF feature weighting method and its analysis using unstructured dataset. CEUR Workshop Proc 2870:98–107
Google Scholar
Wu H, Yuan N (2018) An improved TF-IDF algorithm based on word frequency distribution information and category distribution information. In: ACM international conference proceeding series, pp 211–215
Jiang Z, Gao B, He Y, Han Y, Doyle P, Zhu Q (2021) Text classification using novel term weighting scheme-based improved TF-IDF for internet media reports. Math Probl Eng 2021:1–30
Google Scholar
Zhang T, Ge S (2019) An improved Tf-idF algorithm based on class discriminative strength for text categorization on desensitized data. In: ACM international conference proceeding series. Part F1481, pp 39–44
Zhang Z, Su Z, Shi Z (2021) Improvement of TFIDF algorithm based on different information of text. Int J Sci 8:2021
Google Scholar
Zhang Y, Xu H, Xu K (2021) Chinese short text classification based on dependency syntax information. In: ACM international conference proceeding series, pp 133–138
Zhi Y, Li T, Yang Z (2021) Extracting features from app descriptions based on POS and dependency. In: Proceedings of the ACM symposium on applied computing, pp 1354–1358
Namdari A, Durrani T (2021) A multilayer feed-forward perceptron model in neural networks for predicting stock market short-term trends. Oper Res Forum 2:38. https://doi.org/10.1007/s43069-021-00071-2
Article MATH Google Scholar
Namdari A, Samani M, Durrani T (2022) Lithium-ion battery prognostics through reinforcement learning based on entropy measures. Algorithms 15:393
Article Google Scholar
Huang C, Shen Y, Kuo P, Chen Y (2022) Novel spatiotemporal feature extraction parallel deep neural network for forecasting confirmed cases of coronavirus disease 2019. Socio-Econ Plan Sci 80:100976
Article Google Scholar
Chen X, Xue Y, Zhao H, Lu X, Hu X, Ma Z (2019) A novel feature extraction methodology for sentiment analysis of product reviews. Neural Comput Appl 31:6625–6642
Article Google Scholar
Susan Li Explore COVID-19 Infodemic. Towards Data Science (2020)
Abhishek Koirala Covid-19 fake news dataset. Mendeley Data (2021) https://data.mendeley.com/datasets/zwfdmp5syg/1
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780
Article Google Scholar
Farzad A, Mashayekhi H, Hassanpour H (2019) A comparative performance analysis of different activation functions in LSTM networks for classification. Neural Comput Appl 31:2507–2521
Article Google Scholar
Rathor S, Agrawal S (2021) A robust model for domain recognition of acoustic communication using Bidirectional LSTM and deep neural network. Neural Comput Appl 33:1–10
Article Google Scholar
Kaliyar R (2018) Fake news detection using a deep neural network. In: 2018 4th international conference on computing communication and automation (ICCCA), pp 1–7
Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. In: EMNLP 2014—2014 conference on empirical methods in natural language processing, Proceedings of the conference, pp 1724–1734
Kumar J, Abirami S (2021) Ensemble application of bidirectional LSTM and GRU for aspect category detection with imbalanced data. Neural Comput Appl 33:14603–14621. https://doi.org/10.1007/s00521-021-06100-9
Article Google Scholar
Shewalkar A (2018) Comparison of Rnn, Lstm and Gru on speech recognition data
Okewu E, Misra S, Lius F (2020) Parameter tuning using adaptive moment estimation in deep learning neural networks. In: Computational science and its applications-ICCSA 2020: 20th international conference, Cagliari, Italy, July 1–4, 2020, Proceedings, Part VI 20, pp 261–272
Hoc H, Silhavy R, Prokopova Z, Silhavy P (2022) Comparing multiple linear regression, deep learning and multiple perceptron for functional points estimation. IEEE Access 10:112187–112198
Article Google Scholar

Download references

Acknowledgements

This work was supported by the Slovak Research and Development Agency under the contract No. APVV-18-0473, and by the scientific research project of the Czech Science Foundation, Grant No. 22-22586S.

Funding

Open access funding provided by The Ministry of Education, Science, Research and Sport of the Slovak Republic in cooperation with Centre for Scientific and Technical Information of the Slovak Republic.

Author information

Authors and Affiliations

Department of Informatics, Constantine the Philosopher University in Nitra, Nitra, Slovakia
Kitti Szabó Nagy, Jozef Kapusta & Michal Munk
Institute of Computer Science, Pedagogical University of Cracow, Kraków, Poland
Jozef Kapusta
Science and Research Centre, University of Pardubice, Pardubice, Czech Republic
Michal Munk

Authors

Kitti Szabó Nagy
View author publications
You can also search for this author in PubMed Google Scholar
Jozef Kapusta
View author publications
You can also search for this author in PubMed Google Scholar
Michal Munk
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kitti Szabó Nagy.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Szabó Nagy, K., Kapusta, J. & Munk, M. Feature extraction from unstructured texts as a combination of the morphological and the syntactic analysis and its usage in fake news classification tasks. Neural Comput & Applic 35, 22055–22067 (2023). https://doi.org/10.1007/s00521-023-08967-2

Download citation

Received: 01 March 2023
Accepted: 15 August 2023
Published: 07 September 2023
Issue Date: October 2023
DOI: https://doi.org/10.1007/s00521-023-08967-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Feature extraction from unstructured texts as a combination of the morphological and the syntactic analysis and its usage in fake news classification tasks

Abstract

Similar content being viewed by others

Review of Feature Extraction Techniques for Fake News Detection

Introduction to Text Classification: Impact of Stemming and Comparing TF-IDF and Count Vectorization as Feature Extraction Technique

Natural language processing: state of the art, current trends and challenges

1 Introduction

2 Related work