Impact of convolutional neural network and FastText embedding on text classification

Umer, Muhammad; Imtiaz, Zainab; Ahmad, Muhammad; Nappi, Michele; Medaglia, Carlo; Choi, Gyu Sang; Mehmood, Arif

doi:10.1007/s11042-022-13459-x

Impact of convolutional neural network and FastText embedding on text classification

1178: Pattern Recognition for Adaptive User Interfaces
Open access
Published: 24 August 2022

Volume 82, pages 5569–5585, (2023)
Cite this article

Download PDF

You have full access to this open access article

Multimedia Tools and Applications Aims and scope Submit manuscript

Impact of convolutional neural network and FastText embedding on text classification

Download PDF

Muhammad Umer ORCID: orcid.org/0000-0002-6015-9326¹,
Zainab Imtiaz²,
Muhammad Ahmad³,
Michele Nappi⁴,
Carlo Medaglia⁵,
Gyu Sang Choi⁶ &
…
Arif Mehmood¹

6892 Accesses
34 Citations
Explore all metrics

Abstract

Efficient word representation techniques (word embeddings) with modern machine learning models have shown reasonable improvement on automatic text classification tasks. However, the effectiveness of such techniques has not been evaluated yet in terms of insufficient word vector representation for training. Convolutional Neural Network has achieved significant results in pattern recognition, image analysis, and text classification. This study investigates the application of the CNN model on text classification problems by experimentation and analysis. We trained our classification model with a prominent word embedding generation model, Fast Text on publically available datasets, six benchmark datasets including Ag News, Amazon Full and Polarity, Yahoo Question Answer, Yelp Full, and Polarity. Furthermore, the proposed model has been tested on the Twitter US airlines non-benchmark dataset as well. The analysis indicates that using Fast Text as word embedding is a very promising approach.

Large Scale Text Classification with Efficient Word Embedding

TextConvoNet: a convolutional neural network based architecture for text classification

Article 22 October 2022

Case Studies of Several Popular Text Classification Methods

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Electronic text processing is ubiquitous nowadays, from instant messaging applications to virtual repositories with large corpus has created some challenges to address. Automatic classification of textual data is one of such endeavours, so users can extract, retrieve and manipulate information to generate knowledge and recognize patterns. Text categorization uses a combination of knowledge areas including Information Retrieval (IR), Natural Language Processing (NLP), Artificial Intelligence (AI), Machine Learning (ML), Data Mining and Statistics. Text classification is one of the fundamental tasks in Natural Language Processing (NLP) which reduces the processing complexity for huge texts. It can be categorized into two groups: multi-label and multi-class text classification. Classifying a review into its corresponding sentiment is referred to as multi-class classification whereas the classification of an article into different classes (e.g. finance or religion) is known as multi-label classification.

Text classification is being applied in many contexts, such as news filtering and organization, document organization and retrieval, opinion mining, email classification, and spam filtering [23]. Many researchers are paying considerable attention towards text classification. Various machine learning methods are being used such as support vector machines with rule-based features [39], a combination of SVMs and naive Bayes [48] and conditional random fields [31]. Using the bag-of-words (unigrams, bigrams, or n-grams) for the representation of text is a basic approach for text classification. Different classifiers are trained on these representations, for instance, logistic regression,stochastic gradient decent [10, 53] and Naive Bayes [28].

Term Frequency–Inverse Document Frequency (TF-IDF) is another simple yet effective method for feature extraction. The main problem in these feature extraction approaches is the context of the text is not considered [1, 37] due to which the produced result has limitations. For example, there are always some words in a text that express the tendency of results in news articles. If we ignore these words in the classification process, we might not get the desired results. In the sentence: “He killed the man with a single blow of his cricket bat”, if we take the single word “bat” (unigram), it is unclear if it’s a “playing bat” or the “animal bat”. When we take the phrase “cricket bat” (bigram) it might be taken as a cricket bat or the name of an insect and animal written together. When we obtain more words “single blow of his cricket bat” (6-gram), it becomes rather easy to differentiate between the contextual meanings of the words. Hence, using information from the preceding text is likely to improve the accuracy of the classifiers and this problem can be solved by using multi-layer neural networks [36].

Deep learning (DL) allows data with multiple levels of abstraction which enables it to produce promising results for various NLP tasks. Automatic classification applications using DL include visual surveillance, intelligent user interface, collection of demographic statistics for marketing, face recognition, and a variety of computer vision tasks. In deep learning models, a sentence is taken as a continuous stream of tokens that are processed in sequential order from left to right and the neural network memorizes it in a fixed-size hidden layer. Besides general feed-forward networks, some specialized architectures are extensively being used in industry including CNN and RNN which can scale high-resolution images and temporal sequence. Long Short Term Memory (LSTM) is considered to be the most successful model for learning long-term dependencies. Many researchers have proved its ability in their work [41, 42]. However, some researchers argue that LSTM is not task-specific but it is a generic learning model [5]. Also, it is partial towards the words at the end of a sentence or a document. This can reduce the accuracy since important features can occur anywhere in the sentence or document rather than being at the end only [24]. A convolutional neural network (CNN) is an unbiased neural network that is used to extract higher-level features from text [50]. By stacking multiple convolution layers, long-term dependencies of text can be captured with a max-pooling layer. It uses fewer parameters and is faster to train as compared to other models. [24]. Therefore, CNN might perform better than other neural networks in some cases.CNN has become the gold standard for object recognition and is also used as a primary choice for computer vision tasks.

Word Embedding is a neural representation technique in which each word is represented as a real-valued vector. The relatedness of words can be measured by using the distance between two embedding vectors [24]. These vectors might have tens or hundreds of dimensions in contrast to thousands or millions of dimensions in the above-mentioned feature selection approaches. Neural networks have shown great outcomes in many NLP tasks by using pre-trained word embeddings but these pre-trained word embeddings contain word-vectors related to a little number of domains resulting in limited results. FastText is a word representation learning library that is provided by the Facebook research team. It gives exceptional results due to its highly professional implementation in C++ and simple classification algorithm [19].

This work introduces a model based on CNN for sequential short-text and long-text classification. Experiments are carried out over seven different datasets, which validate the feasibility of the proposed model. Six datasets are benchmark datasets that include AgNews, AmazonFull, AmazonPolarity, Yahoo Question Answer, Yelp Full, Yelp Polarity, and one non-benchmark dataset named Twitter US Airlines. Few examples of these datasets are shown in Table 1. First, the text is converted into vectors using FastText word embedding. Later, these vectors are passed to the CNN model. The datasets are divided into 30 and 70 ratios for testing and training. The experiments show that the proposed model achieves state-of-the-art results on five out of seven datasets.

Table 1 Examples of text samples and their labels

Full size table

Major Contributions of the proposed approach are presented below:

A model based on CNN is proposed for sequential short-text and long-text classification. Experiments are carried out over seven different datasets, which validate the feasibility of the proposed model.
The word embedding FastText is utilized with a CNN model to obtain better results for text classification.
For comparing the performance of proposed CNN, five machine learning models including random forest (RF), logistic regression (LR), extra tree classifier (ETC), gradient boosting machine (GBM), and stochastic gradient descent (SGD) are also tested.
The proposed approach which is FastText in combination with the 3-layered CNN model outperformed other models used in experiments.

The rest of the paper is structured as follows. Section Related Work describes some best researches related to our work. Section Dataset & Proposed methodology gives a summary of the proposed model, the dataset, the steps performed on the dataset, and the basic introduction of deep learning models used in this research. In Result section, results are discussed and the paper concludes with a discussion of future research directions in Conclusion section.

2 Related work

There is a large body of research on text classification tasks using machine learning and deep learning models. Classical methods focused on feature engineering and classification steps. Common feature engineering techniques include bag-of-words and TF-IDF. Hand-crafted n-grams are also used to effectively make use of the order of words in text [47]. Some other complex features such as noun phrases [26], part-of-speech tags, as well as tree kernels [32] have also been designed. Also, there are some advanced approaches to select useful features such as information gain and mutual information. Some of the widely used machine learning algorithms include Naive Bayes [28], Logistic Regression and SVM [53]. However, there is a problem of data sparsity with these models. Deep neural networks (DNN) have gained immense popularity in NLP tasks because they have strong expressive power and need lesser feature engineering than the traditional models. CNN and RNN are variants of DNNs that are being used for text classification [52].

RNN is a sequence model that deals with determining long-term dependencies in variable-length input sequences. Multiple variants of RNN are used to better store and access memories e.g. LSTM and Gated recurrent units (GRUs), etc., [14]. By using LSTM, the study [54] proposed to use a supervised and semi-supervised learning model for text classification. In [15], Siamese MaLSTM is used to compute the semantic similarity of question pairs. Another study [43] used gated RNN to learn semantic relations. CNN and LSTM based models are used to learn vector-based sentence representation for text classification. CNN and LSTM were combined in [57] where CNN was used to capture higher-level features and LSTM was used to understand the overall context of the sentence. Character and word models are combined in [45] where the Word CNN model is like the one used in [22] and the Char CNN model is similar to the model used in [56].

CNN is capable of extracting higher-level features as well as capturing local correlations in text classification as well as in image classification. A novel stacked CNN was proposed in [44]. This makes CNN able to model sentences from preceding and successive context windows. CNN models are applied directly to distributed [8, 22] or discrete [17] word embeddings, without knowing the semantic structures of a language. A research [22] used word based CNN architecture with pre-trained word embeddings for sentence-level classification. A single convolution layer followed by a max-pooling layer and a fully connected layer with drop-out is used in the final classifier. With little tuning of hyperparameters, the results on various benchmark datasets were improved. Another system was proposed in [20] where 5 layers of CNN. This system introduced multiple temporal k-max-pooling layers. It enables the system to extract k most important features in the text, irrelative to their position resulting in the preservation of relative order. The length of the sentence and the position of the layer determines the value of ^′k^′. The clustering of word vectors was performed and passed to CNN for short-text classification in [46].

CNN based model with up to 29 layers was proposed in [5]. It operated on character level and used a small convolutional and pooling operations. It improved the results on several datasets by increasing the depth of CNN layers. Another researcher [56] used 9 layers of a character-based CNN model for solving unseen word problems in Word CNN’s model. It increased a small number of accuracy but took a very long time due to a huge training corpus. The same problem of the unseen word was addressed in [51]. The researchers proposed to use CNN and combine both character and word-based models to efficiently perform text classification. Shallow word-level CNN with more parameters was introduced in [18]. The performance and error rate was compared with [5] and it was proved that the model achieved better and faster error rates. In [55], six layers of CNN were used that followed three fully connected classification layers.

When text classification is performed using CNN models, the text is converted into vectors at the first layer of the model. Word embedding [22, 29], character embedding [56], document embedding [6] are some of the embedding types that have been used in recent years. Large text corpus are generally used to form embedded vectors [29]. The decision of choosing the type of word embedding is dependent on the problem statement and network architecture. Various CNN models used a pre-trained word embedding of word2vec as input [29] such as recursive tensor [40]. Recently, some simpler and more efficient models have been proposed that directly learn task-specific word embeddings or fine-tune on pre-trained word embeddings e.g. Deep Averaging Networks [16] and FastText [19].

Sentiment classification has been a thoroughly researched area in NLP for many years [49]. It has many real-world applications in various fields such as finance [3], market research [34], social science [7] and politics [21]. A semi-supervised deep neural network was proposed in [25] with a small number of labelled data and a large number of unlabeled data. The experiments in this study show that using unlabeled data improved the performance of the model. In [27] GRU based multi-task learning method was proposed for sentiment and sarcasm classification which outperformed the baseline CNN model. Several deep learning models including CNN, LSTM, MLP and CNN-LSTM are applied on the IMDB movie reviews dataset [2]. A significant increase in accuracy was observed when compared with classical machine learning models. The study [4] employs BiLSTM-CRF to improve sentence-level sentiment analysis by extracting target expression in opinionated sentences. The sentences are then classified into three types according to the number of targets extracted from them. The results prove that separating sentences containing different opinion targets, enhance the performance of sentence-level sentiment analysis.

In this paper, we stack only 3 layers of CNN architecture for classification and use fastText word embedding for feature engineering. This is in contrast to the current trend in text classification where significant improvements have been reported by using much deeper CNN’s [5, 20, 55, 56]. CNN’s generally require a very large dataset to perform efficiently. Therefore, we carried out the experiments over 7 large datasets. Since fastText word embedding contains billions of word-vector from several domains, the size of training word-vectors and the range of training domain is immensely increased. As a result, the reported accuracies have significantly improved in 5 out of 7 datasets.

3 Dataset & preprocessing

To show the effectiveness of our proposed model, we evaluated the results on both multi-class^{Footnote 1} and multi-label.^{Footnote 2} The large-scale 7 benchmark datasets used in this study were introduced in [54] and are freely available online. These datasets are related to various classification tasks such as sentiment classification, questions answer classification, tweets classification and news classification. Out of 7 datasets, 6 datasets are benchmarks datasets and 1 is a non-benchmark dataset. The datasets are divided into 70 and 30 ratios for training and testing. The details of the datasets are shown in Table 2 and in the following sub-sections.

Table 2 Detailed description of large-scale text classification datasets used in our experiments

Full size table

AG News::: AG news dataset consists of internet news articles and their descriptions from more than 2000 news sources. It is divided into 4 categories including World, Entertainment, Sports and Business. Each class has an equal number of articles. For this study, we only considered the articles and ignored the descriptions of the articles. The dataset contains 120k training and 7.6k test samples.
Yahoo! Answers::: Yahoo! Answers dataset consists of comprehensive questions and answers. It has 10 classes containing an equal number of records. The classes are Society & Culture, Science & Mathematics, Health, Education & Reference, Computers & Internet, Sports, Business & Finance, Entertainment & Music, Family & Relationships and Politics & Government. There are 4 fields in the dataset that includes corresponding to class index (1 to 10), question title, question content and best answer. Training and testing are divided into 1,400k and 60k samples respectively.
Yelp Review Polarity::: Yelp Review Polarity dataset is obtained from the Yelp Dataset Challenge held in 2015. It consists of two classes: negative and positive. The negative polarity is represented by class 1, and positive polarity is represented by class 2. These polarity classes have 560k training samples and 38k test samples in total.
Yelp Review Full::: Yelp Review Full dataset is also obtained from the Yelp Dataset Challenge held in 2015. It is a subset of Yelp’s businesses, reviews and user data with five-star labels. The dataset has 650k training samples and 50k testing samples.
Amazon Review Polarity::: The Amazon review dataset is obtained from the Stanford Network Analysis Project (SNAP). It has 2 classes including negative and positive. Dataset is divided into 3600k training and 400k testing samples.
Amazon Review Ful::: The Amazon review dataset is also obtained from the Stanford Network Analysis Project (SNAP). It consists of 5 start labels. This dataset is divided into 3,000k training and 650k testing samples.
Twitter US Airline Sentiment::: Twitter US Airline Sentiment data was collected from “Crowdflower’s data for everyone library”. It consists of 3 classes including positive, negative and neutral. It has 10,248 training and 4,392 testing samples. The 15 fields in the dataset are tweet id, sentiment, sentiment confidence score, negative reason, negative reason confidence, airline, sentiment gold, retweet count, tweet text, tweet coordinates, time of the tweet, date of the tweet, tweet location, user time zone and name. However, the fields used are tweet text and sentiment only.

3.1 Preprocessing

Several preprocessing steps are performed on every dataset to remove the missing, inconsistent and redundant values. The text in all datasets is converted to lower case and stopwords are removed. Then, it is tokenized and converted into word vectors with the help of freely available libraries such as NLTK and Keras. Elimination of unnecessary information improves the quality of data. The word vectors are passed to FastText word embedding for the extraction of high-quality features. The maximum sequence of a sentence is set to the maximum length of text in the dataset. The questions with smaller lengths are zero-padded.

4 Supervised machine learning algorithms

In this section, we will discuss the machine learning algorithms used for text classification of benchmark datasets. For the implementation of the machine learning models, the SciKit-learn library and NLTK are used. Five machine learning algorithms were deployed in Python using the SciKit module. We used the tree-based, regression-based and ensemble-based models to check the efficacy of the proposed system. We used the following machine learning algorithm in conjunction with our proposed methodology to evaluate the performance of the machine learning classifier predictions.

4.1 Random Forest (RF)

RF is an advanced version of the decision tree. Random Forest is a supervised learning algorithm that consists of a number of decision trees working individually to predict the results of a class where the final prediction is based on the class that gets the majority of votes. If we compare the error rate in RF, that is very low as compared to the other models. The reason for the low error rate is that it has a low correlation between trees [13]. The random forest used in this study was trained using different parameters. Based on the problem, multiple algorithms are used to decide a split in the decision tree.

4.2 Logistic Regression (LR)

Logistic regression is a statistical-based method in which one or more than one variable are used to compute the final result. LR is widely used to compute the probability of the class numbers, so, LR is the best learning model when the target class is categorical [30]. It processes the relationship among one or more variables and categorical independent variables by estimating the probabilities using logistic functions. LR uses the sigmoid function to transform the output into a probability value. The aim is to achieve the optimal probability with the low value of the cost function.

4.3 Extra Tree Classifier (ETC)

Extra Tree Classifier (ETC) is an ensemble learning model. The working principle of ETC is quite similar to the RF and the only difference is in the construction of the trees in the forest. Extra Tree Classifier (ETC) every tree is made from the original training samples. For decision random samples of the k best features and Gini-Index is used to select the best features to split the data in the tree. This approach results in the construction of the de-correlated trees in ETC [38].

4.4 Gradient Boosting Machine (GBM)

Gradient Boosting Machine is based on boosting and it is a powerful ensemble model extensively used to handle the classification problems. In GBM many weak classifiers work together to result in a strong learning model. It usually works on the principle of the decision tree [11]. GBM creates every tree independently so, it is an expensive and time taking choice. Due to the quality of probability of approximation correct learning (PAC), it works well on the un-processed data. To deal with the data missing values GBM is a good choice.

4.5 SGD (Stochastic Gradient Descent)

The working principle of SGD is based on the working principle of logistic regression convex loss function and SVM. It is a good choice for the multi-class classification problems because it combines the multiple binary classifier and OvA (One versus All) method. SGD work well on the large dataset because it takes the idea to the extreme. SGD uses a single sample in an iteration. It is very easy to understand easy to implement the regression model. As SGD has a lot of benefits but it also has some drawbacks. Hyperparameter of SGD needs lots of attention and needs to be correctly valued to achieve the good value of accuracy [12].

5 Proposed methodology

This section presents the proposed framework used in this study. The proposed framework is presented in Fig. 1. The use of deep learning-based classifiers has received great attention during the last few years. Deep models can potentially increase the classification accuracy of traditional classifiers. For this reason, this study aims to utilize a Convolutional Neural Network (CNN) for Text classification. The architecture of the proposed neural network is shown in Fig. 2.

Seven benchmark datasets related to text classification are utilized in experiments. Five steps of preprocessing have been performed on each dataset before training. Datasets are split into 70:30 ratios for train and test. Then, the proposed approach which is FastText word embedding in combination with 3-layered CNN is applied for training. The proposed approach is evaluated on four evaluation measures that are Accuracy, Precision, Recall and F1-score.

5.1 Embedding layer

The Neural Networks do not take input in the form of text directly, thus the text must be vectorized to make it understandable for the model. Recently, word embedding has gained much attention for text classification [22] and sentiment analysis [4]. Pre-trained vectors [29] or random values can be used to initialize word embedding. The proposed model used the embedding layer as the first layer that takes the text as input and converts it into a vector denoted as W ∈ R^k×m, where k is the maximum number of words in the text and m is the dimension of word embedding. The embedding dimension is set to 300 and the maximum sequence length is equal to the maximum length of the text in the dataset. If the text is not equal to the maximum sequence, zero paddings are used.

$$ W_{k} = {v_{1}^{w}} \oplus {v_{2}^{w}} \oplus ... \oplus {v_{r}^{w}} $$

(1)

In (1), W_k is the embedding of the words and ⊕ is the concatenation operation. W_k is constructed by concatenating ${v_{r}^{w}}$ where ${v_{r}^{w}}$ ∈ R_m, and is an m-dimensional vector of r_th word in the text. FastText word embedding technique is being used in this work and the output of FastText is later fed to the convolutional layer.

5.2 FastText

FastText is a word representation library that is provided by the Facebook research team. It contains 2 million common crawl words with 300 −dimensions, providing 600 billion word-vectors. It uses hand-crafted n-grams as features in addition to single words. Text classification is performed very effectively and efficiently because of its simple architecture [33]. Different word embedding techniques have been utilized in various text classification tasks. Pre-trained word embeddings work in an unsupervised manner in predicting the context of the words. Closely placed words are considered as of similar context. FastText embedding uses morphological features to detect difficult words which make it a suitable choice to represent vectors. This ability also improves its generalizability. FastText word embedding generates vectors using n-gram which helps in dealing with unknown words.

5.3 Convolution feature maps

The basic operation of the convolution layer is to extract multi-resolution features from the input matrix. Different types of filters are applied to get different kinds of features. The convolutional layer aims to extract patterns, i.e., discriminative word sequences found in the input word-vectors that are common throughout the training instances. Let x_i ∈ R_d be the d-dimensional word vectors for the i^th word in a sentence. Let x ∈ R_L × d be the input sentence where L is the length of the sentence. Let k be the length of the filter and the vector m ∈ R_k × d is a filter for the convolution operation. For each position, j in the sentence, window vector w_j with k consecutive word vectors are denoted as.

$$ w_{j} = [x_{j},x_{j+1},....,x_{j+k}-1] $$

(2)

where the commas represent row vector concatenation. A filter m convolves with the window vectors (k-grams). In our work, the kernel size is 7, therefore, the filter of size 64 will create 7-word combinations.

ReLu activation function is applied over the output of each CNN neuron. This function converts all the negative values to zero to maintain non-linearity in the network. After applying ReLu activation, the output shape of the CNN layer remains the same as the input layer.

The purpose of a pooling layer is to further abstract the features by combining the scores for each filter. In this model, we apply two max-pooling layers over each feature map. This layer captures the most important feature by selecting the highest value on each dimension of the vector. As a result, the size of the input features reduces greatly. The output of this layer will reduce the features by kernel/pool size (p) where p = 2. Kernel = 5 with different size “global max pooling”.

The dropout layer is a way to reduce overfitting by dropping the input features with values lesser than the dropout rate. The dropout rate d for our model is 0.5.

A dense layer is the last layer of this model and it produces the final result. This layer is followed by a softmax activation function. Softmax activation is used for multi-class classification. Therefore, we have used this activation since most of the datasets used are multi-class datasets.

6 Case study: evaluation of CNN model

A case study has been performed to evaluate the impact of using external word embedding with few layers of CNN on text classification. We performed a series of experiments on different benchmark datasets. In this case study, the experimental results for all different datasets are discussed. The results on all the datasets show that our model outperforms state-of-the-art methods as shown in Table 3.

Table 3 Proposed methodology comparison with best published results from previous work

Full size table

It is evident from the results shown in Table 3 that the hand-crafted feature engineering-based models [56] got the least results on all the datasets as compared to the rest of the models. One of the main reasons is that feature engineering suffers from data sparsity. Also, it cannot take advantage of training set supervision [9]. Character-based models [50, 55] get better accuracies because, as compared to the word-based models, character-based models have deeper supervision from characters. N-grams, stems, words and phrases formed by combining characters, assist in sentimental classification. Overall, the two amazon datasets get the highest results in character-based models. The reason is that both amazon datasets contain millions of training samples. Out of the three character-based baselines, the best results are achieved on VDCNN [5] architecture. Having 29 layers of CNN, the model learns more combinations of characters.

Whereas, word-based models lose on two amazon datasets and exceed other models on three datasets. The reason is that these three datasets perform categorization through keywords and word-based models do not combine characters and directly use word embeddings. Then comes the W.C RegionEmb, which exceeds all the variants on all five datasets because of the ability to learn region embedding to use the N-gram features from the text. EXAM outperforms all the word-based baselines because more fine-grained interaction features are considered between classes and words. Finally, a voting classifier is introduced in [35] and achieved 80.4% accuracy on the Twitter US Airline Sentiment dataset. It is important to note that no external word embedding was used in any of the aforementioned methods.

One can observe that the proposed model achieves the best performance over five datasets: AG News, Yahoo! Answers, Yelp Reviews Full, Amazon Review Full and Twitter US Airline Sentiment. For the Yelp Reviews Full dataset and Yelp Reviews Full, the proposed model improves the performance by 20.0% to 22.0%. Additionally, the proposed model beats all the baselines on the AG’s News with a performance gain of 3.0%. The accuracy of Yahoo! Answers also increased drastically by nearly 20.0%. Finally, the performance on Twitter US Airline Sentiment was improved by 6% because we use FastText word embedding which contains billions of training word-vectors that are pretty useful in this task. The detailed performance of our proposed model applied to all the datasets is shown in Table 4.

Table 4 Detailed result of proposed model on all 7 datasets

Full size table

Deep learning models have shown robust results on all seven datasets as shown in Table 3, but to the best of our knowledge, it is the first time to use CNN with pre-trained word embedding to analyze reviews. Mostly CNN has shown superiority in performance as compared to other deep learning models. CNN directly extracts features from raw data by applying filters and also explore spatial relationships among variables by assigning weights. Overfitting in the CNN model is prevented by reducing the complexity of the model, our proposed model is simple but efficient. Result comparison of the proposed model on all 7 datasets is presented in Fig. 3. It can be concluded that the proposed CNN along with FastText outperformed on all datasets in terms of accuracy, precision, recall and F-score. The proposed approach has shown the highest accuracy with 0.96 value on AG’s News dataset, highest precision with 0.94 on AG’s News and Yelp Review Polarity dataset, highest recall with 0.96 value on Yelp Reviews Polarity and highest F-Score with 0.95 value on Yelp Reviews Polarity dataset. FastText word embedding is able to achieve good results because it considers character level information which helps in representing rare words appropriately. However, it is also important to consider the limitations of this study like datasets are from different domains having different types of vocabulary that is the reason for different results. In future, our aim is to apply other word embedding techniques in combination with deep learning models on these datasets (Table 5).

Table 5 Proposed methodology accuracy comparison with machine learning models

Full size table

7 Conclusions

A novel framework for short and long-text classification by using FastText word embedding followed by 3 layers CNN model is proposed. Experiments over 7 benchmark datasets validate the effectiveness of our proposed model. The experimental results verify that the use of FastText word embedding has increased the accuracy. In this study, a simple, effective and efficient framework is proposed that combines FastText with CNN. Results reveal that the proposed approach has shown robust results on all datasets using raw data without any manual feature extraction or feature selection method. It can be concluded that easy implementation and low complexity of the CNN model make it an efficient method for the classification of short and long text.

Moreover, there is no need to stack so many layers of CNN to get promising results. Better results are obtained by using merely three CNN layers. Future research entails testing the proposed methodology using the fusion of multiple word embeddings rather than using a single work embedding i.e., FastText. This may bring more comparative results to the ones presented in this article.

Notes

Multi-class classification assumes that each sample is assigned to one and only one label: a fruit can be either an apple or a pear but not both at the same time.
Multi-label classification assigns each sample a set of labels. This can be thought of as predicting properties of a data point that are not mutually exclusive, such as topics that are relevant for a document. A text might be about any of religion, politics, finance or education at the same time or none of these.

References

Aas K, Eikvil L (1999) Text categorisation: a survey
Ali N, Hamid M, Youssif A (2019) Sentiment analysis for movies reviews dataset using deep learning models. Int J Data Mining Knowl Manag Process 09:19–27
Article Google Scholar
Bollen J, Mao H, Zeng X (2011) Twitter mood predicts the stock market. J Comput Sci 2(1):1–8
Article Google Scholar
Chen T, Xu R, He Y, Wang X (2017) Improving sentiment analysis via sentence type classification using bilstm-crf and cnn. Expert Syst Appl 72:221–230
Article Google Scholar
Conneau A, Schwenk H, Barrault L, Lecun Y (2016) Very deep convolutional networks for text classification
Dai A M, Olah C, Le Q V (2015) Document embedding with paragraph vectors
Dodds P S, Harris K D, Kloumann I M, Bliss C A, Danforth C M (2011) Temporal patterns of happiness and information in a global social network: hedonometrics and twitter. PLoS ONE 6(12):e26752
Article Google Scholar
dos Santos C, Gatti M (2014) Deep convolutional neural networks for sentiment analysis of short texts. In: Proceedings of COLING 2014, the 25th international conference on computational linguistics: technical papers. Dublin City University and Association for Computational Linguistics, Dublin, pp 69–78
Du C, Chen Z, Feng F, Zhu L, Gan T, Nie L (2019) Explicit interaction model towards text classification. Proceedings of the AAAI Conference on Artificial Intelligence 33:6359–6366
Article Google Scholar
Fan R-E, Chang K-W, Hsieh C-J, Wang X-R, Lin C-J (2008) Liblinear: a library for large linear classification. J Mach Learn Res 9:1871–1874
MATH Google Scholar
Friedman J H (2001) Greedy function approximation: a gradient boosting machine. Annals of Statistics, 1189–1232
Gardner W A (1984) Learning characteristics of stochastic-gradient-descent algorithms: a general study, analysis, and critique. Signal Process 6 (2):113–133
Article Google Scholar
Gregorutti B, Michel B, Saint-Pierre P (2017) Correlation and variable importance in random forests. Stat Comput 27(3):659–678
Article MATH Google Scholar
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neur Comput 9:1735–80
Article Google Scholar
Imtiaz Z, Umer M, Ahmad M, Ullah S, Choi G S, Mehmood A (2020) Duplicate questions pair detection using siamese malstm. IEEE Access 8:21932–21942
Article Google Scholar
Iyyer M, Manjunatha V, Boyd-Graber J, Daumé H III (July 2015) Deep unordered composition rivals syntactic methods for text classification. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 1: long papers). Association for Computational Linguistics, Beijing, pp 1681–1691
Johnson R, Zhang T (2015) Effective use of word order for text categorization with convolutional neural networks. Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics: human language technologies
Johnson R, Zhang T (2016) Convolutional neural networks for text categorization: shallow word-level vs. deep character-level
Joulin A, Grave E, Bojanowski P, Mikolov T (2017) Bag of tricks for efficient text classification. In: Proceedings of the 15th conference of the European chapter of the association for computational linguistics: volume 2, short papers
Kalchbrenner N, Grefenstette E, Blunsom P (2014) A convolutional neural network for modelling sentences. In: Proceedings of the 52nd annual meeting of the association for computational linguistics (volume 1: long papers)
Kaya M, Fidan G, Toroslu I (2013) Transfer learning using twitter data for improving sentiment classification of turkish political news, 264, 139–148
Kim Y (October 2014) Convolutional neural networks for sentence classification. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, Doha, pp 1746–1751
Kowsari, Meimandi J, Heidarysafa, Mendu, Barnes, Brown (2019) Text classification algorithms: a survey. Information 10(4):150
Article Google Scholar
Lai S, Xu L, Liu K, Zhao J (2015) Recurrent convolutional neural networks for text classification. In: Proceedings of the Twenty-Ninth AAAI conference on artificial intelligence, AAAI’15. AAAI Press, pp 2267–2273
Lee V L S, Gan K H, Tan T P, Abdullah R (2019) Semi-supervised learning for sentiment classification using small number of labeled data. Procedia Comput Sci 161:577–584. The Fifth Information Systems International Conference, 23-24 July 2019, Surabaya, Indonesia
Article Google Scholar
Lewis D D (1992) An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings of the 15th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’92. Association for Computing Machinery, New York, pp 37–50
Majumder N, Poria S, Peng H, Chhaya N, Cambria E, Gelbukh A, Cambria E (2019) Sentiment and sarcasm classification with multitask learning. IEEE Intell Syst 34(3):38–43
Article Google Scholar
Mccallum A, Nigam K (2001) A comparison of event models for naive bayes text classification. Work Learn Text Categ 752:05
Google Scholar
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space
Mitchell T M (2006) The discipline of machine learning, vol 9. Carnegie Mellon University, School of Computer Science, Machine Learning ...
Nakagawa T, Inui K, Kurohashi S (2010) Dependency tree-based sentiment classification using CRFs with hidden variables. In: Human language technologies: the 2010 annual conference of the North American chapter of the association for computational linguistics. Association for Computational Linguistics, Los Angeles, pp 786–794
Post M, Bergsma S (2013) Explicit and implicit syntactic features for text classification. In: Proceedings of the 51st annual meeting of the association for computational linguistics (volume 2: short papers). Association for Computational Linguistics, Sofia, pp 866–872
Qiao C, Huang B, Niu G, Li D, Dong D, He W, Yu D, Wu H (2018) A new method of region embedding for text classification. In: ICLR
Qureshi M, O’ Riordan C, Pasi G (2013) Clustering with error-estimation for monitoring reputation of companies on twitter, 8281, 170–180, 12
Rustam F, Ashraf I, Mehmood A, Ullah S, Choi G S (2019) Tweets classification on the base of sentiments for us airline companies. Entropy 21(11):1078
Article Google Scholar
Sadiq S, Mehmood A, Ullah S, Ahmad M, Choi G S, On B-W (2021) Aggression detection through deep neural model on twitter. Futur Gener Comput Syst 114:120–129
Article Google Scholar
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47
Article Google Scholar
Sharaff A, Gupta H (2019) Extra-tree classifier with metaheuristics approach for email classification. In: Advances in computer communication and computational sciences. Springer, pp 189–197
Silva J, Coheur L, Mendes A, Wichert A (2011) From symbolic to sub-symbolic information in question classification. Artif Intell Rev 35:137–154, 02
Article Google Scholar
Socher R, Perelygin A, Wu J, Chuang J, Manning C D, Ng A, Potts C (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 conference on empirical methods in natural language processing. Association for Computational Linguistics, Seattle, pp 1631–1642
Sundermeyer M, Schlüter R, Ney H (2012) Lstm neural networks for language modeling
Sutskever I, Vinyals O, Le Q V (2014) Sequence to sequence learning with neural networks
Tang D, Qin B, Liu T (2015) Document modeling with gated recurrent neural network for sentiment classification. In: Proceedings of the 2015 conference on empirical methods in natural language processing. Association for Computational Linguistics, Lisbon, pp 1422–1432
Umer M, Sadiq S, Ahmad M, Ullah S, Choi G S, Mehmood A (2020) A novel stacked cnn for malarial parasite detection in thin blood smear images. IEEE Access 8:93782–93792
Article Google Scholar
Wang J, Wang Z, Zhang D, Yan J (2017) Combining knowledge with deep convolutional neural networks for short text classification. In: Proceedings of the 26th international joint conference on artificial intelligence, IJCAI’17. AAAI Press, pp 2915–2921
Wang P, Xu J, Xu B, Liu C, Zhang H, Wang F, Hao H (2015) Semantic clustering and convolutional neural network for short text categorization. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 2: short papers). Association for Computational Linguistics, Beijing, pp 352– 357
Wang S, Manning C (2012) Baselines and bigrams: simple, good sentiment and topic classification, 90–94, 07
Wang S, Manning C (2012) Baselines and bigrams: simple, good sentiment and topic classification. In: Proceedings of the 50th annual meeting of the association for computational linguistics (volume 2: short papers). Association for Computational Linguistics, Jeju Island, pp 90–94
Xia R, Jiang J, He H (2017) Distantly supervised lifelong learning for large-scale social media sentiment analysis. IEEE Trans Affect Comput 8(4):480–491
Article Google Scholar
Xiao Y, Cho K (2016) Efficient character-level document classification by combining convolution and recurrent layers
Yenigalla P, Kar S, Singh C, Nagar A, Mathur G (2018) Addressing unseen word problem in text classification, 339–351. 01
Yin W, Kann K, Yu M, Schütze H (2017) Comparative study of cnn and rnn for natural language processing
Yousaf A, Umer M, Sadiq S, Ullah S, Mirjalili S, Rupapara V, Nappi M (2020) Emotion recognition by textual tweets classification using voting classifier (lr-sgd). IEEE Access 9:6286–6295
Article Google Scholar
Zhang X (2019) Textclassificationdatasets
Zhang X, LeCun Y (2015) Text understanding from scratch
Zhang X, Zhao J, LeCun Y (2015) Character-level convolutional networks for text classification
Zhou C, Sun C, Liu Z, Lau F C M (2015) A c-lstm neural network for text classification

Download references

Acknowledgements

This work was supported in part by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2019R1A2C1006159) and (NRF-2021R1A6A1A03039493), and in part by the 2022 Yeungnam University Research Grant.

Author information

Authors and Affiliations

Department of Computer Science & Information Technology, The Islamia University of Bahawalpur, Bahawalpur, 63100, Pakistan
Muhammad Umer & Arif Mehmood
Department of Computer Science, Khwaja Fareed University of Engineering and Information Technology (KFUEIT), Rahim Yar Khan, Pakistan
Zainab Imtiaz
Department of Computer Engineering, Khwaja Fareed University of Engineering and Information Technology (KFUEIT), Rahim Yar Khan, Pakistan
Muhammad Ahmad
Department of Computer Science, University of Salerno, Fisciano, Italy
Michele Nappi
Research Department, Link Campus University of Rome, Via del Casale di San Pio V, 44, 00165, Rome, Italy
Carlo Medaglia
Department of Information and Communication Engineering, Yeungnam University, Gyeongsan, 38541, Korea
Gyu Sang Choi

Authors

Muhammad Umer
View author publications
You can also search for this author in PubMed Google Scholar
Zainab Imtiaz
View author publications
You can also search for this author in PubMed Google Scholar
Muhammad Ahmad
View author publications
You can also search for this author in PubMed Google Scholar
Michele Nappi
View author publications
You can also search for this author in PubMed Google Scholar
Carlo Medaglia
View author publications
You can also search for this author in PubMed Google Scholar
Gyu Sang Choi
View author publications
You can also search for this author in PubMed Google Scholar
Arif Mehmood
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization, Arif Mehmood and Gyu Sang Choi; Data curation, Muhammad Ahmad; Formal analysis, Zainab Imtiaz; Funding acquisition, Arif Mehmood and Gyu Sang Choi; Investigation, Muhammad Ahmad; Methodology, Muhammad Umer and Arif Mehmood; Project administration, Gyu Sang Choi; Supervision, Gyu Sang Choi; Writing – original draft, Muhammad Ahmad and Muhammad Umer; Writing – review & editing, Arif Mehmood and Gyu Sang Choi.

Corresponding authors

Correspondence to Gyu Sang Choi or Arif Mehmood.

Ethics declarations

Conflict of Interests

The authors declare no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Umer, M., Imtiaz, Z., Ahmad, M. et al. Impact of convolutional neural network and FastText embedding on text classification. Multimed Tools Appl 82, 5569–5585 (2023). https://doi.org/10.1007/s11042-022-13459-x

Download citation

Received: 01 July 2020
Revised: 07 May 2022
Accepted: 04 July 2022
Published: 24 August 2022
Issue Date: February 2023
DOI: https://doi.org/10.1007/s11042-022-13459-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Impact of convolutional neural network and FastText embedding on text classification

Abstract

Similar content being viewed by others

Large Scale Text Classification with Efficient Word Embedding

TextConvoNet: a convolutional neural network based architecture for text classification

Case Studies of Several Popular Text Classification Methods

1 Introduction

2 Related work