Impact of convolutional neural network and FastText embedding on text classification

Efficient word representation techniques (word embeddings) with modern machine learning models have shown reasonable improvement on automatic text classification tasks. However, the effectiveness of such techniques has not been evaluated yet in terms of insufficient word vector representation for training. Convolutional Neural Network has achieved significant results in pattern recognition, image analysis, and text classification. This study investigates the application of the CNN model on text classification problems by experimentation and analysis. We trained our classification model with a prominent word embedding generation model, Fast Text on publically available datasets, six benchmark datasets including Ag News, Amazon Full and Polarity, Yahoo Question Answer, Yelp Full, and Polarity. Furthermore, the proposed model has been tested on the Twitter US airlines non-benchmark dataset as well. The analysis indicates that using Fast Text as word embedding is a very promising approach.


Introduction
Electronic text processing is ubiquitous nowadays, from instant messaging applications to virtual repositories with large corpus has created some challenges to address. Automatic classification of textual data is one of such endeavours, so users can extract, retrieve and manipulate information to generate knowledge and recognize patterns. Text categorization uses a combination of knowledge areas including Information Retrieval (IR), Natural Language Processing (NLP), Artificial Intelligence (AI), Machine Learning (ML), Data Mining and Statistics. Text classification is one of the fundamental tasks in Natural Language Processing (NLP) which reduces the processing complexity for huge texts. It can be categorized into two groups: multi-label and multi-class text classification. Classifying a review into its corresponding sentiment is referred to as multi-class classification whereas the classification of an article into different classes (e.g. finance or religion) is known as multi-label classification.
Text classification is being applied in many contexts, such as news filtering and organization, document organization and retrieval, opinion mining, email classification, and spam filtering [23]. Many researchers are paying considerable attention towards text classification. Various machine learning methods are being used such as support vector machines with rule-based features [39], a combination of SVMs and naive Bayes [48] and conditional random fields [31]. Using the bag-of-words (unigrams, bigrams, or n-grams) for the representation of text is a basic approach for text classification. Different classifiers are trained on these representations, for instance, logistic regression,stochastic gradient decent [10,53] and Naive Bayes [28].
Term Frequency-Inverse Document Frequency (TF-IDF) is another simple yet effective method for feature extraction. The main problem in these feature extraction approaches is the context of the text is not considered [1,37] due to which the produced result has limitations. For example, there are always some words in a text that express the tendency of results in news articles. If we ignore these words in the classification process, we might not get the desired results. In the sentence: "He killed the man with a single blow of his cricket bat", if we take the single word "bat" (unigram), it is unclear if it's a "playing bat" or the "animal bat". When we take the phrase "cricket bat" (bigram) it might be taken as a cricket bat or the name of an insect and animal written together. When we obtain more words "single blow of his cricket bat" (6-gram), it becomes rather easy to differentiate between the contextual meanings of the words. Hence, using information from the preceding text is likely to improve the accuracy of the classifiers and this problem can be solved by using multi-layer neural networks [36].
Deep learning (DL) allows data with multiple levels of abstraction which enables it to produce promising results for various NLP tasks. Automatic classification applications using DL include visual surveillance, intelligent user interface, collection of demographic statistics for marketing, face recognition, and a variety of computer vision tasks. In deep learning models, a sentence is taken as a continuous stream of tokens that are processed in sequential order from left to right and the neural network memorizes it in a fixed-size hidden layer. Besides general feed-forward networks, some specialized architectures are extensively being used in industry including CNN and RNN which can scale high-resolution images and temporal sequence. Long Short Term Memory (LSTM) is considered to be the most successful model for learning long-term dependencies. Many researchers have proved its ability in their work [41,42]. However, some researchers argue that LSTM is not taskspecific but it is a generic learning model [5]. Also, it is partial towards the words at the end of a sentence or a document. This can reduce the accuracy since important features can occur anywhere in the sentence or document rather than being at the end only [24]. A convolutional neural network (CNN) is an unbiased neural network that is used to extract higher-level features from text [50]. By stacking multiple convolution layers, long-term dependencies of text can be captured with a max-pooling layer. It uses fewer parameters and is faster to train as compared to other models. [24]. Therefore, CNN might perform better than other neural networks in some cases.CNN has become the gold standard for object recognition and is also used as a primary choice for computer vision tasks.
Word Embedding is a neural representation technique in which each word is represented as a real-valued vector. The relatedness of words can be measured by using the distance between two embedding vectors [24]. These vectors might have tens or hundreds of dimensions in contrast to thousands or millions of dimensions in the above-mentioned feature selection approaches. Neural networks have shown great outcomes in many NLP tasks by using pre-trained word embeddings but these pre-trained word embeddings contain word-vectors related to a little number of domains resulting in limited results. FastText is a word representation learning library that is provided by the Facebook research team. It gives exceptional results due to its highly professional implementation in C++ and simple classification algorithm [19].
This work introduces a model based on CNN for sequential short-text and long-text classification. Experiments are carried out over seven different datasets, which validate the feasibility of the proposed model. Six datasets are benchmark datasets that include AgNews, AmazonFull, AmazonPolarity, Yahoo Question Answer, Yelp Full, Yelp Polarity, and one non-benchmark dataset named Twitter US Airlines. Few examples of these datasets are shown in Table 1. First, the text is converted into vectors using FastText word embedding. Later, these vectors are passed to the CNN model. The datasets are divided into 30 and 70 ratios for testing and training. The experiments show that the proposed model achieves state-of-the-art results on five out of seven datasets.
Major Contributions of the proposed approach are presented below: -A model based on CNN is proposed for sequential short-text and long-text classification. Experiments are carried out over seven different datasets, which validate the feasibility of the proposed model. The rest of the paper is structured as follows. Section Related Work describes some best researches related to our work. Section Dataset & Proposed methodology gives a summary of the proposed model, the dataset, the steps performed on the dataset, and the basic introduction of deep learning models used in this research. In Result section, results are discussed and the paper concludes with a discussion of future research directions in Conclusion section.

Related work
There is a large body of research on text classification tasks using machine learning and deep learning models. Classical methods focused on feature engineering and classification steps. Common feature engineering techniques include bag-of-words and TF-IDF. Handcrafted n-grams are also used to effectively make use of the order of words in text [47]. Some other complex features such as noun phrases [26], part-of-speech tags, as well as tree kernels [32] have also been designed. Also, there are some advanced approaches to select useful features such as information gain and mutual information. Some of the widely used machine learning algorithms include Naive Bayes [28], Logistic Regression and SVM [53]. However, there is a problem of data sparsity with these models. Deep neural networks (DNN) have gained immense popularity in NLP tasks because they have strong expressive power and need lesser feature engineering than the traditional models. CNN and RNN are variants of DNNs that are being used for text classification [52].
RNN is a sequence model that deals with determining long-term dependencies in variable-length input sequences. Multiple variants of RNN are used to better store and access memories e.g. LSTM and Gated recurrent units (GRUs), etc., [14]. By using LSTM, the study [54] proposed to use a supervised and semi-supervised learning model for text classification. In [15], Siamese MaLSTM is used to compute the semantic similarity of question pairs. Another study [43] used gated RNN to learn semantic relations. CNN and LSTM based models are used to learn vector-based sentence representation for text classification. CNN and LSTM were combined in [57] where CNN was used to capture higher-level features and LSTM was used to understand the overall context of the sentence. Character and word models are combined in [45] where the Word CNN model is like the one used in [22] and the Char CNN model is similar to the model used in [56].
CNN is capable of extracting higher-level features as well as capturing local correlations in text classification as well as in image classification. A novel stacked CNN was proposed in [44]. This makes CNN able to model sentences from preceding and successive context windows. CNN models are applied directly to distributed [8,22] or discrete [17] word embeddings, without knowing the semantic structures of a language. A research [22] used word based CNN architecture with pre-trained word embeddings for sentence-level classification. A single convolution layer followed by a max-pooling layer and a fully connected layer with drop-out is used in the final classifier. With little tuning of hyperparameters, the results on various benchmark datasets were improved. Another system was proposed in [20] where 5 layers of CNN. This system introduced multiple temporal k-max-pooling layers. It enables the system to extract k most important features in the text, irrelative to their position resulting in the preservation of relative order. The length of the sentence and the position of the layer determines the value of k . The clustering of word vectors was performed and passed to CNN for short-text classification in [46].
CNN based model with up to 29 layers was proposed in [5]. It operated on character level and used a small convolutional and pooling operations. It improved the results on several datasets by increasing the depth of CNN layers. Another researcher [56] used 9 layers of a character-based CNN model for solving unseen word problems in Word CNN's model. It increased a small number of accuracy but took a very long time due to a huge training corpus. The same problem of the unseen word was addressed in [51]. The researchers proposed to use CNN and combine both character and word-based models to efficiently perform text classification. Shallow word-level CNN with more parameters was introduced in [18]. The performance and error rate was compared with [5] and it was proved that the model achieved better and faster error rates. In [55], six layers of CNN were used that followed three fully connected classification layers.
When text classification is performed using CNN models, the text is converted into vectors at the first layer of the model. Word embedding [22,29], character embedding [56], document embedding [6] are some of the embedding types that have been used in recent years. Large text corpus are generally used to form embedded vectors [29]. The decision of choosing the type of word embedding is dependent on the problem statement and network architecture. Various CNN models used a pre-trained word embedding of word2vec as input [29] such as recursive tensor [40]. Recently, some simpler and more efficient models have been proposed that directly learn task-specific word embeddings or fine-tune on pre-trained word embeddings e.g. Deep Averaging Networks [16] and FastText [19].
Sentiment classification has been a thoroughly researched area in NLP for many years [49]. It has many real-world applications in various fields such as finance [3], market research [34], social science [7] and politics [21]. A semi-supervised deep neural network was proposed in [25] with a small number of labelled data and a large number of unlabeled data. The experiments in this study show that using unlabeled data improved the performance of the model. In [27] GRU based multi-task learning method was proposed for sentiment and sarcasm classification which outperformed the baseline CNN model. Several deep learning models including CNN, LSTM, MLP and CNN-LSTM are applied on the IMDB movie reviews dataset [2]. A significant increase in accuracy was observed when compared with classical machine learning models. The study [4] employs BiLSTM-CRF to improve sentence-level sentiment analysis by extracting target expression in opinionated sentences. The sentences are then classified into three types according to the number of targets extracted from them. The results prove that separating sentences containing different opinion targets, enhance the performance of sentence-level sentiment analysis.
In this paper, we stack only 3 layers of CNN architecture for classification and use fastText word embedding for feature engineering. This is in contrast to the current trend in text classification where significant improvements have been reported by using much deeper CNN's [5,20,55,56]. CNN's generally require a very large dataset to perform efficiently. Therefore, we carried out the experiments over 7 large datasets. Since fastText word embedding contains billions of word-vector from several domains, the size of training wordvectors and the range of training domain is immensely increased. As a result, the reported accuracies have significantly improved in 5 out of 7 datasets.

Dataset & preprocessing
To show the effectiveness of our proposed model, we evaluated the results on both multiclass 1 and multi-label. 2 The large-scale 7 benchmark datasets used in this study were introduced in [54] and are freely available online. These datasets are related to various classification tasks such as sentiment classification, questions answer classification, tweets classification and news classification. Out of 7 datasets, 6 datasets are benchmarks datasets and 1 is a non-benchmark dataset. The datasets are divided into 70 and 30 ratios for training and testing. The details of the datasets are shown in Table 2 and in the following sub-sections. It is a subset of Yelp's businesses, reviews and user data with five-star labels. The dataset has 650k training samples and 50k testing samples.

Amazon Review Polarity:
The Amazon review dataset is obtained from the Stanford Network Analysis Project (SNAP). It has 2 classes including negative and positive. Dataset is divided into 3600k training and 400k testing samples. Amazon Review Ful: The Amazon review dataset is also obtained from the Stanford Network Analysis Project (SNAP). It consists of 5 start labels. This dataset is divided into 3, 000k training and 650k testing samples. Twitter US Airline Sentiment: Twitter US Airline Sentiment data was collected from "Crowdflower's data for everyone library". It consists of 3 classes including positive, negative and neutral. It has 10,248 training and 4,392 testing samples. The 15 fields in the dataset are tweet id, sentiment, sentiment confidence score, negative reason, negative reason confidence, airline, sentiment gold, retweet count, tweet text, tweet coordinates, time of the tweet, date of the tweet, tweet location, user time zone and name. However, the fields used are tweet text and sentiment only.

Preprocessing
Several preprocessing steps are performed on every dataset to remove the missing, inconsistent and redundant values. The text in all datasets is converted to lower case and stopwords are removed. Then, it is tokenized and converted into word vectors with the help of freely available libraries such as NLTK and Keras. Elimination of unnecessary information improves the quality of data. The word vectors are passed to FastText word embedding for the extraction of high-quality features. The maximum sequence of a sentence is set to the maximum length of text in the dataset. The questions with smaller lengths are zero-padded.

Supervised machine learning algorithms
In this section, we will discuss the machine learning algorithms used for text classification of benchmark datasets. For the implementation of the machine learning models, the SciKitlearn library and NLTK are used. Five machine learning algorithms were deployed in Python using the SciKit module. We used the tree-based, regression-based and ensemble-based models to check the efficacy of the proposed system. We used the following machine learning algorithm in conjunction with our proposed methodology to evaluate the performance of the machine learning classifier predictions.

Random Forest (RF)
RF is an advanced version of the decision tree. Random Forest is a supervised learning algorithm that consists of a number of decision trees working individually to predict the results of a class where the final prediction is based on the class that gets the majority of votes. If we compare the error rate in RF, that is very low as compared to the other models. The reason for the low error rate is that it has a low correlation between trees [13]. The random forest used in this study was trained using different parameters. Based on the problem, multiple algorithms are used to decide a split in the decision tree.

Logistic Regression (LR)
Logistic regression is a statistical-based method in which one or more than one variable are used to compute the final result. LR is widely used to compute the probability of the class numbers, so, LR is the best learning model when the target class is categorical [30]. It processes the relationship among one or more variables and categorical independent variables by estimating the probabilities using logistic functions. LR uses the sigmoid function to transform the output into a probability value. The aim is to achieve the optimal probability with the low value of the cost function.

Extra Tree Classifier (ETC)
Extra Tree Classifier (ETC) is an ensemble learning model. The working principle of ETC is quite similar to the RF and the only difference is in the construction of the trees in the forest. Extra Tree Classifier (ETC) every tree is made from the original training samples. For decision random samples of the k best features and Gini-Index is used to select the best features to split the data in the tree. This approach results in the construction of the de-correlated trees in ETC [38].

Gradient Boosting Machine (GBM)
Gradient Boosting Machine is based on boosting and it is a powerful ensemble model extensively used to handle the classification problems. In GBM many weak classifiers work together to result in a strong learning model. It usually works on the principle of the decision tree [11]. GBM creates every tree independently so, it is an expensive and time taking choice. Due to the quality of probability of approximation correct learning (PAC), it works well on the un-processed data. To deal with the data missing values GBM is a good choice.

SGD (Stochastic Gradient Descent)
The working principle of SGD is based on the working principle of logistic regression convex loss function and SVM. It is a good choice for the multi-class classification problems because it combines the multiple binary classifier and OvA (One versus All) method. SGD work well on the large dataset because it takes the idea to the extreme. SGD uses a single sample in an iteration. It is very easy to understand easy to implement the regression model. As SGD has a lot of benefits but it also has some drawbacks. Hyperparameter of SGD needs lots of attention and needs to be correctly valued to achieve the good value of accuracy [12].

Proposed methodology
This section presents the proposed framework used in this study. The proposed framework is presented in Fig. 1. The use of deep learning-based classifiers has received great attention during the last few years. Deep models can potentially increase the classification accuracy of traditional classifiers. For this reason, this study aims to utilize a Convolutional Neural Network (CNN) for Text classification. The architecture of the proposed neural network is shown in Fig. 2. Seven benchmark datasets related to text classification are utilized in experiments. Five steps of preprocessing have been performed on each dataset before training. Datasets are split into 70:30 ratios for train and test. Then, the proposed approach which is FastText word embedding in combination with 3-layered CNN is applied for training. The proposed approach is evaluated on four evaluation measures that are Accuracy, Precision, Recall and F1-score.  Fig. 1 Architecture diagram of the proposed framework

Embedding layer
The Neural Networks do not take input in the form of text directly, thus the text must be vectorized to make it understandable for the model. Recently, word embedding has gained much attention for text classification [22] and sentiment analysis [4]. Pre-trained vectors [29] or random values can be used to initialize word embedding. The proposed model used Fig. 2 Architecture of the proposed CNN model the embedding layer as the first layer that takes the text as input and converts it into a vector denoted as W ∈ R k×m , where k is the maximum number of words in the text and m is the dimension of word embedding. The embedding dimension is set to 300 and the maximum sequence length is equal to the maximum length of the text in the dataset. If the text is not equal to the maximum sequence, zero paddings are used.
In (1), W k is the embedding of the words and ⊕ is the concatenation operation. W k is constructed by concatenating v w r where v w r ∈ R m , and is an m-dimensional vector of r t h word in the text. FastText word embedding technique is being used in this work and the output of FastText is later fed to the convolutional layer.

FastText
FastText is a word representation library that is provided by the Facebook research team. It contains 2 million common crawl words with 300−dimensions, providing 600 billion word-vectors. It uses hand-crafted n-grams as features in addition to single words. Text classification is performed very effectively and efficiently because of its simple architecture [33]. Different word embedding techniques have been utilized in various text classification tasks. Pre-trained word embeddings work in an unsupervised manner in predicting the context of the words. Closely placed words are considered as of similar context. FastText embedding uses morphological features to detect difficult words which make it a suitable choice to represent vectors. This ability also improves its generalizability. FastText word embedding generates vectors using n-gram which helps in dealing with unknown words.

Convolution feature maps
The basic operation of the convolution layer is to extract multi-resolution features from the input matrix. Different types of filters are applied to get different kinds of features. The convolutional layer aims to extract patterns, i.e., discriminative word sequences found in the input word-vectors that are common throughout the training instances. Let x i ∈ R d be the d-dimensional word vectors for the i th word in a sentence. Let x ∈ R L × d be the input sentence where L is the length of the sentence. Let k be the length of the filter and the vector m ∈ R k × d is a filter for the convolution operation. For each position, j in the sentence, window vector w j with k consecutive word vectors are denoted as.
where the commas represent row vector concatenation. A filter m convolves with the window vectors (k-grams). In our work, the kernel size is 7, therefore, the filter of size 64 will create 7-word combinations. ReLu activation function is applied over the output of each CNN neuron. This function converts all the negative values to zero to maintain non-linearity in the network. After applying ReLu activation, the output shape of the CNN layer remains the same as the input layer.
The purpose of a pooling layer is to further abstract the features by combining the scores for each filter. In this model, we apply two max-pooling layers over each feature map. This layer captures the most important feature by selecting the highest value on each dimension of the vector. As a result, the size of the input features reduces greatly. The output of this layer will reduce the features by kernel/pool size (p) where p = 2. Kernel = 5 with different size "global max pooling".
The dropout layer is a way to reduce overfitting by dropping the input features with values lesser than the dropout rate. The dropout rate d for our model is 0.5.
A dense layer is the last layer of this model and it produces the final result. This layer is followed by a softmax activation function. Softmax activation is used for multi-class classification. Therefore, we have used this activation since most of the datasets used are multi-class datasets.

Case study: evaluation of CNN model
A case study has been performed to evaluate the impact of using external word embedding with few layers of CNN on text classification. We performed a series of experiments on different benchmark datasets. In this case study, the experimental results for all different datasets are discussed. The results on all the datasets show that our model outperforms state-of-the-art methods as shown in Table 3.
It is evident from the results shown in Table 3 that the hand-crafted feature engineeringbased models [56] got the least results on all the datasets as compared to the rest of the models. One of the main reasons is that feature engineering suffers from data sparsity. Also, it cannot take advantage of training set supervision [9]. Character-based models [50,55] get better accuracies because, as compared to the word-based models, character-based models have deeper supervision from characters. N-grams, stems, words and phrases formed by combining characters, assist in sentimental classification. Overall, the two amazon datasets get the highest results in character-based models. The reason is that both amazon datasets contain millions of training samples. Out of the three character-based baselines, the best results are achieved on VDCNN [5] architecture. Having 29 layers of CNN, the model learns more combinations of characters.
Whereas, word-based models lose on two amazon datasets and exceed other models on three datasets. The reason is that these three datasets perform categorization through keywords and word-based models do not combine characters and directly use word embeddings. Then comes the W.C RegionEmb, which exceeds all the variants on all five datasets because of the ability to learn region embedding to use the N-gram features from the text. EXAM outperforms all the word-based baselines because more fine-grained interaction  One can observe that the proposed model achieves the best performance over five datasets: AG News, Yahoo! Answers, Yelp Reviews Full, Amazon Review Full and Twitter US Airline Sentiment. For the Yelp Reviews Full dataset and Yelp Reviews Full, the proposed model improves the performance by 20.0% to 22.0%. Additionally, the proposed model beats all the baselines on the AG's News with a performance gain of 3.0%. The accuracy of Yahoo! Answers also increased drastically by nearly 20.0%. Finally, the performance on Twitter US Airline Sentiment was improved by 6% because we use FastText word embedding which contains billions of training word-vectors that are pretty useful in this task. The detailed performance of our proposed model applied to all the datasets is shown in Table 4.
Deep learning models have shown robust results on all seven datasets as shown in Table 3, but to the best of our knowledge, it is the first time to use CNN with pre-trained word embedding to analyze reviews. Mostly CNN has shown superiority in performance as compared to other deep learning models. CNN directly extracts features from raw data by applying filters and also explore spatial relationships among variables by assigning weights. Overfitting in the CNN model is prevented by reducing the complexity of the model, our  proposed model is simple but efficient. Result comparison of the proposed model on all 7 datasets is presented in Fig. 3. It can be concluded that the proposed CNN along with Fast-Text outperformed on all datasets in terms of accuracy, precision, recall and F-score. The proposed approach has shown the highest accuracy with 0.96 value on AG's News dataset, highest precision with 0.94 on AG's News and Yelp Review Polarity dataset, highest recall with 0.96 value on Yelp Reviews Polarity and highest F-Score with 0.95 value on Yelp Reviews Polarity dataset. FastText word embedding is able to achieve good results because it considers character level information which helps in representing rare words appropriately. However, it is also important to consider the limitations of this study like datasets are from different domains having different types of vocabulary that is the reason for different results. In future, our aim is to apply other word embedding techniques in combination with deep learning models on these datasets (Table 5).

Conclusions
A novel framework for short and long-text classification by using FastText word embedding followed by 3 layers CNN model is proposed. Experiments over 7 benchmark datasets validate the effectiveness of our proposed model. The experimental results verify that the use of FastText word embedding has increased the accuracy. In this study, a simple, effective and efficient framework is proposed that combines FastText with CNN. Results reveal that the proposed approach has shown robust results on all datasets using raw data without any manual feature extraction or feature selection method. It can be concluded that easy implementation and low complexity of the CNN model make it an efficient method for the classification of short and long text. Moreover, there is no need to stack so many layers of CNN to get promising results. Better results are obtained by using merely three CNN layers. Future research entails testing the proposed methodology using the fusion of multiple word embeddings rather than using a single work embedding i.e., FastText. This may bring more comparative results to the ones presented in this article.

Conflict of Interests
The authors declare no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.