1 Introduction

Natural language processing (NLP) involves text processing and extracting the key patterns from the natural/human languages. It involves various tasks that rely on various statistics and data-driven computation techniques. One of the important tasks in NLP is text classification. It is a classical problem where the prime objective is to classify (assign labels or tags) the textual contents [1]. Textual contents can either be sentences, paragraphs, or queries [1, 2]. There are many real-world applications of text classification, such as sentiment analysis [3], news classification [4], intent classification [5], spam detection [6], and so on.

Text classification can be done by manual labeling of the textual data. However, with the exponential growth of text data in the industry and over the Internet, automated text categorization has become very important. Automated text classification approaches can be broadly classified into Rule-based, Data-Driven based (Machine Learning/Deep Learning-based approaches), and Hybrid approaches. Rule-based approaches classify text into different categories using a set of pre-defined rules. However, it requires complete domain knowledge [7, 8]. Alternatively, machine learning-based approaches have proven to be significantly effective in recent years. All the machine learning approaches work in two stages: first, they extract some handcrafted features from the text. Next, these features are fed into a machine learning model. A bag of words, n-grams based model, term frequency, and inverse document frequency (TF-IDF) and their extensions have been popularly used for extracting the handcrafted features. For the second stage, many classical machine learning algorithms such as Support Vector Machine (SVM), Decision Tree (DT), Conditional Probability-based such as Naïve Bayes, and other Ensemble-based approaches have been used [9, 10].

Recently, some deep learning methods, specifically Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN), have shown significant results in text classification [11,12,13,14,15,16]. CNN-based models are trained to recognize patterns in text automatically, such as key phrases. Most CNN-based models utilize one-dimensional (1-D) convolution followed by a one-dimensional pooling operation (average or max) to extract a feature vector from the input word embeddings. This feature vector is fed into the classification layer as an input for classification purposes. Input word embedding is a word matrix where each row represents a word vector. Therefore, one-dimensional (1-D) convolution extracts n-gram based features by performing convolution operation on two or more two-word vectors at a time.

However, improving text classification results by utilizing the n-gram features in between different sentences using convolution operation remains an open research question for all researchers. Furthermore, the input matrix structure remains a point to ponder, which could be revamped to apply multidimensional convolution. This paper presents, TextConvoNet, a new CNN-based architecture for text classification. In contrast to the existing works, the proposed architecture uses a 2-dimensional convolutional filter to extract the intra-sentence and inter-sentence n-gram features from text data. First, it represents the text data as a paragraph-level (multi-sentence) embedding matrix, which helps apply 2-dimensional convolutional filters. After that, multiple convolutional filters are applied to the extracted features. The resultant features are concatenated and fed into the classification layer. To evaluate the performance of the presented TextConvoNet, we perform a thorough experimental analysis on five different benchmarked binary-class and multi-class classification datasets. Evaluation of the TextConvoNet is based on eight performance measures: accuracy, precision, recall, f1-score, specificity, G-means, and MCC. Furthermore, we compare the performance of the TextConvoNet with state-of-the-art classification models, including attention-based models, BERT, and deep-learning-based models.

1.1 Contributions

The main contributions of the presented work are as follows.

  1. 1.

    This work presents TextConvoNet, a CNN-based architecture to represent input text data as a multidimensional word embedding. The presented architecture extracts both intra-sentence and inter-sentence features of the text.

  2. 2.

    The presented architecture is comprehensively evaluated on five benchmarked text datasets, including binary-class and multi-class datasets. This analysis helps in generalizing the findings of the presented work.

  3. 3.

    An extensive comparison of the presented TextConvoNet with existing machine learning models, deep learning-based models, attention models, and state-of-the-art CNN-based models is performed to validate the performance of the presented TextConvoNet.

The rest of the article is organized as follows. Section 2 discusses the literature review. Section 3 presents information on text classification using CNN models and provides details of the presented TextConvoNet architecture. Section 4 provides the details of the experimental setup and analysis. It includes details of used datasets followed by performance measures and implementation details. Section 5 presents experimental results and comparison results of the TextConvoNet with the state-of-the-art text classification models. Section 6 discusses the main findings and limitations of the presented work. Section 7 concludes the paper along with the details of the future directions of the presented work.

2 Literature review

Text classification is one of the important tasks in Natural Language Processing (NLP). There have been many data-driven-based approaches suggested for text classification. Recently, deep learning-based approaches have emerged and performed significantly well in text classification. This section discusses some of the relevant deep learning-based models suggested for text classification.

Two neural networks have been prevalent in NLP problems: Long Short Term Memory (LSTM) and Convolutional Neural Network (CNN). LSTM can extract current information and remember the past data points in sequence [13, 14, 17, 18]. However, LSTM-based models have very high training time because it is characterized as each token per step. Recently, attention-based networks have proven effective for Natural Langauge Processing [19, 20]. However, LSTM with attention networks introduces an additional computation burden. It is because of the exponential function and normalized alignment score computation of all the words in the text [21].

One of the early CNN-based models was Dynamic CNN (DCNN) for text classification [22]. It used dynamic max-pooling, where the first layer of DCNN makes a sentence matrix using word embeddings. Then it has a convolutional architecture that uses repeated convolutional layers with dynamic k-max-pooling to extract feature maps on the sentence. These feature maps are capable enough of capturing the short and long-range relationship between the words. Later, Kim [23] gave the simplest CNN-based model for text classification, which has become the benchmark architecture for many recent models for text classification. Kim’s model adapted a single layer of convolution on top of the input word vectors obtained from an unsupervised neural language model (word2vec). It has been evident to improve upon the state-of-the-art binary and multi-class text classification problems. Recently, some attempts have been made to enhance the architectures of CNN-based models. Liu et al. [15, 24,25,26]. Instead of using pre-trained low-dimensional word vectors as an input to CNN, authors [26] directly applied CNNs to high-dimensional text data to learn the embeddings of small text regions for the classification. Most existing works have shown that convolutions of sizes 2, 3, 4, or 5 give significant results in text classification.

Some of the research explored the impact of model performance while using word embeddings and CNN architectures. Encouraged by VGG [27] and ResNets [28], Conneau et al. [29] proposed a Very Deep CNN (VDCNN) model for text processing. It applies CNN directly at the character level and uses small convolutions and pooling functions. The research exhibited that the performance of VDCNN improves with the increase in depth. In another work, Le et al. [30] showed that deep architectures could outperform shallow architectures where text data is represented as a sequence of characters. Later on, in another research, Squeezed-VDCNN suggested [15] which improved VDCNN to work on mobile platforms. However, a basic shallow-and-wide network outperforms deep models, such as DenseNet [31], with word embeddings as inputs. There were some application-specific models have been proposed for text classification. Some researchers used deep learning-based models for short text classification [32, 33]. In some research, language-specific (for example, Arabic and Urdu) deep learning models have been proposed [34, 35]. Moreover, the author used combination of CNN and bi-LSTM for medical text classification [36]. In another work, authors used CNN with a hierarchical encoder for defect text classification [37].

To the best of our knowledge, above discussed, all the CNN-based networks extract the n-gram based feature using varied sizes of kernels/filters. In light of the above works, we present a novel CNN-based architecture that extracts intra-sentence n-gram features and captures the inter-sentence n-gram features.

3 Proposed TextConvoNet architecture

This section first discusses the existing CNN-based approach of text classification with an example (Section 3.1). Next, the section presents details of the proposed TextConvoNet architecture (Section 3.2). The mathematical symbols used in this section are given in Table 1.

Table 1 Symbols used in proposed framework

3.1 Text classification using existing CNN models (background)

Text classification problems can be formally defined as follows [38].

Definition 1

Given a text dataset \(\mathcal {T}\) consisting of labelled text articles. Depending on a particular NLP task, text articles have a particular label/class lL. In case of binary-class classification, there are two labels for the text dataset. A text article \(te \in \mathcal {T}\) consists of sentences and words. Let us say, text article tei contain m sentences s1,…,sm and sentence sj (0 ≤ jm) contain n words.

The objective of text classification is to learn a model M that can correctly classify any new text articles tnew into label lL.

Kim et al. [23] presented a simple and effective architecture for text classification. From now on, we call Kim’s CNN model throughout the paper for simplicity. This presented architecture served as a guiding light and basis for many CNN-based architectures for text classification. Many recent architectures internally use this model [39,40,41,42]. In Kim’s CNN model, sentences are matched to the embedding vectors made available as an input matrix to the model. It uses only a single layer of convolution with word vectors on top obtained from an existing pre-trained model, with kernel sizes 3, 4, and 5. The resultant feature maps are further processed using a max-pooling layer to distill or summarize the extracted features, subsequently sent to a fully connected layer. Figure 1 shows a simple example of text classification using CNN-based Kim’s model.

Fig. 1
figure 1

Example: Text classification using CNN [23]

As shown in Fig. 1, the input to the model is a sentence represented as a matrix. Each row of the matrix is a vector that represents a word. 1D convolution is performed onto the matrix with the kernel size being 3 along with 4 and 5. Max-Pooling is performed upon the filter maps, which are further concatenated and sent to the last fully connected layer for classification purpose. Formally, the sentence modeling is as follows.

Sentence modelling

In each sentence, \(we_{p} \in \mathbb {R}_{z}\) denotes the word embedding (a vector) for the pth word in the sentence, where z is the word embedding dimension. Suppose that a sentence has n words, the sentence can now be represented as an embedding matrix \(E_{we} \in \mathbb {R}^{n\times z}\). So we can refer to it as a word matrix where every row denotes the vectors for a particular word of the sentence. Let wep:p+q represents the concatenation of vectors wep,wep+ 1,…,weq. The convolution operation is performed on this input embedding layer. It involves a filter \(k\in \mathbb {R}^{x,z}\) that applies to a window of x words to produce a new feature. For example, a feature cp is generated using the window of words wep:p+x− 1 by (1).

$$ c_{p} = f(we_{(p:p+x-1)}.k + b) $$
(1)

Here, \(b\in \mathbb {R}\) and f denotes the bias and non-linear activation function respectively. The filter (kernel) k applies to all possible windows using the same weights to create the feature map (1-D vector).

$$ C = [c_{1}, c_{2}, \ldots, c_{n-x+1}] $$
(2)

3.2 Proposed TextConvoNet architecture

The proposed TextConvoNet architecture finds n-gram features between words of the different sentences and the intra-sentence n-gram feature. It is because, in the text data, having multiple sentences may have useful n-gram features. This could only be possible by using the paragraph matrix instead of the sentence matrix and applying 2-D filters. Thus, the motivation and the research question are to explore “if combining n-gram-based inter-sentence characteristics with n-gram-based intra-sentence features is beneficial or not”. In real-world scenarios, the paragraphs are stringed together in a very complex manner, making it very difficult for any model to come up with correct labeling, whether it be any sentiment or a news category. Therefore, there may be instances when the model cannot extract the inter-sentence features and hence fails to come up with a suitable result. Taking inspiration from the above shortcoming, we present an alternative input structure for the model and propose a novel CNN model using the alternative input structure and employing 2-D Convolution [43].

3.2.1 Input representation

We propose a new input representation for text data. In existing works, each sentence is represented as two-dimensional matrix where each row represents an embedding vector for a word. Whereas in our model, the input is represented as three-dimensional matrix. In this representation, each row depicts each sentence of a paragraph, with each cell as a single word and the 3rd dimension as the embeddings or the word vectors. This representation may be termed as a sentence matrix. The formal description of our input structure is mentioned below. For each sentence in a paragraph, let \(E_{w_{i}} \in \mathbb {R}^{z}\) represents the word embedding for the ith word in the sentence, where z is the dimension of the word embedding. Given that a paragraph has m sentences and n words in each sentence, the paragraph can be represented as an embedding matrix W of size (m,n,z) such that \(W \in \mathbb {R}^{m\times n \times z}\).

The overall architecture of our proposed model, TextConvoNet, is shown in Fig. 2. The presented TextConvoNet model uses an alternate input structure of the paragraph, using 2D convolution instead of 1D convolution and differing kernel sizes. TextConvoNet sends the input matrix into 4 parallel pathways of Convolution layers. The first two layers (intra-sentence layers) with 32 filters each and kernel sizes of (1 × 2 and 1 × 3), respectively, are concatenated and have the role of extricating the intra-sentence n-gram features. The other two layers (inter-sentence) with 32 filters each and kernel sizes of (2 × 1 and 2 × 2) concatenated together have the sole purpose of drawing out the inter-sentence n-gram features. These two intra-sentence and inter-sentence layers are further concatenated and fed into the fully connected layer consisting of 64 neurons and subsequently perform the relevant classification task. A detailed explanation of the architecture is given as follows.

Fig. 2
figure 2

Proposed TextConvoNet Architecture

3.2.2 Convolution layer

This layer applies filters to the input to create feature maps and condense out the input’s detected features. It is a process where we take a small matrix of numbers (called kernel or filter) and pass it over the paragraph matrix and transform it based on the values from filter. Let Ew(m,n) be an input paragraph matrix of size m × n and H is a two dimensional with kernel size of (2g + 1,2d + 1), where g and d are constants. The outcome of the convolutional layer is represented by (3).

$$ r_{i,j}=\sum\limits_{u=-g}^{g}\left( \sum\limits_{v=-d}^{d}H[u,v]F[i-u,j-v]\right) $$
(3)

Here, ri,j is the value at location (i,j) in the feature map.

3.2.3 ReLu activation layer

The purpose of the ReLu activation layer after each convolution layer is to normalize output. This layer also aids the model to learn something complex and complicated with a reduced possibility of vanishing gradient and cheap computation costs. The activation function for ReLu is given in (4). Here ri,j is the input to the ReLu function.

$$ f(r_{i,j})=max(0,r_{i,j}) $$
(4)

3.2.4 Classification

The feature maps generated by using different kernel sizes are concatenated and fed into the fully connected layer. The fully connected layer is a multilayer perceptron connected to all the activations from the previous layers. The activation of these neurons is calculated by matrix multiplication of its weights added by an offset value. A dropout layer is also used that randomly activates or deactivates (makes them 0) the outgoing edges of hidden units at each update of the training phase, which helps to reduce overfitting. In the end, the classification layer performs classification based on the attributes extricated by the previous layers. It is a traditional ANN layer with softmax or sigmoid as the activation function.

3.2.5 Loss function

For binary-class text classification task, TextConvoNet is trained by minimizing the binary-cross entropy (5) over a sigmoid activation function. For the task of multi-class classification, TextConvoNet is trained by minimizing the categorical-cross entropy (6) over a softmax activation function. The above loss functions can be formulated as

$$ \scriptsize BCE =-\frac{1}{m}{\sum\limits_{i}^{m}} {\sum\limits_{j}^{c}}y_{ij} log(\sigma(\hat{y}_{ij}))) - (1 - y_{ij}) log(1 - \sigma(\hat{y}_{ij}))) $$
(5)
$$ CCE = -\frac{1}{m}{\sum\limits_{i}^{m}} {\sum\limits_{j}^{c}}y_{ij}log\left (\frac{e^{\hat{y}_{ij}}}{{\sum}_{r=1}^{c} e^{\hat{y}_{ij}}}\right ) $$
(6)

Here i is the index of a training instance, j is the index of a label (class), \(\hat {y}_{ij}\) is output of the final fully connected layer, and yij is the ground truth (actual value) of ith training sample of the jth class.

3.3 Analysis of TextConvoNet

Almost all real-life conversations, reviews, and remarks are generally very long and complex and thus convey a different perspective in each line. However, only a single deep-rooted sentiment is attached to the whole paragraph. To uphold the semantics, the paragraph is converted into paragraph-level sentence embedding without any preprocessing of text. The embedding matrix is then sent into 4 lateral pathways subdivided into the intra-sentence (kernel sizes 1 × 2,1 × 3) layer and the inter-sentence layer (kernel sizes 2 × 1, 2 × 2) with 32 filters in every layer. These hyperparameters were selected through the GridSearchCV method from the plethora of other suitable hyperparameter choices. The results also complemented our thinking/approach of selecting small window sizes to capture every minute detail. Similarly, using the GridSearchCV method, the learning rate was chosen to be 0.01 and the number of neurons to be 64 for the final fully connected layer. The convolutional layers were limited to four only, as increased layers led to overfitting.

3.4 Variants of TextConvoNet

We have created various variants of the proposed model to develop an effective text classification framework. Figure 2 shows the baseline model. However, it might be interesting to see whether increasing the number of n-gram based kernels, and inter-sentence kernels will improve the model’s efficacy. Thus, we have extended the baseline to create two versions of TextConvoNet: TextConvoNet_4 and TextConvoNet_6.

  • TextConvoNet_4: The base/parent model with 4 convolution layers (with different kernel sizes), 2 for extracting out the intra-sentence n-gram features, and the other 2 for extricating the n-gram based inter-sentence attributes.

  • TextConvoNet_6: It is the same framework as mentioned above but extending the convolutional pathways to 6, 3 for extracting out intra-sentence n-gram features and other 3 for extracting inter-sentential n-gram features.

We have also performed modifications on various parameters of TextConvoNet: number of filters, dropout rate, kernel sizes, number of nodes in fully connected layer and optimizers, etc. However, the effectiveness of these modifications is experimentally validated in Section 5.

4 Experimental setup and analysis

This section first describes the used datasets. The model building and evaluation using the presented TextConvoNet is conducted via an experimental analysis on various binary-class and multi-class text classification datasets. This section also discusses the used performance measures and baseline machine-learning and deep-learning-based models used for comparison.

4.1 Used datasets

We have performed experiments on various publicly available binary-class and multi-class text datasets. Only a subset of instances from the datasets has been included for training and testing. The details of the used datasets are given in Table 2. No additional changes have been made to the datasets, and no preprocessing has been applied to the text. For the experiments, we have used two binary-class and three multi-class datasets. The binary-class datasets are the famous SST-2 and Amazon Review dataset. The multi-class datasets consisted of Ohsumed (R8), Twitter Airline Sentiment, and the Coronavirus Tagged datasets. All the datasets are publicly available and are sourced from Kaggle. The details of the datasets are mentioned below.

  • DATASET-1Footnote 1: Binary SST− 2: This dataset is similar to the Stanford Sentiment Treebank dataset with only positive and negative reviews. We have removed all neutral reviews from the dataset.

  • DATASET-2Footnote 2: Amazon Review for Sentiment Analysis Dataset: This dataset contains a few million customer reviews and the star ratings.

  • DATASET-3Footnote 3: R8 Dataset: This is a subset of Reuters- 21578 dataset containing 8 categories for multiclass classification.

  • DATASET-4Footnote 4: Twitter User Airline Sentiment: The dataset contains the tweets of six different airlines as positive, negative, or neutral.

  • DATASET-5Footnote 5: Covid Tweets: The tweets have been pulled from Twitter followed by manual tagging as Extremely Negative, Negative, Neutral, Positive and Extremely Positive.

Table 2 Details of the used experimental datasets

4.2 Performance evaluation measures

In the experimental evaluation, we used eight different performance measures. They are- accuracy, precision, recall, f1-score, specificity, g-means (gmean 1 and gmean 2), and MCC (Mathews Correlation Coefficient). Various previous works related to text classification tasks have used these measures to evaluate the performance of the prediction models. Therefore, we have chosen these measures due to their broad applicability. A detailed description of these performance measures is given in Appendix B, Table 9.

To assess the statistical significance of the presented TextConvoNet_4 and TextConvoNet_6 with other considered machine learning and deep learning techniques, we have performed the Wilcoxon Signed-Rank paired sample test. It is a non-parametric test that does not assume the normality of within-pair differences. It tests the hypothesis of whether the median difference is zero between the tested pair or not. We have used a significance level of 95% (i.e., α= 0.05) for all the tests. The framed Null Hypothesis (H0) and Alternative Hypothesis (Ha) are as follow.

H0: No statistically significant difference is there between the paired group for value α= 0.05.

Ha: A statistically significant difference is present between the paired group for value of α =  0.05.

The null hypothesis can be rejected when the experimental p-value has come out to be lesser than the α value, and it can be concluded that there is a significant difference between the paired group. If this is not the case, then automatically accept the null hypothesis.

Further, we have performed an effect size analysis using Pearson Effect r measure. The effect size shows the magnitude of performance difference among the groups—the more significant the effect of size, the stronger the relationship between the two variables. It is defined by (7).

$$ r = \frac{z}{\sqrt{2n}} $$
(7)

Where 2n = number of observations, including the cases where the difference is 0 and z is the z-score value defined by (8).

$$ z = \frac{|U-\mu|-0.5}{\sigma} $$
(8)

According to Cohen [44], the effect size is: Low, if r ≈ 0.1; Medium, if r ≈ 0.3; and Large, if r ≈ 0.5.

4.3 Machine learning and deep learning models used for comparison

For a comprehensive performance evaluation of the proposed TextConvoNet, we have used seven different machine learning techniques namely, Multinomial Naive Bayes [45], Decision Tree (DT) [46], Random Forest (RF) [47], Support Vector Classifier (SVC) [48], Gradient Boosting classifier [49], K-Nearest Neighbour (KNN) [50], and XGBoost [51]. An evaluation of the proposed TextConvoNet using these techniques helps establish the usability of the TextConvoNet and increases the generalization of the results. Since TextConvoNet is a convolutional neural network-based deep learning architecture, we have included some deep learning-based approaches for performance comparison. Specifically, we have implemented Kim’s CNN model [23], Long Short Term Memory (LSTM) [13, 52] and VDCNN [27] based model proposed for text classification (Table 4) and compared our model with these models. We have also compared our model with other recent attention and/or transformer-based deep learning models such as BERT [53], Attention-based BiLSTM [17], Hierarchical attention networks (HAN) [18], and hybrid models such as BerConvoNet [38] and CNN-BiLSTM [54, 55]. The description and implementation details of these techniques are given as follows. All implementation has been carried out using Python libraries.

4.3.1 Kim’s CNN model [23]

A detailed description of Kim’s model has been provided in Section 3.1. The implementation details of this model are as follows. In Kim’s model, sentences are matched to the embedding vectors that are made available as an input matrix to the model. It uses 3 parallel layers of convolution with word vectors on top obtained from the existing pre-trained model, with 100 filters of kernel sizes 3, 4, and 5. It is followed by a dense layer of 64 neurons and a classification layer.

4.3.2 Long short term memory (LSTM) [13, 52]

Long Short-Term Memory Networks (LSTMs) are a special form of recurrent neural network (RNN) that can handle long-term dependencies. Also, LSTMs have a chain-like RNN structure, but the repeated module has a different structure. Rather than having a single layer of the neural network, there are four, which communicate in a unique way. Some of the works used the LSTM model for different text classification tasks [13, 14, 17, 18]. For the comparison with our model, we used a single LSTM layer with 32 memory cells followed by the classification layer.

4.3.3 Very deep convolutional neural networks (VDCNN) [27]

Unlike TextConvoNet, which is a shallow network, VDCNN uses multiple layered convolution and max-pooling operations. Therefore, inspired by VDCNN, we implemented its version based on word embedding. This model uses four different pooling processes, each of which reduces the resolution by half, resulting in four different feature map tiers: 64, 128, 256, and 512, followed by a max-pooling layer. After the end of 4 convolution pair operations, the 512 × k resulting features are transformed into a single vector which is the input to a three-layer fully connected classifier (4096,2048,2048) with ReLU hidden units and softmax outputs. Depending on the classification task, the number of output neurons varies.

4.3.4 Attention+BiLSTM model [17]

It starts with an input layer that tokenizes input sentences and indexed lists, followed by an embedding layer. There exist bidirectional LSTM cells (100 hidden units) which can be concatenated to get a representation of each token. Attention weights are taken from linear projection and non-linear activation. Final sentence representation is a weighted sum of all token representations. The final classification output is derived from a simple dense and softmax layer.

4.3.5 BERT model [53]

The pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of NLP tasks, and we have also used the same strategy to compare TextConvoNet with BERT based model. We used pre-trained bert-base-uncased embeddings to encode, followed by a dense layer of 712 neurons ended by a classification layer.

4.3.6 Hierarchical attention networks (HAN) model [18]

In our experiment, we set the word embedding dimension to 300 and the GRU dimension to 50. A combination of forward and backward GRU provides us with 100 dimensions for word/sentence annotation in this scenario. The word/sentence context vectors have a dimension of 100 and are randomly initialized. We utilize a 32-piece mini-batch size for training.

4.3.7 BERT+CNN [38]

Some recent works used BERT as a text embedder and CNN as a classifier [38, 56]. In our experiments, the BERT+CNN model uses kernel sizes of 2, 3, & 4 and the number of kernels (filters) were set to 100. The word vector size was 768. The model is trained on a batch size of 100 with a learning rate of 0.001. Adam optimizer is used with the Binary Cross-Entropy loss function (BCE).

4.3.8 CNN+BiLSTM [54, 55]

Some recent works used a hybrid model based on CNN and BiLSTM for text classification. In our experiments, we use a densely connected BiLSTM layer after the two convolution layers of kernel size 3 and one max pooling layer of size 3. The embedding dimension was set to 300 and the learning rate to 0.001.

4.3.9 Graph neural network based models [57, 58]

In [57], a Graph Neural Network (GNN) based model, TLGNN, is proposed. It generates a text-level graph for each input text. It builds graphs by assuming smaller windows in the text for extracting more local features. The details of the methodology are given in [57]. For comparison purposes, except for max length parameter, we consider the exact implementation details and hyperparameters settings provided in the paper. The max length parameter was set to 100. In another work, the Sequential GNN model is proposed for short text classfication [58]. It builds individual graphs for each document based on word co-occurrence and uses a bidirectional long short-term memory network (Bi-LSTM) to extract the sequential features. After that, it uses a Graph Convolutional Network (GCN) for learning word representations. For comparison purposes, we used the default settings of SeqGNN given in the original paper.

4.4 Implementation details

All the experiments to examine the model performance of TextConvoNet_4 and TextConvoNet_6 are carried out on a system having with Dual-Core Intel Core i5 processor and 8 GB RAM, running Macintosh operating system, with 64- bit processor and access to NVidia K80 GPU kernel. All experiments were performed in Python 3.0. The models are trained over a mini-batch size of 32 using Adam as an optimizer. The learning rate is chosen to be 0.1, and the models are trained over 10 epochs with early stopping to avoid overfitting. All these hyperparameters are chosen using a hyperparameter optimization technique called GridSearchCV. We use GloVeFootnote 6, a pre-trained word embedding model for generating word vectors from sentences.

5 Results and analysis

This section presents the results of the TextConvoNet on five datasets (Section 4.1) on different performance measures. Additionally, the comparison results of the TextConvoNet with other machine learning and deep learning models are reported in this section.

First, the performance comparison of TextConvoNet with other baseline models is presented in the results and analysis. A statistical test is conducted to assess whether TextConvoNet performed significantly different from other used baseline models. Additionally, the performance analysis of the presented model is performed by varying the number of sentences in a paragraph. After that, the experimental results to dataset size are discussed, i.e., how does the presented TextConvoNet performs with minimal data (few-shot learning) and in challenging scenarios.

5.1 Results of TextConvoNet architecture

Table 3 shows the results of the presented TextConvoNet architecture (TextConvoNet_6 and TextConvoNet_4) compared to different machine learning and deep learning-based models. The results are reported in terms of the used performance measures, accuracy, precision, recall, f1-score, specificity, Gmean1, and Gmean2. Table 3, reports results for binary classification datasets (dataset 1 and dataset 2) and multi-class classification datasets (dataset 3, dataset 4, and dataset 5). The following inferences can be drawn from the results.

Table 3 Classification results of the proposed TextConvoNet and other baseline models for different performance measures, (*MNB: Multinomial Naive Bayes, GBC= Gradient Boosting Classifier)

It has been found that the presented TextConvoNet produced significant result values for the considered performance measures for different used datasets. For accuracy, precision, and recall measures, the average values of TextConvoNet_6 are 0.889, 0.829, and 0.807, respectively. These values are higher than the other used machine learning and deep learning-based models. The TextConvoNet produced significant results for the f1-score measure. The highest f1-score value is 0.969, with an average value of 0.819. Similarly, the specificity measure’s highest value is 0.996, and the average value is 0.921. For other measures, g-means (Gmean1 and Gmean2) and MCC, the TextConvoNet produced average values of 0.816, 0.86, and 0.73, respectively. On average, for accuracy, the maximum performance improvement achieved by TextConvoNet compared to other models is 16%. Similarly, on average, for precision, recall, and f1-score, the maximum average improvement achieved by TextConvoNet is 24%, 21%, and 23.9%, respectively. For the other measures, on average, the improvement achieved by TextConvoNet is 24% in terms of Specificity, 22% in terms of Gmean1, 27% in terms of Gmean2, and 44% in terms of MCC measure. Overall, in multi-class datasets (3, 4, and 5), variants of TextConvoNet perform better than all the other models in terms of all the performance measures.

5.2 Comparison of the TextConvoNet with recent attention-based models, BERT model and graph based models

This section compares the performance of the presented TextConvoNet with two attention-based models: BiLSTM followed by attention (Attention+BiLSTM), Hierarchical Attention Network (HAN) model, one transformer-based BERT model, and CNN-based hybrid text classification models: BERT-CNN (BerConvoNet) and CNN-BiLSTM. Additionally, it is also compared with graph-based models: Text Level Graph Neural Network (TLGNN) and Sequential GNN (SeqGNN). Section 4.3 provides details of these models. Table 4 shows the results of TextConvoNet and other considered models, BiLSTM +Attention, BERT, HAN, BerConvoNet, CNN-BiLSTM, TLGNN, and SeqGNN, on different performance measures. From the table, it can be observed that the BERT and HAN models have produced relatively poor results on all the datasets compared to other models. On datasets 1 and 3, TextConvoNet_4 has produced the best results in comparison with others. On datasets 2, 4, and 5, TextConvoNet_6 has yielded better performance than other others. Overall, it has been found that TextConvoNet performed better in comparison to other models. The performance of the BerConvoNet is comparable to the presented TextConvoNet models. The presented models produced better performance than the graph-based models for all the datasets, except for Dataset-1. Where the SeqGNN produced better recall and f1-score values than the presented models. Furthermore, it has been observed from Table 3 that the non-attention-based competing models have produced different values for the used performance measures themselves. However, for these models, the precision, recall, and F1-score values are the same for datasets 3, 4, and 5. These three are multiclass datasets. We have adopted the micro-averaging technique of calculating the values of different metrics for multiclass classification. Micro-averaging is a well-suited technique as it doesn’t discriminate between classes based on their population. Hence, micro-averaging in a multiclass setting with all labels included, produces equal precision and recall (thus F1-score) for datasets 3, 4, and 5 (multiclass datasets).

Table 4 Result comparison of TextConvoNet with attention and/or transformer-based deep learning models

Overall, it has been found that the presented TextConvoNet outperformed all the other attention and non-attention models for multiclass datasets (Datasets -3, 4, and 5) for all performance measures. For binary class datasets, TexConvoNet produced a relatively lower performance for only 2-3 cases compared to other models. The results of the presented TextConvoNet model are comparable or improved than the other used attention models and the BERT model.

5.3 Effect of different parameter values on the performance of the TextConvoNet

The presented TextConvoNet has been evaluated over a variety of parameters given in Table 5 to analyze their effect on the performance of the TextConvoNet for various measures. Between any two versions, there is a change in the kernel size. Within a version, between any two sub-versions, there are changes in the number of filters, dropout rate before the classification layer, optimizer, and the units in the fully connected layer. Generally, performance measures can change by adhering to the needs, application, and model type. In practice, each NLP application is unique. Therefore, a unique approach/model is needed for every application of NLP. Hence, this analysis has been performed on all five datasets for accuracy, precision, and recall measures, and results are recorded. Table 6 shows the results of this analysis. The following observations have been drawn from the results of various versions of TextConvoNet.

  • It has been found that Adam is the best optimizer for the presented model. All other used optimizers took a large amount of time to train or did not give good results, as in the case of RMSProp (V1.2, V2.2, V3.2, V4.2).

  • Dropout Rate of 0.4 was found to be optimum as the value 0.5 model was slightly overfitting.

  • In datasets with longer texts (such as Dataset-2 and Dataset-5), versions 3 and 4 give slightly better results when compared to versions 1 and 2. The large kernel sizes in versions 3 and 4 might be the possible reason behind the inference. In contrast, datasets (Datasets 1, 3, and 4) with comparatively smaller paragraphs do well in versions 1 and 2.

Optimum value of number of sentences in a paragraph (m)

To evaluate our models’ run-time performance, we tracked the training time by varying the value of (m). The rationale behind it is that varying the value of m ranging from (\(\frac {m}{4}\) to m) leads to an increase in the computation time. The results for the run-time performance of TextConvoNet are shown in Fig. 3. The value of m has been varied from \(\frac {m}{4}\) to m, and the performance metrics for each of the above m-values were calculated. The results are mentioned in Table 7. The observations obtained from the results are as follows.

  • It has been observed that TextConvoNet_6 improves marginally for all the performance measures when the value is varied from \(\frac {m}{4}\) to m on datasets having a maximum length of sentences in a paragraph considerably small, as shown for (Dataset-4).

  • For the datasets having maximum length of sentenca es in a paragraph considerably larger, TextConvoNet_6 performs well for lower values of \(m\left (\frac {m}{4},\frac {m}{2},\frac {3m}{4}\right )\).

Fig. 3
figure 3

Training time with different values of m

Table 5 Different versions of TextConvoNet
Table 6 Results of accuracy, precision, and recall evaluation measures on different datasets
Table 7 Optimum value of m

The possible reason for the following observations could be that TextConvoNet_6 finds it difficult to extract features from smaller paragraphs due to fewer data and hence requiring a high number of sentences to work on, as seen in Dataset-4. On the other hand, TextConvoNet_6 drops a portion of sentences, such as in the case of Dataset-2, having larger paragraphs due to the ample amount of textual data already present. Therefore, the training time reduces for the TextConvoNet_6.

5.4 Performance evaluation of TextConvoNet for fewshot learning

In minimal data and challenging scenarios, it becomes rather important that the model can train well on a minimalist dataset (i.e., a dataset with a fewer training instances) and perform reasonably well on the test set [59]. Few-shot learning for text classification is a scenario in which a small amount of labeled data for each category is available. The goal of the prediction model is to generalize new unseen examples in the same categories quickly and effectively. In the experiment, the test dataset’s size remains constant, as mentioned in Table 2, for all the training percentages and is plotted against the test error rate at those training percentages. We evaluate TextConvoNet_6, TextConvoNet_4, Kim’s CNN model, LSTM, and VDCNN on one binary dataset and one multi-class dataset with varying proportions of training examples. The results are shown in Figs. 4 and 5.

Fig. 4
figure 4

Test error rates on Dataset-1 (Binary-class)

Fig. 5
figure 5

Test error rates on Dataset-4 (Multi-class)

It is observed that the TextConvoNet model performs better than all the other baseline models with lower test error rates at an even lower proportion of training examples. Furthermore, the TextConvoNet achieved lower test error rates without any change in its parameter space. TextConvoNet extracts not just the n-gram based characteristics between the words of the same sentence as 1-D CNN does but also the inter-sentence n-gram based features. As a result, TextConvonet will be able to extract additional features that 1-D CNN models will not be able to. It strengthens our claim that the proposed TextConvoNet performs reasonably well even with fewer training examples.

5.5 Statistical test results

Table 8 repors the Wilcoxon signed-rank test results in terms of p-values and effect r for the TextConvoNet_6 and TextConvoNet_4 with other used techniques, respectively. If the p-value is less than 0.05, the performance difference is statistically significant and marked with the asterisk (*); thus, the null hypothesis is rejected. The effect r value of the test shows the magnitude of the performance difference between the comparison groups. From Table 8, it is observed that there is a statistically significant difference between the presented TextConvoNet_6 and all other considered techniques. The experimental p-values are less than the significance level of 0.05 in all groups. Further, effect r values are higher than 0.45 in all the groups, showing a large magnitude of performance diffen the TextConvoNet_6 and other techniques. Similarly, Table 8 shows that the performance difference between the presented TextConvoNet_4 and other techniques is also statistically significant at the given significance level for all cases. A significant difference can be seen in all the groups as the p-values are below 0.05. Further, effect r values are higher than 0.40 for all the groups showing a large magnitude of the performance difference.

Table 8 Results of Wilcoxon signed-rank paired sampled test between the presented TextConvoNet_6, TextConvoNet_4, and other techniques/models (* showing the groups with the statistically significant difference)

6 Discussion

This work has presented TextConvoNet, a CNN-based architecture for text classification. The essence of the text classification model is to extract the key phrases from the text to assign them an appropriate label. Most existing models utilized one-dimensional (1-D) convolution followed by a one-dimensional pooling operation to extract a feature vector from the input word embeddings. The existing 1-D convolution only extracts n-gram-based features from two or more than two-word in a sentence. However, these models did not extract the n-gram features between different sentences. The presented TextConvoNet architecture uses a 2-dimensional convolutional filter and extracts text data’s intra-sentence and inter-sentence n-gram features. Therefore, it results in a rich feature set for better text classification. A comprehensive evaluation of the presented TextConvoNet using five different datasets and eight performance measures showed that TextConvoNet produced a state-of-the-art performance for text classification. We summarize the main findings of the presented work as follows.

  • We found that the presented TextConvoNet architecture produced a maximum average improvement in the performance of around 20% or more for different performance measures compared to the used machine learning models and deep learning-based models.

  • When compared with the existing attention-based models and BERT model, we found that the presented TextConvoNet outperformed all the other attention and non-attention models for multiclass datasets (Datasets 3, 4, and 5). The performance of the TexConvoNet is approximately equal to or improved than the other models for binary class datasets.

  • When the presented TexConvoNet is evaluated for the few-shot learning scenario, we found that the TextConvoNet produced lower test error rates than all the other baseline models with a lower proportion of training examples.

7 Conclusion and future work

This paper presented a convolutional neural network-based deep learning architecture for text classification. The important feature of the presented TextConvoNet is that it extracts the intra-sentence n-gram features from the text data and also extracts the inter-sentence n-gram features. We used the 2-D CNN model to provide an alternate input representation for text data (paragraph matrix). An extensive performance evaluation of the presented TextConvoNet architecture on five different text datasets has been done. The results showed that the presented TextConvoNet yielded better performance than the baseline machine learning models and state-of-the-art deep-learning-based models. The improved performance of TextConvoNet has been recorded for both binary-class and multi-class classification problems. The analysis showed that extracting the inter-sentence features along with the intra-sentence features improves the performance of the CNN model’s text classification task. In future work, we will explore the idea of representing input in higher dimensions so that convolution operations can capture various features from the textual data.