Skip to main content

A Novel BGCapsule Network for Text Classification


Several text classification tasks such as sentiment analysis, news categorization, multi-label classification and opinion classification are challenging problems even for modern deep learning networks. Recently, Capsule Networks (CapsNets) are proposed for image classification. It has been shown that CapsNets have several advantages over Convolutional Neural Networks (CNNs), while their validity in the domain of text was less explored. In this paper, we propose a novel hybrid architecture viz., BGCapsule, which is a Capsule model preceded by an ensemble of Bidirectional Gated Recurrent Units (BiGRU) for several text classification tasks. We employed an ensemble of Bidirectional GRUs for feature extraction layer preceding the primary capsule layer. The hybrid architecture, after performing basic pre-processing steps, consists of five layers: an embedding layer based on GloVe, a BiGRU-based ensemble layer, a primary capsule layer, a flatten layer and fully connected ReLU layer followed by a fully connected softmax layer. To evaluate the effectiveness of BGCapsule, we conducted extensive experiments on five benchmark datasets (ranging from 10,000 records to 700,000 records) including Movie Review (MR Imdb 2005), AG’s News dataset, Dbpedia ontology dataset, Yelp Review Full dataset and Yelp review polarity dataset. These benchmarks cover several text classification tasks such as news categorization, sentiment analysis, multiclass classification, multi-label classification and opinion classification. We found that our proposed architecture (BGCapsule) achieves better accuracy compared to the existing methods without the help of any external linguistic knowledge such as positive sentiment keywords and negative sentiment keywords. Further, BGCapsule converged faster compared to other extant techniques.


Text classification is one of the most basic and important applications of machine learning. Traditionally, the use of term frequency inverse document frequency (tf-idf) in forming the document-term matrix as a representation of documents followed by invoking general classifiers such as naïve bayes, support vector machines (SVM) or logistic regression has been the de facto standard for text classification. Recently, many novel and powerful neural embedding approaches have been proposed in the literature which made it possible to find distributed representations of words and documents in an efficient manner [1] leading to higher accuracies in text classification. The major deep learning models employed in text classification are largely based on Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs).

Despite great success, these deep neural networks have some inadequacies. In the case of text data, these deep learning models heavily depend on the quality of instance representation of text data. Here, instance could be a sentence, a document, or a paragraph. CNN, RNN-based text classification requires huge amount of training data to obtain reasonable accuracies and they do not perform so well on small datasets.

Meanwhile, in the image classification domain, capsule networks proposed by Sabour et al. [2] proved to be effective at understanding spatial relationships in high levels of data by employing a whole vector of instantiation parameters. We applied modified and extended version of this network structure to the classification of text, and argue that it has advantages in this field.


Text classification using deep learning is based on the concept of feature extraction from text data. Text feature extraction can be accomplished using CNN or RNN. CNN performs n-gram-based feature extraction and RNN employs window-based feature extraction. If the length of a sentence is less than 10, then CNN and Gated Recurrent Unit (GRU) yield comparable performance. While CNN captures local features, it does not capture sentiments from long sentences. Other problem with these deep learning models is that they require large amounts of training data for producing good results [3].

Further, CNN- and RNN-based methods use max pooling concept for feature routing which is called static routing. It tends to lose important context and information. Hence, it is not much useful for text datasets.

Therefore, we propose a new hybrid architecture, named BGCapsule, to overcome these issues. BGCapsule, a hybrid of bidirectional GRU ensemble and a capsule network in tandem, performs better on less data as compared to CNN. Sabour et al. [2] proposed dynamic routing algorithm to pass features from lower layer to higher layer. Dynamic routing algorithms, unlike static routing such as max pooling, yield better performance as they achieve equivariance.


In this work, we try to build a faster and more robust text classification model based on the capsule network. The main contributions are as follows:

  • To the best of our knowledge, hybrid of the BiGRU ensemble and Capsule is the first attempt for multi-label and multiclass text classification. While BiGRU ensemble takes care of feature extraction, the capsule net takes care of classification.

  • We demonstrate that the BGCapsule achieves state-of-the-art performance without any external knowledge of the dataset.

  • We tested the performance of BGCapsule with that of the extant methods on five benchmark datasets of different sizes and different tasks such as binary classification and multiclass classification. Results show that capsule model is competitive and robust. Our model performed well on small as well as on big datasets compared to other state-of-the-art methods.

  • We also performed an ablation study by changing the composition of the hybrid as well, where we compared the performance of BGCaspule with that of BiGRU + Max Pooling and CNN + Capsule Network.

The rest of the paper is organized as follows: “Related Work” presents the related work; “Proposed Model: BGCapsule Network” presents in detail our proposed model; “Description of the Datasets and Performance Metric” describes the benchmark datasets on which we tested our model; “Results and Discussion” presents the discussion of the results and finally, “Conclusions and Future Work” concludes the paper and presents future directions.

Related Work

Text classification tasks are also impacted by the deep learning revolution that is witnessed of late. Text classification using traditional machine learning mainly focuses on feature engineering. For achieving better performance, text classification heavily depends on the choice of feature representation of text because many representations reflect semantic meaning of neighboring words. Hence, it is better to capture the context as well.

Critical pre-processing phase of text classification is feature representation. The most prominent feature representation is the document-term matrix, which, in turn, has a few variations like term count, term frequency, and tf-idf scores [4]. Despite that, it is an open research challenge to obtain better feature representation for text corpora [5]. Neural network-based methods met with great success in Natural Language Processing (NLP) tasks by offering simple and effective approaches to learn distributed representations of words and phrases [1], Many deep learning models have been applied to text classification, including Recursive Autoencoder [6, 7], Recursive Neural Tensor Network [7], Recurrent Neural Network [8], LSTM [9], and GRU [10].

In all these works, deep learning networks are used as feature extractors. For instance, CNN is used for n-gram extraction, where the n-grams (features) are fed to multilayer perceptron to classify the text corpora. On the other hand, sequential networks like LSTM, GRU, BiLSTM, etc. are also used for feature extraction. However, the advantage of these sequential networks is that the feature extracted using them capture some context information. These networks are window-based feature extractors.

Kim [11] has proposed first CNN-based text classification model. He has shown how CNN is able to detect n-gram features for text classification-based sentiment analysis. He introduced three variants of CNN viz., random CNN, static CNN and non-static CNN. Zhang et al. [12] developed character-based embedding method for text classification. Character-based embedding method helps in out-of-vocabulary cases (OOV). If pre-trained embedding is not trained on particular words, we called it OOV case. Conneau et al. [13] proposed the Very Deep Convolutional Neural Network (VDCNN) using skip connection. They used residual blocks for taking the advantage of large depth models. Miyato et al. [9] proposed semi-supervised learning-based classifier. They used LSTM in adversarial manner. Kim et al. [14] proposed capsule net for text classification. They used two types routing algorithms. Wang et al. [15] proposed hybrid capsule net for text classification. They have used only small datasets to check the performance of the network. Zhao et al. [16] proposed capsule network with dynamic routing for text classification. They reported that capsule network yields significant improvement in accuracy when transfer single-label to multi-label text classification. Single-label corresponds to assigning a particular text document to one class out of N classes and multi-label corresponds to assigning a given sample to more than one class. Zhang et al. [17] proposed attention-based capsule network for relation extraction. They mentioned that capsule network converts the multi-label classification problem into a multiple binary classification problem. Miyato et al. [18] proposed adversarial training methods for semi-supervised text classification. They mentioned that adversarial and virtual adversarial training have good regularization performance in sequence models on text classification tasks. Feng et al. [19] proposed image classification using capsule guided by external textual knowledge. Their proposed model performed better than existing capsule using additional text data.

Quite coincidentally, Du et al. [20] proposed a similar capsule-based hybrid neural network for short text classification. They used capsule network with attention and CNN, RNN architectures. They performed experiments on two datasets including Movie Review (MR) dataset [11], which we also analyzed. (1) Our proposed architecture is different from theirs on several counts: (1) in terms of feature extraction, while we used an ensemble of two BiGRUs of different unit sizes such that it captures different text features and concatenated it for the use of next layer they used only one GRU layer followed by attention concept. (2) While we validated our model’s performance on five datasets including Movie Review (MR) dataset, they tested on only two datasets (3) our proposed model achieved better accuracy on MR dataset compared to that of Du et al. [20] despite not using an attention layer. (4) Largest dataset tested by us has 650,000 samples, whereas the largest dataset tested by them has 1000 samples. (5) We performed tenfold cross validation for the MR dataset, while they performed hold-out method of testing.

Our proposed BGCapusle model has advantages over CNN-based test classification models [11,12,13] as CNN uses the same text representation strategies but the feature routing is static. It is handled using pooling operations which can lose the spatial relationship information. In the case of capsule net, it handles feature routing using dynamic routing algorithm and learns the spatial relationship.

Compared to other capsule net-based text classification models [15, 16], we proposed BiGRU ensemble layer as feature extractor. It will have extra edge in the case of entire sequence or long-range semantic dependency. Ensembling of two BiGRUs can extract different features and turns out to be a better feature extractor. Table 1 summarizes the related work reported in this area.

Table 1 Summary of the related work

Proposed Model: BGCapsule Network

The architecture of the proposed BGCapsule network, which is a Capsule model preceded by an ensemble of Bidirectional Gated Recurrent Units (BiGRU), depicted in Fig. 1, is a variant of the capsule network proposed [2]. It consists of six layers: embedding layer, BiGRU-based ensemble layer, capsule network (which has deep learning layers too), flatten layer, fully connected ReLU layer and fully connected softmax layer. We elaborate the key components in detail as follows:

Fig. 1

Proposed architecture of BGCapsule network for text

We performed lowering, tokenization and padding on the text documents as part of the pre-processing tasks. In lowering phase, we converted every sentence to lowercase. Then, we tokenized the sentences and assigned a particular integer index to each token. We then selected maximum length limit for each sentence. We performed pre-padding with zeros up to maximum length. Padded zeros represent that there is no word. It is meant for just making each document of same length. After that, we fed padded sentence to the embedding layer. Embedded layer then converts each token of a document to n-dimensional vector and it converts each padded zero to n-dimensional zero vector.

Embedding Layer

Word embedding is obtained through distributed context vector model and dimensionality reduction. Distributed context vector model captures the context of the words in which they present in corpus. These vectors have dimensionality of the vocabulary size of the corpus [21]. These vectors are obtained by training over the corpus and calculating the co-occurrence of a word. Mikolov et al. [1], proposed Word2Vec in two variants: (1) continuous bag of words and (2) skip-gram model. Both models capture the co-occurrence of one window at a time. Recently, in an open-source project at Stanford, Global Vectors for word representation (GloVe) [22] try to capture the counts of overall statistics as to how often it appears.

In this paper, for each sentence, we use pre-trained GloVe word embedding [23]. It is an unsupervised technique for obtaining vector representation of words. Training is conducted on aggregated global word–word co-occurrence statistics from a corpus, and the representations, thus, obtained portray interesting linear substructures of the word vector space. We used GloVe model because it has the benefits of the Word2Vec's skip-gram model [24] for word analogy tasks, and matrix factorization methods which exploit global statistical information. Finally, each sentence is collapsed into a matrix M of size p × v.

$${x}_{i} = \left\{{w}_{1}, {w}_{2},\dots {w}_{p}\right\}\in {M}^{ p\times v},$$

where w1, w2wp are the words of sentence padded up to a user-defined length, p; v is the length of word vector representation. We considered N number of such matrices, where N is the batch size representing the number of text documents. Thus, our data become three-dimensional array of size \(N\times v\times p\).

BiDirectional Gated Recurrent Unit Layer (BiGRU)

GRU [10] is a type of Recurrent Neural Network (RNN), which is a class of neural networks, that can handle temporal information with temporal inputs and outputs [25]. Conventional neural network has connection between the units in different layers, but in RNN it has connections between hidden units [25] forming directed cycle in same layer. Due to the recurrent connections, it is able to transmit temporal (sequential) information. Therefore, RNN outperforms other networks by extracting the temporal features. Bidirectional Recurrent Neural Network (BiRNN) [26] connects two hidden layers in opposite directions to the same output. The output layer can get information from the past (backwards) and future (forward) states simultaneously.

BiRNN increases the amount of input information available to the network. BiGRU [27] is the most advanced RNN and is less complex compared to BiLSTM. BiGRU works as a better window-based feature extractor. LSTM architecture is very effective, but complex. Due to the complexity, LSTM is hard to analyze and it is slow as well. GRU was recently introduced by Cho et al. [10] as an alternative to LSTM. It was subsequently shown by Chung et al. [28] that it performed comparably to the LSTM on several (non-textual) datasets. We make use of ensemble of two BiGRUs to deeply learn the semantic meaning of the sentences.

Conventional RNN can only capture temporal information of on direction; however, BiRNN can capture temporal information of both forward and backward direction. In Fig. 2, we have depicted BiRNN diagram, \({x}_{t-1}, {x}_{t}, {x}_{t+1}\) are temporal inputs and \({y}_{t-1}, {y}_{t}, {y}_{t+1}\) are temporal outputs. \({\overrightarrow{h}}_{t-1}, {\overrightarrow{h}}_{t}, {\overrightarrow{h}}_{t+1}\) are states for forward sequence and \({\overleftarrow{h}}_{t-1}, {\overleftarrow{h}}_{t}, {\overleftarrow{h}}_{t+1}\) are state of backward sequence.

Fig. 2

Bidirectional RNN architecture

We employed ensemble BiGRU layer for feature extraction. It extracts the features better than CNN and it only detects n-grams whereas BiGRU detects context and pattern also in sequence manner.

GRU has two gates: reset gate and update gate. These are useful in handling long term dependency. Figure 3 depicts the GRU cell architecture. Here, \({h}_{t}\), \({h}_{t-1}\) are the output of current state and previous state, \({x}_{t}\) is the input to current state, \({[h}_{t-1}, {x}_{t}]\) is the concatenation of \({h}_{t-1}, {x}_{t}\), update gate \({z}_{t}\) and forget gate \({r}_{t}\) are obtained through the dot product of \({W}_{z}\) and \({[h}_{t-1}, {x}_{t}]\) and \({W}_{r}\) and \({[h}_{t-1}, {x}_{t}],\) respectively, for time stamp t. σ and tanh are sigmoid and tanh layer, respectively. Using \({r}_{t} \mathrm{and} {z}_{t}\) we calculate the output of the cell state \({h}_{t}\) for time stamp t.

Fig. 3

GRU cell architecture

$${h}_{t} = \left(1- {z}_{t}\right)\times {h}_{t-1} + {z}_{t} \times {h}_{t},$$
$${h}_{t} = \mathrm{tan}h \left(W\cdot \left[{r}_{t} \times {h}_{t-1}, {x}_{t}\right]\right),$$
$${r}_{t} = \sigma \left({W}_{r}\cdot \left[{h}_{t-1}, {x}_{t}\right]\right),$$
$${z}_{t} = \sigma \left({W}_{z}\cdot \left[{h}_{t-1}, {x}_{t}\right]\right).$$

Main property of GRU cell state, the horizontal line running through the top of the Fig. 3, is that it can remove or add information to the cell state based on update and reset gate. With the help of the update and forget gates, we can handle the information passing from the previous state to the next state.

In the update gate, we get a vector which contains value between 0 and 1. This gate has point-wise multiplication operation. A sigmoid activation squashes values between 0 and 1. It helps to update or forget data because any vector getting point-wise multiplied by vector of 0 results in the values to disappear or be “forgotten”. On the other hand, any vector multiplied point-wise by vector of 1 results in the same value. Therefore, that value stays the same or is “kept”. The network can learn which data are not important and, therefore, can be forgotten or which data are important to be kept. The reset gate is another gate used to decide how much past information can be forgotten. GRU cell has fewer tensor calculation. Hence, it is faster than an LSTM cell. We used bidirectional GRU network which is made of two GRU cells.

Capsule Network

Capsule network is proposed by Sabour et al. [2] for image classification task, which was demonstrated to be a better image classifier for learning spatial relationship. Our goal is to hybridize a capsule network with an ensemble of BiGRUs for text classification. Capsules have the ability to represent attributes of partial entities, and express semantic meanings in a wider space by expressing the entities with a vector rather than a scalar. In this regard, capsules are suitable to express a sentence or document as a vector. Figure 1 depicts the general structure of the proposed model. The architecture of the capsule network, depicted in Fig. 4, is described as follows:

Fig. 4

Capsule layers connections

Primary Capsule Layer This is the first capsule layer in which the capsules replace the scalar-output feature detectors (CNN or BiGRU) with vector-output capsules to preserve the instantiated parameters for each feature, such as the local order of words and semantic representations of words.

Connection Between Capsule Layers Capsule network generates the capsules in next layer using the principle of “routing-by-agreement”. This process predominantly replaces the pooling operation. Pooling operation loses some important information like angle, position and cannot capture equivariance. Equivariance means the internal representation capture the properties of object, meaning that if we change the internal representation then it also changes the object. If there is no information lost, then it helps make robust prediction otherwise fooling a network with pooling operation becomes easy [29]. In Fig. 4, \({u}_{i}\) is output of ith capsule of lower layer L and \({v}_{j}\) is output of jth capsule of next layer L + 1.

Between two neighboring layers L and (L + 1), a “prediction vector” uj|i is first computed from the lower layer L capsule output ui, by multiplying it with a weight matrix Wij. In the capsule network, ui is the input vector and vj is the output vector for jth capsule of L + 1 layer. For getting output of next layer capsule, apply transformation \({W}_{ij}\) to low level layer capsule output.


Then in (L + 1) parent layer, a capsule sj is generated by linear combination of all the prediction vectors with weights cij.

$${s}_{j} = \sum_{i}{c}_{ij}\cdot {u}_{j|i},$$

where \({c}_{ij}\) represents the coupling coefficients. Dynamic routing algorithm is required to calculate the value of coupling coefficient and \(\sum_{i}{c}_{ij}\) are designed to sum to one. Coupling coefficient \({c}_{ij}\) reflects the effectiveness of capsule i to activate capsule j.

For maintaining non-linearity, instead of applying a sigmoid, tanh or ReLU [30] activation function, capsule network use squashing function to sj. It transforms the activity vector (next layer L + 1 capsule output vector) vj to length between 0 and 1. Equations (9) and (10) show that squashing function shrinks small vectors to almost zero and large vectors to unit vectors. When the value of \({s}_{j}\) is very small (large), then the value given by Eq. (8) tends to zero (one). Hence, we can approximate Eqs. (8) by (9) and (10). This squashing function limits the length of capsule with non-linearity. By this process, the short vectors are pushed to shrink to zero length, and long ones are pushed to one.

$${v}_{j} = \frac{{\left|{s}_{j}\right|}^{2}}{1+ {\left|{s}_{j}\right|}^{2}}\frac{{s}_{j}}{\left|{s}_{j}\right|},$$
$$v_{j} \approx \left| {s_{j} } \right|s_{j} ,{\text{when}}\;s_{j} \;{\text{is}}\;{\text{very}}\;{\text{small}};$$
$$v_{j} \approx \frac{{\left| {s_{j} } \right|}}{{s_{j} }},{\text{when}}\;s_{j} \;{\text{is}}\;{\text{large}}.$$

Dynamic Routing The capsule network updates the weights of the coupling coefficients through an iterative dynamic routing process [2] and determined the degree to which lower capsules were directed to upper capsules. The coupling coefficient is determined by the degree of similarity between the standard-upper and prediction-upper capsules.

Recall that the prediction vector uj|i and activity vector vj (output of jth capsule of next layer L + 1) are already computed. Prediction vector uj|i represents votes from the capsule i for the output capsule j above. If the input vector is highly similar to the voted vector, we conclude that both capsules are highly related. For similarity measure we use dot product between the prediction and the activity vector.

$${b}_{i|j}\leftarrow {u}_{j|i}\cdot {v}_{j.}$$

Due to the dot product bij not only takes into account likeliness but also feature properties. The value of bij will not be high, if the activation ui of capsule i is low since uj|i length is proportional to ui, i.e., bij should remain low between the lower layer L capsule and the parent layer capsule if the lower layer capsule is not activated. The value of cij is obtained using softmax of bij:

$${C}_{ij} = \frac{\mathrm{exp}\left({b}_{ij}\right)}{\sum_{k}\mathrm{exp}\left({b}_{ik}\right)}.$$

Algorithm updates the value of bij iteratively in multiple iterations to make it more accurate.

$${b}_{ij} \leftarrow {b}_{ij} + {u}_{j|i}.{v}_{j}.$$

In this paper, we have used dynamic routing algorithm proposed by Sabour et al. [2] for features routing. Dynamic routing is used to update the parameters like coupling coefficient \({c}_{ij}\) which expresses the connection between a capsule and its parent capsule. As routing algorithm is still not enough to update all parameters, backpropagation is used to calculate the weight matrix Wij. The value of \({c}_{ij}\) has to be re-initialized before dynamic routing calculation begins.


In dynamic routing algorithm, first we initialize the logits of coupling coefficients \({b}_{ij}\mathrm{ to zero},\) which are the log prior probabilities that capsule i should be coupled with capsule j. Coupling coefficients ci are softmax of \({b}_{ij}\). Hence, the initial values of all ci will be same which reflects the effectiveness of any input capsule i to activate capsule j are same to start the algorithm. For learning coupling coefficients, we repeat the following steps:

Step 1. Calculate the value of coupling coefficients using the current value of log prior probabilities \({b}_{ij}\).

Step 2. For parent layer (L + 1), the total input \({s}_{j}\) is a weighted sum of all prediction vector \({u}_{j|i}\) which will be obtained by applying the transformation \({W}_{ij}\) to low level layer capsule output as mentioned in Eq. (6).

Step 3. Squash down the value of sj between 0 and 1 to maintain the non-linearities as mentioned in Eq. (8). We call the squashed vector as output vector of parent layer (L + 1) and represent it by vj.

Step 4. Now calculate the dot product of each output vj of output layer (L + 1) with every individual prediction vector \({u}_{j|i}\) of lower layer L. If dot product is greater than 0, then prediction vector and output vector are similar. Else if the value of dot product is less than 0, then they are not similar. Based on the dot product similarity, we will update the log prior probabilities \({b}_{ij}.\)

These updated values of \({b}_{ij}\) will impact the coupling coefficients ci. In this way, we learn the value of coupling coefficients using dynamic routing algorithm. To start with, we initialized all coupling coefficients to similar values and update its values over iterations. The values of coupling coefficients will increase if the contribution of prediction vector is positive meaning that the dot product of prediction vector and output vector is positive.

Capsule Flattening Layer

The capsules in this layer are flattened into a list of capsules. If we have N number of capsules each with dimension D, then the output of flatten layer is a vector of dimension \(N\times D.\) Output of the flatten layer represents important features. These are passed on to the multilayer perceptron (MLP) layer.

Fully Connected ReLU/Softmax Layer

Here, we used two fully connected layers with ReLU activation [30] for classification the features and softmax activation function for the last fully connected layer for obtaining the probabilities in multiclass classification.

Description of the Datasets and Performance Metric

To validate the effectiveness of our BGCapsule network, we performed experiments on five benchmark datasets. Out of the five datasets, AG’s News, DBPedia, Yelp Review Polarity and Yelp review Full are introduced by Zhang et al. [12]. The fifth dataset is Movie Review (MR) taken from [11]. These benchmark datasets cover several text classification tasks such as sentiment analysis, ontology classification, news categorization and opinion classification. The details are presented in Table 2.

Table 2 Statistics of the datasets analyzed
Table 3 Test set accuracy of BGCapsule compared to the state-of-the-art
Table 4 Test set accuracy in the ablation study

Movie Review (MR) MR (Imdb 2005) is a movie review dataset which contains reviews of movies in English [11]. It contains 5331 positive and 5331 negative reviews. We performed tenfold cross validation (10FCV) on this dataset. This dataset is used for sentiment analysis task.

AG’s news corpus This is English news categorization dataset introduced by Zhang et al. [12]. It is a multiclass dataset with four classes and 496,835 categorized news articles. It has 30,000 training samples and 1900 test samples for each class. It is a balanced dataset with three fields: title, description and target class.

DBPedia ontology dataset This is ontology classification dataset introduced by Zhang et al. [12]. It is a multiclass dataset with 14 non-overlapping classes. It is constructed by DBPedia in 2014. It has 560,000 training and 70,000 test samples. Each class has 40,000 training samples and 5000 test samples.

Yelp Review Polarity This is a sentiment analysis dataset with binary classification. It is obtained from the Yelp Dataset Challenge in 2015 [12]. This dataset is converted to polarity dataset based on rating in yelp reviews. The polarity of a review is measured from rating. Based on rating, it is constructed to have two labels namely positive polarity and negative polarity. It has 560,000 training samples and 38,000 test samples.

Yelp Review Full Yelp Review Full dataset [12] is also obtained from the Yelp Dataset Challenge in 2015. It is a multiclass sentiment analysis dataset with five classes. It has 650,000 training samples and 50,000 test samples. Each class has 13,000 training samples and 10,000 test samples.

Performance Metrics We have used only accuracy as the performance metric for five benchmark datasets namely Movie Review (MR) [11], AG’s News [12], DBPedia [12], Yelp Review Polarity [12], Yelp Review Full [12], because other metrics like precision, recall, F1-score are not reported for the baseline state-of-the-art, Character-level CNN [12]and VDCNN [13].

Experimental Details

For data pre-processing, all the datasets are tokenized, and all words are converted to lowercase. We took 200 word limit for each document. If any document has less than 200 words, we used pre-padding by zeroes, after converting the rest of the words into corresponding index integer numbers. Thereafter, all the words, represented by index numbers are transformed by pre-trained embedding method GloVe. We used GloVe [23] pre-trained model. The GloVe model trained on 2.2 million vocabularies, 840 billion tokens of web data from Common Crawl. This Glove embedding projected each word to a 300-dimensional vector. Each index integer number represents a token which is mapped to 300-dimension vector using pre-trained word embedding. Zero index represent padded token which mapped to 300-dimensional zero vector. The dimensions of BiGRU1 and BiGRU2 are 256 and 200, respectively. Number of capsule and dropout rate for layers is set to 20 and 0.25, respectively.

The following are the hyper-parameter settings: we used recurrent dropout value of 0.25 which was determined after fine tuning. We used lower dropout rate 0.25 as we want to reduce the variance and maintain the most of neurons active. Further, Rathnayaka et al. [31] also used a dropout value of 0.25 for the capsule network. Then, we also used SeLU activation function for multilayer perceptron and ADAM optimizer [32] and a batch size of 1000. We used validation loss for early stopping criteria with validation patience 5, min_delta (minimum changes in loss value which will be accepted as change in loss value) 1% and mode min (training will stop when loss stops decreasing). We ran the experiments to 20–25 epochs. All code is implemented in Keras and Tensorflow.

Results and Discussion

The hybrid model is developed and executed on a workstation with 16 GB NVidia Quadro P5000 GPU with 20 microprocessors having 128 CUDA cores each. The system configuration is Intel Xeon(R) CPU E5-2640 v4, 2.4 GHz, 32 GB RAM, 40 core intel-i7 and 40 core in Ubuntu 16.04 environment.

Classification Accuracies

We compared the BGCapsule with two state-of-the-art methods, which used supervised text classification methods using five benchmark datasets. We report the Character-level CNN [12] and VDCNN [13] as baseline. The results, in terms of accuracy, for all datasets are presented in Table 3, where the best results are marked bold.

Dynamic Routing (Capsule Net) Over Static Routing (Max Pooling)

We have used a capsule layer for dynamic routing in the place of max pooling. The disadvantage of max Pooling is that it loses information. Max Pooling and other pooling like average pooling and k-max pooling are static routing algorithms because they have a rule of thumb that determines which features are routed to upper layer. In static routing, information loss occurs. On the other hand, capsule net uses dynamic routing, where it takes weighted average of extracted features instead of selecting the best features as is done in max pooling. Thus, we have a better feature routing algorithm in capsule net in addition to a better feature extractor resulting in better results for our hybrid model compared to the state-of-the-art models.

Ablation Study

Ablation study [33] is useful for comparative study of architectures. Wherein we tinkered with the proposed architecture for performing ablation study [16]. It resulted in two more architectures namely BiGRU + Max pooling, CNN-Capsule net. To demonstrate the effect of capsule net and dynamic routing better, we removed the dynamic routing and primary capsule layer. We performed the experiments on all benchmark datasets with different ablation architectures. We trained and tested all benchmark datasets with BiGRU + Max pooling, CNN-Capsule net and compared the results with that of our proposed novel architecture BGCapsule. We presented the results of ablation study in Table 4, from which we can see that BiGRU + Max pooling and CNN-Capsule network are outperformed by BGCasule in terms of accuracy.

BiGRU + Max pooling We used BiGRU for feature extraction followed by max pooling for routing the features. In max pooling, we selected dominating feature out of four features and drop the other three features.

CNN-Capsule Network We have used CNN as feature extractor in place of BiGRU ensemble.

Number of trainable parameters in BGCapsule Network is more compared to max pooling does not have any parameters to decide how to route the features. In capsule network, dynamic routing will select features dynamically and uses weight for each feature. For important features it assigns larger weights. Therefore, dynamic routing is better compared to max pooling because the latter drops some non-important features and hence there is information loss in max pooling operation.

We can conclude from Table 4 that BGCapsule is the best compared to other two feature extractor and routing algorithms in terms of the accuracy.

Our approach outperformed all the previous studies on all datasets. We obtained better accuracy with better margin in yelp review full dataset for sentiment analysis task. We obtained 4.87% higher accuracy than the one reported by Zhang et al. [12].

We can see from Table 4 that our proposed BGCapsule network achieves better accuracy than state-of-the-art models, without using any linguistic knowledge such as knowledge of positive and negative words. The IMDB and AG’s News datasets are small compared to other datasets. Therefore, VDCNN [13] having 49 CNN layers was unable to outperform normal char CNN [12] model. However, our proposed BGCapsule can outperform both these models and achieved better accuracy because capsule network can capture the feature information correctly. Hence, it does not require lots of data to learn unlike CNN. Our model also performed well on large data like Yelp Review dataset, where we obtained 4.87% higher accuracy compared to Zhang et al. [12]. Thus, we can infer that our model can perform better on small as well as large datasets.

We used BiGRU ensemble layer over CNN layer for feature extraction. Both CNN and RNN are good feature extractors. CNN is n-gram detector, but GRU will capture long-range semantic dependency. We find that GRU performs better than CNN in cases, where we have to categorize text data based on entire sequence or a long-range semantic dependency rather than on some local phrases [12]. For example, if any long phrase contains negative word like “not”, ”bad” but whole sentence have positive sentiment then in this case GRU will perform better. We opted BiGRU because it is a window-based feature extractor and extracts context successfully. Consequently, to extract better features, we employed an ensemble of two BiGRUs.

The reasons for the superior performance of the proposed model are as follows:

BGCapsule used the ensemble of two BiGRU network. We used two BiGRU with different unit size such that it captures different text features and concatenate it for the use of next layer. Concatenation of two BiGRU makes it a better feature extractor. Choice of number of BiGRUs that can be used in the ensemble and their unit size depends on RAM and it affects the computation. Therefore, we started with two BiGRUs for ensemble.

In the experiments, dimension of the capsule is an important hyper-parameter. If the dimension of the capsule is large, then it can contain more feature information. However, larger dimension of capsule increases the computational complexity. We tried different dimensions of the capsule. For large dimensions of capsule, we observed that loss decreases slowly. Hence, we set the capsule’s dimension to 20.

We obtained better results than the state-of-the-art studies for all datasets. Our model can perform very well on multi-label, multiclass, topic classification and sentiment analysis successfully. In YELP review full dataset, our model outperformed character-based convolutional neural net [12] and VDCNN [13] because text feature extraction of BGCapsule is carried out by BiGRU layer.

Conclusions and Future Work

In this paper, we proposed a new hybrid architecture comprising an ensemble of Bidirectional Gated Recurrent units and Capsule Network for text classification domain. We used dynamic routing in capsule network and two BiGRU ensemble for feature extraction in place of CNN. We compared the proposed model with character-based convolutional neural network and VDCNN. We observed that our proposed architecture BGCapsule is indeed useful for text classification based on five popular benchmark datasets and it has achieved best performance compared to the state-of-the-art.

In future, we would like to investigate the effect of routing algorithms on various tasks in multi-task learning by exploiting the potential of BGCapsule network. We will explore SOM and K-means-based routing algorithms too. We also would like to investigate its usefulness in classifying numerical and image datasets.


  1. 1.

    Mikolov T, Sutskever I, Chen K, Corrado G, Dean J. Distributed representations of words and phrases and their compositionality. 2013. arXiv:1310.4546.

  2. 2.

    Sabour S, Frosst N, Hinton GE. Dynamic routing between capsules. (2017). arXiv:1710.09829.

  3. 3.

    Patrick MK, Adekoya AF, Mighty AA, Edward BY. Capsule networks–a survey. Journal of King Saud University-computer and information sciences. ISSN 1319-1578 (2019).

  4. 4.

    Prager J. Open-domain question-answering, found. Trends® Inf Retr. 2006;1:91–231.

    Article  MATH  Google Scholar 

  5. 5.

    Goodfellow I, Bengio Y, Courville A. Deep learning. Nature. 2016.

    Article  MATH  Google Scholar 

  6. 6.

    Wang Y, Sun A, Han J, Liu Y, Zhu X. Sentiment analysis by capsules. In: Proceedings of the 2018 World Wide Web conference—WWW’18, ACM Press, New York, USA, 2018, pp. 1165–1174.

  7. 7.

    Socher R, Perelygin A, Wu JY, Chuang J, Manning CD. Recursive deep models for semantic compositionality over a sentiment treebank. PLoS ONE. 2013.

    Article  Google Scholar 

  8. 8.

    Tang D, Qin B, Liu T. Document modeling with gated recurrent neural network for sentiment classification. In: Proceedings of the 2015 conference on empirical methods in natural language processing, Association for Computational Linguistics, Stroudsburg, PA, USA, 2015, pp. 1422–1432.

  9. 9.

    Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9:1735–80.

    Article  Google Scholar 

  10. 10.

    Cho K, van Merrienboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. 2014. arXiv:1406.1078.

  11. 11.

    Kim Y. Convolutional neural networks for sentence classification. In: Proceedings of the 2014 conference on empirical methods in natural language processing, Association for computational linguistics, Stroudsburg, PA, USA, 2014, pp. 1746–1751.

  12. 12.

    Zhang X, Zhao J, LeCun Y. Character-level convolutional networks for text classification. 2015. arXiv:1509.01626.

  13. 13.

    Conneau A, Schwenk H, Barrault L, Lecun Y. Very deep convolutional networks for natural language processing. EACL. 2016.

    Article  Google Scholar 

  14. 14.

    Kim J, Jang S, Choi S, Park E. Text classification using capsules. 2018. arXiv:1808.03976.

  15. 15.

    Wang Y, Sun A, Han J, Liu Y, Zhu X. Sentiment analysis by capsules. 2018.

  16. 16.

    Zhao W, Ye J, Yang M, Lei Z, Zhang S, Zhao Z. Investigating capsule networks with dynamic routing for text classification. 2018. arXiv:1804.00538.

  17. 17.

    Zhang N, Deng S, Sun Z, Chen X, Zhang W, Chen H. Attention-based capsule networks with dynamic routing for relation extraction. 2018. arXiv:1812.11321.

  18. 18.

    Miyato T, Dai AM, Goodfellow I. Adversarial training methods for semi-supervised text classification. 2016. arXiv:1605.07725.

  19. 19.

    Feng Y, Zhu X, Li Y, Ruan Y, Greenspan M. Learning capsule networks with images and text. In Proc. 32nd Conf. Neural Inf. Process. Syst. (NIPS) (pp. 1–5). (2018).

  20. 20.

    Du Y, Zhao X, He M, Guo W. A novel capsule based hybrid neural network for sentiment classification. IEEE Access. 2019;7:39321–8.

    Article  Google Scholar 

  21. 21.

    Henry S, Cuffy C, McInnes BT. Vector representations of multi-word terms for semantic relatedness. J Biomed Inform. 2018;77:111–9.

    Article  Google Scholar 

  22. 22.

    Pennington J, Socher R, Manning C. Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing. 2014.

  23. 23.


  24. 24.

    Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. 2013.

  25. 25.

    Short-term load forecasting by artificial intelligent technologies. MDPI. 2019.

  26. 26.

    Schuster M, Paliwal KK. Bidirectional recurrent neural networks. IEEE Trans Signal Process. 1997;45:2673–81.

    Article  Google Scholar 

  27. 27.

    Zhang L, Zhou Y, Duan X, Chen R. A hierarchical multi-input and output Bi-GRU model for sentiment analysis on customer reviews. IOP Conf Ser Mater Sci Eng. 2018;322: 062007.

    Article  Google Scholar 

  28. 28.

    Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555

  29. 29.

    Wei X, Zhu J, Su H. Sparse adversarial perturbations for videos. 2018. arXiv:1803.02536.

  30. 30.

    Xu B, Wang N, Chen T, Li M. Empirical evaluation of rectified activations in convolutional network. 2015. arXiv:1505.00853.

  31. 31.

    Rathnayaka P, Abeysinghe S, Samarajeewa C, Manchanayake I, Walpola M. Sentylic at IEST 2018: gated recurrent neural network and capsule network based approach for implicit emotion detection. arXiv:1809.01452.

  32. 32.

    Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv:1412.6980

  33. 33.

    Liu J, Chang W-C, Wu Y, Yang Y. Deep learning for extreme multi-label text classification. In: Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval—SIGIR’17, ACM Press, New York, USA, 2017, pp. 115–124.

Download references

Author information



Corresponding author

Correspondence to Vadlamani Ravi.

Ethics declarations

Conflict of Interest

On behalf of both authors, the corresponding author states that there is no conflict of interest whatsoever.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Gangwar, A.K., Ravi, V. A Novel BGCapsule Network for Text Classification. SN COMPUT. SCI. 3, 81 (2022).

Download citation


  • Capsule network
  • Text classification
  • BiDirectional Gated Recurrent Unit
  • Sentiment analysis
  • Word embedding
  • Deep learning