CGSPN : cascading gated self-attention and phrase-attention network for sentence modeling


Sentence modeling is a critical issue for the feature generation of some natural language processing (NLP) tasks. Recently, most works generated the sentence representation by sentence modeling based on Convolutional Neural Network (CNN), Long Short-Term Memory network (LSTM) and some attention mechanisms. However, these models have two limitations: (1) they only present sentences for one individual task by fine-tuning network parameters, and (2) sentence modeling only considers the concatenation of words and ignores the function of phrases. In this paper, we propose a Cascading Gated Self-attention and Phrase-attention Network (CGSPN) that generates the sentence embedding by considering contextual words and key phrases in a sentence. Specifically, we first present a word-interaction gating self-attention mechanism to identify some important words and build the relationship between words. Then, we cascade a phrase-attention structure by abstracting the semantic of phrases to generate the sentence representation. Experiments on different NLP tasks show that the proposed CGSPN model achieves the highest accuracy among most sentence encoding methods. It improves the latest best result by 1.76% on the Stanford Sentiment Treebank (SST), and shows the best test accuracy on different sentence classification data sets. In the Natural Language Inference (NLI) task, the performance of CGSPN without phrase-attention is better than CGSPN model itself and it obtains competitive performance against state-of-the-art baselines, which show the different applicability of the proposed model. In other NLP tasks, we also compare our model with popular methods to explore our direction.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6


  1. 1.

  2. 2.

  3. 3.

  4. 4.

  5. 5.

  6. 6.


  1. Al-Rfou, R., Choe, D., Constant, N., Guo, M., & Jones, L. (2019). Character-level language modeling with deeper self-attention. In Proceedings of the AAAI Conference on Artificial Intelligence, (Vol. 33 pp. 3159–3166).

  2. Babic, B., Nesic, N., & Miljkovic, Z. (2008). A review of automated feature recognition with rule-based pattern recognition. Computers in industry, 59(4), 321–337.

    Article  Google Scholar 

  3. Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research 3.

  4. Bollegala, D., Weir, D., & Carroll, J. (2012). Cross-domain sentiment classification using a sentiment sensitive thesaurus. IEEE Transactions on Knowledge and Data Engineering, 25(8), 1719–1731.

    Article  Google Scholar 

  5. Bowman, S. R., Angeli, G., Potts, C., & Manning, C.D. (2015). A large annotated corpus for learning natural language inference. arXiv:1508.05326.

  6. Bowman, S. R., Gauthier, J., Rastogi, A., Gupta, R., Manning, C.D., & Potts, C. (2016). A fast unified model for parsing and sentence understanding. arXiv:1603.06021.

  7. Chen, X., Qiu, X., Zhu, C., Wu, S., & Huang, X. (2015). Sentence modeling with gated recursive neural network. In Proceedings of the 2015 conference on empirical methods in natural language processing (pp. 793–798).

  8. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018).

  9. Fürnkranz, J. (1998). A study using n-gram features for text categorization. Austrian Research Institute for Artifical Intelligence, 3(1998), 1–10.

    Google Scholar 

  10. Gan, Z., Pu, Y., Henao, R., Li, C., He, X., & Carin, L. (2016). Learning generic sentence representations using convolutional neural networks. arXiv:1611.07897.

  11. Gazdar, G., Klein, E., Pullum, G.K., & Sag, I.A. (1985). Generalized phrase structure grammar. Cambridge: Harvard University Press.

    Google Scholar 

  12. Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics (pp. 249–256).

  13. Haddock, J.R., & Ko, Y. (1995). Liapunov-razumikhin functions and an instability theorem for autonomous functional differential equations with finite delay. Rocky Mountain Journal of Mathematics, 25(1), 261–267.

    MathSciNet  Article  Google Scholar 

  14. Hewitt, J., & Manning, C.D. (2019). A structural probe for finding syntax in word representations. In North american chapter of the association for computational linguistics (pp. 4129–4138).

  15. Jaiswal, A., AbdAlmageed, W., Wu, Y., & Natarajan, P. (2018). Capsulegan: Generative adversarial capsule network. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 0–0).

  16. Kalchbrenner, N., Grefenstette, E., & Blunsom, P. (2014). A convolutional neural network for modelling sentences. arXiv:1404.2188.

  17. Kim, Y. (2014). Convolutional neural networks for sentence classification. arXiv:1408.5882.

  18. Kiros, R., Zhu, Y., Salakhutdinov, R.R., Zemel, R., Urtasun, R., Torralba, A., & Fidler, S. (2015). Skip-thought vectors. In Advances in neural information processing systems (pp. 3294–3302).

  19. Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents. In International conference on machine learning (pp. 1188–1196).

  20. Lei, T., Barzilay, R., & Jaakkola, T. (2015). Molding cnns for text: non-linear, non-consecutive convolutions. arXiv:1508.04112.

  21. Lei, T., Zhang, Y., & Artzi, Y. (2017). Training rnns as fast as cnns. arXiv:1709.02755.

  22. Lewis, D.D. (1992). Feature selection and feature extraction for text categorization. In Proceedings of the workshop on Speech and Natural Language. Association for computational linguistics (pp. 212–217).

  23. Li, J., Luong, M.-T., Jurafsky, D., & Hovy, E. (2015). When are tree structures necessary for deep learning of representations? arXiv:1503.00185.

  24. Liu, N., & Shen, B. (2019). Aspect-based sentiment analysis with gated alternate neural network. Knowledge-Based Systems, 105010.

  25. Liu, Y., Sun, C., Lin, L., & Wang, X. (2016). Learning natural language inference using bidirectional lstm model and inner-attention. arXiv:1605.09090.

  26. Logeswaran, L., & Lee, H. (2018). An efficient framework for learning sentence representations. arXiv:1803.02893.

  27. Luan, Y., Ji, Y., & Ostendorf, M. (2016). Lstm based conversation models. arXiv:1603.09457.

  28. Luong, M.-T., Pham, H., & Manning, C.D. (2015). Effective approaches to attention-based neural machine translation. arXiv:1508.04025.

  29. Ma, M., Huang, L., Xiang, B., & Zhou, B. (2015). Dependency-based convolutional neural networks for sentence embedding. arXiv:1507.01839.

  30. Matsumoto, S., Takamura, H., & Okumura, M. (2005). Sentiment classification using word sub-sequences and dependency sub-trees. In Pacific-Asia conference on knowledge discovery and data mining (pp. 301–311): Springer.

  31. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv:1301.3781.

  32. Mou, L., Men, R., Li, G., Xu, Y., Zhang, L., Yan, R., & Jin, Z. (2015). Natural language inference by tree-based convolution and heuristic matching. arXiv:1512.08422.

  33. Nallapati, R., Zhai, F., & Zhou, B. (2017). Summarunner: a recurrent neural network based sequence model for extractive summarization of documents. In Thirty-First AAAI conference on artificial intelligence.

  34. Ouyang, X., Zhou, P., Li, C.H., & Liu, L. (2015). Sentiment analysis using convolutional neural network. In 2015 IEEE international conference on computer and informationtechnology; ubiquitous computing and communications; dependable, autonomic and secure computing; pervasive intelligence and computing (pp. 2359–2364): IEEE.

  35. Palangi, H., Deng, L., Shen, Y., Gao, J., He, X., Chen, J., Song, X., & Ward, R. (2016). Deep sentence embedding using long short-term memory networks: Analysis and application to information retrieval. IEEE ACM Trans. Audio Speech Language Process. (TASLP), 24(4), 694–707.

    Article  Google Scholar 

  36. Qian, Q., Huang, M., Lei, J., & Zhu, X. (2016). Linguistically regularized lstms for sentiment classification. arXiv:1611.03949.

  37. Rao, G., Huang, W., Feng, Z., & Cong, Q. (2018). Lstm with sentence representations for document-level sentiment classification. Neurocomputing, 308, 49–57.

    Article  Google Scholar 

  38. Salazar, J., Kirchhoff, K., & Huang, Z. (2019). Self-attention networks for connectionist temporal classification in speech recognition. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7115–7119): IEEE.

  39. Sharma, N., & Yalla, P. (2018). Developing research questions in natural language processing and software engineering. JOIV: International Journal on Informatics Visualization, 2(4), 268–270.

    Article  Google Scholar 

  40. Shen, T., Zhou, T., Long, G., Jiang, J., Pan, S., & Zhang, C. (2018a). Disan: Directional self-attention network for rnn/cnn-free language understanding. In Thirty-Second AAAI Conference on Artificial Intelligence.

  41. Shen, T., Zhou, T., Long, G., Jiang, J., Wang, S., & Zhang, C. (2018b). Reinforced self-attention network: a hybrid of hard and soft attention for sequence modeling. arXiv:1801.10296.

  42. Shen, Y., Yuan, K., Li, Y., Tang, B., Yang, M., Du, N., & Lei, K. (2018c). Drug2vec: Knowledge-aware feature-driven method for drug representation learning. In 2018 IEEE international conference on Bioinformatics and Biomedicine (BIBM) (pp. 757–800). IEEE.

  43. Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C.D., Ng, A., & Potts, C. (2013). Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing (pp. 1631–1642).

  44. Sun, Y., Wang, S., Li, Y., Feng, S., Tian, H., Wu, H., & Wang, H. (2019). Ernie 2.0: A continual pre-training framework for language understanding. arXiv:1907.12412.

  45. Sundermeyer, M., & Schlüter, R. (2012). Ney, H.. In Thirteenth annual conference of the international speech communication association. Lstm neural networks for language modeling.

  46. Tai, K.S., Socher, R., & Manning, C.D. (2015). Improved semantic representations from tree-structured long short-term memory networks. arXiv:1503.00075.

  47. Tang, D., Qin, B., Liu, T., & Li, Z. (2013). Learning sentence representation for emotion classification on microblogs. In Natural language processing and chinese computing (pp. 212–223): Springer.

  48. Teng, Z., Vo, D.T., & Zhang, Y. (2016). Context-sensitive lexicon features for neural sentiment analysis. In Proceedings of the 2016 conference on empirical methods in natural language processing (pp. 1629–1638).

  49. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998–6008).

  50. Vendrov, I., Kiros, R., Fidler, S., & Urtasun, R. (2015). Order-embeddings of images and language. arXiv:1511.06361.

  51. Wang, P., Xu, B., Xu, J., Tian, G., Liu, C.-L., & Hao, H. (2016). Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification. Neurocomputing, 174, 806–814.

    Article  Google Scholar 

  52. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. R. (2018a). Glue: a multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP workshop BlackboxNLP analyzing and interpreting neural networks for NLP (pp. 353–355).

  53. Wang, S., Huang, M., & Deng, Z. (2018b). Densely connected cnn with multi-scale feature attention for text classification. In Proceedings of the Twenty-Seventh international joint conference on artificial intelligence (pp. 4468–4474).

  54. Xu, J., Zhao, R., Zhu, F., Wang, H., & Ouyang, W. (2018). Attention-aware compositional network for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2119–2128).

  55. Yoon, D., Lee, D., & Lee, S. (2018). Dynamic self-attention: Computing attention over words dynamically for sentence embedding. arXiv:1808.07383.

  56. Yu, A. W., Dohan, D., Luong, M.-T., Zhao, R., Chen, K., Norouzi, M., & Le, Q.V. (2018). Qanet: Combining local convolution with global self-attention for reading comprehension. arXiv:1804.09541.

Download references


This work was supported by the Fundamental Research Funds for the Central Universities (No. 2019YJS006)and the National Key Research and Development Program of China (No. 2018YFC0831300).

Author information



Corresponding author

Correspondence to Yun Liu.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Fu, Y., Liu, Y. CGSPN : cascading gated self-attention and phrase-attention network for sentence modeling. J Intell Inf Syst (2020).

Download citation


  • Sentence modeling
  • Gated self-attention
  • CNN
  • Phrase-attention mechanism
  • NLP