Information geometry enhanced fuzzy deep belief networks for sentiment classification


With the development of internet, more and more people share reviews. Efficient sentiment analysis over such reviews using deep learning techniques has become an emerging research topic, which has attracted more and more attention from the natural language processing community. However, improving performance of a deep neural network remains an open question. In this paper, we propose a sophisticated algorithm based on deep learning, fuzzy clustering and information geometry. In particular, the distribution of training samples is treated as prior knowledge and is encoded in fuzzy deep belief networks using an improved Fuzzy C-Means (FCM) clustering algorithm. We adopt information geometry to construct geodesic distance between the distributions over features for classification, improving the FCM. Based on the clustering results, we then embed the fuzzy rules learned by FCM into fuzzy deep belief networks in order to improve their performance. Finally, we evaluate our proposal using empirical data sets that are dedicated for sentiment classification. The results show that our algorithm brings out significant improvement over existing methods.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3


  1. 1.

    Shoushan L, Lee SYM, Chen Y, Huang C, Zhou G (2010) Sentiment classification and polarity shifting. In: Proceedings of the 23rd international conference on computational linguistics, pp 635–643

  2. 2.

    Taboada M, Brooke J, Tofiloski M, Voll K, Stede M (2010) Lexicon-based methods for sentiment analysis. Comput Linguist 37(2):267–307

    Google Scholar 

  3. 3.

    Ravishankar N, Raghunathan S (2017) Corpus based sentiment classification of tamil movie tweets using syntactic patterns. Comput Sci 8(2):172–178

    Google Scholar 

  4. 4.

    HaCohen-Kerner Y, Badash H (2016) Positive and negative sentiment words in a blog corpus written in hebrew. Procedia Comput Sci 96(50):733–743

    Article  Google Scholar 

  5. 5.

    Gao K, Su S, Wang J (2015) A sentiment analysis hybrid approach for microblogging and E-commerce corpus. In: 7th international conference on modelling, identification and control (ICMIC), pp 1–6

  6. 6.

    Bo P, Lillian L, Shivakumar V (2002) Thumbs up? Sentiment classification using machine learning techniques. Proc EMNLP-02 10(2):79–86

    Google Scholar 

  7. 7.

    Turney P (2002) Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. In: Annual meeting of the association of computational linguistics, pp 417–424

  8. 8.

    Turney PD, Littman ML (2003) Measuring praise and criticism: inference of semantic orientation from association. ACM Trans Inf Syst 21(1):315–346

    Google Scholar 

  9. 9.

    Da Silva NFF, Coletta LFS, Hruschka ER, Hruschka ER Jr (2016) Using unsupervised information to improve semi-supervised tweet sentiment classification. Inf Sci 355(1):348–365

    Article  Google Scholar 

  10. 10.

    Torresani L (2014) Weakly supervised learning. Comput Vis A Ref Guide 10(2–3):883–885

    Google Scholar 

  11. 11.

    Guan Z, Chen L, Zhao W, Zheng Y, Tan S, Cai D (2016) Weakly-supervised deep learning for customer review sentiment classification. In: Proceedings of the twenty-fifth international joint conference on artificial intelligence (IJCAI-16)

  12. 12.

    Hady MFA, Schwenker F (2013) Semi-supervised learning. In: Bianchini M, Maggini M, Jain L (eds) Handbook on neural information processing. intelligent systems reference library, vol 49. Springer, Berlin

    Google Scholar 

  13. 13.

    Li S, Wang Z, Zhou G, Lee SYatM (2017) Semi-supervised learning for imbalanced sentiment classification. J R Stat Soc 172(2):530–530

    Google Scholar 

  14. 14.

    Hinton GE, Osindero S, Teh Y (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(1):1527–1554

    MathSciNet  MATH  Google Scholar 

  15. 15.

    Zhou S, Chen Q, Wang X (2014) Fuzzy deep belief networks for semi-supervised sentiment classification. Neurocomputing 131(1):312–322

    Article  Google Scholar 

  16. 16.

    Zadeh LA (1965) A Fuzzy sets. Inf Control 8:338–353

    Article  Google Scholar 

  17. 17.

    Basseville M (2013) Divergence measures for statistical data processing—an annotated bibliography. Signal Process 93(4):621–633

    MathSciNet  Article  Google Scholar 

  18. 18.

    Zhao K, Alavi A, Wiliem A, Lovell BC (2005) A novel information geometric approach to variable selection in MLP networks. Neural Netw 18(2):1309–1318

    Google Scholar 

  19. 19.

    Amari SI (1998) Natural gradient works efficiently in learning. Neural Comput 10(2):251–276

    Article  Google Scholar 

  20. 20.

    Zhao J (2015) Natural gradient learning algorithms for RBF networks. Neural Comput 27(2):481–505

    MathSciNet  Article  Google Scholar 

  21. 21.

    Bezdek AC, Ehrlich R, Full W (1984) FCM: the Fuzzy C-means clustering algorithm. Comput Geosci 10(2–3):191–203

    Google Scholar 

  22. 22.

    Zhuang L, Jing F, Zhu Z (2006) Movie review mining and summarization. In: Proceedings of the 15th ACM international conference on information and knowledge management, pp 43–50

  23. 23.

    Go A, Bhayani R, Huang L (2009) Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, pp 1–12

  24. 24.

    Wu F, Song Y, Huang Y (2015) Microblog sentiment classification with contextual knowledge regularization. In: Proceedings of the twenty-ninth AAAI conference on artificial intelligence, pp 2332–2338

  25. 25.

    Xia Y, Wang AL, Wong KF, Xu M (2008) Lyric-based song sentiment classification with sentiment vector space model. In: Annual meeting of the association of computational linguistics, pp 133–136

  26. 26.

    Mcdonald R, Hannan K, Neylon T (2007) Structured models for fine-to-coarse sentiment analysis. In: Annual meeting of the association of computational linguistics, pp 432–439

  27. 27.

    Deng Z, Luo K, Yu H (2014) A study of supervised term weighting scheme for sentiment analysis. Expert Syst Appl 41(1):3506–3513

    Article  Google Scholar 

  28. 28.

    Aue A, Gamon M (2005) Customizing sentiment classifiers to new domains: a case study. In: International conference on recent advances in natural language processing, pp 210–231

  29. 29.

    Tan S, Wu G, Tang AH, Cheng X (2007) A novel scheme for domain-transfer problem in the context of sentiment analysis. In: ACM conference on information & knowledge management, pp 979–982

  30. 30.

    Li S, Zong C (2008) Multi-domain sentiment classification. In: Annual meeting of the association of computational linguistics, association for computational linguistics, pp 257–260

  31. 31.

    Pan J, Ni X, Sun J, Yang Q, Chen Z (2010) Cross-domain sentiment classification via spectral feature alignment. In: International World Wide Web Conference, ACM, pp 751–760

  32. 32.

    Biagioni R (2016) Unsupervised sentiment classification. Springer, Cham

    Google Scholar 

  33. 33.

    Read J, Carroll J (2009) Weakly supervised techniques for domain-independent sentiment classification. In: Proceedings of the 1st international CIKM workshop on topic-sentiment analysis for mass opinion, TSA’09, pp 45–52

  34. 34.

    Zhao ZW, Guan L, Chen X, He D, Cai B, Wang, Wang Q (2018) Weakly-supervised deep embedding for product review sentiment analysis. IEEE Trans Knowl Data Eng 30(1):1–23

    Article  Google Scholar 

  35. 35.

    Zhu X (2007) Semi-supervised learning literature survey. Ph.D. thesis

  36. 36.

    Goldberg AB, Zhu X (2006) Seeing stars when there aren’t many stars: graph-based semi-supervised learning for sentiment categorization. In: Proceedings of text graphs: the first workshop on graph based methods for natural language processing, association for computational linguistics, pp 45–52

  37. 37.

    Sindhwani V, Melville P (2008) Document-word co-regularization for semi-supervised sentiment analysis. In: IEEE international conference on data mining, pp 1025–1030

  38. 38.

    Zhou S, Qingcai C, Xiaolong W (2010) Active deep networks for semi-supervised sentiment classification. In: International conference on computational linguistics, poster, pp 1515–1523

  39. 39.

    Smolensky S (1986) Information processing in dynamical systems: foundations of harmony theory. Parallel Distrib Process Explor Micro Struct Cognit 1:194–281

    Google Scholar 

  40. 40.

    Park K-J, Lee J-P, Lee DY (2012) Optimal design of fuzzy clustering-based fuzzy neural networks for pattern classification. Int J Grid Distrib Comput 5(3):361–831

    Google Scholar 

  41. 41.

    Rubio JJ, Pacheco J (2009) An stable online clustering fuzzy neural network for nonlinear system identification. Neural Comput Appl 18(1):633–641

    Article  Google Scholar 

  42. 42.

    Anuar N, Zakaria Z (2012) Electricity load profile determination by using Fuzzy C-means and probability neural network. Energy Procedia 14(5):1861–1869

    Article  Google Scholar 

  43. 43.

    Kass RE, Vos PW (1997) Geometrical foundations of asymptotic inference. Wiley, New York

    Google Scholar 

  44. 44.

    Amari S, Kawanabe M (1997) Information geometry of estimating functions in semiparametric statistical models. Bernoulli 3:29–54

    MathSciNet  Article  Google Scholar 

  45. 45.

    Dasgupta S, Ng V (2009) Mine the easy, classify the hard: a semi-supervised approach to automatic sentiment classification. In: Joint conference of the 47th annual meeting of the association for computational linguistics and 4th international joint conference on natural language processing of the Asian federation of natural language processing, pp 701–709

  46. 46.

    Sergey I, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. Comput Sci 3(21):15–23

    Google Scholar 

  47. 47.

    Frieden BR (2004) Science from Fisher information: a unification. Cambridge Univ. Press, Cambridge

    Google Scholar 

  48. 48.

    Devroye L, Gyorfi L, Lugosi G (1996) A probabilistic theory of pattern recognition. Springer, Berlin. ISBN:0-3879-4618-7

    Google Scholar 

  49. 49.

    Nielsen F, Garcia V (2009) “Statistical exponential families: a digest with flash cards.

  50. 50.

    Nielsen F (2013) Pattern learning and recognition on statistical manifolds. Int Workshop Similarity Based Pattern Recognit 7953:1–25

    Google Scholar 

  51. 51.

    Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22(1):79–86

    MathSciNet  Article  Google Scholar 

  52. 52.

    Bengio YA (2009) Learning deep architecture for AI. Found Trends Mach Learn 2:1–127

    Article  Google Scholar 

  53. 53.

    Kamvar S, Klein D, Manning C (2003) Spectral learning. In: International joint conferences on artificial intelligence. AAAI, Catalonia, pp 561–566

    Google Scholar 

  54. 54.

    Xiong X, Chan KL, Tan KL (2012) Similarity-driven cluster merging method for unsupervised fuzzy clustering. In: Proceedings of the 20th international conference on uncertainty in artificial intelligence, pp 55–67

  55. 55.

    Smith LN (2017) Corpus based sentiment classification of tamil movie tweets using syntactic patterns. In: Applications of computer vision (WACV), 2017 IEEE winter conference on, pp 464–472. IEEE

  56. 56.

    Amari S (2001) Information geometry on hierarchy of probability distributions. IEEE Trans Inf Theory 47(5):1701–1711

    MathSciNet  Article  Google Scholar 

Download references

Author information



Corresponding author

Correspondence to Zhen-Hu Ning.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.



Proof for Theorem 1 Using Central Limit Theorems, for any distribution with a sufficiently large j, we have

$$({\mu _2} - {\mu _1})/{\sigma _1}\sim N(0,\,1/j)$$
$${({\sigma _2})^2}/{({\sigma _1})^2}\sim N(1,\,1/(j - 1))$$

Then, there exists a positive function c(j), which is decreasing with zero as the limit, such that

$$p\{ |\mu | \leq c(j)\}>1 - \varepsilon$$
$$p\{ |\sigma | \leq c(j)\}>1 - \varepsilon$$

For large j, we deduce that \(\begin{gathered} {d_F}(({\mu _1},{\sigma _1}),({\mu _2},{\sigma _2})) \hfill \\ =\sqrt 2 \ln \{ \bigg [F(({\mu _1},{\sigma _1}),({\mu _2},{\sigma _2}))+{({\mu _1} - {\mu _2})^2}+2(\sigma _{1}^{2}+\sigma _{2}^{2})]/4{\sigma _1}{\sigma _2}\} \hfill \\ =\sqrt 2 \ln \bigg[\frac{{\sigma _{1}^{2}\sqrt {({\mu ^2}+2{\sigma ^2})({\mu ^2}+8+O(\sigma ))} +\sigma _{1}^{2}{\mu ^2}+4\sigma _{1}^{2}(1+\sigma +O({\sigma ^2}))}}{{4{\sigma _1}{\sigma _2}}}\bigg] \hfill \\ \end{gathered}\)Then,

$$\sqrt 2 \ln [{c_1}(r+o(r))+1] \leq {d_F}(({\mu _1},{\sigma _1}),({\mu _2},{\sigma _2})) \leq \sqrt 2 \ln [{c_2}(r+o(r))+1]$$

where \(r=\sqrt {{\mu ^2}+{\sigma ^2}}\), c1 and c2 are positive constants. The results holds true.

Next, we prove the superiority of dF compared with KLD.

The symmetric form of KLD is [56]

\(KLD(({\mu _1},{\sigma _1})||({\mu _2},{\sigma _2}))=\frac{1}{2}[2\ln ({\sigma _2}/{\sigma _1})+\sigma _{1}^{2}/\sigma _{2}^{2}+{({\mu _1} - {\mu _2})^2}/\sigma _{2}^{2} - 1].\)

Then for large j, we have

$$KLD(({\mu _1},{\sigma _1})||({\mu _2},{\sigma _2})) \leq o(\sqrt {{\mu ^2}+{\sigma ^2}} ).$$

Then, according to Theorem 1, with at least a probability of \(1 - \varepsilon\),

$$\mathop {\lim }\limits_{{n ->\infty }} KLD(({\mu _1},{\sigma _1})||({\mu _2},{\sigma _2}))/\sqrt {{\mu ^2}+{\sigma ^2}} =0$$

which implies that KLD(.) has lower sensitivity than dF(.).

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Wang, M., Ning, Z., Li, T. et al. Information geometry enhanced fuzzy deep belief networks for sentiment classification. Int. J. Mach. Learn. & Cyber. 10, 3031–3042 (2019).

Download citation


  • Fuzzy neural networks
  • Information geometry
  • Semi-supervised learning
  • Sentiment classification