Skip to main content
Log in

Efficient feature selection techniques for sentiment analysis

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Sentiment analysis is a domain of study that focuses on identifying and classifying the ideas expressed in the form of text into positive, negative and neutral polarities. Feature selection is a crucial process in machine learning. In this paper, we aim to study the performance of different feature selection techniques for sentiment analysis. Term Frequency Inverse Document Frequency (TF-IDF) is used as the feature extraction technique for creating feature vocabulary. Various Feature Selection (FS) techniques are experimented to select the best set of features from feature vocabulary. The selected features are trained using different machine learning classifiers Logistic Regression (LR), Support Vector Machines (SVM), Decision Tree (DT) and Naive Bayes (NB). Ensemble techniques Bagging and Random Subspace are applied on classifiers to enhance the performance on sentiment analysis. We show that, when the best FS techniques are trained using ensemble methods achieve remarkable results on sentiment analysis. We also compare the performance of FS methods trained using Bagging, Random Subspace with varied neural network architectures. We show that FS techniques trained using ensemble classifiers outperform neural networks requiring significantly less training time and parameters thereby eliminating the need for extensive hyper-parameter tuning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. Code will be available at repository https://github.com/avinashsai/MTAP

  2. Train,Test splits can found in https://github.com/avinashsai/Cross-domain-sentiment-analysis/tree/master/Dataset/Actualdata

  3. http://nlp.stanford.edu/data/glove.840B.300d.zip

References

  1. Abbasi A, Chen H C, Salem A (2008) Sentiment analysis in multiple languages: Feature selection for opinion classification in web forums. In: ACM transactions on information systems (TOIS), 2008, 26(3)

  2. Abdi A, Shamsuddin S M, Hasan S, Piran J (2019) Deep learning-based sentiment classification of evaluative text based on Multi-feature fusion. Inf Process Manag 56(4):1245–1259

    Article  Google Scholar 

  3. Agarwal B, Mittal N (2012) Categorical probability proportion difference (CPPD): a feature selection method for sentiment classification. In: Proceedings of the 2nd workshop on sentiment analysis where ai meets psychology, pp 17–26

  4. Agarwal B, Mittal N (2013) Optimal feature selection for sentiment analysis. In: International conference on intelligent text processing and computational linguistics. Springer, Berlin, pp 13–24

  5. Bahassine S, Madani A, Al-Sarem M, Kissi M (2018) Feature selection using an improved Chi-square for Arabic text classification. Journal of King Saud University-Computer and Information Sciences

  6. Barandiaran I (1998) The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell 20(8):1–22

    Google Scholar 

  7. Blitzer J, Dredze M, Pereira F (2007) Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In: Proceedings of the 45th annual meeting of the association of computational linguistics, pp 440–447

  8. Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140

    MATH  Google Scholar 

  9. Cai J, Song F (2008) Maximum entropy modeling with feature selection for text categorization. In: Li H, Liu T, Ma WY, Sakai T, Wong KF, Zhou G (eds) Information retrieval technology. AIRS 2008. Lecture notes in computer science, vol 4993. Springer, Berlin

  10. Chi X, Siew T P, Cambria E (2017) Adaptive two-stage feature selection for sentiment classification. In: IEEE international conference on systems, man, and cybernetics (SMC), pp 1238–1243

  11. Conneau A, Schwenk H, Barrault L, Lecun Y (2016) Very deep convolutional networks for text classification. arXiv:1606.01781

  12. Das S (2001) Filters, wrappers and a boosting-based hybrid for feature selection. In: Icml, vol 1, pp 74–81

  13. From Group to Individual Labels using Deep Features’, Kotzias et al. KDD, 2015

  14. Galavotti L, Sebastiani F, Simi M (2000) Experiments on the use of feature selection and negative evidence in automated text categorization. In: Borbinha J, Baker T (eds) Research and advanced technology for digital libraries. ECDL 2000. Lecture Notes in Computer Science, vol 1923. Springer, Berlin

  15. Gao Z, Wang D Y, Wan S H, Zhang H, Wang Y L (2019) Cognitive-inspired class-statistic matching with triple-constrain for camera free 3D object retrieval. Futur Gener Comput Syst 94:641–653

    Article  Google Scholar 

  16. Gao Z, Xuan H Z, Zhang H, Wan S, Choo KKR (2019) Adaptive fusion and category-level dictionary learning model for multi-view human action recognition. IEEE Internet of Things Journal

  17. Harris ZS (1954) Distributional structure. Word 10.2-3:146–162

    Article  Google Scholar 

  18. Hochreiter S, Bengio Y, Frasconi P, Schmidhuber J (2001) Gradient flow in recurrent nets: the difficulty of learning long-term dependencies

  19. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural computation 9(8):1735–1780

    Article  Google Scholar 

  20. Jones KS (2004) A statistical interpretation of term specificity and its application in retrieval. Journal of documentation

  21. Joulin A, Grave E, Bojanowski P, Mikolov T (2017) Bag of tricks for efficient text classification. In: EACL, 427–431. Association for computational linguistics

  22. Kim Y (2014) Convolutional neural networks for sentence classification. arXiv:1408.5882

  23. Labani M, Moradi P, Ahmadizar F, Jalili M (2018) A novel multivariate filter method for feature selection in text classification problems. Eng Appl Artif Intel 70:25–37

    Article  Google Scholar 

  24. Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: International conference on machine learning, pp 1188–1196

  25. Lee J, Yu I, Park J, Kim D W (2019) Memetic feature selection for multilabel text categorization using label frequency difference. Inform Sci 485:263–280

    Article  Google Scholar 

  26. Liu P, Qiu X, Huang X (2016) Recurrent neural network for text classification with multi-task learning. arXiv:1605.05101

  27. López M, Valdivia A, Martínez-Cámara E, Luzón MV, Herrera F (2019) E2SAM: Evolutionary ensemble of sentiment analysis methods for domain adaptation. Inform Sci 480:273–286

    Article  Google Scholar 

  28. Metsis V, Androutsopoulos I, Paliouras G (2006) Spam filtering with naive bayes which naive bayes?. In: Proceedings of CEAS

  29. Morinaga S, Yamanishi K, Tateishi K, Fukushima T (2002) Mining product reputations on the web. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp 341–349. ACM

  30. O’Keefe T, Koprinska I (2009) Feature selection and weighting methods in sentiment analysis. In: Proceedings of the 14th Australasian document computing symposium, Sydney, pp 67–74

  31. Oussous A, Lahcen AA, Belfkih S (2019) Impact of text pre-processing and ensemble learning on arabic sentiment analysis. In: Proceedings of the 2nd international conference on networking, information systems and security, pp 65. ACM

  32. Pang B, Lee L (2005) Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In: Proceedings of the 43rd annual meeting on association for computational linguistics (pp. 115–124). Association for Computational Linguistics

  33. Pang B, Lee L, Vaithyanathan S (2002) Thumbs up?: Sentiment classification using machine learning techniques. In: Proceedings of the ACL-02 conference on empirical methods in natural language processing - Vol 10,EMNLP ’02, pp 79–86

  34. Pascanu R, Mikolov T, Bengio Y (2012) Understanding the exploding gradient problem. arXiv:1211.5063, 2

  35. Plackett R L (1983) Karl Pearson and the chi-squared test. International Statistical Review/Revue Internationale de Statistique, pp 59–72

  36. Pong-Inwong C, Kaewmak K (2016) Improved sentiment analysis for teaching evaluation using feature selection and voting ensemble learning integration. In: 2nd IEEE international conference on computer and communications (ICCC), pp 1222–1225

  37. Rehman A, Javed K, Babri H A, Saeed M (2015) Relative discrimination criterion–A novel feature ranking method for text data. Expert Syst Appl 42:3670–3681

    Article  Google Scholar 

  38. Robnik-Šikonja M, Kononenko I (2003) Theoretical and empirical analysis of ReliefF and RReliefF. Mach Learn 53(1-2):23–69

    Article  Google Scholar 

  39. Tan S, Zhang J (2008) An empirical study of sentiment analysis for chinese documents. Expert Syst Appl 34(4):2622–2629

    Article  Google Scholar 

  40. Tang J, Alelyani S, Liu H (2014) Feature selection for classification: A review. Data classification: Algorithms and applications, pp 37

  41. Van Der Maaten L, Postma E, Van den Herik J (2009) Dimensionality reduction: a comparative. J Mach Learn Res 10(66-71):13

    Google Scholar 

  42. Wang S, Li D, Wei Y, Li H (2009) A feature selection method based on fisher’s discriminant ratio for text sentiment classification. In: Liu W, Luo X, Wang FL, Lei J (eds) Web information systems and mining. WISM 2009. Lecture notes in computer science, vol 5854. Springer, Berlin

  43. Wang S, Manning CD (2012) Baselines and bigrams: Simple, good sentiment and topic classification. In: Proceedings of the 50th annual meeting of the association for computational linguistics: Short papers-volume 2 (pp. 90–94). Association for Computational Linguistics

  44. Xiao L, Zhang H, Chen W, Wang Y, Jin Y (2018) Transformable convolutional neural network for text classification. In IJCAI, pp 4496–4502

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Avinash Madasu.

Ethics declarations

Conflict of interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Madasu, A., Elango, S. Efficient feature selection techniques for sentiment analysis. Multimed Tools Appl 79, 6313–6335 (2020). https://doi.org/10.1007/s11042-019-08409-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-019-08409-z

Keywords

Navigation