Skip to main content
Log in

EnSWF: effective features extraction and selection in conjunction with ensemble learning methods for document sentiment classification

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

With the rise of web 2.0, a huge amount of unstructured data has been generated on regular basis in the form of comments, opinions, etc. This unstructured data contains useful information and can play a significant role in business decision making. In this context, sentiment analysis (SA) is an active research area and has recently attracted the attention of the research community. The aim of SA is to classify the user-generated content into positive and negative class. State-of-the-art techniques for sentiment classification relies on the traditional bag-of-words approaches. Such approaches can be advantageous in terms of simplicity but completely ignore the semantics aspects, the order between words, and also leads to the curse of dimensionality. Researchers have also proposed semantic-based SA techniques in conjunction with word-order employing high order n-grams, part-of-speech (POS) patterns, and dependency relation features. But can every word or phrase of high order n-grams, POS patterns or dependency relation features represent sentiment clue? If incorporated, then what about the dimensionality? In order to tackle and investigate such issues, in this paper, we propose a novel POS and n-gram based ensemble method for SA while considering semantics, sentiment clue, and order between words called EnSWF which is a four phase process. Our main contributions are four-fold (a) Appropriate Feature Extraction: we investigate and validate extracting various appropriate features for sentiment classification. (b) Dimensionality Reduction: We decrease the dimensionality of feature space by selecting the subset of most meaningful and effective features. (c) Ensemble Model: We propose an ensemble learning method for both filter based features selection and classification using simple majority voting technique. (d) Practicality: we authenticate our claim while applying our model on benchmark datasets. We also show that EnSWF out-perform existing techniques in terms of classification accuracy and reduce high dimensional feature space.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Abbasi A, Chen H, Salem A (2008) Sentiment analysis in multiple languages: feature selection for opinion classification in web forums. ACM Trans Inf Syst (TOIS) 26(3):12

    Article  Google Scholar 

  2. Abeillé A (2012) Treebanks: building and using parsed corpora, vol 20. Springer, Berlin

    MATH  Google Scholar 

  3. Aggarwal CC, Zhai C (2012) Mining text data. Springer, Berlin

    Book  Google Scholar 

  4. Akhtar N, Ahamad MV (2018) Graph tools for social network analysis. In: Graph theoretic approaches for analyzing large-scale social networks. IGI Global, pp 18–33

  5. Blitzer J, Dredze M, Pereira F (2007) Biographies, bollywood, boom-boxes and blenders: domain adaptation for sentiment classification. In: Proceedings of the 45th annual meeting of the association of computational linguistics, pp 440–447

  6. Catal C, Nangir M (2017) A sentiment classification model based on multiple classifiers. Appl Soft Comput 50:135–141

    Article  Google Scholar 

  7. Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Electr Eng 40(1):16–28

    Article  Google Scholar 

  8. Chawla N, Eschrich S, Hall LO (2001) Creating ensembles of classifiers. In: Proceedings IEEE international conference on data mining. ICDM 2001. IEEE, pp 580–581

  9. Choi Y, Kim Y, Myaeng SH (2009) Domain-specific sentiment analysis using contextual feature generation. In: Proceedings of the 1st international CIKM workshop on Topic-sentiment analysis for mass opinion. ACM, pp 37–44

  10. Dave K, Lawrence S, Pennock DM (2003) Mining the peanut gallery: opinion extraction and semantic classification of product reviews. In: Proceedings of the 12th international conference on world wide web. ACM, pp 519–528

  11. Ekbal A, Saha S (2013) Combining feature selection and classifier ensemble using a multiobjective simulated annealing approach: application to named entity recognition. Soft Comput 17(1):1–16

    Article  Google Scholar 

  12. Esuli A, Sebastiani F Sentiwordnet: a high-coverage lexical resource for opinion mining

  13. Frank E, Bouckaert RR (2006) Naive bayes for text classification with unbalanced classes. In: European conference on principles of data mining and knowledge discovery. Springer, pp 503–510

  14. Goldburd M, Khare A, Tevet CD (2016) Generalized linear models for insurance rating. In: Casualty actuarial society

  15. Hofmann M, Klinkenberg R (2013) Rapidminer: data mining use cases and business analytics applications. CRC Press, Boca Raton

    Google Scholar 

  16. Hu M, Liu B (2004) Mining and summarizing customer reviews. In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 168–177

  17. Iliasova O et al (2017) The application of social media analysis for marketing and business

  18. Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: European conference on machine learning. Springer, pp 137–142

  19. Karthik M, Davis M (2004) Search using n-gram technique based statistical analysis for knowledge extraction in case based reasoning systems. arXiv:cs/0407009

  20. Khan J, Jeong BS (2016) Summarizing customer review based on product feature and opinion. In: 2016 international conference on machine learning and cybernetics (ICMLC), vol 1. IEEE, pp 158–165

  21. Khan J, Jeong BS, Lee YK, Alam A (2016) Sentiment analysis at sentence level for heterogeneous datasets. In: Proceedings of the sixth international conference on emerging databases: technologies, applications, and theory. ACM, pp 159–163

  22. Kim SM, Hovy E (2004) Determining the sentiment of opinions. In: Proceedings of the 20th international conference on computational linguistics. Association for Computational Linguistics, p 1367

  23. Lewis DD (1998) Naive (bayes) at forty: the independence assumption in information retrieval. In: European conference on machine learning. Springer, pp 4–15

  24. Li S, Zong C, Wang X (2007) Sentiment classification through combining classifiers with multiple feature sets. In: International conference on natural language processing and knowledge engineering. NLP-KE 2007. IEEE, pp 135–140

  25. Li YH, Jain AK (1998) Classification of text documents. Comput J 41(8):537–546

    Article  MATH  Google Scholar 

  26. Liu B (2012) Sentiment analysis and opinion mining. Synthesis lectures on human language technologies 5 (1):1–167

    Article  Google Scholar 

  27. Matsumoto S, Takamura H, Okumura M (2005) Sentiment classification using word sub-sequences and dependency sub-trees. In: Pacific-asia conference on knowledge discovery and data mining. Springer, pp 301–311

  28. McAuley J, Leskovec J (2013) Hidden factors and hidden topics: understanding rating dimensions with review text. In: Proceedings of the 7th ACM conference on recommender systems. ACM, pp 165–172

  29. McCallum A, Nigam K, et al (1998) A comparison of event models for naive bayes text classification. In: AAAI-98 workshop on learning for text categorization. Citeseer, vol 752, pp 41–48

  30. McCullagh P (1984) Generalized linear models. Eur J Oper Res 16(3):285–292

    Article  MathSciNet  MATH  Google Scholar 

  31. Moraes R, Valiati JF, Neto WPG (2013) Document-level sentiment classification: an empirical comparison between svm and ann. Expert Syst Appl 40(2):621–633

    Article  Google Scholar 

  32. Mullen T, Collier N (2004) Sentiment analysis using support vector machines with diverse information sources. In: Proceedings of the 2004 conference on empirical methods in natural language processing

  33. Ng V, Dasgupta S, Arifin S (2006) Examining the role of linguistic knowledge sources in the automatic identification and classification of reviews. In: Proceedings of the COLING/ACL on main conference poster sessions. Association for Computational Linguistics, pp 611–618

  34. Onan A, Korukoğlu S (2017) A feature selection model based on genetic rank aggregation for text sentiment classification. J Inf Sci 43(1):25–38

    Article  Google Scholar 

  35. Pang B, Lee L, Vaithyanathan S (2002) Thumbs up?: sentiment classification using machine learning techniques. In: Proceedings of the ACL-02 conference on empirical methods in natural language processing. Association for Computational Linguistics, vol 10, pp 79–86

  36. Park H, Kwon S, Kwon HC (2010) Complete gini-index text (git) feature-selection algorithm for text classification. In: 2010 2nd international conference on software engineering and data mining (SEDM). IEEE, pp 366–371

  37. Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238

    Article  Google Scholar 

  38. Priyadarsini RP, Valarmathi M, Sivakumari S (2011) Gain ratio based feature selection method for privacy preservation. ICTACT J Soft Comput 1(04):2229–6956

    Google Scholar 

  39. Rogati M, Yang Y (2002) High-performing feature selection for text classification. In: Proceedings of the eleventh international conference on information and knowledge management. ACM, pp 659–661

  40. Saleh MR, Martín-Valdivia MT, Montejo-Ráez A, Ureña-López L (2011) Experiments with svm to classify opinions in different domains. Expert Systems with Applications 38(12):14799–14804

    Article  Google Scholar 

  41. Su Y, Zhang Y, Ji D, Wang Y, Wu H (2012) Ensemble learning for sentiment classification. In: Workshop on chinese lexical semantics. Springer, pp 84–93

  42. Subrahmanian VS, Reforgiato D (2008) Ava: adjective-verb-adverb combinations for sentiment analysis. IEEE Intell Syst 23(4):43–50

    Article  Google Scholar 

  43. Tan S, Zhang J (2008) An empirical study of sentiment analysis for chinese documents. Expert Systems with Applications 34(4):2622–2629

    Article  Google Scholar 

  44. Tang J, Alelyani S, Liu H (2014) Feature selection for classification: a review. Data Classification: Algorithms and Applications, p 37

  45. Tripathy A, Agrawal A, Rath SK (2016) Classification of sentiment reviews using n-gram machine learning approach. Expert Syst Appl 57:117–126

    Article  Google Scholar 

  46. Trofimov I, Genkin A (2017) Distributed coordinate descent for generalized linear models with regularization. Pattern Recognit Image Anal 27(2):349–364

    Article  Google Scholar 

  47. Tsutsumi K, Shimada K, Endo T (2007) Movie review classification based on a multiple classifier. In: Proceedings of the 21st pacific Asia conference on language, information and computation, pp 481–488

  48. Turney PD (2002) Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews. In: Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, pp 417–424

  49. Vechtomova O (2009) Introduction to information retrieval christopher d. manning, prabhakar raghavan, and hinrich schutze (stanford university, yahoo! research, and university of stuttgart) cambridge: Cambridge university press, 2008, xxi+ 482 pp; hardbound isbn 978-0-521-86571-5

  50. Wan Y, Gao Q (2015) An ensemble sentiment classification system of twitter data for airline services analysis. In: 2015 IEEE international conference on data mining workshop (ICDMW). IEEE, pp 1318–1325

  51. Wang G, Sun J, Ma J, Xu K, Gu J (2014) Sentiment classification: the contribution of ensemble learning. Decis Support Syst 57:77–93

    Article  Google Scholar 

  52. Wang H, Khoshgoftaar TM, Van Hulse J (2010) A comparative study of threshold-based feature selection techniques. In: 2010 IEEE international conference on granular computing (grc). IEEE, pp 499–504

  53. Weidong Z, Jingyu F, Yongmin L (2014) Using gini-index for feature selection in text categorization

  54. Wilson T, Wiebe J, Hoffmann P (2005) Recognizing contextual polarity in phrase-level sentiment analysis. In: Proceedings of the conference on human language technology and empirical methods in natural language processing. Association for Computational Linguistics, pp 347–354

  55. Xia R, Zong C, Li S (2011) Ensemble of feature sets and classification algorithms for sentiment classification. Inf Sci 181(6):1138–1152

    Article  Google Scholar 

  56. Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Icml, vol 97, No 412–420, p 35

  57. Yousefpour A, Ibrahim R, Hamed HNA (2017) Ordinal-based and frequency-based integration of feature selection methods for sentiment analysis. Expert Syst Appl 75:80–93

    Article  Google Scholar 

  58. Yu J, Zha ZJ, Wang M, Chua TS (2011) Aspect ranking: identifying important product aspects from online consumer reviews. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies. Association for Computational Linguistics, vol 1, pp 1496–1505

  59. Zhai Z, Xu H, Kang B, Jia P (2011) Exploiting effective features for chinese sentiment classification. Expert Syst Appl 38(8):9139–9146

    Article  Google Scholar 

  60. Zhang D, Xu H, Su Z, Xu Y (2015) Chinese comments sentiment classification based on word2vec and svmperf. Expert Syst Appl 42(4):1857–1863. https://doi.org/10.1016/j.eswa.2014.09.011

    Article  Google Scholar 

  61. Zhang D, Xu H, Su Z, Xu Y (2015) Chinese comments sentiment classification based on word2vec and svmperf. Expert Syst Appl 42(4):1857–1863

    Article  Google Scholar 

  62. Zhu J, Wang H, Zhu M, Tsou BK, Ma M (2011) Aspect-based opinion polling from customer reviews. IEEE Trans Affect Comput 2(1):37–49

    Article  Google Scholar 

Download references

Acknowledgments

This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the ICT Consilience Creative program (IITP-2019-2015-0-00742) supervised by the IITP (Institute for Information & communications Technology Planning & Evaluation).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Young-Koo Lee.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Khan, J., Alam, A., Hussain, J. et al. EnSWF: effective features extraction and selection in conjunction with ensemble learning methods for document sentiment classification. Appl Intell 49, 3123–3145 (2019). https://doi.org/10.1007/s10489-019-01425-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-019-01425-4

Keywords

Navigation