Skip to main content

Data-Driven vs. Dictionary-Based Word n-Gram Feature Induction for Sentiment Analysis

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8105))

Abstract

We address the question which word n-gram feature induction approach yields the most accurate discriminative model for machine learning-based sentiment analysis within a specific domain: a purely data-driven word n-gram feature induction or a word n-gram feature induction based on a domain-specific or domain-non-specific polarity dictionary. We evaluate both approaches in document-level polarity classification experiments in 2 languages, English and German, for 4 analog domains each: user-written product reviews on books, DVDs, electronics and music. We conclude that while dictionary-based feature induction leads to large dimensionality reductions, purely data-driven feature induction yields more accurate discriminative models.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   49.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Pang, B., Lee, L.: Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval 2(1-2), 1–135 (2008)

    Article  Google Scholar 

  2. Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up? sentiment classification using machine learning techniques. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 79–86 (2002)

    Google Scholar 

  3. Wiebe, J., Wilson, T., Bruce, R., Bell, M., Martin, M.: Learning subjective language. Computational Linguistics 30(3), 277–308 (2004)

    Article  Google Scholar 

  4. Lewis, D.: Feature selection and feature extraction for text categorization. In: Proceedings of the Workshop on Speech and Natural Language, pp. 212–217 (1992)

    Google Scholar 

  5. Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys (CSUR) 34(1), 1–47 (2002)

    Article  Google Scholar 

  6. Waltinger, U.: GermanPolarityClues: A lexical resource for German sentiment analysis. In: Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC), pp. 1638–1642 (2010)

    Google Scholar 

  7. Waltinger, U.: An empirical study on machine learning-based sentiment classification using polarity clues. Web Information Systems and Technologies 75(4), 202–214 (2011)

    Article  Google Scholar 

  8. Sekine, S.: The domain dependence of parsing. In: Proceedings of the 5th Conference on Applied Natural Language Processing (ANLP), pp. 96–102 (1997)

    Google Scholar 

  9. Escudero, G., Màrquez, L., Rigau, G.: An empirical study of the domain dependence of supervised word sense disambiguation systems. In: Proceedings of Joint Conference on Empirical Methods in Natural Language Processing (EMNLP) and Very Large Corpora (VLC), pp. 172–180 (2000)

    Google Scholar 

  10. Wang, D., Liu, Y.: A cross-corpus study of unsupervised subjectivity identification based on calibrated EM. In: Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (WASSA), pp. 161–167 (2011)

    Google Scholar 

  11. Lee, D.: Genres, registers, text types, domains, and styles: Clarifying the concepts and navigating a path through the bnc jungle. Language Learning & Technology 5(3), 37–72 (2001)

    Google Scholar 

  12. Bank, M., Remus, R., Schierle, M.: Textual characteristics for language engineering. In: Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC), pp. 515–519 (2012)

    Google Scholar 

  13. Remus, R., Bank, M.: Textual characteristics of different-sized corpora. In: Proceedings of the 5th Workshop on Building and Using Comparable Corpora (BUCC), pp. 156–160 (2012)

    Google Scholar 

  14. Remus, R.: Domain adaptation using domain similarity- and domain complexity-based instance selection for cross-domain sentiment analysis. In: Proceedings of the 2012 IEEE 12th International Conference on Data Mining Workshops (ICDMW 2012), Workshop on Sentiment Elicitation from Natural Text for Information Retrieval and Extraction (SENTIRE), pp. 717–723 (2012)

    Google Scholar 

  15. Wilson, T., Wiebe, J., Hoffmann, P.: Recognizing contextual polarity: An exploration of features for phrase-level sentiment analysis. Computational Linguistics 35(3), 399–433 (2009)

    Article  Google Scholar 

  16. Fahrni, A., Klenner, M.: Old wine or warm beer: Target-specific sentiment analysis of adjectives. In: Proceedings of the Symposium on Affective Language in Human and Machine, AISB Convention, pp. 60–63 (2008)

    Google Scholar 

  17. Wu, Y., Jin, P.: SemEval-2010 task 18: Disambiguating sentiment ambiguous adjectives. In: Proceedings of the 5th International Workshop on Semantic Evaluation (SemEval), pp. 81–85 (2010)

    Google Scholar 

  18. Vapnik, V.: The Nature of Statistical Learning. Springer, New York (1995)

    Book  MATH  Google Scholar 

  19. Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  20. Noreen, E.: Computer Intensive Methods for Testing Hypothesis – An Introduction. John Wiley and Sons, Inc. (1989)

    Google Scholar 

  21. Yeh, A.: More accurate tests for the statistical significance of result differences. In: Proceedings of the 18th International Conference on Computational Linguistics (COLING), pp. 947–953 (2000)

    Google Scholar 

  22. Blitzer, J., Dredze, M., Pereira, F.: Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 440–447 (2007)

    Google Scholar 

  23. Wilson, T., Wiebe, J., Hoffmann, P.: Recognizing contextual polarity in phrase-level sentiment analysis. In: Proceedings of the Conference on Human Language Technology (HLT) and the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 347–354 (2005)

    Google Scholar 

  24. Esuli, A., Sebastiani, F.: SentiWordNet: A publicly available lexical resource for opinion mining. In: Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC), pp. 417–422 (2006)

    Google Scholar 

  25. Baccianella, S., Esuli, A., Sebastiani, F.: SentiWordNet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining. In: Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC), pp. 2200–2204 (2010)

    Google Scholar 

  26. Remus, R., Quasthoff, U., Heyer, G.: SentiWS – a publicly available German-language resource for sentiment analysis. In: Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC), pp. 1168–1171 (2010)

    Google Scholar 

  27. Rill, S., Scheidt, J., Drescher, J., Schütz, O., Reinel, D., Wogenstein, F.: A generic approach to generate opinion lists of phrases for opinion mining applications. In: Proceedings of the 1st International Workshop on Issues of Sentiment Discovery and Opinion Mining, WISDOM (2012)

    Google Scholar 

  28. Rill, S., Adolph, S., Drescher, J., Reinel, D., Scheidt, J., Schütz, O., Wogenstein, F., Zicari, R., Korfiatis, N.: A phrase-based opinion list for the german language. In: Proceedings of the 1st Workshop on Practice and Theory of Opinion Mining and Sentiment Analysis (PATHOS), pp. 305–313 (2012)

    Google Scholar 

  29. Polanyi, L., Zaenen, A.: Contextual Valence Shifters. In: Computing Attitude and Affect in Text: Theory and Applications. The Information Retrieval Series, vol. 20, pp. 1–9. Springer, Dordrecht (2006)

    Chapter  Google Scholar 

  30. Wiegand, M., Balahur, A., Roth, B., Klakow, D., Montoyo, A.: A survey on the role of negation in sentiment analysis. In: Proceedings of the 2010 Workshop on Negation and Speculation in Natural Language Processing (NeSp-NLP), pp. 60–68 (2010)

    Google Scholar 

  31. Choi, Y., Cardie, C.: Learning with compositional semantics as structural inference for subsentential sentiment analysis. In: Proceedings of the 13th Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 793–801 (2008)

    Google Scholar 

  32. Klenner, M., Petrakis, S., Fahrni, A.: Robust compositional polarity classification. In: Proceedings of the 7th International Conference on Recent Advances in Natural Language Processing (RANLP), pp. 180–184 (2009)

    Google Scholar 

  33. Liu, J., Seneff, S.: Review sentiment scoring via a parse-and-paraphrase paradigm. In: Proceedings of the 14th Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 161–169 (2008)

    Google Scholar 

  34. Moilanen, K., Pulman, S.: Sentiment composition. In: Proceedings of the 6th International Conference on Recent Advances in Natural Language Processing (RANLP), pp. 378–382 (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Remus, R., Rill, S. (2013). Data-Driven vs. Dictionary-Based Word n-Gram Feature Induction for Sentiment Analysis. In: Gurevych, I., Biemann, C., Zesch, T. (eds) Language Processing and Knowledge in the Web. Lecture Notes in Computer Science(), vol 8105. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40722-2_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-40722-2_18

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-40721-5

  • Online ISBN: 978-3-642-40722-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics