Opinion mining from noisy text data

Original Paper


The proliferation of Internet has not only led to the generation of huge volumes of unstructured information in the form of web documents, but a large amount of text is also generated in the form of emails, blogs, and feedbacks, etc. The data generated from online communication acts as potential gold mines for discovering knowledge, particularly for market researchers. Text analytics has matured and is being successfully employed to mine important information from unstructured text documents. The chief bottleneck for designing text mining systems for handling blogs arise from the fact that online communication text data are often noisy. These texts are informally written. They suffer from spelling mistakes, grammatical errors, improper punctuation and irrational capitalization. This paper focuses on opinion extraction from noisy text data. It is aimed at extracting and consolidating opinions of customers from blogs and feedbacks, at multiple levels of granularity. We have proposed a framework in which these texts are first cleaned using domain knowledge and then subjected to mining. Ours is a semi-automated approach, in which the system aids in the process of knowledge assimilation for knowledge-base building and also performs the analytics. Domain experts ratify the knowledge base and also provide training samples for the system to automatically gather more instances for ratification. The system identifies opinion expressions as phrases containing opinion words, opinionated features and also opinion modifiers. These expressions are categorized as positive or negative with membership values varying from zero to one. Opinion expressions are identified and categorized using localized linguistic techniques. Opinions can be aggregated at any desired level of specificity i.e. feature level or product level, user level or site level, etc. We have developed a system based on this approach, which provides the user with a platform to analyze opinion expressions crawled from a set of pre-defined blogs.


Noisy text Context-dependent cleaning Opinion mining WordNet Text analytics for market knowledge discovery 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Benamara, F., Cesarano, C., Picariello, A., Reforgiato, D., Subrahmanian, V.S.: Sentiment analysis: adjectives and adverbs are better than adjectives alone. In: Proceedings of International Conference on Weblogs and Social Media (ICWSM 2007) Boulder, CO, USA (2007)Google Scholar
  2. 2.
    Brill, E., Moore, R.C.: An improved error model for noisy channel spelling correction. In: Proceedings of ACL 2000 (2000)Google Scholar
  3. 3.
    Cesarano, C., Dorr, B., Picariello, A., Reforgiato, D., Sagoff, A., Subrahmanian, V.S.: OASYS: an Opinion Analysis System, AAAI 06 spring symposium on Computational Approaches to Analyzing Weblogs, 2004 (2006)Google Scholar
  4. 4.
    Chieu, H.L., Ng, H.T.: A Maximum Entropy Approach to Information Extraction from Semi-Structured and Free Text. AAAI/IAAI 2002, pp. 786–791 (2002)Google Scholar
  5. 5.
    Clark, A.: Preprocessing very noise text. In: Proceedings of Workshop on Shallow Processing of Large Corpora, Corpus Linguistics. http://www.issco.unige.ch/staff/clark/SprolacPaper.doc.pdf (2003)
  6. 6.
    Dave, K, Lawrence, S., Pennock, D.M.: Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews, WW2003, Budapest, Hungary (2003)Google Scholar
  7. 7.
    Ding, X., Liu, B.: The Utility of Linguistic Rules in Opinion Mining. SIGIR-2007 (poster paper), Amsterdam (2007)Google Scholar
  8. 8.
    Esuli, A., Sebastiani, F.: SENTIWORDNET: A Publicly Available Lexical Resource for Opinion Mining, LREC2006 Conference on Language Resources and Evaluation, Genova (2006)Google Scholar
  9. 9.
    Ghose, A., Ipeirotis, P.G., Sundararajan, A.: Opinion Mining Using Econometrics: A Case Study on Reputation Systems, Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 416–423. Prague, Czech Republic (2007)Google Scholar
  10. 10.
    Gotoh, Y., Renals, S.: Sentence boundary detection in broadcast speech transcripts. In: Proceedings of International Speech Communication Association (ISCA) Workshop: Authomatic Speech Recognition: Challanges for the New Millenium (ASR-2000), Paris, France (2000)Google Scholar
  11. 11.
    Hatzivassiloglou, V., McKeown, K.: Predicting the Semantic Orientation of Adjectives. In: ACL 1997 Proceedings, pp. 174–181 (1997)Google Scholar
  12. 12.
    Hu, M., Liu, B.: Mining opinion features in customer reviews. In: Proceedings of Nineteeth National Conference on Artificial Intelligence (AAAI-2004), San Jose, USA (2004a)Google Scholar
  13. 13.
    Hu, M., Liu, B.: Mining and summarizing customer reviews. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-2004), Seattle, Washington, USA (2004b)Google Scholar
  14. 14.
    Jindal, N., Liu, B.: Mining comparative sentences and relations. In: Proceedings of 21st National Conference on Artificial Intellgience (AAAI-2006), Boston, Massachusetts, USA (2006)Google Scholar
  15. 15.
    Kernighan, M., Church, K., Gale, W.: A spelling correction program based on a noisy channel model. In: Proceedings of COLING 1990, pp 205–210 (1990)Google Scholar
  16. 16.
    Kim, J., Schwarm, S.E., Ostendorf, M.: Detecting structural metadata with decision trees and transformation-based learning. In: Proc. HLTNAACL, pp. 137–144 (2004)Google Scholar
  17. 17.
    Klein, D., Manning, C.D.: Fast exact inference with a factored model for natural language parsing. In: Advances in Neural Information Processing Systems 15 (NIPS 2002) pp. 3–10. MIT Press, Cambridge (2003)Google Scholar
  18. 18.
    Liu, Y., Stolcke, A., Shriberg, E., Harper, M.: Using conditional random fields for sentence boundary detection in speech. In: Proceedings of of ACL-05, pp. 451–458. Ann Arbor, MI, USA (2005)Google Scholar
  19. 19.
    Liu, B., Hu, M., Cheng, J.: Opinion observer: analyzing and comparing opinions on the web. In: Proceedings of the 14th international World Wide Web conference (WWW-2005), Chiba, Japan (2005)Google Scholar
  20. 20.
    Mikheev A. (2002) Periods, capitalized words, etc. Comput. Linguist. 28: 289–318CrossRefGoogle Scholar
  21. 21.
    Mei, I., Mi, H., Quiaot, J.: Sentiment Mining and Indexing in Opinmind, ICWSM’2007 Boulder, Colorado, USA (2007)Google Scholar
  22. 22.
    Nasukawa, T., Punjani, D., Roy, S., Subramaniam, L.V., Takeuchi, H.: Adding Sentence Boundaries to Conversational Speech Transcriptions using Noisily Labelled Examples. In: Proceedings of IJCAI-2007 Workshop on Analytics for Noisy Unstructured Text, pp. 71–78, Hyderabad, India (2007)Google Scholar
  23. 23.
    Pavel, S.: Using WordNet for opinion mining. In: Proceedings of the Third International WordNet Conference, GWC 2006, pp. 333–335. Brno, CZ (2006)Google Scholar
  24. 24.
    Pang, B., Lee, L.: A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts. In: Proceedings of the ACL, 2004 (2004)Google Scholar
  25. 25.
    Popescu, A.M., Etzioni, O.: Extracting Product Features and Opinions from Reviews, EMNLP-05, Canada (2005)Google Scholar
  26. 26.
    Qiu, G., Liu, K., Bu, J., Chen, C., Kang, Z.: Extracting opinion topics for Chinese opinions using dependence grammar. In: Proceedings of the 1st International Workshop on Data Mining and Audience Intelligence for Advertising held during International Conference on Knowledge Discovery and Data Mining, pp. 40–45. San Jose, California (2007)Google Scholar
  27. 27.
    Qiu, G., Wang, C., Bu, J., Liu, K., Chen, C.: Incorporate the Syntactic Knowledge in Opinion Mining in User-generated Content, WWW 2008, Beijing, China (2008)Google Scholar
  28. 28.
    Turney, P.D.: Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania (2002)Google Scholar
  29. 29.
    Turney P.D., Littman M.L. (2003) Measuring praise and criticism: Inference of semantic orientation from association. ACM Trans. Inf. Syst. 21(4): 315–346CrossRefGoogle Scholar
  30. 30.
    Wong, W., Liu, W., Bennamoun, M.: Integrated Scoring for Spelling Error Correction, Abbreviation Expansion and Case Restoration in Dirty Text. In: Proceedings of AusDM2006 Fifth Australasian Data Mining Conference, CRPIT, vol. 61, pp. 83–89, Sydney, Australia (2006)Google Scholar
  31. 31.
    Wong, W., Liu, W., Bennamoun, M.: Enhanced Integrated Scoring for Cleaning Dirty Texts. In: Proceedings of IJCAI-2007 Workshop on Analytics for Noisy Unstructured Text, pp. 55–62, Hyderabad, India (2007)Google Scholar

Copyright information

© Springer-Verlag 2009

Authors and Affiliations

  1. 1.Innovation LabsTata Consultancy ServicesUdyog Vihar, GurgaonIndia

Personalised recommendations