Abstract
We describe an experiment into detecting emotions in texts on the Chinese microblog service Sina Weibo (www.weibo.com) using distant supervision via various author-supplied emotion labels (emoticons and smilies). Existing word segmentation tools proved unreliable; better accuracy was achieved using character-based features. Higher-order n-grams proved to be useful features. Accuracy varied according to label and emotion: while smilies are used more often, emoticons are more reliable. Happiness is the most accurately predicted emotion, with accuracies around 90 % on both distant and gold-standard labels. This approach works well and achieves high accuracies for happiness and anger, while it is less effective for sadness, surprise, disgust and fear, which are also difficult for human annotators to detect.
Keywords
- Sentiment Analysis
- Word Segmentation
- Social Media Data
- Lexical Feature
- Human Annotator
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, access via your institution.
Buying options





Notes
- 1.
- 2.
- 3.
- 4.
- 5.
Available at: http://www.sojump.com/jq/1935017.aspx?npb=1.
- 6.
- 7.
- 8.
- 9.
That is how we constructed our training datasets for previous experiments.
- 10.
References
Agichtein, E., Castillo, C., Donato, D., Gionis, A., Mishne, G. (2008) Finding high-quality content in social media. In: Proceedings of the 2008 International Conference on Web Search and Data Mining (WSDM’08). pp. 183–194
Bloodgood, M., Callison-Burch, C.: Bucking the trend: large-scale cost-focused active learning for statistical machine translation. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 854–864. Uppsala, Sweden (2010)
Chang, C., Lin, C.: LIBSVM: a library for Support Vector Machines (2001). http://www.csie.ntu.edu.tw/cjlin/papers/libsvm.pdf. Cited 4 Feb 2014
Chen, K., Liu, S.: Word identification for Mandarin Chinese sentences. In: Proceedings of the 14th Conference on Computational Linguistics, (1992), vol. 1, pp. 101–107
China Internet Network Information Center (CINIC).: The 32nd Statistical Report on Internet Development in China (2013). http://www1.cnnic.cn/IDR/ReportDownloads/201310/P020131029430558704972.pdf. Cited 2 Feb 2014
China, SINA Corporation (SINA) Q3 2013 Earnings Conference Call (2013). http://seekingalpha.com/article/1835112-sina-corporations-ceo-discusses-q3-2013-results-earnings-call-transcript. Cited 2 Feb 2014
Callison-Burch, C.: Fast, Cheap, and Creative: evaluating translation quality using Amazons mechanical turk. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP-2009), pp. 286–295. Singapore (2009)
Chuang, Z., Wu, C.: Multimodal emotion recognition from speech and text. Comput. Linguist. Chin. Lang. 9(2), 45–62 (2004)
Dave, K., Lawrence, S., Pennock, D.M.: Mining the peanut gallery: opinion extraction and semantic classification of product reviews. In: WWW2003, pp. 519–528
Derks, D., Bos, A., von Grumbkow, J.: Emoticons and online message interpretation. Soc. Sci. Comput. Rev. 26(3), 379–388 (2008)
Ekman, P.: Universal facial expressions of emotion. In: California Mental Health Research Digest, vol. 8, no. 4 (1970)
Fan, C., Tsai, W.: Automatic word identification in Chinese sentences by the relaxation technique. In: Computer Processing of Chinese and Oriental Languages (1988)
Fan, R., Chang, K., Hsieh, C., Wang, X., Lin, C.: LIBLINEAR: a library for large linear classification. J. Mach. Learn. Res. 9(2008), 1871–1874 (2008)
Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)
Gan, K., Palmer, M., Lua, K.: A statistically emergent approach for language processing: application to modeling context effects in ambiguous Chinese word boundary perception. Comput. Linguist. 22(4), 53153 (1996)
Geisser, S.: The predictive sample reuse method with applications. In: Journal of the American Statistical Association, pp. 320–328 (1975)
Go, A., Bhayani, R., Huang, L.: Twitter Sentiment Classification using Distant Supervision. Master’s thesis, Stanford University (2009)
Guo, J.: Critical tokenization and its properties. Comput. Linguist. 23(4), 569596 (1997)
Hatzivassiloglou, V., Wiebe, J.M.: Effects of adjective orientation and gradability on sentence subjectivity. In: Proceedings of the 18th International Conference on Computational Linguistics (2000)
Jiang, W., Huang, L., Liu, Q.: Automatic adaptation of annotation standards: Chinese word segmentation and pos tagging a case study. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pp. 522–530. Suntec, Singapore (2009)
Jin, W., Chen, L.: Identifying unknown words in Chinese corpora. In: First Workshop on Chinese Language, University of Pennsylvania, Philadelphia (1998)
Joachims, T.: Text categorization with suport vector machines: learning with many relevant features. In: Proceedings of the 10th European Conference on Machine Learning (ECML’08), pp. 137–142 (1998)
Kayan, S., Fussell, S.R., Setlock, L.D.: Cultural differences in the use of instant messaging in Asia and North America. In: Proceedings of the 20th Anniversary Conference on Computer Supported Cooperative Work (CSCW’06), pp. 525–528. Banff, Alberta, Canada (2006)
Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI). Morgan Kaufmann, San Mateo (1995)
Nakov, P.: Noun compound interpretation using paraphrasing verbs: feasibility study. In: Proceedings of the 13th International Conference on Artificial Intelligence: Methodology, Systems and Applications (AIMSA 2008), pp. 103–117
Pak, A., Paroubek, P.: Twitter as a corpus for sentiment analysis and opinion mining. In: Proceedings of the 7th Conference on International Language Resources and Evaluation (LREC’10). Valletta, Malta (2010)
Pang, B., Lee, L.: Opinion mining and sentiment analysis. In: Foundations and Trends in Information Retrieval (2008)
Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up? Sentiment classification using machine learning techniques. In: Proceedings of Empirical Methods in Natural Language Processing, (2002), pp. 79–86
Provine, R., Spencer, R., Mandell, D.: Emotional expression online: emoticons punctuate website text messages. J. Lang. Soc. Psychol. 26(3), 299–307 (2007)
Ptaszynski, M., Maciejewski, J., Dybala, P., Rzepka, R., Araki, K.: CAO: A fully automatic emoticon analysis system based on theory of kinesics. In: Affective Computing, IEEE Transactions (2010)
Purver, M., Battersby, S.: Experimenting with distant supervision for emotion classification. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL), pp. 482–491. Avignon, France (2012)
Read, J.: Using emoticons to reduce dependency in machine learning techniques for sentiment classification. In: Proceedings of the ACL Student Research Workshop, pp. 43–48. Ann Arbor, Michigan (2005)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Snow, R., O’Connor, B., Jurafsky, D., Ng, A.Y.: Cheap and fast but is it good? Evaluating non-expert annotations for natural language tasks. In: Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing (EMNLP-2008). Honolulu, Hawaii (2008)
Sproat, R., Shih, C.: A statistical method for finding word boundaries in Chinese text. In: Computer Processing of Chinese and Oriental Languages (1990)
Sun, W.: Word-based and characterbased word segmentation models: Comparison and combination. In: Coling 2010: Posters, pp. 1211–1219. Beijing, China (2010)
Sun, X., Zhang, Y., Matsuzaki, T., Tsuruoka, Y., Tsujii, J.: A discriminative latent variable Chinese segmenter with hybrid word/character information. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 56–64. Boulder, Colorado (2009)
Tsai, C.: MMSEG: A Word Identification System for Mandarin Chinese Text Based on Two Variants of the Maximum Matching Algorithm (2000). http://technology.chtsai.org/mmseg/. Cited 4 Feb 2014
Tseng, H., Chang, P., Andrew, G., Jurafsky, D., Manning, C.: A conditional random field word segmenter. In: Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing (2005)
Tsutsumi, K., Shimada, K., Endo, T.: Movie review classification based on a multiple classifier. In: Proceedings of the 21st Pacific Asia Conforence on Language, Information and Computation (PACLIC) (2007)
Turney, P.D.: Thumbs Up or Thumbs Down? Semantic orientation applied to unsupervised classification of reviews. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 417–424. Philadelphia (2002)
Vapnik, V.N.: The Nature of Statistical Learning Theory (1995)
Wu, A.: Customizable segmentation of morphologically derived Words in Chinese. In: Computational Linguistics and Chinese Language (2003)
Xue, N.: Chinese word segmentation as character tagging. In: International Journal of Computational Linguistics and Chinese Language Processing (2003)
Yessenov, K., Misailovic, S.: Sentiment analysis of movie review comments. In: Methodology (2009), pp. 1–17
Yuasa, M., Saito, K., Mukawa, N.: Emoticons convey emotions without cognition of faces: an fMRI study. In: CHI 06 Extended Abstracts on Human Factors in ComputingSystems (2006), pp. 1565–1570
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
56 individuals completed our survey; the detailed results are presented here—see Table 9.
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Yuan, Z., Purver, M. (2015). Predicting Emotion Labels for Chinese Microblog Texts. In: Gaber, M., Cocea, M., Wiratunga, N., Goker, A. (eds) Advances in Social Media Analysis. Studies in Computational Intelligence, vol 602. Springer, Cham. https://doi.org/10.1007/978-3-319-18458-6_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-18458-6_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-18457-9
Online ISBN: 978-3-319-18458-6
eBook Packages: EngineeringEngineering (R0)