Abstract
We investigate the use of sentiment dictionaries to estimate sentiment for large document collections. Our goal in this paper is a semiautomatic method for extending a general sentiment dictionary for a specific target domain in a way that minimizes manual effort. General sentiment dictionaries may not contain terms important to the target domain or may score terms in ways that are inappropriate for the target domain. We combine statistical term identification and term evaluation using Amazon Mechanical Turk to extend the EmoLex sentiment dictionary to a domain-specific study of dengue fever. The same approach can be applied to any term-based sentiment dictionary or target domain. We explain how terms are identified for inclusion or re-evaluation and how Mechanical Turk generates scores for the identified terms. Examples are provided that compare EmoLex sentiment estimates before and after it is extended. We conclude by describing how our sentiment estimates can be integrated into an epidemiology surveillance system that includes sentiment visualization and discussing the strengths and limitations of our work.
Similar content being viewed by others
Notes
https://www.brandwatch.com, formerly Crimson Hexagon, a subscription service that provides “insights from 100 million sources and 1.4 trillion posts”
The odds a particular outcome occurs given a particular exposure, versus the odds of the outcome absent the exposure.
References
Alharbi, M., Laramee, R.S.: SoS TextViz: an extend survey of surveys on text visualization. Computers 8(1), 143–152 (2019)
Dou, W., Liu, S.: Topic- and time-oriented visual text analysis. IEEE Comput. Gr. Vis. 36(4), 8–13 (2016)
Kucher, K., Paradis, C., Kerren, A.: State of the art in sentiment visualization. Comput. Gr. Forum 37(1), 71–96 (2017)
Shepard, D.S., Halasa, Y.A., Tyagi, B.K., Adhish, S.V., Nandan, D., Karthiga, K.S., Chellaswamy, V., Gaba, M., Arora, N.K.: Economic and disease burden of dengue illness in India. Am. J. Trop. Med. Hyg. 91(6), 1235–1242 (2014)
Plutchik, R.: A general psychoevolutionary theory of emotion. In: Plutchik, R., Kellerman, H. (eds.) Theories of Emotion : Emotion, Theory, Research, and Experience, pp. 3–31. Academic Press, New York (1980)
Mohammad, S.M., Turney, P.D.: Crowdsourcing a word-emotion association lexicon. Comput. Intell. 29(3), 436–465 (2013)
Liu, B., Zhang, L.: A survey of opinion mining and sentiment analysis. In: Aggarwal, C.C., Zhai, C.X. (eds.) Mining Text Data, pp. 415–463. Springer, New York (2012)
Mohammad, S.M.: Sentiment analysis: detecting valence, emotions, and other affectual states from text. In: Meiselman, H. (ed.) Emotional Measurement, pp. 201–237. Elsevier, Atlanta (2015)
Pang, B., Lee, L.: Opinion mining and sentiment analysis. Found. Trends Inf. Retr. 2(1–2), 1–135 (2008)
Zhang, L., Wang, S., Liu, B.: Deep learning for sentiment analysis: a survey. WIREs Data Min. Knowl. Discov. 8(4), 1–25 (2018)
Russell, J.A.: A circumplex model of affect. J. Personal. Soc. Psychol. 39(6), 1161–1178 (1980)
Russell, J.A., Feldman Barrett, L.: The structure of current affect: controversies and emerging consensus. Curr. Dir. Psychol. Sci. 8(1), 10–14 (1999)
Pang, B., Lee, L.: A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts. In: Proceedings of the 42nd annual meeting of the association for computational linguistics (ACL ’04), Barcelona, Spain, pp. 271–278 (2004)
Pang, B., Lee, L., Vithyanathan, S.: Thumbs up? Sentiment classification using machine learning techniques. In: Proceedings of the 2002 conference on empirical methods in natural language processing (EMNLP 2002), Philadelphia, PA, pp. 79–86 (2002)
Turney, P.: Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. In: Proceedings of the 40th annual meeting of the association for computational linguistics (ACL ’02), Philadelphia, PA, pp. 417–424 (2002)
Bonata, V., Janardhan, N.: A comprehensive study on lxicon based approaches for sentiment analysis. Asian J. Comput. Sci. Technol. 8(S2), 1–6 (2019)
DiBattista, J.: The best python sentiment analysis package (\(+1\) Huge Mistake). https://towardsdatascience.com/the-best-python-sentiment-analysis-package-1-huge-common-mistake-d6da9ad6cdeb. Online; accessed 02 Mar 2021 (2021)
Podiotis, P.: Sentiment analysis of the CIA world Factbook). Social science research network (SSRN), https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3721400. Online; accessed 02 Mar 2021 (2020)
Li, Z., Wei, Y., Zhang, Y., Yang, Q.: Hierarchical attention transfer network for cross-domain sentiment classification. In: Proceedings of the thirty-second AAAI conference on artifical intelligence (AAAI-18), New Orleans, LA, pp. 5852–5859 (2018)
Zhang, K., Zhang, K., Zhang, M., Zhao, H., Liu, W., Wei, W.: Incorporating dynamic semantics into pre-trained language model for aspect-based sentiment analysis. In: Cohn, T., He, Y., Liu, Y. (eds.) Findings of the Association for Computational Linguistics (ACL 2022), pp. 3599–3610. Ireland, Dublin (2022)
Kenton, J.D., Chang, M.-W., Toutanova, L.K.: BERT: Pre-training of deep bidirectional transforms for language understanding. In: Proceedings of the 2019 annual conference of the North American chapter of the association for computational linguistics-human language technologies (NAACL-HLT 2019), virtual, pp. 4171–4189 (2019)
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems 33 (NeurlPS 2020), pp. 1877–1901. virtual, (2020)
Lewis, M., Liu, Y., Goya, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., Zettlemoyer, L.: BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Proceedings of the 58th annual meeting of the association for computational linguistics (ACL 2020), Seattle, Washington, pp. 7871–7880 (2020)
Song, K., Tan, X., Qin, T., Lu, U., Y., L.T.: MASS: Masked sequence to sequence pre-training for language generation. In: Proceedings of the 36th international conference on machine learning (ICML 2019), Long Beach, California, pp. 5926–5936 (2019)
Pepe, A., Bollen, J.: Between conjecture and memento: shaping a collective emotional perception of the future. In: AAAI spring symposium on emotion, personality, and social behavior, Stanford, CA, pp. 111–116 (2008)
Mislove, A., Lehmann, S., Ahn, Y.-Y., Onnela, J.-P., Rosenquist, J.N.: Pulse of the Nation: U.S. Mood Throughout the Day Inferred from Twitter. http://www.ccs.neu.edu/home/amislove/twittermood (2010)
Thelwall, M., Buckley, K., Paltoglou, G., Cai, D., Kappas, A.: Sentiment strength detection in short informal text. J. Am. Soc. Inf. Sci. Technol. 61(12), 2544–2558 (2010)
Tausczik, Y.R., Pennebaker, J.W.: The psychological meaning of words: LIWC and computerized text analysis methods. J. Lang. Soc. Psychol. 29(1), 25–54 (2010)
Baccianella, S., Esuli, A., Sebastiani, F.: SentiWordNet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining. In: Proceedings of the 7th international conference on language resources and evaluation (LREC ’10), Valletta, Malta, pp. 2200–2204 (2010)
Warriner, A.B., Kuperman, V., Brysbaert, M.: Norms of valence, arousal, and dominance for 13,915 English lemmas. Behav. Res. Methods 45(4), 1191–1207 (2013)
Cao, N., Lin, Y.-R., Sun, X., Lazer, D., Liu, S., Huamin, Q.: Whisper: Tracing the spatiotemporal process of information diffusion in real time. IEEE Trans Vis. Comput. Gr. 18(12), 2649–2658 (2012)
Cao, N., Lu, L., Lin, Y.-R., Wang, F.: SocialHelix: Visual analysis of sentiment divergence in social media. J. Vis. 18(2), 221–235 (2014)
Wu, Y., Liu, S., Yan, K., Liu, M., Wu, F.: OpinionFlow: visual analysis of opinion diffusion on social media. IEEE Trans. Vis. Comput. Gr. 20(12), 1763–1772 (2014)
Liu, Y., Wang, H., Landis, S., Macjejewski, R.: A visual analytics framework for identifying topic drivers in media events. IEEE Trans. Vis. Comput. Gr. 24(9), 2501–2515 (2017)
El-Assady, M., Gold, V., Acevedo, C., Collins, C., Keim, D.: ConToVi: multi-party conversation exploration using topic-space views. Comput. Gr. Forum 35(3), 431–440 (2016)
El-Assady, M., Sevastjanova, R., Keim, D., Collins, C.: ThreadReconstructor: modeling reply-chains to untangle conversational text through visual analytics. Comput. Gr. Forum 37(3), 351–365 (2018)
Hoque, E., Carenini, G.: ConVis: a visual text analytic system for exploring blog conversations. Comput. Gr. Forum 33(3), 221–230 (2014)
Hoque, E., Carenini, G.: MultiConVis: A visual text analysis system for exploring a collection of online conversations. In: Proceedings of the 21st international conference on intelligent user interfaces (IUI ’16), Sonoma, CA, pp. 96–107 (2016)
Mohammad, S.M., Sobhani, P., Kiritchenko, S.: Stance and sentiment in tweets. ACM Trans. Int. Technol. 17(3), 26 (2017)
Kucher, K., Martins, R.M., Paradis, C., Kerren, A.: StanceVis Prime: visual analysis of sentiment and stance in social media texts. J. Vis. 23(6), 1015–1034 (2020)
Wei, F., Shixia, L., Yangqiu, S., Shimei, P., Zhou, M.X., Qian, W., Lei, S., Li, T., Qiang, Z.: TIARA: interactive, topic-based visual text summarization and analysis. In: Proceedings of the 16th SIGKDD international conference on knowledge discovery and data mining (KDD 2010), Washington, DC, pp. 153–162 (2010)
Dörk, M., Gruen, D., Williamson, C., Carpendale, S.: A visual backchannel for large-scale events. IEEE Trans. Vis. Comput. Gr. 16(6), 1129–1138 (2010)
Mohammad, S.M.: Challenges in sentiment analysis. In: Das, D., Cambria, E., Bandyopadhyay, S. (eds.) A Practical Guide to Sentiment Analysis, pp. 61–83. Springer, New York (2016)
World Health Organization: Prevention and control of dengue and dengue hemorrhagic fever: comprehensive guidelines. Technical report, World Health Organization Regional Office for South-East Asia (1999)
Bhatt, S., Gething, P.W., Brady, O.J., Messina, J.P., Farlow, A.W., Moyes, C.L., Drake, J.M., Brownstein, J.S., Hoen, A.G., Sankoh, O.: The global distribution and burden of dengue. Nature 496(7446), 504 (2013)
Montoya, M., Gresh, L., Mercado, J.C., Williams, K.L., Vargas, M.J., Gutierrez, G., Kuan, G., Gordon, A., Balmaseda, A., Harris, E.: Symptomatic versus inapparent outcome in repeat dengue virus infections is influenced by the time interval between infections and study year. PLoS Negl. Trop. Dis. 7(8), 2357 (2013)
Moreira, L.A., Iturbe-Ormaetxe, I., Jeffery, J.A., Lu, G., Pyke, A.T., Hedges, L.M., Rocha, B.C., Hall-Mendelin, S., Day, A., Riegler, M.: A Wolbachia symbiont in Aedes Aegypti limits infection with dengue, chikungunya, and plasmodium. Cell 139(7), 1268–1278 (2009)
Olkowski, S., Forshey, B.M., Morrison, A.C., Rocha, C., Vilcarromero, S., Halsey, E.S., Kochel, T.J., Scott, T.W., Stoddard, S.T.: Reduced risk of disease during postsecondary dengue virus infections. J. Infect. Dis. 208(6), 1026–1033 (2013)
Reyes, M., Mercado, J.C., Standish, K., Matute, J.C., Ortega, O., Moraga, B., Avilés, W., Henn, M.R., Balmaseda, A., Kuan, G.: Index cluster study of dengue virus infection in Nicaragua. Am. J. Trop. Med. Hyg. 83(3), 683–689 (2010)
Shepard, D.S., Undurraga, E.A., Halasa, Y.A.: Economic and disease burden of dengue in southeast asia. PLoS Negl. Trop. Dis. 7(2), 2055 (2013)
Lozano, R., Naghavi, M., Foreman, K., Lim, S., Shibuya, K., Aboyans, V., Abraham, J., Adair, T., Aggarwal, R., Ahn, S.Y.: Global and regional mortality from 235 causes of death for 20 age groups in 1990 and 2010: a systematic analysis for the Global Burden of Disease Study 2010. Lancet 380(9859), 2095–2128 (2012)
World Health Organization: Setting priorities in communicable disease surveillance. Technical report, World Health Organization, Lyon, France (2006)
Brownstein, J.S., Freifeld, C.C., Reis, B.Y., Mandl, K.D.: Surveillance sans frontières: internet-based emerging infectious disease intelligence and the HealthMap project. PLoS Med. 5(7), 151 (2008)
Davies, S.E.: The challenge to know and control: disease outbreak surveillance and alerts in China and India. Glob. Pub. Health 7(7), 695–716 (2012)
Farrington, C.P., Andrews, N.J., Beale, A.D., Catchpole, M.A.: A statistical algorithm for the early detection of outbreaks of infectious disease. J. Royal Stat. Soc. Series A (Statistics in Society) 159(3), 547–563 (1996)
Liu, Y.: China’s public health-care system: facing the challenges. Bull. World Health Organ. 82(7), 532–538 (2004)
Thacker, S.B., Qualters, J.R., Lee, L.M.: Public health surveillance in the United States: evolution and challenges. MMWR Surveill. Summ. 61, 3–9 (2012)
Beatty, M.E., Stone, A., Fitzsimons, D.W., Hanna, J.N., Lam, S.K., Vong, S., Guzman, M.G., Mendez-Galvan, J.F., Halstead, S.B., Letson, G.W.: Best practices in dengue surveillance: a report from the Asia-Pacific and Americas dengue prevention boards. PLoS Negl. Trop. Dis. 4(11), 890 (2010)
Konowitz, P.M., Petrossian, G.A., Rose, D.N.: The underreporting of disease and physicians’ knowledge of reporting requirements. Pub. Health Rep. 99(1), 31 (1984)
McKenzie, J.F., Pinger, R.R.: An Introduction to Community Health, Brief Jones & Bartlett Publishers, Burlington (2013)
Singh, J., Dinkar, A., Atam, V., Himanshu, D., Gupta, K.K., Usman, K., Misra, R.: Awareness and outcome of changing trends in clinical profile of dengue fever: a retrospective analysis of dengue epidemic from January to December 2014 at a tertiary care hospital. J. Assoc. Phys. India 65, 42 (2017)
Fisher, R.A.: Statistical Methods for Research Workers. Oliver & Boyd, Edinburugh (1925)
Upton, G.J.: Fisher’s exact test. J. Royal Stat. Soc. Series A 155(3), 395–402 (1992)
Kelly, J.T., Loepp, E.: Distinction without a difference? An assessment of MTurk worker types. Res. Polit. (2020). https://doi.org/10.11772/2053168019901185
Sherlock, A.: Florence Nightingale’s “Rose” Diagram (2021). https://www.maharam.com/stories/sherlock_florence-nightingales-rose-diagram
Villanes, A., Griffiths, E., Rappa, M., Healey, C.G.: Dengue fever surveillance in India using text mining in public media. Am. J. Trop. Med. Hyg. 98, 181–191 (2018)
Agarwal, A., Fu, W., Menzies, T.: What is wrong with topic modeling? And how to fix it using search-based software engineering. Inf. Softw. Technol. 98, 74–88 (2018)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. 3(4–5), 993–1022 (2003)
Villanes, A.: Epidemiological disease surveillance using public media text mining. PhD thesis, North Carolina State University (2019)
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Villanes, A., Healey, C.G. Domain-specific text dictionaries for text analytics. Int J Data Sci Anal 15, 105–118 (2023). https://doi.org/10.1007/s41060-022-00344-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41060-022-00344-x