Domain-specific text dictionaries for text analytics

Villanes, Andrea; Healey, Christopher G.

doi:10.1007/s41060-022-00344-x

Domain-specific text dictionaries for text analytics

Regular Paper
Published: 11 July 2022

Volume 15, pages 105–118, (2023)
Cite this article

International Journal of Data Science and Analytics Aims and scope Submit manuscript

357 Accesses
4 Citations
3 Altmetric
Explore all metrics

Abstract

We investigate the use of sentiment dictionaries to estimate sentiment for large document collections. Our goal in this paper is a semiautomatic method for extending a general sentiment dictionary for a specific target domain in a way that minimizes manual effort. General sentiment dictionaries may not contain terms important to the target domain or may score terms in ways that are inappropriate for the target domain. We combine statistical term identification and term evaluation using Amazon Mechanical Turk to extend the EmoLex sentiment dictionary to a domain-specific study of dengue fever. The same approach can be applied to any term-based sentiment dictionary or target domain. We explain how terms are identified for inclusion or re-evaluation and how Mechanical Turk generates scores for the identified terms. Examples are provided that compare EmoLex sentiment estimates before and after it is extended. We conclude by describing how our sentiment estimates can be integrated into an epidemiology surveillance system that includes sentiment visualization and discussing the strengths and limitations of our work.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey on sentiment analysis methods, applications, and challenges

Article 07 February 2022

Natural language processing: state of the art, current trends and challenges

Article 14 July 2022

A review on sentiment analysis and emotion detection from text

Article 28 August 2021

Notes

https://www.brandwatch.com, formerly Crimson Hexagon, a subscription service that provides “insights from 100 million sources and 1.4 trillion posts”
The odds a particular outcome occurs given a particular exposure, versus the odds of the outcome absent the exposure.
https://www.brandwatch.com

References

Alharbi, M., Laramee, R.S.: SoS TextViz: an extend survey of surveys on text visualization. Computers 8(1), 143–152 (2019)
Article Google Scholar
Dou, W., Liu, S.: Topic- and time-oriented visual text analysis. IEEE Comput. Gr. Vis. 36(4), 8–13 (2016)
Article Google Scholar
Kucher, K., Paradis, C., Kerren, A.: State of the art in sentiment visualization. Comput. Gr. Forum 37(1), 71–96 (2017)
Article Google Scholar
Shepard, D.S., Halasa, Y.A., Tyagi, B.K., Adhish, S.V., Nandan, D., Karthiga, K.S., Chellaswamy, V., Gaba, M., Arora, N.K.: Economic and disease burden of dengue illness in India. Am. J. Trop. Med. Hyg. 91(6), 1235–1242 (2014)
Article Google Scholar
Plutchik, R.: A general psychoevolutionary theory of emotion. In: Plutchik, R., Kellerman, H. (eds.) Theories of Emotion : Emotion, Theory, Research, and Experience, pp. 3–31. Academic Press, New York (1980)
Chapter Google Scholar
Mohammad, S.M., Turney, P.D.: Crowdsourcing a word-emotion association lexicon. Comput. Intell. 29(3), 436–465 (2013)
Article Google Scholar
Liu, B., Zhang, L.: A survey of opinion mining and sentiment analysis. In: Aggarwal, C.C., Zhai, C.X. (eds.) Mining Text Data, pp. 415–463. Springer, New York (2012)
Chapter Google Scholar
Mohammad, S.M.: Sentiment analysis: detecting valence, emotions, and other affectual states from text. In: Meiselman, H. (ed.) Emotional Measurement, pp. 201–237. Elsevier, Atlanta (2015)
Google Scholar
Pang, B., Lee, L.: Opinion mining and sentiment analysis. Found. Trends Inf. Retr. 2(1–2), 1–135 (2008)
Article Google Scholar
Zhang, L., Wang, S., Liu, B.: Deep learning for sentiment analysis: a survey. WIREs Data Min. Knowl. Discov. 8(4), 1–25 (2018)
Google Scholar
Russell, J.A.: A circumplex model of affect. J. Personal. Soc. Psychol. 39(6), 1161–1178 (1980)
Article Google Scholar
Russell, J.A., Feldman Barrett, L.: The structure of current affect: controversies and emerging consensus. Curr. Dir. Psychol. Sci. 8(1), 10–14 (1999)
Article Google Scholar
Pang, B., Lee, L.: A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts. In: Proceedings of the 42nd annual meeting of the association for computational linguistics (ACL ’04), Barcelona, Spain, pp. 271–278 (2004)
Pang, B., Lee, L., Vithyanathan, S.: Thumbs up? Sentiment classification using machine learning techniques. In: Proceedings of the 2002 conference on empirical methods in natural language processing (EMNLP 2002), Philadelphia, PA, pp. 79–86 (2002)
Turney, P.: Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. In: Proceedings of the 40th annual meeting of the association for computational linguistics (ACL ’02), Philadelphia, PA, pp. 417–424 (2002)
Bonata, V., Janardhan, N.: A comprehensive study on lxicon based approaches for sentiment analysis. Asian J. Comput. Sci. Technol. 8(S2), 1–6 (2019)
Article Google Scholar
DiBattista, J.: The best python sentiment analysis package (\(+1\) Huge Mistake). https://towardsdatascience.com/the-best-python-sentiment-analysis-package-1-huge-common-mistake-d6da9ad6cdeb. Online; accessed 02 Mar 2021 (2021)
Podiotis, P.: Sentiment analysis of the CIA world Factbook). Social science research network (SSRN), https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3721400. Online; accessed 02 Mar 2021 (2020)
Li, Z., Wei, Y., Zhang, Y., Yang, Q.: Hierarchical attention transfer network for cross-domain sentiment classification. In: Proceedings of the thirty-second AAAI conference on artifical intelligence (AAAI-18), New Orleans, LA, pp. 5852–5859 (2018)
Zhang, K., Zhang, K., Zhang, M., Zhao, H., Liu, W., Wei, W.: Incorporating dynamic semantics into pre-trained language model for aspect-based sentiment analysis. In: Cohn, T., He, Y., Liu, Y. (eds.) Findings of the Association for Computational Linguistics (ACL 2022), pp. 3599–3610. Ireland, Dublin (2022)
Chapter Google Scholar
Kenton, J.D., Chang, M.-W., Toutanova, L.K.: BERT: Pre-training of deep bidirectional transforms for language understanding. In: Proceedings of the 2019 annual conference of the North American chapter of the association for computational linguistics-human language technologies (NAACL-HLT 2019), virtual, pp. 4171–4189 (2019)
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems 33 (NeurlPS 2020), pp. 1877–1901. virtual, (2020)
Lewis, M., Liu, Y., Goya, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., Zettlemoyer, L.: BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Proceedings of the 58th annual meeting of the association for computational linguistics (ACL 2020), Seattle, Washington, pp. 7871–7880 (2020)
Song, K., Tan, X., Qin, T., Lu, U., Y., L.T.: MASS: Masked sequence to sequence pre-training for language generation. In: Proceedings of the 36th international conference on machine learning (ICML 2019), Long Beach, California, pp. 5926–5936 (2019)
Pepe, A., Bollen, J.: Between conjecture and memento: shaping a collective emotional perception of the future. In: AAAI spring symposium on emotion, personality, and social behavior, Stanford, CA, pp. 111–116 (2008)
Mislove, A., Lehmann, S., Ahn, Y.-Y., Onnela, J.-P., Rosenquist, J.N.: Pulse of the Nation: U.S. Mood Throughout the Day Inferred from Twitter. http://www.ccs.neu.edu/home/amislove/twittermood (2010)
Thelwall, M., Buckley, K., Paltoglou, G., Cai, D., Kappas, A.: Sentiment strength detection in short informal text. J. Am. Soc. Inf. Sci. Technol. 61(12), 2544–2558 (2010)
Article Google Scholar
Tausczik, Y.R., Pennebaker, J.W.: The psychological meaning of words: LIWC and computerized text analysis methods. J. Lang. Soc. Psychol. 29(1), 25–54 (2010)
Article Google Scholar
Baccianella, S., Esuli, A., Sebastiani, F.: SentiWordNet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining. In: Proceedings of the 7th international conference on language resources and evaluation (LREC ’10), Valletta, Malta, pp. 2200–2204 (2010)
Warriner, A.B., Kuperman, V., Brysbaert, M.: Norms of valence, arousal, and dominance for 13,915 English lemmas. Behav. Res. Methods 45(4), 1191–1207 (2013)
Article Google Scholar
Cao, N., Lin, Y.-R., Sun, X., Lazer, D., Liu, S., Huamin, Q.: Whisper: Tracing the spatiotemporal process of information diffusion in real time. IEEE Trans Vis. Comput. Gr. 18(12), 2649–2658 (2012)
Article Google Scholar
Cao, N., Lu, L., Lin, Y.-R., Wang, F.: SocialHelix: Visual analysis of sentiment divergence in social media. J. Vis. 18(2), 221–235 (2014)
Article Google Scholar
Wu, Y., Liu, S., Yan, K., Liu, M., Wu, F.: OpinionFlow: visual analysis of opinion diffusion on social media. IEEE Trans. Vis. Comput. Gr. 20(12), 1763–1772 (2014)
Article Google Scholar
Liu, Y., Wang, H., Landis, S., Macjejewski, R.: A visual analytics framework for identifying topic drivers in media events. IEEE Trans. Vis. Comput. Gr. 24(9), 2501–2515 (2017)
Article Google Scholar
El-Assady, M., Gold, V., Acevedo, C., Collins, C., Keim, D.: ConToVi: multi-party conversation exploration using topic-space views. Comput. Gr. Forum 35(3), 431–440 (2016)
Article Google Scholar
El-Assady, M., Sevastjanova, R., Keim, D., Collins, C.: ThreadReconstructor: modeling reply-chains to untangle conversational text through visual analytics. Comput. Gr. Forum 37(3), 351–365 (2018)
Article Google Scholar
Hoque, E., Carenini, G.: ConVis: a visual text analytic system for exploring blog conversations. Comput. Gr. Forum 33(3), 221–230 (2014)
Article Google Scholar
Hoque, E., Carenini, G.: MultiConVis: A visual text analysis system for exploring a collection of online conversations. In: Proceedings of the 21st international conference on intelligent user interfaces (IUI ’16), Sonoma, CA, pp. 96–107 (2016)
Mohammad, S.M., Sobhani, P., Kiritchenko, S.: Stance and sentiment in tweets. ACM Trans. Int. Technol. 17(3), 26 (2017)
Article Google Scholar
Kucher, K., Martins, R.M., Paradis, C., Kerren, A.: StanceVis Prime: visual analysis of sentiment and stance in social media texts. J. Vis. 23(6), 1015–1034 (2020)
Article Google Scholar
Wei, F., Shixia, L., Yangqiu, S., Shimei, P., Zhou, M.X., Qian, W., Lei, S., Li, T., Qiang, Z.: TIARA: interactive, topic-based visual text summarization and analysis. In: Proceedings of the 16th SIGKDD international conference on knowledge discovery and data mining (KDD 2010), Washington, DC, pp. 153–162 (2010)
Dörk, M., Gruen, D., Williamson, C., Carpendale, S.: A visual backchannel for large-scale events. IEEE Trans. Vis. Comput. Gr. 16(6), 1129–1138 (2010)
Article Google Scholar
Mohammad, S.M.: Challenges in sentiment analysis. In: Das, D., Cambria, E., Bandyopadhyay, S. (eds.) A Practical Guide to Sentiment Analysis, pp. 61–83. Springer, New York (2016)
Google Scholar
World Health Organization: Prevention and control of dengue and dengue hemorrhagic fever: comprehensive guidelines. Technical report, World Health Organization Regional Office for South-East Asia (1999)
Bhatt, S., Gething, P.W., Brady, O.J., Messina, J.P., Farlow, A.W., Moyes, C.L., Drake, J.M., Brownstein, J.S., Hoen, A.G., Sankoh, O.: The global distribution and burden of dengue. Nature 496(7446), 504 (2013)
Article Google Scholar
Montoya, M., Gresh, L., Mercado, J.C., Williams, K.L., Vargas, M.J., Gutierrez, G., Kuan, G., Gordon, A., Balmaseda, A., Harris, E.: Symptomatic versus inapparent outcome in repeat dengue virus infections is influenced by the time interval between infections and study year. PLoS Negl. Trop. Dis. 7(8), 2357 (2013)
Article Google Scholar
Moreira, L.A., Iturbe-Ormaetxe, I., Jeffery, J.A., Lu, G., Pyke, A.T., Hedges, L.M., Rocha, B.C., Hall-Mendelin, S., Day, A., Riegler, M.: A Wolbachia symbiont in Aedes Aegypti limits infection with dengue, chikungunya, and plasmodium. Cell 139(7), 1268–1278 (2009)
Article Google Scholar
Olkowski, S., Forshey, B.M., Morrison, A.C., Rocha, C., Vilcarromero, S., Halsey, E.S., Kochel, T.J., Scott, T.W., Stoddard, S.T.: Reduced risk of disease during postsecondary dengue virus infections. J. Infect. Dis. 208(6), 1026–1033 (2013)
Article Google Scholar
Reyes, M., Mercado, J.C., Standish, K., Matute, J.C., Ortega, O., Moraga, B., Avilés, W., Henn, M.R., Balmaseda, A., Kuan, G.: Index cluster study of dengue virus infection in Nicaragua. Am. J. Trop. Med. Hyg. 83(3), 683–689 (2010)
Article Google Scholar
Shepard, D.S., Undurraga, E.A., Halasa, Y.A.: Economic and disease burden of dengue in southeast asia. PLoS Negl. Trop. Dis. 7(2), 2055 (2013)
Article Google Scholar
Lozano, R., Naghavi, M., Foreman, K., Lim, S., Shibuya, K., Aboyans, V., Abraham, J., Adair, T., Aggarwal, R., Ahn, S.Y.: Global and regional mortality from 235 causes of death for 20 age groups in 1990 and 2010: a systematic analysis for the Global Burden of Disease Study 2010. Lancet 380(9859), 2095–2128 (2012)
Article Google Scholar
World Health Organization: Setting priorities in communicable disease surveillance. Technical report, World Health Organization, Lyon, France (2006)
Brownstein, J.S., Freifeld, C.C., Reis, B.Y., Mandl, K.D.: Surveillance sans frontières: internet-based emerging infectious disease intelligence and the HealthMap project. PLoS Med. 5(7), 151 (2008)
Article Google Scholar
Davies, S.E.: The challenge to know and control: disease outbreak surveillance and alerts in China and India. Glob. Pub. Health 7(7), 695–716 (2012)
Article Google Scholar
Farrington, C.P., Andrews, N.J., Beale, A.D., Catchpole, M.A.: A statistical algorithm for the early detection of outbreaks of infectious disease. J. Royal Stat. Soc. Series A (Statistics in Society) 159(3), 547–563 (1996)
Article MATH Google Scholar
Liu, Y.: China’s public health-care system: facing the challenges. Bull. World Health Organ. 82(7), 532–538 (2004)
Google Scholar
Thacker, S.B., Qualters, J.R., Lee, L.M.: Public health surveillance in the United States: evolution and challenges. MMWR Surveill. Summ. 61, 3–9 (2012)
Google Scholar
Beatty, M.E., Stone, A., Fitzsimons, D.W., Hanna, J.N., Lam, S.K., Vong, S., Guzman, M.G., Mendez-Galvan, J.F., Halstead, S.B., Letson, G.W.: Best practices in dengue surveillance: a report from the Asia-Pacific and Americas dengue prevention boards. PLoS Negl. Trop. Dis. 4(11), 890 (2010)
Article Google Scholar
Konowitz, P.M., Petrossian, G.A., Rose, D.N.: The underreporting of disease and physicians’ knowledge of reporting requirements. Pub. Health Rep. 99(1), 31 (1984)
Google Scholar
McKenzie, J.F., Pinger, R.R.: An Introduction to Community Health, Brief Jones & Bartlett Publishers, Burlington (2013)
Google Scholar
Singh, J., Dinkar, A., Atam, V., Himanshu, D., Gupta, K.K., Usman, K., Misra, R.: Awareness and outcome of changing trends in clinical profile of dengue fever: a retrospective analysis of dengue epidemic from January to December 2014 at a tertiary care hospital. J. Assoc. Phys. India 65, 42 (2017)
Google Scholar
Fisher, R.A.: Statistical Methods for Research Workers. Oliver & Boyd, Edinburugh (1925)
MATH Google Scholar
Upton, G.J.: Fisher’s exact test. J. Royal Stat. Soc. Series A 155(3), 395–402 (1992)
Article Google Scholar
Kelly, J.T., Loepp, E.: Distinction without a difference? An assessment of MTurk worker types. Res. Polit. (2020). https://doi.org/10.11772/2053168019901185
Article Google Scholar
Sherlock, A.: Florence Nightingale’s “Rose” Diagram (2021). https://www.maharam.com/stories/sherlock_florence-nightingales-rose-diagram
Villanes, A., Griffiths, E., Rappa, M., Healey, C.G.: Dengue fever surveillance in India using text mining in public media. Am. J. Trop. Med. Hyg. 98, 181–191 (2018)
Article Google Scholar
Agarwal, A., Fu, W., Menzies, T.: What is wrong with topic modeling? And how to fix it using search-based software engineering. Inf. Softw. Technol. 98, 74–88 (2018)
Article Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. 3(4–5), 993–1022 (2003)
MATH Google Scholar
Villanes, A.: Epidemiological disease surveillance using public media text mining. PhD thesis, North Carolina State University (2019)

Download references

Author information

Andrea Villanes and Christopher G. Healey have contributed equally to this work.

Authors and Affiliations

Department of Computer Science and Institute for Advanced Analytics, North Carolina State University, 890 Oval Drive, Raleigh, NC, 27695-8206, USA
Andrea Villanes & Christopher G. Healey

Authors

Andrea Villanes
View author publications
You can also search for this author in PubMed Google Scholar
Christopher G. Healey
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christopher G. Healey.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Villanes, A., Healey, C.G. Domain-specific text dictionaries for text analytics. Int J Data Sci Anal 15, 105–118 (2023). https://doi.org/10.1007/s41060-022-00344-x

Download citation

Received: 16 December 2021
Accepted: 24 June 2022
Published: 11 July 2022
Issue Date: January 2023
DOI: https://doi.org/10.1007/s41060-022-00344-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Domain-specific text dictionaries for text analytics

Abstract

Access this article

Similar content being viewed by others

A survey on sentiment analysis methods, applications, and challenges

Natural language processing: state of the art, current trends and challenges

A review on sentiment analysis and emotion detection from text

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Domain-specific text dictionaries for text analytics

Abstract

Access this article

Similar content being viewed by others

A survey on sentiment analysis methods, applications, and challenges

Natural language processing: state of the art, current trends and challenges

A review on sentiment analysis and emotion detection from text

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation