The more

Yong, Kuan Shyang; Liew, Jasy Suet Yan

doi:10.1007/s10844-023-00791-3

The more "similar" the happier: Augmenting text using similarity scoring with neural embeddings for happiness classification

Research
Published: 02 May 2023

Volume 60, pages 631–653, (2023)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

Kuan Shyang Yong¹ &
Jasy Suet Yan Liew¹

Abstract

Measuring happiness of populations of interest via Twitter offers an alternative for social scientists to gauge the level of happiness in and across different nations but machine learning models are needed to scale happiness classification for millions of tweets. A good performing happiness classifier requires a fair amount of training data with minimal noise. Our study introduces a similarity-based text augmentation method to efficiently expand data for the emotion “happiness” from an existing emotion corpus (EmoTweet-28) by selecting the most similar positive examples from happiness tweets collected using distant supervision (DS) to be added into an augmented corpus as training data. Six neural embeddings on top of the baseline bag-of-words (BoW) representation were explored to compute the cosine similarity score between 100,000 DS tweets with 1,024 gold standard happiness tweets in EmoTweet-28 (ET). Our results show that the augmented training set obtained from USE embedding with the similarity threshold of 0.7 trained on BiLSTM produced the best model in predicting whether a tweet contains expressions of happiness or not (F1 score = 0.599). However, most augmented training sets obtained from InferSent-GloVe embedding produced BiLSTM classifiers with more consistent F1 scores above the base classifier in the fixed increment experiments. We show that our proposed text augmentation strategy can improve or maintain classification performance in small but cleaner increment sets as opposed to adding DS tweets randomly as training data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Twitter Sentiment Analysis Experiments Using Word Embeddings on Datasets of Various Scales

Sentiment analysis in Portuguese tweets: an evaluation of diverse word representation models

Article 28 June 2023

Words, Tweets, and Reviews: Leveraging Affective Knowledge Between Multiple Domains

Article 02 September 2021

Data availability

The data that support the findings of this study are available from the corresponding author upon request.

Notes

Twitter: https://twitter.com/home
GloVe pre-trained word embedding: https://nlp.stanford.edu/projects/glove/
USE pre-trained sentence embedding: https://tfhub.dev/google/universal-sentence-encoder/4
InferSent pre-trained sentence embeddings: https://github.com/facebookresearch/InferSent

References

Abdul-Mageed, M., & Ungar, L. (2017). EmoNet: Fine-grained emotion detection with gated recurrent neural networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (pp. 718–728). Vancouver, Canada. Association for Computational Linguistics. https://doi.org/10.18653/v1/P17-1067
Allisio, L., Mussa, V., Bosco, et al. (2013). Felicittà: Visualizing and estimating happiness in Italian cities from geotagged tweets. In 1st International Workshop on Emotion and Sentiment in Social and Expressive Media. Approaches and Perspectives from Ai, 1096, 95–106. CEUR Workshop Proceedings.
Google Scholar
Alm, C. O., Roth, D., & Sproat, R. (2005). Emotions from Text: Machine learning for text-based emotion prediction. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (pp. 579–586). Vancouver, British Columbia, Canada. Association for Computational Linguistics. https://doi.org/10.3115/1220575.1220648
Aroyehun, S. T., & Gelbukh, A. (2018). Aggression detection in social media: Using deep neural networks, data augmentation, and pseudo labeling. In Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (pp. 90–97). Association for Computational Linguistics
Bates, W. (2009). Gross national happiness. Asian-Pacific Economic Literature, 23(2), 1–16. John Wiley & Sons. https://doi.org/10.1111/j.1467-8411.2009.01235.x
Berka, P. (2020). Sentiment analysis using rule-based and case-based reasoning. Journal of Intelligent Information Systems, 55(1), 51–66. Springer. https://doi.org/10.1007/s10844-019-00591-8
Cer, D., Yang, Y., Kong, S., et al. (2018). Universal Sentence Encoder for English. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (pp. 169–174). Brussels, Belgium. Association for Computational Linguistics. https://doi.org/10.18653/v1/D18-2029
Conneau, A., Kiela, D., Schwenk, et al. (2017). Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (pp. 670–680). Copenhagen, Denmark. Association for Computational Linguistics. https://doi.org/10.18653/v1/D17-1070
Delsignore, G., Aguilar-Latorre, A., & Oliván-Blázquez, B. (2021). Measuring happiness in the social sciences: An overview. Journal of Sociology, 57(4), 1044–1067. https://doi.org/10.1177/1440783321991655. Sage.
Article Google Scholar
Dodds, P. S., & Danforth, C. M. (2010). Measuring the happiness of large-scale written expression: Songs, blogs, and presidents. Journal of Happiness Studies, 11(4), 441–456. https://doi.org/10.1007/s10902-009-9150-9. Springer.
Article Google Scholar
Dodds, P. S., Harris, K. D., Kloumann, I. M., et al. (2011). Temporal patterns of happiness and information in a global social network: Hedonometrics and Twitter. PLOS ONE, 6(12), e26752. https://doi.org/10.1371/journal.pone.0026752.
Article Google Scholar
Godin, F. (2019). Improving and interpreting neural networks for word-level prediction tasks in natural language processing. Ghent University.
Google Scholar
Gupta, U., Chatterjee, A., Srikanth, R., et al. (2017). A sentiment-and-semantics-based approach for emotion detection in textual conversations. Workshop on Neural Information Retrieval. https://doi.org/10.48550/arXiv.1707.06996
Helliwell, J. F., & Aknin, L. B. (2018). Expanding the social science of happiness. Nature Human Behaviour, 2(4), 248–252. https://doi.org/10.1038/s41562-018-0308-5. Springer Nature.
Article Google Scholar
Johnson, D. (2023). These are the happiest and healthiest cities in America. TIME. Retrieved from https://time.com/4691862/best-cities-us-happiest-healthiest/. Accessed 28 Feb 2023.
Kobayashi, S. (2018). Contextual augmentation: Data augmentation by words with paradigmatic relations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Vol. 2, pp. 452–457). Association for Computational Linguistics. https://doi.org/10.18653/v1/N18-2072
Kolomiyets, O., Bethard, S., & Moens, M.-F. (2011). Model-portability experiments for textual temporal analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (Vol. 2, pp. 271–276). Association for Computational Linguistics
Lawrence, E. M., Rogers, R. G., & Wadsworth, T. (2015). Happiness and longevity in the United States. Social Science & Medicine, 145, 115–119. https://doi.org/10.1016/j.socscimed.2015.09.020. Elsevier.
Article Google Scholar
Li, S., Ao, X., Pan, F., et al. (2022). Learning policy scheduling for text augmentation. Neural Networks, 145, 121–127. https://doi.org/10.1016/j.neunet.2021.09.028. Elsevier.
Article Google Scholar
Liew, J. S. Y. (2016). Fine-grained emotion detection in microblog text. Syracuse University.
Google Scholar
Liew, J. S. Y., Turtle, H. R., & Liddy, E. D. (2016). EmoTweet-28: A fine-grained emotion corpus for sentiment analysis. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (pp. 1149–1156). European Language Resources Association (ELRA)
Liu, R., Xu, G., Jia, C., et al. (2020). Data Boost: Text data augmentation through reinforcement learning guided conditional generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (pp. 9031–9041). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.726
Lu, X., Zheng, B., Velivelli, A., et al. (2006). Enhancing text categorization with semantic-enriched representation and training data augmentation. Journal of the American Medical Informatics Association, 13(5), 526–535. https://doi.org/10.1197/jamia.M2051. Oxford University Pres.
Article Google Scholar
Luo, J., Bouazizi, M., & Ohtsuki, T. (2021). Data augmentation for sentiment analysis using sentence compression-based SeqGAN with data screening. IEEE Access, 9, 99922–99931. IEEE. https://doi.org/10.1109/ACCESS.2021.3094023
Mikolov, T., Chen, K., Corrado, G., et al. (2013a). Efficient estimation of word representations in vector space. In International Conference on Learning Representations. https://doi.org/10.48550/arXiv.1301.3781
Mikolov, T., Sutskever, I., Chen, K., et al. (2013b). Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems 2 3111–3119 https://doi.org/10.48550/arXiv.1310.4546
Mintz, M., Bills, S., Snow, R., et al. (2009). Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP (pp. 1003–1011). Singapore. Association for Computational Linguistics
Mohammad, S. M. (2012). Portable features for classifying emotional text. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 587–591). Montreal, QC. Association for Computational Linguistics
Nguyen, Q. C., Li, D., Meng, H.-W., et al. (2016). Building a national neighborhood dataset from geotagged Twitter data for indicators of happiness, diet, and physical activity. JMIR Public Health and Surveillance, 2(2), e158. https://doi.org/10.2196/publichealth.5869. JMIR Publications.
Article Google Scholar
Pauken, B., Pradyumn, M., & Tabrizi, N. (2018). Tracking happiness of different US cities from tweets. In F. Y. L. Chin, C. L. P. Chen, L. Khan, K. Lee, & L.-J. Zhang (Eds.), Big Data – BigData 2018 140–148. Springer. https://doi.org/10.1007/978-3-319-94301-5_11
Pennington, J., Socher, R., & Manning, C. (2014). GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing 1532–1543 Doha, Qatar. Association for Computational Linguistics https://doi.org/10.3115/v1/D14-1162
Purver, M., & Battersby, S. (2012). Experimenting with distant supervision for emotion classification. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics 482–491 Association for Computational Linguistics
Quercia, D., Ellis, J., Capra, L., et al. (2012). Tracking “gross community happiness” from tweets. In Proceedings of the ACM 2012 Conference on Computer Supported Cooperative Work - CSCW ’12 (pp. 965–968). Seattle, Washington, USA. ACM Press https://doi.org/10.1145/2145204.2145347
Risch, J., & Krestel, R. (2018). Aggression identification using deep learning and data augmentation. In Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (pp. 150–158). Santa Fe, New Mexico, USA. Association for Computational Linguistics
Rossouw, S., & Greyling, T. (2020). Big data and happiness. In K. F. Zimmermann (Ed.), Handbook of Labor, Human Resources and Population Economics (pp. 1–35). Springer. https://doi.org/10.1007/978-3-319-57365-6_183-1
Singh, L. G., & Singh, S. R. (2021). Empirical study of sentiment analysis tools and techniques on societal topics. Journal of Intelligent Information Systems, 56(2), 379–407. https://doi.org/10.1007/s10844-020-00616-7. Springer.
Article Google Scholar
Suttles, J., & Ide, N. (2013). Distant supervision for emotion classification with discrete binary values. In Computational Linguistics and Intelligent Text Processing 121–136 Springer. https://doi.org/10.1007/978-3-642-37256-8_11
Wang, D., Al-Rubaie, A., Hirsch, B., et al. (2021). National happiness index monitoring using Twitter for bilanguages. Social Network Analysis and Mining, 11(1), 24. https://doi.org/10.1007/s13278-021-00728-0. Springer.
Article Google Scholar
Wang, W. Y., & Yang, D. (2015). That’s So Annoying‼!: A lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using #petpeeve tweets. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing ( 2557–2563. Lisbon, Portugal. Association for Computational Linguistics. https://doi.org/10.18653/v1/D15-1306
Wei, J., & Zou, K. (2019). EDA: Easy data augmentation techniques for boosting performance on text classification tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and International Joint Conference on Natural Language Processing 6383–6389. Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1670
Wei, J., Huang, C., Vosoughi, S., et al. (2021). Few-shot text classification with triplet networks, data augmentation, and curriculum learning. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 5493–5500. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.naacl-main.434
Xiang, R., Chersoni, E., Lu, Q., et al. (2021). Lexical data augmentation for sentiment analysis. Journal of the Association for Information Science and Technology, 72(11), 1432–1447. John Wiley & Sons. https://doi.org/10.1002/asi.24493
Yong, K. S., & Liew, J. S. Y. (2020). A text augmentation approach using similarity measures based on neural sentence embeddings for emotion classification on microblogs. In Proceedings of the 2020 IEEE International Conference on Artificial Intelligence in Engineering and Technology. Kota Kinabalu, Sabah, Malaysia. IEEE. https://doi.org/10.1109/IICAIET49801.2020.9257826
Yoo, K. M., Park, D., Kang, J., et al. (2021). GPT3Mix: Leveraging large-scale language models for text augmentation. In Findings of the Association for Computational Linguistics 2225–2239. Punta Cana, Dominican Republic. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.findings-emnlp.192
Yu, A. W., Dohan, D., Luong, M.-T., et al. (2018). QANet: Combining local convolution with global self-attention for reading comprehension. In Proceedings of the Sixth International Conference on Learning Representations. https://doi.org/10.48550/arXiv.1804.09541
Zhang, X., Zhao, J., & LeCun, Y. (2015). Character-level convolutional networks for text classification. Advances in Neural Information Processing Systems, 649–657. https://doi.org/10.48550/arXiv.1509.01626

Download references

Acknowledgements

This research was supported by Universiti Sains Malaysia Short Term Grant (304/PKOMP/6315171). We would also like to thank the contributors of EmoTweet-28.

Funding

Universiti Sains Malaysia Short Term Grant (304/PKOMP/6315171).

Author information

Authors and Affiliations

School of Computer Sciences, Universiti Sains Malaysia, USM 18000, Penang, Malaysia
Kuan Shyang Yong & Jasy Suet Yan Liew

Authors

Kuan Shyang Yong
View author publications
You can also search for this author in PubMed Google Scholar
Jasy Suet Yan Liew
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Yong Kuan Shyang. The first draft of the manuscript was written by Yong Kuan Shyang and Jasy Liew Suet Yan was responsible for reviewing and editing the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Jasy Suet Yan Liew.

Ethics declarations

Ethical approval

Not applicable.

Competing interests

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yong, K.S., Liew, J.S.Y. The more "similar" the happier: Augmenting text using similarity scoring with neural embeddings for happiness classification. J Intell Inf Syst 60, 631–653 (2023). https://doi.org/10.1007/s10844-023-00791-3

Download citation

Received: 06 March 2023
Revised: 20 April 2023
Accepted: 21 April 2023
Published: 02 May 2023
Issue Date: June 2023
DOI: https://doi.org/10.1007/s10844-023-00791-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The more "similar" the happier: Augmenting text using similarity scoring with neural embeddings for happiness classification

Abstract

Access this article

Similar content being viewed by others

Twitter Sentiment Analysis Experiments Using Word Embeddings on Datasets of Various Scales

Sentiment analysis in Portuguese tweets: an evaluation of diverse word representation models

Words, Tweets, and Reviews: Leveraging Affective Knowledge Between Multiple Domains

Data availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethical approval

Competing interests

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

The more "similar" the happier: Augmenting text using similarity scoring with neural embeddings for happiness classification

Abstract

Access this article

Similar content being viewed by others

Twitter Sentiment Analysis Experiments Using Word Embeddings on Datasets of Various Scales

Sentiment analysis in Portuguese tweets: an evaluation of diverse word representation models

Words, Tweets, and Reviews: Leveraging Affective Knowledge Between Multiple Domains

Data availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethical approval

Competing interests

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation