Abstract
Measuring happiness of populations of interest via Twitter offers an alternative for social scientists to gauge the level of happiness in and across different nations but machine learning models are needed to scale happiness classification for millions of tweets. A good performing happiness classifier requires a fair amount of training data with minimal noise. Our study introduces a similarity-based text augmentation method to efficiently expand data for the emotion “happiness” from an existing emotion corpus (EmoTweet-28) by selecting the most similar positive examples from happiness tweets collected using distant supervision (DS) to be added into an augmented corpus as training data. Six neural embeddings on top of the baseline bag-of-words (BoW) representation were explored to compute the cosine similarity score between 100,000 DS tweets with 1,024 gold standard happiness tweets in EmoTweet-28 (ET). Our results show that the augmented training set obtained from USE embedding with the similarity threshold of 0.7 trained on BiLSTM produced the best model in predicting whether a tweet contains expressions of happiness or not (F1 score = 0.599). However, most augmented training sets obtained from InferSent-GloVe embedding produced BiLSTM classifiers with more consistent F1 scores above the base classifier in the fixed increment experiments. We show that our proposed text augmentation strategy can improve or maintain classification performance in small but cleaner increment sets as opposed to adding DS tweets randomly as training data.
Similar content being viewed by others
Data availability
The data that support the findings of this study are available from the corresponding author upon request.
Notes
Twitter: https://twitter.com/home
GloVe pre-trained word embedding: https://nlp.stanford.edu/projects/glove/
USE pre-trained sentence embedding: https://tfhub.dev/google/universal-sentence-encoder/4
InferSent pre-trained sentence embeddings: https://github.com/facebookresearch/InferSent
References
Abdul-Mageed, M., & Ungar, L. (2017). EmoNet: Fine-grained emotion detection with gated recurrent neural networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (pp. 718–728). Vancouver, Canada. Association for Computational Linguistics. https://doi.org/10.18653/v1/P17-1067
Allisio, L., Mussa, V., Bosco, et al. (2013). Felicittà: Visualizing and estimating happiness in Italian cities from geotagged tweets. In 1st International Workshop on Emotion and Sentiment in Social and Expressive Media. Approaches and Perspectives from Ai, 1096, 95–106. CEUR Workshop Proceedings.
Alm, C. O., Roth, D., & Sproat, R. (2005). Emotions from Text: Machine learning for text-based emotion prediction. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (pp. 579–586). Vancouver, British Columbia, Canada. Association for Computational Linguistics. https://doi.org/10.3115/1220575.1220648
Aroyehun, S. T., & Gelbukh, A. (2018). Aggression detection in social media: Using deep neural networks, data augmentation, and pseudo labeling. In Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (pp. 90–97). Association for Computational Linguistics
Bates, W. (2009). Gross national happiness. Asian-Pacific Economic Literature, 23(2), 1–16. John Wiley & Sons. https://doi.org/10.1111/j.1467-8411.2009.01235.x
Berka, P. (2020). Sentiment analysis using rule-based and case-based reasoning. Journal of Intelligent Information Systems, 55(1), 51–66. Springer. https://doi.org/10.1007/s10844-019-00591-8
Cer, D., Yang, Y., Kong, S., et al. (2018). Universal Sentence Encoder for English. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (pp. 169–174). Brussels, Belgium. Association for Computational Linguistics. https://doi.org/10.18653/v1/D18-2029
Conneau, A., Kiela, D., Schwenk, et al. (2017). Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (pp. 670–680). Copenhagen, Denmark. Association for Computational Linguistics. https://doi.org/10.18653/v1/D17-1070
Delsignore, G., Aguilar-Latorre, A., & Oliván-Blázquez, B. (2021). Measuring happiness in the social sciences: An overview. Journal of Sociology, 57(4), 1044–1067. https://doi.org/10.1177/1440783321991655. Sage.
Dodds, P. S., & Danforth, C. M. (2010). Measuring the happiness of large-scale written expression: Songs, blogs, and presidents. Journal of Happiness Studies, 11(4), 441–456. https://doi.org/10.1007/s10902-009-9150-9. Springer.
Dodds, P. S., Harris, K. D., Kloumann, I. M., et al. (2011). Temporal patterns of happiness and information in a global social network: Hedonometrics and Twitter. PLOS ONE, 6(12), e26752. https://doi.org/10.1371/journal.pone.0026752.
Godin, F. (2019). Improving and interpreting neural networks for word-level prediction tasks in natural language processing. Ghent University.
Gupta, U., Chatterjee, A., Srikanth, R., et al. (2017). A sentiment-and-semantics-based approach for emotion detection in textual conversations. Workshop on Neural Information Retrieval. https://doi.org/10.48550/arXiv.1707.06996
Helliwell, J. F., & Aknin, L. B. (2018). Expanding the social science of happiness. Nature Human Behaviour, 2(4), 248–252. https://doi.org/10.1038/s41562-018-0308-5. Springer Nature.
Johnson, D. (2023). These are the happiest and healthiest cities in America. TIME. Retrieved from https://time.com/4691862/best-cities-us-happiest-healthiest/. Accessed 28 Feb 2023.
Kobayashi, S. (2018). Contextual augmentation: Data augmentation by words with paradigmatic relations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Vol. 2, pp. 452–457). Association for Computational Linguistics. https://doi.org/10.18653/v1/N18-2072
Kolomiyets, O., Bethard, S., & Moens, M.-F. (2011). Model-portability experiments for textual temporal analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (Vol. 2, pp. 271–276). Association for Computational Linguistics
Lawrence, E. M., Rogers, R. G., & Wadsworth, T. (2015). Happiness and longevity in the United States. Social Science & Medicine, 145, 115–119. https://doi.org/10.1016/j.socscimed.2015.09.020. Elsevier.
Li, S., Ao, X., Pan, F., et al. (2022). Learning policy scheduling for text augmentation. Neural Networks, 145, 121–127. https://doi.org/10.1016/j.neunet.2021.09.028. Elsevier.
Liew, J. S. Y. (2016). Fine-grained emotion detection in microblog text. Syracuse University.
Liew, J. S. Y., Turtle, H. R., & Liddy, E. D. (2016). EmoTweet-28: A fine-grained emotion corpus for sentiment analysis. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (pp. 1149–1156). European Language Resources Association (ELRA)
Liu, R., Xu, G., Jia, C., et al. (2020). Data Boost: Text data augmentation through reinforcement learning guided conditional generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (pp. 9031–9041). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.726
Lu, X., Zheng, B., Velivelli, A., et al. (2006). Enhancing text categorization with semantic-enriched representation and training data augmentation. Journal of the American Medical Informatics Association, 13(5), 526–535. https://doi.org/10.1197/jamia.M2051. Oxford University Pres.
Luo, J., Bouazizi, M., & Ohtsuki, T. (2021). Data augmentation for sentiment analysis using sentence compression-based SeqGAN with data screening. IEEE Access, 9, 99922–99931. IEEE. https://doi.org/10.1109/ACCESS.2021.3094023
Mikolov, T., Chen, K., Corrado, G., et al. (2013a). Efficient estimation of word representations in vector space. In International Conference on Learning Representations. https://doi.org/10.48550/arXiv.1301.3781
Mikolov, T., Sutskever, I., Chen, K., et al. (2013b). Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems 2 3111–3119 https://doi.org/10.48550/arXiv.1310.4546
Mintz, M., Bills, S., Snow, R., et al. (2009). Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP (pp. 1003–1011). Singapore. Association for Computational Linguistics
Mohammad, S. M. (2012). Portable features for classifying emotional text. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 587–591). Montreal, QC. Association for Computational Linguistics
Nguyen, Q. C., Li, D., Meng, H.-W., et al. (2016). Building a national neighborhood dataset from geotagged Twitter data for indicators of happiness, diet, and physical activity. JMIR Public Health and Surveillance, 2(2), e158. https://doi.org/10.2196/publichealth.5869. JMIR Publications.
Pauken, B., Pradyumn, M., & Tabrizi, N. (2018). Tracking happiness of different US cities from tweets. In F. Y. L. Chin, C. L. P. Chen, L. Khan, K. Lee, & L.-J. Zhang (Eds.), Big Data – BigData 2018 140–148. Springer. https://doi.org/10.1007/978-3-319-94301-5_11
Pennington, J., Socher, R., & Manning, C. (2014). GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing 1532–1543 Doha, Qatar. Association for Computational Linguistics https://doi.org/10.3115/v1/D14-1162
Purver, M., & Battersby, S. (2012). Experimenting with distant supervision for emotion classification. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics 482–491 Association for Computational Linguistics
Quercia, D., Ellis, J., Capra, L., et al. (2012). Tracking “gross community happiness” from tweets. In Proceedings of the ACM 2012 Conference on Computer Supported Cooperative Work - CSCW ’12 (pp. 965–968). Seattle, Washington, USA. ACM Press https://doi.org/10.1145/2145204.2145347
Risch, J., & Krestel, R. (2018). Aggression identification using deep learning and data augmentation. In Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (pp. 150–158). Santa Fe, New Mexico, USA. Association for Computational Linguistics
Rossouw, S., & Greyling, T. (2020). Big data and happiness. In K. F. Zimmermann (Ed.), Handbook of Labor, Human Resources and Population Economics (pp. 1–35). Springer. https://doi.org/10.1007/978-3-319-57365-6_183-1
Singh, L. G., & Singh, S. R. (2021). Empirical study of sentiment analysis tools and techniques on societal topics. Journal of Intelligent Information Systems, 56(2), 379–407. https://doi.org/10.1007/s10844-020-00616-7. Springer.
Suttles, J., & Ide, N. (2013). Distant supervision for emotion classification with discrete binary values. In Computational Linguistics and Intelligent Text Processing 121–136 Springer. https://doi.org/10.1007/978-3-642-37256-8_11
Wang, D., Al-Rubaie, A., Hirsch, B., et al. (2021). National happiness index monitoring using Twitter for bilanguages. Social Network Analysis and Mining, 11(1), 24. https://doi.org/10.1007/s13278-021-00728-0. Springer.
Wang, W. Y., & Yang, D. (2015). That’s So Annoying‼!: A lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using #petpeeve tweets. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing ( 2557–2563. Lisbon, Portugal. Association for Computational Linguistics. https://doi.org/10.18653/v1/D15-1306
Wei, J., & Zou, K. (2019). EDA: Easy data augmentation techniques for boosting performance on text classification tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and International Joint Conference on Natural Language Processing 6383–6389. Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1670
Wei, J., Huang, C., Vosoughi, S., et al. (2021). Few-shot text classification with triplet networks, data augmentation, and curriculum learning. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 5493–5500. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.naacl-main.434
Xiang, R., Chersoni, E., Lu, Q., et al. (2021). Lexical data augmentation for sentiment analysis. Journal of the Association for Information Science and Technology, 72(11), 1432–1447. John Wiley & Sons. https://doi.org/10.1002/asi.24493
Yong, K. S., & Liew, J. S. Y. (2020). A text augmentation approach using similarity measures based on neural sentence embeddings for emotion classification on microblogs. In Proceedings of the 2020 IEEE International Conference on Artificial Intelligence in Engineering and Technology. Kota Kinabalu, Sabah, Malaysia. IEEE. https://doi.org/10.1109/IICAIET49801.2020.9257826
Yoo, K. M., Park, D., Kang, J., et al. (2021). GPT3Mix: Leveraging large-scale language models for text augmentation. In Findings of the Association for Computational Linguistics 2225–2239. Punta Cana, Dominican Republic. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.findings-emnlp.192
Yu, A. W., Dohan, D., Luong, M.-T., et al. (2018). QANet: Combining local convolution with global self-attention for reading comprehension. In Proceedings of the Sixth International Conference on Learning Representations. https://doi.org/10.48550/arXiv.1804.09541
Zhang, X., Zhao, J., & LeCun, Y. (2015). Character-level convolutional networks for text classification. Advances in Neural Information Processing Systems, 649–657. https://doi.org/10.48550/arXiv.1509.01626
Acknowledgements
This research was supported by Universiti Sains Malaysia Short Term Grant (304/PKOMP/6315171). We would also like to thank the contributors of EmoTweet-28.
Funding
Universiti Sains Malaysia Short Term Grant (304/PKOMP/6315171).
Author information
Authors and Affiliations
Contributions
All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Yong Kuan Shyang. The first draft of the manuscript was written by Yong Kuan Shyang and Jasy Liew Suet Yan was responsible for reviewing and editing the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethical approval
Not applicable.
Competing interests
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yong, K.S., Liew, J.S.Y. The more "similar" the happier: Augmenting text using similarity scoring with neural embeddings for happiness classification. J Intell Inf Syst 60, 631–653 (2023). https://doi.org/10.1007/s10844-023-00791-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10844-023-00791-3