Skip to main content
Log in

The more "similar" the happier: Augmenting text using similarity scoring with neural embeddings for happiness classification

  • Research
  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

Measuring happiness of populations of interest via Twitter offers an alternative for social scientists to gauge the level of happiness in and across different nations but machine learning models are needed to scale happiness classification for millions of tweets. A good performing happiness classifier requires a fair amount of training data with minimal noise. Our study introduces a similarity-based text augmentation method to efficiently expand data for the emotion “happiness” from an existing emotion corpus (EmoTweet-28) by selecting the most similar positive examples from happiness tweets collected using distant supervision (DS) to be added into an augmented corpus as training data. Six neural embeddings on top of the baseline bag-of-words (BoW) representation were explored to compute the cosine similarity score between 100,000 DS tweets with 1,024 gold standard happiness tweets in EmoTweet-28 (ET). Our results show that the augmented training set obtained from USE embedding with the similarity threshold of 0.7 trained on BiLSTM produced the best model in predicting whether a tweet contains expressions of happiness or not (F1 score = 0.599). However, most augmented training sets obtained from InferSent-GloVe embedding produced BiLSTM classifiers with more consistent F1 scores above the base classifier in the fixed increment experiments. We show that our proposed text augmentation strategy can improve or maintain classification performance in small but cleaner increment sets as opposed to adding DS tweets randomly as training data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Data availability

The data that support the findings of this study are available from the corresponding author upon request.

Notes

  1. Twitter: https://twitter.com/home

  2. GloVe pre-trained word embedding: https://nlp.stanford.edu/projects/glove/

  3. USE pre-trained sentence embedding: https://tfhub.dev/google/universal-sentence-encoder/4

  4. InferSent pre-trained sentence embeddings: https://github.com/facebookresearch/InferSent

References

  • Abdul-Mageed, M., & Ungar, L. (2017). EmoNet: Fine-grained emotion detection with gated recurrent neural networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (pp. 718–728). Vancouver, Canada. Association for Computational Linguistics. https://doi.org/10.18653/v1/P17-1067

  • Allisio, L., Mussa, V., Bosco, et al. (2013). Felicittà: Visualizing and estimating happiness in Italian cities from geotagged tweets. In 1st International Workshop on Emotion and Sentiment in Social and Expressive Media. Approaches and Perspectives from Ai, 1096, 95–106. CEUR Workshop Proceedings.

    Google Scholar 

  • Alm, C. O., Roth, D., & Sproat, R. (2005). Emotions from Text: Machine learning for text-based emotion prediction. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (pp. 579–586). Vancouver, British Columbia, Canada. Association for Computational Linguistics. https://doi.org/10.3115/1220575.1220648

  • Aroyehun, S. T., & Gelbukh, A. (2018). Aggression detection in social media: Using deep neural networks, data augmentation, and pseudo labeling. In Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (pp. 90–97). Association for Computational Linguistics

  • Bates, W. (2009). Gross national happiness. Asian-Pacific Economic Literature, 23(2), 1–16. John Wiley & Sons. https://doi.org/10.1111/j.1467-8411.2009.01235.x

  • Berka, P. (2020). Sentiment analysis using rule-based and case-based reasoning. Journal of Intelligent Information Systems, 55(1), 51–66. Springer. https://doi.org/10.1007/s10844-019-00591-8

  • Cer, D., Yang, Y., Kong, S., et al. (2018). Universal Sentence Encoder for English. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (pp. 169–174). Brussels, Belgium. Association for Computational Linguistics. https://doi.org/10.18653/v1/D18-2029

  • Conneau, A., Kiela, D., Schwenk, et al. (2017). Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (pp. 670–680). Copenhagen, Denmark. Association for Computational Linguistics. https://doi.org/10.18653/v1/D17-1070

  • Delsignore, G., Aguilar-Latorre, A., & Oliván-Blázquez, B. (2021). Measuring happiness in the social sciences: An overview. Journal of Sociology, 57(4), 1044–1067. https://doi.org/10.1177/1440783321991655. Sage.

    Article  Google Scholar 

  • Dodds, P. S., & Danforth, C. M. (2010). Measuring the happiness of large-scale written expression: Songs, blogs, and presidents. Journal of Happiness Studies, 11(4), 441–456. https://doi.org/10.1007/s10902-009-9150-9. Springer.

    Article  Google Scholar 

  • Dodds, P. S., Harris, K. D., Kloumann, I. M., et al. (2011). Temporal patterns of happiness and information in a global social network: Hedonometrics and Twitter. PLOS ONE, 6(12), e26752. https://doi.org/10.1371/journal.pone.0026752.

    Article  Google Scholar 

  • Godin, F. (2019). Improving and interpreting neural networks for word-level prediction tasks in natural language processing. Ghent University.

    Google Scholar 

  • Gupta, U., Chatterjee, A., Srikanth, R., et al. (2017). A sentiment-and-semantics-based approach for emotion detection in textual conversations. Workshop on Neural Information Retrieval. https://doi.org/10.48550/arXiv.1707.06996

  • Helliwell, J. F., & Aknin, L. B. (2018). Expanding the social science of happiness. Nature Human Behaviour, 2(4), 248–252. https://doi.org/10.1038/s41562-018-0308-5. Springer Nature.

    Article  Google Scholar 

  • Johnson, D. (2023). These are the happiest and healthiest cities in America. TIME. Retrieved from https://time.com/4691862/best-cities-us-happiest-healthiest/. Accessed 28 Feb 2023.

  • Kobayashi, S. (2018). Contextual augmentation: Data augmentation by words with paradigmatic relations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Vol. 2, pp. 452–457). Association for Computational Linguistics. https://doi.org/10.18653/v1/N18-2072

  • Kolomiyets, O., Bethard, S., & Moens, M.-F. (2011). Model-portability experiments for textual temporal analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (Vol. 2, pp. 271–276). Association for Computational Linguistics

  • Lawrence, E. M., Rogers, R. G., & Wadsworth, T. (2015). Happiness and longevity in the United States. Social Science & Medicine, 145, 115–119. https://doi.org/10.1016/j.socscimed.2015.09.020. Elsevier.

    Article  Google Scholar 

  • Li, S., Ao, X., Pan, F., et al. (2022). Learning policy scheduling for text augmentation. Neural Networks, 145, 121–127. https://doi.org/10.1016/j.neunet.2021.09.028. Elsevier.

    Article  Google Scholar 

  • Liew, J. S. Y. (2016). Fine-grained emotion detection in microblog text. Syracuse University.

    Google Scholar 

  • Liew, J. S. Y., Turtle, H. R., & Liddy, E. D. (2016). EmoTweet-28: A fine-grained emotion corpus for sentiment analysis. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (pp. 1149–1156). European Language Resources Association (ELRA)

  • Liu, R., Xu, G., Jia, C., et al. (2020). Data Boost: Text data augmentation through reinforcement learning guided conditional generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (pp. 9031–9041). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.726

  • Lu, X., Zheng, B., Velivelli, A., et al. (2006). Enhancing text categorization with semantic-enriched representation and training data augmentation. Journal of the American Medical Informatics Association, 13(5), 526–535. https://doi.org/10.1197/jamia.M2051. Oxford University Pres.

    Article  Google Scholar 

  • Luo, J., Bouazizi, M., & Ohtsuki, T. (2021). Data augmentation for sentiment analysis using sentence compression-based SeqGAN with data screening. IEEE Access, 9, 99922–99931. IEEE. https://doi.org/10.1109/ACCESS.2021.3094023

  • Mikolov, T., Chen, K., Corrado, G., et al. (2013a). Efficient estimation of word representations in vector space. In International Conference on Learning Representations. https://doi.org/10.48550/arXiv.1301.3781

  • Mikolov, T., Sutskever, I., Chen, K., et al. (2013b). Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems 2 3111–3119 https://doi.org/10.48550/arXiv.1310.4546

  • Mintz, M., Bills, S., Snow, R., et al. (2009). Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP (pp. 1003–1011). Singapore. Association for Computational Linguistics

  • Mohammad, S. M. (2012). Portable features for classifying emotional text. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 587–591). Montreal, QC. Association for Computational Linguistics

  • Nguyen, Q. C., Li, D., Meng, H.-W., et al. (2016). Building a national neighborhood dataset from geotagged Twitter data for indicators of happiness, diet, and physical activity. JMIR Public Health and Surveillance, 2(2), e158. https://doi.org/10.2196/publichealth.5869. JMIR Publications.

    Article  Google Scholar 

  • Pauken, B., Pradyumn, M., & Tabrizi, N. (2018). Tracking happiness of different US cities from tweets. In F. Y. L. Chin, C. L. P. Chen, L. Khan, K. Lee, & L.-J. Zhang (Eds.), Big Data – BigData 2018 140–148. Springer. https://doi.org/10.1007/978-3-319-94301-5_11

  • Pennington, J., Socher, R., & Manning, C. (2014). GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing 1532–1543 Doha, Qatar. Association for Computational Linguistics https://doi.org/10.3115/v1/D14-1162

  • Purver, M., & Battersby, S. (2012). Experimenting with distant supervision for emotion classification. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics 482–491 Association for Computational Linguistics

  • Quercia, D., Ellis, J., Capra, L., et al. (2012). Tracking “gross community happiness” from tweets. In Proceedings of the ACM 2012 Conference on Computer Supported Cooperative Work - CSCW ’12 (pp. 965–968). Seattle, Washington, USA. ACM Press https://doi.org/10.1145/2145204.2145347

  • Risch, J., & Krestel, R. (2018). Aggression identification using deep learning and data augmentation. In Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (pp. 150–158). Santa Fe, New Mexico, USA. Association for Computational Linguistics

  • Rossouw, S., & Greyling, T. (2020). Big data and happiness. In K. F. Zimmermann (Ed.), Handbook of Labor, Human Resources and Population Economics (pp. 1–35). Springer. https://doi.org/10.1007/978-3-319-57365-6_183-1

  • Singh, L. G., & Singh, S. R. (2021). Empirical study of sentiment analysis tools and techniques on societal topics. Journal of Intelligent Information Systems, 56(2), 379–407. https://doi.org/10.1007/s10844-020-00616-7. Springer.

    Article  Google Scholar 

  • Suttles, J., & Ide, N. (2013). Distant supervision for emotion classification with discrete binary values. In Computational Linguistics and Intelligent Text Processing 121–136 Springer. https://doi.org/10.1007/978-3-642-37256-8_11

  • Wang, D., Al-Rubaie, A., Hirsch, B., et al. (2021). National happiness index monitoring using Twitter for bilanguages. Social Network Analysis and Mining, 11(1), 24. https://doi.org/10.1007/s13278-021-00728-0. Springer.

    Article  Google Scholar 

  • Wang, W. Y., & Yang, D. (2015). That’s So Annoying‼!: A lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using #petpeeve tweets. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing ( 2557–2563. Lisbon, Portugal. Association for Computational Linguistics. https://doi.org/10.18653/v1/D15-1306

  • Wei, J., & Zou, K. (2019). EDA: Easy data augmentation techniques for boosting performance on text classification tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and International Joint Conference on Natural Language Processing 6383–6389. Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1670

  • Wei, J., Huang, C., Vosoughi, S., et al. (2021). Few-shot text classification with triplet networks, data augmentation, and curriculum learning. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 5493–5500. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.naacl-main.434

  • Xiang, R., Chersoni, E., Lu, Q., et al. (2021). Lexical data augmentation for sentiment analysis. Journal of the Association for Information Science and Technology, 72(11), 1432–1447. John Wiley & Sons. https://doi.org/10.1002/asi.24493

  • Yong, K. S., & Liew, J. S. Y. (2020). A text augmentation approach using similarity measures based on neural sentence embeddings for emotion classification on microblogs. In Proceedings of the 2020 IEEE International Conference on Artificial Intelligence in Engineering and Technology. Kota Kinabalu, Sabah, Malaysia. IEEE. https://doi.org/10.1109/IICAIET49801.2020.9257826

  • Yoo, K. M., Park, D., Kang, J., et al. (2021). GPT3Mix: Leveraging large-scale language models for text augmentation. In Findings of the Association for Computational Linguistics 2225–2239. Punta Cana, Dominican Republic. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.findings-emnlp.192

  • Yu, A. W., Dohan, D., Luong, M.-T., et al. (2018). QANet: Combining local convolution with global self-attention for reading comprehension. In Proceedings of the Sixth International Conference on Learning Representations. https://doi.org/10.48550/arXiv.1804.09541

  • Zhang, X., Zhao, J., & LeCun, Y. (2015). Character-level convolutional networks for text classification. Advances in Neural Information Processing Systems, 649–657. https://doi.org/10.48550/arXiv.1509.01626

Download references

Acknowledgements

This research was supported by Universiti Sains Malaysia Short Term Grant (304/PKOMP/6315171). We would also like to thank the contributors of EmoTweet-28.

Funding

Universiti Sains Malaysia Short Term Grant (304/PKOMP/6315171).

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Yong Kuan Shyang. The first draft of the manuscript was written by Yong Kuan Shyang and Jasy Liew Suet Yan was responsible for reviewing and editing the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Jasy Suet Yan Liew.

Ethics declarations

Ethical approval

Not applicable.

Competing interests

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yong, K.S., Liew, J.S.Y. The more "similar" the happier: Augmenting text using similarity scoring with neural embeddings for happiness classification. J Intell Inf Syst 60, 631–653 (2023). https://doi.org/10.1007/s10844-023-00791-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-023-00791-3

Keywords

Navigation