Abstract
The inclusion of emojis when solving natural language processing problems (e.g., text‐based emotion detection, sentiment classification, topic analysis) improves the quality of the results. However, the existing literature focuses only on the general meaning transferred by emojis and has not examined emojis in the context of investor sentiment classification. This article provides a comprehensive study of the impact that inclusion of emojis could make in predicting stock investors’ sentiment. We found that a classifier that incorporates domain-specific emoji vectors, which capture the syntax and semantics of emojis in the financial context, could improve the accuracy of investor sentiment classification. Also, when domain-specific emoji vectors are considered, daily time-series of investor sentiment demonstrated additional marginal explanatory power on returns and volatility. Further, a comparison of conducted cluster analysis of domain-specific versus domain-independent emoji vectors showed different natural groupings of emojis reflecting domain specificity when special meaning of emojis is considered. Finally, domain-specific emoji vectors could result in the development of significantly superior emoji sentiment lexicons. Given the importance of domain-specific emojis in investor sentiment classification of social media data, we have developed an emoji lexicon that could be used by other researchers.
Similar content being viewed by others
Notes
Emojis are officially introduced to Stocktwits platform since August 2016.
(Loughran and McDonald (2011) report that three quarters of the words that are denoted as having negative sentiment by the Harvard Dictionary are not generally considered negative in financial contexts.
The main categories of emojis are: people and facial expressions, animals and nature, food and drinks, activities, travel and places, objects, symbols, and flags.
The measures of classification performance are discussed in Sect. 3.3.1.
The formula to measure the level of investor sentiment is discussed in Sect. 3.3.3.
We have used Scikit-Learn library to implement all these classifiers with their default parameters.
Note that the goal of this research is not to produce state-of-the-art results in investor sentiment analysis; our aim is to show whether the domain-specific emoji vectors improve the classification performance of investor sentiment.
Realized volatility is measured using the 5-min sub-sampled intra-day volatility measure available at https://realized.oxford-man.ox.ac.uk/.
Available at http://www.policyuncertainty.com/us_daily.html.
Available at http://www.cboe.com/vix.
References
Aalborg HA, Molnár P, de Vries JE (2019) What can explain the price, volatility and trading volume of bitcoin? Finance Res Lett 29:255–265. https://doi.org/10.1016/j.frl.2018.08.010
Antweiler W, Frank MZ (2004) Is all that talk just noise? The information content of internet stock message boards. J Finance 59(3):1259–1294. https://doi.org/10.1111/j.1540-6261.2004.00662.x
Atkins A, Niranjan M, Gerding E (2018) Financial news predicts stock market volatility better than close price. J Finance Data Sci 4(2):120–137. https://doi.org/10.1016/j.jfds.2018.02.002
Baker M, Wurgler J (2006) Investor sentiment and the cross-section of stock returns. J Finance 61(4):1645–1680. https://doi.org/10.1111/j.1540-6261.2006.00885.x
Baker M, Wurgler J (2007) Investor sentiment in the stock market. J Econ Perspect 21(2):129–152. https://doi.org/10.1257/jep.21.2.129
Barbieri F, Kruszewski G, Ronzano F, Saggion H (2016) How cosmopolitan are emojis?: exploring emojis usage and meaning over different languages with distributional semantics. In: Proceedings of the 24th ACM international conference on multimedia. Association for Computing Machinery, Amsterdam, pp 531–535
Bishop CM (2006) Pattern recognition and machine learning, 1st edn. Springer, New York
Boughorbel S, Jarray F, El-Anbari M (2017) Optimal classifier for imbalanced data using Matthews correlation coefficient metric. PLoS ONE 12(6):1–17. https://doi.org/10.1371/journal.pone.0177678
Brown GW, Cliff MT (2004) Investor sentiment and the near-term stock market. J Empir Finance 11(1):1–27. https://doi.org/10.1016/j.jempfin.2002.12.001
Cavallo M, Demiralp ÇA (2019) Clustrophile 2: guided visual clustering analysis. IEEE Trans Visual Comput Graph 25(1):267–276. https://doi.org/10.1109/TVCG.2018.2864477
Chau F, Deesomsak R, Koutmos D (2016) Does investor sentiment really matter? Int Rev Financ Anal 48:221–232. https://doi.org/10.1016/j.irfa.2016.10.003
Cookson JA, Niessner M (2020) Why don’t we agree? Evidence from a social network of investors. J Finance 75(1):173–228. https://doi.org/10.1111/jofi.12852
Corsi F (2009) A simple approximate long-memory model of realized volatility. J Financ Econom 7(2):174–196. https://doi.org/10.1093/jjfinec/nbp001
Da Z, Engelberg J, Gao P (2015) The sum of all FEARS investor sentiment and asset prices. Rev Financ Stud 28(1):1–32. https://doi.org/10.1093/rfs/hhu072
Danesi M (2016) The semiotics of emoji: the rise of visual language in the age of the internet, 1st edn. Bloomsbury Academic, London
Das SR, Chen MY (2007) Yahoo! for Amazon: sentiment extraction from small talk on the web. Manage Sci 53(9):1375–1388. https://doi.org/10.1287/mnsc.1070.0704
De Long JB, Shleifer A, Summers LH, Waldmann RJ (1990) Noise trader risk in financial markets. J Polit Econ 98(4):703–738
De Vries NJ, Olech ŁP, Moscato P (2019) Introducing clustering with a focus in marketing and consumer analysis. In: De Vries NJ, Moscato P (eds) Business and consumer analytics: new ideas. Springer, Berlin, pp 154–175
Deng L, Wiebe J, Choi Y (2014) Joint inference and disambiguation of implicit sentiments via implicature constraints. In: Proceedings of COLING 2014, the 25th international conference on computational linguistics: technical papers. Dublin City University and Association for Computational Linguistics, Dublin, pp 79–88
Deveikyte J, Geman H, Piccari C, Provetti A (2020) A sentiment analysis approach to the prediction of market volatility. arXiv preprint arXiv:2012.05906
Dimson T (2015) Emojineering part 1: machine learning for emoji trends. Instagram Eng Blog 30:52
Eisner B, Rocktäschel T, Augenstein I, Bosnjak M, Riedel S (2016) emoji2vec: learning emoji representations from their description. In: Proceedings of the fourth international workshop on natural language processing for social media. Association for Computational Linguistics, Austin, pp 48–54
Esuli A, Sebastiani F (2006) SENTIWORDNET: a publicly available lexical resource for opinion mining. In: Proceedings of the fifth international conference on language resources and evaluation (LREC'06). European Language Resources Association (ELRA), Genoa, pp 417–422
Felbo B, Mislove A, Søgaard A, Rahwan I, Lehmann S (2017) Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm. In: Proceedings of the 2017 conference on empirical methods in natural language processing. Association for Computational Linguistics, Copenhagen, pp 1615–1625
Fernández-Gavilanes M, Juncal-Martínez J, García-Méndez S, Costa-Montenegro E, González-Castaño FJ (2018) Creating emoji lexica from unsupervised sentiment analysis of their descriptions. Expert Syst Appl 103:74–91. https://doi.org/10.1016/j.eswa.2018.02.043
Godin F, Vandersmissen B, De Neve W, Van de Walle R (2015) Multimedia Lab @ ACL WNUT NER shared task: named entity recognition for Twitter microposts using distributed word representations. In: Proceedings of the workshop on noisy user-generated text. Association for Computational Linguistics, Beijing, pp 146–153
Goldman E (2018) Emojis and the law. Wash Law Rev 93(3):1227–1291
Grabowski P (2016) Could a smiley make you buy? How using emoji in marketing affects conversions [AdEspresso’s experiment]. Retrieved from https://adespresso.com/blog/emoji-marketing-affects-conversions/
Gupta S, Singh R, Singh J (2020, 2–4 Oct. 2020) A hybrid approach for enhancing accuracy and detecting sarcasm in sentiment analysis. Paper presented at the 2020 IEEE International conference on computing, power and communication technologies (GUCON).
Hamilton WL, Clark K, Leskovec J, Jurafsky D (2016) Inducing domain-specific sentiment lexicons from unlabeled corpora. In: Proceedings of the 2016 conference on empirical methods in natural language processing. Association for Computational Linguistics, Austin, pp 595–605
Hovy D (2015) Demographic factors improve classification performance. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 1: long papers). Association for Computational Linguistics, Beijing, pp 752–762
Kamps J, Marx M, Mokken RJ, de Rijke M (2004) Using WordNet to measure semantic orientations of adjectives. In: Proceedings of the fourth international conference on language resources and evaluation (LREC'04). European Language Resources Association, Lisbon, pp 1115–1118
Katarya R, Meena SK (2021) Machine learning techniques for heart disease prediction: a comparative study and analysis. Health Technol 11(1):87–97. https://doi.org/10.1007/s12553-020-00505-7
Keim D, Andrienko G, Fekete J-D, Görg C, Kohlhammer J, Melançon G (2008) Visual analytics: definition, process, and challenges. In: Kerren A, Stasko JT, Fekete J-D, North C (eds) Information visualization: human-centered issues and perspectives. Springer, Berlin, pp 154–175
Kho SJ, Padhee S, Bajaj G, Thirunarayan K, Sheth A (2019) Domain-specific use cases for knowledge-enabled social media analysis. In: Agarwal N, Dokoohaki N, Tokdemir S (eds) Emerging research challenges and opportunities in computational social network analysis and mining. Springer International Publishing, Cham, pp 233–246
Kim S-H, Kim D (2014) Investor sentiment from internet message postings and the predictability of stock returns. J Econ Behav Organ 107:708–729. https://doi.org/10.1016/j.jebo.2014.04.015
Kim N, Lučivjanská K, Molnár P, Villa R (2019) Google searches and stock market activity: evidence from Norway. Finance Res Lett 28:208–220. https://doi.org/10.1016/j.frl.2018.05.003
Le Q, Mikolov T (2014) Distributed representations of sentences and documents. Paper presented at the Proceedings of the 31st international conference on international conference on machine learning—volume 32, Beijing
Lebduska L (2014) Emoji, emoji, what for art thou? Harlot: A revealing look at the arts of persuasion 1(12)
Lerner JS, Li Y, Valdesolo P, Kassam KS (2015) Emotion and decision making. Annu Rev Psychol 66(1):799–823. https://doi.org/10.1146/annurev-psych-010213-115043
Liang W-L (2016) Sensitivity to investor sentiment and stock performance of open market share repurchases. J Bank Finance 71:75–94. https://doi.org/10.1016/j.jbankfin.2016.06.003
Liang C, Tang L, Li Y, Wei Y (2020) Which sentiment index is more informative to forecast stock market volatility? Evidence from China. Int Rev Financ Anal 71:101552. https://doi.org/10.1016/j.irfa.2020.101552
Linderman GC, Steinerberger S (2017) Clustering with t-SNE, provably. arXiv preprint arXiv:1706.02582
Liu K-L, Li W-J, Guo M (2012) Emoticon smoothed language models for Twitter sentiment analysis. In: Proceedings of the twenty-sixth AAAI conference on artificial intelligence. AAAI Press, Toronto, pp 1678–1684
Ljubešić N, Fišer D (2016) A global analysis of emoji usage. In: Proceedings of the 10th web as corpus workshop. Association for Computational Linguistics, Berlin, pp 82–89
Loughran T, McDonald B (2011) When is a liability not a liability? Textual analysis, dictionaries, and 10-Ks. J Finance 66(1):35–65. https://doi.org/10.1111/j.1540-6261.2010.01625.x
Mahmoudi N, Docherty P, Moscato P (2018) Deep neural networks understand investors better. Decis Support Syst 112:23–34. https://doi.org/10.1016/j.dss.2018.06.002
McCulloch G, Gawne L (2018) Emoji grammar as beat gestures. In: Proceedings of the 1st international workshop on emoji understanding and applications in social media (Emoji2018). CEUR workshop proceedings, Stanford
Miah Y, Prima CNE, Seema SJ, Mahmud M, Shamim Kaiser M (2021) Performance comparison of machine learning techniques in identifying dementia from open access clinical datasets. In: Saeed F, Al-Hadhrami T, Mohammed F, Mohammed E (eds) Advances on Smart and Soft Computing. Advances in Intelligent Systems and Computing, vol 1188. Springer, Singapore. https://doi.org/10.1007/978-981-15-6048-4_8
Mian GM, Sankaraguruswamy S (2012) Investor sentiment and stock market response to earnings news. Acc Rev 87(4):1357–1384. https://doi.org/10.2308/accr-50158
Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th international conference on neural information processing systems. Curran Associates Inc., Lake Tahoe, pp 3111–3119
Milanović S, Marković N, Pamučar D, Gigović L, Kostić P, Milanović SD (2021) Forest fire probability mapping in Eastern Serbia: logistic regression versus random forest method. Forests 12(1):5
Miller H, Thebault-Spieker J, Chang S, Johnson I, Terveen L, Hecht B (2016) "Blissfully happy" or "ready to fight": varying interpretations of emoji. In: Proceedings of the 10th international conference on web and social media. AAAI Press, Cologne, pp 259–268
Mohammadi M, Rashid TA, Karim SHT, Aldalwie AHM, Tho QT, Bidaki M, Rahmani AM, Hosseinzadeh M (2021) A comprehensive survey and taxonomy of the SVM-based intrusion detection systems. J Netw Comput Appl 178:102983. https://doi.org/10.1016/j.jnca.2021.102983
Naeem MA, Farid S, Faruk B, Shahzad SJH (2020) Can happiness predict future volatility in stock markets? Res Int Bus Finance 54:101298. https://doi.org/10.1016/j.ribaf.2020.101298
Novak PK, Smailović J, Sluban B, Mozetič I (2015) Sentiment of emojis. PLoS ONE 10(12):1–22. https://doi.org/10.1371/journal.pone.0144296
Olech ŁP, Paradowski M (2016) Hierarchical Gaussian mixture model with objects attached to terminal and non-terminal dendrogram nodes. In: Burduk R, Jackowski K, Kurzyński M, Woźniak M, Żołnierek A (eds) Proceedings of the 9th international conference on computer recognition systems CORES 2015. Springer International Publishing, Wroclaw, pp 191–201
Olech ŁP, Spytkowski M, Kwaśnicka H, Michalewicz Z (2021) Hierarchical data generator based on tree-structured stick breaking process for benchmarking clustering methods. Inf Sci 554:99–119. https://doi.org/10.1016/j.ins.2020.12.020
Oliveira N, Cortez P, Areal N (2013) On the predictability of stock market behavior using StockTwits sentiment and posting volume. In: Correia L, Reis LP, Cascalho J (eds) Progress in artificial intelligence. Springer, Berlin, pp 355–365
Oliveira N, Cortez P, Areal N (2016) Stock market sentiment lexicon acquisition using microblogging data and statistical measures. Decis Supp Syst 85:62–73. https://doi.org/10.1016/j.dss.2016.02.013
Pavalanathan U, Eisenstein J (2016) More emojis, less :) the competition for paralinguistic function in microblog writing. First Monday. https://doi.org/10.5210/fm.v21i11.6879
Prakash KB, Kanagachidambaresan GR (2021) Introduction to tensorflow package. In: Prakash KB, Kanagachidambaresan GR (eds) Programming with tensorFlow: solution for edge computing applications. Springer International Publishing, Cham, p 1–4. https://doi.org/10.1007/978-3-030-57077-4_1
Rao D, Ravichandran D (2009) Semi-supervised polarity lexicon induction. In: Proceedings of the 12th conference of the European Chapter of the Association for computational linguistics. Association for Computational Linguistics, Athens, pp 675–682
Reis PMN, Pinho C (2020) A new European investor sentiment index (EURsent) and its return and volatility predictability. J Behav Exp Finance 27:100373. https://doi.org/10.1016/j.jbef.2020.100373
Renault T (2017) Intraday online investor sentiment and return patterns in the U.S. stock market. J Bank Finance 84:25–40. https://doi.org/10.1016/j.jbankfin.2017.07.002
San Vicente I et al (2014) Simple, Robust and (almost) Unsupervised generation of polarity lexicons for multiple languages. In: Proceedings of the 14th conference of the European chapter of the association for computational linguistics, Association for Computational Linguistics, Gothenburg, Sweden, pp 88–97. https://doi.org/10.3115/v1/E14-1010
Seok SI, Cho H, Ryu D (2019) Firm-specific investor sentiment and the stock market response to earnings news. N Am J Econ Finance 48:221–240. https://doi.org/10.1016/j.najef.2019.01.014
Shaham U, Steinerberger S (2017) Stochastic neighbor embedding separates well-separated clusters. arXiv preprint arXiv:1702.02670
Shynkevich Y, McGinnity TM, Coleman S, Belatreche A (2015, 7–10 Dec 2015) Predicting stock price movements based on different categories of news articles. Paper presented at the 2015 IEEE symposium series on computational intelligence
Spytkowski M, Kwasnicka H (2012) Hierarchical clustering through bayesian inference. In: Nguyen NT, Hoang K, Jȩdrzejowicz P (eds) Computational Collective Intelligence. Technologies and Applications. ICCCI 2012. Lecture Notes in Computer Science, vol 7653. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34630-9_53
Spytkowski M, Olech ŁP, Kwaśnicka H (2016) Hierarchy of groups evaluation using different F-SCORE VARIAnts. In: Nguyen NT, Trawiński B, Fujita H, Hong T-P (eds) Intelligent information and database systems. Springer, Berlin, pp 654–664
Stambaugh RF, Yu J, Yuan Y (2012) The short of it: investor sentiment and anomalies. J Financ Econ 104(2):288–302. https://doi.org/10.1016/j.jfineco.2011.12.001
Turney PD, Littman ML (2003) Measuring praise and criticism: inference of semantic orientation from association. ACM Trans Inf Syst 21(4):315–346. https://doi.org/10.1145/944012.944013
van der Maaten L (2014) Accelerating t-SNE using tree-based algorithms. J Mach Learn Res 15:3221–3245
van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9:2579–2605
Vidakovic B (2013) Engineering biostatistics: an introduction using MATLAB and WinBUGS, 1st edn. Wiley, Hoboken
Weiss GM (2004) Mining with rarity: a unifying framework. ACM SIGKDD Explor Newsl 6(1):7–19
Widdows D, Dorow B (2002) A graph model for unsupervised lexical acquisition. In: Proceedings of the 19th international conference on computational linguistics. Association for Computational Linguistics, Sinica, Taipei, pp 1–7
Wijeratne S, Balasuriya L, Sheth A, Doran D (2017) A semantics-based measure of emoji similarity. In: Proceedings of the international conference on web intelligence. ACM, New York, pp 646–653
Willoughby JF, Liu S (2018) Do pictures help tell the story? An experimental test of narrative and emojis in a health text message intervention. Comput Hum Behav 79:75–82. https://doi.org/10.1016/j.chb.2017.10.031
Wu Q-W, Xia J-F, Ni J-C, Zheng C-H (2021) GAERF: predicting lncRNA-disease associations by graph auto-encoder and random forest. Brief Bioinform. https://doi.org/10.1093/bib/bbaa391
Yang Y, Eisenstein J (2015) Putting things in context: community-specific embedding projections for sentiment analysis. arXiv preprint arXiv:1511.06052
Yu J, Yuan Y (2011) Investor sentiment and the mean–variance relation. J Financ Econ 100(2):367–381. https://doi.org/10.1016/j.jfineco.2010.10.011
Zhao G, Liu Z, Chao Y, Qian X (2020) CAPER: context-aware personalized emoji recommendation. IEEE Trans Knowl Data Eng. https://doi.org/10.1109/TKDE.2020.2966971
Acknowledgements
We would like to thank Zbigniew Michalewicz for his contribution during various stages of the paper preparation. Additionally, we are grateful to the editor and the anonymous reviewers for careful reading of the manuscript and their insightful critical observations that have helped improve our paper. We also express our sincere gratitude to Stocktwits® for providing their data.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: Details of Hk++
Appendix: Details of Hk++
Hk++ (Olech and Paradowski 2016) adopts the Gaussian mixture model (GMM) where clusters (mixtures) are defined as Gaussian distributions (Bishop 2006). In GMM, clustering with n clusters is described with the following probability distribution function (pdf):
where \(\mu \) is a set of all cluster centers (\({\mu }_{i}\in \mu \)), \(\varSigma \) is a set of all cluster covariance matrices (\({\varSigma }_{i}\in \varSigma \)) and \(\psi \) is a set of mixture weights (\({\psi }_{i}\in \psi \)) such that \({\psi }_{i}\) ranges from 0 to 1 (inclusive on both sides) and \(\sum _{i=1}^{n}{\psi }_{i}=1\). \(N\left({\mu }_{i},{\varSigma }_{i}\right)\) is a multivariate Gaussian distribution of the \(i\)-th mixture with center \({\mu }_{i}\) and covariance matrix \({\varSigma }_{i}\). Hk++ uses multiple GMMs and recursively (breadth-first) arrange them in a hierarchical structure. That hierarchical structure is an object cluster hierarchy, where every parent with its immediate children is represented by an extended GMM pdf function. In the extended version, GMM models the child nodes with the presence of an additional background component \(N\left({\mu }_{B},{\varSigma }_{B}\right)\) representing the parent node:
where the balance between a parent and its children is defined using parameter \(\alpha \) that ranges from 0 to 1 (inclusive on both sides), \({\mu }_{B}\) is the center of a parent node, \({\varSigma }_{B}\) is the covariance matrix of a parent node, and the rest of variables are defined as in Eq. (16). The role of the background component is to capture sparsely distributed data, enabling the child clusters to be discovered on a filtered (denoized) set of data points (Olech and Paradowski 2016). Model parameters are estimated using the expectation maximization (EM) algorithm (Bishop 2006). This algorithm consists of two steps: reassignment of points to mixtures (expectation step), and recalculation of \({\mu }_{i}\), \({\varSigma }_{i}\) for every mixture (except for the background component) based on the updated assignment (maximization step). These steps are performed until convergence of the pdf function or up to the predefined number of iterations. In the implemented visual clustering procedure, the Hk++ parameters, including the number of iterations, are dynamically set up by the human operator.
Rights and permissions
About this article
Cite this article
Mahmoudi, N., Olech, Ł.P. & Docherty, P. A comprehensive study of domain-specific emoji meanings in sentiment classification. Comput Manag Sci 19, 159–197 (2022). https://doi.org/10.1007/s10287-021-00407-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10287-021-00407-7