Abstract
We use a large dataset of one million messages sent on the microblogging platform StockTwits to evaluate the performance of a wide range of preprocessing methods and machine learning algorithms for sentiment analysis in finance. We find that adding bigrams and emojis significantly improve sentiment classification performance. However, more complex and time-consuming machine learning methods, such as random forests or neural networks, do not improve the accuracy of the classification. We also provide empirical evidence that the preprocessing method and the size of the dataset have a strong impact on the correlation between investor sentiment and stock returns. While investor sentiment and stock returns are highly correlated, we do not find that investor sentiment derived from messages sent on social media helps in predicting large capitalization stocks return at a daily frequency.
Similar content being viewed by others
Notes
We do not explore the impact of word embedding in this article (GloVe, Word2Vec). We let this for future research.
The accuracy only increases by 0.31 point of percentage when the size of the dataset increase from 500,000 messages to 1 million messages.
Natural Language Toolkit -https://www.nltk.org/.
We also find some that the accuracy reaches a plateau around 250,000–500,000 messages when a Logistic Regression algorithm is used.
Ranco et al. (2015) state that “to achieve the performance of human experts, a large enough set of tweets has to be manually annotated, in our case, over 100,000”.
The time will differ depending on the computing power and the optimization of the script.
References
Ahmad, K., Han, J., Hutson, E., Kearney, C., & Liu, S. (2016). Media-expressed negative tone and firm-level stock returns. Journal of Corporate Finance, 37, 152–172.
Antweiler, W., & Frank, M. Z. (2004). Is all that talk just noise? The information content of internet stock message boards. The Journal of Finance, 59(3), 1259–1294.
Avery, C. N., Chevalier, J. A., & Zeckhauser, R. J. (2015). The “CAPS” prediction system and stock market returns. Review of Finance, 20(4), 1363–1381.
Bartov, E., Faurel, L., & Mohanram, P. S. (2017). Can Twitter help predict firm-level earnings and stock returns? The Accounting Review, 93(3), 25–57.
Chen, C. Y.-H., Despres, R., Guo, L., & Renault, T. (2019). ‘What makes cryptocurrencies special? investor sentiment and return predictability during the bubble’, Working Paper .
Das, S. R., & Chen, M. Y. (2007). Yahoo! for Amazon: Sentiment extraction from small talk on the web. Management Science, 53(9), 1375–1388.
Garcia, D. (2013). Sentiment during recessions. The Journal of Finance, 68(3), 1267–1300.
Leung, H., & Ton, T. (2015). The impact of internet stock message boards on cross-sectional returns of small-capitalization stocks. Journal of Banking & Finance, 55, 37–55.
Li, F. (2010). The information content of forward-looking statements in corporate filings—A naïve bayesian machine learning approach. Journal of Accounting Research, 48(5), 1049–1102.
Loughran, T., & McDonald, B. (2011). When is a liability not a liability? Textual analysis, dictionaries, and 10-Ks. The Journal of Finance, 66(1), 35–65.
Mahmoudi, N., Docherty, P., & Moscato, P. (2018). Deep neural networks understand investors better. Decision Support Systems.
Oliveira, N., Cortez, P., & Areal, N. (2016). Stock market sentiment lexicon acquisition using microblogging data and statistical measures. Decision Support Systems, 85, 62–73.
Price, S. M., Doran, J. S., Peterson, D. R., & Bliss, B. A. (2012). Earnings conference calls and stock returns: The incremental informativeness of textual tone. Journal of Banking & Finance, 36(4), 992–1011.
Ranco, G., Aleksovski, D., Caldarelli, G., Grčar, M., & Mozetič, I. (2015). The effects of Twitter sentiment on stock price returns. PloS One, 10(9), e0138441.
Renault, T. (2017). Intraday online investor sentiment and return patterns in the US stock market. Journal of Banking & Finance, 84, 25–40.
Saif, H., Fernández, M., He, Y., & Alani, H. (2014). ‘On stopwords, filtering and data sparsity for sentiment analysis of twitter’.
Sprenger, T. O., Sandner, P. G., Tumasjan, A., & Welpe, I. M. (2014). News or noise? Using Twitter to identify and understand company-specific news flow. Journal of Business Finance & Accounting, 41(7–8), 791–830.
Tetlock, P. C. (2007). Giving content to investor sentiment: The role of media in the stock market. The Journal of Finance, 62(3), 1139–1168.
Tetlock, P. C., Saar-Tsechansky, M., & Macskassy, S. (2008). More than words: Quantifying language to measure firms’ fundamentals. The Journal of Finance, 63(3), 1437–1467.
Wang, S., & Manning, C. D. (2012). Baselines and bigrams: Simple, good sentiment and topic classification, in ‘Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2’. Association for Computational Linguistics, 90–94.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Renault, T. Sentiment analysis and machine learning in finance: a comparison of methods and models on one million messages. Digit Finance 2, 1–13 (2020). https://doi.org/10.1007/s42521-019-00014-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s42521-019-00014-x