Skip to main content
Log in

Sentiment analysis and machine learning in finance: a comparison of methods and models on one million messages

  • Original Article
  • Published:
Digital Finance Aims and scope Submit manuscript

Abstract

We use a large dataset of one million messages sent on the microblogging platform StockTwits to evaluate the performance of a wide range of preprocessing methods and machine learning algorithms for sentiment analysis in finance. We find that adding bigrams and emojis significantly improve sentiment classification performance. However, more complex and time-consuming machine learning methods, such as random forests or neural networks, do not improve the accuracy of the classification. We also provide empirical evidence that the preprocessing method and the size of the dataset have a strong impact on the correlation between investor sentiment and stock returns. While investor sentiment and stock returns are highly correlated, we do not find that investor sentiment derived from messages sent on social media helps in predicting large capitalization stocks return at a daily frequency.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. We do not explore the impact of word embedding in this article (GloVe, Word2Vec). We let this for future research.

  2. The accuracy only increases by 0.31 point of percentage when the size of the dataset increase from 500,000 messages to 1 million messages.

  3. Natural Language Toolkit -https://www.nltk.org/.

  4. https://scikit-learn.org/stable/.

  5. https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html.

  6. We also find some that the accuracy reaches a plateau around 250,000–500,000 messages when a Logistic Regression algorithm is used.

  7. Ranco et al. (2015) state that “to achieve the performance of human experts, a large enough set of tweets has to be manually annotated, in our case, over 100,000”.

  8. The time will differ depending on the computing power and the optimization of the script.

References

  • Ahmad, K., Han, J., Hutson, E., Kearney, C., & Liu, S. (2016). Media-expressed negative tone and firm-level stock returns. Journal of Corporate Finance, 37, 152–172.

    Article  Google Scholar 

  • Antweiler, W., & Frank, M. Z. (2004). Is all that talk just noise? The information content of internet stock message boards. The Journal of Finance, 59(3), 1259–1294.

    Article  Google Scholar 

  • Avery, C. N., Chevalier, J. A., & Zeckhauser, R. J. (2015). The “CAPS” prediction system and stock market returns. Review of Finance, 20(4), 1363–1381.

    Article  Google Scholar 

  • Bartov, E., Faurel, L., & Mohanram, P. S. (2017). Can Twitter help predict firm-level earnings and stock returns? The Accounting Review, 93(3), 25–57.

    Article  Google Scholar 

  • Chen, C. Y.-H., Despres, R., Guo, L., & Renault, T. (2019). ‘What makes cryptocurrencies special? investor sentiment and return predictability during the bubble’, Working Paper .

  • Das, S. R., & Chen, M. Y. (2007). Yahoo! for Amazon: Sentiment extraction from small talk on the web. Management Science, 53(9), 1375–1388.

    Article  Google Scholar 

  • Garcia, D. (2013). Sentiment during recessions. The Journal of Finance, 68(3), 1267–1300.

    Article  Google Scholar 

  • Leung, H., & Ton, T. (2015). The impact of internet stock message boards on cross-sectional returns of small-capitalization stocks. Journal of Banking & Finance, 55, 37–55.

    Article  Google Scholar 

  • Li, F. (2010). The information content of forward-looking statements in corporate filings—A naïve bayesian machine learning approach. Journal of Accounting Research, 48(5), 1049–1102.

    Article  Google Scholar 

  • Loughran, T., & McDonald, B. (2011). When is a liability not a liability? Textual analysis, dictionaries, and 10-Ks. The Journal of Finance, 66(1), 35–65.

    Article  Google Scholar 

  • Mahmoudi, N., Docherty, P., & Moscato, P. (2018). Deep neural networks understand investors better. Decision Support Systems.

  • Oliveira, N., Cortez, P., & Areal, N. (2016). Stock market sentiment lexicon acquisition using microblogging data and statistical measures. Decision Support Systems, 85, 62–73.

    Article  Google Scholar 

  • Price, S. M., Doran, J. S., Peterson, D. R., & Bliss, B. A. (2012). Earnings conference calls and stock returns: The incremental informativeness of textual tone. Journal of Banking & Finance, 36(4), 992–1011.

    Article  Google Scholar 

  • Ranco, G., Aleksovski, D., Caldarelli, G., Grčar, M., & Mozetič, I. (2015). The effects of Twitter sentiment on stock price returns. PloS One, 10(9), e0138441.

    Article  Google Scholar 

  • Renault, T. (2017). Intraday online investor sentiment and return patterns in the US stock market. Journal of Banking & Finance, 84, 25–40.

    Article  Google Scholar 

  • Saif, H., Fernández, M., He, Y., & Alani, H. (2014). ‘On stopwords, filtering and data sparsity for sentiment analysis of twitter’.

  • Sprenger, T. O., Sandner, P. G., Tumasjan, A., & Welpe, I. M. (2014). News or noise? Using Twitter to identify and understand company-specific news flow. Journal of Business Finance & Accounting, 41(7–8), 791–830.

    Article  Google Scholar 

  • Tetlock, P. C. (2007). Giving content to investor sentiment: The role of media in the stock market. The Journal of Finance, 62(3), 1139–1168.

    Article  Google Scholar 

  • Tetlock, P. C., Saar-Tsechansky, M., & Macskassy, S. (2008). More than words: Quantifying language to measure firms’ fundamentals. The Journal of Finance, 63(3), 1437–1467.

    Article  Google Scholar 

  • Wang, S., & Manning, C. D. (2012). Baselines and bigrams: Simple, good sentiment and topic classification, in ‘Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2’. Association for Computational Linguistics, 90–94.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thomas Renault.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Renault, T. Sentiment analysis and machine learning in finance: a comparison of methods and models on one million messages. Digit Finance 2, 1–13 (2020). https://doi.org/10.1007/s42521-019-00014-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s42521-019-00014-x

Keywords

JEL classification

Navigation