1 Introduction

Can the stock market be predicted by analyzing social media? Recent developments in machine learning and the growing quantities of available text data from online news, social media and annual reports have triggered intensive research in finance. In their pioneering paper, Antweiler and Frank (2004) compute a bullishness measure out of 1.5 million messages posted on Yahoo! Finance and Raging Bull and find that stock messages help predict market volatility. Their results clearly reject the hypothesis that all that talk is just noise. They show that there is financially relevant information present in social media. In a similar vein, Tetlock (2007) constructs a measure of media pessimism from a Wall Street Journal column and finds that it predicts downward pressure on market prices.

Most of the previous financial studies of social media rely on pre-defined or manually annotated sentiment dictionaries. Such approaches are limited in various ways. How to create a sentiment classifier that understands the vocabulary of the messages posted by the investors? For instance, “bull” is an animal in everyday language but it is someone optimistic in the financial jargon. Loughran and McDonald (2012) create a word list, which helps classify tone in a financial document. However, this might not be sufficient in the context of social media because messages posted present many typos, abbreviations and slang, so one needs to have an additional layer of data preprocessing. For instance, the word “goooooood” would not be recognized by the model if it is not corrected into “good” first. On the other hand, manually annotating and validating dictionaries is not a scalable approach to handling social media content. On another note, Cookson and Niessner (2020) employs an entropy classifier that classifies StockTwits messages into bullish versus bearish, and then uses the resulting messages to form a stock-daily disagreement measure. However, it does not include the possibility of sentiment neutral messages, which arguably exist.

Our paper overcomes these limitations. We develop a machine learning algorithm to classify the sentiment of a large sample of StockTwits messages as bullish, bearish, or neutral. The sample consists of all messages referring to US and Canadian stocks, including ETFs and other types of securities available on CRSP/Compustat, from January 2010 to March 2020. We train our machine learning classifier on the set of all user sentiment-labeled messages, which constitute about one third of the sample. We then classify the sentiment of all remaining messages. Our method scales and performs very well. It achieves an out-of-sample accuracy of 85.9%, which compares well to the anecdotal 80–85% probability that human annotators agree on the sentiment of a document, see, e.g., Wilson et al. (2005) and Chen et al. (2020).

We then construct a stock-aggregate daily sentiment polarity measure and relate it to daily stock returns. We find that polarity is positively associated with contemporaneous returns, also when controlling for lagged returns. However, unconditionally, polarity cannot predict next-day returns, which is in line with the efficient market hypothesis (EMH). We then conduct an event study. We define events as days of sudden peaks of message volume of individual tickers. We classify events as bullish, bearish, or neutral depending on the prevailing polarities. We find that bullish (bearish) events are strongly associated with large positive (negative) abnormal returns. Cumulative abnormal returns over the preceding 20 days of an event have no predictive power on the type of event. Returns normalize immediately after the jump on the event date, which again is in line with the EMH. In contrast, remarkably, we find that cumulative abnormal polarity has statistically significant predictive power on the type of event. We assess the economic relevance of our findings with the performance of cumulative abnormal polarity ranked portfolios. We find that for appropriate choices of thresholds, cumulative abnormal polarities provide valuable signals for stock market investments.

Our results and method are of broad interest for researchers that analyze social media and their interplay with stock markets. We collect and process a large dataset of messages from StockTwits posted between January 2010 and March 2020. We generate a vocabulary of one million investor sentiment-labeled terms consisting of up to three words that frequently appear in StockTwits messages.Footnote 1 As a method, we develop a simple and efficient sentiment classifier of micro-blogs for imbalanced data. This addresses the stylized fact that bloggers post more bullish than bearish-labeled messages. In our sample, the ratio is five to one. What’s more, we find that not all messages carry a substantial stock market relevant sentiment. Rather than re-sampling from the underrepresented bearish class, we thus introduce an auxiliary neutral class. We then train two independent binary classifiers. The first (second) classifies messages as bullish versus non-bullish (bearish versus non-bearish). We aggregate the two binary outcomes and classify a message as bullish (bearish) for the concordant combination bullish/non-bearish (non-bullish/bearish), and neutral otherwise. This approach is very simple and efficient, and eliminates the class imbalance bias at the same time. It can be built on any traditional binary classifier. In this study, we use logistic regression on Term Frequency-Inverse Document Frequency (TFIDF)-vectorized messages. TFIDF is a weighting scheme gauging the importance of a word in a document.

Our paper contributes to the growing literature on machine learning classification of social media and its interaction with the stock market. Most previous financial studies use Twitter as their primary source of data. Twitter has the advantage of being used by a wide range of people across the world and a few influencers can attract the attention of many investors. In 2013, following a meeting with Tim Cook (Apple CEO), Carl Icahn tweeted that he bought a large position in Apple and believed that the company is extremely undervalued. This bullish tweet caused the market capitalization of Apple to jump by $12 billion. In 2019, JPMorgan created the Volfefe Index to track Donald Trump’s tweets impact on the stock market. However, it is more difficult to disentangle noise from relevant tweets in Twitter than in other more focused social media. Results from Ghoshal and Roberts (2016) show that StockTwits is significantly more informative than Twitter data. This is not surprising as StockTwits is a finance-focused platform whereas Twitter also captures irrelevant opinions on a wide range of non-finance related matters.

Our paper is the first that analyzes the predictive power of StockTwits messages on stock returns unconditionally and around specific events. Renault (2017) builds an intraday investor sentiment indicator using messages and finds that the change in investor sentiment of the first half-hour of a trading day helps forecast the last half-hour market return of that trading day. However, his classifier is based on a dictionary consisting of 8 thousand manually validated and modified terms, which limits its scalability. Renault (2020) uses larger data sets and compares various classifiers, including machine learning.

Our approach is in some parts similar to Ranco et al. (2015), who also study the relation of micro-blog sentiments with stock returns. However, they use Twitter data, whereas the finance-tailored StockTwits data we use results in higher contemporaneous correlations between stock returns and polarity. They manually annotate 100 thousand tweets, which limits the scalability of their approach. Cookson and Niessner (2020) also use logistic regression on a list of words to classify StockTwits messages that were unclassified in the original sample as either bearish or bullish.Footnote 2 Our sample is much larger (90 million versus 1 million message in Ranco et al. (2015) and 1.5 million messages in Cookson and Niessner (2020)) and covers a longer period (10 years versus 13 months in Ranco et al. (2015) and 21 months in Cookson and Niessner (2020)).

Earlier studies of textual analysis and stock prices also include Das and Chen (2007), who provide a dictionary based approach trained on hand-classified messages extracted from the Yahoo message board in the period from July to August 2001. Boudoukh et al. (2013) use articles on selected S&P500 companies from the Dow Jones Newswire from 2000 to end of 2009. Using a proprietary textual analysis methodology available on the Dow Jones platform, they show that returns respond more to relevant news, both by type and by tone. Heston and Sinha (2017) measure sentiment of Reuters new with a proprietary Thomson Reuters neural network. They find that positive news stories are quickly incorporated into positive stock returns but negative news stories take a while to come into prices. Ke et al. (2020) extract sentiment from news articles on the Dow Jones Newswires. They train a sentiment score directly on returns. In contrast, we use user sentiment-labeled StockTwits messages as training and validation set for our sentiment classifier.

Our paper also contributes to the EMH literature by gauging how cumulative average abnormal returns and abnormal polarities behave around sudden peaks of message activity.

The remainder of the paper is structured as follows. Section 2 discusses StockTwits and stock market data. Section 3 develops our sentiment classifier based on TFIDF vectorization. Section 4 introduces the sentiment polarity measure. Section 5 relates it to stock returns. Section 6 contains the event study. Section 7 discusses the sentiment-sorted portfolio performance. Section 8 concludes. The appendix contains additional statistics and background material.

2 StockTwits and stock market data

StockTwits is a large social media platform similar to Twitter but designed for investors and traders. Users register online and can post messages about any listed stock through the prefix $ followed by the ticker of the stock. StockTwits was created in 2008 as an app built on the Twitter’s API and later detached from Twitter to build a standalone social network. As of April 2019, it had over two million registered users and the number of daily posted messages has been growing exponentially, see Fig. 1.

Fig. 1
figure 1

Number of messages posted daily on StockTwits

StockTwits describes itself as “the voice of social finance and the best way to find out what is happening right now in the markets and stocks you care about”. In practice, it is effectively used by finance professionals to express their opinions on individual stocks and the market as a whole. Importantly, users have the option to label their posted messages as either bullish or bearish.Footnote 3 This feature is key for our approach, as it allows for sentiment classification of all messages using machine learning trained on the user-labeled messages.Footnote 4

The reasons for using StockTwits and not other social media data for financial studies are at least threefold. First, a major challenge in applying natural language processing is the creation of an appropriate labeled vocabulary. Loughran and McDonald (2011) show that it is essential to have a specific vocabulary to interpret finance documents (i.e., many words have a different meaning in finance than in traditional English, such as “bear trap”). In addition to that, social media slang is an additional layer of language complexity. To this extent, the functionality to self-tag bullish and bearish messages that StockTwits is extremely valuable as it allows the creation of a specific labeled vocabulary out of labeled messages. We are not aware of any other social media platform in finance offering this functionality. Second, text data from StockTwits is more reliable and less noisy than from general purpose platforms, such as Twitter, because messages focus on finance and economics matters only. Micro-bloggers have incentives to post valuable information in order to maintain or increase mentions and retweets, and thus have a greater share of voice in the forum (Sprenger et al. 2014). On the other hand, StockTwits messages might be biased and subject to malicious users that try to manipulate the market. However, market manipulations likely happen only rarely as the SEC closely monitors potential influencers to prevent any market abuse. Third, extracting data from StockTwits is easy because of its API. StockTwits’ API is designed to query the database to download messages via JSON requests. We provide a short tutorial in “Tutorial for StockTwits messages extraction” of appendix.

We use stock market data from CRSP/Compustat. We extract daily closing prices, daily volume of transactions and number of shares outstanding for all US and Canadian stocks, as well as ETFs and some other types of securities, from January 2010 to March 2020. Stock prices and number of shares are adjusted to account for any distribution (i.e., dividends, stock splits) so that a comparison can be made on an equivalent basis before and after the distribution. We use as risk-free rate the 3-month US T-bill rate, converted into daily risk-free returns. We henceforth refer to daily stock excess returns over risk-free simply as returns. Using a Python script, we then extract all messages from StockTwits for the list of tickers corresponding to the sample of US and Canadian stocks. This results in 90 million messages, which we download and store as JSON files.Footnote 5 Overall, our sample covers 8843 tickers, whereof 75% refer to ordinary common share, 15% to ETFs, and the remaining 10% to other types of securities. Henceforth, we interchangeably refer to any of these securities as either a stock or a ticker.

Every StockTwits message includes eight features: (1) the reference ticker(s), (2) a timestamp, (3) a unique message identifier, (4) the body of the message, (5) the sentiment label (bearish, bullish, or none) entered by the user, (6) a unique identifier of the user who posted the message, (7) the number of messages published by the user who posted the message, and (8) the number of followers of the user who posted the message. Our sentiment analyses builds on the first five features. The last three provide additional information on the network structure, which we briefly discuss in “User summary statistics” of appendix.

Figure 2 shows a screenshot of the StockTwits website as of 3rd March 2020, for a query of AAPL, which is the ticker for Apple. The first message is labeled as bullish by the user “satkaru”, the two next are unlabeled messages that will be classified by our machine learning algorithm, and the last message is labeled as bearish by the user “Etrading”.

Fig. 2
figure 2

Screenshot of StockTwits as of 3rd March 2020, for a query of AAPL

Left plot of Fig. 3 shows the top 30 most discussed tickers on StockTwits. SPY, a large ETF that tracks the S&P500 stock market index, is the most discussed ticker, followed by Apple and other big tickers. The messages about the 15 (30) biggest tickers represent 20% (25%) of the total number of messages, which indicates that users talk about a wide panel of tickers and not only big firms. Right graph shows a histogram of the number of messages per ticker. The x-axis is log-scaled because due to extreme values the distribution is highly skewed.

Fig. 3
figure 3

Ticker summary statistics. Left graph shows the top 30 most discussed tickers on StockTwits. SPY is the ticker of a large ETF tracking the S &P 500 and AAPL is the ticker for Apple. Right graph shows the distribution of the number of messages across tickers

Text messages need to be transformed into a quantitative vector to be fed into our machine learning classifier, which in turn computes a sentiment score. This transformation consists of several steps. First, we apply some preprocessing operations to the text messages: an apostrophe handler, a contraction form handler (e.g., “aren’t” becomes “are not”), tickers removal, stop words removal (e.g., “a”, “the”, “of”),Footnote 6 users removal, lemmatization, URLs removal and a simple spell corrector dealing with more than two repeated characters (e.g., “soooo goooood” becomes “soo good”). Table 1 shows five examples of messages before and after preprocessing.

Table 1 Preprocessing of five sample messages

The next step is tokenization: the slicing of a text message into smaller units called terms or tokens. In financial lingo, some words only have meaning when associated with other words (i.e., “bad apple” or “bear flag”). N-gram models allow accounting for words frequently occurring together with other words. The main hyperparameter in an N-gram model is the number N of words that form a term: a unigram is a term with only one word, a bigram is a term with two consecutive words, etc. Larger N-gram models increase dramatically the size of the vocabulary (i.e., the collection of all terms considered). We select \(N=1,2,3\) and truncate the resulting vocabulary such that it consists of the one million most frequent terms in the union of all unigrams, bigrams and trigrams.

Figure 4 represents the bullish and bearish word clouds. These represent the most frequent terms in all user-labeled bullish and bearish messages relative to their total appearance, respectively. The size of the terms represents their relative weight in the cloud. In the bullish cloud, we see terms such as “bullish divergence”, “room to grow”, “lot potential” which we can clearly interpret as bullish signals. In the bearish cloud, we find terms such as “recent resistance”, “short setup”, “bad apple” which again we can directly interpret as bearish signals. These findings are reassuring in the sense that the content of the messages on StockTwits are consistent with their labels. We checked for anomalies at random, but did not find significant issues. “Anomalies” of appendix discusses two such anomalies.

Fig. 4
figure 4

Bullish word cloud (left), bearish word cloud (right). These correspond to the most frequent terms (up to trigrams) in user-labeled bullish (bearish) messages relative to their total appearance. The size of the terms represents their importance in the cloud

3 Sentiment classification

Left plot of Fig. 5 shows the proportions of user sentiment-labeled messages across time. In the early years of the platform, most messages were unlabeled, presumably because users were not familiar with the sentiment label yet. Albeit the proportion of unlabeled messages monotonically declines over the years, almost 60% of the more recent messages are still unlabeled. Overall, around 30 million messages are user-labeled and 60 million messages are unlabeled. We conjecture that by far not all unlabeled messages reflect market neutral opinions. Indeed, the right plot of Fig. 5 reveals that a substantial part of user-unlabeled messages is machine learning classified as bullish or bearish. Hence these user-unlabeled messages contain indeed market relevant information, which we are able to capture by our algorithm.

Fig. 5
figure 5

Left plot shows the proportions of user-labeled messages: bullish (green), bearish (red), and unlabeled (gray) across time. Right plot shows the proportions of machine learning classified messages: bullish (light green predicted, green user-labeled), bearish (light red predicted, red user-labeled), and neutral (gray) across time. Proportions are aggregated monthly (color figure online)

Among the user-labeled messages we find five times more bullish than bearish ones. This ratio indicates that investors are on average optimistic about the market, which is consistent with findings in the literature, e.g., Renault (2017). Such an imbalance is a well-known issue in machine learning classification as it creates a bias towards the over-represented class, see Chawla et al. (2004). There are various ways to tackle class imbalance. An all-purpose standard approach in machine learning is to over-sample the minority class, which consists of randomly re-sampling from the minority class and thus artificially re-balance the class sizes in the data. We use a different approach, which is tailored for our setup. As not every message carries a substantial stock market relevant sentiment, we deviate from the bullish–bearish dichotomy. Instead, we create an auxiliary neutral sentiment class to account of messages that do not take a clear stand. See “Examples of classified messages” of appendix for examples of such neutral messages.

We first randomly select 80% of the user sentiment-labeled messages as training set (in-sample) and keep the remaining 20% as test set (out-of-sample). On the training set, we then train two independent binary classifiers. The first (second) classifies messages as bullish versus non-bullish (bearish versus non-bearish). We aggregate the two binary outcomes and classify a message as bullish (bearish) for the concordant combination bullish/non-bearish (non-bullish/bearish), and neutral otherwise. This approach is very simple and efficient, and eliminates the class imbalance bias at the same time.Footnote 7 It builds on any traditional binary classifiers.

Here we use logistic regression on Term Frequency-Inverse Document Frequency (TFIDF)-vectorized messages, as in, e.g., Yildirim et al. (2018), Qasem et al. (2015), Erdemlioglu et al. (2017). We choose TFIDF over more sophisticated algorithms such as BERT because of simplicity reasons and that it is already showing good performance. Also, TFIDF is easier to interpret and generates as a side product a dictionary of finance related terms. TFIDF is a widely used method to transform a text, in our case a message m, into a numerical vector, \(TFIDF_m\). The dimension of this vector is equal to the size of the vocabulary (the collection of all terms across all messages). The components of the vector encode the importance of the corresponding terms t in the message m, as formally defined by \(TFIDF_{m,t} = TF_{m,t} \cdot IDF_t\). The first factor measures how frequently term t appears in the message,

$$\begin{aligned} TF_{m,t} =\dfrac{\sum _{i=1}^{N_m} {\textbf {1}}_{t=t_{m,i}}}{N_m}, \end{aligned}$$
(1)

where \(N_m\) denotes the number of terms \(t_{m,i}\) in message m. The second factor measures how important term t is to the message,

$$\begin{aligned} IDF_t = \log \left( \dfrac{V}{\sum _{j=1}^{V} {\textbf {1}}_{t \in m_j}}\right) , \end{aligned}$$
(2)

where V denotes the total number of messages \(m_j\). A term t appearing in many documents (such as “the”, “is”, “of”) is likely to have low information content, hence a low \(IDF_t\). Logistic regression estimates the conditional probability of a message m being bullish given \(TFIDF_m\),

$$\begin{aligned} {{\mathbb {P}}}[ \text {message}\,m\hbox { is bullish}\mid TFIDM_m] = \frac{ e^{\beta _0+\beta _1 \cdot TFIDF_m}}{1+e^{\beta _0+\beta _1 \cdot TFIDF_m}}. \end{aligned}$$
(3)

The parameters \((\beta _0,\beta _1)\) are estimated by maximum likelihood, and we use the Python scikit-learn library Pedregosa et al. (2011) for the implementation. Message m is classified as bullish (non-bullish) if the conditional probability (3) is larger (smaller) than an auxiliary chosen threshold, as discussed below. In a similar way, we perform an independent logistic regression for the bearish versus non-bearish classification.

Every message then classifies into one of the following combinations: (non-bullish, bearish), (bullish, bearish), (non-bullish, non-bearish), (bullish, non-bearish). For the first and last combinations, the two algorithms agree and the final classification is defined to be bearish (non-bullish, bearish) or bullish (bullish, non-bearish), respectively. For the two middle combinations, (bullish, bearish) and (non-bullish, non-bearish), the two algorithms disagree, so that the final classification is defined to be neutral. Formally, every message m is mapped onto either

$$\begin{aligned} m\mapsto {\left\{ \begin{array}{ll} \text {(non-bullish, bearish)} &{} =: \text {bearish} \\ \text {(bullish, bearish)}&{} =: \text {neutral} \\ \text {(non-bullish, non-bearish)}&{} =: \text {neutral} \\ \text {(bullish, non-bearish)} &{}=: \text {bullish.} \end{array}\right. } \end{aligned}$$
(4)

To select optimal classification thresholds, we maximize the F1 scores.Footnote 8 The F1 scores of the two binary classifiers differ because they depend on which class is defined as the positive one. We recap the definition of the F1 score in “Classifier performance” of appendix. Figure 6 shows the F1 scores as functions of the threshold. Circles indicate the maximal F1 scores, along with the corresponding optimal thresholds, 0.50 and 0.72, respectively.

Fig. 6
figure 6

Optimal classification thresholds. The green (red) line is the F1 score for the bullish versus non-bullish (bearish versus non-bearish) classifier as a function of the threshold. Circles indicate the maximal F1 scores (color figure online)

If the sentiment score of a message is bigger (smaller) than 0.72 (0.50), respectively, then both classifiers agree and the sentiment of the message is classified as bullish (bearish), respectively. If the sentiment score is between 0.50 and 0.72, the classifiers disagree, (bullish, bearish), and we consider the message as neutral. Finally, we overwrite the predicted sentiment of any message by the user-labeled sentiment whenever the latter is available. Research in sentiment classification shows that human annotators tend to agree about 80–85% of the time when evaluating the sentiment of a document (see, e.g., Wilson et al. (2005) and Chen et al. (2020)). This is a benchmark for the accuracy that a sentiment classifier should meet or beat. The out-of-sample accuracy of our combined classifier is 85.9%. “Classifier performance” of appendix provides in-sample and out-of-sample confusion matrices for our combined classifier.

Right plot of Fig. 5 shows the proportions of our machine learning classifications across time. Percentages of bearish (user-labeled and classified as bearish) and bullish (user-labeled and classified as bearish) messages are stable over time, suggesting that our classification method is robust. Even if most messages were not user-labeled in the early years of the platform, as seen in the left plot of Fig. 5, we are now able to classify the sentiment of all messages in the sample, including a neutral class. Consistent with the over-representation of bullish messages observed in the user-labeled messages, there are substantially more messages classified as bullish than bearish. Examples of classified messages are given in “Examples of classified messages” of appendix.

4 Polarity

We next aggregate message sentiments on a daily ticker-level and across the market. Thereto, we denote by \(C_{i,t,j}=1,0,-1\) for bullish, neutral, bearish, respectively, the sentiment of the jth message about ticker i on day t.Footnote 9 We follow Antweiler and Frank (2004) and define an average sentiment measure, which we call the sentiment polarity of ticker i on day t, as

$$\begin{aligned} P_{i,t} = \dfrac{\sum _{j=1}^{V_{i,t}} ({\textbf {1}}_{C_{i,t,j} = 1} - {\textbf {1}}_{C_{i,t,j} = -1})}{\sum _{j=1}^{V_{i,t}} ({\textbf {1}}_{C_{i,t,j} = 1} + {\textbf {1}}_{C_{i,t,j} = -1})}, \end{aligned}$$
(5)

where \(V_{i,t}\) denotes the number of messages about ticker i on day t.Footnote 10

As an aggregate, we define the market polarity as a weighted average over all tickers

$$\begin{aligned} P_t^M = \dfrac{\sum _{i} V_{i,t} \cdot P_{i,t}}{V^M_t}, \end{aligned}$$
(6)

where \(V^M_t = \sum _{i} V_{i,t}\) denotes the number of messages on day t. Figure 7 shows a scatter plot of the market polarity \(P_{t}^M\) versus the polarity of SPY. The slope coefficient of the regression line is statistically significantly positive and the contemporaneous Pearson correlation coefficient is 0.53, suggesting that the market polarity is an accurate measure of the aggregated sentiment of the market.Footnote 11 Also, consistent with Fig. 5, SPY and market polarities are bullish-biased.

Fig. 7
figure 7

Market polarity versus SPY polarity. The red line shows the linear regression line and coefficients (color figure online)

5 Relation to stock returns

For the following time-series analysis and event studies we restrict our sample. There are two reasons for doing so. First, we keep computational cost at a reasonable level. Second, and more importantly, the time series of ticker-individual polarities exhibit spikes and are too noisy if the daily message volumes \(V_{i,t}\) are too small. In fact, Stocktwits is neither regulated nor moderated, so one needs to filter the information before credibly relating it to stock returns. Even if Stocktwits has valuable information from respected contributors, a blogFootnote 12 describes the concerns that may rise when using Stocktwits as a financial information provider, namely self-promotion, lack of credibility and other noise. To diversify noise and better extract information, we therefore exclude tickers from our sample that are rarely discussed. To do this, we compute the median of daily message volumes \(V_{i,t}\) for each ticker and exclude tickers with a median of less than 50 from our sample. Decreasing the median threshold increases the coverage at the expense of more noise in the daily polarity. Figure 8 shows the coverage as a function of the median threshold. To increase the coverage one would need to decrease the threshold a lot (e.g., decreasing the median threshold to 40 from 50 would increase the number of tickers covered to merely 22 from 19). A median threshold of 50 results in a balanced compromise between noise and coverage.

Fig. 8
figure 8

Coverage as a function of the median threshold. A lower threshold increases the coverage at the expense of a bigger bias in the polarity

Table 2 shows the list of the 19 tickers above this threshold and their associated market capitalization as of 31st of December 2019. It appears that these most discussed tickers cover all sizes of stock, and hence we avoid big-firm bias. Also, it includes not only single firms but also ETFs on alternative investments. Many tickers are technology stocks, which arguably are the most discussed on social media. However, other sectors are also covered, and so even if we have a restrictive universe, it is well diversified. Even though the restricted sample contains a relatively small number of tickers, the event studies in the following section are still based on more than 1000 events.

Table 2 Coverage after the trimming process

Now, to understand how our polarity measure is related to investor sentiment, we perform linear regressions of contemporaneous and next-day returns on polarity and lagged returns as control variables:

$$\begin{aligned} R_{i,t}&= \alpha ^{{\textrm{cont}}} + \beta ^{{\textrm{cont}}} \cdot P_{i,t} + \gamma _{1}^{{\textrm{cont}}} \cdot R_{i,t-1} + \gamma _{2}^{{\textrm{cont}}} \cdot R_{i,t-2} +\epsilon ^{{\textrm{cont}}}_{i,t}, \end{aligned}$$
(7)
$$\begin{aligned} R_{i,t+1}&= \alpha ^{{\textrm{next}}} + \beta ^{{\textrm{next}}} \cdot P_{i,t} + \gamma _{1}^{{\textrm{next}}} \cdot R_{i,t-1} + \gamma _{2}^{{\textrm{next}}} \cdot R_{i,t-2} + \epsilon ^{{\textrm{next}}}_{i,t}. \end{aligned}$$
(8)

Table 3 shows that the regression coefficient \(\beta ^{{\textrm{cont}}}\) is positive and significant for contemporaneous returns. This indicates that polarity is a good contemporaneous proxy for the sentiment of investors. Further supporting evidence is given by the correlation between polarity and contemporaneous returns at the ticker level.

Table 3 Results from linear regressions of contemporaneous and next-day stock returns on polarity

Figure 9 shows the time series during 2019 for the top 6 most discussed tickers. Correlations are always positive and range between 0.1 and 0.3. In contrast, regressing next-day returns reveals that polarity has no predictive power for next-day stock returns unconditionally. In the following section we show how polarity has predictive power around specific events.

Fig. 9
figure 9

Time series of daily polarity (red-left axis) and daily stock returns (blue-right axis) since 1st of January 2019 for the top 6 most discussed tickers. Pearson correlation between the two time series is shown in the title (color figure online)

6 Event study

Event studies constitute a statistical method widely used in financial econometrics, see, e.g., MacKinlay (1997). In general, they are used to measure the effect of events on the market value of stocks. Well-known applications of event studies include the testing of various forms of the efficient market hypothesis (EMH) [see Fama et al. (1969) and Fama (1991)].

6.1 Events

We define events as days with an unusual large number of messages for individual tickers. We conjecture that a sudden peak in StockTwits message volume indicates that an important corporate or stock market event is happening on the day of the peak. Figure 10 shows that increases (decreases) in message volumes are positively associated with increases (decreases) in contemporaneous weekly stock transaction volumes. These co-movements suggest that message volume peaks are a good proxy for corporate and stock market events.

Fig. 10
figure 10

Changes in weekly volume of transactions on the y-axis versus changes in message activity on the x-axis. Activity is measured in weekly messages posted per ticker

To measure unusual activity peaks, we use as benchmark model a one-year rolling window regression of daily relative message volume changes of ticker i on daily relative total message volume changes.Footnote 13 Formally,

$$\begin{aligned} \dfrac{\Delta V_{i,t}}{V_{i,t-1}} = \alpha _i^V + \beta _i^V \cdot \dfrac{\Delta V_t^M}{V_{t-1}^M} + \epsilon _{i,t}, \end{aligned}$$
(9)

which gives the one-year rolling estimates \(\hat{\alpha }_i^V\) and \(\hat{\beta }_i^V\). We then define the abnormal message volume changes for ticker i on day t as

$$\begin{aligned} AV_{i,t} = \dfrac{\Delta V_{i,t}}{V_{i,t-1}} - \left( \hat{\alpha }_i^V + \hat{\beta }_i^V \cdot \dfrac{\Delta V_t^M}{V_{t-1}^M}\right) . \end{aligned}$$
(10)

We define an event for ticker i as any day t where the standardized abnormal volume exceeds two,

$$\begin{aligned} \dfrac{AV_{i,t} - \hat{\mu }_{AV_i}}{\hat{\sigma }_{AV_i}} > 2, \end{aligned}$$
(11)

where \(\hat{\mu }_{AV_i}\) and \(\hat{\sigma }_{AV_i}\) denote the one-year rolling empirical mean and standard deviation.

Next we define the type of the event as either bullish, neutral or bearish. We use the abnormal polarity \(AP_{i,t}\) of the event date to assess how on average investors perceive the event. Figure 11 shows the distribution of abnormal polarities on event dates.

Fig. 11
figure 11

Empirical distribution of abnormal polarities on event dates. Red dashed lines show the one-third and two-third percentiles. We chose to use the one-third (\({-}\) 0.03) and two-third percentile (0.07) of the distribution of abnormal polarities as thresholds for the type of the event

We chose to use the one-third (\({-}\) 0.03) and two-third percentile (0.07) of the distribution of abnormal polarities as thresholds for the type of the event. We define the type of the event for ticker i at t as

$$\begin{aligned} Type_{i,t}= {\left\{ \begin{array}{ll} {\text {Bullish}} &{} \quad {\text {if}}\ \ AP_{i,t}> 0.07, \\ {\text {Neutral}} &{} \quad {\text {if}}\ \ AP_{i,t} \in [-\,0.03, 0.07], \\ {\text {Bearish}} &{} \quad {\text {if}}\ \ AP_{i,t} < -\,0.03. \\ \end{array}\right. } \end{aligned}$$
(12)

Overall, across 19 tickers, we identify 1131 events, whereof 454 bullish, 294 neutral, and 383 bearish types. This coverage is on par with previous studies (e.g., MacKinlay (1997) analyze 30 stocks and 600 events). Figure 12 shows the aggregate events and their types across time. The count of events looks stationary over time, apart from a build up phase of the platform in the early part until 2014. The distribution of event types is also balanced across time.

Fig. 12
figure 12

Number of events of each type across time. Numbers are aggregated monthly

As an illustration, Fig. 13 shows for Apple the time-series of message volume and the the corresponding events. Between January 2011 and March 2020, our algorithm identified 73 events for Apple. What are these events? Remarkably, we capture a variety of corporate events and disclosures. Earning announcements constitute about half of the events for Apple. Other events include Apple Keynotes (presentations that Apple gives to the press, often presenting new products), or CEO letters addressed to investors. Table 4 lists a few selected events for Apple.

Fig. 13
figure 13

Daily message volume for Apple. Events are days with an unusual high number of messages. Green upper-triangles show bullish events, gray circles are neutral events and red down-triangles represent bearish events (color figure online)

Table 4 Selected events and associated description and types for Apple

6.2 Abnormal stock returns

How do stock returns behave around events? Similar to the relative message volume changes, we use as benchmark model a one-year rolling window regression of the daily returns of ticker i on the daily market returns, \(R^M_{t}\), i.e., daily excess returns of the S&P500, and lagged returns as control variables,

$$\begin{aligned} R_{i,t}= \alpha _i^R + \beta _i^R \cdot R_t^M + \beta _i^{(1)} \cdot R_{i,t-1} + \beta _i^{(2)} \cdot R_{i,t-2} + \beta _i^{(3)} \cdot R_{i,t-3} + \epsilon _{i,t}, \end{aligned}$$
(13)

which gives the one-year rolling estimates \(\hat{\alpha }_i^R\), \(\hat{\beta }_i^R\) and \(\hat{\beta _i^{(j)}}\). This implies the abnormal returns

$$\begin{aligned} AR_{i,t} = R_{i,t} - \left( \hat{\alpha _i^R} + \hat{\beta _i^R} \cdot R_t^M + \hat{\beta _i^{(1)}}\cdot R_{i,t-1} + \hat{\beta _i^{(2)}} \cdot R_{i,t-2} + \hat{\beta _i^{(3)}} \cdot R_{i,t-3}\right) . \end{aligned}$$
(14)

We define the cumulative abnormal returns (CAR) around a ticker i event \(\tau\) as

$$\begin{aligned} CAR_i(\tau ,t) = \sum _{s=- 20}^{t} AR_{i,\tau + s}, \end{aligned}$$
(15)

and the cumulative average abnormal returns (CAAR) across all \(N=1131\) events as

$$\begin{aligned} CAAR(t) = \dfrac{1}{N} \sum _{j=1}^N CAR_{i_j}(\tau _j,t). \end{aligned}$$
(16)

Left plot of Fig. 14 shows the CAAR around the events. This plot is consistent with MacKinlay (1997). It shows that CAAR related to bearish (bullish) events exhibits a significant downward (upward) jump at the event date, respectively. These jumps are followed by a flat CAAR during the 20 days after the event. Interestingly, there is a systematic shift in the CAAR already 1 day before the event. However, this shift is relatively small compared to the jump on the event day: one day before the event, the bullish (bearish) CAAR equals 0.020 (\({-}\) 0.021). The CAAR related to the neutral events exhibits a slight upward shift around the event date but it fades away after a few days. The CAAR related to bearish events shifts already a few days before the event but this shift is not statistically significant. This is in line with Fig. 15, which shows that the CAR distributions prior to the events are not significantly different from zero, which is confirmed by the Mann–Whitney U-tests shown in Table 5. CAR has no predictive power on the type of the event: five days before an event, the median of the CAR distribution of the bullish events is not statistically different from the median of the neutral events. Same holds for the bearish events.

Fig. 14
figure 14

Cumulative average abnormal returns (left plot) and cumulative average abnormal polarity (right plot) around identified events. CAAR and CAAP related to bearish, neutral and bullish events are displayed with the red, gray and green line, respectively. Areas around lines show confidence intervals at the 95% level

6.3 Abnormal polarity

How does sentiment polarity behave around events? Similar to the above, we use as benchmark model a one-year rolling window regression of the the daily polarity of ticker i on the daily market polarity defined in (6),

$$\begin{aligned} P_{i,t}= \alpha _i^P + \beta _i^P \cdot P_t^M + \epsilon _{i,t}, \end{aligned}$$
(17)

which gives the one-year rolling estimates \(\hat{\alpha }_i^P\) and \(\hat{\beta }_i^P\). This implies the abnormal polarity

$$\begin{aligned} AP_{i,t} = P_{i,t} - \left( \hat{\alpha _i}^P + \hat{\beta _i}^P \cdot P_t^M\right) . \end{aligned}$$
(18)

We define the cumulative abnormal polarity (CAP) around a ticker i event \(\tau\) as

$$\begin{aligned} CAP_i(\tau ,t) = \sum _{s=- 20}^{t} AP_{i,\tau + s}, \end{aligned}$$
(19)

and the cumulative average abnormal polarities (CAAP) across all \(N=1131\) events as

$$\begin{aligned} CAAP(t) = \dfrac{1}{N} \sum _{j=1}^N CAP_{i_j}(\tau _j,t). \end{aligned}$$
(20)

Right plot of Fig. 14 shows the CAAP around the events. There are two main findings. First, in contrast to CAAR, the CAAP for bullish and bearish events is not constant after the event date, suggesting that users’ sentiments about stocks tend to be biased towards recent past events. A possible explanation is that users might still post bullish (bearish) messages about a bullish (bearish) event during several days after the event. This is in contrast to the returns that immediately normalize after the event. Second, and more interestingly, the CAAP for bullish and bearish events shifts several days earlier than the CAAR. This indicates that investors are on average able to anticipate the type of an event in the near future. However, this sentiment only manifests through the social media, but not through abnormal returns.

Figure 15 illustrates this striking finding with box plots [see Dekking et al. (2005) and Tukey (1977)] showing the distributions of the CAR and CAP, for all three event types, 5 days before the event, at the event date, and 5 days after the event, respectively.

Fig. 15
figure 15

Distributions of CAP and CAR 5 days before an event, on event date and 5 days after an event. The line inside a box shows the median while the edges of each box represent the 25% and 75% quantile of the distribution. From above the edges of a box, a distance of 1.5 times the interquartile range is measured and a whisker is drawn up to the largest and lowest observed point from the data that falls within this distance. Interquartile range is equal to the third quartile minus the first quartile

To check statistical significance, we use the Mann–Whitney U-test [see Mann and Whitney (1947) and Sheskin (1998)] to test whether the three samples (bullish, neutral and bearish) represent populations with different median values.Footnote 14 Table 5 shows U-test statistics for pairwise comparisons. The null is rejected in every case except for CAR at \(\tau -5\). That is, 5 days before the event, CAR has no predictive power on the type of event. This is consistent with the EMH. In contrast, 5 days before the event, CAP can predict the type of event. At the event date, the medians of the CAR shift as the abnormal returns jump for both bullish and bearish events. Also this is consistent with the EMH. Finally, 5 days after the event, the distributions of the CAR are very similar to the ones at the event date. Again, this is consistent with the EMH, as all new information is instantaneously embedded into the prices and the returns normalize after the event, immediately. The medians of the CAP 5 days after the event exhibit an extended shift compared to the ones at the event date, as investors continue to post about recent past events.

Table 5 Mann–Whitney U-test statistics for pairwise significant differences between distribution medians

7 Sentiment-sorted portfolios

We assess the economic relevance of the sentiment polarity and construct sorted portfolios. Thereto, we define for every ticker i and day t

$$\begin{aligned} CAP_{i,t}=\sum _{s=t-14}^t AP_{i,s}, \end{aligned}$$
(21)

which is the running CAP over the last 14 days plus the current day t (we rebalance the portfolio at the close on day t).Footnote 15 Note the difference to (19). While we cannot predict the arrival of an event, we assume that the more \(CAP_{i,t}\) deviates from zero the more likely there will be a an event on the next day. We will thus use \(CAP_{i,t}\) as a baseline signal for market timing. However, as we have seen above, CAP continues to shift after an event. To avoid exposures to short-term reversals, we thus reset the running CAP after every event. Formally, let \(\tau _{i,t}\le t\) denote the most recent past event date by t of ticker i. Then we define the reset CAP

$$\begin{aligned} CAP_{i,t}^{(R)} = \sum _{s=\max \{t-14,\tau _{i,t}+1\}}^t AP_{i,s} = {\left\{ \begin{array}{ll} CAP_{i,t},&{}\text {if}\,\tau _{i,t}<t-14, \\ \sum _{s= \tau _{i,t}+1}^t AP_{i,s},&{}\text {if}\,t-14\le \tau _{i,t}<t,\\ 0,&{}\text {if}\,\tau _{i,t}=t,\end{array}\right. } \end{aligned}$$
(22)

where we used the convention that \(\sum _{s=t+1}^t \cdot =0\).

We also define time-varying thresholds on the reset CAP for market timing. For every day t, we estimate the mean \(\mu _t\) and standard deviation \(\sigma _t\) of \(CAP_{i,t}^{(R)}\) across the 19 tickers i. For a fixed multiplier x, we define \(U_t(x) = \mu _t + x \cdot \sigma _t\) the upper threshold, and \(L_t(x) = \mu _t - x \cdot \sigma _t\) the lower threshold. Figure 16 shows the time series of the cross-sectional mean \(\mu _t\) and the 99% confidence interval, \(L_t(x)\) and \(U_t(x)\) for \(x = 2.58\). As a robustness check of our approach, we observe that the mean is well centered at zero. We also see a regime change in the early 2015. In the first regime the standard deviation is much larger (and more volatile) than in the second regime.Footnote 16Sentiment-sorted portfolios for various thresholds” of appendix contains the results for the 95% (\(x = 1.96\)) and 99.5% (\(x = 2.81\)) confidence intervals.

Fig. 16
figure 16

Cross-sectional statistics of \(CAP_{i,t}^{(R)}\)

Based on these signals, we now construct reset CAP-sorted portfolios. Formally, we define the ticker sets \(I^{{\textrm{bull}}}_t=\{ i\mid CAP_{i,t}^{(R)} > U_t(x)\}\) and \(I^{{\textrm{bear}}}_t=\{ i\mid CAP_{i,t}^{(R)} < L_t(x)\}\). At the close of any day t, we form the equally weighted bullish (bearish) portfolio consisting of tickers in \(I^{{\textrm{bull}}}_t\) (in \(I^{{\textrm{bear}}}_t\)), and realize the 1-day returns. If any of the index sets is empty, we set the corresponding return to zero. This strategy is out-of-sample and easily implementable in practice. At the end of day t and for every stock, we extract and classify all messages to compute the polarity using (5), the abnormal polarity using (18) and the running reset CAP using (22). Next, we compute the lower \(L_t(x)\) and upper \(U_t(x)\) thresholds by estimating the cross-sectional mean and standard deviation of the reset CAP. Finally, we short sell all tickers where the reset CAP is smaller than the lower threshold, and buy long all tickers where the reset CAP is larger than the upper threshold.

Top-left plot of Fig. 17 shows the cumulative log returns of bullish and bearish portfolios as well as the S&P500. Overall, the portfolio performance is consistent with our approach: the bullish (bearish) portfolio outperforms (under-performs) the market. Remarkably, the upward (downward) steps suggests that our portfolio strategy succeeds to take the right positions just before an event. The remaining plots of Fig. 17 show the number of positions across time of our portfolios. Most of the returns are earned with portfolios consisting of very few tickers. This is a result of our market timing and stock picking strategy: we only invest in the top/bottom percentiles of CAP, whenever our signal is strong enough.

Fig. 17
figure 17

Bullish and bearish portfolios for \(x=2.58\)

The portfolio performance arguably depends on the choice of the multiplier x. In particular, the smaller x the more likely the bearish portfolio exhibits positive returns. On the other hand, the larger x the more likely the bullish portfolio misses the opportunities of positive returns. A careful gauging of x, possibly asymmetric in bullish and bearish, is therefore required for a real-world implementation of these strategies. Results could also improve for a larger cross-section of stocks than the 19 of our reduced sample, which is left for future research.

8 Conclusion

We extract 90 million messages from StockTwits from January 2010 to March 2020, covering a large sample of US and Canadian stocks. Messages are either user-labeled as bullish or bearish or left unlabeled. Using the user-labeled messages as training set, we perform logistic regressions on TFIDF vectorized messages to classify all unlabeled messages as either bullish, neutral or bearish. We observe a 5-to-1 bullish-to-bearish ratio, indicating that investors are on average optimistic. We build time-series of daily sentiment polarity for individual tickers and the aggregate market. We show that daily polarity is positively associated to contemporaneous stock returns, but this result loses its significance against next-day returns. We then define events as days with sudden peak of message volume and relate them to corporate and stock market events. We show that cumulative abnormal polarity has significant predictive power on the type of event, in contract to cumulative abnormal returns. We also note that investor sentiment about a ticker tends to be biased towards recent past events. As robustness check, we show that our event study on cumulative abnormal returns is consistent with previous literature on the efficient market hypothesis. The performance of sentiment-sorted portfolios illustrates the economic relevance of our sentiment measure.