1 Introduction

The advent of neo-brokers like Robinhood has lead to a new era of retail investing, potentially democratizing stock markets and bringing in a massive influx of small-scale investors. At the same time so-called “Meme Stocks” have emerged as a new phenomenon in the market. These stocks are largely driven by the collective sentiment and enthusiasm of online communities rather than financial and economic fundamentals or technical analysis. Investors often use specific Social Media channels like Reddit’s “Wallstreetbets”-Subreddit (WSB) or Stocktwits to share their short-term investment decisions and strategies, which can result in partial collusion and planning on joint investment decisions. A notable prominent case is the GameStop (or GME) short squeeze, which we will discuss in detail and show why it needs in-depth exploration.

Short-selling is an investment strategy in which traders profit from a price drop in an asset. The process involves borrowing shares, selling them at the current market price, and then buying them back at a lower price in the future to return them to the lender. In the presence of substantial short interest, this can culminate in a so called “short squeeze”, where a stock price hike may cause some short-sellers to close their position early to cut potential losses. This can increase the share price even further and cause a feedback loop.

Throughout the later half of the 2010s, GameStop—one of the largest video game retailers in the world—exhibited poor financial performance, recording a consistent revenue decline between 2015 and 2020. This made GameStop a prime target for short-selling.

An influential Reddit user by the name of “DeepFuckingValue” took notice of the short interest in GME, and began discussing the possibility of a short squeeze.Footnote 1 Retail investors rallied behind his ideas and began buying GME shares at large scale, with the goal of driving the stock price up, thereby putting pressure on short sellers to cover their positions by purchasing shares at inflated prices.

Indeed, GME’s stock price experienced a rapid increase in late January 2021.Footnote 2 The surge in price attracted mainstream media attention, and further fueled the interest in the ongoing short squeeze, which eventually spilled over to other stocks, such as Palantir and BlackBerry. It also lead to the bust of several GME-shorting hedge funds, the most prominent being Melvin Capital.Footnote 3 It quickly became clear that Social Media could be an important potent fire-starter for coordinated retail investor operations. As a result, this new group of investors has shown that they can be a significant actor on financial markets and have the ability to disrupt the efficient price determination of certain stocks in the short term. For this reason, a deeper understanding of this new subculture itself is important. Besides, we investigate the potential of using Social Media discussions to predict share price movements.

We analyze Social Media text data from different sources to gain insights into the topics and shape of the discussion over time on several prominent platforms (Reddit, Stocktwits, Seeking Alpha) used by small-scale investors. Using NLP approaches like Sentiment Analysis and Topic Modelling, we investigate the discussion in these communities and evaluate how NLP methods can be leveraged to gain insight into the investment decision-making processes of users. Furthermore, we identify the impact of small-scale investors on market developments and the potential for other investors for leveraging the decision-making processes of trading choices made by the group by predicting the trend of “Meme Stocks” like GameStop or Blackberry as prominent examples.

Accordingly, our approach can be differentiated into two dimensions. Firstly, a predictive-analytical dimension, which is concerned with the prediction of the short-term price trend of affected stocks, to analyze the potential of this kind of text data for investment decision making. Secondly, a descriptive-analytical dimension, which includes the application of Topic Modelling approaches over time to describe the evolution of the contents of the discourse of individual investors, is conducted. Each dimension is independent in both methodology and scope. All applications presented in this paper are viewed as complementary additions to the sum of methods included in potential future investment decision-making procedures. The Fig. 1 gives a high-level overview of our analysis. A detailed figure describing the whole pipeline is provided in Sect. 4.

Finally, we like to disclaim that the classification conducted in the predictive-analytical dimension of our research is not suitable as a stand-alone-tool to make investment decisions. We acknowledge that from a statistical standpoint, stock prices follow a random walk and cannot be accurately forecasted, as shown by Van Horne and Parker (1967) and Fama (1995), among others. For an informed investment decision, economical, company-related, technical factors, well as many other factors, are crucial. Therefore, our only goal in the predictive-analytical dimension is to show that the information, that is created during the discussions inside the small-scale investor communities, can be utilized using machine learning approaches to support the investment decision-making process of investors. In other words, we do not expect any causality, but we also do not expect independence between the stocks price and the information retrieved from the small-scale investor communities. By predicting the trend of a stock only using the data from said communities, we display the informational potential of the data at hand and the partial influence that those small scale investors have on the development of the observed stock prices. Our goal is therefore not to provide a comparative framework that challenges existing models in their predictive performance, but to evaluate the potential of using the information generated by these online communities.

Fig. 1
figure 1

A view on the conducted analysis on two dimensions

2 Related literature

Our research draws on the literature for textual analysis of Social Media data and especially the Topic Modelling literature. Besides, we discuss literature that investigates the effect of Social Media platforms on financial markets and share prices.

Topic Modelling Topic Modelling describes techniques for extracting and identifying hidden topics within extensive text corpora. Topic models make it possible to uncover the semantic structure of large text collections on a scale that goes beyond what humans can reasonably read, process, and comprehend. Given a collection of documents, a Topic Model first detects, which topics are present in the corpus. Second, it allocates each document to one or more of the detected topics. Historically, topic models are implemented as probabilistic generative models such as Latent Dirichlet Allocation (LDA) (Blei et al. 2003) or Probabilistic Latent Semantic Analysis (PLSA) (Hofmann 2013). While still being used today (Yu and Xiang 2023; Gupta et al. 2022; Egger and Yu 2022), the rise of continuous Bag-of-words (CBOW) word embeddings such as word2vec (Mikolov et al. 2013) enabled models to integrate contextualized representations of words into their modelling (Das et al. 2015; Angelov 2020). Further adaptations of LDA, like the Embedded Topic Model (ETM) (Dieng et al. 2020) or ProdLDA (Srivastava and Sutton 2017) also model the topics themselves as representations in the embedding space. Furthermore, the inherent power of these contextualized word embeddings allowed for completely different modelling architectures, leveraging simple clustering techniques such as K-Means on embedded words (Sia et al. 2020). The ZeroShot Topic Model (Bianchi et al. 2021) or the Contextualized Topic Model (Bianchi et al. 2020), additionally integrate embedded documents (Reimers and Gurevych 2019) into their modelling. These embedded documents also allow for using simple clustering techniques such as HDBSCAN or Gaussian Mixture models (Grootendorst 2022; Angelov 2020; Thielmann et al. 2024) and create topics, that are on par with state of the art probabilistic models (Thielmann et al. 2024, 2022). While the underlying clustering techniques create only clustered documents, the topics can be easily extracted from them, by utilizing either distances in the embedding space (Angelov 2020; Thielmann et al. 2024) or class based term frequency-inverse document frequency scores (Grootendorst 2022).

Topic models have long been used to analyze the latent topics discussed in social networks since they allow to uncover trends and latent topics in these exceedingly large document collections (Hong and Davison 2010; Curiskis et al. 2020; Kant et al. 2022; Weisser et al. 2023). Models such as LDA (Blei et al. 2003), Non-negative Matrix Factorization (NMF) (Lee and Seung 1999) or BERTopic (Grootendorst 2022) (whereas BERT stands for Bidirectional Encoder Representation from Transformers) are some of the prominent models in these applications (Egger and Yu 2022). Nowadays, tools like TTLocVis (Kant et al. 2020) or Twitmo (Buchmüller et al. 2022) make it easy to analyze Social Media data.

Social Media and Stock Prices In recent years, the influence of retail investors on financial markets through Social Media has garnered significant attention. Researchers have delved into this phenomenon, seeking to understand its implications and potential warning signals for market stability. For example, (Agrawal et al. 2022) leverage BERTopic to identify unique traits of Social Media users in finance communities, whereas Schou et al. (2022) and Thukral et al. (2022) analyze the effect of social network communities on financial behavior and use the LDA model. Similarly, Sidhu et al. (2022) investigate multiple online communities, using a mixture of LDA and Support Vector machines (Cortes and Vapnik 1995). Concurrently, Gianstefani et al. (2022) conduct a study focusing on the impact of retail investors using Social Media, leveraging simple linear regression techniques and solely focusing on the Subreddit “WallStreetBets” (WSB). It is worth noting, however, that their analysis primarily revolves around identifying events without delving into whether the social network activity will ultimately lead to positive or negative outcomes for stocks. Aloosh et al. (2021) analyze the behavior of meme stock traders in the context of herd behavior, but do not analyze the activities in social networks and focus exclusively on the stock data. Wang and Luo (2021) estimate the price trend of the GameStop share by analyzing several thousand WSB-Subreddit posts using Sentiment Analysis and textual features. However, their research is conducted for a very short timeframe of less than 3 months. Besides, they only focus on accurately predicting the direction that sentiments have on the stocks. Related to this line of research, Costola et al. (2021) utilize a regime-switching co-integration model to identify stock momentum. They use a simply approach, by relying solely on counting stock name mentions in TwitterFootnote 4 discussions, distinct from finance-specific platforms like WSB. While they are able to identify correlations, their study’s main focus was on characterizing Social Media patterns, not predicting stock price directions since they are not focusing on the sentiment of the Social Media discussions. Padalkar (2021) leverage text mining and semantic analysis methods to analyze Reddit posts. However, their Sentiment Analysis approach relies only on bag-of-words techniques developed by Antweiler and Frank (2004), rather than state-of-the-art word and document embeddings. Their findings suggest a correlation between higher Reddit sentiment and increased stock volume. Furthermore, they observe that a high frequency of posts has significant impacts on stock returns, traded volume, and volatility. The authors of Zhao et al. (2023) also find that Social Media Sentiment Analysis can be utilized to approximate stock prices. Similar to Aloosh et al. (2021) they identify some sort of herd behavior in stock traders. In a related study, Thormann et al. (2021) employed a combination of neural networks and Sentiment Analysis, alongside traditional financial indicators, to forecast Apple stock prices. Their findings suggest an enhanced performance over a baseline model that relies solely on historical closing prices. However, it’s subtly implied, considering the random walk nature of stock prices, that accurate prediction remains inherently challenging and uncertain. Matthies et al. (2023) leverage Sentiment Analysis and perform simple regression based on the extracted sentiment scores. However, they also focus only on Twitter data. Koltun and Yamshchikov (2023) demonstrate the effectiveness of Social Media Sentiment Analysis by incorporating sentiment information into price prediction, which enhances model performance for cryptocurrency forecasting. Shiri et al. (2023) on the other hand specifically analyze the effect that emojis have on stock price prediction. In contrast to the aforementioned approaches they do not find an effect.

The literature suggest that Social Media platforms can have a significant impact on financial markets and share prices. Besides, the literature underscores the potential role that state-of-the-art textual analysis can play as one of several key tools in aiding more informed decision-making in approximating stock price predictions. However, a comprehensive investigation of multiple platforms that draws on latest natural language processing methodologies and machine learning based forecasting is missing.

3 Data collection and pre-processing

The data for our study was provided by FIDA Software GmbH and has been collected via the service APIs. The data encompasses the period from May 2020 to May 2021 and includes 5.1 million comments from Stocktwits, 1.08 million comments from Reddit and 14 thousand comments and 317 articles from Seeking Alpha. The chosen time-frame covers the months leading up to the GameStop short squeeze and the event itself and its aftermath. Several bot detection measures are applied to ensure that we are dealing with human-made content.

The first aforementioned, Reddit, is a social news aggregator website with thousands of Subforums, called “Subreddits” covering a vast variety of topics. We collect data from a Subreddit called WSB, where the GameStop short-squeeze was discussed and coordinated. In this forum, we collect so-called “Daily Discussion Threads” and the “What are your moves Tomorrow?” threads. In both moderated threads, serious discussions about investment decisions are being held, in contrast to many other threads in the Subreddit, where a lot of Satire, Jokes, and Shitposting are present. The data has the structure of a discussion. This means that users can reply to posts made in the threads, promoting discussions on a topic that has been brought up. We filter the posts by the occurrence of the mentioning of stocks we are interested in. This means, that if a stock or its ticker (like “GME”) is mentioned, we include them in our analysis. Additionally, if a stock is mentioned in a lower level of a reply to a post, the whole discussion is added to our data. Another attribute of the Reddit data worth mentioning is that users are able to rate and award comments. We use this metadata in the further analysis.

The second data source we use is Stocktwits, a Social News Ticker service. This platform is design-wise similar to Twitter, where users are posting short messages, restricted to a few words only. The major difference to Twitter lies in the topical focus: On Stocktwits users almost exclusively talk about stocks, investments and Crypto currencies. Users can use so-called “Cashtags”, to tag stocks they are talking about in their posts. On average, posts on Stocktwits are very short, often just consisting of about one sentence. An interesting feature about Stocktwits post is the option for a user to mark their posts as “bearish” or “bullish”, providing us with a de-facto user-labled sentiment dataset with regards to the stocks we are interested in. We collected the data by filtering for the Cashtags of the stocks we are using for our analysis. One problem with the Stocktwits data is the high prevalence of bots posting and promoting products or sentiments with regards to specific stocks. For our analysis, we therefore filter our collected data for bots. We assume that bot-driven accounts have several attributes that make them easily identifiable. First, bots have a very high posting frequency. If a user posted more than 30 post per day, we remove them from our data set. Secondly, we assume that the contents posted by bots follows a syntactical pattern, therefore we check for each user the syntactical equality of several of their posts using the Hamming Distance Measure (Hamming 1986). If the posts similarity (i.e. the Hamming Distance difference between the two sequences) was above a predefined threshold, we remove all posts of such users from our dataset.

The final data in our analysis is obtained from Seeking Alpha, a Social News Blog. Seeking Alpha mainly provides longer essay-like texts on equities, exchange-traded funds and investment strategies. These articles are often written by selected investors or experts. Additionally, all articles have a comment section for the discussion of the articles topics. We collect the articles with regards to all analyzed stocks, including all of the comments belonging to the articles collected. For our analysis, we only use text chunks, that directly refer to the stocks of interest, more on that can be found in Sect. 4. The comments of the articles are structure-wise similar to ones from Reddit.

In short, we have three pretty diverse sources of content—the preliminary outlook is that the long-format content of Seeking Alpha seems to cater to a smaller niche of “serious” and professional investors, while WSB is more accessible and has a broader reach to a more amateur audience that espouses a more archaic, gamified approach to investing. In many ways, Stocktwits appears to be a middle-ground between the two.

We collect data about the “Meme Stocks” of GameStop (ticker: GME), an American retail chain specializing in video games, consumer electronics and gaming merchandise, Blackberry (BB), a technology company for enterprise software and Internet of Things, and Palantir (PLTR), a software company focused on Big Data Analytics. All of those Stocks were widely discussed in the communities, especially on Reddits WSB. Additionally, we add two so-called “Blue Chip” stocks, FacebookFootnote 5 (FB) and Amazon (AMZN). We add those two established large scale companies as a baseline for our analysis, specifically for our trend prediction approach. We expect a less prevalent discussion about those stocks in the targeted communities, and a lower impact of the communities actions on these stocks due to their huge number of investors holding their shares.

The processing of the data consists of several steps. The raw data collected via the platforms’ APIs is cleaned of all the data not used in our analysis. Consequently, we remove all hyperlinks and images from the text. We harmonize the datetime of all data sources to the CET timezone. Additionally, we filter false flagged posts, which are especially present on Stocktwits. Those posts consist mostly of many Cashtags to gain a wide audience for the shared content, even though the post does not thematically align to the Cashtag. Consequently, we decide to remove all posts that carried more than two Cashtags from the dataset.

Descriptive statistics Before starting our analysis, we provide an overview of the data. Figure 2 presents the daily frequency of posts on Reddit and Stocktwits (we chose to omit the Seeking Alpha data on this plot since those numbers are relatively small). We decide to omit the Reddit data from April 2021 onwards since it contains a lot of NAs.

Table 1 presents the descriptive statistics regarding comment length by source. Unsurprisingly, Seeking Alpha articles are by far the data source with the longest texts. It is notable that Seeking Alpha comments are also considerably longer than their Reddit and Stocktwits counterparts—it would appear that the long-format nature of the articles also invites more verbose discussions in their comment section. Interestingly, Reddit and Stocktwits comments are remarkably close when it comes to word length, despite Reddit comments having no character limit.

Table 1 Length of comments/articles in words by source

4 Methodology

Our analytical framework consists of a pipeline, which covers all steps from data collection, to data cleaning, to feature generation and modelling. As stated before, we are focusing on two dimensions, a predictive-analytical dimension to assess the value of the data for forecasting purposes, and a descriptive-analytical dimension, to outline the actual contents of the discussion, providing a foundation for a better understanding of the subculture. A detailed overview of the implemented system can be found in Fig. 3.

4.1 Classification

Our classification method tries to prove that the discussions in our sources give crucial information about how small-scale stock investors make decisions. We therefore estimate a simple bi-directional (positive or negative) price trend on a daily basis. In other words, the classification answers the question, if a given stocks share price is increasing or decreasing on the following day. Therefore, we aggregate the data on a daily level for each of our three data sources. The aggregation is a simple daily average for most features. An overview on the features can be found in Table 8. Please note that since we are forecasting the development of stock prices, we do work with time series. In each estimation, we include all the available data for the respective stock.

Given our task of modeling a binary stock market time series target with the help of multiple features in a highly irregular market period, we selected a Random Forest and an LSTM model for our analysis.

Classical approaches for time series modeling, such as ARIMA and its variants, have two main disadvantages in our use case: (1) it is traditionally used for regression tasks, as shown in Hyndman and Athanasopoulos (2018) and not for classification tasks (such as our use case) (2) it is built on the premise that future values of a time series can be predicted as a linear function of past observations and stochastic error terms and its usage is thus suboptimal when it comes to capturing non-linearities, as indicated in Brockwell et al. (2016). Since our considered timeframe centers around the short-squeeze period and includes its prelude and aftermath—an altogether volatile setting - we would ideally like for our model(s) of choice to be able to effectively capture non-linearities.

On the other hand, Random Forest models are well suited for non-linear classification tasks and offer interpretability in the form of feature importance. And while they can not innately handle temporal dependencies within the data, this can be easily (albeit only partially) amended by using lagged versions of our features.

Finally, LSTM models seem to fit all of our criteria—their ability to adeptly navigate the multivariate, complex, and non-linear temporal relationships characteristic of stock market dynamics, across both short- and long-term horizons have been well documented empirically, in Selvin et al. (2017), Althelaya et al. (2018a), Althelaya et al. (2018b).

In the following we are shortly covering the classifiers, a Random Forest model (Ho 1995) and a LSTM-based (Hochreiter and Schmidhuber 1997) Neural Network.

Combining multiple decision trees, Random Forest classification is a learning method to improve predictive accuracy and reduce overfitting in comparison to a simple decision tree. Assuming a binary classification case with a dataset of I samples and J features, the classifier builds a group of G decision trees. Each of those collections are trained on a different subset of the data and a randomly sampled subset of the features at hand. At each node of a decision tree, the Random Forest classifier selects the best feature to split the data based on the Gini impurity or entropy criterion. The Gini impurity assesses the likelihood of wrongly classifying a random sample from a node, whereas entropy evaluates the uncertainty of class distribution. Using either the Gini impurity or entropy, the best features are picked to divide data in a decision tree. A new sample’s class prediction is determined by a majority vote from individual decision trees after processing the new data through them. Formally, let \({x}_i \in {\mathbb {R}}^{J}\) describe the feature vector of the i-th sample and \(y_i \in \{0,1\}\) denote its corresponding binary label. Additionally, let the collection of decision trees be described by \({G} = \{G_1, G_2, \ldots , G_G\}\), each trained on a differing subset of the data and a randomly sampled subset of the features. Resulting, the classifier predicts the class \({\hat{y}}_i\) of a new sample \({x}_i\) as:

$$\begin{aligned} {\hat{y}}_i = \arg \max _{l} \sum _{g=1}^{G} {\mathbb {I}}(G_g({x}_i) = l) \end{aligned}$$
(1)

where \({\mathbb {I}}(G_g({x}_i)=l)\) describes an indicator function that returns 1 if class l for \({x}_i\) was predicted by the g-th decision tree, and 0 otherwise. During the training process, the most important hyperparameter to optimize are the number and the depth of the decision trees, as indicated in Breiman (2001), Breiman (2017).

A LSTM-based neural network for binary classification can be defined as follows:

Assuming a set of input sequences \({x} = ({x}_1, {x}_2, \ldots , {x}_I)\) and a set of corresponding binary labels \({y} = (y_1, y_2, \ldots , y_I)\), a sequence of hidden states \({h} = ({h}_1, {h}_2, \ldots , {h}_I)\) is obtained by processing the input. Finally, a sigmoid activation function is fed the last hidden state \({h}_I\) to get the prediction \({\hat{y}}_I\) as follows:

$$\begin{aligned} {\hat{y}}_I = \sigma ({A}_o{h}_I + b_o), \end{aligned}$$
(2)

where \({A}_o\) and \(b_o\) represent the weights and bias of the network and \(\sigma \) is for instance the sigmoid activation function. Using a binary cross-entropy loss function, the predicted label \({\hat{y}}_H\) and the true label y are compared:

$$\begin{aligned} L({y},\hat{{y}}) = -\frac{1}{I}\sum _{i=1}^{I}\left( y_i\log {\hat{y}}_i + (1-y_i)\log (1-{\hat{y}}_i)\right) , \end{aligned}$$
(3)

where y and \(\hat{{y}}\) are the true labels and predicted labels, respectively. The weights and biases of the LSTM cells, the ouput layer, and other parameters, are learned through a combination of backpropagation and gradient descent optimization.

For both classifiers we implement two approaches: Firstly, a “direct” classification for all stocks, using 101 decision trees in the Random Forest model and a 13 layer-architecture for the LSTM-based Neural Network. For this approach, the training data contains the data mentioned in Table 8 with the dates from the 1st November 2020 to the 29th of March 2021, while the test data is covering the dates from the 30th of March 2021 to the 7th of May 2021. Secondly, we check our results for robustness by using the Time Series Split Cross Validation approach on our data, which is a special case of a k-Fold Cross Validation, outlined in Hyndman and Athanasopoulos (2018). The Time Series Split Cross Validation method divides a time series dataset into several sequential folds, where each fold comprises a continuous subset of the data. The model is trained on the initial fold and assessed on the next time-wise subsequent fold, with the process being repeated for all the folds. Each succeeding fold contains more recent data. In each fold, the data is split into training, validation and test set, whereas a hyper-parameter optimization is applied and the best parameter from the estimation of validation set is applied on the test set in each fold. Please find a graphical representation in Fig. 4. For this approach we decide to use 5 splits with a 1 month training set size starting point, a validation set size of 13 days each, and a test set size of 15 days each. Consequentially, the overall data usage shrinks to the 28th of April 2021 as last day in the final test set, as dictated by the splits outline.

A challenge for the design of the experimental outline was the handling of the non-trading days on the stock markets, especially the weekends. While the discussion about investment decisions on these days continue on all data sources, investments themselves can only be conducted on days on which the stock exchanges are opened. For the first non-cross-validated approach, we evaluate two different approaches for handling the days with a missing response variable. Firstly, backward-fill, where we set the trend of the following Monday to Saturday and Sunday. Secondly, we removed the weekends altogether, including all the input data of the according days. On the Time Series Split Cross Validation approach, we decided to remove the weekends.

4.2 Topic modelling

In a unique pre-processing step, we utilize LDA Topic Modelling in the pre-processing of Seeking Alpha articles to pinpoint aspects relevant to the stocks of our interest. We then filter out sections of the articles that do not align with these specific topics. In the context of Natural Language Processing, “aspects” typically refer to the specific features or attributes of a product, service, or entity that are being discussed in text data. In the context of our analysis, aspects can be understood as particular characteristics and traits that are closely associated with the stock we are investigating. For instance, we may consider “cinema” as an aspect that is relevant to the stock of “AMC”. In short, we use LDA to create topics from the Seeking Alpha articles and then carefully select words based on their close affiliation to a specific stock to avoid any ambiguities. After generating a list of referring aspects, we then only incorporated sentences from the Seeking Alpha articles in our analysis, that contain at least one relevant aspect. This enables us to use only content that is strictly related to our targeted stocks, in case the articles have a multi-topical focus. This is only conducted with the Seeking Alpha articles, since it is the only longer document type we deal with in our project and is considered as a mere pre-processing step before the analysis.

4.2.1 LDA

First applied in a Machine Learning context by Blei et al. (2003), the Latent Dirichlet Allocation (LDA) is a generative probabilistic model used for Topic Modelling. It is a three-level hierarchical Bayesian model and, given a corpus of documents, it aims to discover a set of topics, where each topic is characterized by a distribution over words. The relevant notation can be found in Table 9.

The generative process of LDA can then be summarized as follows:

  1. 1.

    For each document d:

    1. (a)

      Sample a topic distribution \(\theta _d \sim \text {Dir}(\alpha )\)

    2. (b)

      For each word w in the document d:

      1. (i)

        Sample a topic assignment \(z_{w,d} \sim \text {Multinomial}(\theta _d)\)

      2. (ii)

        Sample a word assignment \(c_{w,d} \sim \text {Multinomial}(\beta _{z_{w,d}})\), where \(\beta _{w,d}\sim \text {Dir}(\eta )\) is the word distribution conditioned on the topic assigned in (i)

In practice, the topic and word distributions are unknown to us—we have only our observed corpus, D, and our chosen hyperparameters, \(\alpha \), \(\eta \), and K. Given this information, LDA works backwards to infer the set of topics that is most likely to have generated the corpus. This involves inferring both the topic mixtures for each document and the word distributions for each topic. Blei et al. (2003) give the general formulation of the posterior of the hidden variables (topic assignments and distributions) given a document:

$$\begin{aligned} p(\theta ,z,\beta \mid d, \alpha , \eta ) = \frac{p(\theta , z, \beta , d \mid \alpha , \eta )}{p(d \mid \alpha , \eta )} \end{aligned}$$
(4)

By estimating this posterior distribution, LDA enables us to discover the underlying topics in a collection of documents. Blei et al. (2003) show that the denominator of the posterior is often intractable to compute. Therefore, the estimation of the posterior is usually done through Gibbs sampling or variational inference.

4.2.2 NMF

The Non-Negative Matrix Factorization (NMF) introduced by Lee and Seung (1999) is a dimensionality reduction technique commonly used for data analysis and feature extraction. The relevant notation can be found in Table 9.

Given the TF-IDF matrix U, where each row represents a document and each column represents a unique term, NMF seeks to factorize U into two non-negative matrices: the document-topic matrix \(\theta \) and the topic-term matrix \(\beta \).

To obtain the factorization, NMF minimizes the reconstruction error or the approximation loss between U and \(\theta \beta \), for example via the following objective function:

$$\begin{aligned} \min _{W,H} \Vert U - \theta \beta \Vert _{F}^2, \end{aligned}$$
(5)

where \(\Vert \cdot \Vert _{F}\) represents the Frobenius norm, and \(\theta \) and \(\beta \) are subject to the following constraints:

$$\begin{aligned} \theta \ge 0 \quad \text {and} \quad \beta \ge 0. \end{aligned}$$
(6)

As per Berry et al. (2007), the optimization problem can be solved using various algorithms, such as multiplicative updates:

  1. 1.

    Initialize \(\theta \) as a random dense matrix, \(\theta \) = rand(m,k)

  2. 2.

    Initialize \(\beta \) as a random dense matrix, \(\beta \) = rand(k,n)

  3. 3.

    Then, until algorithm converges or for a preset number of iterations:

    1. (a)
      $$\begin{aligned} \beta _{ij} \leftarrow \beta _{ij} \frac{(\theta ^TU)_{ij}}{(\theta ^T\theta \beta )_{ij}} \end{aligned}$$
      (7)
    2. (b)
      $$\begin{aligned} \theta _{ij} \leftarrow \theta _{ij} \frac{(U\beta ^T)_{ij}}{(\theta \beta \beta ^T)_{ij}} \end{aligned}$$
      (8)

In a Topic Modelling context, by applying NMF to the document-term matrix, the resulting document-topic matrix represents the topic proportions for each document, while the topic-term matrix indicates the term distributions for each topic. The entries in \(\theta \) and \(\beta \) can then be interpreted as the strengths of the associations between documents and topics, and topics and terms, respectively.

4.2.3 BERTopic

Created by Grootendorst (2022), BERTopic is a modular framework for Topic Modelling. The relevant notation can be found in Table 9. The BERTopic implementation typically consists of four steps.Footnote 6

  1. 1.

    Compute EmbeddingsFootnote 7 At this stage, input documents are converted into numerical real-valued vectors, making them more suitable for neural network-based models. This representation is called “embedding” and encodes the meaning of the word (or sentence, depending on the type of embeddings we use) in the vector space. A close proximity of two embeddings in the vector space implies a similar meaning. By default, BERTopic utilizes a sentence-transformer model (we used the default: “all-MiniLM-L6-v2”)—a modification of the pretrained BERT model presented by Reimers and Gurevych (2019)—for its embeddings.

  2. 2.

    Reduce dimensionsFootnote 8 Since embeddings are often high-dimensional, the clustering that is later required to construct topics could become computationally highly expensive and complex. Therefore, BERTopic employs by default the dimension reduction technique UMAP, developed by McInnes et al. (2018). UMAP has the capability to effectively capture the complex relationships and structures present in both the local and global aspects of high-dimensional embedding space when reducing the dimensionality. By default, BERTopic sets the number of components to 5, meaning the data is embedded into a reduced space with a dimensionality of 5.

  3. 3.

    ClusterFootnote 9 Having reduced the dimensionality of the embeddings, we next cluster them into groups by similarity, in order to extract meaningful topics. This is a crucial step, as the quality of the clustering is paramount to the quality of our Topic Model (in the context of Topic Modelling, “cluster” and “topic” are interchangeable). By default, BERTopic employs HDBSCAN, implemented by McInnes et al. (2017), for its clustering.

  4. 4.

    Obtain topic representationsFootnote 10 The topic representation should showcase what makes one topic different from the others. The default approach for this in BERTopic is the class-based Term Frequency-Inverse Document Frequency (henceforth c-TF-IDF):

    $$\begin{aligned} Q_{w,k} = \left\| M_{w,k} \right\| \log \left( 1+\frac{A}{f_w}\right) \end{aligned}$$
    (9)

    where \(tf_{w,k}\) the importance or weight of a specific word w in relation to a specific class or cluster k is the frequency of word w in class k, \(f_w\) is the frequency of word w across all classes, and A is the average number of words per class. In this context, the terms “cluster” and “class” can be used interchangeably. The result, \(Q_{w,k}\) can be interpreted as an importance score for words w in the cluster/topic k. Intuitively, the more important a word is within a cluster, the more representative it is of that topic.

4.2.4 The problem of evaluating topic models

Since Topic Modelling is a form of unsupervised learning, there are no “objective” metrics, such as accuracy, that help us evaluate its performance.

Other metrics such as coherence (Newman et al. 2010), diversity (Dieng et al. 2020), and perplexity (Zhao et al. 2015) could be used. Coherence measures the semantic coherence of the words within a topic, assessing how well they relate to each other. Diversity measures the distinctiveness of topics, ensuring that they cover different aspects of the corpus. Perplexity quantifies how well a Topic Model predicts unseen data. Grootendorst (2022) outline that while these metrics can provide some guidance and assist in hyperparameter tuning, they are still proxies and may not fully capture the subjective evaluation of topics. Indeed, Chang et al. (2009) find evidence that many of these performance metrics negatively correlate with what humans interpret as coherent topics.

The decision of what a good Topic Model constitutes should then be more results oriented. Chang et al. (2009) suggest that researchers should focus more on the real-world task performance of models. The statistical outputs of Topic Modelling algorithms can serve as a starting point, but the interpretation and validation of topics should involve human evaluation and expertise. It is therefore up to the researchers to define goals and evaluate the models with regards to them. For our research, we define two goals for the topic models:

  1. 1.

    Detect market-wide bearish/bullish trends or other interesting contemporary phenomena

  2. 2.

    Successfully separate the texts in stock-centric topics with as little overlap as possible, ideally:

    1. (a)

      assign each stock to only one topic

    2. (b)

      ensure this topic does not contain other stocks

The main task of the stock-separation goal is to act as a sanity check for the stock filtering step of our data pre-processing. This can help us to evaluate whether we have a clean separation of stocks in our data, which is crucial for our subsequent modelling of stock price. Moreover, clearly defined stock topics allow us to pinpoint stock-specific discussions and track their fluctuations over time. This can provide intriguing insights when paired with current stock return trends and events.

To make the performance of a model with regards to our stock separation goal easier to track, we implement a metric for our research that we call “stock-specificity”, defined as the number of topics that contain only one stock, conditional on that stock appearing only once across all topics, expressed mathematically:

$$\begin{aligned} \Psi = \mid \{ s \in S: \mid K(s) \mid = 1 \ \wedge \ \mid S(K(s)) \mid = 1 \} \mid \end{aligned}$$
(10)

The relevant notation can be found in Table 9.

4.3 Sentiment analysis

For the predictive analysis, we generate a classification-based trend estimation using features from both the metadata and the text content itself, as detailed in Table 8. For all data sources we applied Valence Aware Dictionary and sEntiment Reasoner (VADER)-Sentiment, a rule-based Sentiment Analysis Framework specialized for short, sparse and Social Media data, implemented by Hutto and Gilbert (2014). We use VADER’s “compound” value for all posts.

BERT (Bidirectional Encoder Representations from Transformers) (Devlin et al. 2018) is a Transformer model, which consists of layers with a self-attention mechanism as well as a position-wise fully connected feed-forward neural network. We employ both BERT-Sentiment and VADER to offset any potential shortcomings of either method, especially given the unique language characteristics in our dataset.

4.4 Granger causality test

The Granger causality test, first proposed in Granger (1969) is a statistical test used to determine whether a time series can linearly forecast another. It is important to note that this does not imply causation in the strict sense. In our case, we will be using the test to determine whether the daily average stock-specific sentiments have predictive value over stock price movements. We consider a bivariate example with two time series variables, denoted as X and Y (the relevant notation can be found in Table 9) and implement the following testing procedure:

  1. 1.

    We test each time-series for non-stationarity, for example via an Augmented Dickey-Fuller-Test ADF test.

  2. 2.

    We determine the lag length p, which represents the number of past observations of X and Y used in the model via AIC or BIC.

  3. 3.

    We specify the vector autoregression(VAR) model as:

    $$\begin{aligned} Y(t) = \tau + \delta _1 Y(t-1) + \cdots + \delta _p Y(t-p) + \epsilon (t) \end{aligned}$$
    (11)
    $$\begin{aligned} Y(t) = \tau + \delta Y(t-1) + \cdots + \delta _p Y(t-p) \\ + \gamma _1 X(t-1) + \gamma _p X(t-p) + \epsilon (t) \end{aligned}$$
    (12)

    We test for autocorrelation via a Durbin-Watson test, and if it is present, increase p by 1 until we eliminate it.Footnote 11

  4. 4.

    We test for the absence of Granger causality using our specified VAR model, where the null hypothesis \(H_0\) would be: The past values of X do not have any additional predictive power for forecasting Y beyond the information contained in the past values of Y alone. Expressed mathematically: \(H_0\): \(\delta _1\) = ...= \(\delta _p\) = 0

  5. 5.

    We perform hypothesis testing using for example a F-test or t-test, to compare the two models.

    • If the null hypothesis \(H_0\) is rejected, it suggests that X Granger-causes Y, indicating that the past values of X provide additional information for predicting Y.

    • If the null hypothesis \(H_0\) is not rejected, it implies that X does not Granger-cause Y.

  6. 6.

    Repeat steps 2. through 5. whilst switching X and Y around, to explore whether the Granger causality is bidirectional.

5 Results

5.1 Classification

Table 11 shows the classification accuracy for all stocks for the first non-cross-validated classification approach. We find 72% classification accuracy for GME (GameStop) using BERT-Sentiments and removing all weekends on the LSTM model and a surprisingly good performance of FB (Facebook) of 68% with the same outline, using the Random Forest approach. AMZN (Amazon) performs best when using the LSTM model with the weekends backfilled. There is no effect, if BERT-Sentiments are included, reaching 64% accuracy in both cases. BB (Blackberry) performs best with the LSTM model without BERT-Sentiments included and weekends backfilled, scoring 66% accuracy. For Palantir (PLTR), we reach a high 74% accuracy for backfilled weekends, with and without BERT with the Random Forest model.

Table 2 shows the classification accuracy for all stocks for the second, Time Series Split Cross Validation based classification approach. A first surprise are the 67% overall mean accuracy of AMZN on the LSTM model, contrary to the premise that it might be very difficult to predict “Blue Chips” stocks based on the data at hand. FB on the other hand performs as poorly as expected. The results are not better than a random guess for LSTM and for Random Forest. BB also disappoints with an overall mean accuracy of 57% for the LSTM model. PLTR, on the other hand, scores very high with 68% for both models, suggesting a robust data foundation for classifying the trend of this stock using our data base. A very interesting case is GME. The mean accuracy drops significantly to 56% for Random Forest and 53% for LSTM, respectively. Nevertheless, the absolute accuracies for each split reveal an interesting insight. The GME short squeeze began on the 21st of January 2021. Looking on the absolute accuracy for both models for the split, on which the training data only consists of November and December 2020 with the test set exactly covering the time of the short squeeze (14th of Jan to 28th of Jan), it is not surprising, that the splits accuracy hits rock bottom, heavily decreasing the overall mean accuracy in the end. Encouragingly, the split accuracy is significantly increasing in the following month for the LSTM model once the hype around the GME stock starts to accelerate, suggesting a stronger predictive potential of our model than it seems due to the unexpected structural break, which is arguably by far the most extreme one for all of the stocks we focused on in the given time period.

Table 2 Summary of time series cross-validation results

Error analysis In the following, we conduct a brief error analysis. Traditionally, an error analysis considers the features of wrong/high loss test set predictions and attempts to ascertain an underlying pattern. Since we are working with time series, we note that the features for any given prediction are actually the lagged features from the preceding n_lag days, where n_lag is model specific.

We commence the error analysis by plotting the test set loss from all folds for each of our 5 considered stocks in Fig. 5. The correlation matrix of these loss series can be found in Table 3. We find that the PLTR test loss is weakly correlated with: the GME test loss (corr. 0.284), the BB test loss (corr. 0.243) and the FB test loss (corr.0.311) The BB test loss is moderately correlated with: the AMZN test loss (corr 0.44) (as well as the PLTR loss as outlined above). This suggests the presence of some temporal trends affecting the losses across different stocks, and could potentially be attributed to market-wide shocks in the meme stocks domain or even further beyond, which our models were unable to capture through online discourse alone. Further research could try and include traditional market indices like the S &P 500 in hopes that it captures some of these temporal trends.

Table 3 Correlation matrix for the test-set losses of each stock specific model

We now turn to an examination of the feature-loss relationship. Table 13 presents the five features that are on average most correlated with loss across all models. While we identify some features with small to moderate correlation with the loss in specific stock models, such as for example \(red\_reddit\_dd\_mean\_compound\_senti\_vader\_lag4\) (in human words: the mean VADER sentiment in the daily discussion thread on Reddit from four days ago) with the test loss in the PLTR model, these rarely hold across stock models. We therefore do not really gain any generalizeable insight. While we could try and address the model deficits on a stock-level, for example by analyzing for and removing outliers within the \(red\_reddit\_dd\_mean\_compound\_senti\_vader\_lag4\) variable in the PLTR model specifically, this would mostly serve as a form of targeted optimization that risks overfitting to the nuances of historical data for a specific stock.

5.2 Topic modelling

We fit the LDA, NMF and BERTopic models separately on the Stocktwits, Reddit and SeekingAlpha data. We utilize the gensim implementation of LDA and the scikit_learn implementation of NMF. Since fitting BERTopic on the Stocktwits dataset proved to be too memory-intensive for our hardware capacity, we utilize a reduced sample of 1 million Stocktwits observations for all Topic Models for comparison purposes. As outlined earlier, the usual performance metrics can often be unreliable. Therefore we only use coherence and perplexity to inform our intra-model hyperparameter decision and do not rely on them for inter-model comparisons.

Table 4 provides an overview of the model specifications. We initially run BERTopic with no pre-processing barring stop-word and digit removal, as advised by the BERTopic developers.Footnote 12

Table 4 Specifications of the LDA, NMF, and BERTopic model runs

5.2.1 LDA

As outlined earlier, LDA requires the number of topics, K, as a hyperparameter. We implement models with default \(\alpha \) and \(\eta \) of 1/K. We perform hyperparameter tuning, where we evaluate the performance of different models with K = [10, 15, 20, 30, 40, 50] for Reddit, Stocktwits and Seeking Alpha comments and K = [5, 6, 7, 8, 9, 10] for Seeking Alpha articles. The perplexity and coherence scores of the LDA model over the different number of topics and sources are showcased in Fig. 6. The topic representations of the best performing LDA model w.r.t. coherence are displayed in Tables 12, 14, 15 and 16. For LDA, the “Count” column indicates the number of documents where the corresponding topic is most prominent.

5.2.2 NMF

Similar to LDA, NMF also requires K as an input. We conducted the same hyperparameter tuning as described above. The coherence scores of the NMF model over the different number of topics and sources are showcased in Fig. 7. The topic representations of the best performing NMF model w.r.t. coherence are showcased in Tables 17, 18, 19 and 20.

5.2.3 BERTopic

For the BERTopic implementation, we use the default sentence-embedding model all-MiniLM-L6-v2. For the dimensionality reduction and clustering, we take cuML’s implementation of UMAP and HBDSCAN, in order to leverage the GPU to speed up our training. Initial runs of BERTopic show that the model is not able to establish the connection between a company and its stock ticker, such as “GME”. To amend this, we add another step in the pre-processing for BERTopic, in which we convert the stock tickers to the actual company name. Thanks to the HBDSCAN clustering model, we did not need to provide the number of topics ex-ante. Instead, we utilize the hierarchical topic structure that BERTopic builds to reduce the number of topics ex-post, thereby collapsing some outlier topics that are adjacent on the lower hierarchy level to a bigger topic. One of the many modularities of BERTopic also allows for different ways to model topic representations. We used a KeyBERTFootnote 13-based topic representation in Table 21, a part-of-speech topic representation in Table 22, and the default c-TF-IDF-representations in Tables 23 and 24. The reason for the different representational forms for the different data sources lays in a evaluation of all options for each data source. We decided to use the best performing representation for each of the sources individually w.r.t. stock-specificity.

5.2.4 Stock specificity

The stock-specificity of the three models on the respective datasets is presented in Table 5. NMF achieves the highest stock-specificity on the Reddit data set, whereas BERTopic is the best-performing model on the Stocktwits, Seeking Alpha articles and Seeking Alpha comments dataset.

Table 5 Stock-specificity score of the models(the higher, the better, with the maximum possible score being 6 in our case) on the different datasets

5.3 Sentiment analysis

Part of our data comes with its sentiment self-labeled since Stocktwits prompts users to indicate whether their comments are bullish or bearish. We utilize this to fine-tune our BERT model, and to assess the performance of VADER and BERT on our dataset. We also self-label a random sample of around 200 Reddit comments, which we will later use for testing.

We accomplish the fine-tuning of BERT by equipping a custom classification layer on top of the base pre-trained BERT model, where we employ the TFBertForSequenceClassificationFootnote 14 model provided by HuggingFace. We summarize the performance of both models on the held-out 36,000 Stocktwits and 200 Reddit comments in Table 25.

The fine-tuned BERT model significantly outperformed VADER by 21.19 percentage points on the Stocktwits test set and 13 percentage points on the Reddit test set, as can be seen in Table 25.

Next, we use this Sentiment Analysis model in conjunction with our topic models. Namely, we select a stock-specific topic, for example “gme_thread_gang” from our NMF Reddit model and extract the comments associated with that topic. We then compute their sentiments with our model, and obtain a daily average stock-specific sentiment time series, an example of which is shown in Fig. 8 (sentiments are defined on the unit scale, the more positive a sentiment is, the closer it is to 1).

5.4 Granger causality test

We utilize the best performing topic models w.r.t. stock-specificity for each data set (NMF for Reddit, BERTopic for the rest) and use the topics from these models to obtain our stock-specific sentiment time series. In the cases where we could not obtain a unique topic for a stock, we simply took the most prominent one in terms of its count, if applicable. Due to the low volume of our Seeking Alpha data, we combined the articles and comments datasets into one.

We also perform an additional experiment, in which we circumvent having to rely on Sentiment Analysis and instead directly utilize the NMF Reddit model and simply take the proportion of the bullish topic expressed each day compared to all topics—expressed as \(\frac{\text {daily}}\, \# \,{\text{of comments bullish topic}}{\text {all daily comments}}\).

Given that Palantirs IPO made the PLTR stock only available to trade in the beginning of October 2020,Footnote 15 and considering that we have little to no data before that timepoint, we make the decision to only model the data for all stocks from 01/10/2020 to 30/03/2021 for uniformity’s sake (taking into account the NAs from the Reddit data in April).

The results of the stationarity tests are displayed in Tables 26, 27, 28 and 29. Using a significance level of 0.05, we can reject the null hypothesis of a unit root in all but one of the relevant time series and thereby infer that they are stationary.

The only series for which we could not reject the unit root was the PLTR returns in the combined Seeking Alpha dataset. As the PLTR returns were stationary in all other models, this brought to our attention that even with the combined data, the Seeking Alpha dataset still did not have enough volume to cover all days in our considered timeframe. We visualize this in Fig. 9, where gaps in sentiment indicate no PLTR-related articles or comments on that day. Since we only consider stock returns on days where we have sentiment information, this results in a PLTR return series with the same gaps, which appears to have somehow induced non-stationarity. Upon differencing once, we can reject the ADF null-hypothesis of a unit root. While we include the results of PLTR in the subsequent tables, we caution that a sensible interpretation of these results may not be possible, due to the random gaps.

The results of the Durbin-Watson autocorrelation tests are displayed in Tables 30, 31, 32, 33. For our research purposes, we accept a DW value between 1.5 and 2.5 as a sign pointing towards lack of serial negative or positive autocorrelation. We are only forced to make an adjustment once in the case of the PLTR sentiment time series from the NMF data. We increase the lag in the specified VAR model by 1, which appears to resolve the autocorrelation issue.

The results of the bi-directional Granger-causality tests are showcased in Tables 34, 35, 36, and 37. We summarize the significant results in Table 6. Please be aware that results from multi-testing settings need to interpreted with caution (Romano et al. 2010).

Table 6 Summary of significant findings in Granger-causality tests

In the Stocktwits dataset, we identify a significant unidirectional predictive relationship between the daily average sentiments for FB and its respective daily returns. In other words, the past daily average sentiments for Facebook provide statistically significant information on its future returns. We also find a significant unidirectional predictive relationship in the opposite direction for GME and AMZN—between their daily returns and their respective daily sentiments.

In the Reddit dataset, we identify a predictive relationship between the daily GME returns and the GME-specific sentiments, and between the daily GME returns and the bullish topic ratio. In the combined Seeking Alpha dataset, we identify a significant unidirectional predictive relationship between the daily average sentiments for AMZN and its respective daily returns. We also identify a bi-directional predictive relationship between PLTR daily average sentiments and the daily PLTR returns.

6 Discussion

6.1 Classification

In the results we found surprising good performance of AMZN (67% CV mean accuracy) by the LSTM model, even though it is considered a Blue Chip stock. The CV mean accuracy for FB was not better than a random guess, as expected. PLTR was estimated fairly well (68% CV mean accuracy). The extreme shock (short-squeeze) of GME could not be estimated but the time afterwards, when the hype around the stock really started to gain wide attention, was estimated fairly well by the LSTM model (60% and 67% accuracy for the last two splits).

In general we found that the models we are estimating are better in forecasting losses than gains, as can be seen in Table 38. In formal classification terms, our models exhibit a relatively high specificity (78.57% across all stocks) and a more modest sensitivity (47.89% across all stocks) Additionally, we emphasize the importance of the right stock selection: The fact that the estimation for FB is not better than a random guess can be related to the relatively lower amount of discussion about “Blue Chips” stocks and to the reality of a lower ratio of small scale investors to other investors on the market for those stocks. Furthermore, there needs to be constant discussion in the online investment community about rather unknown and smaller scale stocks to utilize the information for forecasting. This issue is present for our model during the short squeeze period of GME, where the amount of users discussing the topic on the platform skyrockets after the beginning of the short squeeze, leading to a poor performance during the structural break in the stock price and the drastically changing in the user-base in the same time period. Figure 10 shows the trade and posting volume for GME for Reddit. It’s very clearly visible that the amount of posts is strongly correlated to the trading volume, suggesting a close connection between the involvement of small-scale investors and their effect on the price volatility—A finding underlined by the results of our Granger Causality tests below. Hence, revisiting the point made in Sect. 1, considering the random walk hypothesis assumption, it’s important to emphasize again that this model of trend analysis shouldn’t be used in isolation by investors. Rather, they may consider it as an additional quantitative measure to support their decision-making process. The same holds for any conclusions drawn from the topic modelling in the following section.

6.2 Topic modelling

6.2.1 LDA

Overall, the LDA model delivered suboptimal results w.r.t our research goals. As shown in Table 5, the model failed to uniquely isolate a single stock in the Reddit and Stocktwits data sets. A closer look at its topic representations shows that it was also unable to capture any general bearish or bullish trends, nor any other interesting phenomenons. This is not particularly surprising, since the average Reddit and Stocktwits comment in our dataset consists of around 15 words as shown in Table 1. This certainly falls within the realms of what can be considered short text, and LDA has already been documented in Chen et al. (2019) to underperform in this domain. Indeed, LDA did fare somewhat better on our long-text datasets, the Stocktwits articles and comments, but so did NMF and BERTopic.

6.2.2 NMF

The NMF model was able to consistently outperform LDA across three of our four datasets and performed rather well w.r.t. our research goals. Taking a look at the example of Reddit, Table 17 presents a shortened version of the Reddit topic representations with particular topics of interest.

On the Reddit data set, NMF was the only model able to isolate any stock-specific topics—GME (topic #0) and BB(#5). In all of those topics we find words related to small scale investors, like “robinhood” or “(GME) gang”. The model also performed well in the traditional sense of Topic Modelling and found other coherent and well-specified topics that captured relevant contemporary discussions of the community, such as a “Big Tech” topic (#28), two topics centering around motivation to “hold the stocks” (or as also later referred to as bearish) (#4) and bearish(#6) market sentiments, and what appears to be a topic voicing the discontent of users regarding the Robinhood trading restriction(#9). Topic #38 seems to represent concerns surrounding a potential bot activity. The discussion about the occurred or potential short squeezes is also discussed (#33). In the topics #23, #10 and #39 we find different types of slang used by the community,Footnote 16 with common online-used words like “lmao” to more community specific terms like “moon”, “rocket” and “retard”, respectively. In summary, we find that the community on Reddit is discussing mainly very seriously investment strategies on “meme stocks”, as underlined by the large amount of closely investment related terms, with a motivational undertone and the usage of community-related terms.

Plotting the topic frequency over time for some topics also provides interesting insights. Figure 11 showcases the daily topic frequency of the aforementioned “bearish” and “bot” topics, and the GME returns for the same timeframe. Interestingly, the frequency of concerns regarding bot activity generally appears to rise every time the frequency of bearish comments rises. This might indicate a concern or fear of the part of Reddit WSB users that the postings that amplify negative sentiments during this critical time are actually bot-based. It is difficult to determine whether this is justified, as existing research into bot activity amid the short squeeze focused on bots that were artificially hyping the GME stock price up, rather than bots trying to downplay the future prospects.Footnote 17 Interestingly, the two highest peaks of both topics occurred on the same days as the two all-time-lowest returns of GME, on the 28/01/2021 and 02/02/2021, respectively.Footnote 18 If indeed there were bots trying to derail the “GME hype train”, they might have been successful. This also raises the question for further research, on how easy a herdal sentiment might artificially be promoted among the community to manipulate the price of a certain stock in a desired direction.

Figure 12 compares the topic frequency of the bullish and bearish topics and provides the backdrop through the closing prices of the meme stocks. The bullish sentiment topic dominates in late January and early February, throughout which period the stock prices of all four of our meme stocks reach their peaks in the timeframe analyzed. Shortly thereafter, the bearish sentiment topic frequency matches and occasionally takes over, which is also accompanied by falling stock prices. We interpret the motivational tone of the bullish topic as an expression of a “We are in this together” mentality of the community, which it reassuring itself that it is on the right path.

NMF also performed well on the Stockwits dataset, we present relevant topic representations in Table 18.

While the model did successfully separate our meme stocks from one another in topics #0, #2, #3 and #16, this time it did so less accurately than on the Reddit data. In particular, we have other stocks, namely Nokia (henceforth NOK) and Naked Brand Group (formerly NAKD, currently CENN) intertwined in our otherwise isolated stock-specific topic representations. Nevertheless, the other identified topics still present some other interesting information—topic #1 centers on cryptocurrencies, topics #5 and #6 represent market-wide bullish and bearish sentiments, topic #11 isolates discussions surrounding Virgin Galactic Holdings (with stock ticker SPCE), topic #17 summarizes discussions on Big Tech companies, whereas topic #25 likely centers on the alleged connection between hedge funds and Robinhood’s decision to halt trading GameStops shares. Additionally, we find overall many words that also do occur on Reddit but seem a bit less economical-technical. We find further community specific terms like “hodl” and “diamond” (referring to “diamond hands”, someone who “hodls”, aka holds the stock) in topic #28 for example. We can therefore assume that the community that is active on Reddit seems to at least partly intersect with the persons active on Stocktwits w.r.t. discussing the selected stocks. Overall, our findings support existing evidence that NMF excels in the short-text realm. It appears that the sparse representations induced by the non-negativity constraint were a good fit for our research goals, and they also enabled the model to produce the quickest results on all datasets.

NMF also arguably outperformed LDA even in the long text domain. As can be seen in Table 19, NMF successfully isolated FB, AMZN and BB into their own topics. The model did provide two topics centering on GME and it is interesting to briefly consider the possible reasoning behind it. Topic #1 mentions GME by its company name, GameStop, and seems to center more on the business of the company, with keywords such as “console”, “game”, “hardware”. Topic #6 on the other hand, refers to GME with its stock market ticker and accordingly features stock market oriented verbiage, such as “gme”, “short”, “option”, “stock”.

6.2.3 BERTopic

With regards to our goal of stock separation, BERTopic outperformed its competitors in three out of the four datasets, as was shown in Table 5.

Notably, BERTopic did perform poorly on the Reddit dataset, not only in isolating stocks, but also in general in terms of identifying other interesting discussions. It is likely that the models understanding of natural language was not as advantageous on the Reddit data set, since Reddit and in particular WSB is well known for its usage of specific lingo, like “hodl” (a misspelling of ’hold’, meaning to keep your investment even when the price drops), “diamond hands” (holding a stock even when the price is falling), and “tendies” (profits). It is highly probable that any existing embedding model is not particularly well-versed in such niche expressions. Training an embedding model on such a corpus may be an interesting avenue for future researchers.

BERTopic had a better showing on our other short-form text corpus, Stocktwits, as can be seen in Table 7. It successfully isolated a stock-specific topic for GME(#2) and BB(#4). Much like NMF, it was also able to identify interesting topics beyond our meme stocks, such as a DOGE-related topic (#1), a bullish topic (#3), a Big Tech topic (#5), a SPCE topic (#6) and two topics expressing frustration regarding the Robinhood trade halting debacle—topic #16, which addresses Robinhood directly, and #12, which appears to center around Robinhood’s CEO, Vlad Tenev.

Table 7 Selected topic representations of the BERTopic model fit on Stocktwits data

BERTopic was the only model to earn “perfect” marks in stock separation, being able to uniquely separate all of our 5 stocks in their own topics in our Stocktwits articles dataset. Arguably, its topic representations, shown in Table 23, also provide the best and most coherent summary of the 5 stocks out of any model-dataset combination. It is worth mentioning that the BERTopic results for the Seeking Alpha Comments are resulting in pretty clear topics that display sophisticated investment-related terms and aspects of the specific industrial branch on which the selected companies are invested in. We can conclude that the scope of the discussion on Seeking Alpha is more sophisticated topical and technical than on our other sources.

Based on our findings, it becomes clear that the Seeking Alpha articles data lends itself the most to Topic Modelling, as all of our considered models had their best showing there. The Reddit and Stocktwits data proved to be considerably tougher for all of our models, largely due to their highly corpus-specific language, which we could not fully account for in the pre-processing and, accordingly, presented a considerable challenge for BERTopic’s word embeddings w.r.t. a logical clustering.

6.3 Sentiment analysis

It appears that on our dataset the heuristics and sentiment lexicon of VADER were insufficient to make up for the loss of contextual information that is caused by its token-level approach. The strong performance of the finetuned pre-trained BERT model was largely made possible by the self-labeled Stocktwits—training a Sentiment Analysis model on a bigger self-labeled Stocktwits sample could provide even better results. Of course, its domain of usability would be rather niche. However, extending the idea of websites prompting users en masse to self-label their content to other Social Media, such as Twitter or Reddit, would go a long way to practically solve Sentiment Analysis on Social Media.

6.4 Granger causality test

We found plentiful evidence for returns causing stock-specific, and market-wide sentiments in the case of Reddit data, especially in the case of GME. On the other hand, we found limited evidence that meme stocks sentiments Granger-cause stock returns. It is important to remember that Granger causality tests for a linear relationship between the variables in question. However, Hsieh (1991) and Abhyankar et al. (1997) present evidence that stock markets might in fact be non-linear systems, especially during market shocks. The GME short squeeze, whose time frame our dataset is built around, was in many ways a black swan event and certainly a huge market shock. In hindsight, it seems likely that its effects could be better modeled through techniques capable of capturing and predicting nonlinear relationships, such as some machine learning algorithms like random forests or neural networks. This could prove a fruitful avenue for future researchers.

We did find evidence for the online stock-specific sentiment Granger-causing the returns of PLTR and FB. We initially believed companies like FB to be too big and their stock price too strongly based in fundamental value for online sentiments to have any predictive power over their stock price development. In an interesting turn of events, it may have been exactly this size and value of FB beyond the GME short squeeze that enabled us to find a linear relationship via the Granger causality test. It is plausible that while the meme stocks were caught up in a whirlwind of non-linear effects, FB was in a way “anchored” by its fundamental value. This shows promise that the utilization of sentiments to model stock returns may be applicable beyond meme-stocks and specific online events. Future studies may use these results to try and model more “traditional” stocks in different time-frames.

It is of interest to briefly consider how our results fit in the existing landscape of Granger-causation of stocks. Previous studies on the matter, such as Hiemstra and Jones (1994) commonly feature trading volume as the predicting variable. That is natural—trading volume is a tangible, kinetic marker of market activity. Contrasting this, our research ventured into the realm of potential market change, its magnitude embodied by the sheer comment volume and its direction modeled through our Sentiment Analysis.

7 Conclusion

This research explored the relationship between investor activity on online platforms such as Reddit, Stocktwits, and Seeking Alpha during the GME short squeeze period, and the price fluctuation of the associated stocks. By performing Topic Modelling and Sentiment Analysis on over 6 million investment-related comments across these social networks from May 2020 to May 2021, we generated multi-dimensional insights in the domains of Topic Modelling, Sentiment Analysis and stock prediction.

Our analysis of the predictive-analytical dimension showed the potential for new possibilities of monitoring the discussion about investment decisions of small-scale investors for financial gains on the markets. Using aggregated data in this study has proven to be informative enough for classifiers to work fairly well on certain stocks. It is important to keep in mind that the selection of the stocks seems crucial for the approach to work properly. This includes assuring that the selected stock is regularly discussed in the community. External shocks and events that strongly influence the stocks price can have a heavy effect on the user base and therefore on its predictive potential, as seen in the growing GME discussion after the initial short squeeze.

On the other hand, our classification approach has clear limitations. The utilized test set time interval is rather small, therefore we are not able to make a generalized statement on the full potential using this kind of data for predictive purposes. Additionally, the informational potential of the textual data are not yet utilized to full extend. In our experiments, we generated meta data like sentiment information and word counts but did not conduct further effort into utilizing the textual information itself by using Word Embeddings, for example. First experiments using a more complex neural network on the GloVe Word Embeddings of each individual text have led to overburden of significant computational resources, even using just a small sample size. Nevertheless, our results make it easy to imagine, that such an extensive analysis holds massive potential for improvement. The integration of this potential will be subject for future research.

Analyzing the descriptive-analytical dimension of the phenomenon “Social Media informed small scale investors”, we adopted traditional models like LDA, NMF, and VADER and compared them with newer, neural-net based approaches such as BERTopic and a fine-tuned BERT model. Our analysis demonstrated the superiority of BERTopic in the field of Topic Modelling and affirmed the robust performance of NMF in the short-text domain. The Sentiment Analysis results revealed the edge of the neural net based fine-tuned BERT over traditional counterparts. We found that the discussion on Reddit and Stocktwits seems relatively less technical as in the Seeking Alpha comments that have been analyzed. Additionally, on Reddit and Stocktwits it is apparent, that the users use sub-culture specific terms and phrases, motivate each other to invest or holding stocks and therefore creating a “We are in this together” mentality, and discussing concernedly the possible intervention of bots in their threads and posts.

Finally, we utilized the results of a stock-specific sentiment time series analysis based on the Topic Model results. For the analysis, we extracted the stock-specific conversations on Reddit through NMF, and of Stocktwits, Seeking Alpha articles and comments through BERTopic, as those models provided the best results on their respective datasets w.r.t. our Topic Modelling goals. We then obtained the sentiments of these comments and subjected them to a Granger causality test to further evaluate their predictive potential over stock returns. Notably, we found significant evidence that online sentiments Granger-cause stock returns for both Facebook and Palantir. These findings emphasize the critical role of online sentiments in influencing stock market trends. The study opens up new avenues for future research. In particular, investigating the non-linear relationships between sentiments and returns could be valuable. This could be achieved by employing machine-learning-based techniques like random forests or neural-net based models to further expand our understanding of this dynamic relationship.