One-way ticket to the moon? An NLP-based insight on the phenomenon of small-scale neo-broker trading

Kant, Gillian; Zhelyazkov, Ivan; Thielmann, Anton; Weisser, Christoph; Schlee, Michael; Ehrling, Christoph; Säfken, Benjamin; Kneib, Thomas

doi:10.1007/s13278-024-01273-2

One-way ticket to the moon? An NLP-based insight on the phenomenon of small-scale neo-broker trading

Original Article
Open access
Published: 25 June 2024

Volume 14, article number 121, (2024)
Cite this article

Download PDF

You have full access to this open access article

Social Network Analysis and Mining Aims and scope Submit manuscript

One-way ticket to the moon? An NLP-based insight on the phenomenon of small-scale neo-broker trading

Download PDF

Gillian Kant^1,3^na1,
Ivan Zhelyazkov¹^na1,
Anton Thielmann²,
Christoph Weisser¹,
Michael Schlee^1,3,
Christoph Ehrling³,
Benjamin Säfken² &
…
Thomas Kneib¹

348 Accesses
Explore all metrics

Abstract

We present an Natural Language Processing based analysis on the phenomenon of “Meme Stocks”, which has emerged as a result of the proliferation of neo-brokers like Robinhood and the massive increase in the number of small-scale stock investors. Such investors often use specific Social Media channels to share short-term investment decisions and strategies, resulting in partial collusion and planning of investment decisions. The impact of online communities on the stock prices of affected companies has been considerable in the short term. This paper has two objectives. Firstly, we chronologically model the discourse on the most prominent platforms. Secondly, we examine the potential for using collaboratively made investment decisions as a means to assist in the selection of potential investments.. To understand the investment decision-making processes of small-scale investors, we analyze data from Social Media platforms like Reddit, Stocktwits and Seeking Alpha. Our methodology combines Sentiment Analysis and Topic Modelling. Sentiment Analysis is conducted using VADER and a fine-tuned BERT model. For Topic Modelling, we utilize LDA, NMF and the state-of-the-art BERTopic. We identify the topics and shapes of discussions over time and evaluate the potential for leveraging information of the decision-making process of investors for trading choices. We utilize Random Forest and Neural Network Models to show that latent information in discussions can be exploited for trend prediction of stocks affected by Social Network driven herd behavior. Our findings provide valuable insights into content and sentiment of discussions and are a vehicle to improve efficient trading decisions for stocks affected from short-term herd behavior.

A survey on sentiment analysis methods, applications, and challenges

Article 07 February 2022

Artificial intelligence in Finance: a comprehensive review through bibliometric and content analysis

Article Open access 20 January 2024

Sentiment Analysis in the Age of Generative AI

Article Open access 05 March 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The advent of neo-brokers like Robinhood has lead to a new era of retail investing, potentially democratizing stock markets and bringing in a massive influx of small-scale investors. At the same time so-called “Meme Stocks” have emerged as a new phenomenon in the market. These stocks are largely driven by the collective sentiment and enthusiasm of online communities rather than financial and economic fundamentals or technical analysis. Investors often use specific Social Media channels like Reddit’s “Wallstreetbets”-Subreddit (WSB) or Stocktwits to share their short-term investment decisions and strategies, which can result in partial collusion and planning on joint investment decisions. A notable prominent case is the GameStop (or GME) short squeeze, which we will discuss in detail and show why it needs in-depth exploration.

Short-selling is an investment strategy in which traders profit from a price drop in an asset. The process involves borrowing shares, selling them at the current market price, and then buying them back at a lower price in the future to return them to the lender. In the presence of substantial short interest, this can culminate in a so called “short squeeze”, where a stock price hike may cause some short-sellers to close their position early to cut potential losses. This can increase the share price even further and cause a feedback loop.

Throughout the later half of the 2010s, GameStop—one of the largest video game retailers in the world—exhibited poor financial performance, recording a consistent revenue decline between 2015 and 2020. This made GameStop a prime target for short-selling.

An influential Reddit user by the name of “DeepFuckingValue” took notice of the short interest in GME, and began discussing the possibility of a short squeeze.^{Footnote 1} Retail investors rallied behind his ideas and began buying GME shares at large scale, with the goal of driving the stock price up, thereby putting pressure on short sellers to cover their positions by purchasing shares at inflated prices.

Indeed, GME’s stock price experienced a rapid increase in late January 2021.^{Footnote 2} The surge in price attracted mainstream media attention, and further fueled the interest in the ongoing short squeeze, which eventually spilled over to other stocks, such as Palantir and BlackBerry. It also lead to the bust of several GME-shorting hedge funds, the most prominent being Melvin Capital.^{Footnote 3} It quickly became clear that Social Media could be an important potent fire-starter for coordinated retail investor operations. As a result, this new group of investors has shown that they can be a significant actor on financial markets and have the ability to disrupt the efficient price determination of certain stocks in the short term. For this reason, a deeper understanding of this new subculture itself is important. Besides, we investigate the potential of using Social Media discussions to predict share price movements.

We analyze Social Media text data from different sources to gain insights into the topics and shape of the discussion over time on several prominent platforms (Reddit, Stocktwits, Seeking Alpha) used by small-scale investors. Using NLP approaches like Sentiment Analysis and Topic Modelling, we investigate the discussion in these communities and evaluate how NLP methods can be leveraged to gain insight into the investment decision-making processes of users. Furthermore, we identify the impact of small-scale investors on market developments and the potential for other investors for leveraging the decision-making processes of trading choices made by the group by predicting the trend of “Meme Stocks” like GameStop or Blackberry as prominent examples.

Accordingly, our approach can be differentiated into two dimensions. Firstly, a predictive-analytical dimension, which is concerned with the prediction of the short-term price trend of affected stocks, to analyze the potential of this kind of text data for investment decision making. Secondly, a descriptive-analytical dimension, which includes the application of Topic Modelling approaches over time to describe the evolution of the contents of the discourse of individual investors, is conducted. Each dimension is independent in both methodology and scope. All applications presented in this paper are viewed as complementary additions to the sum of methods included in potential future investment decision-making procedures. The Fig. 1 gives a high-level overview of our analysis. A detailed figure describing the whole pipeline is provided in Sect. 4.

Finally, we like to disclaim that the classification conducted in the predictive-analytical dimension of our research is not suitable as a stand-alone-tool to make investment decisions. We acknowledge that from a statistical standpoint, stock prices follow a random walk and cannot be accurately forecasted, as shown by Van Horne and Parker (1967) and Fama (1995), among others. For an informed investment decision, economical, company-related, technical factors, well as many other factors, are crucial. Therefore, our only goal in the predictive-analytical dimension is to show that the information, that is created during the discussions inside the small-scale investor communities, can be utilized using machine learning approaches to support the investment decision-making process of investors. In other words, we do not expect any causality, but we also do not expect independence between the stocks price and the information retrieved from the small-scale investor communities. By predicting the trend of a stock only using the data from said communities, we display the informational potential of the data at hand and the partial influence that those small scale investors have on the development of the observed stock prices. Our goal is therefore not to provide a comparative framework that challenges existing models in their predictive performance, but to evaluate the potential of using the information generated by these online communities.

2 Related literature

Our research draws on the literature for textual analysis of Social Media data and especially the Topic Modelling literature. Besides, we discuss literature that investigates the effect of Social Media platforms on financial markets and share prices.

Topic Modelling Topic Modelling describes techniques for extracting and identifying hidden topics within extensive text corpora. Topic models make it possible to uncover the semantic structure of large text collections on a scale that goes beyond what humans can reasonably read, process, and comprehend. Given a collection of documents, a Topic Model first detects, which topics are present in the corpus. Second, it allocates each document to one or more of the detected topics. Historically, topic models are implemented as probabilistic generative models such as Latent Dirichlet Allocation (LDA) (Blei et al. 2003) or Probabilistic Latent Semantic Analysis (PLSA) (Hofmann 2013). While still being used today (Yu and Xiang 2023; Gupta et al. 2022; Egger and Yu 2022), the rise of continuous Bag-of-words (CBOW) word embeddings such as word2vec (Mikolov et al. 2013) enabled models to integrate contextualized representations of words into their modelling (Das et al. 2015; Angelov 2020). Further adaptations of LDA, like the Embedded Topic Model (ETM) (Dieng et al. 2020) or ProdLDA (Srivastava and Sutton 2017) also model the topics themselves as representations in the embedding space. Furthermore, the inherent power of these contextualized word embeddings allowed for completely different modelling architectures, leveraging simple clustering techniques such as K-Means on embedded words (Sia et al. 2020). The ZeroShot Topic Model (Bianchi et al. 2021) or the Contextualized Topic Model (Bianchi et al. 2020), additionally integrate embedded documents (Reimers and Gurevych 2019) into their modelling. These embedded documents also allow for using simple clustering techniques such as HDBSCAN or Gaussian Mixture models (Grootendorst 2022; Angelov 2020; Thielmann et al. 2024) and create topics, that are on par with state of the art probabilistic models (Thielmann et al. 2024, 2022). While the underlying clustering techniques create only clustered documents, the topics can be easily extracted from them, by utilizing either distances in the embedding space (Angelov 2020; Thielmann et al. 2024) or class based term frequency-inverse document frequency scores (Grootendorst 2022).

Topic models have long been used to analyze the latent topics discussed in social networks since they allow to uncover trends and latent topics in these exceedingly large document collections (Hong and Davison 2010; Curiskis et al. 2020; Kant et al. 2022; Weisser et al. 2023). Models such as LDA (Blei et al. 2003), Non-negative Matrix Factorization (NMF) (Lee and Seung 1999) or BERTopic (Grootendorst 2022) (whereas BERT stands for Bidirectional Encoder Representation from Transformers) are some of the prominent models in these applications (Egger and Yu 2022). Nowadays, tools like TTLocVis (Kant et al. 2020) or Twitmo (Buchmüller et al. 2022) make it easy to analyze Social Media data.

Social Media and Stock Prices In recent years, the influence of retail investors on financial markets through Social Media has garnered significant attention. Researchers have delved into this phenomenon, seeking to understand its implications and potential warning signals for market stability. For example, (Agrawal et al. 2022) leverage BERTopic to identify unique traits of Social Media users in finance communities, whereas Schou et al. (2022) and Thukral et al. (2022) analyze the effect of social network communities on financial behavior and use the LDA model. Similarly, Sidhu et al. (2022) investigate multiple online communities, using a mixture of LDA and Support Vector machines (Cortes and Vapnik 1995). Concurrently, Gianstefani et al. (2022) conduct a study focusing on the impact of retail investors using Social Media, leveraging simple linear regression techniques and solely focusing on the Subreddit “WallStreetBets” (WSB). It is worth noting, however, that their analysis primarily revolves around identifying events without delving into whether the social network activity will ultimately lead to positive or negative outcomes for stocks. Aloosh et al. (2021) analyze the behavior of meme stock traders in the context of herd behavior, but do not analyze the activities in social networks and focus exclusively on the stock data. Wang and Luo (2021) estimate the price trend of the GameStop share by analyzing several thousand WSB-Subreddit posts using Sentiment Analysis and textual features. However, their research is conducted for a very short timeframe of less than 3 months. Besides, they only focus on accurately predicting the direction that sentiments have on the stocks. Related to this line of research, Costola et al. (2021) utilize a regime-switching co-integration model to identify stock momentum. They use a simply approach, by relying solely on counting stock name mentions in Twitter^{Footnote 4} discussions, distinct from finance-specific platforms like WSB. While they are able to identify correlations, their study’s main focus was on characterizing Social Media patterns, not predicting stock price directions since they are not focusing on the sentiment of the Social Media discussions. Padalkar (2021) leverage text mining and semantic analysis methods to analyze Reddit posts. However, their Sentiment Analysis approach relies only on bag-of-words techniques developed by Antweiler and Frank (2004), rather than state-of-the-art word and document embeddings. Their findings suggest a correlation between higher Reddit sentiment and increased stock volume. Furthermore, they observe that a high frequency of posts has significant impacts on stock returns, traded volume, and volatility. The authors of Zhao et al. (2023) also find that Social Media Sentiment Analysis can be utilized to approximate stock prices. Similar to Aloosh et al. (2021) they identify some sort of herd behavior in stock traders. In a related study, Thormann et al. (2021) employed a combination of neural networks and Sentiment Analysis, alongside traditional financial indicators, to forecast Apple stock prices. Their findings suggest an enhanced performance over a baseline model that relies solely on historical closing prices. However, it’s subtly implied, considering the random walk nature of stock prices, that accurate prediction remains inherently challenging and uncertain. Matthies et al. (2023) leverage Sentiment Analysis and perform simple regression based on the extracted sentiment scores. However, they also focus only on Twitter data. Koltun and Yamshchikov (2023) demonstrate the effectiveness of Social Media Sentiment Analysis by incorporating sentiment information into price prediction, which enhances model performance for cryptocurrency forecasting. Shiri et al. (2023) on the other hand specifically analyze the effect that emojis have on stock price prediction. In contrast to the aforementioned approaches they do not find an effect.

The literature suggest that Social Media platforms can have a significant impact on financial markets and share prices. Besides, the literature underscores the potential role that state-of-the-art textual analysis can play as one of several key tools in aiding more informed decision-making in approximating stock price predictions. However, a comprehensive investigation of multiple platforms that draws on latest natural language processing methodologies and machine learning based forecasting is missing.

3 Data collection and pre-processing

The data for our study was provided by FIDA Software GmbH and has been collected via the service APIs. The data encompasses the period from May 2020 to May 2021 and includes 5.1 million comments from Stocktwits, 1.08 million comments from Reddit and 14 thousand comments and 317 articles from Seeking Alpha. The chosen time-frame covers the months leading up to the GameStop short squeeze and the event itself and its aftermath. Several bot detection measures are applied to ensure that we are dealing with human-made content.

The first aforementioned, Reddit, is a social news aggregator website with thousands of Subforums, called “Subreddits” covering a vast variety of topics. We collect data from a Subreddit called WSB, where the GameStop short-squeeze was discussed and coordinated. In this forum, we collect so-called “Daily Discussion Threads” and the “What are your moves Tomorrow?” threads. In both moderated threads, serious discussions about investment decisions are being held, in contrast to many other threads in the Subreddit, where a lot of Satire, Jokes, and Shitposting are present. The data has the structure of a discussion. This means that users can reply to posts made in the threads, promoting discussions on a topic that has been brought up. We filter the posts by the occurrence of the mentioning of stocks we are interested in. This means, that if a stock or its ticker (like “GME”) is mentioned, we include them in our analysis. Additionally, if a stock is mentioned in a lower level of a reply to a post, the whole discussion is added to our data. Another attribute of the Reddit data worth mentioning is that users are able to rate and award comments. We use this metadata in the further analysis.

The second data source we use is Stocktwits, a Social News Ticker service. This platform is design-wise similar to Twitter, where users are posting short messages, restricted to a few words only. The major difference to Twitter lies in the topical focus: On Stocktwits users almost exclusively talk about stocks, investments and Crypto currencies. Users can use so-called “Cashtags”, to tag stocks they are talking about in their posts. On average, posts on Stocktwits are very short, often just consisting of about one sentence. An interesting feature about Stocktwits post is the option for a user to mark their posts as “bearish” or “bullish”, providing us with a de-facto user-labled sentiment dataset with regards to the stocks we are interested in. We collected the data by filtering for the Cashtags of the stocks we are using for our analysis. One problem with the Stocktwits data is the high prevalence of bots posting and promoting products or sentiments with regards to specific stocks. For our analysis, we therefore filter our collected data for bots. We assume that bot-driven accounts have several attributes that make them easily identifiable. First, bots have a very high posting frequency. If a user posted more than 30 post per day, we remove them from our data set. Secondly, we assume that the contents posted by bots follows a syntactical pattern, therefore we check for each user the syntactical equality of several of their posts using the Hamming Distance Measure (Hamming 1986). If the posts similarity (i.e. the Hamming Distance difference between the two sequences) was above a predefined threshold, we remove all posts of such users from our dataset.

The final data in our analysis is obtained from Seeking Alpha, a Social News Blog. Seeking Alpha mainly provides longer essay-like texts on equities, exchange-traded funds and investment strategies. These articles are often written by selected investors or experts. Additionally, all articles have a comment section for the discussion of the articles topics. We collect the articles with regards to all analyzed stocks, including all of the comments belonging to the articles collected. For our analysis, we only use text chunks, that directly refer to the stocks of interest, more on that can be found in Sect. 4. The comments of the articles are structure-wise similar to ones from Reddit.

In short, we have three pretty diverse sources of content—the preliminary outlook is that the long-format content of Seeking Alpha seems to cater to a smaller niche of “serious” and professional investors, while WSB is more accessible and has a broader reach to a more amateur audience that espouses a more archaic, gamified approach to investing. In many ways, Stocktwits appears to be a middle-ground between the two.

We collect data about the “Meme Stocks” of GameStop (ticker: GME), an American retail chain specializing in video games, consumer electronics and gaming merchandise, Blackberry (BB), a technology company for enterprise software and Internet of Things, and Palantir (PLTR), a software company focused on Big Data Analytics. All of those Stocks were widely discussed in the communities, especially on Reddits WSB. Additionally, we add two so-called “Blue Chip” stocks, Facebook^{Footnote 5} (FB) and Amazon (AMZN). We add those two established large scale companies as a baseline for our analysis, specifically for our trend prediction approach. We expect a less prevalent discussion about those stocks in the targeted communities, and a lower impact of the communities actions on these stocks due to their huge number of investors holding their shares.

The processing of the data consists of several steps. The raw data collected via the platforms’ APIs is cleaned of all the data not used in our analysis. Consequently, we remove all hyperlinks and images from the text. We harmonize the datetime of all data sources to the CET timezone. Additionally, we filter false flagged posts, which are especially present on Stocktwits. Those posts consist mostly of many Cashtags to gain a wide audience for the shared content, even though the post does not thematically align to the Cashtag. Consequently, we decide to remove all posts that carried more than two Cashtags from the dataset.

Descriptive statistics Before starting our analysis, we provide an overview of the data. Figure 2 presents the daily frequency of posts on Reddit and Stocktwits (we chose to omit the Seeking Alpha data on this plot since those numbers are relatively small). We decide to omit the Reddit data from April 2021 onwards since it contains a lot of NAs.

Table 1 presents the descriptive statistics regarding comment length by source. Unsurprisingly, Seeking Alpha articles are by far the data source with the longest texts. It is notable that Seeking Alpha comments are also considerably longer than their Reddit and Stocktwits counterparts—it would appear that the long-format nature of the articles also invites more verbose discussions in their comment section. Interestingly, Reddit and Stocktwits comments are remarkably close when it comes to word length, despite Reddit comments having no character limit.

Table 1 Length of comments/articles in words by source

One-way ticket to the moon? An NLP-based insight on the phenomenon of small-scale neo-broker trading

Abstract

Similar content being viewed by others

A survey on sentiment analysis methods, applications, and challenges

Artificial intelligence in Finance: a comprehensive review through bibliometric and content analysis

Sentiment Analysis in the Age of Generative AI

1 Introduction

2 Related literature

3 Data collection and pre-processing

4 Methodology

4.1 Classification

4.2 Topic modelling

4.2.1 LDA

4.2.2 NMF

4.2.3 BERTopic

4.2.4 The problem of evaluating topic models

4.3 Sentiment analysis

4.4 Granger causality test

5 Results

5.1 Classification

5.2 Topic modelling

5.2.1 LDA

5.2.2 NMF

5.2.3 BERTopic

5.2.4 Stock specificity

5.3 Sentiment analysis

5.4 Granger causality test

6 Discussion

6.1 Classification

6.2 Topic modelling

6.2.1 LDA

6.2.2 NMF

6.2.3 BERTopic

6.3 Sentiment analysis

6.4 Granger causality test

7 Conclusion

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Informed consent

Additional information

Publisher's Note

Appendices

Appendix A: Tables

Appendix B: Figures

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation