1 Introduction

Popular content from social media sites such as Reddit provides a window into the opinions and interests of internet users. Now that people spend more time on the internet, the question of what they are engaging with has come to the forefront. However, the forces underlying the emergence of popular internet content are still poorly understood. Particularly, the influence of memes’ topics on their popularity has not yet been analyzed. Do social media sites discuss the same issues as verified news sources? Are topical subjects that are currently prevailing in the news more likely to go viral on social media too? Is it worth “riding the wave” of topicality, or does interest in a topical subject decline rapidly on social media sites?

We investigate these questions using a dataset of image-with-text memes from Reddit and published articles from the New York Times. Image-with-text memes are usually images with overlaid text that are copied and shared extensively across the internet. The widespread prevalence of image-with-text memes online shows how well-suited the internet is to multi-modal reasoning and communication. Video, sound, and image data are more complicated to analyze than pure text data, making them less studied in scientific literature, but our understanding of online communication would be incomplete without these analyses (Halevy et al. 2022). To address this shortage, here we analyze both the image and text content of meme data collected from Reddit.

In the digital age, many are concerned about losing the depth of conversation in the short-form discourse of the internet. However, social media content may simply achieve nuance in different ways than longer-form written communication. Multi-modality is one way of conveying nuance online. The level of nuance in online communication is further enhanced by referencing relatable and popular topics in society. Often, social media users are participating in larger discourses that are happening across many internet platforms. Topic analysis algorithms such as Latent Dirichlet Allocation (LDA) (Blei et al. 2003) and BERTopic (Grootendorst 2022) provide the means for analyzing the larger conversations referenced.

The mechanism of reference is especially important to image-with-text memes, which are elsewhere defined exactly as a reference to reality followed by a punchline (Spitzberg 2014). By identifying meme topics, in this paper, we connect the individual posts to the broader discourse they are a part of and analyze overall trends in internet discourse across 5 years of data. Our method moves beyond individual posts to analyze the ideas purported by memes, the importance of which is highlighted by Mascarenhas et al. (2024).

Our previous work highlighted the prevalence of COVID-19-related content in viral Reddit memes in March 2020, at the beginning of the pandemic (Barnes et al. 2021). Other events likely exhibit similar properties. We assume that, in general, topical posts about subjects that are currently widely discussed in society are more likely to go viral online. In the present work, we investigate this hypothesis using the New York Times (NYT) as a reference document for topicality. Importantly, this study does not establish a causal relationship between NYT and Reddit. The correlations found simply indicate that the two data sources are tuned to similar topics. Specific contributions are listed below.

  • Identifying Reddit- and NYT-based topics.

  • Comparing the temporal distributions of 120 topics on both platforms.

  • Showing that posts about topics that are gaining prevalence in the NYT receive more upvotes on Reddit, including an innovator’s advantage for being among the first to post about a topic.

  • Demonstrating that topic-based features have significant predictive power in the machine learning identification of viral memes.

The research workflow used to establish these findings is summarized in Fig. 1. In Sect. 2, we review related literature. Section 3 discusses the data sets and engineering of control features describing the Reddit memes. Topicality features, describing topics extrinsic and intrinsic to the Reddit data, are discussed in Sect. 4. Section 5 discusses the training of a Categorical Boost classification model (Prokhorenkova et al. 2018) to predict viral memes using all engineered features. We analyze the incremental predictive power of topic-based features over other content-based control features, as well as the importance of intrinsic vs. extrinsic topicality features in the same section. Our findings are discussed and summarized in Sects. 6 and 7.

Fig. 1
figure 1

Research workflow including examples of OCR text extraction, BLIP image captioning, and meme topic assignments

2 Related work

Previous work has shown that image-with-text memes often reference cultural and political topics (Brodie 2009; Du et al. 2020) such as those discussed in NYT. Memes can be used to consolidate group identities (Dynel and Messerli 2020), influence political opinions (McClure 2016), and inspire political actions (Leach and Allen 2017). To give a current example, TikTok has been used to increase public engagement in the Israeli-Palestinian conflict (Yarchi and Boxman-Shabtai 2023). The work of Zannettou et al. (2018) investigated political content on social media more broadly. The authors labeled memes from four social media sites with annotations from the Know Your Memes (KYM) website, indicating whether a meme template has political content. They found that memes on many social media sites contain political content; and on Reddit, political memes are more popular than non-political memes. These findings suggest that the NYT article corpus can appropriately serve to model some of the topics found in Reddit memes. Moreover, today, over half of U.S. adults get news from social media (Jacob Liedke and Luxuan Wang 2023), making work highlighting the connections between social media and the news increasingly important.

A prior model by Leskovec et al. (2009) showed that mimicry and novelty are key features of the news cycle. The dynamics of news stories in internet discourse can be traced through the replication of quotes across the web. Using this method, Leskovec et al. (2009) showed that new stories are more widely shared than out-of-date stories. In fact, the window within which a news story is relevant is typically very small. Most news-related quotes maintain a nontrivial volume for only 40 h around the moment of peak publication volume (Yang and Leskovec 2011), and on average, blog adoption of the quotes peaks only an average of 2.5 h after the phrase peaked in news media (Leskovec et al. 2009). Yang and Leskovec (2011) furthermore found that temporal features achieve the best accuracy for predicting how the popularity of a news-related quote will vary with time. Importantly, online information is not always trustworthy and memes containing quotes and factual information have been shown to change as they replicate across the internet (Simmons et al. 2011; Leskovec et al. 2009). To the best of our knowledge, our study is the first to analyze image-with-text memes in relation to the news cycle, but topic analysis has been used in connection with social media data in other ways.

Popularity is a common framework through which to analyze social media data. Liking and re-posting are implicit endorsements of the memes’ topics and opinions (Spitzberg 2014), implying that topics may influence memes’ popularity and diffusion across the web. Indeed, previous research shows that individuals are more likely to spread information about topics of interest to them. For example, Twitter users are more likely to adopt hashtags that align with their own topical interests (Grabowicz et al. 2016) and retweet hashtags that they have tweeted about in the past (Weng et al. 2012). Typically, the diffusion of memes through the internet is modeled as a simple contagion process, similar to the spread of disease (Wang and Wood 2011), however, topical content clearly exhibits more complex diffusion mechanisms. Poux-Médard et al. (2022) studied the interactions between topics in the diffusion of information on Reddit news pages, showing that the topics of previous posts have minor explanatory power for predicting the topics of subsequent posts. These analyses suggest that topics are interrelated and exhibit popularity trends.

The majority of previous research using topic analysis models such as LDA or BERTopic to model topics in social media data focuses on the topical interests of internet users. For example, preferential attachment can be described via topical analysis. Instagram users tend to connect with other users who have the same interests as them (Ferrara et al. 2014). Weng and Menczer (2015) show that if a particular hashtag on Twitter is adopted by people with diverse topical interests, the hashtag is more likely to go viral, due to exposure to a diverse, and presumably broadly connected, community. In other cases, homogeneous communities with strong central figures promote content virality when the topic of the shared content matches the interests of the community members, implying the importance of shared beliefs and attitudes (Cialdini and Goldstein 2004).

Our analysis combines themes from previous research about the topicality of posts on social media with research about political events and the news cycle. Many studies compare multiple data sources to gain a fuller picture of internet content (Zannettou et al. 2018; Leskovec et al. 2009), but to the best of our knowledge, ours is the first study to use one dataset (NYT) as a source document for defining topics that are then assigned to another, distinct dataset (Reddit). Using this technique, we are able to observe significant alignment between data sets that are otherwise very different (community-run vs. institutional, comedic vs. informational). Our primary contribution to the literature is to study the influence of topicality on social media popularity more broadly. Informally, when a particular topic is widely prevailing, or increasingly published in the NYT, it is said to be topical. Following our formal definitions of topicality in Sect. 4, we show that memes about topics with growing topicality in NYT receive more upvotes on average on Reddit.

3 Data description and preparation

Data was collected from two sources: image-with-text memes from a popular social media site called Reddit, discussed in Sect. 3.1, and archived article metadata from the New York Times (NYT), see Sect. 3.2. Sections 3.3 and 3.4 discuss the engineering of content-based control features such as the color content of meme images and the sentiment of meme texts. Both data sets can be found on Github (HSDS 2024).

3.1 Reddit

Reddit, nicknamed “the front page of the Internet”, is the source of a lot of viral internet data, making it likely that the viral content analyzed here also circulates elsewhere on the internet (Sanderson and Rigby 2013). Data from r/Memes, the largest community dedicated to sharing memes on Reddit, was collected using the Pushift API (Baumgartner et al. 2020; Boe 2016; Podolak 2021). We limited the collection to a maximum of 1000 randomly sampled posts per day. There were 37 missing dates from which we were unable to collect data due to issues with the API. After removing items with broken image links and gifs, the Reddit data set contained 899,766 memes posted between January 1, 2018 and November 14, 2022. The r/Memes subreddit is the 12th largest community on Reddit, with more than 26 million subscribers. Based on the above characteristics, we believe that the collected data is representative, and our conclusions can be applied to memes appearing elsewhere.

Fig. 2
figure 2

Distribution of upvotes in Reddit data and corresponding number of comments each meme received

Table 1 Features table. Shaded rows indicate features used for the Catboost identification of viral memes including topic-related features (red) and control features (gray)

In addition to the meme images, the API provides several metadata features, including the title and caption posted with the meme, the publication date, an over-18 content indicator variable, the number of comments, and the score. On Reddit, the score of a post is calculated as the number of upvotes minus the number of downvotes it received. These attributes provide opportunities to engineer various other features.

As visualized in the memes’ score distribution in Fig. 2, few memes receive a lot of attention, while the vast majority go unnoticed. In general, memes with high scores also receive more comments, as reflected by the Spearman rank-order correlation coefficient of \(r=0.65\) (\(p<0.001\)). In many places on the Internet including Reddit, popular content, whether measured by likes, followers, or views, is sorted to the top of users’ news feeds (Figueiredo et al. 2014). Reddit’s default content sorting algorithm, used to curate the “hot” tab, prioritizes content based on a mixture of how recently it was posted and the logarithm of its score (Amir Salihefendic 2015). Content ranking methods such as this create a feedback loop between what is popular and what is visible online. While unpopular memes are visible briefly when they are first posted, popular memes are widely viewed. One study manipulated the scores received by Reddit posts, finding that by providing 10 upvotes soon after the content was posted, they could greatly increase the posts’ chances of going viral (Carman et al. 2018). This can explain the heavy-tailed distribution of popularity measures on many sites, including in our Reddit data.

It is a common choice to model the heavy-tailed popularity distributions on social media as binary variables (Weng and Menczer 2015; Ling et al. 2021). Here, we define virality as a percentile. If a given meme is in the top 5 percent of memes posted within a \(\pm 7\)-day period around it, it is labeled viral (1), and otherwise, it is labeled non-viral (0). The choice of percentile was based on the Reddit upvotes distribution in Fig. 2 and can also be visualized in this figure. This percentile is further discussed and explored in Sect. 5 where the viral/not-viral target variable is used in a binary classification task. By measuring virality locally, over a two-week period, we avoid replicating seasonal popularity trends on Reddit. Additionally, as the Reddit userbase is constantly growing (Tiago Biachi 2023), this ensures we do not overstate the influence of memes from recent years.

Table 1 summarizes all features describing the Reddit meme data. Shaded rows indicate features used for training the predictive model, including topicality-related features (red) and control features (gray).

3.2 New York times

Data from the same timeframe as the Reddit memes, January 1, 2018 to November 14, 2022, was collected from the New York Times Archives API.Footnote 1 The API returns all article metadata for a given month, including the abstract, snippet, lead paragraph, headline, publication date, keywords, and type of material. Most (72%) of the collected data was from the news section, but other types of material such as crosswords, obituaries, sports, and book reviews were also included. The NYT published roughly 200 articles per day on weekdays and 100 articles per day on weekends, and publishing rates showed a slight decline over the 5-year period we examined. On a few dates, the archives contained more than 500 items on a single day due to the NYT updating the archived podcast episodes, or, during the first year of the COVID-19 pandemic and during US elections, due to state-by-state statistics reports. These statistics reports and podcasts were removed to prevent the topic models from forming topics specific to this type of post and to ensure an even daily distribution of articles. No other statistics that should be handled similarly were identified. After removing these entries, our data set contained 255,783 NYT articles in total.

3.3 Text features

We extracted text from the Reddit meme images using Optical Character Recognition,Footnote 2 and generated image captions using Bootstrapping Language-Image Pre-training (BLIP) (Li et al. 2022), discussed in the next section. Examples of OCR and image caption results can be seen in Fig. 1. The OCR text and image captions were combined with the title and caption posted with the meme image on Reddit into one feature containing all text associated with the meme. This all-text feature was subsequently used for topic extraction and assignment in the Reddit data. In the NYT data set, the abstract, snippet, lead paragraph, headline, and keywords were all combined into one all-text feature for topic extraction.

The all-text features for the Reddit and NYT data were cleaned identically to assure alignment when modeling the topics in both document corpora. We removed punctuation, made the text lowercase, removed stop words, and stemmed the words to their root forms using NLTK (Bird et al. 2009). Additionally, we made custom edits to the NYT text data based on observed differences between the NYT and Reddit vernaculars. For example, NYT referred to the COVID-19 pandemic with the word “coronavirus”, whereas posts on Reddit tended to use “covid”, “rona”, “corona”, and “pandemic”. This is important as the LDA algorithm identifies and assigns topics based on shared words. Therefore, if a post on Reddit referred to “covid”, it would not necessarily be associated with an NYT topic that only included the word “coronavirus”. To solve this issue, we added the words “covid”, “rona”, “corona”, and “pandemic” to the all-text feature of every NYT article that was tagged with the NYT keyword “Coronavirus”. Other examples include adding “RGB” to articles tagged with Ruth Bater Ginsberg and “BLM” to articles tagged with Black Lives Matter. Additionally, acronyms of the form “G.O.P.” and “N.F.L.” were edited to “GOP” and “NFL” in order to keep this information when punctuation was removed in the text cleaning process.

In addition to text cleaning, we extracted numerical features from the Reddit text data. These features were used as control features describing the Reddit memes for classification. As seen in Table 1, we recorded the number of characters in the title posted with the meme and the OCR text extracted from the meme image. Other studies show that on Twitter, the length of the post is strongly correlated with popularity (Tsur and Rappoport 2015). A binary variable indicating whether or not the all-text feature contained emojis was engineered using the Emoji 2.8.0 identification Python library (Kim and Wurster 2023). The valence of the Reddit title and OCR-extracted text was analyzed with the NLTK sentiment model (Shreyas 2019). Sentiment scores close to 1 are more positive while sentiment scores close to 0 represent more negative sentiment.

3.4 Image features

In addition to text features, we extracted numerical features from the meme images following a similar procedure to our previous work (Barnes et al. 2021). Due to higher information density, images are more succinct than pure text content. They grab our attention and are able to rapidly communicate complicated messages (Nelson et al. 1976). Like many forms of online media, the topic of a meme is not only communicated through the text caption but also through the underlying image, so it was important to determine meme topics based on the image as well as the text content. In total, we used 19 image-related features, including high-level features such as BLIP image captions and face identification, and low-level features such as the color content of the images.

Bootstrapping Language-Image Pre-training (BLIP) (Li et al. 2022) is a pre-trained, transformer-based model for describing images. The model generates an image caption of usually 5–10 words, such as “a skeleton sitting on a cart in front of a house”, to quote the example in Fig. 1. This allowed us to identify objects in the meme images that may be relevant to their topics, such as “skeleton”, to use for topic assignment. The BLIP caption was added to the memes’ all-text feature and cleaned identically to the other text data.

The pre-trained Multi-Task Cascading Convolutional Neural Network (MTCNN) was used to detect faces in the meme images (Zhang et al. 2016). The model returned the probability that a face was present and the number of faces present in the image. For simplicity, we elected to use a binary variable indicating whether or not the meme image contained a face. Using the OpenCV image segmentation technique to mask the meme images, we calculated what percentage of the image area contained each of 10 colors (Stone 2018). The average hue, saturation, and value components of the HSV representation of the images and the average red, green, and blue components of the RGB image were also used as control features.

Previous work has shown that low-level image features such as those mentioned above can have a large effect on popularity (Russakovsky et al. 2015; Khosla et al. 2014). However, while an earlier study on the aesthetics of images shows that high-definition, bright-colored images are more appealing (Datta et al. 2006), in the case of memes, it appears to be the opposite. Popular memes generally contain dull colors (Barnes et al. 2021). For memes, these low-level features could encode the template image used to create memes, indicating that meme templates, rather than the low-level features themselves, have an impact on popularity (Coscia 2014).

Table 2 Most common LDA-identified topics in NYT and Reddit corpora. Words are listed in order of importance to the topic

4 Topic modeling

Using two topic models, LDA and BERTopic, we analyzed the topics of Reddit memes and NYT articles. For comparison, we used top models from both paradigms of topic modeling: probabilistic (LDA) and embedding (BERTopic) approaches (Egger and Yu 2022). Each topic is represented by a list of its most common words, as shown in the example topics in Tables 2 and 3. To identify “extrinsic topics” (NYT-based), we trained the topic models on the NYT data and performed inference on the Reddit data. To determine “intrinsic topics” (Reddit-based), the topic models were trained on and inference was performed on the Reddit text data. From the distributions of each topic over time, we defined topicality features for the Reddit data. In total, 12 topicality features were added for each topic model. This is twice the number listed in Table 1 because the features were calculated using both the extrinsic and intrinsic topic models.

In this section, we discuss the LDA topic model (Sect. 4.1), the BERTopic model (Sect. 4.2), and the formalization of topicality into numerical features (Sect. 4.3). At the end of Sect. 4.3 we report statistics describing the relation between topicality features and the popularity of the Reddit memes. We display plots comparing topicality trends from the past 5 years of NYT and Reddit memes data. The interested reader can see the complete sets of NYT-based topics identified by LDA and BERTopic on our GitHub page (HSDS 2024).

Table 3 Most common BERTopic-identified topics in NYT and Reddit corpora. Words are listed in order of importance to the topic

4.1 Latent Dirichlet allocation

LDA is a statistical language model used to distill topics from a corpus of documents using a bag-of-words approach in which word order is not considered. Words that commonly co-occur in documents are grouped into topics. Topics are then represented as a probability distribution over all of the words in the document corpus, with the words most highly associated with the given topic receiving the highest probabilities. In the inference step, the word distributions of the topics are compared with the word distributions of each document to assess the probabilities with which each document discusses each single topic. Documents are then represented as probability distributions over the set of topics (Blei et al. 2003).

One limitation of the LDA algorithm is the requirement to specify the number of topics in advance. We used the keywords metadata provided by the NYT to estimate an appropriate number of topics for the NYT data. The NYT indexed 1,583,868 unique keywords covering 92% of the articles, the most used keywords being “US Politics and Government”, “Trump, Donald J.”, and “Coronavirus (2020-nCov)”. We filtered the keywords list based on three criteria. The keyword should be used, 1. at least 100 times total, 2. on at least 50 distinct dates, and 3. at least 5 times on at least one day. Only, 127 unique keywords fit these criteria, providing an estimated number of topics for the NYT corpus, which we rounded down to 120 for simplicity.

The NYT all-text feature was used to train the LDA model, which was parameterized to find 120 topics in 1500 iterations over the corpus. We then labeled articles in the NYT corpus according to their highest probability topic. Table 2 shows that the most prevalent topics identified in the NYT data are about US politics & government and Coronavirus. This aligns with the most common NYT-applied keywords, lending confidence to our model. Using the trained LDA model, we performed inference on the Reddit all-text feature, resulting in a topic distribution assigned to each meme in the Reddit data, which described the probability that the meme is about each of the 120 NYT-based topics. Later, we will refer to this topic distribution as the vector, \(\underline{\tau }_{m}\).

We labeled the Reddit memes according to their highest probability topic, saving that topic’s probability of being assigned to the meme as well. Additionally, we calculated the Shannon’s entropy of the memes’ topic distributions. Memes uniformly assigned to all topics receive high entropy values, while memes with few dominant topics receive low entropy values. These are the “topic”, “prob”, and “topic entropy” features described in Table 1.

In the same manner, a set of 120 intrinsic topics (500 iterations over the Reddit corpus) was extracted by training the LDA algorithm on the Reddit all-text feature. In this case, we chose 120 topics to be consistent with the NYT-based topic set. Again, the highest probability intrinsic topic, its associated probability, and the Shannon’s entropy of the topic distribution were saved for the Reddit memes.

Both the intrinsic and extrinsic topic models created one miscellaneous topic in which the most important words in the topic were common words like “already”, “life”, “way”, “time” and “person”. This topic was the most prevalent in both the NYT and Reddit corpora, but it did not dominate the data sets. In both cases, the miscellaneous topic was assigned to less than 10% of the documents. Table 2 shows the top 2 most frequent topics in the Reddit and NYT data, excluding this miscellaneous topic.

Fig. 3
figure 3

Monthly average topicality in the NYT (blue) and Reddit (red) as identified by the LDA (top row) and BERTopic (bottom row) topic models. Word clouds above the distributions show the top ten words associated with the topic

4.2 BERTopic

BERTopic is a newer topic modeling technique that uses a transformer-based approach (Grootendorst 2022). The model generates high-dimensional document embeddings, reduces the dimensionality of these embeddings with UMAP, and then clusters the vectors into topic groups using HDBSCAN. Word order is important for the transformer-based model to understand the context of texts. Therefore, word order was not changed during the text cleaning steps described in Sect. 3.3. To ensure comparability, we used the same text data for both the BERTopic and LDA models.

Fig. 4
figure 4

Stream charts showing 5-year trends for 20 NYT-based topics identified by BERTopic

A set of 120 extrinsic topics was extracted using BERTopic from the NYT all-text document corpus, and inference was run on the Reddit all-text corpus to assign these extrinsic topics to the Reddit memes. Although it is not necessary to specify the number of topics before fitting BERTopic, the number of topics can be reduced after training. Our BERTopic model was parameterized to cluster topics containing at least 100 documents, resulting in 194 NYT topics. This was reduced to a set of 120 topics, to match the LDA model, using automatic and manual approaches. We applied BERTopic’s default topic-reduction technique, which merges the most similar HDBSCAN clusters. Additionally, we qualitatively assessed the similarity of topics’ word distributions and manually merged those that were most similar. We trained a second BERTopic model to find 120 intrinsic topics, using only the Reddit all-text features, following the same procedure.

As with the LDA model, BERTopic initially generated miscellaneous topics with common words such as “like” and “also”. These topics were manually combined into one miscellaneous topic, which, again was the most prevalent topic assigned to both the NYT and Reddit data. However, again this topic accounted for less than 10% of the data sets.

A disadvantage of BERTopic is the number of outliers produced. These are documents that are not assigned to any topic. To address this, we used the BERTopic calculate probabilities parameter to calculate the probability that each document belonged to every topic. Then, we automatically assigned every document to its highest probability topic, resulting in no outlier documents. Furthermore, this parameter was used to match the LDA results, in which every document is represented by a probability distribution over the topics, \(\underline{\tau }_{m}\). As in the case of the LDA model, we supplemented the meme data with the highest probability topic, its probability, and the Shannon’s entropy of the topic distribution.

The BERTopic and LDA topic models obtained similar results. As shown in Fig. 3, there was a considerable amount of overlap between the topics found by the LDA and BERTopic models, and their 5-year temporal distributions. Table 3 shows the top 2 most common topics identified by BERTopic in the NYT and Reddit data sets. In the NYT, the most prevalent topics were those that the NYT publishes about regularly, such as arts events, real estate, and sports. COVID-19 and the invasion of Ukraine were the only two current events that appeared in the top 10 NYT topics. Notably, topics about police brutality and sexual assault were among the top 10 most prevalent NYT-based topics in the memes data as identified by both models, but these topics were not as prevalent in the NYT data.

Although NYT and Reddit follow many of the same topicality trends, the proportions with which they publish about these topics differ. This can be seen in the stream charts in Fig. 4. The NYT publishes much more about the arts, economics, and real estate than Reddit, whereas Reddit posts are more likely to discuss animals and technology than the NYT. Notably, current events are discussed at length in the NYT, but in Reddit memes interest in these topics declines rapidly after the date of the event. The impact of COVID-19 on NYT publishing is also evident in Fig. 4 as live events such as theater performances, weddings, and fashion week were canceled due to the pandemic. Moreover, the effect of Reddit outages, when the website was down for 5–10 h in 2019 and 2021, can also be observed.

4.3 Topicality features

Fig. 5
figure 5

BERTopic-identified topical alignments between the NYT and Reddit

Fig. 6
figure 6

LDA-identified topical alignments between the NYT and Reddit

In addition to saving the topic, probability, and topic entropy for each Reddit meme, we developed variables to assess whether trending, “topical” topics were more popular than non-trending topics. These variables were calculated for both the LDA and BERTopic models and for both the NYT-based, extrinsic topics and the Reddit-based, intrinsic topics.

4.3.1 Defining trending topics

First, we define trending topics in terms of how frequently they were published on given dates. This topicality variable, \(\tau _{t}(d)\), was calculated as the normalized sum of the probability with which documents were assigned to a given topic t on a given date d. Formally,

$$\begin{aligned} \tau _{t}(d) = \dfrac{\sum \limits _{a \in A_d}^{}p_a(t)}{|A_d|}, \end{aligned}$$
(1)

where \(A_d\) is the set of articles published on date d and \(p_a(t)\) is the probability that document a is about topic t. Note that \(\tau _{t}(d)\) can be calculated for either daily or monthly granularity. Figures 3 and 4 show monthly topic trends, meaning d in Eq. (1) was the month and year in which the document was posted. Figures 5 and 6, on the other hand, show daily topic trends, meaning d was the date on which the document was posted. Furthermore, \(\tau _{t}(d)\) was calculated for both NYT articles and Reddit memes. We use “document” to refer to either an article or a meme. The blue lines in the above-mentioned figures show NYT distributions in which a in Eq. (1) represents NYT articles, whereas, for the red lines, a represents Reddit memes. The plots serve to compare topical trends in the NYT and Reddit and are thus only based on the sets of extrinsic topics.

Topics about current events typically exhibit a rapid spike followed by a slower decline. Oscillating topics appeared for sporting events, holidays, and other regular celebrations, such as Pride. In many topics, like the invasion of Ukraine and Christmas holidays shown in Fig. 3, the distribution of Reddit memes posted about the given topics aligns very well with the distribution of NYT articles. In other topics, such as the marriage and romance topic in Fig. 3, the two sources differ. We can also see that while the topic models identify the documents associated with each topic well, they are not perfect. For example, some NYT articles about Russian interference in the US 2020 presidential elections are associated with the topic about the invasion of Ukraine by the LDA topic model. This can be seen in the 2020 spike in the LDA Ukraine invasion topic in Fig. 3, and is likely due to words about Russia being prevalent in both topics. Despite this disadvantage, the identified topics describe the data sets well for both algorithms used.

4.3.2 Defining topical memes

Equation (1) and the corresponding charts describing trending topics can be used to identify which memes are topical and which are not. Here, we consider Reddit memes topical if they are about events that are currently on the rise in NYT publishing. Visually, memes published about a given topic when there is a spike in the corresponding NYT distribution (e.g., on the date of the current event) are considered topical.

Applying Eq. (1) for each topic on the NYT document corpus results in topicality vectors with a length of 120 for each date d, \(\underline{\tau }_{d}\), describing how frequently each topic was being discussed on a given month or day in the 5 years that our data span. This vector can be formalized as,

$$\begin{aligned} \underline{\tau }_{d} = \sum \limits _{t \in T}^{}{\tau _{t}(d)}\cdot {\underline{i}_{t}}, \end{aligned}$$
(2)

where T is the set of topics, and \(\underline{i}_{t}\) is the unit vector on |T| coordinates with the coordinate corresponding to topic t being 1.

Using the topicality vector \(\underline{\tau }_{d}\) and the topic distribution \(\underline{\tau }_{m}\) assigned to a meme m in the model inference step, we define the topicality of a Reddit meme m as,

$$\begin{aligned} \textrm{top}_m = {\underline{\tau }_{d_m}}\cdot {\underline{\tau }_{m}} \end{aligned}$$
(3)

where \(\underline{\tau }_{d_m}\) is the topicality distribution of the reference data (NYT for extrinsic topics, Reddit for intrinsic topics) on the given date \(d_m\) that the meme m was posted, and \(\underline{\tau }_{m}\) is the topic distribution assigned to the particular meme instance.

Note that again, this can be calculated for either a daily or monthly granularity, meaning \(d_{m}\) is either the day on which the meme m was posted or the month in which the meme was posted. Thus, Eq. (3) was used to define the “daily topicality” and “monthly topicality” features listed in Table 1. We suspect the monthly topicality variable to have reduced noise. For example, the monthly average topicality for the holidays topic, seen in Fig. 3, is higher for all memes posted in December due to the effect of Christmas, but the daily average topicality variable has higher values only on the few days around Christmas and the other holidays.

In addition to the daily and monthly topicality features, reflecting how frequently a given topic was being discussed in the NYT or Reddit, we designed a feature describing the slope of the daily topicality distributions (e.g., in Fig. 6). This was calculated as the average slope of the reference data distribution for the \(\pm 2\) days around when the meme was posted. Given that topics about current events tend to spike rapidly, we used a small time frame. This is the “slope topicality” feature in Table 1. We suspect that a topic will be more popular on Reddit when the topicality distribution has a positive slope, indicating increasing topicality trends.

Two binary topicality features were engineered based on the slope and daily topicality features. These binary variables were used for statistical tests showing that topical memes are more popular than non-topical memes. The first binary topicality variable, visualized in Fig. 7a, indicates whether the meme’s topic is being posted about more (labeled 1) or less (labeled 0) than average in NYT on the date the meme was posted. The second binary variable, visualized in Fig. 7b, was calculated based on the slope topicality. This variable takes the values 1 when the slope of the reference topicality distribution is positive, and 0 when the slope of the distribution is negative.

The cross-tabulation of these two binary topicality variables makes a categorical topicality feature with 4 categories. This categorical feature indicates both whether the slope of the reference distribution was positive or negative on the date when the meme was posted and whether the topic was being published about more or less than average on the date when the meme was posted. Figure 7c highlights when the slope of the reference distribution is positive but the topic is not yet being posted about more than average, called “combined”.

Fig. 7
figure 7

Reddit extrinsic COVID-19 topicality distribution colored according to three binary topicality features (red=topical, black=non-topical).

Table 4 Pooled t-tests results comparing memes’ scores, groups determined by three binary topicality features.

Table 4 reports t-tests comparing the number of upvotes received by topical and non-topical memes according to these binary variables. On average, memes about topics that show increasing, positive slope, topicality trends receive a higher score than memes about topics with decreasing topicality trends. There is an additional innovators’ advantage for memes posted right when a topic is beginning to gain popularity in the NYT (Fig. 7c), as shown by the results for the combined topicality variable. Differences between the other groups, for which T-test results were not reported, were not significant. Taken together, these results indicate that memes posted about a topic that has recently hit the news receive more upvotes on average. Topicality has a positive effect on Reddit popularity, however, this effect doesn’t endure long after the date of the event, as indicated by the non-significant results of the Fig. 7a variable.

5 Identifying viral memes

In order to gain a better understanding of how topicality influences the virality of a meme, we trained a CatBoost classifier. CatBoost, an ensemble learning algorithm for gradient boosting on decision trees, has also been shown to outperform other algorithms on a variety of problems (Ibrahim et al. 2020). The classifier is able to parse categorical features, meaning the 120-category topic features did not have to be manually one-hot-encoded before training. We used the image, text, and topicality features summarized in the shaded rows of Table 1 for the binary classification task of identifying viral memes in our Reddit data. Figure 8 shows the algorithm’s performance in identifying viral memes.

Fig. 8
figure 8

ROC and features importances results identifying Reddit meme virality with CatBoost model

Fig. 9
figure 9

SHAP values for extrinsic topicality features indicating that topicality positively affects the CatBoost probability memes are predicted viral

The binary target variable was highly unbalanced because it was defined as a percentile: only the top 5 percent of Reddit memes were considered viral. We used the CatBoost weighting parameter to improve performance on the imbalanced class prediction. Additionally, using the feature set of the top-performing model, we experimented with defining the viral target variable based on different percentiles, where the top 1%, 5%, 10%, 15%, 20% or 25% of memes were marked as viral. These cut-off values resulted in AUC scores of 0.73, 0.73, 0.72, 0.71, 0.71, and 0.71 respectively. Interestingly, these experiments indicate a slight advantage to defining viral memes as those in the top 1 or 5% of the data, but overall the change seems to have a small impact on the outcome. The fact that the AUC is independent of the threshold used to define the viral class indicates that the machine learning model can effectively rank memes based on their scores. This suggests that approaching the problem through quantile regression could be a viable alternative.

The CatBoost classifier was trained using data from 763,388 memes and tested using 84,821 memes. The model was fine-tuned with 5-fold cross-validation, and sklearn’s GridSearchCV for parameter selection. Notably, the title length feature is the most important for classifying viral memes. Memes posted with short titles perform better than memes accompanied by a lot of text. This mirrors earlier work showing brevity is the most important feature for viral tweets (Tsur and Rappoport 2015). The reason could be that messages that are easier to convey are understood better. Given the importance of the text length variable, we used CatBoost’s per-float-feature-quantization parameter to increase the number of decision boundaries allowed for this feature, taking advantage of its predictive power.

The classifier was trained separately for the LDA-based and BERTopic-based topicality feature sets. In order to estimate the importance of the topic-based features for predicting meme virality, we trained the CatBoost model with different subsets of features. The first models were trained using only topicality features, highlighted in red in Table 1. Alone, the topicality features (AUCs: LDA=0.64, BERTopic=0.57) are already able to predict meme virality better than a random guess (AUC=0.5).

It could be the case that the explanatory power of the topicality features is already contained by information in the control feature set. To test this, we trained the model with only the control features described in Sects. 3.3 and 3.4. Then, we assessed the incremental predictive power of topicality features over control features. Models trained with the entire feature set, all shaded rows in Table 1, were better able to identify viral memes than the model trained with only the control feature set. We tested the statistical significance of these AUC differences using a procedure from previous research (Hanley and McNeil 1982), finding the engineered topicality features have significant predictive power over the control features for both BERTopic (\(AUC=0.71\), \(p<0.01\)) and LDA (\(AUC=0.73\), \(p<0.001\)). Results were more significant for the LDA topic model, and topicality features ranked higher in the LDA features importances than for BERTopic, as seen in Fig. 8.

Figure 9 shows the SHapley Additive exPlanations (SHAP) values for given extrinsic topicality features (Lundberg and Lee 2017). The text length feature, which is negatively correlated with meme virality, is provided for reference —memes with fewer words have a higher probability of being viral. The topicality features on the other hand correspond positively with the CatBoost’s estimation of meme virality. Memes with higher daily and monthly topicality values are generally labeled 1, viral, with a higher probability by the CatBoost model than memes with lower topicality values. Interestingly, higher values of topic entropy and topic probability also have a positive effect on the CatBoost outcome. This suggests that memes with a strong dominant topic, and high probability, but otherwise mixed topic distributions, and high entropy have higher probabilities of going viral. The topicality features positively impact the CatBoost’s estimation of meme virality.

6 Discussion

The findings of this paper can be interpreted from the perspective of agenda-setting theory, which states that media sources shape the interests of society by selecting which topics to publish about. Our finding that news topicality is positively correlated with social media virality indicates that it may be harder to get attention for and spread non-topical information. This study does not unveil a causal link between topics in NYT and on Reddit, however, the great overlap between the sources indicates that they exert similar influence on their audiences.

Previous work performing topic analysis on social media data has focused on user topical interests (Grabowicz et al. 2016; Weng et al. 2012; Ferrara et al. 2014). Our analyses of general topics across a social media platform, and the use of a reference document to assess the topicality of the topics is a unique contribution. We furthermore show that social media sites may align significantly with the news and should be considered as part of the news cycle. Earlier findings from research on the news cycle suggest that early adoption is of crucial importance (Leskovec et al. 2009; Yang and Leskovec 2011), which we confirm here in our analysis of Reddit data. Our finding that topicality plays a role in meme popularity suggests that topics may influence a more complex diffusion process in online media. This complicates the existing account of meme diffusion as a simple contagion process (Wang and Wood 2011).

Finally, the results presented here contribute to the growing body of research modeling multi-modal data. With the rise of internet communication, multi-modality is a paradigm shift in the forms discourse takes in general. Predicting the popularity of multimedia content is more difficult than text-only data. For example, impressive results have been achieved predicting viral tweets (Weng et al. 2012; Tsur and Rappoport 2015), but we have not yet seen such success predicting the popularity of image-with-text memes. Apart from one article that worked with a small, hand-labeled data set (Ling et al. 2021), our AUC of 0.73 is the best result we know of from the literature. Recent improvements in generative AI will soon make modeling multi-modal content more tractable.

With the increasing availability of internet data, social media has become a popular source of data for academics. Reddit is an especially popular data source, and the ethical guidelines around using this data are not yet firmly established. In light of this, we heed recommendations from studies surveying social media users about research data use. Namely, we collected data from a large, public community, assured no personal identifying information was associated with the data, and we plan to share our results back to Reddit upon completion (Proferes et al. 2021).

7 Conclusion

While NYT and Reddit are qualitatively very different sources of information, they do show significant alignment. NYT-based topics are good descriptors and predictors of popular Reddit memes. Here we showed:

  • Many topics on Reddit and the NYT exhibit similar topicality distributions.

  • Memes that are about topics with increasing topicality trends in the NYT receive more upvotes on average.

  • “Innovator” memes posted when a topic is just beginning to be prevalent in the news receive the most upvotes on average.

  • Interest in topics from the news cycle declines more rapidly on Reddit than in the NYT.

  • A set of engineered features able to identify viral memes with a leading AUC of 0.73.

The scope of topicality in this study is shaped by its reliance on NYT data. Although the NYT covers a range of global events and subjects beyond news, the choice of this source may lead to a focus that is somewhat more centered on U.S. and political themes. The USA-skewed results were appropriate for our purposes to match the large proportion of Reddit users based out of the USA, 48% of all users (Tiago Biachi 2023). However, this means our results are less relevant to other parts of the globe.

Our use of BLIP (Li et al. 2022) image captioning ensured that the topics assigned to Reddit memes were based on image as well as text content. Given the importance of the image to a meme, this was a crucial step. One drawback to this method is that BLIP cannot recognize celebrities. Identifying politicians in meme images doubtlessly would have improved our topic assignment. Furthermore, our research did not take community structure into account, although previous research has shown that accounting for community structure can improve topic assignment (Kalyanam et al. 2015), especially when modeling a social media phenomenon such as memes.

Many interesting implications of this research could be expanded on in future studies. Our work highlighted significant topical alignment between Reddit and NYT but also showed that the most common topics differ between the two sources. Further work could identify which topics from the news cycle are most likely to appear and gain popularity on Reddit. Similarly, our work indicated that interest in news topics declines more rapidly on social media than in the news, but future research could quantify these differences in topic dynamics.