Introduction

Can we predict whether a social media post will attract a large number of reposts or responses? Is it possible to predict the structure of the information cascade triggered by a specific post? Many models have been proposed to predict the virality of individual pieces of information (Weng et al. 2013; Cheng et al. 2014; Zhao et al. 2015; Li et al. 2017; Subbian et al. 2017; Kefato et al. 2018). Other approaches have aimed at generating ensembles of online conversations, providing population-level insights about how and why information spreads (Kumar et al. 2010; Wang et al. 2012; Gómez et al. 2013; Nishi et al. 2016; Lumbreras et al. 2017; Aragón et al. 2017b). This manuscript presents work that lies at the intersection of these two approaches. We propose a platform-agnostic generative model to grow conversation (cascade) trees that focuses on predicting each microscopic event (e.g., a comment or a retweet). The model is based on a supervised machine learning approach that leverages information about the tree generated up to a given point.

We conceptualize the conversation structures in various platforms as trees (Fig. 1). For instance, on Reddit, a post may attract comments, and each of these comments may induce further threaded conversations. On Twitter, similar conversation trees can be constructed from a more diverse set of interactions, such as retweets and replies.

Fig. 1
figure 1

Examples of conversation trees on a Reddit—a post followed by comments, and on b Twitter—a tweet followed by retweets, replies, and/or quotes. Link directions represent attention toward previous posts

We propose a generative model for predicting the size and structure of conversation trees on social media, which first predicts the final size of the tree and then iteratively adds nodes to the tree until it reaches the predicted size. To evaluate our model, at each point in a conversation history, we measure the likelihood of correctly predicting which post will be replied/shared/commented upon next. We also assess the accuracy of our method by measuring the relative deviation in terms of depth, breadth, structural virality, and size of our simulated trees from those in the ground-truth data.

Fig. 2
figure 2

A few example conversation trees from a–d Reddit and e–h Twitter. Each node represents a comment, retweet, or reply. Node size represents popularity (number of children) and node color represents temporal order in the tree sequence, from the root (blue) to the last leaf (red)

We test our model on Reddit and Twitter conversation datasets, made available as part of a challenge run by the Defense Advanced Research Projects Agency (DARPA). The conversation trees in the dataset have diverse sizes and structural features, as illustrated in Fig. 2. While our method outperforms a simple baseline and is competitive with a state-of-the-art baseline, its accuracy dramatically decreases for longer sequences of microscopic predictions due to accumulating errors. These results highlight the challenges of detailed reconstruction of online diffusion networks (Goel et al. 2012).

In the next section we review relevant studies about both the prediction of important features of information cascades and the simulation of their evolution. The prediction task is defined and discussed in “Problem and models” section, along with our model and the baselines adopted for evaluation. “Feature description” and “Datasets” sections provide details about the features used by our model and the data used for training and testing. The results are presented in “Results” section and discussed in “Discussion” section.

Related work

Cascades have been characterized on many platforms (Gómez et al. 2008; boyd et al. 2010; Rossi and Magnani 2012; Dow et al. 2013; Weng et al. 2013, 2014; Weninger 2014; Choi et al. 2015; Hui et al. 2018). These studies revealed common properties among popular/viral conversations, suggesting the possibility of predicting viral content. Despite such efforts, the prediction of virality remains elusive, probably due to the intrinsic randomness of popularity (Salganik et al. 2006; Goel et al. 2012).

Some machine learning methods aim to predict macroscopic characteristics of individual cascades (mainly the size) from features observable at an early stage. Backstrom et al. (2013) identified features that describe novelty, arrival patterns, textual expression, and social influence to predict the size of media sharing cascades. Cheng et al. (2014) formalized size prediction as a binary classification problem, achieving an accuracy of 80% at predicting whether the number of shares of a photo on Facebook would double. Li et al. (2017) proposed DeepCAS, an end-to-end deep learning approach that leveraged raw data to directly predict logarithmic increments of the size of Twitter cascades and paper citations within finite time windows (1, 3, and 5 days). Using data from Twitter and Facebook, Subbian et al. (2017) proposed SansNet, an approach based on survival analysis (Klein and Moeschberger 2006) that predicts whether cascades will become viral. Their definition of virality is based on percentiles — the largest 0.5% trees are considered viral. Beck et al. (2019) applied various machine learning models to monitor cryptocurrency news and predict the number of article mentions given their Twitter activity in the previous day. Other methods leverage node characteristics (Pei et al. 2014), network structure (Weng et al. 2013), temporal patterns (Pinto et al. 2013), or a combination of them (Guo et al. 2015).

Generative approaches aim at identifying underlying mechanisms that can reproduce stylized traits of cascades. Aragón et al. (2017a) published an extensive survey of approaches of this kind. Here we mention just a few to highlight their main ingredients, beyond what one may expect from a pure branching process à la Galton–Watson (1875). Kumar et al. (2010) combined the concept of novelty with preferential attachment. Wang et al. (2012) explored reaction times and lifespans of cascades by continuous-time modeling of dynamics. Gómez et al. (2013) added a root bias to popularity and novelty, recognizing the special role that the original post plays in the spreading process. Aragón et al. (2017b) built upon this model by adding reciprocity among users. They focused on how the structural features of conversations are affected by their presentation in a platform’s interface. Nishi et al. (2016) observed that the structural diversity observed among cascades was not described adequately by previous generative models. Such a diversity was manifest in the co-existence of deep conversation-like threads, shallow star-like trees, and irregular patterns. To reproduce the conversation-like patterns, they introduced the concept of segments, long chains of posts with little or no branching involved. Lumbreras (2016); Lumbreras et al. (2017) were able to capture different user roles in the cascade formation, and to use this information to predict user engagement.

Some deep learning approaches model a cascade by predicting the next action and its timestamp. The next action could be one of a given set of events, or one of a given set of users. CAS2VEC (Kefato et al. 2018) produces cascades as sequences of event timestamps. The methods proposed by Wang et al. (2017) and Islam et al. (2018) predict when users in a known social graph would participate in a cascade. Du et al. (2016) applied recurrent neural networks and temporal point processes to predict sequences of events such as stock-market orders and taxi trips. In these approaches, time is the most important component being predicted; the tree structure is not reconstructed.

Point-process models can simulate cascades by fitting parameters to historical data (Shen et al. 2014; Gao et al. 2016; Mishra et al. 2016; Cao et al. 2017; Rizoiu et al. 2017). These models can be used to predict the entire tree structure, and have been successfully applied to the study of Twitter cascades (Kobayashi and Lambiotte 2016) as well as conversation threads on Reddit (Medvedev et al. 2019; Krohn and Weninger 2019). SEISMIC (Zhao et al. 2015) predicted the final size of a retweet cascade with a 15% relative error on average after observing the first hour of the cascade. Most point-process models cannot work from just the initial post, and so they are not directly comparable to ours. An exception is the Cascade Tree Prediction Model (CTPM) proposed by Krohn and Weninger (2019), which uses a Hawkes process to predict the full tree structure from only the initial post by cleverly inferring the necessary model parameters from similar posts in the training data. To our knowledge, CTPM is the only model directly comparable to the one presented here, and therefore we use it as a baseline in our evaluation.

Problem and models

We represent a social media conversation as a tree of messages, with the root node being the initial post, and all subsequent comments, replies, or shares being additional nodes in the tree. Given some initial condition of a conversation, one goal is to predict its final macroscopic state, i.e., the size and structure of its tree. In addition, we aim to predict the longitudinal evolution of a cascade tree as a sequence of microscopic states.

We propose a Tree Growth Model (TGM) that starts from a partial tree containing the first k nodes, as well as information about the root post such as its textual content. TGM breaks the tree prediction problem into two sub-tasks: (1) predicting the final size of the tree based on the initial information given (“Size prediction” section); and (2) predicting what node in the tree the next node will attach to (“Node placement” section). TGM grows the conversation by attaching nodes (messages) until the tree reaches the predicted size. For each sub-task, we develop a model and propose a baseline for evaluation.

Fig. 3
figure 3

Complementary Cumulative Distributions (CCDF) of cascade size, breadth, depth, and structural virality within each dataset

Size prediction

We use regression to predict the final size of the tree based on features of the initial partial tree, which could be as minimal as a single first post (see “Feature description” section). The distribution of tree size is heavy-tailed, as shown in Fig. 3. Since it is difficult for a regression method to predict a quantity with such a broad distribution, we instead predict the size-quantile of the tree and then map that quantile back to a size using the empirical size distribution. This approach, similar to quantile regression (Koenker and Bassett 1978), allows our method to deal with imbalanced data without resampling large trees or excluding small ones.

We evaluated several regression models for the size prediction task, using cross validation within the training set. We selected Random Forest regression, which provided the best performance (see Table 1).

Table 1 Mean Relative Error in the size prediction problem by different regression models across our three datasets (cf. “Datasets” section), for \(k=2\) and \(k=10\)

We trained a different regression for each initial tree size k (for \(1 \le k \le 10\)), and trained each model based on the trees in the training data whose final size is larger than k. This performed better than simply using k as an additional feature in the regression.

We also considered an alternative model in which the size is not decided at the outset, but a decision whether to stop or continue is made at each step. However, without introducing strong assumptions and biases, this approach leads to unrealistic, exponentially narrow tree size distributions.

Node placement

Once we predict the final size, we iteratively attach additional nodes to the current conversation tree. To predict where the next node will be attached, we train an estimator to assign a likelihood score to each node in the current tree. We then draw a random node with probability proportional to its score. The selected node becomes the parent of the newly attached node. The process is illustrated in Fig. 4.

Fig. 4
figure 4

Diagram illustrating how nodes are added to the tree

The node classifier takes in a set of features from a given node and the tree, and estimates the node’s likelihood of becoming the parent of the next node in the tree. We considered various machine learning models for this task. As shown in Table 2, the Gradient Boosting Trees classifier achieved the best accuracy, as measured by the Area Under the ROC Curve (AUC) across our three datasets (cf. “Datasets” section). AUC captures the trade-off between maximizing the true positive rate and minimizing the false positive rate.

Table 2 AUC performance by various classifiers on the node placement problem

Based on this analysis, we selected the Gradient Boosting Trees model for the node placement task. The classifier outputs a score between 0 and 1, such that if the score is closer to 1, the classifier estimates that the node is more likely to receive the next reply in the cascade. For details on the features we used and how the training set for this classifier was constructed from the empirical tree data, see “Feature description” section and Appendix  “Extracting node placement classifier features”.

Treating the raw classifier score as a probability (as is often done) does not yield optimal results. Instead, we apply a Bayesian correction to the raw classifier output to obtain each node’s likelihood estimate. Specifically, given a node with classifier score S, let A be the case where the node receives the next reply, and \(\lnot A\) the case where the node does not receive it. Then we can compute the conditional probability that the node will receive the next reply given its score using Bayes theorem, \(P(A \mid S=s) = P(S=s \mid A) \, P(A) \, / \, P(S=s)\).

We compute the values on the right hand side of the equation from our training data and the classifier scores for the nodes in the training data. We set \(P(A)= N_A/N\), where \(N_A\) is the number of nodes in the training dataset that received the next reply and N is the total number of nodes in the training dataset. We compute \(P(S=s) = P(S=s \mid A) + P(S=s \mid \lnot A)\). Finally, both \(P(S=s \mid A)\) and \(P(S=s \mid \lnot A)\) are given by Gaussian Kernel Density Estimates produced from the the set of scores for nodes in our training data with condition A and \(\lnot A\), respectively.

Baseline models

Let us introduce two baseline models. One is a simple model that follows our approach with size prediction and next-node placement sub-tasks. The other baseline is a state-of-the-art method for predicting the full tree structure.

Random baseline:

A random size baseline model draws a random size for each prediction from the empirical distribution of sizes with replacement. We then use a simple random choice baseline for node placement, in which all nodes are equally likely to receive the next reply. Let j be the time-ordered index of existing nodes, and let n be the number of existing nodes in the tree. The probability of node \(x_j\) being the parent of the next node is given by \(P(x_j)=\frac{1}{n}\).

CTPM baseline:

The Cascade Tree Prediction Model uses Hawkes processes to reconstruct the structure of the cascade tree. While the details are found elsewhere (Krohn and Weninger 2019), here we explain the intuition behind the approach. Two distributions are fit, one for the root node and one for the other nodes, which determine branching factors at each node. CTPM infers the parameters of these distributions for each test post from similar posts in the training set. For each node, the number of children is drawn from its distribution, and the tree is constructed in a breadth-first way. The size is implicitly determined when the branching process stops after leaf nodes draw zero values for their numbers of children.

Feature description

Our models use four classes of features: structural, user, content, and temporal features. Each of these classes is described next. Table 3 outlines the features in each class. Appendix “Extracting node placement classifier features” discusses how the features are extracted, while Appendix “Feature importance” analyzes the importance of different features.

Structural features

We use two types of structural features for node placement and size regression. First, there are features about the current tree, including the initial size of the tree, its depth, and statistics about the placement of individual nodes (e.g., how many attached to the root node). Second, for the node placement task, there are features about the individual node, including its depth, how many children it has, how many siblings it has, and more.

Most of these features have standard definitions and are easily interpreted, but some are not. In particular, we introduce a normalized delay steps since parent measure, or simply the parent delay. It is an edge attribute capturing the relative difference in steps (actions) between a new node and its parent. Let the index of node i be ordered by the time of creation of the node. Then the parent delay of node i, \(d_{i}\), is given by

$$\begin{aligned} d_{i} = \frac{(i-\pi (i))-1}{i-1} \end{aligned}$$
(1)

where \(\pi (i)\) is i’s parent (\(i > \pi (i)\)) and \(d_{i} \in [0,1]\). The root index is 0, so the parent delay of the second node attached to the tree is undefined. That is, \(d_{i}\) is defined only for the second edge (third node) and on. This measure aims to capture the root bias and the novelty features in prior work (Gómez et al. 2013; Lumbreras 2016; Aragón et al. 2017a). When a node is connected to the root, this edge is the most delayed possible \(d_{i}=1\). Conversely, attaching a node to the most recent node previously added to the tree gives the shortest delay \(d_{i}=0\). Figure 5 shows a conversation tree and the parent delay for each edge. We derive several features from this measure, including the mean and median delay, and the percentage of edges connecting to the root and to the most recent node.

Fig. 5
figure 5

Representation of the conversation trees in Fig. 1 with parent delays (see Eq. 1) as edge weights. Each node links to its parent. Note the parent delay is undefined for the edge between nodes 0 and 1

User features

Table 3 Description of features, grouped by class, showing if they were used in the size prediction regression model, the node placement classifier, or both

User features are derived only from the author of the conversation’s initial post, or root. Since our approach is generative, incorporating information about other users would require the model to also associate users to each new node during the tree growth. We focus on the tree structure replication and leave user attribution for future work.

We extract structural properties of the conversations initiated by a user, such as max breadth, depth, and size (see Table 3 for the full list). The user features are the median and the mean of these properties.

Content features

As with user features, the content features are also limited to the root of the conversation, e.g., a Reddit post title or a Twitter text. Using any intermediate content would require the generative model to also produce content along the simulation, a task beyond the scope of this work.

We use fastText, a document embedding method developed by Facebook Research (Joulin et al. 2017), to represent the content of a conversation’s root post as a 25-component feature vector used by the size regression model and the node placement classifier.

Temporal features

We use various temporal features, including the time of day and day of week of the root post, and the mean time between replies in a conversation (see Table 3 for the full list). Temporal features prove very valuable for the size prediction task. However, we do not use temporal features for the node placement task, because this would introduce too much complexity, requiring our model to generate a time stamp for each new node as it is added to the tree.

Datasets

We use three social media cascade datasets that were made available as part of the DARPA SocialSim (DARPA 2018) December 2018 Challenge. Two originate from Reddit and the third from Twitter. All are anonymized. The Twitter dataset is a collection of public tweet cascades that contain Common Vulnerabilities and Exposure (CVE) codes and related keywords. Since retweets are always connected to the root tweet in the raw data, we use the follower data and timestamps to reconstruct the conversations (Goel et al. 2016; Pacheco 2019). One Reddit dataset was collected using the same CVE keywords as the Twitter dataset, while the other Reddit dataset consisted of cascades containing keywords related to several cryptocurrencies, including names, aliases, and cashtags of the coins.Footnote 1 Table 4 presents a statistical summary of the datasets.

Table 4 Statistics about the three cascade datasets

For each dataset, we held out a number of the most recent cascades as the test set and used the rest for training. For the two Reddit datasets, we used 500 cascades for the test set. For Twitter, we used 5000 cascades because most cascades were trivial (containing only one tweet), so we needed a greater number and diversity of cascades to test our method’s performance. For the Reddit Crypto and Twitter CVE datasets, we used 10,000 cascades for the training set. We decided to limit our training set to 10,000 cascades because extracting the features from these cascades during the training process was computationally expensive. For Reddit CVE, we were limited to 2911 cascades in the training set because the dataset only contains a total of 3411 cascades.

We also considered the use of some publicly available datasets that are often used to predict cascades of votes on the Digg news platform (Hogg and Lerman 2012), quote/phrase memes on the Web (Leskovec et al. 2009), and retweets on Twitter (Hodas and Lerman 2014). However, none of these datasets have a minimal set of features used in our model. In particular, they do not contain usable cascade tree structures or lack the information needed to reconstruct the tree structure of each cascade. Digg cascades are all trivial stars because all users in a cascade are voting for the same post, not creating a conversation tree. In addition, the Twitter and Digg datasets lack text content.

Evaluation

We evaluated our models for each sub-task, size prediction and next-node placement, as well as for the combined task of predicting the size and structure of whole conversation trees. In this section we describe our evaluation methods.

Size prediction

Given a partial conversation tree with k initial nodes and information about the initial post (post content, time of day, etc.), our model predicts the final size of the tree. We adopt the Relative Error as our error measure, calculated as \(\frac{|s - {\hat{s}}|}{|s|}\), where s is the true size and \({\hat{s}}\) is the estimated size of the tree. We report the Mean Relative Error (MRE) over the set of trees in our test data.

Node placement

We evaluated the model’s ability to predict where to place the next node and compared it to the random choice baseline model for node placement. Specifically, let \(\tau (n)\) be the n-th stage in the construction of tree \(\tau\) (n is the current size of tree \(\tau\)). Each node i in \(\tau (n)\) is assigned a probability \(p(i_n)\) that it will be the parent of the next node. If \(i^*_n\) is the correct next node from the empirical data, we use \(p(i^*_n)\) as our accuracy metric for this task because that is the probability of making the correct choice at step n in the evolution of the tree. This probability is averaged across all the steps across all the trees within our evaluation dataset.

For the random choice baseline, the probability of placing the next node correctly in a tree with n nodes is given by \(p(i^*_n) = \frac{1}{n}\). To compare the models more easily, we plot the mean probabilities normalized by the baseline model, and call this metric Relative Performance in the next-node placement task.

Whole-tree simulation

To test the ability of our Tree Growth Model to grow entire conversation trees, we start from a partial tree with k initial nodes and information about the initial post (content, time of day, etc.), and ask TGM to predict the size and structure of the final state of the conversation tree. First, the model predicts the final size of the tree using our size prediction model. Second, we add one node at a time to the tree using our node placement model, until we reach the predicted size of the tree.

We evaluate the performance of TGM using macroscopic measures of tree structure, namely depth, breadth, and structural virality. The latter, also known as the Wiener index, is defined as the average distance between all pairs of nodes in the tree (Goel et al. 2016). We measure how these quantities depend on k and the final tree size.

Cumulative node placement

To assess the effects of cumulative errors in consecutive node placement decisions, we also compute the probability of placing all of the next t nodes correctly, starting from an initial tree of size k. For the random choice baseline, the probability of placing t consecutive nodes correctly is \(\prod _{n=k}^{k+t-1} \frac{1}{n}\). Similarly for our TGM model, the probability of placing t consecutive nodes correctly is given as \(\prod _{n=k}^{k+t-1} p(i^*_n)\), i.e., the product of the probabilities of getting each node placement correct for nodes \(k+1\) to \(k+t\). For CTPM, this calculation is infeasible because the probabilities \(p(i^*_n)\) would have to be estimated by many partial simulations starting from each stage \(\tau _n\) of every tree.

We explore how these probabilities depend on the parameters k and t. Because the probabilities end up being very small for large values of t, we plot the log-probability of each model, and use log-ratios to compare the models.

Results

Size prediction

For the size prediction task, our regression model achieved substantially better Mean Relative Error than the random baseline model on all three datasets (see Fig. 6). In both Reddit datasets, the error monotonically decreases as k increases; this is expected because as k increases, the regressor has more information about each cascade in the dataset. However, for the Twitter dataset, the error increases with k for both models. We attribute this to training data sparsity. As k increases, the number of potential targets for new nodes increases while the number of examples decreases: there are far fewer Twitter cascades of size 10 than of size 2. Consequently, the regression model has too few examples to learn from, and the random baseline model fails to adapt to the change in initial tree size.

Fig. 6
figure 6

Mean Relative Error of size prediction versus number of nodes k in the given initial tree. Shaded areas in this and the following plots indicate 95% confidence intervals

Node placement

We computed the probability of choosing the correct next node, given the true tree up to each point. TGM achieves substantially improved performance compared to the random choice baseline (see Fig. 7).

Fig. 7
figure 7

Probability of placing the next node correctly, normalized by the random baseline model. Reddit Crypto and Reddit CVE results are binned over k (bin width \(\Delta k = 5\)) to improve readability and reduce noise

Whole-tree simulation

To asses the ability of the Tree Growth Model to predict the size and macroscopic structure of whole conversation trees, we grow entire trees and measure their depth, breadth, and structural virality. For each dataset, we plot the Mean Relative Error in the predictions of these structural properties against the initial tree size k and the final size of the ground truth trees. Here k is a proxy for the amount of initial information given to the model about the trees it is supposed to generate.

We observe in Fig. 8 that in general, TGM outperforms the random baseline model for all tested values of k on Reddit datasets, but struggles with larger k values for the Twitter dataset. This may be due to data sparsity; only about 1% of the trees in the training set have size larger than 5. Similarly, for \(k=64\) (the largest value tested here), the Reddit training sets are also sparse, hence the errors converge to near the random baseline model.

Fig. 8
figure 8

Mean Relative Error of whole-tree metrics (depth, breath, and structural virality) versus initial size k. For the Twitter CVE dataset, TGM predicts all trees to have size one when \(k=1\), so that the structural virality is undefined

The state-of-the-art CTPM baseline typically outperforms TGM for small k values on Reddit datasets due to its clever parameter inference method. We also find that CTPM performance typically degrades with increasing k, while TGM typically improves with k. On the Twitter dataset, there is less of a difference between the performance of TGM and CTPM.

Fig. 9
figure 9

Mean Relative Error of whole-tree metrics (depth, breath, and structural virality) versus final tree size of ground truth

Figure 9 plots errors against final tree size. Our model generally outperforms the random baseline. Again, it performs best for smaller trees, for which we have more training data, and converges to errors near those of the random baseline model for larger trees. The CTPM baseline typically outperforms TGM for small trees on Reddit datasets, whereas TGM outperforms CTPM for medium and large trees. Performance is comparable on the Twitter dataset.

Cumulative node placement

Having found that our model performs well in reproducing the macroscopic structure of conversation trees, let us examine its accuracy in predicting their microscopic evolution. Since it is infeasible to compare TGM with CTPM on this task (cf. “Cumulative node placement” section), we only report on results comparing TGM against the random baseline model.

Fig. 10
figure 10

Log-probability of placing a consecutive sequence of t nodes correctly, starting from an initial partial tree with k nodes, for each model and dataset. We also plot the difference in log-probabilities between TGM and the random baseline model, where blue regions show our model outperforming the baseline and red regions highlight the opposite case. Note that \(t+k\) is equal to the tree size, so that tree size is conserved on the diagonals of these plots

Figure 10 shows that our model outperforms the baseline in consecutive node placement on all datasets and all values of t and k; the difference is very small for very small t (short prediction sequences). However, the small values of the log-probabilities indicate that the model is very rarely capable of correctly reproducing the microscopic evolution of the conversation trees, especially for long node placement sequences (large t). Performance is better for Twitter cascades, which are easier to predict because they tend to be very shallow. We believe that this poor performance is due to the accumulation of errors: even if we have a decent chance of selecting the next node correctly, the odds of doing so many times in a row drop geometrically.

Discussion

Our model outperforms the random baseline and is competitive with the state-of-the-art CTPM baseline in predicting macroscopic features such as depth, breadth, and structural virality of conversation trees. Nevertheless, the present evaluation reveals that our microscopic predictions deviate significantly from the empirical conversations. This negative result highlights the challenges of the microscopic prediction task for cascade trees: we are using a rich feature set and a strong model, and yet still failing.

In particular, our results highlight the effects of error accumulation: the probabilities of predicting long sequences of events are extremely small. The accumulation of errors could play an important role in limiting the microscopic predictability of many online diffusion networks (Goel et al. 2012). This problem should be investigated further in other predictive models of social systems.

How can our model do so poorly at microscopic predictions far into the future, and yet do reasonably well in predicting the structure of the final trees? This apparent contradiction reveals a lack of sensitivity of the traditional structural measurements. When a tree has many nodes, there are many distinct microscopic configurations leading to almost identical macroscopic structural properties such as size, depth, breadth, and structural virality. In other words, predicting these structural features is a lot easier than reproducing the evolving conversations.

Although the data provided by DARPA was gathered using public application programming interfaces, filtering public subreddits and public tweets, we cannot share it with the research community due to a non-disclosure agreement designed to protect the privacy of social media users. The lack of public datasets makes direct comparisons between methods difficult, hinders the reproduction of results, and slows down scientific progress in this area (Pasquetto et al. 2020). Unfortunately, it was not possible to evaluate TGM on publicly available datasets used in the evaluation of virality prediction models (Hogg and Lerman 2012; Leskovec et al. 2009; Hodas and Lerman 2014). We hope that in the future it will be possible to evaluate the proposed method on more standard datasets.

As in most studies of this kind, the present results are somewhat limited by the scope of our data, i.e., cryptocurrency and cyber-security. Yet, given the differences between these two domains, we expect that our model would perform similarly in other domains. Further experiments are necessary to confirm this conjecture. Regarding platforms, we detected imbalances between Reddit and Twitter in our results. Although our model is designed to be platform-agnostic, further analyses on other social media platforms (e.g., Facebook and Instagram) are needed to explore the universal applicability of the proposed model. In general, models like the ones presented in this paper are limited by the availability of training data, especially a sufficient number of large trees.

Finally, our framework involves extensive computational efforts to calculate the features, node-by-node and step-by-step. It does not scale well to simulate large cascades beyond tens of thousands of nodes. Future work should be devoted to optimizing the method to handle larger-scale cascades.

Conclusions

We proposed a generative model to grow individual conversation trees starting from minimal information—as little as just the initial post. Our approach combines two machine learning sub-models: a size predictor and a next-node placement model. We used three datasets from two social media platforms, Reddit and Twitter, spanning the topical domains of cryptocurrency and cyber-security.

A contribution of the proposed machine learning framework is the engineering of features useful for the size prediction and node placement tasks. We introduced the parent delay edge property to capture the characteristics of deep conversation threads versus shallow star-like broadcasts. This property encapsulate structural and (relative) temporal features of cascades, without requiring the model to estimate inter-arrival times. In addition, as an edge-level property, it is more informative about the growth dynamics than traditional tree-level properties such as depth and breadth.

Extensive validation results were presented to measure the accuracy of the two machine learning sub-models as well as the full aggregated generative model. To the best of our knowledge, this work presents the first evaluation of models aiming to reproduce the microscopic, step-by-step evolution of individual conversations trees.

Our results suggest that while it’s possible to predict with some accuracy the global structural features of conversation cascades as well as microscopic dynamics in the short-term, predicting the long-term evolution of online conversations at the level of individual actions remains out of the reach of generative models.