1 Introduction

On social media, people are sharing billions of posts, news and videos with their friends or followers every day. These sharing behaviours lead to the rapid diffusion of unprecedented amounts of information (Chen et al. 2019) in the form of cascades. The prevalence of information cascades exposes people to information of their interest faster, and meanwhile also amplifies the damage of false information such as rumours (Guarino et al. 2021). The COVID-19 pandemic gives us a chance to rediscover the importance of social media not only as networking platforms but also as an information source which can actually interfere with our everyday decisions (Xu et al. 2021). Thus, it is crucial to understand and forecast cascade dynamics to effectively promote useful messages, e.g., for viral marketing (Wang et al. 2015), and proactively control the impact of misinformation (Song et al. 2017). The problem of cascade prediction aims to achieve two objectives along this direction: popularity prediction and adopter prediction. We say that a user adopts a message and becomes an active adopter if the user shares the message from at least one of his/her friends. With an observation of early adopters, the goal of popularity prediction is to predict the number of final adopters while adopter prediction is to forecast who will adopt the message at a future time point. Final adopter prediction is required to enforce the real effectiveness of information diffusion in applications such as marketing and the vaccination campaign during the COVID-19 pandemic. In such cases, we need to ensure information reaches as many targeted users as possible in addition to a large number of recipients.

Cascade prediction has garnered attention from both industry and academy over the past decade (Cheng et al. 2014; Yu et al. 2015) and the solutions have evolved from the methods based on diffusion models (Panagopoulos et al. 2020) to those based on cascade representation (Chen et al. 2019). Diffusion model-based methods characterise the interpersonal influences between users and simulate the diffusion process through social relations. These methods are not scalable for large networks due to the repetitive simulations of diffusion models. Moreover, they rely on some unrealistic assumptions such as independent cascades and uniform influence probabilities between users (Panagopoulos et al. 2020). Therefore, despite their explainability, this class of methods are sub-optimal for cascade prediction. By contrast, the methods based on cascade representation characterise the features of observed early cascades instead of modelling diffusion processes. Machine learning models are employed for downstream predictions. These methods have become state-of-the-art due to their overwhelming prediction performance, especially with the recent success of deep learning. Compared to earlier methods using hand-crafted predictive features, deep learning allows for automatic extraction of cascade representations which capture the heterogeneous information embedded in cascades (Xu et al. 2021). For instance, the application of recurrent neural networks (RNN) and graph node embedding simultaneously captures the temporal rankings of early adopters and the structural properties of their neighbours in social graphs (Yang et al. 2019). Despite their promising performance, deep learning methods confront a few inherent challenges as repeatedly emphasised in the literature such as the imbalanced distribution of cascades (Tang et al. 2021) and cascade graph dynamics (Sun et al. 2022). Moreover, except for FOREST (Yang et al. 2019), they are designed either for popularity prediction or for microscopic prediction which infers the next adopter. Without modelling diffusion processes, they are thus sub-optimal for predicting final adopters.

In this paper, we aim to combine the advantages of the two classes of previous studies and apply deep learning to model the diffusion process of information on social media. This approach will allow us to get rid of the inherent challenges in embedding observed cascades, and efficiently achieve the two objectives using a single method. The key to diffusion process modelling is to capture the interpersonal influences between a user and his/her friends before adopting a message.  Cao et al. (2020D) conducted the first attempt CoupledGNN by modelling the cascading effect only with users’ influences. One shortcoming of this method is that it ignores the double roles simultaneously played by users in information diffusion: distributors and receivers which have been widely accepted in the literature (Panagopoulos et al. 2020; Wang et al. 2015). In this paper, our goal is thus to explore users’ profiles of these two roles to improve the performance of cascade prediction. Specifically, we will address the following perspectives on modelling diffusion process which have not been well studied so far:

  • A user’s decision to forward a message should result from three factors: message content, influences of active friends and susceptibility of the user. Intuitively, a user’s influence measures his/her ability to convince another user to share his/her message while susceptibility measures how likely the user gets influenced by other users (Panagopoulos et al. 2020; Wang et al. 2015).

  • Users’ influences and susceptibilities are not only user-specific but also topic-specific. This phenomenon has not been discussed before. Social media users, especially on platforms featured by microblogs such as Twitter and Sina, usually have multiple topics of interest and different sharing patterns. Suppose a sports news reporter with a hobby of pop music. He will be more influential for sports-related tweets than those about music. As an information receiver, the reporter will be more cautious to spread sports news compared to music-related tweets.

  • Influences and susceptibilities are context-dependent (Wang et al. 2015). In other words, they spread through social relations during diffusion processes. A user will become more susceptible to a message when he/she sees that message shared by a larger number of users. Similarly, when more users have adopted the message a user shared, then the user becomes more influential to his/her friends due to the accumulated trust in the message.

To the best of our knowledge, we are the first to integrate users’ topic-specific and context-dependent susceptibilities and influences into cascade prediction. We start by validating our hypothesis that users’ influences and susceptibilities are topic-specific with our collected Twitter data. Then we propose a new deep learning cascade prediction model, which leverages the social network structure and simulates the propagation of messages from early adopters through social relations. The model can be effectively trained to achieve the two cascade prediction objectives at the same time. In this model, we explicitly embed users’ susceptibility and influence profiles as two representation vectors. With graph neural networks (GNN) (Kipf and Welling 2017), we model the activation of users according to topic-specific susceptibilities and influences and the dynamics of susceptibilities and influences. Through comprehensive experiments with three real-life datasets, we show our model outperforms state-of-the-art baselines in both popularity and adopter prediction with almost all measurements.

2 Related works

2.1 Diffusion model-based methods

This line of methods iteratively run their diffusion models to simulate the information propagation process as viral contamination (Panagopoulos et al. 2020). Two typical diffusion models are used: Independent Cascade (IC) (Song et al. 2017; Wang et al. 2015) or Linear Threshold (LT) (Kempe et al. 2003). Earlier stochastic methods require manual assignment of influence probabilities for each user pair, which is not tractable in practice. To address the deficiency, embedding learning-based methods are proposed such as TIS (Wang et al. 2015) and EMBED-IC (Bourigault et al. 2016) and CELFIE (Panagopoulos et al. 2020). User-specific susceptibility and influence are represented as latent parameters which are estimated according to observed cascades. The activation of a user can thus be determined by his/her susceptibility and influence vectors. One advantage of such methods is to well characterise the diffusion process and output the activation state of each user. However, they suffer from high computation overhead and the strong assumptions on diffusion models make them sub-optimal for cascade prediction (Yang et al. 2019; Sun et al. 2022; Chen et al. 2019).

2.2 Generative methods

With the time stamps of users’ sharing behaviours, a cascade of early adopters is abstracted as an event sequence and thus temporal point processes can be applied to simulate the arrivals of events. According to the employed point processes, we have two types of generative methods: the ones based on the Reinforced Poisson process (Shen et al. 2014) and those based on the self-exciting Hawkes process (Cao et al. 2017). Due to the assumption of temporal point processes, this line of methods over-simplify information diffusion and are thus limited in prediction performances.

2.3 Cascade representation-based models

This class of methods extract features of observed cascades as representation vectors and employ machine learning models to infer cascades dynamics. Earlier works rely on manually crafted features from user profiles (Cui et al. 2013) and message contents (Hong et al. 2011). Deep learning overtakes feature-engineering methods recently due to its overwhelming performance. DeepCas (Li et al. 2017) is the first end-to-end deep learning method for popularity prediction. It samples diffusion paths from cascade graphs and makes use of recurrent neural networks (RNN) to embed these sequential paths. Following DeepCas, a number of methods have been proposed by extending RNNs to calculate cascade representations (Wang et al. 2017; Yang et al. 2019; Wang et al. 2018). With social relations between adopters, some studies model cascades as cascade graphs and use various methods to calculate their embedding vectors with more effective sampling methods (Tang et al. 2021) or graph embedding methods (Chen et al. 2019; Sun et al. 2022; Xu et al. 2021). In spite of their promising performance, deep learning cascade prediction faces some inherent challenges as stated in Chen et al. (2019); Tang et al. (2021); Zhou et al. (2021). New methods are continuously developed to address them. For instance,  Tang et al. (2021) addressed the impacts of hub structures and deep cascade paths in cascade graphs while  (Chen et al. 2019) identified the challenges to combine cascade structures with temporal information.  Zhou et al. (2021) studied the impact of the long-tailed distribution of cascade sizes on cascade prediction. In addition, except FOREST (Yang et al. 2019), deep learning based methods focus on either popularity prediction or microscopic prediction, i.e., forecasting the next single adopter, and thus cannot predict popularity and final adopters simultaneously.  Cao et al. (2020D) proposed a different approach CoupledGNN by modelling the cascading effects with graph neural networks (GNN) (Kipf and Welling 2017), i.e., users’ sharing behaviours are influenced by their neighbours in social networks. However, this method oversimplified the diffusion process by ignoring users’ double roles in information diffusion and thus produced sub-optimal prediction performances. Inspired by Cao et al. (2020D), we explore users’ dual roles as information receivers and distributors and propose a new deep learning model that can not only predict the size of cascades but also the final adopters.

3 Problem definition

Let \(\mathcal {M}\) be a set of messages. We use the term “message” to refer to a piece of information that can be disseminated over social media. It can be a tweet on Twitter or an image on Instagram. In this paper, we focus on textual messages and our approach can be straightforwardly extended to other message types if their representations can be effectively calculated. For any message \(m\in \mathcal {M}\), we have the set of early adopters that had shared this message up to the time \(t_0\) after the message was first posted, denoted by \(C_m^{t_0}\). The observation time \(t_0\) depends on the requirements of downstream applications as well as the popularity of social media platforms. It can be of hours on Twitter and Sina, and years for citation networks. We use \(G=(\mathcal {V},\mathcal {E})\) to denote the social graph recording the social relations between users. Specifically, \(\mathcal {V}\) is the set of nodes representing the set of users and \(\mathcal {E}\subset \mathcal {V}\times \mathcal {V}\) is the set edges indicating the social relations. The network can be directed or undirected depending on social media platforms. For instance, the following relationships on Twitter are directed while the friendships on Facebook are undirected.

3.1 Popularity prediction

The problem of popularity prediction is to predict the final number of adopters, i.e., \(n_m^\infty = \vert C_m^\infty \vert\) according to the early adopters in \(C_m^{t_0}\) and the social graph. In practical applications, the final time can be determined as a given fixed time t. Formally, given a set of messages \(\mathcal {M}\) and their observed cascades \(\{C_m^t\vert m\in \mathcal {M}\}\), the problem of popularity prediction can be solved by minimising the mean relative square error (MRSE) loss:

$$\begin{aligned} \begin{array}{l} \mathbf {\mathcal {L}_{ pop}} = \frac{1}{\vert \mathcal {M}\vert } \sum _{m\in \mathcal {M}}\Bigl (\frac{\tilde{n}^\infty _m-n_m^\infty }{n^\infty _m}\Bigr )^2 \end{array} \end{aligned}$$
(1)

where \(\tilde{n}^\infty _m = f_{\Theta ,G}(C_m^{t_0})\). Note that \(f_{\Theta ,G}:\mathcal {V}^{\mathcal {P}}\rightarrow \mathbb {Z}\) is the regression function customised to graph G and parameterised by the set of trainable parameters \(\Theta\) where \(\mathcal {V}^{\mathcal {P}}\) denotes the power set of \(\mathcal {V}\). It takes the set of early adopters as input and outputs the predicted final size of the cascade. We select relative error over the absolute error to avoid the potential negative impacts of the various cascade sizes, e.g., exposing unnecessary weights to more popular messages.

3.2 Final adopter prediction

The goal is to predict the set of users who will forward the target message. This is different from the microscopic cascade prediction in the literature (Yu et al. 2015; Yang et al. 2019) which aims to predict the next adopter according to the observed ones. The problem of final adopter prediction can be solved by minimising the following loss function:

$$\begin{aligned} { \mathbf {\mathcal {L}_{adp}} = - \frac{1}{\vert \mathcal {M}\vert }\sum _{m\in \mathcal {M}}\Bigl (\sum _{v\in C_m^\infty }\log q_{\Theta ,G}(C_m^{t_0}, v) +\sum _{v\not \in C_m^\infty } \log \left( 1-q_{\Theta ,G}(C_m^{t_0}, v)\right) \Bigr ) } \end{aligned}$$

where \(q_{\Theta ,\mathcal {G}}:\mathcal {V}^\mathcal {P}\times \mathcal {V}\rightarrow [0,1]\) is the trainable function customised to social graph G and parameterised by \(\Theta\) that predicts the probability of a specific user adopting the message. In the end, we can select the users with probabilities larger than a predefined threshold as the output set of final adopters. An alternative is to output the top \(\tilde{n}_m^\infty\) users with the largest activation probabilities.

4 Topic-specific susceptibility and influence

In this section, we will validate our hypotheses that a user’s susceptibility and influence vary according to the topics of messages. This hypothesis actually contains an implicit claim that users adopt messages on multiple topics on social media. In other words, users have their own topic preferences. We start with validating this claim and then examine the dependence of susceptibility and influence on topics. Before our validation, we present our collection of necessary social media data to support our analysis. We conduct our own data collection instead of using existing publicly available datasets because they do not have all the inputs over a sufficiently long period required for our analysis.

4.1 Twitter data collection

We use Twitter as the data source of our analysis because of its popularity and friendly data-sharing interfaces for data analysts. The metadata downloaded with retweeted messages includes the original tweets’ IDs which allow us to retrieve the corresponding cascades. To efficiently obtain a sufficiently large set of users, we refer to the Twitter dataset published in Chen et al. (2022). The dataset contains tweets related to COVID-19 vaccination from four Western European countries: Germany, France, Belgium and Luxembourg. In order to obtain the Twitter users, we first crawled the tweets according to the published tweet IDs and extracted the account IDs of the originating users. Then for each originator, we queried and downloaded its followers and followees, according to which we successfully constructed the social network. Specifically, if user v follows user \(v'\), then an edge is created from v to \(v'\). We calculated the largest weakly connected subgraph of the social network as the final set of users to eliminate isolated users and ensure interconnectivity between users. In the end, we downloaded the tweets together with their metadata from the remaining users in two time periods each of which spans three months. One period starts from March 1st, 2020 while the other starts from March 1st, 2021. By these two periods separated by 1 year, we can examine the consistency of our empirical analysis over time. We summarise the statistics of the final social networks and tweets in Table 6 in “Appendix A”. Note that the numbers regarding tweets count those shared or posted by users. In Twitter, sharing an existing tweet generates a new tweet with a unique ID and metadata containing the ID of the original one.

In this paper, we focus on the texts of retweeted messages and thus remove all other content such as ‘@’, hyperlinks and ‘RT’ which stands for ‘retweet’. For quoted tweets, we only consider the quoted tweets and ignore the comments added by users. In our analysis, we only consider the users who have shared more than 5 different messages in our dataset to ensure the reliability of our analysis. In practice, users’ social connections evolve over time by adding new connections or removing existing ones. In the following analysis, we do not consider this dynamic nature of social graphs by assuming that Twitter users do not frequently cancel their following-ships and their topics of interest stay relatively stable. In our collection, the social graph is built according to the data collected in early 2020.

Fig. 1
figure 1

Clustering retweets into topics

4.2 Users’ topic preferences

We validate our observation that users simultaneously participate in discourses of multiple topics on social media.

Fig. 2
figure 2

User topic preferences and distribution

4.2.1 Topic modelling

Topic modelling has upgraded from traditional LDA method to machining learning methods (Greene and Cunningham 2006). In Zhang et al. (2022), it has been shown that the combination of high-quality text embeddings and clustering methods is more efficient in learning topics of the same quality as complex neural network models. In this paper, we adopt the most effective combination in Zhang et al. (2022), i.e., RoBERTa+UMAP+K-Means, to cluster tweets with similar topics. RoBERTA (Liu et al. 2019) is a pre-trained transformer-based text embedding method and UMAP (McInnes and Healy 2018) is used to conduct dimension reduction of text embeddings while K-Means is one of the most widely used classical clustering methods. Besides textual tweets, the number of topics is required as an input parameter. In our dataset, we only consider the textual content of messages. As a result, original tweets and retweets will have the same embeddings.

We classify the collected tweets in each period with the selected topic modelling method. After several trials, we select 25 topics due to the relatively higher quality of the output clusters. In the end, we have 25 clusters of retweets, the i-th of which is denoted by \(\mathcal {S}_i\). In Fig. 1, we depict retweets as data points and layout them according to their text embedding vectors mapped to a 2-D space with UMAP (McInnes and Healy 2018) in the two selected periods. The colours indicate their clusters. With the widely accepted measurements: C_V and Normalised Pointwise Mutual Information (NPMI), we measure the coherence values of extracted topics which are 0.649 and 0.138 for the first period, and 0.704 and 0.142 for the second period. According to the criteria adopted in topic modelling works such as Zhang et al. (2022), these numbers indicate a more than good topic coherence. We extract the representative keywords with their TFIDF rankings, and manually examine the topics of the clusters. We find that in general, the tweets in these clusters are about specific topics such as the death and infection numbers of COVID-19, Black Lives Matter movement and COVID-19 policies (see “Appendix E” for the top 10 keywords and our coarse annotation).

4.2.2 User topic preference

We represent the topic preferences of user v by a vector \(\textbf{v}\) by counting the proportions of his/her retweets in each topic. Formally, let \(\mathcal {R}_{v}\) be the set of retweets of user v and recall that \(\mathcal {S}_j\) is the set of retweets in the j-th topic, then the j-th element of \(\textbf{v}\) is calculated as \(\frac{\vert \mathcal {R}_{v}\cap \mathcal {S}_j\vert }{\vert \mathcal {R}_v\vert }\). In Fig. 2a, b we layout users as data points according to their preference vectors mapped to 2 dimensions in the two periods with UMAP (McInnes and Healy 2018). We can see that users’ vectors scatter all over the space. This confirms the diversity of Twitter users’ topics of interest. Another observation is that users cluster naturally, which indicates users group with others with similar interests. Both observations help demonstrate the validity of the users’ topic vectors we calculate. We consider a user is interested in the j-th topic if the j-th element of his representation vector is over 0.08 which is double the value of the null model where users have equal preferences over the 25 topics. Figure 2c shows the distributions of the number of topics users prefer in the two periods. We observe that about \(86\%\) users actively participate in at least 2 topics. On average, each user is interested in 3 topics. According to the above discussion, we can conclude that users are interested in multiple topics.

4.3 Topic-specific susceptibility and influence

Whether a user retweets a message is determined by his/her susceptibility and the influences received from his/her followees who have shared the message. We hypothesise that the interplay between susceptibility and influences is not only user-specific but also topic-specific. Many methods have been proposed to learn the latent representations for users’ susceptibilities and influences according to past observed cascades (Panagopoulos et al. 2020; Wang et al. 2015). However, we cannot validate our hypothesis by directly comparing the representations extracted from past cascades of different topics. This is because the learning processes on different topics are independent. Therefore, the learned representations do not belong to the same space and are not comparable. Therefore, we select an intuitive approach based on a heuristic utilised in the literature (Bourigault et al. 2016) that if users’ susceptibility and influence are topic-specific, we will have two observations:

  1. 1.

    As an information receiver, a user will have different patterns regarding sharing messages from his/her followees between topics;

  2. 2.

    As an information distributor, a user’s followers will have different patterns regarding sharing messages retweeted or posted by the user.

If these two differences are present in our dataset, we can infer that the interaction of a user’s susceptibility and influence varies between topics. After a user shares a message, the user will have an influence on each follower’s decision whether to share the message. According to this intuition, we use the frequency with which a follower forwards messages after the user’s sharing to measure the strength of the interplay between the user’s influence and the follower’s susceptibility. In the following, we first present our measurements for a user’s susceptibility pattern as a receiver and his/her influence pattern as a distributor, and then discuss our analysis of our dataset.

Fig. 3
figure 3

Distributions of susceptibility pattern \({ SP@K}\)

4.3.1 Measuring topic-specific susceptibility

Intuitively, given a topic, we use the relative frequencies with which a user forwards messages from his/her followees to quantitatively capture the user’s sharing pattern as an information receiver. Suppose a user v with the set of his/her followees \(\mathcal {U}^+_{v}=\{v'\in \mathcal {V}\vert (v, v')\in \mathcal {E}\}\). We assume a pre-defined order between the followees of user v and use \(v_i\) to denote the i-th followee. Let \(\mathcal {R}_{v,j}\) be the set of tweets that u retweeted about the j-th topic, and \({ t}(m, v)\) to denote the time when m is posted or retweeted by user v. The susceptibility vector of v for topic j is denoted by \(\textbf{s}_{v,j}\in \mathbb {Z}^{\vert \mathcal {U}^+_{v}\vert }\) whose i-th element is the number of messages retweeted by the i-th followee in \(\mathcal {U}^+_{v}\) before v retweets the same message, i.e, \(\vert \{m\in \mathcal {R}_{v,j}\cap \mathcal {R}_{v_i}\vert t(m,v)>t(m,v_i)\}\vert\).

4.3.2 Measuring topic-specific influence

We use the frequencies with which a user’s followers share his/her retweeted messages to quantify the influence patterns of the user as an information distributor. Suppose a user v with the set of followers \(\mathcal {U}^-_v = \{v'\in \mathcal {V}\vert (v',v)\in \mathcal {E}\}\) ranked according to a pre-defined order. Let \(\textbf{h}_{v,j}\) denote the influence vector of user v of the j-th topic. Then the i-th element is the number of retweets conducted by the follower \(v_i\) in \(\mathcal {U}^-_v\) after seeing the same message posted or retweeted by user v, i.e., \(\vert \{m\in \mathcal {R}_{v,j}\cap \mathcal {R}_{v_i}\vert t(m,v)<t(m,v_i)\}\vert\). Similar to the definition of topic-wise susceptibility similarity, we also consider the top K favourite topics of user v, i.e., \(\mathcal {T}_K^{v}\). Then the topic-wise influence similarity of user v is defined as follows:

$$\begin{aligned} \begin{array}{l} { IP@K}(v) = \frac{2}{K\cdot (K-1)}\sum _{ j,k\in \mathcal {T}_K^v\wedge j<k} \frac{\textbf{h}_{v,j}\cdot \textbf{h}_{v,k}}{\Vert \textbf{h}_{v,j}\Vert \cdot \Vert \textbf{h}_{v,k}\Vert }. \end{array} \end{aligned}$$
(3)

The domain of \({ IP@K}\) is between 0 and 1. A lower value indicates a users’ influence varies more between topics.

Fig. 4
figure 4

Distributions of influence pattern \({ IP@K}\)

4.3.3 Experimental analysis

We re-use the topics extracted in Sect. 4.2 to analyse the topic dependence of susceptibility and influence. In Figs. 3 and 4, we show the distributions of \({ SP@K}\) and \({ IP@K}\) values over the users when K is set to 2 and 3, respectively. For either measurement, we construct a null model as a reference to capture the distributions when the topic-specific phenomenon is absent. Take susceptibility patterns as an example. For any user v and each topic (e.g., the j-th topic), we construct a null vector \(\textbf{s}'_{v,j}\). Its k-th element is a uniformly sampled random number between 0 and \(\vert \mathcal {R}_{v_k,j}\vert\) representing the number of messages of the j-th topic of user \(v_k\in \mathcal {U}^+_v\) that have ever been shared by user v. A general observation is that users’ susceptibility and influence patterns roughly follow normal distributions. The curves of the distributions become narrower and shift right when larger K values are set. This is natural that more topics considered will lead to smaller average mutual similarity. We can see with all the selected K values, users have smaller values for both susceptibility pattern and influence pattern than the null models. On average, users’ \({ IP@K}\) and \({ SP@K}\) fall into the range between 0.3 and 0.4 which are only half of those when susceptibilities and influences are not topic-specific. The difference indicates that users’ sharing behaviours and their influences on friends differ between the topics of their interest.

5 Our CasSIM model

The propagation of a message can be interpreted as a process of multiple sequential generations. In any generation, each user first updates his/her influence and susceptibility according to the current activation states of users. Then the user decides whether to forward the targeted message according to his/her updated susceptibility and the influences of his/her friends who have forwarded the message. Inspired by CoupledGNN (Cao et al. 2020D), we use multi-layered graph neural networks to model this iterative process. We depict our framework in Fig. 5. At each layer, three sequential tasks are accomplished. The first task is to update susceptibility and influence by aggregating the profiles of social network neighbours. This task actually simulates the spread of influence and susceptibility and thus captures their context-dependence property. The second task is to calculate the topic-specific influence and susceptibility according to the user’s topic preferences and the target message’s content. The last task is to update each user’s activation state by aggregating the interplay between his/her susceptibility and the influences of all the active friends. Note that the number of layers can to some extent indicate the depth of propagation simulated in our model. According to the small world property in social networks, a relatively small number of layers is sufficient to cover the major component of the network and enforce accuracy. We also add a self-activation mechanism to simulate that users adopt the message to further overcome the impact of this hyperparameter.

For each user \(v\in \mathcal {V}\), we use \({ State}_v\in [0,1]\) to store his/her activation state indicating the probability that user v is activated. Furthermore, for each generation \(\ell\), user v is associated with three embedding vectors \(\textbf{r}^{(\ell )}_v\), \(\textbf{h}^{(\ell )}_v\) and \(\textbf{p}_v\) indicating his/her susceptibility, influence and topic preferences, respectively.

Fig. 5
figure 5

Framework of the CasSIM model

5.1 Influence and susceptibility update

As users’ influences and susceptibilities propagate through social relations, we make use of a graph neural network to first aggregate the profiles from their friends and then combine the aggregation with their own profiles. We start by describing the update of susceptibility vectors. We use the idea of graph attention networks (Velickovic et al. 2018) to take into account the various contributions of friends to the update. Let \(\mathcal {N}(v)\) be the neighbours of user v, i.e., \(\{v'\in \mathcal {V}\vert (v,v')\in \mathcal {E}\}\). Formally, the aggregated susceptibility of user v can be calculated as follows:

$$\begin{aligned} \begin{array}{l} \textbf{a}_{v,s}^{(\ell )}= \textbf{W}^{(\ell )}\sum _{v'\in \mathcal {N}(v)} \textbf{h}^{(\ell )}_{v'}\cdot \text {StateGate}({ State}_{v'}^{(\ell )})\cdot \phi _{v,v'}^{(\ell )}\end{array} \end{aligned}$$
(4)

where \(\textbf{W}^{(\ell )}\in \mathbb {R}^{d_r^{(\ell )}\times d_r^{(\ell +1)}}\) is the weight matrix and \(d_r^{(\ell )}\) defines the dimension of a user’s susceptibility vector at the \(\ell\)-th layer, i.e., \(\textbf{r}^{(\ell )}_v\). The function \(\text {StateGate}()\) is the state-gating mechanism (Cao et al. 2020D) to reflect the non-linearity of activation states. In our implementation, we use a 2-layered MLP. The attention \(\phi _{v,v'}^{(\ell )}\) calculates the contribution of the influence of user v’s neighbour \(v'\). This attention is determined not only by \(v'\)’s susceptibility but also by v’s influence vector. Formally, the attention is calculated as follows:

$$\begin{aligned} \begin{array}{l} \phi _{v,v'}^{(\ell )}= \frac{\exp (\textbf{e}_{v,v'}^{(\ell )})}{\sum _{v''\in \mathcal {N}(v)}\exp (\textbf{e}_{v,v''}^{(\ell )})} \end{array} \end{aligned}$$
(5)

where \(\textbf{e}_{v,v'}^{(\ell )}=\varvec{\psi }_{s}^{(\ell )}\bigl ( \textbf{W}^{(\ell )}_r\textbf{r}^{(\ell )}_{v'}\parallel \textbf{W}^{(\ell )}_h\textbf{h}^{(\ell )}_v\bigr )\). Note that \(\parallel\) is the concatenation function of two vectors, and \(\varvec{\psi }_s\in \mathbb {R}^{d_h^{(\ell )}+d_r^{(\ell )}}\) where \(d_h^{(\ell )}\) is the dimension of influence vectors at layer \(\ell\). Moreover, \(\textbf{W}^{(\ell )}_h\in \mathbb {R}^{d_h^{(\ell )}\times d_h^{(\ell )}}\) and \(\textbf{W}^{(\ell )}_r\in \mathbb {R}^{d_r^{(\ell )}\times d_r^{(\ell )}}\) are two weight matrices. In the end, we combine the aggregated susceptibility of neighbours with the user’s own susceptibility:

$$\begin{aligned} \begin{array}{l} \varvec{r}_{v}^{(\ell +1)} = \text {relu}\left( \textbf{W}^{(\ell )}\textbf{h}^{(\ell )}_v+ \varvec{a}_{v,s}^{(\ell )}\right) \end{array} \end{aligned}$$
(6)

where \(\textbf{W}^{(\ell )}\in \mathbb {R}^{d_r^{(\ell +1)}\times d_s^{(\ell +1)}}\) and \(\text {relu}\) is the non-linear activation function. The update of user v’s influence is similar to that of his/her susceptibility. We first aggregate the influence of his/her friends according to their activation states with attention networks. The aggregated influence \(\textbf{a}_{v,h}^{(\ell )}\) is calculated as follows:

$$\begin{aligned} \begin{array}{l} \textbf{a}_{v,h}^{(\ell )}= \textbf{W}^{(\ell )}\sum _{v'\in \mathcal {N}(v)}\textbf{h}^{(\ell )}_{v'}\cdot \text {StateGate}({ State}_v^{(\ell )}) \cdot \lambda ^{(\ell )}_{v,v'} \end{array} \end{aligned}$$
(7)

where \(\textbf{W}^{(\ell )}\in \mathbb {R}^{d_h^{(\ell )}\times d_h^{(\ell +1)}}\) is the weight matrix. We calculate the attention \(\lambda ^{(\ell )}_{v,v'}\) as follows:

$$\begin{aligned} \begin{array}{l} \lambda _{v,v'}^{(\ell )}= \frac{\exp (\textbf{o}_{v,v'}^{(\ell )})}{\sum _{v''\in \mathcal {N}(v)}\exp (\textbf{o}_{v,v''}^{(\ell )})}, \end{array} \end{aligned}$$
(8)

where \(\textbf{o}_{v,v'}^{(\ell )}=\mathbf {\psi }_h^{(\ell )}(\textbf{W}^{(\ell )}_h \textbf{h}^{(\ell )}_{v'} \parallel \textbf{W}^{(\ell )}_r \textbf{r}^{(\ell )}_v ).\) Note that \(\varvec{\psi }_h\in \mathbb {R}^{d_h^{(\ell )}+d_r^{(\ell )}}\) and \(\textbf{W}^{(\ell )}_h\in \mathbb {R}^{d_h^{(\ell )}\times d_h^{(\ell )}}\) and \(\textbf{W}^{(\ell )}_r\in \mathbb {R}^{d_r^{(\ell )}\times d_r^{(\ell )}}\) are two weight matrices which are different from those used in updating susceptibility. The influence vector of user v at the layer \(\ell +1\) is calculated as follows:

$$\begin{aligned} \begin{array}{l} \textbf{h}_v^{(\ell +1)} = \text {relu}(\textbf{W}^{(\ell )}\textbf{h}^{(\ell )}_v + \varvec{a}_{v,h}^{(\ell )}) \end{array} \end{aligned}$$
(9)

5.2 Calculating topic-specific influence and susceptibility

We show how to customise a user’s susceptibility and influence according to the topic of the message under diffusion in order to capture their topic-specific property. Suppose \(m\in \mathcal {M}\) is the message being propagated. We take user \(v\not \in C_m^{t_0}\) at \(\ell\)-th generation as an example and illustrate how to convert the vectors \(\textbf{r}^{(\ell )}_v\) and \(\textbf{h}^{(\ell )}_v\) into \(\textbf{r}^{(\ell )}_{v,m}\) and \(\textbf{h}^{(\ell )}_{v,m}\). We use \(\textbf{x}_m\in \mathbb {R}^{d_x}\) to denote the embedding vector of message m. As emphasised previously, in this paper, we concentrate on messages in the form of texts and the model can be extended to integrate other formats such as images if their representations can be effectively calculated. In our model, we use the pre-trained RoBERTa model (Liu et al. 2019) to calculate the embedding vectors of textual messages.

As empirically validated in the previous section, most users have multiple topics of interest in social media and their preferences vary between topics. Although the focus of the topics may shift over time as pointed out in Yuan et al. (2020), users’ interests remain relatively stable. For instance, a sports news reporter may switch from reporting a local football team to national teams due to the opening of the FIFA World Cup, but the topic still remains around football. Users’ topic preferences are extracted from their past sharing behaviours. We use \(\textbf{p}_v\in \mathbb {R}^{d_p}\) to denote the embedding vector for his/her topic preferences. Intuitively, given a targeted message m, we capture its related topic by referring to users’ past topic preferences and utilise an MLP module to calculate the adjustments that should be exposed to the user’s susceptibility and influence vectors. Starting with susceptibility, we calculate the corresponding topic-specific susceptibility vector as follows:

$$\begin{aligned} \begin{array}{l} \textbf{r}^{(\ell )}_{v,m} = \textbf{W}^{(\ell )}\bigl (\textrm{MLP}(\textbf{p}_v \parallel \textbf{x}_m) \circ \textbf{r}^{(\ell )}_{v}\bigr ) \end{array} \end{aligned}$$
(10)

where \(\circ\) represents the dot product of two vectors. In addition, \(\textrm{MLP}\) is a multi-layer perceptron with an input vector of dimension \(d_p+d_x\) and outputs a vector of dimension \(d_r^{(\ell )}\). The weight matrix \(\textbf{W}^{(\ell )}\in \mathbb {R}^{d^{(\ell )}_{r, m}\times d_r^{(\ell )}}\) conducts the linear transformation of the susceptibility vector to dimension \(d_{r,m}^{(\ell )}\). Similarly, we have the user’s topic-specific influence calculated as follows:

$$\begin{aligned} \begin{array}{l} \textbf{h}^{(\ell )}_{v,m} = \textbf{W}^{(\ell )}\bigl (\textrm{MLP}(\textbf{p}_v\parallel \textbf{x}_m) \circ \textbf{h}^{(\ell )}_{v}\bigr ). \end{array} \end{aligned}$$
(11)

Note \(\textbf{W}^{(\ell )}\in \mathbb {R}^{d^{(\ell )}_{h, m}\times d_h^{(\ell )}}\) and the output of \(\textrm{MLP}\) has the dimension of \(d_h^{(\ell )}\).

5.3 User state update

With users’ topic-specific susceptibility and influence, we can model their interplay which changes their activation states. The influences of each user’s active neighbours are first aggregated as the total amount of topic-specific influences exposed to the user. Then we use an MLP module to capture the likelihood of the user adopting the message only according to the exposed influences, denoted by \(\gamma _v^{(\ell )}\). We use

$$\begin{aligned} \begin{array}{l} \gamma ^{(\ell )}_v = \textrm{sigmoid}\Bigl (\textrm{MLP}\Bigl (\Bigl (\sum _{v'\in \mathcal {N}(v)} \textbf{h}^{(\ell )}_{v,x} \cdot { State}_{v'}^{(\ell )}\Bigr ) \parallel \textbf{r}^{(\ell )}_{v,m}\Bigr ) + \beta _v\Bigr ). \end{array} \end{aligned}$$
(12)

where \(\beta _v\in \mathbb {R}\) is a self activation parameter. Intuitively, the probability is dependent on the user’s topic preferences. In our model, we use a one-layer MLP followed by a sigmoid function to capture this dependence. In the end, we combine the above activation probability with the user’s current activation status into the user’s new activation state:

$$\begin{aligned} \begin{array}{l} { State}^{(\ell +1)}_v = {\left\{ \begin{array}{ll} 1, &{} \text {~if~} v\in C^{t_0}_m \\ \textrm{sigmoid}\bigl (\mu ^{(\ell )}_1 { State}^{(\ell )}_v+ \mu ^{(\ell )}_2 \gamma ^{(\ell )}_v\bigr ), &{} \text {~if~} v\notin C^{t_0}_m. \end{array}\right. } \end{array} \end{aligned}$$
(13)

Note that \(\mu ^{(\ell )}_1, \mu ^{(\ell )}_2\in \mathbb {R}\) are two weight parameters which are to be trained. The initial state, i.e., \({ State}_v^{(0)}\), is set to 1 if \(v\in C_m^{t_0}\) or 0, otherwise. In the end, we calculate the final size of the cascade \(\tilde{n}_m^\infty\) as \(\sum _{v\in \mathcal {V}}{ State}_v\).

5.4 User profiling

From the above discussion, we can see that our model uses three input vectors for each user v at the 0-th layer: \(\textbf{p}_v^{(0)}\), \(\textbf{r}_v^{(0)}\) and \(\textbf{h}_v^{(0)}\). A few methods have been proposed in the literature to learn users’ susceptibility and influence embedding from users’ sharing history (Wang et al. 2015; Panagopoulos et al. 2020). In this paper, we pre-train a simple but effective model to prepare the three types of initial vectors. Suppose we have the cascades for the past messages in \(\mathcal {M}_{ hist}\). We interpret them as the ultimate states of users in the corresponding information diffusion processes. In other words, for each \(m\in \mathcal {M}_{ hist}\), we have the final cascade \(C_m^\infty\). We set \({ State}_v^\infty =1\) if \(v\in C_m^\infty\) and 0 otherwise. We calculate the activation state for each user \(v\in C_m^\infty\) based on his/her topic-specific susceptibility and his/her active friends’ influence, which is denoted by \(\widetilde{{ State}}_{v,m}\). Formally,

$$\begin{aligned} \begin{array}{l} \widetilde{{ State}}_{v,m} = \textrm{sigmoid}\Bigl (\mathbf {\alpha }\cdot \sum _{v'\in \mathcal {N}(v)\cap C_m^\infty } \bigl (\textrm{MLP}(\textbf{p}^{(0)}_v \parallel \textbf{x}_m) \circ (\textbf{h}^{(0)}_{v'}\parallel \textbf{r}_v^{(0)})\bigr )\Bigr ) \end{array} \end{aligned}$$
(14)

where \(\mathbf {\alpha }\in \mathbb {R}^{d_r^{(0)}+d_h^{(0)}}\) and \(\textrm{MLP}\) outputs a vector of dimension \(d_r^{(0)}+d_h^{(0)}\). In the end, \(\textbf{p}_v^{(0)}\), \(\textbf{r}_v^{(0)}\) and \(\textbf{h}_v^{(0)}\) are trained by minimising the objective function:

$$\begin{aligned} \begin{array}{l} \mathcal {L}_{ initial} = -\frac{1}{\vert \mathcal {M}_{ hist}\vert }\sum _{m\in \mathcal {M}_{ hist}} \sum _{v'\in C_m^\infty } \log (\widetilde{{ State}}_{v,m}). \end{array} \end{aligned}$$
(15)

There may exist users who do not participate in any cascades. For these users, we set the neutral vectors \(\textbf{0}\) to these users as their three profile vectors.

5.5 Model training

In order to achieve the two objectives of cascade prediction: popularity and final adopter prediction, we aggregate the two corresponding objective functions into our final loss function to guide the parameter optimisation: \(\mathcal {L} = \theta _1\mathcal {L}_{adp} + \theta _2 \mathcal {L}_{pop} + \theta _3\mathcal {L}_{reg}\) where \(\theta _1\), \(\theta _2\) and \(\theta _3\) are hyperparameters. The \(\mathcal {L}_{ reg}\) is added for the purpose of regularisation as an L2 norm for all model parameters.

6 Experimental evaluation

6.1 Datasets

We leverage four real-life datasets in our experiments: Sina, AMINER, Twitter2012 and Twitter2020. Sina and AMINER are publicly available and widely exploited in the validation of previous works related to cascade prediction (Li et al. 2017; Cao et al. 2020D). The Twitter2020 dataset is an extension of our collection described in Sect. 4 while Twitter2012 is a public Twitter dataset collected in 2012 (Weng et al. 2013). Each dataset has two components: a social graph and a text dataset consisting of diffused messages. We select these datasets to ensure a comprehensive evaluation that covers as many practical scenarios as possible. Sina and Twitter represent the social media platforms characterised by microblogs. The users of the Sina dataset are more densely connected. This dense social graph will benefit cascade prediction models with a more complete view of the sources of influences. AMINER is a citation network instead of social media and stores the citation relations between academic authors. We use AMINER to test whether our CasSIM model can also predict the cascades in more general settings. Moreover, in order to check the performance of our model for different lengths of observation periods, i.e., \(t_0\), for each dataset, we construct three sets of cascades by cutting the cascades according to thr ee given time periods. For Twitter and Sina, due to their fast propagation speed, the observation periods are set to 1 h, 2 h and 3 h. For AMINER, we select 1 year, 2 years and 3 years. More details can be found in “Appendix B” and the detailed statistics are summarised in Table 7 in “Appendix B”.

6.2 Baselines

Considering the large number of methods for macroscopic and microscopic prediction, we select representative methods for comparison. A method is representative if it is typical for a class of methods or claims strong performances. For instance, we use SEISMIC (Zhao et al. 2015) and feature-based methods (Cao et al. 2020D) as representative baselines for machine learning methods without deep learning. We reuse the implementation of these models whenever they are accessible and conduct our own implementation otherwise. A brief description of our baselines can be found in “Appendix C”.

6.3 Experimental settings

6.3.1 Evaluation measurements

We use three widely adopted measurements to evaluate the prediction performances regarding popularity. MSLE (Mean square log-transformed error) is a standard evaluation metric (Chen et al. 2019) defined as: \({ MSLE} = \frac{1}{\vert \mathcal {M}\vert }\sum _{m\in \mathcal {M}}(\log n_m^\infty -\log \tilde{n}_m^\infty )^2\). We use mean absolute percentage error (MAPE) and wrong percentage error (WroPerc) which is introduced and used in Cao et al. (2020D) to evaluate prediction performance in terms of relative errors. MAPE measures the average relative errors and is defined as: \({ MAPE}= \frac{1}{\vert \mathcal {M}\vert }\sum _{m\in \mathcal {M}} \frac{\vert \tilde{n}_\infty ^m- n_\infty ^m\vert }{n_\infty ^m}\). WroPerc measures the average percentage of cascades that are poorly predicted and is defined as: \({ WroPerc} = \frac{1}{\vert \mathcal {M}\vert }\sum _{m\in \mathcal {M}}\mathbbm {1} \left( \frac{\vert \tilde{n}_\infty ^m- n_\infty ^m\vert }{n_\infty ^m} \ge \varepsilon \right)\). We set the threshold to 0.5 in our experiments. Note that \(\mathbbm {1}(*)\) is an indication function which outputs 1 when the input proposition is true or 0 otherwise. For each measurement, a lower value indicates better prediction performance.

With regard to evaluating the prediction performance of final adopters, we use the standard metrics: precision, recall and F1 score.

6.3.2 Hyperparameter settings

For each of the three datasets, we randomly split it into training, validation and testing sets according to the ratio 8:1:1. For the text embedding model RoBERTa, we utilise the implementation XLM-RoBERTa (Conneau et al. 2020). We set the maximum size of input strings to 128, and the length of text embedding is 768. For all models including the benchmark models, we tune their hyperparameters to obtain the best performance on validation sets. Early stopping is employed for tuning when validation errors do not decline for 20 consecutive epochs. The learning rate and L2 coefficient are chosen from \(10^{-1},10^{-2},\dots ,10^{-8}\). The hidden units for MLPs are chosen from {32, 64}. The batch size is 32. We train our model for 500 epochs and utilise Adam (Kingma and Ba 2015) for optimisation. We use the first three months’ cascades in Sina and Twitter2020 dataset to pre-train users’ initial profiles: their topic preference, susceptibilities and influences. For the Twitter2012 dataset, the first week’s cascades are used. For the AMINER dataset, the first 2 years’ cascades are used. All hyperparameters remain the same as they are recommended in the original papers or the published source codes.

6.4 Overall prediction performance

We compare the performance of our CasSIM model to the baselines for both the two cascade objectives: popularity prediction and final adopter prediction. As discussed previously, not all baselines can achieve these two objectives simultaneously. As a result, for each objective, we compare with the baselines that can achieve the objective. We independently train each model 5 times and report the average results on testing sets.

6.4.1 Popularity prediction

We outline the performance of all the benchmarks and our CasSIM model on the selected datasets in Tables 12,  3 and 8. Due to the limited space, we put the results about Twitter2012, i.e., Table 8, in “Appendix D” We do not consider DyHGCN in this comparison since it can only conduct microscopic prediction, i.e., predicting the next single adopter. Our objective is to examine whether our CasSIM model outperforms the baselines in different scenarios. If not, we analyse the possible causes so as to understand the scenarios where our model works the best. In general, we can observe that FOREST, CasFlow and TempCas are the best baselines in terms of popularity prediction. In addition, the prediction becomes more accurate when observation periods are longer. These two observations are consistent with the experimental evaluation in the literature (Tang et al. 2021; Chen et al. 2019). We highlight the best performance in bold numbers and italic the second best.

We have three main observations. First, we observe that our CasSIM model outperforms almost all the baselines according to the three measurements in the four datasets. Tempcas only marginally outperforms CasSIM when the observation periods are set to 1 h and 2 h. This may be caused by the relatively large variances of cascade lengths in the Sina dataset. The performance improvements show that our model can accurately predict the final size of cascades on both social media and citation networks where the cascading phenomenon exists. Second, compared to CoupleGNN, our CasSIM model can produce overwhelmingly more accurate predictions, especially when measured by WroPerc. For instance, on the Sina dataset, the increase is larger than 17%. The improvement can even reach 35% in our Twitter2020 dataset. This means the performance of CasSIM is more stable than CoupledGNN. We can also infer that the consideration of users’ dual roles in information diffusion is necessary and our CasSIM model effectively captures the interactions between users’ susceptibilities and influences. Last, the improvement of our CasSIM model is more significant when observation periods are shorter. For instance, for the Sina dataset, CasSIM increases the performance measured by MLSE by \(6\%\) compared to Tempcas when observation periods are set to 1 h. The increase drops to \(3\%\) for 2-h observation periods and further decreases to \(2\%\) when observation periods are 3 h. We infer that this should result from our consideration of users’ topic preferences and message contents in CasSIM. When shorter observation periods are set, the baselines which only rely on early adopters’ co-occurrences in cascades do not have sufficient information for prediction.

Table 1 Popularity prediction performance on Sina dataset
Table 2 Popularity prediction performance on AMINER dataset
Table 3 Popularity prediction performance on Twitter2020

Final adopter prediction

Table 4 Final adopter prediction performance

In the literature, only FOREST can predict the final adopters while predicting the popularity. It uses a microscopic prediction module to calculate the probability distribution over users to be the next activated user. FOREST iteratively samples the next adopters until a special virtual user named by ‘STOP’ is sampled. Compared to FOREST, CoupledGNN and our CasSIM model assign an activation probability for each user. As both the models can predict the number of final adopters, i.e., \(\tilde{n}_m^\infty\), we can use the \(\tilde{n}_m^\infty\) users with the largest activation probabilities as the set of final adopters. Considering the inevitable prediction errors, we use a tolerant parameter \(\eta\) to add a certain percentage of extra adopters. It may be argued that microscopic models can also be applied to predict final adopters by iteratively predicting the next adopters which is similar to FOREST. However, different from FOREST, such models do not have the mechanisms to terminate the sampling. In order to ensure the comprehensiveness of our validation, we manually add an unfair terminating condition, that is, the true number of final adopters are sampled. We use the state-of-the-art microscopic models such as DyHGCN and TopoLSTM as representatives. Note that \(\eta\) only works for CoupledGNN and CasSIM since they are introduced to counter the potential errors of their predicted popularity. In Table 4, we list the performance regarding final adopter prediction when observation periods are 3 h for Twitter and Sina, and 3 years for AMINER. For the tolerant parameter \(\eta\), we use \(10\%\), \(20\%\), \(30\%\), \(40\%\) and \(50\%\) in our experiments.

We can see that CasSIM already perform better than all the baselines except for DyHGCN with the original predicted popularity with \(\eta\) set to 0. DyHGCN only performs slightly better than CasSIM when applied on the Sina dataset and Twitter2012. Although the improvement is a bit marginal compared to FOREST, CasSIM has a much better performance than CoupledGNN. With the relatively high-quality cascades in Sina and Twitter2012, CasSIM increases the three measurements by about \(18\%\). The improvement can reach more than \(30\%\) on AMINER and Twitter2020. With positive \(\eta\) values set, we can observe an obvious performance increase for both CoupledGNN and CasSIM. It can be expected that too large \(\eta\) will eventually compromise the performance. In our experiments, we can achieve the best performance when \(\eta\) equals \(30\%\) or \(40\%\) and the performance started to fall when \(\eta\) is \(50\%\).

6.4.2 Discussion

From the above analysis, we can see our CasSIM model produce promising performance for both popularity prediction and final adopter prediction. Moreover, it effectively models the two roles of users in information diffusion. The integration of message contents into our model also helps improve the prediction of popularity when observation periods are short.

Table 5 Ablation study of popularity prediction performance on all datasets

6.5 Ablation study

We examine the contributions of the components which are implemented in our CasSIM model and missing in previous works. As we emphasised previously, the novelty of CasSIM is the diffusion process modelling which considers users’ profiles as two roles, message contents and topic-specific susceptibilities and influences. We design three variants of CasSIM to study the components related to these factors:

  • CasSIM-h/r We do not distinguish users’ dual roles in diffusion and use the same vectors for users’ susceptibilities and influences.

  • CasSIM-up We remove the pre-training process for the initial user profiles and use random assignments.

  • CasSIM-x We remove users’ topic preference vectors, e.g., \(\textbf{p}_v\) and do not consider the content of messages under diffusion, e.g., \(\textbf{x}_m\).

Table 5 outlines the performance comparison between CasSIM and its variants in terms of popularity prediction. We have three major observations: i) CasSIM performs considerably better than its variants; ii) ignoring users’ two roles in information diffusion consistently leads to the largest damage to the prediction performance; iii) except for Sina, message content consistently ranks the second most influential component.

6.6 Hyperparameter test

We examine the influence of three important hyperparameters of CasSIM. The first parameter is the number of GNN layers which can intuitively be interpreted as the number of diffusion generations. The other two parameters relate to the pre-trained user profiles. In CasSIM, we assume that users’ profiles are stable over a sufficiently long time, especially for users’ topic profiles, e.g., \(\textbf{p}_v\). In the previous experiments, we use the first three months’ retweets in our Twitter dataset to pre-train users’ susceptibility and influence vectors, and stick to them to conduct following cascade predictions. We would like to test whether this is reasonable in practice and when user profiles should be retrained. We take our Twitter dataset as an example in our investigation. We start with examining how many months in advance are needed in this pre-training process and then track the performance changes when predicting cascades in different periods after the user profile training. Figure 6 shows the results. We vary the number GNN layers z from {2,3,4,5}. We can see that MSLE curve drops to the bottom when \(z=3\), then slowly climbs up when larger numbers of layers are implemented. We vary the number of months whose retweets are used for user profiles from \(\{1, 2, 3, 4, 5\}\) and the result shows that the periods for user profiling can be neither too short nor too long. On our Twitter dataset, three months work the best for popularity prediction. To test the effectiveness of pre-trained user profiles, we train and test our CasSIM model on tweets in the 1, 3, 6, 9 and 12 months after the tweets used for user profile training. We can see the popularity prediction performance decreases when the trained user profiles are used to predict cascades later than 3 months. However, a closer look will reveal that the range of the change is rather small. This is consistent with our expectation that user preferences and interests are relatively stable in spite of the vast changes in social news trending.

Fig. 6
figure 6

The influence of hyperparameters

7 Conclusion

In this paper, we proposed a new deep learning model CasSIM which can simultaneously achieve the two most demanded cascade prediction objectives: popularity prediction and final adopter prediction. Compared to previous models, CasSIM explores the dual roles of users in diffusion processes as both receivers and distributors and models the three basic factors in users’ decision to become active: susceptibilities, influences and message contents. With effective user profiling, CasSIM successfully models the topic-specific property of susceptibilities and influences. In addition, the introduction of GNN allows CasSIM to capture the dynamics of susceptibilities and influences during information diffusion. With extensive experiments on three real-life datasets, we validated the effectiveness of CasSIM in predicting popularity and final adopters. The results showed that CasSIM outperforms the state-of-the-art methods, especially when shorter cascades are observed, in both social media and other scenarios where cascades are also present.

We identify a few limits of our CasSIM model which can be addressed in the future. First, we focused on messages in the form of texts and only consider their topics. Second, CasSIM does not consider the temporal ranks between the early adopters. It is interesting to extend and test CasSIM in cascade prediction by combining other types of information in messages such as images and quotations, considering other semantic features such as sentiments, and improve the performance by integrating the time stamps of early adopters.