1 Introduction

An abundance of beneficial information is available on social media (Fan et al. 2013). Social media users can disseminate information posted by other users through functionalities such as retweeting on Twitter and sharing on Facebook (Doerr et al. 2012). Through such word-of-mouth information diffusion, information about important breaking news and newly created useful content is disseminated quickly and widely on social media. Because social media users can access such useful information, social media is considered to be an important platform for both those who produce information and those who consume it.

In contrast, the dissemination of harmful information, including the so-called fake news, misinformation, and hate speech, has become a serious issue (Kimura et al. 2008; Sharma et al. 2019; Mathew et al. 2019). For example, during the coronavirus disease 2019 (COVID-19) outbreak, there was widespread dissemination of misleading and false information on social media, resulting in people believing in false theories about the disease and the virus that causes it (Jolley and Paterson 2020). In addition, it has been reported that an increasing number of people are relying on misleading information on social media and as a result are hesitant to get vaccinated without first seeking professional advice (van der Linden et al. 2020).

Many researchers have devoted substantial effort to limiting the diffusion of harmful information on social media (Pérez-Rosas et al. 2018; Shah and Zaman 2011; Leskovec et al. 2009; Sharma et al. 2019). Pérez-Rosas et al. (2018) proposed a method to automatically detect fake news on social media. Shah and Zaman (2011) proposed a method to estimate the users who are the source of rumors on social media. Leskovec et al. (2009) proposed a framework for tracking the spread of fake news.

Methods to block information diffusion between (a small number of) specific users have been actively studied as promising approaches to limiting harmful information diffusion (Yan et al. 2019; Kimura et al. 2008; Tong et al. 2012; Yao et al. 2015). Blocking one user’s information from spreading to other users can reduce the final size of information diffusion cascades. Figure 1 shows an example of blocking information diffusion between users. As shown in Fig. 1, by blocking information diffusion between only users A and B, the number of users who finally receive the information will decrease from 4 to 2. However, if information diffusion is blocked too heavily, there is a risk of blocking diffusion of useful information. Therefore, it is desirable to control the diffusion of harmful information by blocking the diffusion of information between as few users as possible.

Fig. 1
figure 1

Example of blocking information diffusion between users

The problem with limiting the size of information diffusion cascades by blocking information diffusion between a limited number of users is formulated as a link deletion problem in social networks (Kimura et al. 2008). A social network is defined as a graph \(G=(V,E)\), in which social media users are represented by nodes and relationships among them are represented by links (Domingos and Richardson 2001). Information posted by nodes in a social network spreads through links between nodes. Given a social network G and an integer k, the link deletion problem aims to determine the combination of k links that should be removed in order to minimize the size of information diffusion cascades on social network G. Many studies have proposed various link deletion methods to control the size of information diffusion cascades (Kimura et al. 2008; Tong et al. 2012; Khalil et al. 2014; Yan et al. 2019). Previous studies have used synthetic information diffusion models to evaluate the effectiveness of the link deletion methods (Tong et al. 2012; Yan et al. 2019; Zareie and Sakellariou 2022). For example, Tong et al. (2012) evaluated the effectiveness of link deletion methods in reducing the size of information diffusion using a popular synthetic diffusion model called the susceptible-infectious-susceptible (SIS) model.

In this study, we aim to reveal the benefits and limitations of link deletion in limiting the spread of information diffusion on real social media. Contrary to previous studies that used synthetic models, we utilize the actual log data to evaluate the effectiveness of link deletion methods. Synthetic information diffusion models cannot fully reproduce the characteristics of information diffusion on actual social media (Tripathy et al. 2010). Therefore, even if the link deletion method is effective in controlling the size of information diffusion generated by the synthetic information diffusion model, its effectiveness in controlling actual information diffusion on social media remains unclear. We examine the size of information diffusion cascades when it is possible to block information diffusion between a certain number of user pairs identified using link deletion methods. Because it is not easy to conduct an experiment that actually blocks information diffusion, we propose a method to estimate the size of retweet diffusion cascades after link deletion from a who-follows-whom network users and log data of information diffusion among them. Based on the proposed method, we examine the effectiveness of link deletion methods including NetMelt (Tong et al. 2012), a popular link deletion method, and other heuristics for limiting the size of information diffusion cascades on social media (Kashyap and Ambika 2019).

This paper is an extended version of our previous conference short paper (Furukawa and Tsugawa 2021). We have conducted experiments on two additional datasets, and two additional link deletion methods for comprehensive examination of the effectiveness of link deletion methods. Moreover, we propose a new method to estimate the size of diffusion cascade after link deletion. In our previous paper, we only evaluated the effectiveness of link deletion under the worst-case scenario. In this paper, we evaluate the effectiveness of link deletion both under the worst-case and more optimistic scenarios.

The contributions of the present study are summarized as follows.

  • We use four large-scale datasets of cascades rather than synthetic cascade models to evaluate the effectiveness of link deletion. We propose a method to estimate the size of a given cascade after link deletion. Using the proposed method, we examine the benefits and limitations of link deletion methods when applying them to actual social media.

  • Our results suggest that there are limitations to link deletion methods in limiting the size of retweet cascades on social media. While existing studies (Tong et al. 2012; Yan et al. 2019) have demonstrated the effectiveness of link deletion methods in limiting the size of cascades generated by synthetic models, our results suggest that link deletion methods are not so effective in limiting the size of real diffusion cascades on social media both under the pessimistic and optimistic scenarios.

  • We show that most of the cascades examined in this study have multiple sources, which makes limiting their sizes by link deletion difficult.

The remainder of this paper is organized as follows. In Sect. 2, we introduce related work. In Sect. 3, we propose a method to estimate the size of cascades when link deletion methods are applied. In Sect. 4, we explain the datasets and methodologies for the analyses. In Sect. 5, we present the results and discuss their implications. Finally, in Sect. 6, we summarize this study and discuss future work.

2 Related work

Many researchers have conducted research to detect harmful information on social media (Pérez-Rosas et al. 2018; Shah and Zaman 2011; Liu et al. 2020; Helmstetter and Paulheim 2018). Pérez-Rosas et al. (2018) proposed a method to automatically detect fake news on social media. They analyzed actual fake news texts and developed a model to predict whether a given news text is fake news by combining information on vocabulary, syntax, and semantics that are frequently used in actual fake news. Ghanem et al. (2020) proposed an approach to detect non-factual twitter accounts. Shah and Zaman (2011) proposed a method to estimate the source of rumors using maximum likelihood estimation. To control the spread of harmful information, both detecting harmful information and limiting its spread are required (Sharma et al. 2019). While the above mentioned studies focused on the former, the present study focuses on the latter.

To limit the spread of information diffusion on a social network, several studies have proposed methods to change the network topology of the given network so that the size of the spread of information diffusion cascades on the network are minimized (Khalil et al. 2014; Kimura et al. 2008). Some of these studies have proposed removing several nodes from the network (Wang et al. 2013; Alorainy et al. 2022). In particular, previous studies have shown that removing high centrality nodes is an effective strategy (Kitsak et al. 2010). However, deleting high centrality nodes in the network means deleting influential users on social media. Therefore, such methods are considered to be costly (Zareie and Sakellariou 2021) and may even have a considerably negative effect because famous celebrities may be removed.

Deleting a certain number of links in a social network is another approach to limiting the spread of information diffusion (Tong et al. 2012; Yan et al. 2019; Zareie and Sakellariou 2022; Yao et al. 2015; Kanwar et al. 2022). Here, deleting a link between two nodes means blocking information diffusion between them. Tong et al. (2012) addressed the NetMelt problem of selecting links for deletion in order to minimize the spread of rumors. Because the largest eigenvalue of the adjacency matrix of a social network graph is positively correlated with the size of information diffusion on the network, they proposed an algorithm to remove k links so that the largest eigenvalue of the network is minimized. Yan et al. (2019) proposed a rumor spread minimization (RSM) problem and its algorithms. RSM algorithms aim to identify a set of links so that the size of the information diffusion cascade from given source nodes is minimized. In contrast, Zareie and Sakellariou (2022) proposed a source-ignorant method to identify a set of critical links for limiting the sizes of information diffusion cascades from arbitrary seed nodes. These studies used synthetic information diffusion models and demonstrated the potential of link deletion methods in limiting the size of information diffusion. Our study follows and extends these studies, evaluating the effectiveness of link deletion methods by using actual logs of retweet cascades on social media rather than using synthetic diffusion models.

3 Method of estimating size of diffusion cascades after link deletion

We propose a method to estimate the size of a given information diffusion cascade when deleting several links in the social network. Notations used in this paper are summarized in Table 1. Let T be a set of tweets or topics posted on social media during a certain period of time, and let L be a set of links to be deleted. Here, deleting link \((u,v) \in L\) means blocking information diffusion from user u to user v. We consider a method to estimate the size of each diffusion cascade regarding tweet or topic \(t \in T\) after deleting links in L. Although each t can be either a specific tweet or a specific topic, from here on, we use tweet t rather than tweet or topic t for simplicity. We denote the set of users who post or repost tweet t as \(R_t\), and the set of users involved in tweet set T as \(V= \bigcup _{t \in T} R_t\). The time at which user \(u \in V\) posts t or repost t is denoted as \(\tau (u,t)\), and the social network that expresses who-follows-whom relationships among users belonging to user set V is denoted as \(G=(V,E)\). The link \((u,v) \in E\) represents user u following user v.

Table 1 Notations used in this paper

First, for each tweet \(t \in T\), we construct a diffusion graph \(H_t=(R_t,E_t)\) that represents the diffusion paths of tweet t. The set of nodes in \(H_t\) is equal to \(R_t\), which is the set of users who tweeted or retweeted tweet t. The link \((u,v) \in E_t\) in \(H_t\) represents tweet t spreading from user u to user v. To obtain \(E_t\) is not obvious because each cascade does not contain information about the specific diffusion paths of the tweet. Here, we construct three diffusion graphs, one of which is non-tree diffusion graph (i.e., each node can have multiple parents), and the others are trees (i.e., each node has at most a single parent).

In the non-tree diffusion graph, we consider a tweet to have spread from user u to user v when user v is following user u and the timing of user u’s retweeted (or tweeted) tweet t is earlier than that of user v. In other words, if \((v,u) \in E\) and \(\tau (u,t) < \tau (v,t)\), then \((u,v) \in E_t\).

While a user can be affected by multiple users in the non-tree diffusion graph, a user is affected by a only single user in the diffusion trees. We construct two-types of diffusion trees, which we call diffusion tree (first) and diffusion tree (last). In the diffusion tree (first), we consider a tweet to have spread from user u to user v when user v is following user u and the timing of user u’s retweeted tweet t is earlier than that of user v and the earliest among user v’s followees. In contrast, in the diffusion tree (last) we consider a tweet to have spread from user u to user v when user v is following user u and the timing of user u’s retweeted tweet t is earlier than that of user v and the latest among user v’s followees.

We call the node that does not have any incoming links in \(H_t\) the seed user of tweet t. The user who posted tweet t is the seed user. User u who retweeted tweet t is also a seed user if none of their followees retweeted tweet t before user u retweeted tweet t. We denote the set of seed users of tweet t as \(S_t\).

Next, we use the link deletion method to select a set of links L (\(|L|=k\)) to be deleted and construct a diffusion graph after link deletion. By removing these links from diffusion graph \(H_t\), we can estimate how the number of users who receive each tweet \(t \in T\) changes. We remove the selected link set L from \(H_t\). Namely, we construct \(H'_t=(R_t,E'_t)\), where \(E'_t=E_t \setminus L\). Deleting link (uv) corresponds to blocking information propagation from user u to user v. Therefore, if user v receives tweet t from only user u, that is, if node v has no incoming link in \(H'_t\), then deleting link (uv) will prevent user v from receiving tweet t.

Finally, we use diffusion graph \(H'_t\) to find the number of users who receive tweet t after the link deletion. In graph \(H'_t\), seed users can receive tweet t even after the link deletion. Users who follow the seed users can also receive this information. Furthermore, users who follow those users (i.e., the users following the seed users) can also receive the information. Therefore, we assume that all users within reach of the seed users in diffusion graph \(H'_t\) can receive tweet t. Then, we estimate the number of users who receive tweet t after link deletion as the number of nodes that are within the reach of the seed nodes \(s \in S_t\) in diffusion graph \(H'_t\)

Note that there is a only single diffusion path for a given node in the diffusion trees whereas there are multiple diffusion paths in the non-tree diffusion graph. Hence, when using diffusion trees, the estimated sizes of diffusion cascades after link deletion are smaller than those when using non-tree diffusion graphs. Using the non-tree diffusion graph and diffusion tress, we examine the effectiveness of link deletion both under the pessimistic and optimistic scenarios.

An example of non-tree diffusion graph \(H_t=(R_t,E_t)\) of tweet t is shown in Fig. 2a. We consider the case where the set of links to be deleted is \(L = \{ (1,5),(3,6) \}\). Graph \(H' _t\) after deleting these links from \(H _t\) is shown in Fig. 2b.

Fig. 2
figure 2

Examples of original diffusion graph \(H_t\) and the diffusion graph after link deletion \(H'_t\). Numbers on the nodes indicate the time at which the user posted a tweet or retweet regarding t. Link (1, 2) indicates that user 2 is following user 1, and the time at which user 1 posted a tweet or retweet regarding t was earlier than the time at which user 2 did

4 Dataset and methodology

4.1 Datasets

To evaluate the effectiveness of link deletion in limiting the size of cascades, we use four datasets containing both cascades and social networks representing who-follows-whom relationships among users involved in the cascades. A statistical summary of the datasets is shown in Table 2. The datasets are explained in detail as follows.

  1. 1.

    Ordinary dataset (Tsugawa 2019) This dataset contains 10,000 tweets and their retweets posted during November 19–25, 2018. The 10,000 tweets were randomly selected from a collection of English tweets with one or more retweets posted during the above period. Note that there were no specific events during the period. In total, 114,249 users tweeted or retweeted these tweets, and these users are target user set V. This dataset also contains the social network of who-follows-whom relationships among the target users. The social network is used to construct the diffusion graph and to determine links to be deleted. To examine the effectiveness of link deletion in limiting the size of retweet cascades, tweets that have a large cascade are preferable for analysis. Therefore, among the 10,000 tweets, we extracted those that were retweeted 100 times or more, which resulted in 90 original tweets. These 90 tweets that were retweeted 100 times or more are the set of tweets T that we analyzed.

  2. 2.

    Higgs dataset (Domenico et al. 2013) This dataset contains tweets and retweets about the discovery of the Higgs boson in 2012. In this dataset, the discovery of the Higgs boson is regarded as topic t. That is, all sets of tweets and retweets in the dataset are regarded as a single cascade of information diffusion. Therefore, the number of topics to be analyzed \(T=1\) in this dataset. The social network representing who-follows-whom relationships among users who tweet and retweet is available in this dataset and is used to construct the diffusion graph and to determine links to be deleted.

  3. 3.

    URL dataset (Hodas and Lerman 2014) This dataset contains tweets containing URLs and retweets during October 2010. Each URL is considered as a topic propagated among users. As with the ordinary dataset, we extract tweets that were retweeted 100 times or more.

  4. 4.

    Douban dataset (Zhong et al. 2012) This dataset contains data from a Chinese social website. This website allows users to post their book reading status and also allows them to follow the status of other users. Each book is considered as a topic propagated among users. As with the ordinary dataset, we extract cascades whose size were 100 or more.

Table 2 Statistics of the datasets used in the analysis

4.2 Link deletion methods

In this study, we focus on the effectiveness of three popular link deletion methods, NetMelt (Tong et al. 2012), Betweenness (Kashyap and Ambika 2019), and Edge-Degree (Kashyap and Ambika 2019) in limiting the size of actual diffusion cascades. We also use a random baseline. The details of the link deletion methods are explained as follows.

  1. 1.

    NetMelt (Tong et al. 2012) We apply the NetMelt to the social network G representing who-follows-whom relationships among the target users and obtain \(|L|=k\) links to be deleted.

  2. 2.

    Betweenness (Kashyap and Ambika 2019) Betweenness selects \(|L|=k\) links to be deleted in descending order of link betweenness of the links in social network G.

  3. 3.

    Edge Degree (Kashyap and Ambika 2019) The edge degree of link (uv) is defined as the product of in-degree of node u and out-degree of node v (Kashyap and Ambika 2019). This method selects \(|L|=k\) links to be deleted in descending order of edge degree of the links in social network G.

  4. 4.

    Random Random selects \(|L|=k\) links to be deleted randomly from all links in social network G.

Using these methods, we determine \(|L|=k\) links to be deleted and estimate the size of cascade \(t \in T\) after link deletion.

5 Results and discussion

First, we analyze how the total size of the cascades in the four datasets changes by link deletion. Figure 3 shows the estimated total size of retweet cascades against the number of deleted links in each dataset when using non-tree diffusion graph. From Fig. 3, we can see that although the total cascade size decreases by link deletion, the decrease is not large relative to the number of deleted links. For example, in the case of ordinary data, when 1.5 million links are deleted from the social network using NetMelt, the total cascade size is about 50% that of the original cascade. 1.5 million links are equivalent to approximately 50% of the links in the original social network. It is practically difficult to block information diffusion between such a large number of user pairs. Therefore, this result suggests that the effects of link deletion methods to limit the size of tweet diffusion cascades are limited. Looking at the difference among link deletion methods, the NetMelt and the Betweenness are generally effective. In particular, except for the URL dataset, the NetMelt achieves the best results in the three of four datasets.

Fig. 3
figure 3

Total size of cascades after link deletion versus the number of deleted links (non-tree diffusion graph)

We next examine the effectiveness of link deletion under more optimistic scenarios by using diffusion trees. Figures 4 and 5 compare the total size of cascades when using non-tree diffusion graph, diffusion tree (first), and diffusion tree (last). Figure 4 shows the results for the NetMelt, and Fig. 5 shows the results for the Betweenness. These results show that although the estimated sizes of cascades when using the diffusion trees are smaller than those when using the non-tree diffusion graph, the estimated sizes of cascades are still large. Even under the best case for each dataset, 10%–50% links are necessary to be deleted for limiting the estimated diffusion sizes to be under 50% of the original sizes. These results again suggest the limitations of link deletion methods for limiting the sizes of real diffusion cascades. It is suggested that even under the optimistic estimation, many links should be deleted to reduce the cascade sizes.

Fig. 4
figure 4

Comparison among different diffusion graphs (NetMelt)

Fig. 5
figure 5

Comparison among different diffusion graphs (Betweenness)

We next analyze how the link deletion affects the size of each cascade. Because the Higgs dataset contains only a single cascade, the other three datasets are used in the following analyses. Figures 6 and 7 show the relation between the original cascade size and the estimated cascade size after link deletion for each tweet when using the NetMelt and the Betweenness, respectively. The number of deleted links is 50% of the original links, and the diffusion tree (last) is used as the diffusion graph. These results show that the cascade size of some tweets is substantially decreased by link deletion. However, we should note that this result is obtained when 50% of links are deleted. Considering the fact that a large number of links are deleted, this result also suggests that the effect of link deletion on information diffusion is limited.

Fig. 6
figure 6

Relation between the original cascade size and the cascade size after link deletion for each tweet (link deletion method: NetMelt, diffusion graph: diffusion tree (last), number of deleted links \(|L|=0.5|E|\))

Fig. 7
figure 7

Relation between the original cascade size and the cascade size after link deletion for each tweet (link deletion method: Betweenness, diffusion graph: diffusion tree (last), number of deleted links \(|L|=0.5|E|\))

Finally, we examine the reason why the link deletion does not substantially reduce the size of diffusion cascades. Figure 8 shows the relationship between the original cascade size of each tweet and the number of seed users of each tweet. For the sake of visibility, we only show the tweets with less than 1000 retweets. These figures show that many tweets have a large ratio of seed users relative to the total cascade size. A seed user is a user who retweets a tweet without being influenced by their followers. Twitter users may obtain information about tweets from sources other than their followers. For example, if a tweet is trending on Twitter, or if it is introduced on another web page, Twitter users may obtain information about the tweet not from their social relationships on Twitter. Link deletion only limits information diffusion through social relationships and cannot prevent seed users from obtaining information from other sources. If a tweet spreads through only social relationships among users, link deletion should be effective in limiting its cascade size. However, if many users obtain information about a tweet from sources other than their social relationships, link deletion may not be effective in limiting its cascade size.

Fig. 8
figure 8

Relation between the original cascade size and the number of seed nodes for each tweet (tweets with less than 1000 retweets)

From the examples of diffusion graphs shown in Fig. 9, we can also confirm that limiting the cascade sizes by removing a small number of links is difficult. Seed users are represented as light green nodes in the figure. If seed users will be isolated from other users in the diffusion graph, the cascade size can be reduced. However, in the example of Fig. 9a, many seed users involve in the tweet diffusion, and isolating these seed users requires many links to be deleted. Also for the example of Fig. 9b, a seed user shown in the center of the figure is connected to many other users, and isolating the seed user requires many links to be deleted.

Fig. 9
figure 9

Examples of non-tree diffusion graph (ordinary dataset): Light green nodes represent seed users

In summary, our results suggest that link deletion methods are limited in efficiently limiting the size of diffusion cascades. NetMelt has been shown to be effective in limiting the size of synthetic cascades in (Tong et al. 2012) and requires many links (e.g., 50% of links) to be deleted to limit the size of real tweet diffusion cascades. In practice, it is difficult to block information diffusion between such a large number of user pairs on social media, which suggests that link deletion on social media is limited in practice. Our results also suggest that the existence of many seed users is one of the reasons for the inefficiency of link deletion methods in actual social media. We show that the information cascades analyzed in this paper have many seed users who may receive the information, not from their social relationships, but from other sources. Link deletion methods cannot prevent seed users from receiving information, so other strategies (e.g., advertising counter campaigns (Budak et al. 2011; Erd et al. Erd et al.)) are suggested as necessary methods for limiting the size of diffusion cascades with many seed users. Moreover, we recognize that other link deletion methods that delete links by using the information about the seed nodes are available in the literature (Yan et al. 2019), and their effectiveness should be evaluated using actual diffusion cascade logs in future research.

6 Conclusion and future work

In this study, we used actual logs of retweet cascades and evaluated the effectiveness of link deletion methods to suppress the spread of information. We proposed a method to estimate the size of a given retweet cascade after link deletion and examined the potential and limitations of link deletion methods when applying them to actual social media. Our results suggest that link deletion methods are limited in limiting the size of retweet cascades on social media. We showed that most of the retweet cascades have multiple sources, which makes limiting their size by link deletion difficult.

In future work, we plan to evaluate the effectiveness of link deletion methods that incorporate the information of seed nodes (Yan et al. 2019). In addition, validating the proposed estimation method of cascade size is also an important future work.