Reply trees in Twitter: data analysis and branching process models
 First Online:
 Received:
 Revised:
 Accepted:
DOI: 10.1007/s1327801603340
 Cite this article as:
 Nishi, R., Takaguchi, T., Oka, K. et al. Soc. Netw. Anal. Min. (2016) 6: 26. doi:10.1007/s1327801603340
 1 Citations
 1.9k Downloads
Abstract
Structure of networks constructed from mentioning relationships between posts in online media may be valuable for understanding how information and opinions spread in these media. We crawled Twitter to collect tweets and replies to construct a large number of socalled reply trees, each of which was rooted at a tweet and joined by replies. Consistent with the previous literature, we found that the empirical trees were characterized by some long pathlike reply trees, large starlike trees, and long irregular trees, although their frequencies were not high. We tested several branching process models to explain the empirical frequency of these types of reply trees as well as more basic quantities such as the distributions of the size and depth of the reply tree. Based on our modeling results, we suggest that the indegree of the tweet that initiates a reply tree (i.e., the number of times that the tweet is directly mentioned by other reply posts) may play an important role in forming the global shape of the reply tree.
Keywords
Reply tree Twitter Branching process Data analysis1 Introduction
Information spreading plays a fundamental role in triggering collective actions in human society on a large scale. A classical example is diffusion of technological innovation, in which individuals receiving information on a new technology from other peers may decide to adopt the technology (Rogers 2003; Easley and Kleinberg 2010). Other examples include fads (Gladwell 2000), social mobilization (Lotan et al. 2011; Banõs et al. 2013; Conover et al. 2013), marketing (Leskovec et al. 2007a; Easley and Kleinberg 2010), voter turnout (Bond et al. 2012), responses to natural disasters (Sano et al. 2013; Sasahara et al. 2013), and circulation of new scientific publications (Thelwall et al. 2013) to name but a few.
Network analysis has been a useful tool for understanding information spreading both online and offline. In particular, owing to increasing amounts of users’ activity and availability of data, various online social media ranging from microblogging services (e.g., Twitter), to social networking services (e.g., Facebook) have been analyzed as networks. In networks of users, a node represents a user, and a link represents a relatively static dyadic relationship between two users such as followership in Twitter and mutual friendship in Facebook. An alternative construct, which we focus on in the present study, is networks of posts; a node represents a post by a user, and a link represents a reference relationship from a post to a previous post. Such a network is usually treelike with possible branching and without confluence. The reference relationship implies that a post spreads information relayed from a previous post. Therefore, a network of posts is considered to be a direct derivative of information spreading. It should be noted that, in contrast to the case of user networks, a user may appear as different nodes in a network of posts. Networks of posts have been studied in Twitter (Kumar et al. 2010; Kwak et al. 2010; Bakshy et al. 2011; Cogan et al. 2012), Facebook (Sun et al. 2009; Cheng et al. 2014), blogs (Leskovec et al. 2007b; McGlohon et al. 2007; Götz et al. 2009), Flickr (Cha et al. 2009), discussion threads (Gómez et al. 2008; Kumar et al. 2010; Gómez et al. 2011; Wang et al. 2012; Gómez et al. 2013), and email (LibenNowell and Kleinberg 2008; Golub and Jackson 2010; Wang et al. 2011).
In a network of posts, an initial post located at the root of the network may induce a cascade of responses of different magnitudes and spatiotemporal patterns. Structure of such a network seems to inform us of the nature of the cascade (Leskovec et al. 2007b; Iribarren and Moro 2009; LibenNowell and Kleinberg 2008; Cha et al. 2010; Golub and Jackson 2010; Kumar et al. 2010; Iribarren and Moro 2011; Wang et al. 2011; Wang et al. 2012). For example, the size of the cascade defined by the number of nodes in the network is a simple measure of the extent to which the initial post has involved other users. In addition, networks of the same size may have different shapes. An initial post may diffuse by forming a long chainlike network to eventually involve 100 other posts. A different initial post may receive 100 direct replies, and then, the cascade may terminate without further diffusion, resulting in a star network. Although the size of the cascade is the same in the two cases, the way information is communicated during the cascade may be different. A long chainlike network may be formed by alternately replying behavior between two users with which the two users end up detailing the topic; a star network does not allow this interpretation (Cogan et al. 2012). The structure of networks of posts may also tell us the importance of individual users and posts involved in information cascades (Cha et al. 2010; Kwak et al. 2010; Weng et al. 2010; Bakshy et al. 2011; Wang et al. 2011; Banõs et al. 2013). The previous studies used structural information obtained from networks of posts for practical applications. Examples include classification of topics without text mining (McGlohon et al. 2007; Kumar et al. 2010; Gómez et al. 2011), quantification of how controversial a post is in online discussion threads (Gómez et al. 2008), and predictions of the final size of an information cascade (Cheng et al. 2014).
In the present study, we analyze a large data set of trees formed by posts in Twitter, which we call reply trees. We operationally distinguish three types of post in Twitter in the present paper: tweet, reply, and retweet. By convention, we do not include reply and retweet to the definition of tweet. Then, we construct trees from the tweets and replies that we have collected. In short, a reply tree is rooted at a tweet and involves replies that refer to the tweet directly or indirectly. We analyze structural properties of empirical reply trees and propose branching process models for them.
We use Twitter data because Twitter is suitable for studying diffusion processes for several reasons (Kwak et al. 2010; Bakshy et al. 2011; Bollen et al. 2011; Dodds et al. 2011; Lotan et al. 2011; Bliss et al. 2012; Cogan et al. 2012; Banõs et al. 2013; Conover et al. 2013; Sasahara et al. 2013). First, Twitter is devoted to information diffusion. This situation contrasts with that for other media such as Facebook in which mutual endorsement is more emphasized. Second, Twitter users communicate in standardized ways. Tweets are restricted to 140 characters, and retweets and replies, which have to follow a given standardized format, are the only modes allowed with which users can directly respond to previous posts. Third, Twitter data can be collected on a large scale with the use of the application programming interface (API).
A majority of previous literature on networks of posts in Twitter seems to have focused on networks of retweets (e.g., Kwak et al. 2010; Bakshy et al. 2011) rather than those of replies (but see Kumar et al. 2010; Cogan et al. 2012). However, we focus on replies in the present study for two reasons. First, replies are considered to be more informative about the relationships between users than retweets are (Sousa et al. 2010; Gonçalves et al. 2011; Bliss et al. 2012). Second, replies are suggested to convey emotional responses of users (Dodds et al. 2011; Bliss et al. 2012), and collective emotions and moods in Twitter often covary with the results of collective actions, presumably induced by information spreading, such as dynamics of stock prices (Bollen et al. 2011).
2 Methods
2.1 Data
We collected mentioning relationships (i.e., one post mentioning another post) between pairs of public posts in Twitter from the December 1–9, 2011, using Twitter API as follows. First, on March 15, 2011, we manually selected 26 Japanese celebrity users with many followers as seed users. Second, we collected the posts, i.e., tweets (excluding replies and retweets by definition), replies, and retweets, made by the seed users using the user timeline API provided by Twitter. We collected all the posts made by the seed users between March 15 and December 9, 2011. Between the 11th and 15th March, we collected 3000 most recent posts for each user due to the limitation imposed by Twitter on the user timeline API. Third, we added about 1000 users who received the largest number of responses within the data collected up to the previous step. Here, a response to a user X is operationally defined as either a retweet to an X’s post containing at least one Japanese character or a reply containing a Japanese character to an X’s post. Fourth, we collected posts made by each of the newly added users. We collected all the posts of each newly added user between the time at which the newly added user was detected for the first time and December 9, 2011. We also collected the most recent 3000 posts of the newly added user before the time the user was detected for the first time, under the condition that the posts are dated March 11, 2011 or later. Fifth, we repeated the third and fourth steps a large number of times to expand the set of the users and posts.
We excluded the replies and retweets that did not explicitly contain the IDs of the posts that these replies and retweets referred to, because construction of directed links was not straightforward for these replies and retweets. As a result, restricted to the period between December 1–9, 2011, we obtained 505,557 users and 57,982,740 posts including 24,280,912 replies and 5,478,846 retweets. Before analyzing the data, we anonymized the user IDs and discarded the contents of the posts while keeping the information about the mentioning relationships between all pairs of posts. We discarded retweets and defined tweets and replies as nodes. A directed link is defined as dyadic relationship from the mentioning post, which is a reply (because we have discarded retweets), to the mentioned post, which is either a tweet or reply.
2.2 Reply tree
Unless otherwise stated, we exclude isolated tweets, i.e., those never mentioned by any reply post within the observation period, from the definition of the reply tree. Therefore, the size of a reply tree in terms of the number of nodes, denoted by S, is at least two. Owing to our data collection method, we exhaustively collected all reply trees containing at least one sampled user unless the tweet at the root of the reply tree occurred before the observation period. We discarded reply trees whose root (i.e., tweet) was dated before the observation period. Then, there are 2,170,021 reply trees, which are by definition as many as the tweets that are posted in the observation period and have been mentioned at least once. The number of replies summed over all the reply trees is equal to 6,903,147.
We cannot exclude the possibility that a reply tree grows after the observation period by receiving a new reply. If a reply tree starts from a tweet located near the end of the observation period, the tree is likely to grow even after the observation period. We confirmed that statistics of reply trees calculated from the entire data set did not considerably differ from those calculated from the partial data set composed of the reply trees whose roots were located in the first half of the observation period (Appendix 1). We focus on the entire data set in the following.
3 Results of data analysis
The depth of reply tree, denoted by D, is defined as the maximal distance from the root (Kumar et al. 2010; Wang et al. 2011; Gómez et al. 2013). For example, the reply tree shown in Fig. 1 has \(D=5\). The survivor function of D is shown by the solid line in Fig. 2b. The distribution of D is shorttailed with a CV value of 1.01, which is consistent with previous results (Kumar et al. 2010).
It should be noted that the results shown in Fig. 3a are not comparable with those in Gómez et al. (2013), which has also investigated the relationship between S and D. This is because D values averaged over discussion trees possessing the same value of S are examined in Gómez et al. (2013). In contrast, we are concerned with distributions of S and D for individual trees.
The survivor function of the length of segment, \({\rm P}_{\mathrm{surv}}(\lambda )\), is shown by the solid line in Fig. 5. The mean of \(\lambda\) is equal to 2.15. The distribution is roughly approximated by a lognormal distribution \({\rm P}(\lambda ) = \exp \left\{ \left[ \ln (\lambda 1)  \mu \right] ^2/2\sigma ^2 \right\} / \left[ \sqrt{2\pi }\sigma (\lambda 1)\right]\) with \(\mu =0\) and \(\sigma =1\) (dashed line showing the survivor function of the fitted lognormal distribution). The empirical distribution \(\mathrm{P}(\lambda )\) is not longtailed, with a CV value of 0.97. Although there are segments whose \(\lambda\) is much larger than the mean, there frequency is too small to qualify the distribution to be longtailed.
4 Modeling with branching processes
Empirical distributions used in each model. For models 3, 4, and 5, copula variants were also examined in Sect. 4.4.
Model  Distributions used 

Galton–Watson (model 1)  P(k) 
Correlated Galton–Watson (model 2)  \({\rm P}(k_\mathrm{t})\), \({\rm P}(kk_{\mathrm{prev}})\) 
Model 3  \({\rm P}(k_\mathrm{t})\), \({\rm P}(\lambda )\), \({\rm P}(k_\mathrm{e})\) 
Model 4  \({\rm P}(k_\mathrm{t})\), \({\rm P}(\lambda k_\mathrm{t})\), \({\rm P}(k_\mathrm{e})\) 
Model 5  \({\rm P}(k_\mathrm{t})\), \({\rm P}(\lambda k_\mathrm{t})\), \({\rm P}(k_\mathrm{e}k_\mathrm{t})\) 
It is worth noting at this point how reply trees in Twitter were previously modeled. In Kumar et al. (2010), three models were considered. In the first model, a new reply chooses which tweet or reply to attach to with the probability proportional to a linear combination of its indegree and age. The second model extends the first model by considering the authorship of each reply. The third model is a branching process model with multiple types of replies each of which is associated with a separate indegree distribution. The type of each reply is estimated by the expectation–maximization algorithm. It is difficult to conclude which of the three models fits to Twitter reply trees, because the main focus of Kumar et al. (2010) was on data different from those obtained from Twitter and the comparison between the model results and the Twitter data was not provided in quantitative terms.
4.1 Measurements
For each model, we generate the joint distribution of the size and depth of reply tree, P(S, D), using the same number of samples as that for the empirical data (i.e., \(N=2,170,021\)), and compare it with the empirical distribution shown in Fig. 3a.
For quantitative comparisons, we also generate \(N=5\times 10^7\) synthetic reply trees from each model and carry out the following analysis. First, we measure marginalized survivor functions of the size and depth of reply trees, i.e., \({\rm P}_{\mathrm{surv}}(S)\) and \({\rm P}_{\mathrm{surv}}(D)\). Second, we measure the fraction of long pathlike reply trees, large starlike reply trees, and large irregular reply trees as follows. We define the long pathlike tree as a reply tree satisfying \(Sd_1\le D\le S1\) and \(S\ge 50\), where \(d_1\) is a threshold value and presumably much smaller than D. Only exact paths are counted if \(d_1=1\). Similarly, the large starlike tree is defined by \(1\le D\le d_2\) and \(S\ge 50\), where \(d_2\) is a presumably small threshold. The large irregular tree is defined by \(d_3 \le D \le Sd_4\) and \(S\ge 50\), where \(d_3\) and \(d_4\) are thresholds. We measure the fraction of long pathlike trees, that of large starlike trees, and that of large irregular trees, relative to all generated reply trees for various threshold values.
4.2 Galton–Watson process and its correlated variant
4.2.1 Galton–Watson process (model 1)
Given the moderately heterogeneous indegree distribution of the reply trees, the simplest model is probably the Galton–Watson branching process, in which we draw the indegree of each node from the empirical degree distribution P(k) (Harris 1963; Kimmel and Axelrod 2002). The Galton–Watson process defines model 1 (Table 1). Because it always holds that \(S\ge 2\) according to our convention, we discard samples that have yielded an isolated root node, which would result in \(S=1\).
The distributions of S and D produced by model 1 are compared with the empirical distribution in Fig. 2a, b, respectively. The model overestimates the probability at large S and underestimates the probability at large D.
The joint distribution of S and D obtained from model 1 is shown in Fig. 3b. We observe that the model does not generate long pathlike reply trees (i.e., near the diagonal for large S and D), which contrasts with the empirical data (Fig. 3a). This result is consistent with Fig. 2b, which indicates the lack of trees with large D for model 1. More quantitatively, the fraction of long pathlike trees as defined in Sect. 4.1 is almost equal to zero for the range of \(d_1\) shown in Fig. 6a. Long pathlike trees are absent because the length of segment, \(\lambda\), for model 1 (i.e., Galton–Watson process) obeys the geometric distribution \(\mathrm{P}(\lambda ) = (1p)p^{\lambda1 }\), where \(p={\rm P}(k=1)\). The geometric distribution with the value of p estimated from the empirical data is shown by the dotted line in Fig. 5, confirming that model 1 does not produce long pathlike trees as observed in the empirical data. The CV for \(\lambda\) obtained from model 1 is equal to 0.77, which is considerably smaller than that for the empirical data, i.e., 0.97 (Sect. 3). Model 1 does not produce a realistic fraction of large starlike reply trees, either (see Fig. 6b where \(d_2\) is small). Finally, model 1 overestimates the frequency of large irregular trees relative to the empirical data (insets of Fig. 6c, d). In summary, the standard Galton–Watson process does not reproduce chief statistical characteristics of reply trees observed in the empirical data.
4.2.2 Correlated Galton–Watson process (model 2)
In an attempt to improve fitting of the model to the empirical data, we consider the socalled correlated Galton–Watson process (model 2). In this model, the indegree of replies is drawn from conditional distribution \({\rm P}(kk_{\mathrm{prev}})\), where \(k_{\mathrm{prev}}\) is the degree of the previous node (defined as the node that the focal reply node mentions). By convention, P(XY) here and in the following indicates the distribution of X conditioned by the value of Y. The correlated Galton–Watson process is a special case of the socalled macro process model (Olofsson 1996). In fact, Fig. 7a indicates that the indegree of a node considerably decreases on an average as the indegree of the previous node increases, which is consistent with the assumption of model 2. To initiate a reply tree, we draw the indegree of tweet, \(k_\mathrm{t} (\ge 1)\), from the empirical distribution of the indegree constructed from all tweets with \(k_\mathrm{t}\ge 1\), i.e., \({\rm P}(k_\mathrm{t})\), because a tweet does not have a previous node.
Figure 2a indicates that model 2 produces a distribution of S similar to the empirical one despite some noticeable deviation in a middle range of S. Figure 2b indicates that model 2 underestimates the probability of D at large values of D relative to the empirical data. Figure 6a, together with the joint distribution P(S, D) shown in Fig. 3c, indicates that model 2 does not produce long pathlike reply trees, similarly to model 1. The CV value for \(\lambda\) obtained from model 2 is equal to 0.79, which is close to the value for model 1 and smaller than that for the empirical data. This result is consistent with the fact that model 2 produces the geometrical distribution of \(\lambda\), i.e., \({\rm P}(\lambda ) = (1p)p^{\lambda 1}\), where \(p={\rm P}(k=1k_{\mathrm{prev}}=1)\). Model 2 produces a realistic frequency of large starlike reply trees across a range of threshold \(d_2\) (Fig. 6b). However, model 2 by far underestimates the frequency of large irregular trees in an entire range of \(d_3\) and \(d_4\) (Fig. 6c, d).
4.3 Models that explicitly use the empirical distribution of the segment length
At best, the Galton–Watson processes (models 1 and 2) produce a realistic fraction of large starlike reply trees but not long pathlike trees or large irregular trees. The models do not produce realistic distributions of S and D, either. Therefore, we explore models that go beyond the family of conventional branching process. In the models considered in this and the following sections, we draw \(\lambda\) from empirically determined distributions. In fact, segments are generated by users’ microscopic behavior. We have decided not to model this factor, and the limitation of the present approach will be discussed in Sect. 5.
4.3.1 Model 3
We extend the Galton–Watson process to define model 3 as follows. First, we draw the indegree of the tweet, \(k_\mathrm{t}\), from the empirical distribution \({\rm P}(k_\mathrm{t})\), as is done in model 2. Second, we draw the length of each of the \(k_\mathrm{t}\) segments starting from the root independently from the empirical distribution, \({\rm P}(\lambda )\). Third, for each segment, the indegree of the end node, denoted by \(k_\mathrm{e}\), is drawn from \({\rm P}(k_\mathrm{e})\), which is constructed from all end nodes of segments in all empirical reply trees. The use of different indegree distributions for tweets and replies is motivated by a clear difference between \({\rm P}(k_\mathrm{t})\) and \({\rm P}(k_\mathrm{r})\) in the empirical data shown in Fig. 4. It should be noted that \({\rm P}(k_\mathrm{e} = k^\prime ) \propto \mathrm{P}(k_\mathrm{r} = k^\prime )\) for \(k^\prime \ge 2\). It should be also noted that, because the end node of a segment is either a leaf or a branching node, \({\rm P}(k_\mathrm{e}=1)=0\). Fourth, if the end node of a segment attains \(k_\mathrm{e}\ge 2\), the lengths of \(k_\mathrm{e}\) segments starting from this node are independently drawn from \(\mathrm{P}(\lambda )\). We repeat the procedure until all branches terminate.
The joint distribution of S and D obtained from model 3 is shown in Fig. 3d. The model produces some long pathlike reply trees (i.e., \(S\approx D\) and large S). In addition, the distribution of D is similar between the model and data (Fig. 2b). However, model 3 is yet unsatisfactory for the following reasons. First, model 3 overestimates the probability of S at large S (Fig. 2a). Second, as shown in Fig. 6a, long pathlike trees are much fewer in model 3 than in the empirical data for the entire range of \(d_1\) examined in the figure. Third, the model does not produce sufficiently many large starlike trees (at small \(d_2\) in Fig. 6b). Fourth, the model overestimates the fraction of large irregular trees (Fig. 6c, d).
4.3.2 Model 4
Empirically, Fig. 7b indicates that \(\lambda\) decreases on an average with the indegree of the root tweet (i.e., \(k_\mathrm{t}\)). Therefore, we extend model 3 by assuming that the distribution of \(\lambda\) depends on the indegree of the tweet at the root of the reply tree. In the extended model, which we refer to as model 4, we draw the length of each segment from \(\mathrm{P}(\lambda  k_\mathrm{t})\). Then, the indegree of the end node of each segment is drawn from \({\rm P}(k_\mathrm{e})\) constructed from the empirical data, which is the same as in model 3 (Table 1).
The distributions of both S (Fig. 2a) and D (Fig. 2b) are close between model 4 and the empirical data. The joint distribution of S and D obtained from the model is shown in Fig. 3e. The fraction of long pathlike trees for small \(d_1\) is similar between the model and data (Fig. 6a). However, large starlike trees (see Fig. 6b where \(d_2\) is small), and large irregular trees (Fig. 6c, d) are considerably fewer for the model than the empirical data.
4.3.3 Model 5
We consider a further extension of the model in which \(k_\mathrm{e}\) for each end node of a segment is drawn from the empirically constructed conditional distribution \({\rm P}(k_\mathrm{e} k_\mathrm{t})\) instead of the unconditional distribution \({\rm P}(k_\mathrm{e})\) employed in models 3 and 4. It should be noted that the end node does not have to be that for the segment emanating from the root. We refer to the extended model as model 5. This extension of the model is empirically supported; the indegree of the end node of a segment considerably depends on the indegree of the tweet that initiates the reply tree (Fig. 7c).
Similarly to model 4, model 5 produces the distributions of S (Fig. 2a) and D (Fig. 2b) that are close to the empirical data. The joint distribution P(S, D) for model 5 is shown in Fig. 3f. The fraction of long pathlike trees with small \(d_1\) (Fig. 6a) and that of large starlike trees for a range of \(d_2\) (Fig. 6b) are not far from those for the empirical data. However, the model produces much less irregular trees than the empirical data (Fig. 6c, d).
4.4 Models with correlated segment lengths
4.4.1 Empirical evidence of correlated segment lengths
The models introduced so far are incapable of producing a realistic frequency of large irregular trees. Although large irregular trees are rare even in the empirical data (Fig. 6c, d), they are suggestive of mechanisms that generate an entire reply tree. If the \(\lambda\) values for the \(k_\mathrm{s}\) segments starting from the same node are positively correlated, large irregular trees are expected to occur relatively easily. This is because a large \(\lambda\) value in one branch implies a relatively high probability of large \(\lambda\) values in other branches in the same reply tree. For example, if the root has indegree 2, both of the two segments have \(\lambda =100\), and no further branching occurs, we obtain a large irregular tree with \(S=201\) and \(D=100\).
Denote by \(\langle\lambda \rangle_\mathrm{s}\) the average of \(\lambda\) over the \(k_\mathrm{s}\) segments starting from the same node. If the \(\lambda\) values for the \(k_\mathrm{s}\) segments are positively correlated, \(\langle\lambda \rangle_\mathrm{s}\) statistically fluctuates more than realizations of \(\langle\lambda \rangle_\mathrm{s}\) calculated on the basis of independent \(\lambda\) values as we assumed in models 3, 4, and 5. In the independent case, \(\langle\lambda \rangle_\mathrm{s}\) has standard deviation equal to \(\sigma (\lambda )/\sqrt{k_\mathrm{s}}\), where \(\sigma (\lambda )\) is the standard deviation of \(\lambda\) calculated from \({\rm P}(\lambda )\). It should be noted that the mean of \(\langle\lambda \rangle_\mathrm{s}\) is the same between the empirical and the independent cases, because we use the empirical \(\mathrm{P}(\lambda )\) to independently draw \(\lambda\) values for the \(k_\mathrm{s}\) segments in models 3, 4, and 5.
The standard deviation of \(\langle\lambda \rangle_\mathrm{s}\) calculated from the empirical data and that calculated from \({\rm P}(\lambda )\) under the independence assumption are plotted against \(k_\mathrm{s}\) in Fig. 7d. The figure suggests that the fluctuation of \(\langle\lambda \rangle_\mathrm{s}\) is larger for the empirical data than under the independence assumption unless \(k_\mathrm{s}\) is large. The amount of fluctuation is the same between the two cases when \(k_\mathrm{s}\) is large. Therefore, \(\lambda\) observed in the empirical data may be positively correlated across segments sharing a start node.
4.4.2 Models 3, 4, 5 with copula
Motivated by the results shown in Sect. 4.4.1, we extend models 3, 4, and 5 (Sect. 4.3) to allow \(\lambda\) to be correlated among segments emanating from the same starting node as follows. For each start node, we use a \(k_\mathrm{s}\)dimensional multivariate normal distribution to generate \(k_\mathrm{s}\) correlated variables denoted by \((x_1, \ldots , x_{k_\mathrm{s}})\). We assume that each \(x_i\) (\(1\le i\le k_\mathrm{s}\)) is distributed according to the standard normal distribution (i.e., mean zero and standard deviation one) when marginalized. Then, we transform each \(x_i\) to \(\lambda _i\), the value of \(\lambda\) for the ith segment, such that the marginal distribution of \(\lambda _i\) coincides with the empirical \({\rm P}(\lambda )\).
We numerically examine the copula variants of models 3, 4, and 5, which by definition employ distributions of \(\lambda\). We set \(\rho =0.7\), which we found to produce results relatively close to the empirical data. Figure 2c indicates that model 3 with copular overestimates the probability of S at large S, whereas models 4 and 5 with copula produce distributions of S close to the empirical one. All three models with copular produce distributions of D close to the empirical one (Fig. 2d). The joint distributions P(S, D) for the three copula models are shown in Fig. 3g–i. The fraction of long pathlike reply trees, that of large starlike reply trees, and that of irregular reply trees are shown in Fig. 6e–h. Similar to the distribution of S (Fig. 2c), the figures indicate that introduction of the copula improves models 4 and 5, but not model 3. Figure 6e indicates that the fraction of long pathlike trees is similar among models 4 with copula, model 5 with copula, and the empirical data when \(d_1\) is small, although significant discrepancies remain for large \(d_1\). Figure 6f indicates that the fraction of large starlike trees is close between model 5 with copula and the empirical data over the entire range of \(d_2\). It should be noted that model 4 with \(\rho =0\) and \(\rho =0.7\) produces much less starlike trees than the empirical data when \(d_2\) is small. Figure 6g, h indicate that the fraction of large irregular trees is also similar among model 4 with copula, model 5 with copula, and the empirical data for the entire parameter range explored by our numerical simulations. We conclude that model 5 with \(\rho =0.7\) captures main statistical properties of reply trees observed empirically, despite a notable discrepancy in the frequency of long pathlike trees when \(d_1\) is not small (Fig. 6e).
5 Conclusions
We analyzed structure of reply trees observed in Twitter. We examined a suite of branching process models to capture properties of empirical data in terms of the frequency of long pathlike reply trees, large starlike reply trees, and large irregular reply trees, which are typologies proposed in Cogan et al. (2012), as well as the distributions of the size and depth of reply tree. The Galton–Watson process and its correlated variant did not produce realistic statistics of reply tree. Our final model (i.e., model 5 with copula) assumed that the segment length (i.e., \(\lambda\)) and the degree of end nodes of segments depended on the indegree of the tweet located at the root of the reply tree. These assumptions imply that the tweet at the root, whose firstorder properties may be encoded in its indegree, seems to be a strong determinant of the shape of the reply tree (Wang et al. 2011; Li et al. 2012; Gómez et al. 2013). The final model also assumed that \(\lambda\) was positively correlated among segments starting from the same node. This assumption is also in line with the idea that the indegree of the tweet affects the entire topology of the reply tree for the following reason. Owing to their contents, some tweets may tend to induce long segments in the reply trees rooted at them. Other tweets may tend to induce short segments. If this is the case, different segments in a reply tree would be positively correlated. For simplicity, in our models, we introduced positive correlation only to segments sharing the start node.
Copulas have been used for generating correlated networks (Gleeson 2008; Raschke et al. 2014). In these studies, twodimensional copulas were used for defining the joint degree distribution of an adjacent pair of nodes. In contrast, the present study employed a Gaussian copula of a general dimension to produce correlated segments sharing a start node.
A serious limitation of the present development is that we have plugged the empirical distribution of the length of segments, e.g., \({\rm P}(\lambda )\), directly into models 3, 4, and 5. Then, we focused on other structural properties of reply trees such as correlation between segments sharing the start node. However, the mechanisms governing such correlations are not clear. In addition, users seem not to care about the length of segments when deciding whether or not to reply to other posts. Branching process models have also been criticized of not being able to explain other aspects of networks of posts (Kumar et al. 2010; Wang et al. 2011). An alternative, agentbased approach is growing network models, in which a node with outdegree one joins an existing tree according to a certain attachment rule. This approach, which has been used for modeling networks of posts (Götz et al. 2009; Kumar et al. 2010; Li et al. 2012; Wang et al. 2012; Gómez et al. 2013; Gleeson et al. 2014), may be also useful for understanding the current data set.
Acknowledgments
N.M. acknowledges the support provided through CREST, JST. M.T. was partially supported by JSPS KAKENHI Grant Number 25280111.
Funding information
Funder Name  Grant Number  Funding Note 

Japan Science and Technology Agency (JP) 
 
Japan Society for the Promotion of Science (JP) 
 
Japan Science and Technology Agency (JP) 

Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.