1 Introduction

Wind energy is one of the major renewable sources, but it has spurred a significant discussion about its environmental impact since its deployment, as investigated, e.g., in Saidur et al. (2011), Leung and Yang (2012), Dai et al. (2015). A large part of the debate has been channelled through social media, among which Twitter is probably the most popular, with 330 million monthly active users (see https://www.oberlo.com/blog/twitter-statistics), due to its short messages (aka tweets, shorter than 140 characters) (O’Reilly and Milstein 2011). Tweets may include hashtags (identified by the preceding # symbol), which allow associating a topic to the message: all messages related to a specific topic can be retrieved at once by searching for the associated hashtag. Users may decide to follow somebody and receive all his/her updates, i.e. by an opt-in mechanism.

Though anybody may send a message over Twitter, stakeholders in the wind energy debate may act more persistently to send their message through.

In fact, people do not constantly scroll Twitter all day and may miss tweets. Time zones further complicate the matter since a message sent in one’s morning may be received by the recipient during his/her night, increasing the chance of missing it. Twitterers may therefore post their tweets again and again, on different days and at different times, increasing their chance of reaching their intended readers and hammering their message (like in the old Latin adage repetita iuvant). In the following, we refer to this practice as self-reposting to distinguish it from the general retweeting mechanism (at the basis of Twitter), where you share a tweet originally posted by others. Tools like Buffer (https:buffer.com, HootSuite (https:www.hootsuite.com), and Sprout Social (https://sproutsocial.com) make it easy easy to re-share the same tweet so as to publish it over and over. We guess this practice is heavily carried out by people having a strong interest in pushing the discussion about wind energy.

Twitter is aware of the practice and its potentially harmful consequences. The recent overhaul of its terms of service forbids users to schedule tweets with identical content across multiple accounts. This move was intended to get more original content on the platform and reduce spam and bots.

Though self-reposting is not necessarily spurred by obnoxious intentions, analysing its characteristics is helpful for several reasons. It helps build a deeper picture of twitterers’ behaviour and the traffic they generate. It may also help Twitter’s management staff tune their policies by identifying its features against the widely accepted retweeting mechanism. In our case, it helps us understand who wishes to take the lead in the discussion about wind energy over Twitter.

While some work has been conducted for mining textual content that users generate or analysing the social network structure in social perception on energy economics (Austmann and Vigne 2021; Reboredo and Ugolini 2018; Agarwal et al. 2020), no publication has studied the underlying mechanism of the retweeting behaviour yet. This lack is even more true in the context of renewable energies. This is not a secondary issue because retweeting became an established convention inside Twitter, and the literature sees it as a conversational practice in which “authorship, attribution, and communicative fidelity are negotiated in diverse ways” (Suh et al. 2010). For this reason, in our paper, we investigate the characteristics of self-reposting in the specific case of wind energy. We aim to understand where self-reposting is different from retweeting others’ messages. Of course, we do not refer to their definition (i.e., retweeting one’s own tweets rather than other people’s ones). Instead, we wish to find out the meta-characteristics of messages (e.g. their length) and their occurrence (e.g., the day of the week when they are posted) that make them more likely to be self-reposted. In other words, we wish to identify the determinants of self-reposting.

In order to accomplish that task, we formulate the following three research questions (RQ):

  1. RQ1

    When do self-reports occur?

  2. RQ2

    Does retweeting frequency impact self-reposting behaviour?

  3. RQ3

    Are tweets meta-characteristics associated with self-reposting?

For our purpose, we employ both an exploratory analysis and a machine learning-based classification approach (through a decision tree), employing a dataset obtained by collecting tweets over the topic of wind energy.

After an analysis of the literature (Sect. 2), we report our original contribution:

  • Self-reposting is mostly carried out by individual twitterers, in some cases having their account suspended due to some violation of the norms of conduct;

  • International or governmental agencies, associations, and organisations resort to self-reposting less massively, while companies involved in building or managing wind power plants play quite a negligible role;

  • The most relevant features to discriminate self-reposted messages vs non-self-reposted ones appear to be the day of the week, the time of the day, the length of the tweet, the number of hashtags, and the intertweet time interval;

  • Weekends (especially Sundays) are the days of choice for self-reposting;

  • Self-reposting takes place in afternoons mostly;

  • Messages longer and containing more hashtags are more likely to be self-reposted;

  • Twitterers tend to repost their tweets when their retweeting by others gets less frequent.

2 Literature review

We wish to analyse self-reposting behaviour as opposed to retweeting and consider wind energy as a debated topic where we observe self-reposting in action. In this section, we review the literature concerning retweeting behaviour and the social perception of wind energy separately. We also analyse the use of social media in connection to energy and environmental issues.

2.1 Retweeting

Social media have attracted a great deal of attention as non-conventional sources of data. They contain timely data capable of capturing public awareness and insights into human behaviours. Among them, Twitter is undoubtedly one of the most used to investigate a wide range of topics, such as investor sentiment (Reboredo and Ugolini 2018), social impact assessment (Sherren et al. 2017), NIMBY conflicts (Wang et al. 2019, 2021), social perception of energy (Li et al. 2019; Austmann and Vigne 2021) and social appreciations (Resce and Maynard 2018).

A way of expressing the support for an opinion expressed through a tweet is by retweeting it. The reasons for retweeting have been investigated, e.g., by Boyd et al. (2010), Webberley et al. (2011). Several factors have been identified in the literature to influence retweeting: number of followers (Kim et al. 2016; Li and Liu 2017), content features such as hashtags and URLs (Pang and Law 2017), linguistic features (Wang et al. 2012), topic and/or user virality (Hoang and Lim 2013; Li and Liu 2017; Park and Kaye 2019; Shi et al. 2017), users’ and tweets’ emotional status (Chen et al. 2017; Kim et al. 2016; Park and Kaye 2019), social behaviors performed by the sender and the receiver of the retweet (Chen and Deng 2020; Shi et al. 2017), different types of network structures (Kim et al. 2016). Though retweets often represent uncritical support of the original tweet, Gruber (2017) has analysed the commented variant, where the retweeter (critically) evaluates content (or author) of the retweeted text.

Several approaches have been proposed to model and predict the retweeting dynamics. Gao et al. (2015) propose an exponential reinforcement mechanism characterising the “richer-get-richer” phenomenon by capturing a power-law temporal relaxation function corresponding to the ageing in the ability of the message to attract new retweets. Wang et al. (2013) employ the classic Susceptible-Infectious-Susceptible (SIS) epidemic model. Other approaches are based on networks (Achananuparp et al. 2012), classification trees with recursive partitioning procedure (Nesi et al. 2018), social exchange theory and game theory (O’Leary 2016), factor graph model (Yang et al. 2010), probabilistic matrix factorization (Zhang et al. 2016). Other papers, like, e.g. Lin et al. (2016); Hu et al. (2018); Li et al. (2013); Xiong et al. (2012); Wang and Lee (2020); Kim et al. (2014); shin Lim and Lee-Won (2017); Cruz and Lee (2014) focus on the importance of retweeting, analysing in some cases the evolution of its propagation.

With the rise in popularity of social media platforms like Twitter, exerting great influence on such platforms allows influencing many decisions and shaping many opinions. (Gao et al. 2016) has shown that the popularity of tweets is distributed very unevenly, with a handful of key nodes dominating the scene and gaining a high number of retweets. A few efforts have been made in the literature to describe the phenomenon of influencers, who act through a constant production of data, attempting to make themselves algorithmically recognisable (Gillespie 2017) by social media algorithms and so respond to the coercive force of the threat of invisibility (O’Meara 2019; Cotter 2019). However, some bad practices have emerged, representing a threat to the credibility of these social networking platforms. For example, Arora et al. (2020) has investigated collusive retweeting activities, e.g., those services provided by paying for them to gain influence inorganically. The impact of spammers on Twitter networks has instead been addressed by Fronzetti Colladon and Gloor (2019), while Liu et al. (2019) has analysed the crowd-retweeting spam.

However, the literature does not appear to have investigated self reposting, so far, i.e., the practice by which influencers reshare their own same content multiple times (as if it were original) to increase their retweet count and their visibility.

2.2 Wind energy

Most of the literature on wind energy concerns its diffusion and the environmental policies adopted by governments.

The need to move towards sustainable, low-carbon and affordable energy systems is driven by the urgency to reduce greenhouse gas emissions and address global environmental problems. The required international efforts to combat climate change may include a shift in the national energy mix towards renewable energies and a new portfolio of electricity generation technologies (Hoffert et al. 2002; De Jesus et al. 2018; Dhakouani et al. 2019; Aleixandre-Tudó et al. 2019). An important contributor to that energy transition is the development and implementation of a wind energy infrastructure, as shown by Muñoz and Márquez (2018), Stephens et al. (2009), Zhao et al. (2016), Jethani (2016), both in its onshore and offshore implementations as described in Weinzettel et al. (2009), Sun et al. (2012). Countries are taking steps in that direction. We cite, e.g., the USA (where 29 states require minimum levels of wind generation through renewable portfolio standards Lamy et al. (2020)), Switzerland (where citizens approved a national energy strategy in May 2017, and wind energy plays a fundamental role Vuichard et al. (2019)), and Denmark (where parliament has agreed to promote the establishment of a longer-term goal of satisfying 50% of Denmark’s electricity needs through wind power by 2020 Borch et al. (2020)). From the year 2013 to the end of 2019, the total amount of wind energy capacity grew up at a rate of 12.6%, helped by steadily falling generation costs and an increase in the size of wind turbines as well as a low average construction time (Ali Sayigh 2020).

However, all renewable energy options have their own negative impacts, and wind power is no exception. Acceptance-related issues have hindered the deployment of wind energy. For example, people living near such projects may oppose their development (Horbaty et al. 2012; Vuichard et al. 2019). We are talking about the onshore wind farms, which are still the most popular type of wind farm in the world, but also the offshore wind farms seem to spur signs of conflict in the local communities (Lamy et al. 2020; van der Loos et al. 2020). In short, the social acceptance of wind plants continues to be the key challenge facing the governments and industry (Frantál 2015).

2.3 Discussion on social media

The use of social media in relation to energy and environmental issues has a relatively long history in the literature. Social media play various roles in that relationship, where social media users may play both active and passive roles.

An example of the former is the use of social media to express protests and one’s social activism. Valenzuela (2013) have analysed that role in the framework of the massive demonstrations taking place in Chile to change the government’s energy policy.

Social media also exert an influence on one’s attitudes towards energy sources. Such influence may be driven by an intentional behaviour to influence other social media users or by the sheer pressure of the crowd of users expressing their opinion. The variety of ways whereby social media can develop their influence has been explored by Sivarajah et al. (2015) in the context of energy efficiency practices. More recently, Zobeidi (2020) has shown that trust in the information obtained from social media affects the attitudes towards conventional energy/fuels. Luedecke and Boykoff (2017) have shown that social media are largely overcoming traditional media in that role.

The overall sentiment about energy sources can now be measured on social media. Jain and Jain (2019) has employed Twitter as a data source to measure the sentiment towards renewable energy sources. Similarly, Boutakidis et al. (2014) has used questionnaires to elicit the opinions of social media users on the same issue.

Opinions expressed on social media can also be used as a proxy for physical measurements. Wang et al. (2017) have proposed to derive an environmental quality index from those opinions. For example, metrics could be obtained to measure water quality and air pollution.

3 Method

In this section, we describe the method we have applied to the analysis of tweets. We first provide a mathematical definition of self-reposting and its related variables. We then describe how we collect our data and the dataset we obtain.

3.1 Self-reposting of tweets

As hinted in the Introduction, we are interested in the phenomenon of self-reposting, i.e. in people posting again a message they have posted on Twitter in the (recent) past. Self-reposting is distinguished from the well-studied phenomenon of retweeting, where the same message is posted again by somebody other than the original twitterer. We can also refer to the same phenomenon as self-retweeting but in the following, we stick to using self-reposting to avoid confusion. In this section, we provide a mathematical description of the phenomenon.

We represent the set of all tweets by S. Tweets may be original or reposted. In the latter category, we include both tweets self-reposted by the author of the original message and tweets retweeted by somebody else. We group the original tweets in the set \(S_T\) and the reposted ones in the set \(S_{R}\). Those two are subsets of S and represent a partition of S, so that \(S=S_T\cup S_{R}\) and \(S_T\cap S_{R}=\emptyset \).

Let N be the cardinality of \(S_T\) and M be the cardinality of \(S_{R}\).

The generic element \(t_i\) of \(S_T\), \(i=1,2,\ldots , N\) is a 3-tuple \(t_i:=(t_i^{\text {text}},t_i^{\text {time}},t_i^{\text {user}})\), where \(t_i^{\text {text}}\) is the text enclosed in that tweet, \(t_i^{\text {time}}\) is the concatenation of date and UTC time when that tweet was created (e.g. 2020-03-18 11:00:29), and \(t_i^{\text {user}}\) is the user who posted the tweet.

As to the set \(S_R\), we may distinguish retweets based on the identities of the retwitterer and the original twitterer. If those identities coincide, we have self-reposting, where the twitterer of the original tweet posts it again. We group those self-reposted tweets in the \(S_{SR}\) subset. We collect those retweets that are instead retweeted by another twitterer in the \(S_{RR}\) subset. Those two subsets represent a partition of \(S_R\), so that \(S_R = S_{RR} \cup S_{SR}\) and \(S_{RR} \cap S_{SR} = \emptyset \). We denote the cardinalities of the two subsets respectively as \(M_{RR}\) and \(M_{SR}\), with \(M=M_{RR} + M_{SR}\).

The generic element \(rr_j\) of \(S_{RR}\), \(j=1,2, \ldots , M_{RR}\), is a 4-tuple \(rr_j:=(rr_j^{\text {text}},rr_j^{\text {time}},rr_j^{\text {user}},rr_j^{\text {orig}})\), where \(rr_j^{\text {text}}\) is the reposted text, \(rr_j^{\text {time}}\) is the day and time of reposting, \(rr_j^{\text {user}}\) is the user who reposted the tweet, and \(rr_j^{\text {orig}}\) is the user who posted the original tweet. Since all the elements in \(S_{RR}\) and \(S_{SR}\) are repostings of tweets in \(S_T\), the following statement holds:

$$\begin{aligned} \forall j \in (1:M_{RR}) \quad \exists ! i \in (1:N) / rr_j^{\text {text}}=t_i^{\text {text}}, rr_j^{\text {orig}}=t_i^{\text {user}}, rr_j^{\text {time}}>t_i^{\text {time}}. \end{aligned}$$

Figure 1 illustrates the relationship between those sets.

We can now define the set \(S_{SR}\) of self-reposted tweets, i.e. those tweets that are reposted by the original twitterer itself. The generic element \(sr_j\) of \(S_{SR}\) is represented by the 3-tuple \(sr_j:=(sr_j^{\text {text}},sr_j^{\text {time}},sr_j^{\text {user}})\) with obvious meaning of the elements. We do not need a fourth element here since the original twitterer and the retwitterer coincide. The following statement holds similarly to what we have stated for \(S_{RR}\)

$$\begin{aligned} \forall j \in (1:M_{SR}) \quad \exists ! i \in (1:N) / sr_j^{\text {text}}=t_i^{\text {text}}, sr_j^{\text {user}}=t_i^{\text {user}}, sr_j^{\text {time}}>t_i^{\text {time}}. \end{aligned}$$

These are exactly the subsets we are interested in, i.e., the tweets making the self-reposting phenomenon. Since we are interested in the union of the two sets \(S_T\) and \(S_{SR}\), we can conveniently define that union as \(S_{TSR} = S_T \cup S_{SR}\). Each element \(s_k \in S_{TSR}\) is defined by the 3-tuple \(s_{k}:=(s_k^{\text {text}},s_k^{\text {time}},s_k^{\text {user}})\). It is useful to order those elements by time, so that \(s_{k+1}^{\text {time}}>s_{k}^{\text {time}}\). Each element in \(S_{TSR}\) is an original tweet or a self-reposted one. For each element \(s_k \in S_{TSR}\) we can identify the subset \(S_{RR}^{(k)}\) of its retweets by other twitterers, i.e. all the elements \(rr_j \in S_{RR} : (rr_{j}^{\text {text}}=s_k^{\text {text}}) \wedge (s_k^{\text {time}}< rr_j^{\text {time}} < s_{f(k)}^{\text {time}})\), where f(k) is the function pointing to the next element in \(S_{TSR}\) that is a self-repost of \(s_k\), i.e. \(f(k) = \min j : (s_j^{\text {text}}=s_k^{\text {text}}) \wedge (s_j^{\text {time}} > s_k^{\text {time}})\). If the tweet \(s_k\) is the last self-repost in its series, then the subset \(S_{RR}^{(k)}\) is made of all the tweets \(rr_j \in S_{RR} : (rr_{j}^{\text {text}}=s_k^{\text {text}}) \wedge (rr_j^{\text {time}} > s_k^{\text {time}})\). It is also expedient to define some other quantities that will be useful when we later define the features we employ. For each subset \(S_{RR}^{(k)}\) we define the ordered set \(U^{(k)}\) of all the indices of the elements of \(S_{RR}^{(k)}\), i.e. the set of all j for which \(rr_{j}\in S_{RR}^{(k)}\). The generic element of \(U^{(k)}\) is \(u_{i}^{(k)}\), \(i=1,2, \ldots , \vert S_{RR}^{(k)}\vert \). We also introduce the set \(D^{(k)}\) of all time differences between subsequent retweets of the same tweet, whose elements are \(d_{i}^{(k)} = rr_{u_{i+1}^{(k)}}^{\text {time}} - rr_{u_{i}^{(k)}}^{\text {time}}\), \(i = 1, 2, \ldots , \vert S_{RR}^{(k)} \vert - 1\). We can finally introduce the rank of each time difference, i.e., the set \(RD^{(k)}\) of all \(rd_{j}^{(k)} = \#d_{i}^{k}\quad : \quad d_{i}^{(k)}\ge d_{j}^{(k)}\).

Fig. 1
figure 1

The set of all tweets, S. \(S_{T}\) (yellow), \(S_{SR}=\cup _{i=1}^{N} S_{SR}^{(i)}\) (blue) and \(S_{RR}=\cup _{i=1}^{N} S_{RR}^{(i)}\) (green) form a partition of S. We can spot the presence of two different applications that map a tweet in \(S_T\) to a self-repost in \(S_{SR}\) or a retweet in \(S_{RR}\)

3.2 Twitter scraping

We have scraped the tweets by Twitter’s API (Application Programming Interface), accessible upon opening a Twitter developer account. Tweets were retrieved using the R package twitteR developed by Gentry (2015). The search interval has a 7-day limit, which means that only tweets posted in the latest seven days can be retrieved. Our inspection interval falls in the weeks from November 30, 2019, to March 23, 2020, for an overall collection period of 16 weeks.

We have focussed our investigation on tweets concerning wind power. For that reason, we have searched for all the tweets containing either of the following word combinations:

  • Wind AND power;

  • Wind AND energy.

Since many tweets contain both the above combinations, our basket after the retrieval phase may contain duplicates (i.e., tweets exhibiting the same 3- or 4- tuple identifying the tweet as described in Sect. 3.1). Before proceeding further, we have removed all duplicates.

We must, however, recognize the possibility of including tweets that, despite containing the words wind or power or energy, are not relevant to our actual theme, i.e., the use of wind to get electrical power.

The number of relevant tweets by week is shown in Fig. 2.

Fig. 2
figure 2

Number of relevant tweets by week

3.3 Dataset

As previously mentioned, our dataset has been built from tweets in English posted over roughly four months. This dataset contains 51,580 tweets (\(\vert S \vert = 51,580\)). The subsets of original and self-reposted tweets amount to 11,723 tweets. Precisely, we have \(\vert S_T \vert = 10,230\) and \(\vert S_{SR}\vert = 1,493\).

The target variable, which our supervised learning engine tries to predict, is the status of being self-reposted. Namely, the target variable \(\textit{self\_repost}\) is 1 if the tweet has been self-reposted and 0 otherwise. The definition of the target variable is then as follows: Fix \(i=1,2,\ldots , N\) and consider \(t_i\in S_T\). Let us denote by \(S_{SR}^{(i)}\) the set of all \(sr_j\) that satisfy Eq. (2). Then, intuitively, the union of \(S_{SR}^{(i)}\) and \(\{t_i\}\) is the set of tweets with \(\textit{self\_repost}=1\). Let us now denote by \(\vert S_{SR}^{(i)}\vert \) the cardinality of this set and define:

$$\begin{aligned} \mathcal {I}_0&\triangleq \left\{ i\in \left( 1:\left( N+M_{SR}\right) \right) \Bigl | \vert S_{SR}^{(i)}\vert =0 \right\} \nonumber \\ \mathcal {I}_1&\triangleq \left\{ i\in \left( 1:\left( N+M_{SR}\right) \right) \Bigl | \vert S_{SR}^{(i)}\vert \ge 1 \right\} \end{aligned}$$

Accordingly, the sets of tweets with \(\textit{self\_repost}=0\) and \(\textit{self\_repost}=1\) are, respectively \(\left\{ t_i\Bigl | i\in \mathcal {I}_0\right\} \) and \(\left\{ t_i\cup S_{SR}^{(i)}\Bigl | i\in \mathcal {I}_1\right\} \).

For each tweet in \(S_{TSR}\), we have nine features, obtained either by simple retrieving from the scraped data or by processing those data. The features of the generic element \(s_k \in S_{TSR}\) are:

  • The retweet count \(n_{\text {RT}}(k)\), i.e. the number of times the tweet has been retweeted before being reposted, which is precisely defined as

    $$\begin{aligned} n_{\text {RT}}(k) = \vert S_{RR}^{(k)} \vert . \end{aligned}$$
  • The average time \(\overline{t}(k)\) between two retweets, which can be computed as the ratio of the time between two subsequent self reposts to the number of retweets:

    $$\begin{aligned} \overline{t}(k) = \frac{s_{f(k)}^{\text {time}}-s_{k}^{\text {time}}}{\vert S_{RR}^{(k)}\vert }. \end{aligned}$$
  • The weekday wd(k), i.e., the day of the week when the tweet has been posted, extracted from \(s_{k}^{time}\);

  • The time dt(k) of the day when the tweet has been posted, extracted again from \(s_{k}^{time}\);

  • The length nc(k) of \(s_k^{\text {text}}\) in characters;

  • The number nh(k) of hashtags contained in \(s_k^{\text {text}}\);

  • The number nl(k) of hyperlinks (URLs) contained in \(s_k^{\text {text}}\);

  • The number ne(k) of exclamation points contained in \(s_k^{\text {text}}\).

  • The rank \(o(k) = rd_{\vert S_{RR}^{(k)}\vert - 1}^{(k)}\) of the last inter-retweet interval (i.e., after sorting all the inter-retweet intervals since the latest self-repost).

4 Determinants of self-reposting

In this section, we analyse the dataset to extract the most relevant features associated with self-reposting. We carry out a dual approach. We first employ an exploratory analysis, which may give us a clue about which features take different values in self-reposted vs non-self-reposted tweets. Then we apply a machine learning approach based on a decision tree to classify tweets by employing the nine features we have described in Sect. 3.3.

4.1 Exploratory analysis

Before applying a thorough machine learning approach, we conduct an exploratory analysis by seeing if any single feature actually takes different values for self-reposted and non-self-reposted tweets. For the feature to be really useful in discriminating between the two types of tweets, hence characterizing self-reposting, we expect it to behave differently.

We start by examining how frequent self-reposting is. We can draw two metrics from the composition of the dataset illustrated in Sect. 3.3. First, out of all original tweets, those that are self-reposted are \(\vert S_{SR}\vert /\vert S_T \vert = 14.6\%\), which is a significant fraction. On the other hand, we may consider self-reposting as a special case of retweeting, i.e., where the twitterer tweeting the original tweet and that tweeting the retweet coincides. In that case, we can compute the fraction of retweets that are actually self-reposts as \(\vert S_{SR} \vert / (\vert S_{SR} \vert + \vert S_{RR} \vert ) = 3.6\%\).

We can also see how much self-reposting is resorted to by each twitterer. In Fig. 3, we see that the overwhelming majority of twitterers limit themselves to reposting their tweet twice, but there are (rare) cases where the twitterer applies self-reposting massively. Though we reported just the cases up to fourteen retweets, in a small percentage of cases (4.46% of the total), the authors have reposted their same post more than fifteen times. We mention the cases of single users who reposted the same tweet 35, 40, 50 or even 114 times. The last case concerned a tweet reposted by the same author nearly every day, roughly at the same time, but without ever earning a retweet.

If we look at the users who have applied self-reposting massively, we notice that in the top 20, there are: only 1 (small) company; 1 expert; 1 climate activist; 3 leading international or governmental agencies, associations, organizations; 3 users who seem not to be on Twitter any longer; 2 suspended accountsFootnote 1; 9 generalist individual twitterers. The latter is a class of twitterers that promote websites on cryptocurrencies and appear to be related (they publish or retweet the same posts); their behaviour is reminiscent of social bots.

Fig. 3
figure 3

Distribution of twitterers by the number of self-reposts carried out

We can now examine the impact of each feature on self-reposting behaviour. In the following, we consider the two values of the target variables (i.e., being a self-repost or not) and report the statistical characteristics of each feature for the two values of the target variable.

We start with the number of retweets (by twitterers other than the original poster) taking place between two subsequent self-reposts. In the case of non-self-reposted tweets, we consider the overall number of retweets instead as a comparison. We report the two resulting probability density functions in Fig. 4. The comparison does not show significant differences.

Fig. 4
figure 4

Relative frequency of retweet count

If we take a look at how often retweets are posted in Fig. 5, we see that retweets after self-reposts take place more frequently than for tweets that are not self-reposted. However, the mode of the density (the most frequent inter-retweet time) is quite similar for both cases, being located at 0.47 hours (for self-reposted tweets) and 1.41 hours (for non-self-reposted tweets), respectively.

Fig. 5
figure 5

Density plot of the average time between two successive retweets of the same tweet

We can now examine when retweets are posted. We start with the day of the week. In Fig. 6. Here we see significant differences between the two subsets. Non-self-reposted tweets occur more frequently during weekdays, while self-reposts peak on Sunday. We can imagine a reviewing habit taking place for twitterers who review the status of their posts on Sundays and then decide whether to self-repost their tweets.

Fig. 6
figure 6

Distribution of tweets by the week day in which they are posted

We can go deeper by examining the time of the day at which tweets are posted. In Fig. 7, we see that the distribution is far from uniform but different for the two cases. Tweets that are not self-reposted take place relatively uniformly during the daytime. Self-reposts concentrate in the afternoon hours.

Fig. 7
figure 7

Distribution of tweets by the hour of the day in which they are posted

We can now move to the actual features of the tweet itself.

We first consider the length of the tweet, reported in Fig. 8. We see that there are no highly significant differences, except some occasional peaks around shorter lengths for tweets that are self-reposted.

Fig. 8
figure 8

Density plot of tweets length in characters

Another important feature of tweets is the number of hashtags they contain. In Fig. 9, we see that self-reposted tweets have more embedded hashtags than non-self-reposted ones. We can conclude that twitterers who repost their own tweets are more sensitive to catching readers’ attention.

Fig. 9
figure 9

Distribution of tweets by the number of hashtags contained in the tweet

Instead, no significant differences appear if we turn to another embedded element such as hyperlinks. In Fig. 10, we see that the overwhelming majority of tweets contain just one hyperlink, with most of the remaining tweets containing just two. Though the presence of a single hyperlink is more frequent for self-reposts (and the opposite can be said for the case of two hyperlinks), the number of hyperlinks embedded in tweets does not seem a helpful feature to discriminate between self-reposts and non-self-reposted tweets.

Fig. 10
figure 10

Distribution of tweets by the number of links which they contain

We can reach the same conclusion for the number of exclamation points, which is shown in Fig. 11. Here, we have no exclamation points for 94% of the non-self-reposted tweets and 96% of self-reposts, quite a slight difference to be considered for discrimination (Fig. 12).

Fig. 11
figure 11

Distribution of tweets by the number of exclamation points which they contain

Fig. 12
figure 12

Distribution of tweets by rank feature

4.2 Classification metrics

After performing the exploratory analysis, we now consider a machine learning approach to identify the determinants of self-reposting behaviour. The task of predicting whether a message is going to be self-reposted can be seen as a binary classification task, where we label self-reposted messages as positive instances. The resulting confusion matrix is reported in Table 1.

Table 1 Confusion matrix

In order to evaluate the performance of our classifier, we employ the following metrics, which are well established in the literature:

$$\begin{aligned}&\text {Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} \end{aligned}$$
$$\begin{aligned}&\text {Sensitivity} = \frac{TP}{TP + FN} \end{aligned}$$
$$\begin{aligned}&\text {Specificity} = \frac{TN}{TN + FP} \end{aligned}$$
$$\begin{aligned}&\text {Balanced accuracy} = \dfrac{\text {Sensitivity} + \text {Specificity}}{2} \end{aligned}$$

Since our dataset is imbalanced, accuracy may not be the best performance metric; hence we will mostly rely on the true positive and true negative rates (i.e., Sensitivity and Specificity).

4.3 Classification methodology

The task we are considering is a classification task, where we try to link the features identified in Sect. 3.3 to the target variable self_repost.

We employ the labelled dataset described in Sect. 3.3. We consider inputs \(\mathbf{x} \) and output y, merged in the training set \(\mathcal {D}=\left\{ \left( \mathbf{x} _{i}, y_{i}\right) \right\} _{i=1}^{N}\), where N is the number of training tweets. In our case, \(\mathbf{x} \) contains the values of the 9 features introduced in Sect. 3.3 and \(y_{i}\) is the binary categorical variable \(\textit{self\_repost} \in \{0, 1\}\). Formally, given the training set \(\mathcal {D}\) of N tweets, the k-th tweet is represented by 9 features and a target variable:

$$\begin{aligned} \left( n_{\text {RT}}(k), \, \overline{t}(k), \, wd(k), \, dt(k),\, nc(k), \, nh(k), \, nl(k), \, ne(k), \, o(k), \, \textit{self\_repost}_k\right) \end{aligned}$$

We opt for a classification tree as the classification model due to its simplicity and easy interpretability. The tree is built using the greedy procedure shown in Algorithm 1, which looks for the optimal partitioning of the data. This method is used by CART (Breiman et al. 1984; Murphy 2012), and its popular implementation (Therneau et al. 1997; Therneau and Atkinson 2019; Kuhn 2020, 2008). The split function chooses the best feature and the best value for that feature, as follows:

$$\begin{aligned} \left( j^{*}, t^{*}\right)&=\arg \min _{j \in \{1, \ldots , D\}} \min _{t \in \mathcal {T}_{j}} \Bigl [{\text {cost}}\left( \left\{ \mathbf {x}_{i}, y_{i}: x_{i j} \le t\right\} \right) +\nonumber \\&+{\text {cost}}\left( \left\{ \mathbf {x}_{i}, y_{i}: x_{i j}>t\right\} \right) \Bigr ] \end{aligned}$$

where \(\mathcal {T}_{j}\) is the set of possible thresholds for feature j; it can be obtained by sorting the unique values of \(x_{i j}\). For example, if feature 1 has the values \(\{4.5,-12,72,-12\}\), then we set \(\mathcal {T}_{1}=\{-12,4.5,72\}\). For real-valued inputs, it makes sense to compare a feature \(x_{i j}\) to a numeric value t. In the case of categorical inputs we consider splits of the form \(x_{i j}=c_{k}\) and \(x_{i j} \ne c_{k}\), for each possible class label \(c_{k}\). The training set \(\mathcal {D}\) is then splitted in two subsets, the left subtree \(\mathcal {D}_{L}\) and the right one \(\mathcal {D}_{R}\), using a single feature j and a threshold t.

figure a

The function that checks if a node is worth splitting uses a stopping heuristic based on the reduction in cost \(\Delta \):

$$\begin{aligned} \Delta \triangleq {\text {cost}}(\mathcal {D})-\left( \frac{\left| \mathcal {D}_{L}\right| }{|\mathcal {D}|} {\text {cost}}\left( \mathcal {D}_{L}\right) +\frac{\left| \mathcal {D}_{R}\right| }{|\mathcal {D}|} {\text {cost}}\left( \mathcal {D}_{R}\right) \right) \end{aligned}$$

To define the classification cost we resort to the so-called impurity functions. Let f be some impurity function and define the impurity of a node A as

$$\begin{aligned} I(A)=\sum _{i=1}^{C} f\left( p_{i A}\right) \end{aligned}$$

where \(p_{i A}\) is the proportion of those in A that belong to class i (in our case we have \(C=2\)).

Two candidates for f are the information entropy \(f(p)=-p \log (p)\) and the Gini index \(f(p)=p(1-p)\). The algorithm we built in this paper split the tree based on information entropy. Since we wish to define the variable importance, the reduction in the information entropy attributed to each variable at each split is tabulated, and the sum is returned.

The partitioning we employ is binary (it splits the data into two regions) and recursive (each splitting rule depends on the splits above it). However, we must decide how long we keep growing the tree. Though we could proceed till exploiting all the features, a very large tree could result in overfitting (of course, we may also face the opposite problem, since a small tree might not capture the important structure hidden in the data) (Friedman et al. 2001).

The adopted strategy here is to grow a large tree \(T_0\) and then prune it back to find an optimal subtree.

Let us define a subtree \(T \subset T_{0}\) to be any tree that can be obtained by pruning \(T_{0}\) and denote, by |T|, the number of terminal nodes in T. Each node \(A_m\) represents a region with \(N_m\) observations; for example, the node A considered in (13) is just one of them.

Hence, we define a cost complexity criterion as follows (Friedman et al. 2001)

$$\begin{aligned} L_{c_p}(T)=\sum _{m=1}^{|T|} N_{m} I(A_m)+c_p|T| \end{aligned}$$

where \(I(A_m)=I(A_m,T)\) is defined by (13) and \(c_p\) is the complexity parameter. The idea is to find the optimal subtree to minimize \(L_{c_p}(T)\). Large (small) values of \(c_p\) result in smaller (bigger) subtrees. Intuitively, with \(c_p=0\) the solution is the full tree \(T_0\).

5 Experiments and results

We aim to identify the drivers of self-reposting. For that purpose, we have introduced a decision-tree model in Sect. 4.3. In this section, we report the results of applying that model. In addition, we examine the importance of each feature in the model. These results allow us to identify the most significant factors associated with self-reposting.

5.1 Testing and validation

As stated earlier, the subset of interest to us is \(S_{TSR}\), which contains both the original tweets and the self-reposted ones. The overall size of \(S_{TSR}\) is 11,723. As reported in Sect. 3.3, the dataset is heavily imbalanced since the number of original tweets is much larger than the self-reposted ones. For the purpose of testing, we split that set into two parts, employed respectively for training the machine learning model and testing it. In order to avoid overfitting, the composition of both sets was not fixed, but tenfold cross-validation was employed Schaffer (1993).

5.2 Full-feature decision tree classification

We first build a decision tree as described in Sect. 4.3 with the full set of features. The resulting performance metrics are reported in Table 2. As can be seen, the accuracy is close to 90%, which we can consider as a good value. We also see that the performance for the two classes, embodied by the sensitivity and specificity, are not far apart. In particular, the high sensitivity value, now a bit over 90%, shows that the full features model can capture the overwhelming majority of self reposts (here labelled as positive). Several non-self-reposted tweets are instead disguised as positive cases, as shown by the lower value of the specificity. However, though being dragged towards this lower value, the balanced accuracy is still above 85%.

Table 2 Performance of the decision tree model with full features

We have built the tree using both impurity measures (Gini index and entropy). The resulting trees are shown respectively in Figs. 13 and  14. We can consider the variables employed in the levels closest to the tree root as the most significant ones. The first level split is carried out by employing the number nh(k) of hashtags. On the second level, we use the rank o(k) on the left subtree and the length nc(k) of the tweet in characters on the right subtree. This partly agrees with the clue provided by the exploratory analysis in Sect. 4.1.

Fig. 13
figure 13

Decision Tree for the full set of features (Gini index)

Fig. 14
figure 14

Decision Tree for the full set of features (Information Entropy)

5.3 Feature selection

Though we have achieved good accuracy with the tree model built by considering the full set of features, we wish to extract those features that really matter in characterizing the self-reposting phenomenon.

We have already performed an exploratory analysis in Sect. 4.1, where the day of the week, the time of the day, the number of embedded hashtags, and the rank of the last inter-retweet interval appeared as the most discriminating features. The decision trees built for the full set of features identified the number of hashtags, the rank, and the length of tweets as the most relevant variables to split the dataset.

We now take a step forward by considering some metrics that allow us to support (or disprove) those findings under a more quantitative approach. In Duboue (2020) the following set of metrics is suggested, with our choices between parentheses:

  • Chi-square test (disregarded);

  • Pearson correlation coefficient (replaced by Cramer’s V);

  • Mutual information.

The chi-square test would be employed by setting the null hypothesis as that there is no difference between the distribution of the feature when the tweet is self-reposted or not. The features to be retained as significant are those for which the null hypothesis is rejected. However, it is well known that such significance tests are very sensitive to the sample size since even trivial differences may be considered as significant Bergh (2015). In that case, the null hypothesis is rejected for most variables, concluding that all variables are significant. This is precisely what happened in our case: the chi-square test was not helpful, judging all the variables to be significant.

As to the second suggestion in the bulleted list above, the Pearson correlation coefficient is not very meaningful since most of our variables are categorical. We have opted then for Cramer’s V association index, which is built for categorical variables Acock and Stavig (1979). We have discretized the intrinsically continuous variables by binning them as typically done in the computation of the chi-square statistics. In Table 3, we show the results, where the three features most strongly associated with the target variable are the length of the tweet, the day of the week, and the number of hashtags. These results confirm the suggestions provided by our exploratory analysis and the decision tree.

Table 3 Feature ranking according to Cramer’s V

Finally, we have computed the mutual information between each feature and the target variable. The results are shown in Fig. 15, where now the day of the week, the length of the tweet, and the rank appear as the three top features.

Fig. 15
figure 15

Mutual Information between features and target variable

Summing up the outcomes of the exploratory analysis and the three quantitative feature selection metrics, we can conclude that the most significant features to assess the propensity of a tweet to be self-reposted are

  • The day of the week;

  • Length of the tweet;

  • Number of hashtags;

  • Rank of the intertweet time interval;

  • The time of the day.

The day of the week has received the broadest support, i.e., by the exploratory analysis, Cramer’s V, and the mutual information. The tweet’s length is supported by both Cramer’s V and the mutual information. The exploratory analysis supports the number of hashtags and Cramer’s V. The rank is supported again by the exploratory analysis and the mutual information. Finally, the time of the day is supported by the exploratory analysis only.

We can now build a new decision tree using just the features that have been deemed significant. We show them in Figs. 16 and  17 when the Gini index and the information entropy are employed respectively.

Fig. 16
figure 16

Decision Tree for the reduced set of features (Gini index)

Fig. 17
figure 17

Decision Tree for the reduced set of features (Information Entropy)

The performance achieved with the reduced set of features is shown in Table 4. If we compare these results with those obtained with the full set of features (shown in Table 2), we notice the performance is practically unaltered (actually, there is even a tiny improvement). The removed features add nothing to the prediction performance of the decision tree. We can then safely conclude that the features impacting the self-repost behaviour are the day of the week, the tweet’s length, the number of hashtags, the rank of the intertweet time interval, and the time of the day. In particular, as can be seen from both decision trees in Figs. 16 and  17, self-reposts take place more often in the weekend and on times other than the central part of the day, for longer tweets containing a higher number of hashtags, and when a long time has elapsed since the last retweet.

Table 4 Classification performance of the decision tree with the reduced set of features

Looking at Fig. 16 we can spot that the characterization of all the tweets reposted multiple times (for the reduced set of features) is the following

$$\begin{aligned}&\Bigl \{\text {nh}\ge 7\, , \, \text {nc}\in [135,136)\Bigr \}\ \vee \ \Bigl \{\text {nh}\ge 7\, , \, \text {nc}\in [139,140)\, , \, \text {dt}<19\Bigr \}\ \vee \nonumber \\&\Bigl \{\text {nh}< 7\, , \, \text {o}<99\, ,\, \text {dt}<8\, , \ \text {wd}\in \bigl \{\text {Monday}\bigr \} \Bigr \} \ \vee \ \nonumber \\&\Bigl \{\text {nh}< 7\, , \, \text {o}<99\, , \, \text {wd}\in \bigl \{\text {Sunday}\bigr \} \Bigr \} \ \vee \nonumber \\&\Bigl \{\text {nh}\in [5,7)\, , \, \text {o}\ge 99\, , \, \text {nc}\in [117,118)\Bigr \} \end{aligned}$$

6 Discussion and conclusions

In this paper, we have described an analysis framework to understand the phenomenon of self-reposting in Twitter discussions. For that purpose, we have examined a large set of tweets concerning wind energy. Though those tweets concern a specific topic, the framework and most conclusions can be considered generally valid.

The combined use of exploratory analysis and a machine learning approach (through decision trees) has shown that self-reposts typically take place on weekends and in the afternoon hours, have a length close to the limit, contain more hashtags, and occur when retweets from other twitterers become less frequent. Their typical placement over time suggests that they fall into a planning habit. The combination of day and time reveals that self-reposting is not a major activity to be conducted in the core times. Instead, it is part of a regular survey of social media activities. People review the status of their tweets every week and take some time off their non-working hours to examine their posts and resubmit some. A major criterion for self-reposting that emerges from our analysis is suggested by the prominent feature of the rank of the inter-tweet interval. It looks like twitterers decide to repost their own messages when retweets start to get rarer. We cannot go as far as identifying a threshold that acts as a trigger; we can imagine that, though implicitly, this can be related to the popularity of the tweet to be self-reposted. However, the intention to revive the interest in their messages is clear. It is not worthwhile to repost when the interest is still high, as embodied by a high number of retweets. As to the characteristics of the tweets that are self-reposted, they appear to be richer (more content and more hashtags) than non-self-reposted ones. It is not easy to say whether this is a driving feature for self-reposting or a natural characteristic of tweets posted by people who strongly wish to put their message across.

Regarding the specific topic of wind energy, we see that wind energy companies do not apply self-reposting. Indeed, among the users who have applied self-reposting massively, we noticed that the majority is formed by suspended accounts or accounts which seem to no longer be on Twitter and by generalist individual twitterers. It is to be seen whether the massive use of reposting by individuals, especially by temporary or banned users, is a feature of the specific topic or is a general result.

Aside from the characterization of the self-reposting phenomenon, the implications of this study concern the management of social platforms like Twitter. In fact, self-reposting may take an annoying or vexing flavour (or even overlap with a phenomenon like mail bombing), which may compel platform managers to take actions against massive self-reposters. Possible actions by Twitter could include limitations to twitterers reposting their own tweets. This could be done based on one or more indicators, which express related policies. For example, self-reposts could be blocked when they exceed a threshold set for a time interval (e.g., no more than three self-reposts per month). The rationale, in this case, would be to limit self-reposting straightforwardly, considering it a vexing activity anyway. A second possible measure would be to limit self-reposts when the original tweet has not received enough retweets (which triggers self-reposting in many cases). The rationale would be to avoid reposting tweets that have not caught the interest of twitterers. Finally, a similar measure by Twitter could block self-reposts when the tweet has not been retweeted for a long time. This third measure would avoid reposting tweets that do not longer catch the interest of twitterers though they did in the past.

Our analysis has shown that self-reposting is not a random behaviour but is triggered by some events (which we have identified as the timing of past retweets) and is linked to some features of the tweet itself (e.g., its length and hashtag content). At the same time, it happens more frequently on weekends and in non-central times. Two branches appear to be most interesting after this initial study: (a) looking for additional features of self-reposted tweets; (b) extending the analysis to other social media platforms to examine if self-reposting is employed and exhibits characteristics different from Twitter.