Analyzing history-related posts in twitter

Sumikawa, Yasunobu; Jatowt, Adam

doi:10.1007/s00799-020-00296-2

Analyzing history-related posts in twitter

Open access
Published: 28 October 2020

Volume 22, pages 105–134, (2021)
Cite this article

Download PDF

You have full access to this open access article

International Journal on Digital Libraries Aims and scope Submit manuscript

Analyzing history-related posts in twitter

Download PDF

Yasunobu Sumikawa¹ &
Adam Jatowt²

7323 Accesses
11 Citations
9 Altmetric
Explore all metrics

Abstract

Microblogging platforms such as Twitter have been increasingly used nowadays to share information between users. They are also convenient means for propagating content related to history. Hence, from the research viewpoint they can offer opportunities to analyze the way in which users refer to the past, and how as well when such references appear and what purposes they serve. Such study could allow to quantify the interest degree and the mechanisms behind content dissemination. We report the results of a large scale exploratory analysis of history-oriented posts in microblogs based on a 28-month-long snapshot of Twitter data. The results can increase our understanding of the characteristics of history-focused content sharing in Twitter. They can also be used for guiding the design of content recommendation systems as well as time-aware search applications.

Analyzing Microblogging Posts for Tracking Collective Emotional Trajectories

#FewThingsAboutIdioms: Understanding Idioms and Its Users in the Twitter Online Social Network

Temporal Analysis of User Behavior and Topic Evolution on Twitter

Find the latest articles, discoveries, and news in related topics.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

History is regarded as knowledge that plays a special role in our society. This is because the comprehension of history is useful for multiple reasons. First, one can better understand the processes impacting the present world. Second, history forms the basis for the development of coherent national and local identities. Third, history offers support for decision making and provides guidance as for what can await us in the future [1, 23]. Due to these and other reasons, history is one of the key subjects that are taught in elementary schools as well as in the subsequent stages of education.

Recently, social media and microblogs in particular have been often used as a convenient source for understanding public attitude towards entities or events (e.g., the US American elections [57]). Microblogs are also a platform useful for finding and sharing history-related content. Computational studies of references to the past in microblogs can then offer us novel perspectives for understanding the formation of collective memories and the pursuit of public history.

Collective memory analyses based on large-scale data and using computational methods have been already carried either on news article collections [5, 16] or Wikipedia data [19, 20, 35, 36]. However, when it comes to microblogs, little research has been done. One notable project is the analysis of the memories related to the First World War in Twitter [14] from the multi-cultural perspective. Our work also focuses on Twitter which constitutes a popular social media platform frequently utilized for a variety of studies in the computational social sciences and other domains. The analysis we perform has exploratory character aiming to offer broad investigation of practices of sharing history-related content in microblogging platforms.

The following questions are considered in our study:

1.
How do users write about history in Twitter?
2.
How does the time horizon of history-related references look like?
3.
In what way are collective memories expressed in Twitter?
4.
What are the key tweeted and re-tweeted past events and entities?
5.
How different are collective memories expressed in tweets from the ones in re-tweets?

These and other related questions are investigated based on a compiled dataset of tweet messages which were issued from March 2016 to July 2018. We collect such posts by searching for tweets which contain history-related hashtags. To increase the coverage, we apply a bootstrapping as an iterative process of collecting relevant hashtags starting from an initial set of seed history-focused hashtags. Thanks to this procedure, we collected the sufficient number of history-related hashtags, which allow us to gather over 2 million tweets which contain different kinds of references to history.

Based on the collected data, we then examine the characteristics of history-related tweets. We study their time horizons, mentioned entities, hashtag popularities as well as several other related aspects. Moreover, we describe our novel taxonomy of history-related hashtags and we analyze different hashtag categories. By this, we try to organize and provide structure to user activities related to referencing, evaluating and sharing history-related information in social networks like Twitter.

Besides answering specific research questions in this study, the results of our analysis can be useful for several practical applications. First, specialized content detection and recommendation systems can be better designed thanks to the results we report. Their objective would be to facilitate sharing of historical knowledge. Historical content recommendation in social media is an attractive and informal way for learning history. Building effective, dedicated recommendation systems could be supported based on understanding of the characteristics and types of popular history-related content in social media and the context in which this content is shared. Indeed, several existing projects already employ online social platforms like Twitter to stimulate interest in history and for teaching history.^{Footnote 1} An interesting idea is automatic content dissemination enabled by history-focused chatbots such as HistoChatbot.^{Footnote 2} Tweets, due to relatively short content and the simple yet effective methods for measuring their popularity (e.g., re-tweet counts and user response analysis), could constitute a useful source of data for such systems.

Naturally, some history-focused tweets are directly triggered by current events or current popular entities. Studying their formation and popularity could be useful for understanding the conditions and circumstances that would allow for “historification” of different types of documents. In practice, this would mean recommending relevant historical references and grounding for any present events and topics mentioned in these documents.

Besides providing answers to the research questions on history-related content dissemination in social media, our work may also offer clues about collection building for historians or other researchers who are interested in using tweet collections. The proposed categorization of history-related hashtags could be used for generating collections that contain content of special characteristics. In this context, we also discuss particular types of tools that can be used (temporal tagger, NER method) for effective analysis of collected datasets.

To sum up, we make the following contributions:

1.
We study how users refer to history in social networks based on collected large scale data.
2.
We perform tweet- and re-tweet-based analyses.
3.
We provide novel findings which offer a better understanding of how collective memories are maintained and formed in microblogging.
4.
We propose novel categorization of historical references in Twitter.
5.
We outline novel research directions and potential applications that can utilize history-related content in microblogs.
6.
We release our dataset of history-related tweets for further research.

This work is an extended version of the paper published at the JCDL 2018 conference [55]. We analyze here larger datasets (close to three years long span of data collection instead of 1 year as in our previous work). This allows us to undertake comparative analyses for different years (2016, 2017 and 2018). Besides the larger scale and comparative focus of this work, we also contrast the results obtained from tweets with those coming from re-tweets. This allows for pinpointing differences between active formulation of texts containing remembrances with their passive dissemination along with social networks. Finally, in comparison with the JCDL 2018 paper we analyze URLs included in tweets and show the results in this paper.

The remainder of this paper is structured as follows. We present related work in the next section. In Sect. 3, we detail the data collection and processing. Section 4 describes the findings of our analysis, while the next section introduces our novel categorization of hashtags and provides the results of the related analyses. We then provide discussions in Sect. 6. The last section concludes the paper and describes our future work.

2 Related work

In this section, we first start with the overview of temporal information retrieval studies and temporal text analysis (Sect. 2.1) and with surveying works on Twitter data analysis (Sect. 2.2) as our study uses temporal references in tweets. We then focus on broad studies of collective memory using computational approaches in Sect. 2.3.

2.1 Temporal analysis

The current Web contains numerous digital archives including historical images, documents and so on due to intensive digitization efforts carried out over the last years. Due to the ever increasing amount of temporal data, analyzing temporal information has become an important process in information retrieval (IR) to improve satisfaction of users. Recently, several kinds of studies were undertaken in the broad area of Temporal IR (T-IR), for example, detecting temporal expressions or information [28], retrieving history-related images [13], organizing information by creating timelines [3, 17, 29], or future-related IR [7, 34, 50]. A detailed survey of T-IR is given in [11].

Similar to our study, several past-oriented temporal analyses have been performed. These could be roughly grouped into several sub-areas of T-IR: supporting users to perform retrieval of past specified data, extracting useful past data, and supporting or understanding historical sciences in general.

As for the supporting data search and retrieval, various methods and algorithms to assist users in finding past content were proposed [9, 47, 52]. For example, Singh et al. designed an IR framework to support historians in their searches [52]. According to the literature, if historians investigate an entity, they first try to see it from a big picture. Then, they further search for content on the entity according to some of its specific aspects. Thus, supporting historians’ information seeking is useful to indicate not only important time information but also display several kinds of aspects. Bogaard et al. proposed a data-driven partitioning process to identify user interests and search behavior based on interactions with a historical newspaper collection spanning 400 years that is available from the National Library of the Netherlands [9]. They confirmed that their approach can detect user interests and observed that the related search behavior varies within the different parts of the collection. Abujabal and Berberich [2] proposed method to identify important past as well as future events based on frequent itemset mining and mutual information on sentences containing named entities and temporal expressions.

Works on finding analogical items over different temporal scopes are also related to our study. Zhang et al. proposed a framework for detecting counterparts of entities over time [61]. This framework bridges two different vector spaces that are created for different time-ranges such as [1900–1950] and [1960–2010] by applying an automatically learnt transformation matrix. The transformation matrix maps an entity in one vector space into the other one. The authors extend this approach to make use of hierarchical cluster structures [62]. In general, mining history-related knowledge is another popular direction of study. For example, several works try to find beneficial information from large amounts of data by evaluating the significance of historical entities [31], timestamping entities [32], analyzing trends [29], or trying to predict future from past events [33, 49, 50].

2.2 Twitter analysis

Twitter is one of the most popular social media platforms to share information. As a tweet can have at most 280 characters, this platform poses several challenges caused by the short content of messages. For example, there are studies extending traditional IR/NLP techniques designed for long documents such as news articles to fit short texts, e.g., identifying central topic model from tweet streams [48], summarizing tweets [22], retrieving opinions [21], detecting community [8], and building corpora [38, 42, 46]. In addition, Twitter contains not only texts but also unique features such as hashtags, followers and followees (i.e., Twitter users who follow or are followed by a particular user), and URLs. Using these features, past studies focused on (among others) automatic hashtag labeling by hashtag-based pooling tweets [43], analyzing factors affecting response [15], readability of crisis communications [56], language diversity [41], language and locations [59], detecting influencers in Twitter [60], classifying user’s temporal intention when sharing resources [51], ranking users [58] or meme tracking in blogosphere [40].

As discussed above, many Twitter-related studies use unique Twitter’s features, yet what these studies usually lack is a deep consideration of historical aspects.

2.3 Collective memory analysis

The concept of collective memory (or social memory) popularized by Halbwachs [25, 26] describes the shared reflection of the past within social groups. Collective memory can be contrasted with collective amnesia defined by Jacoby [30] as forceful or unconscious suppressions of memories, especially those related to disgraceful or inconvenient events for a particular social group or nation. In a similar fashion to personal memory [18], social memory is known to thin out over time and to be subject to temporal variations following the occurrence of memory triggers such as sudden events or anniversaries [5, 36, 37].

Studies of collective memory can help us to understand the mechanisms of forgetting and remembering as well as explain the role of the history and the past in our lives. In addition, they have direct implications on the archival selection by memory institutions such as national or dedicated archives [37]. Traditionally, research on collective memory has been based on manual approaches and small-scale investigations of personal accounts and the activities of political and cultural institutions. There is still relatively little literature on the use of computational approaches for the quantification of the characteristics of social memory over large text datasets. Cook et al. [16] investigated the decay of fame over time on the basis of the collection of news articles that span the twentieth century. Au Yeung and Jatowt [5] studied the way in which past year mentions appear in the datasets of recent news articles in order to understand which years are forgotten and which remain remembered, as well as the main topics associated with the remembering of past years.

Wikipedia has been quite often used as a reflection of collective memories and their formation processes. Ferron and Massa [19] and Kanhabua et al. [36] proposed to use Wikipedia as a global memory space. The latter work focused on memory triggers that cause forgotten or vaguely remembered events to be brought back into social attention. Anniversaries are natural examples of memory triggers. In another case, current events may also serve as triggers of the memories of similar, past events. García-Gavilanes et al. [20] revealed viewership statistics of Wikipedia articles on aircraft crashes and focused on memory triggering patterns. Miz et al. [44] proposed a new method that allows learning and remembering collective memories in an unsupervised manner by analyzing the Wikipedia Web network and hourly viewership history of its articles. The interests of Wikipedia visitors were also studied in [35] focusing on Wikipedia articles on historical persons. The authors have also investigated connectivity of Wikipedia articles about historical persons. Graus et al. [24] investigated about 80,000 entities emerging in online text streams before they got incorporated into Wikipedia analyzing in this way the processes behind collective memory formation.

Collective memories have been also researched in the context of particular items or objects. Strötgen et al. [53] performed large-scale worldwide analysis of street names with date references according to the intuition that temporal streets are frequently used to commemorate important events of different regions. Similarly, Nielek et al. [45] analysed street names distributions as a window to nation-level collective memory in Poland. Candia et al. [12] analyzed temporal decay of the attention received by cultural products such as academic articles, patents, songs, movies and biographies. The authors showed that the attention received by cultural products decays following a universal biexponential function and explained it by proposing a mathematical model based on communicative and cultural memory. The formation of collective memory has been also recently modeled by simulating opinion dynamics of collective agents including phenomena such as homophily [10]. Koutlis et al. [39] studied collective memory dynamics with regard to song recognition levels leveraging chart data, YouTube views, Spotify popularity and forgetting curve dynamics.

Despite the above-listed efforts, to the best of our knowledge, few researches focus on history-oriented studies in microblogging scenarios. Memory dynamics was investigated in Twitter data in [4] with regard to particular attributes of hurricanes. The authors tracked the use of ngrams involving hurricane name mentions and found that the most damaging and deadly storms of the 2010s generated the most attention and were remembered the longest. In another work, commemoration of the First World War was studied in relation to diverse countries [14]. In contrast to these works, we use relatively large size data (at least, for history-related studies), longer time spans, and we investigate multiple aspects ranging from the types of references, intensity of remembering, key entities, dates, temporal patterns and so on. Lastly, our analysis uses three temporal snapshots of data what allows comparison of collective memories in different years.

3 Data collection

This section describes the data collection and preprocessing procedures as well as general statistics of the dataset used for analysis. We also provide few basic statistics and example results of entity mention detection.

Collecting hashtags and tweets. We used the Twitter official search API^{Footnote 3} provided by Twitter to collect tweets. Note that three kinds of tweets are typically found in Twitter: tweets, re-tweets and quote tweets. A tweet is an original text issued as a post by a Twitter user. A re-tweet is a copy of an original tweet for the purpose of propagating the tweet content to more users (i.e., one’s followers). Finally, a quote tweet copies the content of another tweet and allows also to add new content. A quote tweet is sometimes called a re-tweet with a comment. In this work, we simply treat all quote tweets as original tweets since they include additional information/text. There were, however, only 1,877 (0.2%) tweets recognized as quote tweets in the collected data.

To collect tweets that refer to the past and are related to collective memory of past events/entities, we performed hashtag based crawling together with a bootstrapping procedure. At the beginning, we gathered several historical hashtags selected by experts (e.g., #HistoryTeacher, #history, #WmnHist)^{Footnote 4}. In addition, we prepared several hashtags that are commonly used when referring to the past: #onthisday, #thisdayinhistory, #throwbackthursday, #otd. We then collected tweets that contain these hashtags by using Twitter’s official search API. The procedure of the bootstrapping approach is shown in Procedure 1. T1 and T2 are conditions for collecting new seed hashtags and for stopping the tweet crawling, respectively.

These conditions depend on the data collection policy, such as T2 can be bound to the pre-determined tweet collection period and T1 may be implemented in order to perform manual checking by an expert. $ CandNewSeedHashtag $ serves as a difference set and is used to store newly found hashtags (while the ones already used for tweet crawling are removed) for their subsequent manual inspection.

The tweets we collected were issued from March 8, 2016, to July 2, 2018. Bootstrapping allowed us to search for other hashtags frequently used with the seed hashtags. The tweets tagged by such hashtags were then included into the seed set after the manual inspection of all the discovered hashtags as of their relation to the history, and filtering out unrelated ones. In total, we gathered 147 history-related hashtags which allowed us to collect 2,370,252 tweet IDs pointing to 882,977 tweets and 1,487,275 re-tweets^{Footnote 5}. Table 1 shows the key statistics of the collected data. Table 2 shows the number of tweets we collected in each year. We gathered on average approximately 77k tweets per month in 2016 and 2018 and, on average, 89k tweets per month in 2017. The complete list of the used hashtags is shown in Table 23.

Table 1 Dataset statistics

Analyzing history-related posts in twitter

Abstract

Similar content being viewed by others

Analyzing Microblogging Posts for Tracking Collective Emotional Trajectories

#FewThingsAboutIdioms: Understanding Idioms and Its Users in the Twitter Online Social Network

Temporal Analysis of User Behavior and Topic Evolution on Twitter

Explore related subjects

1 Introduction

2 Related work

2.1 Temporal analysis

2.2 Twitter analysis

2.3 Collective memory analysis

3 Data collection

4 General analysis

4.1 Connection of past and present entities

4.2 Hashtag analysis

4.3 URL analysis

5 Category-based analysis

5.1 Definitions

5.2 Inter-category similarity

5.3 Temporal category analysis

5.4 Entity-focused category analyses

5.5 Analysis of entity and time reference dispersions

6 Discussion

6.1 Summary of main findings

6.2 Limitations

6.3 Potential applications

7 Conclusions and future work

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation