Modeling Inter-country Connection from Geotagged News Reports: A Time-Series Analysis
The rapid development of big data techniques provides growing opportunities to investigate large-scale events that emerge over space and time. This research utilizes a unique open-access dataset, “The Global Data on Events, Location and Tone” (GDELT), to model how China has connected to the rest of the world, as well as predicting how this connection may evolve over time based on an autoregressive integrated moving average (ARIMA) model. Methodologically, we examined the effectiveness of traditional time series models in predicting trends in long-term mass media data. Empirically, we identified various types of ARIMA models to depict the connection patterns between China and its top 15 related countries. This study demonstrates the power of applying GDELT and big data analytics to investigate informative patterns for interdisciplinary researchers, as well as provides valuable references to interpret regional patterns and international relations in the age of instant access.
KeywordsTime series analysis ARIMA Inter-country relations Mass media events GDELT
In recent decades, the rapid development of techniques and theories in the big data field has introduced new challenges and opportunities to analyze the large amount of information available online [1, 2, 3], including user-contributed (personalized) information such as social media data and traditional mass media that targets a larger audience. Social media are best characterized by a series of Social Network Sites (SNS) (such as Facebook and Twitter) that have attracted worldwide users to communicate, socialize, and share their daily lives, whereas mass media refers to various forms of media technologies that aim to reach a large audience via mass communication, including broadcast, print, film, and new channels developed with the growth of the world wide web (WWW), such as online news reports . Although many studies have focused on how user-generated content has revolutionized the traditional media landscape, especially in the marketing field [5, 6], there has not been sufficient study on how these mass media datasets (e.g., massive online news archives) can be utilized to track, analyze, and model societal issues, such as the conflict and interaction between regions and countries. Realizing the necessity to explore the geographic component of these geotagged news reports, this research utilizes an open-source dataset, “The Global Data on Events, Location and Tone” (GDELT), to analyze the time series of China’s inter-country connections with respect to time. GDELT monitors print, broadcast, and web news media in over 100 languages worldwide and automatically encodes such data into a structured database. Although researchers in various fields such as sociology and communication have explored the potential of such data in analyzing societal events [7, 8], there is very limited research in utilizing these extracted mass media data in geography, such as analyzing the evolution of a geographic entity or the connection between geographic entities upon time [9, 10]. We adopt an autoregressive integrated moving average (ARIMA) model to analyze time series due to its capability of dealing with both Autoaggressive (AR) moving average (MA) and “Integrated” components. These models are appropriate for time series data either to better interpret the autocorrelation of the data or to forecast future points in the series . Additionally, the ARIMA model is capable of dealing with non-stationary time series data, which is typically associated with long-term news events. This research concentrates on demonstrating the effectiveness of applying time series analysis to geotagged mass media data. We do not aim to interpret these patterns from a sociological or political perspective. The applied methodology can be further extended to other fields as a data pre-processing strategy, such as public relations, communications, and political geography.
2 Related Work
In the age of instant access, the wide spread usage of the Internet has introduced multiple new channels in the field of communication. On the one hand, researchers have investigated how “individual-oriented” SNS have revolutionized where, when, and how people communicate and share their daily life [12, 13, 14]. These social media platforms not only provide multiple avenues for communication, but they also generate rich data sources that allow researchers to analyze human behavior patterns from both individual and aggregated perspectives . On the other hand, compared to individual-oriented social media data, traditional mass media channels, such as newspapers and TV programs, concentrate on delivering information to a larger audience [4, 15]. Researchers have realized the advantages of mass media content in its professorial nature: compared to social media, traditional mass media often addresses significant and aggregated events , thereby playing an important role in analyzing the social, economic, and cultural status of a society. Many researchers have applied mass media data to modeling short-term events , such as the response of the stock market corresponding to major social events . In addition, since traditional mass media has evolved for decades (or even centuries), and the data are often collected over a longer time span, they are more appropriate for investigating long-term socio- economic trends and patterns, such as the evolution of an urban system over decades or the collective patterns of a society or the connection between societal systems . For example, the machine-coded GDELT dataset  utilized in this research is updated daily and consists of over a quarter-billion news event records dating back to 1979. It captures what has happened/is happening worldwide [7, 17], and therefore has been utilized in many previous studies to analyze various long-term collective patterns . One example study was conducted by Yonamine , in which the researchers constructed a predictive model to explore conflict levels in Afghanistan by incorporating various socio-economic indicators such as unemployment levels and ethnic diversity. Yuan, Liu, and Wei  also utilized GDELT to analyze how China was connected to other countries in the past few decades and how these patterns can be clustered into different categories. As mentioned in Sect. 1, this study focuses on constructing ARIMA time series models to predict the strength of international relations, which can be considered an extension of Yuan, Liu, and Wei  from a time series modeling perspective.
The GDELT data used in this research include multiple columns such as the source, actors, time, and approximated location of recorded events. For instance, in a news report entitled “In Malaysia, Obama carefully calibrates message to Beijing,” Actor 1 would be “United States government” and Actor 2 would be “Chinese government”. The associated geographic locations of Actor 1, Actor 2, and the actual action are “Beijing, China”, “Washington DC, United States”, and “Kuala Lumpur, Malaysia”.
First, we extract all news records involving China and another country as two parties. Note that the location of “action” is not a substantial factor here since an event related to a certain country can happen inside or outside of that country. Based on the pre-processed data, we calculate descriptive statistics to provide a general interpretation of the trend at various spatio-temporal scales. For each year and each country, we calculate the frequencies of “co-occurrence” with China (donated as C) in the dataset. The frequencies are noted as Fy(i,c), which stands for the “co-occurrence” frequency between China and country i in year y. Here we first define connection strength as follows:
To explore the changing dynamics of this pattern, we compute the yearly connection strength between China and the top 15 countries, represented as time series data. The following series provides an example series between United States and China, which indicates that the connection strength is 0.162 in the year 1979 and 0.179 in 2013:
US [0.162, 0.174, 0.191, 0.193, 0.189, 0.189, 0.181, 0.177, 0.174, 0.169, 0.17, 0.165, 0.162, 0.157, 0.157, 0.161, 0.17, 0.165, 0.164, 0.165, 0.166, 0.16, 0.164, 0.16, 0.159, 0.155, 0.153, 0.151, 0.153, 0.156, 0.162, 0.169, 0.175, 0.178, 0.179]
Modeling and interpreting time series data
p: the autoregressive parameter indicates how much the output variable depends linearly on its own previous values (e.g., how much the value in 2010 depends on the years 2009, 2008, etc.).
d: the integrated parameter is the number of non-seasonal differences and long term trend. For instance, the random walk model Y(t) – Y (t − 1) = µ (where the average difference in Y over time t is a constant, denoted by µ), since it includes (only) a non-seasonal difference and a constant term, is classified as an “ARIMA(0,1,0) model with constant.”
q: the order of lagged forecast errors in the prediction. For instance, if series µt can be represented by the weighted average of q white noise patterns (Eq. 1, where εt are white noise series, θ1 … θq are constants), then µt corresponds to ARIMA (0,0,q). q can be interpreted as a level of uncertainty in time series analysis:
The construction of ARIMA models provides quantitative evidence of how the inter-nation connection of China has changed upon time, and the fitted parameters can be applied for predictions and estimates of future patterns.
4 Results and Discussion
ARIMA models and predicted results (‘Obs.’ indicates observed data)
Categorized ARIMA models and countries
(p > 0, d = 0, q = 0)
The output variable depends linearly on its own previous values
JA, RS, PK
Autoregressive integrated models
(p > 0, d > 0, q = 0)
Autoregressive models with non-stationary behavior (e.g., long-term trend)
US, IN, AS, VM, CA
Integrated moving average models
(p = 0, d > 0, q > 0)
For moving average models, the output variable is conceptually a linear regression of the linear combination of q + 1 white noise variables. Integrated moving average models is MA model with non-stationary behavior
KS, KN, GM, RP
Autoregressive moving average models
(p > 0, d = 0, q > 0)
A combination of MA and AR models without a non-stationary component
General integrated models
(p = 0, d > 0, q = 0)
The output variable depends only on the orders of a non-stationary component
This paper applied the GDELT dataset to examine the connection between China and foreign countries based on time series analysis. We examined the effectiveness of ARIMA models in predicting trends in long-term mass media data. Although ARIMA has been previously applied in fields such as political geography and communication, its utility for determining inter-country relations in the big data era is limited. We also demonstrated the power of applying GDELT and big data techniques to investigate informative patterns for interdisciplinary researchers. This research does not aim to provide in-depth interpretation of the causes and consequences of these international events from a political perspective; instead, we proposed a method to discover the patterns that can provide insights in different research fields.
Potential future directions include extending this method to other countries to test its robustness. GDELT provides a rich data source to analyze inter-region relations at various spatial scales, such as investigating the connection between different provinces in China. Another valuable direction is to compare the performance of mass media and social media in characterizing urban-level patterns. Future study can also look into the correlation between connection strength and various demographic variables such as population, economic status, and the tone of each event record.
- 3.Yuan, Y., Liu, Y.: Exploring inter-country connection in mass media: a case study of China. In: International Conference on Location-based Social Media, Athens, Georgia (2015)Google Scholar
- 7.Leetaru, K., Schrodt, P.: GDELT: global data on events, language, and tone, 1979–2012. In: International Studies Association Annual Conference, San Diego, CA (2013)Google Scholar
- 8.Yonamine, J.E.: Predicting future levels of violence in Afghanistan district using GDELT. UT Dallas (2013)Google Scholar
- 9.Cohen, S.B., Cohen, S.B.: Geopolitics: The Geography of International Relations. Rowman & Littlefield, Lanham (2009)Google Scholar
- 12.Gao, H., Liu, H.: Mining Human Mobility in Location-Based Social Networks. Morgan & Claypool Publisher (2015)Google Scholar
- 15.McQuail, D.: The influence and effects of mass media. In: Graber, D.A. (ed.) Media Power in Politics. CQ Press, Washington, D.C. (1979)Google Scholar
- 18.Yu, T., Jan, T., Debenham, J., Simoff, S.: Classify unexpected news impacts to stock price by incorporating time series analysis into support vector machine. In: 2006 IEEE International Joint Conference on Neural Network Proceedings, vols. 1–10, pp. 2993–2998 (2006)Google Scholar
- 19.Schrodt, P.: Conflict and Mediation Event Observations Event and Actor Codebook V.1.1b3. (2012)Google Scholar
- 20.Jiang, L., Mai, F.: Discovering bilateral and multilateral causal events in GDELT. In: International Conference on Social Computing, Behavioral-Cultural Modeling, & Prediction (2014)Google Scholar