Modeling Inter-country Connection from Geotagged News Reports: A Time-Series Analysis

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10387)

Abstract

The rapid development of big data techniques provides growing opportunities to investigate large-scale events that emerge over space and time. This research utilizes a unique open-access dataset, “The Global Data on Events, Location and Tone” (GDELT), to model how China has connected to the rest of the world, as well as predicting how this connection may evolve over time based on an autoregressive integrated moving average (ARIMA) model. Methodologically, we examined the effectiveness of traditional time series models in predicting trends in long-term mass media data. Empirically, we identified various types of ARIMA models to depict the connection patterns between China and its top 15 related countries. This study demonstrates the power of applying GDELT and big data analytics to investigate informative patterns for interdisciplinary researchers, as well as provides valuable references to interpret regional patterns and international relations in the age of instant access.

Keywords

Time series analysis ARIMA Inter-country relations Mass media events GDELT 

1 Introduction

In recent decades, the rapid development of techniques and theories in the big data field has introduced new challenges and opportunities to analyze the large amount of information available online [1, 2, 3], including user-contributed (personalized) information such as social media data and traditional mass media that targets a larger audience. Social media are best characterized by a series of Social Network Sites (SNS) (such as Facebook and Twitter) that have attracted worldwide users to communicate, socialize, and share their daily lives, whereas mass media refers to various forms of media technologies that aim to reach a large audience via mass communication, including broadcast, print, film, and new channels developed with the growth of the world wide web (WWW), such as online news reports [4]. Although many studies have focused on how user-generated content has revolutionized the traditional media landscape, especially in the marketing field [5, 6], there has not been sufficient study on how these mass media datasets (e.g., massive online news archives) can be utilized to track, analyze, and model societal issues, such as the conflict and interaction between regions and countries. Realizing the necessity to explore the geographic component of these geotagged news reports, this research utilizes an open-source dataset, “The Global Data on Events, Location and Tone” (GDELT), to analyze the time series of China’s inter-country connections with respect to time. GDELT monitors print, broadcast, and web news media in over 100 languages worldwide and automatically encodes such data into a structured database. Although researchers in various fields such as sociology and communication have explored the potential of such data in analyzing societal events [7, 8], there is very limited research in utilizing these extracted mass media data in geography, such as analyzing the evolution of a geographic entity or the connection between geographic entities upon time [9, 10]. We adopt an autoregressive integrated moving average (ARIMA) model to analyze time series due to its capability of dealing with both Autoaggressive (AR) moving average (MA) and “Integrated” components. These models are appropriate for time series data either to better interpret the autocorrelation of the data or to forecast future points in the series [11]. Additionally, the ARIMA model is capable of dealing with non-stationary time series data, which is typically associated with long-term news events. This research concentrates on demonstrating the effectiveness of applying time series analysis to geotagged mass media data. We do not aim to interpret these patterns from a sociological or political perspective. The applied methodology can be further extended to other fields as a data pre-processing strategy, such as public relations, communications, and political geography.

2 Related Work

In the age of instant access, the wide spread usage of the Internet has introduced multiple new channels in the field of communication. On the one hand, researchers have investigated how “individual-oriented” SNS have revolutionized where, when, and how people communicate and share their daily life [12, 13, 14]. These social media platforms not only provide multiple avenues for communication, but they also generate rich data sources that allow researchers to analyze human behavior patterns from both individual and aggregated perspectives [2]. On the other hand, compared to individual-oriented social media data, traditional mass media channels, such as newspapers and TV programs, concentrate on delivering information to a larger audience [4, 15]. Researchers have realized the advantages of mass media content in its professorial nature: compared to social media, traditional mass media often addresses significant and aggregated events [16], thereby playing an important role in analyzing the social, economic, and cultural status of a society. Many researchers have applied mass media data to modeling short-term events [17], such as the response of the stock market corresponding to major social events [18]. In addition, since traditional mass media has evolved for decades (or even centuries), and the data are often collected over a longer time span, they are more appropriate for investigating long-term socio- economic trends and patterns, such as the evolution of an urban system over decades or the collective patterns of a society or the connection between societal systems [17]. For example, the machine-coded GDELT dataset [19] utilized in this research is updated daily and consists of over a quarter-billion news event records dating back to 1979. It captures what has happened/is happening worldwide [7, 17], and therefore has been utilized in many previous studies to analyze various long-term collective patterns [20]. One example study was conducted by Yonamine [8], in which the researchers constructed a predictive model to explore conflict levels in Afghanistan by incorporating various socio-economic indicators such as unemployment levels and ethnic diversity. Yuan, Liu, and Wei [17] also utilized GDELT to analyze how China was connected to other countries in the past few decades and how these patterns can be clustered into different categories. As mentioned in Sect. 1, this study focuses on constructing ARIMA time series models to predict the strength of international relations, which can be considered an extension of Yuan, Liu, and Wei [17] from a time series modeling perspective.

3 Methodology

The GDELT data used in this research include multiple columns such as the source, actors, time, and approximated location of recorded events. For instance, in a news report entitled “In Malaysia, Obama carefully calibrates message to Beijing,” Actor 1 would be “United States government” and Actor 2 would be “Chinese government”. The associated geographic locations of Actor 1, Actor 2, and the actual action are “Beijing, China”, “Washington DC, United States”, and “Kuala Lumpur, Malaysia”.

As discussed in Sect. 2, this research concentrates on the inter-country relatedness between China and foreign countries. The analyses will be conducted using the following two steps:
  • Data Preprocessing

First, we extract all news records involving China and another country as two parties. Note that the location of “action” is not a substantial factor here since an event related to a certain country can happen inside or outside of that country. Based on the pre-processed data, we calculate descriptive statistics to provide a general interpretation of the trend at various spatio-temporal scales. For each year and each country, we calculate the frequencies of “co-occurrence” with China (donated as C) in the dataset. The frequencies are noted as Fy(i,c), which stands for the “co-occurrence” frequency between China and country i in year y. Here we first define connection strength as follows:

$$ Co_{y} (i,c) = \frac{{F_{y} (i,c)}}{{\sum\limits_{j \ne c} {F_{y} (j,c)} }}. $$
(1)
where \( \sum\limits_{j \ne c} {F_{y} (j,c)} \) is the total number of records that involve China and another country as two actors. Note that the connection strength is not normalized by the total occurrence of country i.

To explore the changing dynamics of this pattern, we compute the yearly connection strength between China and the top 15 countries, represented as time series data. The following series provides an example series between United States and China, which indicates that the connection strength is 0.162 in the year 1979 and 0.179 in 2013:

  • US [0.162, 0.174, 0.191, 0.193, 0.189, 0.189, 0.181, 0.177, 0.174, 0.169, 0.17, 0.165, 0.162, 0.157, 0.157, 0.161, 0.17, 0.165, 0.164, 0.165, 0.166, 0.16, 0.164, 0.16, 0.159, 0.155, 0.153, 0.151, 0.153, 0.156, 0.162, 0.169, 0.175, 0.178, 0.179]

  • Modeling and interpreting time series data

As discussed in Sect. 1, ARIMA models can be applied to both stationary and non-stationary time series data. Due to its flexibility in data processing, this research constructed ARIMA models to better interpret the summarized time series. ARIMA model is generally referred to as an ARIMA(p,d,q) model where three parameters p, d, and q are non-negative integers. They refer to the autoregressive, integrated, and moving average parts of the model respectively, and are interpreted as follows:
  • p: the autoregressive parameter indicates how much the output variable depends linearly on its own previous values (e.g., how much the value in 2010 depends on the years 2009, 2008, etc.).

  • d: the integrated parameter is the number of non-seasonal differences and long term trend. For instance, the random walk model Y(t) – Y (t − 1) = µ (where the average difference in Y over time t is a constant, denoted by µ), since it includes (only) a non-seasonal difference and a constant term, is classified as an “ARIMA(0,1,0) model with constant.”

  • q: the order of lagged forecast errors in the prediction. For instance, if series µt can be represented by the weighted average of q white noise patterns (Eq. 1, where εt are white noise series, θ1θq are constants), then µt corresponds to ARIMA (0,0,q). q can be interpreted as a level of uncertainty in time series analysis:

$$ \mu_{t} = \varepsilon_{t} + \theta_{1} \varepsilon_{t - 1} + \cdots \theta_{\text{q}} \varepsilon_{{{\text{t}} - q}} . $$
(2)

The construction of ARIMA models provides quantitative evidence of how the inter-nation connection of China has changed upon time, and the fitted parameters can be applied for predictions and estimates of future patterns.

4 Results and Discussion

The ARIMA models are constructed based on the yearly connection strength defined in Sect. 3 step 1. Table 1 presents the models and fitted results. To test the effectiveness of the models, we utilized data from 1979–2010 as a training set and the years 2011, 2012, and 2013 as a testing set for model validation.
Table 1.

ARIMA models and predicted results (‘Obs.’ indicates observed data)

Country

 

ARIMA model

Fitted 2011

Obs. 2011

Fitted 2012

Obs. 2012

Fitted 2013

Obs. 2013

United States

US

(1,1,0)

0.172

0.175

0.1732

0.1781

0.1737

0.1792

Japan

JA

(1,0,0)

0.0991

0.0946

0.0984

0.0927

0.0977

0.0929

Russia

RS

(1,0,0)

0.0858

0.0806

0.087

0.0806

0.0881

0.0808

South Korea

KS

(0,1,1)

0.0525

0.0493

0.054

0.0467

0.0555

0.0465

North Korea

KN

(0,1,1)

0.0488

0.0455

0.0503

0.0423

0.0517

0.0424

United Kingdom

UK

(1,0,2)

0.0446

0.0409

0.0458

0.0413

0.0483

0.0414

France

FR

(0,1,0)

0.0295

0.0292

0.0289

0.029

0.0285

0.029

Iran

IR

(0,1,0)

0.0225

0.0215

0.0232

0.0235

0.0239

0.0238

Pakistan

PK

(2,0,0)

0.0242

0.0249

0.0234

0.0236

0.022

0.0236

India

IN

(1,1,0)

0.0236

0.0226

0.0245

0.0226

0.0252

0.0227

Australia

AS

(1,1,0)

0.022

0.022

0.0226

0.022

0.0232

0.0219

Vietnam

VM

(1,2,0)

0.0191

0.0208

0.0192

0.0195

0.0198

0.0193

Germany

GM

(0,1,1)

0.0183

0.0186

0.0175

0.0184

0.017

0.0184

Philippines

RP

(0,1,1)

0.0109

0.0127

0.0139

0.0151

0.0149

0.0152

Canada

CA

(2,1,0)

0.0128

0.0124

0.0133

0.0138

0.0135

0.014

The fitted ARIMA models in Table 1 show interesting patterns. The non-zero d value (integrated parameter) for most countries indicates that a non-stationary long term trend exists in the connection between China and these countries. Figure 1 shows an example time series in South Korea showing a clear increasing trend (d = 1). This reflects the rapidly growing connection between China and South Korea since the 1970s.
Fig. 1.

Yearly connection strength between China and South Korea

Moreover, Table 2 indicates that the 15 countries can be characterized into the following categories (Table 2):
Table 2.

Categorized ARIMA models and countries

Heading level

Characteristics

Countries

Autoregressive models

(p > 0, d = 0, q = 0)

The output variable depends linearly on its own previous values

JA, RS, PK

Autoregressive integrated models

(p > 0, d > 0, q = 0)

Autoregressive models with non-stationary behavior (e.g., long-term trend)

US, IN, AS, VM, CA

Integrated moving average models

(p = 0, d > 0, q > 0)

For moving average models, the output variable is conceptually a linear regression of the linear combination of q + 1 white noise variables. Integrated moving average models is MA model with non-stationary behavior

KS, KN, GM, RP

Autoregressive moving average models

(p > 0, d = 0, q > 0)

A combination of MA and AR models without a non-stationary component

UK

General integrated models

(p = 0, d > 0, q = 0)

The output variable depends only on the orders of a non-stationary component

FR, IR

Table 2 indicates varying patterns between different countries and China. For instance, the connection strength between China and Russia is fitted as a stationary process, in which the connection strength for a certain year auto-correlates with the value of the previous year. However, between China and France, the connection strength is a basic random walk model (ARIMA(0,1,0)) where the difference between two consecutive years can be modeled as a constant. To validate the models, we also computed the predicted connection strength in 2013. The forecast accuracy level of the model is evaluated using mean absolute percentage error (MAPE):
$$ MAPE = \frac{1}{n}\sum\nolimits_{t = 1}^{n} {\left| {\frac{{Y_{t} - F_{t} }}{{Y_{t} }}} \right|} . $$
(3)
where n is the number of time points, Ft is the forecast value at time t, and Yt is the actual data. In Table 2, the average MAPE values for the years 2011, 2012, 2013 are 4.20%, 6.26%, and 7.79%. As can be seen, the error inevitably propagates over time; however, the result still indicates a reliable model with low prediction error rates (<8% for all three testing years).

5 Conclusion

This paper applied the GDELT dataset to examine the connection between China and foreign countries based on time series analysis. We examined the effectiveness of ARIMA models in predicting trends in long-term mass media data. Although ARIMA has been previously applied in fields such as political geography and communication, its utility for determining inter-country relations in the big data era is limited. We also demonstrated the power of applying GDELT and big data techniques to investigate informative patterns for interdisciplinary researchers. This research does not aim to provide in-depth interpretation of the causes and consequences of these international events from a political perspective; instead, we proposed a method to discover the patterns that can provide insights in different research fields.

Potential future directions include extending this method to other countries to test its robustness. GDELT provides a rich data source to analyze inter-region relations at various spatial scales, such as investigating the connection between different provinces in China. Another valuable direction is to compare the performance of mass media and social media in characterizing urban-level patterns. Future study can also look into the correlation between connection strength and various demographic variables such as population, economic status, and the tone of each event record.

References

  1. 1.
    Eagle, N., Pentland, A., Lazer, D.: Inferring friendship network structure by using mobile phone data. Proc. Natl. Acad. Sci. USA 106, 15274–15278 (2009)CrossRefGoogle Scholar
  2. 2.
    Liben-Nowell, D., Novak, J., Kumar, R., Raghavan, P., Tomkins, A.: Geographic routing in social networks. Proc. Natl. Acad. Sci. USA 102, 11623–11628 (2005)CrossRefGoogle Scholar
  3. 3.
    Yuan, Y., Liu, Y.: Exploring inter-country connection in mass media: a case study of China. In: International Conference on Location-based Social Media, Athens, Georgia (2015)Google Scholar
  4. 4.
    Mazzitello, K.I., Candia, J., Dossetti, V.: Effects of mass media and cultural drift in a model for social influence. Int. J. Mod. Phys. C 18, 1475–1482 (2007)CrossRefMATHGoogle Scholar
  5. 5.
    Stephen, A., Galak, J.: The effects of traditional and social earned media on sales: a study of a microlending marketplace. J. Mark. Res. 49, 624–639 (2012)CrossRefGoogle Scholar
  6. 6.
    Meraz, S.: Is there an elite hold? Traditional media to social media agenda setting influence in blogs networks. J. Comput. Mediated Commun. 14, 682–707 (2009)CrossRefGoogle Scholar
  7. 7.
    Leetaru, K., Schrodt, P.: GDELT: global data on events, language, and tone, 1979–2012. In: International Studies Association Annual Conference, San Diego, CA (2013)Google Scholar
  8. 8.
    Yonamine, J.E.: Predicting future levels of violence in Afghanistan district using GDELT. UT Dallas (2013)Google Scholar
  9. 9.
    Cohen, S.B., Cohen, S.B.: Geopolitics: The Geography of International Relations. Rowman & Littlefield, Lanham (2009)Google Scholar
  10. 10.
    Liu, Y., Wang, F.H., Kang, C.G., Gao, Y., Lu, Y.M.: Analyzing relatedness by toponym co-occurrences on web pages. Trans. GIS 18, 89–107 (2014)CrossRefGoogle Scholar
  11. 11.
    Wilde, G.J.S.: Effects of mass-media communications on health and safety habits - an overview of issues and evidence. Addiction 88, 983–996 (1993)CrossRefGoogle Scholar
  12. 12.
    Gao, H., Liu, H.: Mining Human Mobility in Location-Based Social Networks. Morgan & Claypool Publisher (2015)Google Scholar
  13. 13.
    Memon, I., Chen, L., Majid, A., Lv, M.Q., Hussain, I., Chen, G.C.: Travel recommendation using geo-tagged photos in social media for tourist. Wireless Pers. Commun. 80, 1347–1362 (2015)CrossRefGoogle Scholar
  14. 14.
    Wu, L., Zhi, Y., Sui, Z.W., Liu, Y.: Intra-urban human mobility and activity transition: evidence from social media check-in data. PLoS ONE 9, e97010 (2014)CrossRefGoogle Scholar
  15. 15.
    McQuail, D.: The influence and effects of mass media. In: Graber, D.A. (ed.) Media Power in Politics. CQ Press, Washington, D.C. (1979)Google Scholar
  16. 16.
    Liebert, R.M., Schwartzberg, N.S.: Effects of mass-media. Annu. Rev. Psychol. 28, 141–173 (1977)CrossRefGoogle Scholar
  17. 17.
    Yuan, Y., Liu, Y., Wei, G.: Exploring inter-country connection in mass media: a case study of China. Comput. Environ. Urban Syst. 62, 86–96 (2017)CrossRefGoogle Scholar
  18. 18.
    Yu, T., Jan, T., Debenham, J., Simoff, S.: Classify unexpected news impacts to stock price by incorporating time series analysis into support vector machine. In: 2006 IEEE International Joint Conference on Neural Network Proceedings, vols. 1–10, pp. 2993–2998 (2006)Google Scholar
  19. 19.
    Schrodt, P.: Conflict and Mediation Event Observations Event and Actor Codebook V.1.1b3. (2012)Google Scholar
  20. 20.
    Jiang, L., Mai, F.: Discovering bilateral and multilateral causal events in GDELT. In: International Conference on Social Computing, Behavioral-Cultural Modeling, & Prediction (2014)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Department of GeographyTexas State UniversitySan MarcosUSA

Personalised recommendations