Introduction

Terrorism is a public nuisance to the international community [1]. Especially after the “9/11” incident, the international community has shaped the response and fight against terrorism as an important area of global governance. As one of the non-traditional security factors endangering international security, terrorism has reached an unprecedented degree of internationalization in the era of globalization. The specific manifestations are as follows: (1) the number of terrorist attacks worldwide has been increasing since the twenty-first century. (2) Terrorist attacks on innocent civilians are becoming more frequent. (3) Weak regions and zones have become hot spots for terrorist activities. (4) The collusion and contacts between different terrorist organizations in different countries are increasing day by day. The international development of terrorism means that the impact of terrorism on national security has spread outside the country, affecting the political stability of countries in the region and leading to tension and deterioration of relations between countries. It also brings a series of social problems, such as the spread of drugs and organized crime, which have an extremely negative impact on regional security [2]. Therefore, the casualty prediction of terrorist attacks at the regional level has become a key link in effectively preventing and responding to terrorist attacks and evaluating the regional terrorism-related security situation [3,4,5].

The casualty prediction of terrorist attacks at the regional level is a typical problem of spatial–temporal data prediction. In this paper, the following two key assumptions need to be followed: (1) the general direction and trend of the development of terrorist organizations or terrorists in a certain period of time can be obtained through data analysis. (2) The selected regions are less affected by external factors. Based on these assumptions, the casualty prediction of terrorist attacks at the regional level can be carried out meaningfully. In the prediction problem, the time series data of terrorist attacks are recorded at a fixed point in time and distributed in a fixed position in a continuous space. The data recorded at adjacent locations and times are not independent but dynamically related. Therefore, to effectively extract the spatial–temporal correlation of terrorist attack data and find its inherent spatial–temporal law is the key to solve this problem.

Global Terrorism Database (GTD) is an open-source database, which is considered to be the most comprehensive terrorist attack database in the field of terrorism research in the world. It contains more than 200,000 pieces of global terrorist attack data with rich geographic information, which provides a solid data foundation for the spatial–temporal prediction of terrorist attacks [6]. Many researchers have made great efforts to solve such problems. In the early days, the time series analysis model was applied to the prediction and early warning of terrorist attacks. However, the time series data of terrorist attacks often show complex spatial–temporal correlation, so it is difficult to model them by using traditional time series analysis methods. Subsequently, the traditional machine learning method is used to model complex terrorist attack data, but it can not consider the spatial–temporal correlation of terrorist attack data at the same time. And the prediction performance of such methods depends to a large extent on feature engineering, which requires a lot of experience of experts in related fields. In recent years, many researchers use deep learning and network modeling methods to deal with terrorist attack data and establish different models to describe, explain and predict terrorist attacks. However, these methods still can not model the spatial–temporal dynamic correlation of terrorist attack data at the same time.

At the same time, the research problem of this paper is also a very important research topic in the field of spatial–temporal data mining. In the past, statistical models such as VAR [7] and ARIMA [8] were used to model time series. These methods require data to satisfy specific stationarity assumptions, so they are not suitable for the modeling of nonlinear spatial–temporal data. Later machine learning methods, including SVR [9] and XGBoost [10], are used to capture the nonlinear features of spatial–temporal data. However, the machine learning model depends on the features of manual processing, so their ability to process features is limited. With the rapid development of the deep neural network, RNN and its variants, such as LSTM [11] and GRU [12], have shown excellent performance in time series modeling task. However, in terms of spatial–temporal data prediction, models such as RNN only model correlation along the time dimension, while ignoring spatial correlation.

In recent years, more and more scholars have begun to try to apply the graph neural network model specially designed for processing graph-structured data to the spatial–temporal graph data prediction problem, to deal with the limitations of traditional methods in spatial–temporal data prediction [13,14,15,16]. These novel spatial–temporal graph neural networks can be divided into two categories: recurrent neural network-based methods and convolutional neural network-based methods.

The spatial–temporal graph neural network model based on a recurrent neural network uses recurrent units to store and process the sequence information of historical data and uses graph convolution operations to replace the original linear transformation in recurrent units [16, 17]. Although the recurrent neural network was originally designed to learn the correlation of data in the time dimension, theoretical derivation and practical application results [18, 19] show that due to problems such as gradient disappearance and gradient explosion, recurrent neural networks are limited in their ability to capture the correlation of long sequences in the temporal dimension. In addition to recurrent neural networks, convolution operations can also be used to deal with the features of data along the time dimension. In recent years, many research works have used convolution operation along the time dimension to deal with the time series information of spatial–temporal graph data and combined with graph convolution operation to deal with the spatial information of spatial–temporal graph data. STGCN [20] used ChebNet and 2D convolution networks to capture spatial and temporal correlations, respectively. Graph WaveNet [14] used dilated casual convolution and gating mechanism to model temporal features while utilizing diffusion convolution to aggregate information from adjacent nodes in the spatial domain. Song et al. [15] proposed a novel spatial–temporal graph convolution model, which can capture localized spatial–temporal correlation synchronously. In addition, the model also constructed multiple modules to capture the heterogeneities in localized spatial–temporal graphs.

However, the above methods still face challenges in capturing complex spatial–temporal correlations. In the spatial dimension, there are complex spatial dependencies between different regions, which requires that the model can not only deal with the spatial correlation of adjacent regions but also capture the spatial correlation between distant regions with similar contexts. In the time dimension, the prediction of a certain time is usually associated with various historical observations, such as an hour ago, a day ago, or even a week ago. This requires that the model can capture the complex nonlinear relationship between different time observations, and be able to effectively process and remember information in long time series. Geng et al. [17] proposed a new spatial–temporal multi-graph convolution network (ST-MGCN), which encodes the non-Euclidean pairwise correlation between regions into multiple graphs, and then uses multi-graph convolution to explicitly model the spatial correlation. Zhang et al. [21] proposed a deep spatial–temporal residual network (ST-ResNet) to explicitly model the time-dependent relationship of crowd flow. These methods improve the performance of spatial–temporal data prediction in many aspects, but they lack a comprehensive method and can not deal with the complex temporal and spatial correlation in spatial–temporal data at the same time in a coherent way.

To solve the above problems, we propose a deep learning model for predicting innocent civilian casualties in terrorist attacks: Attention-based spatial–temporal multi-graph convolutional networks (AST-MGCN). Figure 1 illustrates the overall architecture of our model. Specifically, (1) based on the different social-spatial dynamic relationships of terrorist attacks, we construct the spatial adjacency graph and spatial diffusion graph. The nodes of the graph correspond to each region defined in GTD, and the specific connections between regions correspond to the edges of the graph. (2) We propose a multi-scale periodic analysis method based on wavelet transform, which is used to determine the multi-scale features of the time series of casualties in terrorist attacks. On this basis, we extract the temporal dynamic information of terrorist attack data from three kinds of temporal properties: trend, period and closeness. (3) We design a spatial–temporal attention mechanism to learn the most relevant spatial–temporal dynamic information of terrorist attack data. More specifically, spatial attention is used to capture the complex social-spatial correlation between different regions, and temporal attention is used to capture the dynamic temporal correlation between different times. (4) We propose a new spatial–temporal multi-graph convolution model, which uses spatial multi-graph convolution to capture rich social-spatial features in multi-views, and uses temporal convolution to capture time-dependent relationships. We evaluate the effectiveness of the AST-MGCN on the GTD database through experiments, and the results show that the model achieves the most advanced performance.

Fig. 1
figure 1

Overall architecture of the AST-MGCN

The rest of this paper is organized as follows: In “Literature review”, we briefly reviewed the relevant literature, summarized the research conclusions, analyzed the shortcomings, and positioned this study. In “Preliminaries”, we define the research problems described in this paper. In “Attention-based Spatial–temporal multi-graph convolutional networks”, we describe in detail the overall architecture and each module of the proposed model. In “Experiments”, a large number of numerical experiments are carried out to evaluate the performance of the model, and the work of this paper is summarized in “Conclusion”.

Literature review

Terrorist activities are the specific forms of terrorism, and events related to terrorist activities are often referred to as “terrorist attacks” or “terrorist events”. As a kind of violent crime with both generality and particularity, the research scope of terrorist activities is extensive. It includes not only the research on the organization, planning, implementation and follow-up governance of terrorist activities but also the study of single or multiple terrorist attacks. The research on terrorist attack groups mainly includes terrorist attack prediction and risk analysis, spatial–temporal evolution and trend analysis.

The main idea of terrorist attack prediction is to extract the inertia model of terrorist acts from historical or current data and to comprehensively predict and evaluate possible terrorist attacks in the future through quantitative methods on a credible basis. Although predicting the occurrence of a terrorist attack has great complexity and uncertainty [22, 23], the data-driven model has achieved some success in many fields [24, 25], providing a new perspective to solve this problem.

In the field of terrorist attack prediction, the traditional method is to use the time series analysis method to predict and analyze terrorist attacks. Based on the analysis of the trend and period of the time series of casualties in terrorist attacks, Enders and Sandler [26] used vector autoregression (VAR) to analyze the influence of the end of the Cold War, the decline of far-left terrorism, the export of religious fundamentalism and other important factors on the number of terrorist attack casualties. Enders et al. [27] first calibrated the GTD database to solve the problems of inconsistent data coding, missing data and misestimation of data. Then, the time series method is used to explore the dynamic features and correlation of domestic and transnational terrorist attacks.

Another common method is prediction research based on machine learning. Specifically, Siebeneck et al. [28] analyzed the terrorist attacks in Iraq from 2004 to 2006 from three aspects: spatial autocorrelation analysis, hot spot analysis and spatial–temporal clustering. By analyzing the spatial–temporal evolution pattern of terrorist attacks, the author examines the temporal trend and spatial distribution of terrorist attacks in Iraq. Based on the hijacking events that occurred around the world from 1948 to 2007, Fahey et al. [29] used the logistic regression (LR) method to predict whether the hijacking was aimed at terrorist attacks. Tolan and Soliman [30] used Naïve Bayes (NB), support vector machine (SVM) and Decision Tree (DT) to predict terrorist organizations that carried out terrorist attacks in Egypt between 1970 and 2013. On the basis of considering social, natural, geographical space and other factors, Ding et al. [31] used SVM, Random Forest (RF), neural network (NN) and other methods to predict the possible location of terrorist attacks. Mo et al. [32] used SVM, NB and LR to predict terrorist attacks based on the GTD database, with a classification accuracy of 78.41%. Hao et al. [33] analyzed the spatial–temporal evolution of terrorist attacks in the Indochina Peninsula by using the method of geospatial statistics. Using RF method, the potential risk of terrorist attack in the Indochina Peninsula is predicted on the spatial scale of 15 driving factors. Meng et al. [34] proposed an optimized hybrid classifier based on K-nearest neighbor (KNN), SVM, Bagging and C4.5 algorithms to predict the types of terrorist attacks, and the genetic algorithm was used to optimize the weight of every single classification. Kumar et al. [35] used Lazy Tree, multilayer perceptron (MLP) and NB to predict the organizations that carry out terrorist attacks. Feng et al. [36] proposed an XGBoost-based casualty prediction algorithm (RP-GA-XGBoost) to study whether terrorist attacks will cause casualties of innocent civilians. In the proposed algorithm, a hybrid feature selection method based on RF and Principal Component Analysis (PCA) is designed, and a genetic algorithm is used to select the hyperparameters of the model. Python et al. [37] used machine learning algorithms such as Generalized Additive Model (GAM), RF and XGBoost to predict and explain terrorist attacks in a fine spatial–temporal range.

In recent years, many researchers have used deep learning methods to predict terrorist attacks. Liu et al. [38] regarded terrorist attack prediction as a multi-behavioral sequential prediction problem and proposed a Time-Aware Recurrent Log-BiLinear (TA-RLBL) model to predict the location of possible attacks by terrorist organizations. Uddin et al. [39] established five different models based on deep neural network (DNN) to predict the behavior of terrorist attacks and compared the performance of DNN with MLP, LR, SVM, NB and other algorithms. The results show that the model based on DNN has the highest accuracy. Therefore, the author concludes that compared with other most advanced machine learning technologies, DNN is a model suitable for predicting terrorist attacks. Based on the annual frequency data from 1989 to 2016, Hao et al. [40] analyzed the relationship between India’s economy and armed conflict using the time series deep learning algorithm Long Short-Term Memory (LSTM) and applied LSTM to the prediction of armed conflict.

Prediction based on network modeling is a hot field of terrorist attack prediction [41]. On the basis of network modeling, the existing research combines network prediction with terrorist organization behavior prediction, which provides a new idea for terrorist attack prediction. Moon and Carley [42] modeled and simulated terrorist organizations from the perspective of social and spatial relations, and explored the law of spatial–temporal evolution of terrorist attacks. Carley et al. [43] constructed a multi-level network including people, organizations, events, tasks and geographical locations, and analyzed the spatial–temporal relationship of terrorist attacks based on the theory of multivariate relationship modeling. Desmarais and Cranmer [44] integrated the similarity-based link prediction framework into the probability modeling method to build a transnational terrorist attack network to predict the sources and targets of terrorist attacks. Based on the public terrorist attack data, Fang et al. [45] constructed the alliance network among terrorist organizations, and used the link prediction method to analyze the alliance relationship of terrorist organizations. Based on Rough Set Theory and Three-way Decisions Theory, Loia and Orciuoli [46] proposed a new method to extract terrorist organization networks from the GTD database to study the time evolution of terrorist organization behavior.

The existing research results have launched an extensive discussion on the prediction of terrorist attacks, which provides important theoretical support for this study. It is mainly manifested in the following four aspects: (1) terrorist attacks imply specific characteristic laws [47], which have both generality and particularity, and can be obtained from time series data by certain technical means. (2) Terrorist attacks change dynamically in social-spatial, temporal and other dimensions. (3) Network modeling and analysis is an important tool for quantitative analysis of terrorism, which can provide descriptive explanation and predictive conclusion for terrorist attack prediction research. (4) On the basis of extracting the spatial–temporal features of terrorist attacks, the dynamic early warning and detection of terrorist attacks can be effectively guided by combining network analysis with machine learning and deep learning.

Although there have been many studies on the prediction of terrorist attacks, the existing research also has some shortcomings, which are shown in the following three aspects: (1) the existing terrorist attack prediction research is rarely used to predict terrorist attack casualties. However, for the management and control of terrorist attacks, terrorist attack casualty prediction plays a very important role. (2) At present, most of the terrorist attack prediction based on network modeling is static research. Terrorist attacks have spatial–temporal dynamic correlation, so it is necessary to analyze the spatial–temporal network of terrorist attacks when considering time factors. (3) For the research on the prediction of terrorist attacks, there is little fusion analysis of social-spatial and temporal relations.

To the best of our knowledge, it is the first attempt to apply spatial–temporal graph convolution networks for terrorist attack prediction, which can provide theoretical and methodological support for the effective early warning of terrorist attacks and the scientific construction of anti-terrorism strategies. Overall, the main contributions of the proposed model are as follows:

First of all, according to the different social-spatial dynamic relations of terrorist attacks, we construct the spatial adjacency graph and spatial diffusion graph. The complementarity of different views provides rich social-spatial dynamic information for prediction tasks.

Second, we propose a multi-scale periodic analysis method based on wavelet transform, and identify the temporal properties of the time series of casualties in terrorist attacks into three categories, consisting of temporal trend, period and closeness.

Third, we design a novel deep learning model (AST-MGCN), in which the spatial–temporal multi-graph convolution can effectively capture the spatial–temporal dynamic features of terrorist attacks, and the spatial–temporal attention mechanism can automatically capture the most relevant spatial–temporal dynamic information in prediction tasks.

Finally, experiments are carried out on the GTD database. The results show that the proposed model achieves the most advanced performance.

Preliminaries

Terrorist attack network

GTD divides the regions where terrorist attacks occur in the world into 12 regions: North America, North America, Central America and the Caribbean, South America, East Asia, Southeast Asia, South Asia, Central Asia, Western Europe, Eastern Europe, Middle East and North Africa, Sub-Saharan Africa, Australia and Oceania.

In this study, we define the terrorist attack network as a graph \(G = \left( {V,E,A} \right)\), where V represents a set of vertices, and each vertex in the network represents a region where the terrorist attack took place; \(\left| V \right| = N\) is the number of vertices in a terrorist attack network; E represents a set of edges; \(A \in {\mathbb{R}}^{N \times N}\) represents the adjacency matrix of the terrorist attack network G. As shown in Fig. 2, to predict the temporal and spatial distribution of innocent civilian casualties in terrorist attacks in different regions on a daily basis, we construct a spatial adjacency graph \(G^{{{\text{SA}}}}\) and a spatial diffusion graph \(G^{{{\text{SD}}}}\) of terrorist attacks. At the same time, we import the daily number of terrorist attacks in various regions and the total number of innocent civilians killed and injured into each view as the spatial–temporal distribution data of terrorist attacks.

Fig. 2
figure 2

Multi-view terrorist attack network

Casualty prediction of terrorist attacks

Suppose that the m-th time series recorded on each node in the terrorist attack network G is the target time series, and \(m \in \left( {1, \ldots ,M} \right)\). \(x_{t}^{f,i} \in {\mathbb{R}}\) represents the f-th time series of node i at time t, and \(y_{t}^{i} = x_{t}^{m,i} \in {\mathbb{R}}\) represents the target time series m of node i at time t. \(X_{t} = \left( {x_{t}^{1} ,x_{t}^{2} , \ldots ,x_{t}^{N} } \right)^{T} \in {\mathbb{R}}^{N \times M}\) represents the value of all the features of all nodes at time t. \({\mathcal{X}} = \left( {X_{1} ,X_{2} , \ldots ,X_{h} } \right)^{T} \in {\mathbb{R}}^{N \times M \times h}\) represents the value of all the features of all nodes over h time slices.

The goal of this paper is to learn the mapping relationship between coding features and the number of innocent civilian casualties in terrorist attacks. Therefore, the problem of terrorist attack casualty prediction is defined as: Given \({\mathcal{X}}\), all the historical data of all nodes in the terrorist attack network over the past h time slices are used to predict the future time series \(Y = \left( {y^{1} ,y^{2} , \ldots ,y^{N} } \right)^{T} \in {\mathbb{R}}^{N \times p}\), that is, to predict the value of the m-th time series of all nodes over p time slices in the future. \(y^{i} = \left( {y_{h + 1}^{i} ,y_{h + 2}^{i} , \ldots ,y_{h + p}^{i} } \right)^{T} \in {\mathbb{R}}^{p}\) represents the predicted value of node i from time h + 1. Specifically, to predict the daily number of innocent civilian casualties caused by terrorist attacks in each region, for each previous period of time, we construct the spatial adjacency graph and spatial diffusion graph, respectively, and use them as the input of the model to predict the total number of innocent civilian casualties in each region over p time slices in the future.

Attention-based spatial–temporal multi-graph convolutional networks

The overall architecture of our AST-MGCN is shown in Fig. 1, and the model implementation architecture is shown in Fig. 3. We construct the spatial adjacency graph and spatial diffusion graph using different aspects of inter-regional correlation to fully express the social-spatial dynamic information of terrorist attacks, and use three independent components with the same structure to form the main body of the model, which are designed to, respectively, model the temporal closeness, period, and trend properties of the historical data.

Fig. 3
figure 3

Implementation architecture of the AST-MGCN

Suppose the data is collected on a daily basis, and the current time is \(t_{0}\). The multi-scale period of the target time series is determined by wavelet transform, and three-time series segments of length \(T_{{\text{t}}}\), \(T_{{\text{p}}}\) and \(T_{{\text{c}}}\) are truncated along the time axis. Then, the three-time series segments are input into three independent components as trend component \({\mathcal{X}}_{{\text{t}}}\), periodic component \({\mathcal{X}}_{{\text{p}}}\) and closeness component \({\mathcal{X}}_{{\text{c}}}\), where the input components are all integer multiples of the prediction window.

These three independent components are all composed of multiple spatial–temporal blocks (ST-block) and a final convolution layer. First of all, each ST-block obtains the temporal attention matrix through the temporal attention module, and multiplies it with the initial input component to get the temporal component adjusted by the temporal attention mechanism. Secondly, we get the spatial attention matrix through the spatial attention module, and input the spatial multi-graph convolution module combined with the temporal component to capture different types of correlation between regions. And then we use the temporal convolution module to aggregate observations at different times. After the ST-block processing, the feature vector with the same size as the final prediction result is output through the final convolution layer. In addition, to improve network performance, we use the residual learning framework [48], as shown by the red arrow in Fig. 3. Finally, the output of the three independent components is further fused based on the parameter matrix, and the final prediction result is obtained. The overall structure of the model has been carefully designed to accurately predict the number of innocent civilian casualties in terrorist attacks.

Multi-scale periodic analysis based on wavelet transform

In the network prediction flow, because the change in the number of innocent civilian casualties in terrorist attacks is in a non-linear and non-stationary process, and the general time series model is difficult to analyze, we design a multi-scale periodic analysis method based on wavelet transform.

Wavelet transform has been maturely used in signal processing and other fields of time series analysis [49,50,51,52], which can not only effectively identify the periodicity of time series, but also effectively deal with multi-time scale nonlinear and non-stationary processes. Therefore, by using the multi-scale feature decomposition of the wavelet transform, the nonlinear non-stationary process of the target time series can be located in the time–frequency space [49], and the main periodic features of the transform can be determined.

The number of innocent civilian casualties in terrorist attacks is input into the wavelet transform as a time series signal, and the multi-scale features of the time series are obtained by multi-scale transform. The period of the large time scale can capture the trend change of historical data, while the period of the small time scale describes the periodic fluctuation at different times within the trend. Therefore, this method can be used to determine the input period of the trend component, period component and closeness component of the AST-MGCN.

Wavelet transform is a local transform of space (time) and frequency, which analyzes and studies the time–frequency characteristics of the signal by stretching and translating the mother wavelet with finite length or fast attenuation. For a given time series \(x(t) \in L^{{2}} (R)\), its continue wavelet transform (CWT) is [52, 53]:

$$\begin{aligned} W(a,b) &= \left\langle {x(t),\varphi_{a,b} (t)} \right\rangle = \int_{ - \infty }^{ + \infty } {x(t)\varphi_{a,b}^{ * } (t){\text{d}}t} \\&= \frac{1}{\sqrt a }\int_{ - \infty }^{ + \infty } {x(t)\varphi^{ * } \left( {\frac{t - b}{a}} \right){\text{d}}t} .\end{aligned} $$
(1)

where \(W(a,b)\) is the wavelet transform coefficient. \(\varphi_{a,b} (t)\) is a wavelet sequence obtained by the mother wavelet \(\varphi (t)\) transformed by a stretching factor a and translation factor b, and “\(*\)” stands for conjugation.

When using wavelet analysis theory to solve practical problems, choosing the appropriate basis wavelet function is the premise. We choose Morlet continuous complex wavelet transform because it can not only provide both phase and amplitude information of signal changes but also eliminate the false oscillation caused by real wavelet transform coefficients as a criterion. The mother wavelet function of Morlet is defined as [52]:

$$ \varphi (t) = e^{{j\omega_{0} t}} e^{{ - t^{2} /2}} , $$
(2)

where \(\omega_{0}\) is the wavelet center frequency. In the process of signal processing, the above calculation method is complex and time-consuming, and can not meet the actual requirements. According to Parseval theorem, for two signals \(x_{1} (t)\) and \(x_{2} (t)\), there are:

$$ \begin{aligned}\int_{ - \infty }^{ + \infty } {x_{1} (t)x_{2}^{ * } (t){\text{d}}t} &= \frac{1}{2\pi }\int_{ - \infty }^{ + \infty } {X_{1} (\omega )X_{2}^{*} (\omega ){\text{d}}\omega } \\&= \frac{1}{2\pi }\int_{ - \infty }^{ + \infty } {X_{1}^{ * } (\omega )X_{2} (\omega ){\text{d}}\omega } , \end{aligned}$$
(3)

Therefore, Eq. (1) can be transformed into:

$$ \begin{aligned} W(a,b) & = \int_{ - \infty }^{ + \infty } {x{(}t{)}\varphi_{a,b}^{ * } (t){\text{d}}t} \\ & = \frac{1}{2\pi }\int_{ - \infty }^{ + \infty } {X(\omega ){\text{FT}}^{ * } \left[ {\varphi_{a,b} (t)} \right]{\text{d}}\omega } \\ & = \frac{1}{2\pi }\int_{ - \infty }^{ + \infty } {X(\omega ){\text{FT}}^{ * } \left[ {\frac{1}{\sqrt a }\varphi \left( {\frac{t - b}{a}} \right)} \right]{\text{d}}\omega } \\ & = \frac{\sqrt a }{{2\pi }}\int_{ - \infty }^{ + \infty } {X(\omega ){\Phi }^{ * } (a\omega )e^{j\omega b} {\text{d}}\omega } \\ & = {\text{IFT}}\left[ {\sqrt a X(\omega ){\Phi }^{ * } (a\omega )} \right], \\ \end{aligned} $$
(4)

where \(X(\omega )\) and \(\Phi (\omega )\) are the Fourier transforms of \(x(t)\) and Morlet mother wavelet \(\varphi (t)\), and FT and IFT represent Fourier transform and inverse Fourier transform, respectively. Equation (4) shows that as long as the original signal and the Morlet mother wavelet function are processed by FT, and then a series of non-orthogonal wavelet transform coefficients can be obtained by IFT processing.

From the above formula, the real part and modulus isoline graph of wavelet transform coefficients \(W(a,b)\) can be obtained. The time–frequency variation features of time series on different time scales can be observed from the graph, which can be used to assist decision-making on the identified time series period.

To determine the change period of time series accurately, wavelet variance is introduced to analyze at the same time. By integrating the square of the wavelet coefficients in the b domain, the wavelet variance can be obtained, that is:

$$ {\text{Var}}(a) = \int_{ - \infty }^{ + \infty } {\left| {W(a{,}b)} \right|}^{{2}} {\text{d}}b. $$
(5)

The variation process of wavelet variance with the expansion scale a is called the wavelet variance graph, which can be used to determine the relative intensity and the main time scale of different scale disturbances in the signal [50].

Multi-view on terrorist attack network

In this section, we introduce two different views of terrorist attack networks: spatial adjacency graph and spatial diffusion graph. Different views reflect the different social-spatial dynamic relationships of terrorist attacks, which is shown not only in the preparatory period and outbreak period of terrorist activities but also in the period of decline of terrorist activities. During the preparatory period, terrorist organizations or terrorists acquire terrorist resources, learn terrorist attack skills and experience, and select targets for attacks; during the outbreak period continuous transfer in space; and during the decline period global spread through transnational networks of terrorist organizations. All the above links fully reflect the social-spatial dynamics of terrorist attacks.

Spatial adjacency graph of terrorist attacks

Considering the social-spatial dependence of terrorist attacks, terrorist organizations or terrorists will consider spatial geographical elements in their survival and attack choices. Therefore, we have constructed the spatial adjacency graph of terrorist attacks for 12 regions around the world based on Queen contiguity, as shown in Fig. 4.

Fig. 4
figure 4

Spatial adjacency graph of terrorist attacks

Spatial diffusion graph of terrorist attacks

To model the dynamic propagation law of terrorist attacks in social space, we introduce the spatial diffusion (SD) graph of terrorist attacks [54]. The first step in building an SD graph is to build an Organization-Region (OR) basic network. There are two types of nodes in the OR basic network, which are terrorist organizations and regions. The edges represent the relationship between terrorist organizations and regions.

Suppose that at time t, terrorist organization i is active in region A, then \({\text{OR}}_{iA}^{t} = 1\), otherwise \({\text{OR}}_{iA}^{t} = 0\). Therefore, according to the timing relationship, the temporal network set of OR basic network can be constructed.

Then, based on the principle of timing sharing of terrorist organizations, we take the OR basic network of adjacent time pair \(\left( {t - 1,t} \right)\) as the basic unit and derive it as SD graph by matrix multiplication. The principle of derivation is:

  1. (1)

    At the time \(t - 1\), there are terrorist organizations operating in region A.

  2. (2)

    At the time t, there are terrorist organizations operating in region B.

  3. (3)

    At least one terrorist organization is the same in (1) and (2).

According to the above principle, SD graph can be calculated by the following formula:

$$ {\text{SD}}_{AB} = \sum\limits_{t = 1}^{T} {\sum\limits_{i = 1}^{N} {{\text{OR}}_{iA}^{t - 1} {\text{OR}}_{iB}^{t} } } . $$
(6)

If there are n terrorist organizations operating in region A and B one after another in the whole time domain, then \({\text{SD}}_{AB} = n\). If there is no terrorist organization operating in region A and B one after another, then \({\text{SD}}_{AB} = 0\). The build process of SD graph in time pair \(\left( {t - 1,t} \right)\) is shown in Fig. 5.

Fig. 5
figure 5

The construction process of SD graph in time pair (t−1, t)

It can be seen from the construction process that SD graph can express the spread relationship of terrorist attacks among regions through the transfer of terrorist organizations in space, so it can model the social-spatial dynamic relationship of terrorist attacks.

Spatial–temporal attention

Attention mechanism is often used to automatically extract the most relevant information. In this paper, we use the spatial–temporal attention [55] to capture the spatial–temporal dynamic correlation on the terrorist attack network. Spatial–temporal attention includes temporal attention and spatial attention.

Temporal attention

In the temporal dimension, there is a correlation between terrorist attacks in different time periods, and this correlation varies with different scenarios. Therefore, we use the attention mechanism to adaptively capture the temporal dynamic information between terrorist attack networks. Take the temporal attention in the closeness component as an example:

$$ P = V_{{\text{p}}} \cdot \sigma \left( {\left( {\left( {\chi_{{\text{c}}}^{(l - 1)} } \right)^{T} Z_{1} } \right)Z_{2} \left( {Z_{3} \chi_{{\text{c}}}^{(l - 1)} } \right) + b_{{\text{p}}} } \right), $$
(7)
$$ P_{m,n}^{\prime } = {\text{softmax}}\left( {P_{m,n} } \right), $$
(8)

where \(\chi_{{\text{c}}}^{(l - 1)} = \left( {X_{1} ,X_{2} , \ldots X_{{T_{l - 1} }} } \right) \in {\mathbb{R}}^{{N \times C_{l - 1} \times T_{l - 1} }}\) is the input of the l-th ST-block and \(C_{l - 1}\) is the number of features of the input data in the l-th ST-block. When \(l = 1\), \(C_{0} = M\). \(T_{l - 1}\) is the length of the temporal dimension of the input data in the l-th ST-block. When \(l = 1\), the trend component \(T_{0} = T_{{\text{t}}}\), period component \(T_{0} = T_{{\text{p}}}\) and closeness component \(T_{0} = T_{{\text{c}}}\). \(V_{{\text{p}}} ,b_{{\text{p}}} \in {\mathbb{R}}^{{T_{l - 1} \times T_{l - 1} }}\), \(Z_{1} \in {\mathbb{R}}^{N}\), \(Z_{2} \in {\mathbb{R}}^{{C_{l - 1} \times N}}\) and \(Z_{3} \in {\mathbb{R}}^{{C_{l - 1} }}\) represent learnable parameters. \(\sigma\) represents the sigmoid activation function. \(P_{m,n}\) semantically represents the correlation intensity of the terrorist attack network between time m and n. Finally, the softmax function is used to normalize the temporal attention matrix P.

As shown in the following formula, the input of the AST-MGCN is adjusted by the normalized temporal attention matrix:

$$\begin{aligned} \hat{\chi }_{{\text{c}}}^{(l - 1)} &= \left( {\hat{X}_{1} ,\hat{X}_{2} , \ldots ,\hat{X}_{{T_{l - 1} }} } \right) \\ &= \left( {X_{1} ,X_{2} , \ldots ,X_{{T_{l - 1} }} } \right)P^{\prime } .\end{aligned} $$
(9)

Spatial attention

In the spatial dimension, the impact of different regions on terrorist attacks is highly dynamic. Therefore, spatial attention is used to capture the dynamic correlation between nodes in the spatial dimension. Take the spatial attention in the closeness component as an example:

$$ S = V_{{\text{s}}} \cdot \sigma \left( {\left( {\hat{\chi }_{{\text{c}}}^{(l - 1)} W_{1} } \right)W_{2} \left( {W_{3} \hat{\chi }_{{\text{c}}}^{(l - 1)} } \right)^{T} + b_{\rm s} } \right), $$
(10)
$$ S_{m,n}^{\prime } = {\text{softmax}} \left( {S_{m,n} } \right). $$
(11)

where \(V_{s} ,b_{s} \in {\mathbb{R}}^{N \times N}\), \(W_{1} \in {\mathbb{R}}^{{T_{l - 1} }}\), \(W_{2} \in {\mathbb{R}}^{{C_{l - 1} \times T_{l - 1} }}\) and \(W_{3} \in {\mathbb{R}}^{{C_{l - 1} }}\) represent learnable parameters. \(\sigma\) represents the sigmoid activation function. The spatial attention matrix S is dynamically calculated according to the input vector adjusted by the temporal attention matrix. The element \(S_{m,n}\) in S represents the correlation between node m and node n. Finally, the softmax function is used to normalize S. When carrying out spatial multi-graph convolution, we will combine adjacency matrix and spatial attention matrix to dynamically adjust the influence weight between nodes.

Spatial–temporal multi-graph convolution

The spatial–temporal attention makes terrorist attack networks automatically pay more attention to valuable information. The input regulated by the attention mechanism is fed into the spatial–temporal multi-graph convolution module. The spatial–temporal multi-graph convolution module is composed of spatial multi-graph convolution and standard temporal convolution, which is used to extract spatial and temporal dependencies at the same time. The spatial dependence is obtained by aggregating the information of the neighboring nodes of each terrorist attack network, while the time dependence is obtained using the correlation of the adjacent time.

Spatial multi-graph convolution

After constructing the spatial adjacency graph and spatial diffusion graph of terrorist attacks, we propose a spatial multi-graph convolution to model the spatial dependence.

Take the spatial multi-graph convolution in the closeness component as an example:

$$ \hat{\chi }_{{\text{c}}}^{(l)} = {\rm ReLU}\left( {\mathop {\sqcup}\limits_{{G^{i} \in G}} f\left( {G^{i} ;\theta^{i} } \right)\hat{\chi }_{{\text{c}}}^{(l - 1)} Q_{l - 1} } \right). $$
(12)

where the current input vector adjusted by the temporal attention in the l-th ST-block is \(\hat{\chi }_{\rm c}^{(l - 1)}\), G represents the set of graphs, \(f\left( {G^{i} ;\theta^{i} } \right) \in {\mathbb{R}}^{N \times N}\) represents the aggregation matrix based on graph \(G^{i} \in G\) and parameterized by \(\theta^{i}\). If \(f\left( {G^{i} ;\theta^{i} } \right)\) is the polynomial function of the Laplacian matrix L, then this will become ChebNet on multiple graphs [56]. If \(f\left( {G^{i} ;\theta^{i} } \right) = I_{N}\), i.e., the identity matrix, then this will fall back to the fully connected network. \(Q_{l - 1} \in {\mathbb{R}}^{{C_{l - 1} \times C_{l} }}\) denotes the feature transformation matrix. \(\sqcup\) denotes the aggregation function, e.g., sum, max, average, etc. Finally, the rectified linear unit (ReLU) is used as the activation function.

To make full use of the spatial topology characteristics of terrorist attack networks, we use spectral graph theory to extend grid-based data to graph-structured data. Specifically, we regard the features of each node as the signal on the graph, and the graph convolution based on graph theory is used to process the signal directly on each time slice.

In spectral analysis, graph G is represented by the corresponding Laplace matrix L, defined as \(L = D - A\), and its standardized form is \(L = I_{N} - D^{{ - {1 \mathord{\left/ {\vphantom {1 2}} \right. \kern-0pt} 2}}} AD^{{ - {1 \mathord{\left/ {\vphantom {1 2}} \right. \kern-0pt} 2}}}\). Where A is the adjacency matrix and \(I_{N}\) is a unit matrix, and the degree matrix D is a diagonal matrix, consisting of node degrees, \(D_{ii} = \sum\nolimits_{j} {A_{ij} }\).The eigenvalue decomposition of the Laplace matrix is \(L = U\Lambda U^{T}\), where \(\Lambda = {\text{diag}}([\lambda_{0} , \cdots ,\lambda_{N - 1} ])\) is a diagonal matrix composed of eigenvalues and U is Fourier basis.

Take the graph signal \(f = X_{t}\) of the terrorist attack network at time t as an example, we change the signal by Fourier transform, \(\hat{f} = U^{T} f\). Because U is an orthogonal matrix, the matrix form of the inverse Fourier transform of f on the graph is: \(f = U\hat{f}\). Graph convolution is a convolution operation that uses diagonalized linear operators defined in the Fourier domain to replace classical convolution operators [57]. Therefore, the signal f on the graph can be filtered by the convolution kernel \(g_{\theta }\):

$$ g_{\theta } *_{G} f = g_{\theta } (L)f = g_{\theta } (U\Lambda U^{T} )f = Ug_{\theta } (\Lambda )U^{T} f, $$
(13)

where \(*_{G}\) represents the graph convolution operation. The above formula can be understood as transforming \(g_{\theta }\) and f to the spectral domain, respectively, through the graph Fourier transform, then multiplying their transformed results, and then carrying on the inverse Fourier transform of the graph to get the final result of the graph convolution operation. However, when the scale of the graph is large, it is more expensive to decompose the eigenvalue of the Laplace matrix directly. Therefore, we use Chebyshev polynomials as convolution kernels to solve this problem:

$$ g_{\theta } (\Lambda ) = \sum\limits_{k = 1}^{K} {\theta_{k} T_{k} (\tilde{\Lambda })} , $$
(14)

where the parameter \(\theta \in {\mathbb{R}}^{K}\) is the Chebyshev coefficient vector. \(\tilde{\Lambda }={{2\Lambda } \mathord{\left/ {\vphantom {{2\Lambda } {\lambda_{{{\text{max}}}} }}} \right. \kern-0pt} {\lambda_{{{\text{max}}}} }} - I_{N}\), \(\lambda_{{{\text{max}}}}\) is the maximum eigenvalue of the Laplacian matrix. The recursive definition of the Chebyshev polynomial is \(T_{k + 2} (x) = 2xT_{k + 1} (x) - T_{k} (x)\), where \(T_{1} (x) = 1\), \(T_{2} (x) = x\).

If we bring the Eq. (14) into the Eq. (13), we can get:

$$\begin{aligned} g_{\theta } *_{G} f& = U\left( {\sum\limits_{k = 1}^{K} {\theta_{k} T_{k} (\tilde{\Lambda })} } \right)U^{T} f \\&= \sum\limits_{k = 1}^{K} {\theta_{k} T_{k} (U\tilde{\Lambda }U^{T} )} f\\& = \sum\limits_{k = 1}^{K} {\theta_{k} T_{k} (\tilde{L})} f, \end{aligned}$$
(15)

where \(\tilde{L} = {{2L} \mathord{\left/ {\vphantom {{2L} {\lambda_{{{\text{max}}}} }}} \right. \kern-0pt} {\lambda_{{{\text{max}}}} }} - I_{N}\). The advantage of this transformation is that there is no need to decompose the Laplace matrix in the calculation process.

Therefore, in the implementation, \(f\left( {G^{i} ;\theta^{i} } \right)\) is chosen to be the K-order Chebyshev polynomial function of the graph Laplacian L, as shown in the following formula:

$$ f\left( {G^{i} ;\theta^{i} } \right){ = }\sum\limits_{k = 1}^{K} {\theta_{k}^{i} } \left( {T_{k} (\tilde{L}^{i} ) \odot S^{\prime } } \right). $$
(16)

To dynamically adjust the correlations between nodes, for each term of the Chebyshev polynomial, we accompany \(T_{k} (\tilde{L}^{i} )\) the spatial attention matrix \(S^{\prime }\), then obtain \(T_{k} (\tilde{L}^{i} ) \odot S^{\prime }\), where \(\odot\) is the Hadamard product.

Using the approximate expansion of the Chebyshev polynomial to solve this formulation corresponds to extracting information about the surrounding 1 to K-th order neighbors centered on each node in the graph. Figure 6 shows an example of the value transformation for a centralized region through the graph convolution layer.

Fig. 6
figure 6

An example of the ChebNet graph convolution centralized at the black vertex

Temporal convolution

After the spatial multi-graph convolution operations have captured neighboring information for each node on the graph in the spatial dimension, a standard convolution layer in the temporal dimension is further stacked to extract the time context information of the current terrorist attack network by merging the information at the neighboring time slice. Take the operation on the l-th ST-block in the closeness component as an example:

$$ \chi_{{\text{c}}}^{(l)} = {\rm ReLU}\left( {\Phi * \hat{\chi }_{{\text{c}}}^{(l)} } \right), $$
(17)

where \(\hat{\chi }_{{\text{c}}}^{(l)}\) is the input vector of the current ST-block adjusted by spatial multi-graph convolution, \(*\) denotes a standard convolution operation, \(\Phi\) is the parameters of the temporal dimension convolution kernel, and the activation function is ReLU.

In conclusion, the spatial–temporal multi-graph convolution module is able to well capture the spatial and temporal features of terrorist attack data. A spatial–temporal attention module and a spatial–temporal multi-graph convolution module forms an ST block. Multiple ST-blocks are stacked to further extract a larger range of spatial–temporal dynamic correlations. Finally, a final convolution layer is added to each separate component to ensure that the output of each component has the same shape as the predicted target.

Multi-component fusion

When the outputs of different components are fused, the impacting weights of the three components for each node are different, and they should be learned from the historical data. So, the final prediction result after multi-component fusion is:

$$ \hat{Y} = W_{{\text{t}}} \odot \hat{Y}_{{\text{t}}} + W_{{\text{p}}} \odot \hat{Y}_{{\text{p}}} + W_{{\text{c}}} \odot \hat{Y}_{{\text{c}}} , $$
(18)

where \(\odot\) is the Hadamard product. \(W_{{\text{t}}}\), \(W_{{\text{p}}}\) and \(W_{{\text{c}}}\) are learning parameters, reflecting the influence degrees of the three temporal-dimensional components on the predicted target.

Algorithm and optimization

Overall, algorithm 1 describes the calculation steps of AST-MGCN in detail. First of all, we construct training examples from the original time series data. Then, AST-MGCN is trained using backpropagation and Adam optimizer to obtain the model parameters. Finally, the learning model obtained after training is used to predict the casualties of terrorist attacks.

figure a

Experiments

Dataset and preprocessing

To evaluate the performance of the model, we use the GTD database for experiments. The GTD database was established by the University of Maryland and the US Department of Homeland Security and maintained by the National Terrorism and Counter-Terrorism Research Alliance (START). It is considered to be the most comprehensive database covering terrorist incidents in the field of global terrorism so far (data

source: https://www.start.umd.edu/gtd/). All information in GTD comes entirely from publicly available, unclassified sources, including media articles, electronic news archives, existing databases, books, journals and legal documents. The database contains more than 200,000 data related to terrorist attacks from 1970 to 2019 (updated annually), and each event has nearly 135 features, but not all of the data are used directly in this paper.

Since we are only concerned with the extent of innocent civilian casualties caused by terrorist attacks, casualties caused by non-terrorist attacks are outside the scope of this paper. Taking into account suspicious incidents such as guerrilla warfare, civil unrest and massacres, we use all terrorist attacks from January 1, 1995 to December 31, 2019 (9131 days) as original data.

For the original data, we mainly use the following features in the processing process: “iyear”, “imonth”, “iday”, “region”, “gname” (name of terrorist organization), “nkill” (total number of deaths), “nkillter” (number of murderers killed), “nwound” (total number of injuries), “nwoundte” (number of murderers injured).

First of all, because there are many missing values in the original data, the missing value of the “gname” feature is filled with “Unknown”. To ensure that the missing data can be generated according to the real data distribution, the missing values of four types of features (“nkill”, “nkillter”, “nwound” and “nwoundte”) are completed using the Generative Adversarial Imputation Nets (GAIN) [58].

Secondly, to eliminate the statistical outliers, we use the variable statistical boundary moving window to replace the outliers in the four types of features involving the number of deaths and injuries with averages. It must be pointed out that the casualties considered in this paper include only civilians, not terrorists. Therefore, the difference between the total number of deaths and the number of murderers killed and the difference between the total number of injuries and the number of murderers injured is added to get the number of innocent civilian casualties. In addition, the data is transformed by standard normalization, as shown in the following formula:

$$ X^{\prime } = \frac{{X - {\text{mean}} (X)}}{{{\text{std}} (X)}}, $$
(19)

where \({\text{mean}} (X)\) and \({\text{std}} (X)\) are the average and standard deviation of historical time series, respectively.

Finally, we use the PivotTable or cross-table in Pandas to match the processed original data according to the time and region of the incident to get the daily number of innocent civilian casualties in each region. As described in “Spatial diffusion graph of terrorist attacks”, the adjacency matrix of the spatial diffusion graph of terrorist attacks is calculated according to the terrorist organization that carried out the terrorist attack and the region where the terrorist attack occurred.

Multi-scale periodic analysis of time series

To facilitate data processing and drawing, after the original data is preprocessed, we carry out multi-scale periodic analysis on a monthly basis. To eliminate the boundary effect, we first extend the two ends of the target time series, and then after the wavelet transform, we remove the extended data at both ends and retain the wavelet coefficients in the original time series.

The real part isoline graph of wavelet coefficients can reflect the periodic changes of the prediction series at different time scales and their distribution in the time domain, and then judge the changing trend of the number of innocent civilian casualties in terrorist attacks on different time scales. Figure 7 shows the multi-scale features of the prediction series in the process of change. The outline of the isoline in the figure is sparse at the top and dense at the bottom, which means that the corresponding period has a longer average duration on a large time scale. On the time scale of about 117 m (month), there are four and a half “high-low” periods, and this periodic phenomenon lasts for the whole time domain. In the time scale of about 18 m, some “high-low” periods only appear in the time domain \(T_{121} - T_{176}\) and \(T_{202} - T_{300}\).

Fig. 7
figure 7

The real part isoline graph of wavelet coefficients

The modulus of Morlet wavelet coefficients is the reflection of the energy density corresponding to different time-scale transformation periods in the time domain distribution. The larger the modulus is, the stronger the periodicity of the corresponding scale or period is. Figure 8 shows that the time scale modulus near 117 m is the largest, indicating that the periodic change of this time scale is the most obvious. The periodic variation of the time scale near 18 m takes second place.

Fig. 8
figure 8

The modulus isoline graph of wavelet coefficients

The wavelet variance graph can reflect the distribution of the wave energy of the target time series with the time scale, and can be used to determine the main period in the transformation process. It can be seen from Fig. 9 that there are two obvious peaks in the wavelet variance graph of the target time series. The 117 m time scale corresponds to the highest peak, so it is the first significant period or large-scale period, and the 18 m time scale corresponds to the second significant period or small-scale period. This shows that the fluctuations of the above two periods control the variation features of the target time series in the whole time domain.

Fig. 9
figure 9

The wavelet variance graph

According to the results of the wavelet variance test, we draw the variation graph of wavelet transform coefficients on a significant time scale. The multi-scale periodic features and average period of time series under different time scales can be analyzed from Fig. 10. On the 117 m time scale, the average period is about 74.33 months, and on the 18 m time scale, the average period is about 12.08 months.

Fig. 10
figure 10

Wavelet transform coefficients on a significant time scale

According to the above analysis results, the periods of 117 and 18 m time scales describe the trend changes of historical data and the periodic fluctuations at different times within the trend, respectively. Therefore, the historical data under these two kinds of time scale periods can be used as the input of the trend and period components of the AST-MGCN. In addition, because the historical time series close to the forecast period will inevitably have an impact on future prediction, we take the segment of historical time series directly adjacent to the predicting period as the input of the closeness component of the model.

Settings

We choose data on the last 928 days as the test set and the remains as the training set and verification set. We use four weeks of historical data to predict the next week’s data, which means using the past 28 continuous time steps to predict the future 7 continuous time steps. All experiments are repeated ten times.

We implement the AST-MGCN using the Pytorch framework. The hyperparameters are determined by the model’s performance on the validation datasets. Considering the computing efficiency and the degree of improvement of the prediction performance, we set \(K = 3\). The number of ST-blocks in each independent component is 1. \(\sqcup\) is chosen to be the sum aggregation function. In the model, the spatial multi-graph convolution layer uses 64 convolution kernels, the temporal convolution layer uses 64 convolution kernels of size 1 × 3, the convolution of residual units uses 64 convolution kernels of size 1 × 1, and the final convolution layer uses p convolution kernels of size 1 × 64. For the lengths of the three segments, we set them as: \(T_{{\text{t}}} = 14\), \(T_{{\text{p}}} = 7\), and \(T_{{\text{c}}} = 7\). The size of the prediction window is 7. During the training phase, we use Adam optimizer. The batch size is 64 and the learning rate is 0.001. When the error of the verification set does not decrease for 100 consecutive rounds, the model training is terminated. All experiments are run independently in the same experimental environment with Intel(R) Xeon(R) Gold 5118 host and NVIDIA Tesla V100-SXM2-32 GB GPU. Unless otherwise noted, the following experiments use the Mean Absolute Error (MAE) loss function.

Performance comparison

We compare our model with the following six baselines. To make a fair comparison, we use the same experimental settings for all comparison models. Specifically, we use the last 28 consecutive time steps to predict the future 7 consecutive time steps. The same test set is used on the GTD data set to evaluate the performance of all models, and the optimal parameters are selected according to the prediction results on the verification set.

  • SES [59]: simple exponential smoothing method is a time series prediction method developed on the basis of the simple average method and weighted moving average method. The smoothing coefficient is automatically adjusted during the calculation and is determined by minimizing the Euclidean distance between the smoothing value and the actual value.

  • HOLT [60]: HOLT two-parameter linear exponential smoothing method is an advanced linear exponential smoothing method, which can consider both historical average and changing trend. Like SES, the algorithm automatically adjusts the horizontal smoothing coefficient and trend smoothing coefficient according to the actual situation in the calculation process to ensure that the model can be dynamically updated according to the latest data.

  • RNN [61]: recurrent neural network is a deep learning model that can capture temporal dependencies. In the experiment, we use the bi-directional RNN model, the dropout is 0.5 and the number of hidden layer units is 16. The parameters of the training phase are the same as those of the AST-MGCN model.

  • LSTM [11]: Long Short-Term Memory network is a special kind of RNN that can learn long-term temporal dependencies. In the experiment, a two-layer bi-directional LSTM model is selected. The output dimension of the first full connection layer is set to 20, and the other parameters are the same as the RNN model.

  • GRU [12]: gated recurrent unit networks are a new type of RNN, which can be used to capture long-term temporal dependencies. The two-layer bi-directional GRU model is used in the experiment, and the other parameters are the same as the LSTM model.

  • STGCN [20]: spatial–temporal graph convolutional network uses ChebNet and 2D convolutional networks to capture spatial dependencies and temporal correlations, respectively. In the experiment, the model uses the default hyperparameters of the original paper, and the training parameters are the same as those of the AST-MGCN model.

We use MAE, root mean squared error (RMSE) and mean absolute percentage error (MAPE) as evaluation metrics. Among them, MAE and RMSE measure the absolute error between the predicted value and the actual value, and MAPE measures the relative size (i.e., percentage) of the deviation from the actual value. In contrast, MAE and MAPE are not easily affected by extreme values; While RMSE uses the square of error, which is more sensitive to outliers and can highlight the error value with greater impact.

It can be seen from Table 1 that our AST-MGCN achieves the best performance in terms of all evaluation metrics. It can be seen that the prediction results of traditional time series analysis methods are often not ideal, indicating that the modeling ability of these methods for nonlinear and complex terrorist attack data is limited. By comparison, the prediction method based on deep learning has a better prediction effect than the traditional time series prediction method on the whole. LSTM and GRU have similar prediction performance, and because both LSTM and GRU can capture long-term temporal dependencies, their average performance is better than that of RNN. The STGCN considering both temporal and spatial correlation and the AST-MGCN are superior to the traditional deep learning model. It can be seen from the experimental results that the proposed model achieves better results than the most advanced model in the past. It is proved that the AST-MGCN has advantages in describing the spatial–temporal features of terrorist attack data, and can effectively capture the dynamic changes of terrorist attack data.

Table 1 Performance comparison of different approaches

Experimental analysis and discussion

Ablation experiment

To verify the effect of each module in the model, we design different variant models. First of all, we use spatial–temporal multi-graph convolution and spatial adjacency graph as the basic model, and then gradually add spatial–temporal attention, multi-time independent component, spatial diffusion graph to form the AST-MGCN. The specific process is described as follows:

  • variant a (spatial–temporal multi-graph convolution + spatial adjacency graph (Base Model)): we use the spatial–temporal multi-graph convolution network of the spatial adjacency graph of terrorist attacks as the basic model.

  • variant b (+ spatial–temporal attention): we add attention mechanism both on spatial and temporal dimension.

  • variant c (+ multi-time independent component): we add three independent components: trend, period and closeness.

  • variant d (+ spatial diffusion graph): on the basis of the spatial adjacency graph, we add a spatial diffusion graph of terrorist attacks to form the model proposed in this paper.

Figure 11 presents that the key modules in our model are effective for prediction tasks. Specifically, the spatial–temporal attention helps to automatically capture valuable spatial–temporal features to improve the prediction performance of the model. Our AST-MGCN models the three kinds of temporal properties: trend, period and closeness, and dynamically fuses the output of three independent components, which effectively extracts the dynamic features in the temporal dimension. In addition, the multi-view of terrorist attacks designed by us effectively captures the dynamic propagation law of terrorist attacks in the social-spatial dimension, and provides comprehensive supplementary information for the prediction task. In conclusion, the ablation experiment presents the effectiveness of each module in our model.

Fig. 11
figure 11

Comparison of the designed variant models

Multi-step ahead prediction

Figure 12 shows the multi-step ahead prediction results of AST-MGCN and optimal baseline STGCN as the prediction interval increases. Generally speaking, the longer the prediction interval is, the more difficult it is to predict, and the prediction error of the two methods will also increase, while AST-MGCN can always achieve the best prediction performance. In addition, the comparison of the other two evaluation metrics MAE and MPAE can also draw the same conclusion. This shows that the combination of attention mechanism and spatial multi-graph convolution can better mine the dynamic spatial–temporal pattern of terrorist attack data.

Fig. 12
figure 12

Multi-step ahead prediction

Impact of different network configurations

Figures 13, 14, 15 and 16 show the experimental results of different network configurations. It should be noted that to verify the impact of different loss functions on the performance of the model, the following experiments use RMSE as the loss function.

Fig. 13
figure 13

Impact of different fusion mechanisms

Fig. 14
figure 14

Impact of the order K of Chebyshev polynomials

Fig. 15
figure 15

Impact of the number of convolution kernels

Fig. 16
figure 16

Impact of network depth

We use a parametric-matrix-based fusion mechanism to fuse temporal closeness, period and trend components, and assign different weights to different independent components. Simply, in addition to average fusion, we can also employ a straightforward method for fusing, i.e., \(\hat{Y} = \hat{Y}_{{\text{t}}} + \hat{Y}_{{\text{p}}} + \hat{Y}_{{\text{c}}}\). As shown in Fig. 13, ours is significantly better than the average or the straightforward method, demonstrating the effectiveness of the proposed parametric-matrix-based fusion.

To study the impact of different hyperparameters, we evaluate the model by changing three important hyperparameters (the order K of Chebyshev polynomials, the number of convolution kernels and ST-blocks). As can be seen from Figs. 14 and 15, with the increase of order K and the number of convolution kernels, the overall trend of error decreases at first and then increases. It shows that a larger K value or the number of convolution kernels can make the model capture more global correlation or implicit features, but it also increases the complexity of the network and reduces the generalization performance of the model. Figure 16 presents the impact of network depth. As the network goes deeper (i.e., the number of ST-blocks increases), the model error gradually increases, demonstrating that when the network deepens, the model parameters and the amount of calculation will increase significantly, and the training will become very difficult, which will affect the performance of the network.

Impact of temporal closeness, period, trend

We verify the impact of temporal closeness, period, trend components on GTD, as shown in Fig. 17. Figure 17a shows the effect of temporal closeness where we fix \(T_{{\text{p}}} = 7\) and \(T_{{\text{t}}} =7\) but change \(T_{{\text{c}}}\). For example, \(T_{{\text{c}}} = 0 \) mean that we do not employ the closeness component, resulting in a very bad RMSE: 21.05. We can observe that RMSE first decreases and then increases as the length of closeness increases, indicating that \(T_{{\text{c}}} = 7\) has the best performance. Figure 17b depicts the effect of a period where we set \(T_{{\text{c}}}\) as 7 and \(T_{{\text{t}}}\) as 7 but change \(T_{{\text{p}}}\). We can see that \(T_{{\text{p}}} = 7\) has the best RMSE. The model without the period component (i.e., \(T_{{\text{p}}} = 0\)) is worse than the model with \(T_{{\text{p}}} = 14\), but better than the \(T_{{\text{p}}} = 21, 28\) model, meaning that short-range periods are always beneficial, and long-range periods may be not helpful for modeling. Figure 17c presents the effect of a trend where \(T_{{\text{c}}}\) and \(T_{{\text{p}}}\) are both fixed to 7. We change \(T_{{\text{t}}}\) from 0 to 21. We can observe that RMSE first decreases and then increases as the length of the trend increases, indicating that \(T_{{\text{t}}} = 14\) has the best performance. Similar to periods, long-range periods may be hard to model or not helpful.

Fig. 17
figure 17

Impact of temporal closeness, period, trend

Impact of different loss functions

To verify the impact of different loss functions on the performance of the model, MAE and RMSE loss functions are used to train the model. As shown in Fig. 18, the model using the MAE loss function reduces MAE by 26.67% while slightly reducing MAE. It shows that because there are more outliers to be considered in the data used in this paper, better results can be achieved by using the MAE loss function which is more robust to outliers.

Fig. 18
figure 18

Impact of different loss functions

Conclusion

In this paper, a novel attention-based spatial–temporal multi-graph convolutional network called AST-MGCN is proposed and successfully applied to predict the number of innocent civilian casualties in terrorist attacks. The model proposes an effective method to model the spatial–temporal dynamic features of terrorist attack data. Specifically, we design the spatial adjacency graph and spatial diffusion graph according to the different social-spatial dynamic relations of terrorist attacks. The difference and complementarity of multi-view data are used to encode the social-spatial correlation of terrorist attacks, and spatial multi-graph convolution is used to capture the comprehensive social-spatial dynamic information. We propose a multi-scale periodic analysis method based on wavelet transform to determine the multi-scale features of the target time series and then model the temporal closeness, period, and trend properties, which fully extract the time dynamic information. Moreover, we combine spatial–temporal attention with spatial–temporal multi-graph convolution to capture the most relevant spatial–temporal dynamic features in prediction tasks at the same time. Experiments on the GTD database show that the AST-MGCN achieves the most advanced performance.