1 Introduction

In today’s world, advances in hardware and wireless network technology have opened the path for energy-efficient, multifunctional and low-cost sensors [1]. Spread across a large geographical region, a set of sensors can then form a sensor network used for data collection and analysis [2], in particular considering large-scale time series data. Example domains where such real-world sensor data is analyzed include, e. g., traffic [3], weather [4] and seismology [5], in particular regarding time series regression and classification, e. g., [6, 7].

Recently, there have been considerable advances in deep learning methods, in particular regarding CNNs, with respect to their ability to automatically find structure and meaningful features in the data. This leads to powerful (implicit) feature construction and computationally efficient models, e. g., [8, 9] for time series. However, if only the time series data are examined, then some aspects of the sensor data are left unseen, i. e., the spatial relations of sensors in datasets that are geographically grounded.

Consequently, researchers have developed deep learning techniques to perform time series analysis like forecasting [10], anomaly detection [11] and imputation [12], with data arising from networks (i. e., graphs), called graph neural networks (now referred to as GNNs), which we also focus on in this paper. However, if the predicted value does not rely more on recent values from the input than early values (known as Time Series Extrinsic Regression (TSER) [7]), the aforementioned models are not adequate for the task.

Therefore, in this paper we tackle the problem of multivariate time series regression, for which we present a novel GNN-based architecture named TISER-GCN. Our evaluation applies high-frequency network-based seismic data demonstrating the efficacy of our proposed approach.

Previous attempts for tackling similar time series problems with graph-based methods have been made by [5, 13, 14], yet each has some shortcomings. van den Ende and Ampuero [5] mention that they designed a GNN for the localization of earthquakes from waveform data. However, they only append the (latitude, longitude) information to the time series being handled by a CNN. Therefore, while prediction scores improved, no actual GNN layers were used. Second, [13] proposed a graph partitioning algorithm that works together with a CNN. However, they make use of classical graph theory techniques and a GNN method is not applied. Lastly, [14] recently suggested a method that uses CNNs and GNNs for seismic event classification. However, (1) no spatial information is used at all, i. e., each edge has a weight of 1, nor (2) meta information about the stations is added, and (3) only three nodes are examined for each observation, which could be difficult to interpret as a full-fledged/complex network.

Therefore, we propose a larger scale GNN architecture that can process multivariate time series for such a regression task. By combining the capabilities of convolutional layers (feature extraction) and graph convolutional layers (spatial information), our model can manage the feature sizes that are common in high-frequency time series data arising from multiple sensors.

We test our proposed model on network-based seismic data, which serve as an intuitive domain where GNN models could be operated due to the naturally geographical grounded sensors. The model is inspired by the work presented in [15] and [16], which functions as our most prominent baseline. In their work, the maximum ground shaking at a set of seismic stations is predicted by tackling this as a regression problem. Their model used convolutional layers to extract useful features from a given time series. We start from their work as a departure point and illustrate how to design the deep learning model structure using GNNs for such a task.

Our contributions are summarized as follows:

  1. 1.

    We propose a method to perform multivariate regression on time series originating from graph-structured data. For this, we present an architecture utilizing convolutional and graph convolutional layers that is also adjustable for other use cases or datasets, e. g., time series classification tasks.

  2. 2.

    We evaluate our model thoroughly on two seismological datasets that differ significantly from one another evidencing the generality and potential of the proposed GNN-based architecture in this task. We discuss our results in detail and perform a comparison against several baseline models (in particular [15], but also [14, 17]) and traditional machine learning methods.

  3. 3.

    Finally, we systematically analyze the capabilities of our model in detail by comprehensive experimentation adjusting several hyperparameters in our proposed workflow.

The article is further structured as follows: we discuss related work in Sect. 2, which provides the necessary background on deep learning, graphs and GNNs. Next, Sect. 3 introduces the dataset, our method and training settings. After that, Sect. 4 presents our results and discusses these in the context of a model-based comparison. Finally, Sect. 5 concludes with a summary and outlines interesting directions for future work.

2 Background and related work

This section briefly outlines the background and related work on graphs and deep learning in general, CNNs, GNNs and its utilization in time series, as well as the implementation of deep learning for seismic analysis.

2.1 Deep learning on complex data

Traditional machine learning often requires considerable effort from the user to construct meaningful features, which usually is rather time-consuming and error-prone [8]. Deep Learning provides a way for automatic feature extraction with help of multiple layers that can utilise nonlinear processing. In particular, this also relates to complex representations such as multivariate time series and graphs. Therefore, deep learning offers strong processing and learning on complex data.

Initially, the multilayer perceptron (MLP) was developed in which all network layers are fully linked [18]. While being powerful, due to its high computation time the depth of the network is limited. Therefore, researchers have found ways to create more advanced architectures for specific tasks. One of the most prominent and successful outcomes of this effort is the CNN.

A CNN is a regularized MLP that is specialized in handling data structures with multiple dimensions (e. g., pictures with color channels). It uses a feed-forward structure with convolutions instead of more general matrix multiplications. CNNs have been widely adopted in natural language processing and computer vision. A CNN has an advantage over MLPs due to its use of weight sharing, sampling and local receptive fields [19].

For creating output, the convolutional layers convolve the input using filters and activation functions. A convolution operation is defined as

$$\begin{aligned} y_{i}^{l+1}(j)=k_{i}^{l} \cdot M^{l}(j)+b_{i}^{l}, \end{aligned}$$
(1)

where \(y_{i}^{l+1}(j)\) denotes the input of the j-th neuron in the feature map i of layer \(l+1\), \(k_{i}^{l}\) the weights of the i-th filter kernel in layer l, \(M^{l}(j)\) the j-th local region in layer l and \(b_{i}^{l}\) the respective bias. An activation function is applied after each convolutional layer to retrieve the nonlinear features.

2.2 Graphs

Before discussing the extension of deep learning models to graphs, we first introduce some background and basic notation. We define a graph G as \(G=(V,E)\) where V is the set of nodes and E the set of edges (see Fig. 1b for an example). An edge \(e_{ij}=(v_i,v_j)\) connects two nodes \(v_{i}, v_{j} \in V\). A common way to represent a graph is with an adjacency matrix \(A \in {\mathbb {R}}^{N \times N}\) where \(N=|V |\), which is a square matrix such \(A_{ij}=1\) if there is an edge from node \(v_{i}\) to node \(v_{j}\), and 0 otherwise. The number of neighbors of a node v is known as the degree of v and is denoted by \(D_{ii} = \sum \nolimits _{j}^{}A_{ij}\), where D is then the diagonal degree matrix. Edges can be undirected and directed. Undirected edges contain no notion of source and destination, e. g., the absolute distance between two nodes is always equal no matter from which node the measurement starts. Directed edges do contain direction information, e. g., whether somebody follows someone else on a social network or not.

In addition, nodes and edges (as well as entire graphs) can have features as well, such that a feature vector \(a=(a_1, a_2, \ldots , a_n)\) of individual features \(a_i \in \Omega _A\) out of a feature domain \(\Omega _A\) is assigned to the nodes (and/or edges). GNN problems therefore mostly consist of node-level, edge-level and graph-level tasks, utilizing the aforementioned feature types.

However, standard convolutional layers (e. g., CNNs) are not applicable to graph-structured data due to its non-euclidean nature. In particular, one cannot convolve an \(n \times n\) grid over a graph the same way as with an image. Figure 1 shows an example: both the red and blue boxes convolve over the same grid with \(3 \times 3\) numbers (Fig. 1a). The red box convolves three nodes, while the blue box convolves over two nodes as shown in Fig. 1b. Thus, extensive effort was put into finding ways to define convolutions over graphs.

Fig. 1
figure 1

Examples: a matrix-based convolution on an image/time series or b a graph, for which a convolution is a lot harder to define than in (a)

2.3 Graph neural networks

GNNs are deep learning-based methods that are adapted for the graph domain. In general, the history of creating deep learning models for graphs is surprisingly long. For example, recursive neural networks were already adapted to work on directed acyclic graphs in the 1990s [20]. However, one recent paper revamped the interest in using deep learning on graphs [21]. They propose two ways that use hierarchical clustering and the spectrum of the graph Laplacian to perform convolutions on low-dimensional graphs. The approach from [21] falls into one of the two historical main methods to perform convolutions with graphs: (1) spectral methods and (2) spatial methods.

Spectral methods use the eigenvectors and eigenvalues of a matrix with eigendecomposition, and perform convolutions using the graph Fourier transformation and the inverse graph Fourier transform, respectively. These transformations of the signal x are defined as \(F(x)=U^{T}x\) and \(F^{-1}(x)=Ux\), where U represents the matrix of eigenvectors of the normalized graph Laplacian \(L=I - D^{-1/2}AD^{-1/2}\), D is the degree matrix of the adjacency matrix A and I refers to the identity matrix of length \(|V |\) [22].

Spatial methods use message passing techniques, which consider the local neighborhood of nodes and perform calculations on their top-k neighbors. With a node aggregation/update function f, an updated node representation Z could then be defined as \(Z=f(G)X\) where G refers to the adjacency or Laplacian matrix, and X to the node features of the nodes contained in G [23]. However, a serious issue with spatial methods is in determining the convolution procedure with differently sized node neighborhoods [22].

To conclude, there are two typical operations when designing GNNs: Spatial methods focus more on the connectivity of the graph, while Spectral methods focus on its eigenvalues and eigenvectors [23]. Both approaches were then simplified by Kipf and Welling [24] into the so-called graph convolutional networks (GCNs), which are also used in this paper. They define their propagation rule (convolution in a graph) as follows:

$$\begin{aligned} H^{(l+1)}=\sigma \left( {\tilde{D}}^{-\frac{1}{2}}{\tilde{A}} {\tilde{D}}^{-\frac{1}{2}}H^{(l)}W^{(l)} \right) \end{aligned}$$
(2)

where \(H^{(l)}\in {\mathbb {R}}^{N\times D}\) is the matrix of activations of the lth layer, \(\sigma \) denotes the selected activation function, \({\tilde{D}}={\textstyle \sum }_{j}^{}{\tilde{A}}_{ij}\) refers to the degree matrix; matrix \({\tilde{A}}=A+I_{N}\) is the adjacency matrix of the undirected graph G with the added self-connections \(I_{N}\) to include a node’s own node features, \(H^{(0)}=X\) where X are the node features and \(W^{(l)}\) is the trainable weight matrix for a specific layer.

A different method was proposed by [17] (graph attention networks, now referred to as GAT), where the structural information of A is dropped and is more implicitly defined by using self-attention over the node features. The authors motivate this by referencing previous work (e. g., transformers [25]) that showed that self-attention is sufficient. Still, both techniques (GCNs and GATs) can produce node-specific outputs of \(N \times F\) features, where F is the number of desired output features for each node N. Based on this, we will discuss extensions for time series analysis below.

2.4 Graph neural networks for time series analysis

Considering the connection between GNNs and classical time series analysis, most effort is visible in time series forecasting [10, 26]. These approaches adapt existing neural network architectures to use operators from the graph domain. Examples are gated recurrent GNNs that utilise the spectral convolutions from [27]. Also, Diffusion-Convolutional Networks are introduced that take into account the in and out-degree of nodes to capture the spatial dependencies of nodes better, which is beneficial in e. g., traffic prediction [28]. Later on, spatiotemporal graph convolutional neural networks are introduced that interchange the convolution procedure between the temporal and spatial dimensions [29]. In addition, GNNs have been used to perform anomaly detection in time series data. [11] propose an attention-based GNN that used the results of a forecast to classify deviating predictions as anomalies. In addition, [12] propose GRIL (graph recurrent imputation layer), a spatial-temporal GNN that reconstructs missing data by learning spatial-temporal representations.

To conclude, a lot of progress has been made in combining GNNs with classical time series-related tasks. However, time series regression (and classification) tasks have not received the same amount of attention yet. Especially when the target value does not rely on more recent values from the input, but rather on the whole length of the time series, other model architectures are needed. As discussed below, seismic data are a typical domain where these data characteristics naturally arise.

2.5 Deep learning for seismic analysis

Over the past decades, huge volumes of continuous seismic data have been collected [30, 31]. With the availability of large datasets and advances in machine learning, the seismological community has also seen a rise in the use of machine and deep learning. Exemplary use cases are magnitude estimation [32] and earthquake detection [33]. Specifically for waveform analysis, the CNN has been applied several times. For example, [34] developed a CNN for single-station localization, magnitude and depth estimation. In addition, CNNs were developed for P- and S-wave arrival times picking [33, 35]. Others used multistation waveforms which were analyzed for the estimation of the earthquakes’ location [5, 36] or early warning [37].

The work of [5, 13, 14] comes most close to the goals of our work. Each of these papers tried to use time series-related data in combination with graphs to improve predictions. [5] used classical CNNs that attached the (latitude, longitude) locations of the sensors to the waveforms to improve predictions, which differs from our goal to use GNN layers. In other words, metadata was used to enhance their CNN model with spatial-information, except no graph layers were applied. [13] propose a technique that combines CNNs with graph partitioning to group time series together based on spatial information. This procedure increases the quality of the within-group features, and improves predictions, but no GNNs were utilised. Another recently proposed method for handling seismic data with GNNs is from [38]. Here, the location and magnitude of earthquakes is predicted. However, their input are pre-calculated characteristics of the earthquakes for each station, which differs from our goal to use raw waveform data. Lastly, [14] present a method that uses CNNs and GNNs for seismic event classification. However, no spatial information is provided to the model (i. e., adjacency matrices only containing 1’s are created), no meta information of the nodes is added, only three stations are examined simultaneously, and their method focuses on time series classification instead of regression.

In this paper, we specifically propose a technique that will use the full power of GNNs to perform time series analysis. To the best of our knowledge, no GNN-based method was used to perform such a time series regression task before.

3 Method

In this section, we first introduce the datasets used in the experiments. With this as context information, we then define our problem formally. After that, we describe how to generate networks given the datasets. Next, we present the framework of our proposed model for multivariate time series regression (TISER-GCN). We first provide an intuition on an abstracted level, followed by a detailed discussion of the full architecture shown in Fig. 5, and its implementation. Finally, we discuss model training and the applied baseline models which are used for our evaluation.

3.1 Dataset

We perform regression on two datasets recorded by the Italian national seismic network [39, 40], described fully in [15, 16]. GNNs are an ideal candidate for the analysis of seismic data, since seismic measurements contain (1) an enormous amount of data and (2) sensors that are geographically grounded. Each sensor (i. e., seismometer or accelerometers) in the dataset continuously records the amplitudes of the seismic waves resulting from earthquake occurrences along three components of ground motion (i.e., 3 dimensions): up-down, north-south, and east-west. The input maximum (i.e., the greatest amplitude detected across all stations during the time window) is used to normalize the data, as performed by [15]. The data recorded by the sensors located at the stations are crucial for seismologists to understand the nature of the recorded earthquakes (e.g., magnitude, location, focal mechanism, etc.).

Because information (via telecommunication) can be transmitted faster than the seismic waves travel, seismologists have developed algorithms to predict the maximum intensity measurements (IMs) of ground shaking at a set of far-away stations, caused by an earthquake, using only the very first stations that recorded the earthquake already. In the seismological literature, this objective is known as “earthquake early warning”. The IMs used here include peak ground acceleration (PGA), peak ground velocity (PGV) and spectral acceleration (SA) at 0.3, 1 and 3 s periods and represent the labeled data of our model.

Therefore, the task with these datasets is as follows: by using the earthquake recordings from the stations nearby the epicenter, recorded within 10 s from the origin time of the earthquake, we make predictions of the IMs at all stations within the network. A large majority of the stations have not yet recorded the maximum earthquake-related ground motion or ground motion at all. Therefore, we hypothesize that GNNs are highly suited for this time series regression task to predict IMs.

An example of an earthquake (red solid star) drawn from our dataset is shown in the left of Fig. 2. After the initial earthquake waves start to spread, only one station (called FEMA) has started recording part of the earthquake, while the other stations farther to the northwest have not recorded the first waves.

The CI dataset consists of 915 earthquakes recorded on a set of 39 stations (CI network) in central Italy. The earthquake epicenters and station locations are within the area that consists of latitude [42\(^{\circ }\), 42.75\(^{\circ }\)] and longitude [12.3\(^{\circ }\), 14\(^{\circ }\)], with earthquakes happening from 01/01/2016 until 29/11/2016. It contains many spatially concentrated earthquakes and a dense network of stations. Earthquakes have a depth between 1.6 km \(\le \) z \(\le \) 28.9 km and magnitudes in the range 2.9 < M \(\le \) 6.5.

The CW dataset consists of 266 earthquakes recorded on a set of 39 other stations (CW network) in central-western Italy. The earthquake epicenters and station locations are within the area bounded by latitudes [41.13\(^{\circ }\), 46.13\(^{\circ }\)] and longitudes [8.5\(^{\circ }\), 13.1\(^{\circ }\)], with earthquakes spanning the time period between 01/01/2013 and 20/11/2017. All the earthquakes are in the depth between 3.3 and 64.7 km, with magnitudes in the range 2.9 < M \(\le \) 5.1. Therefore, the CW dataset clearly covers a larger area than the CI dataset, and as Fig. 3 illustrates, the earthquakes of the CW dataset are scattered across a large part of central and northern Italy, whereas the CI dataset has earthquakes concentrated in one small area.

Fig. 2
figure 2

Overview of the task tackled in this paper. An example earthquake (P (green) and S(orange) wavefronts after 10 s from the origin time) is shown on the left as red star. By taking the initial input length (10 s) of X, we predict the Y values that characterize the earthquake at each node (representing a seismic station). Y has to be inferred by exploiting waveform patterns in X, since Y mostly reveals itself later on in the hidden part of the data (and do not necessarily occur at peaks). In addition, there could be e. g., noise or sensor malfunctioning hindering information. We reduce the 10 s window length progressively by 1 s (each red block) at the time, to further complicate the task in Sect. 4.1.1

3.2 Problem definition

The goal in this work is to regress various values from multivariate time series sensor data. We test our models on seismic data, where the maximum intensity measurements of shaking at each station should be predicted. We calculate values that are external to the input and do not depend necessarily on recent values, but rather on the whole length of the time series (see Fig. 2). Let L be a symmetrically normalized laplacian matrix \(L \in {\mathbb {R}}^{N\times N}\) where N refers to the number of nodes in the graph, and Z be a node feature matrix \(Z \in {\mathbb {R}}^{2 \times N}\) that holds the latitude and longitude location of each node. Given the input time series \(X \in {\mathbb {R}}^{E\times N\times T\times C}\) where E is the number of earthquakes, N the number of stations, T the length of the time series and C the amount of channels, our goal is to predict \(Y \in {\mathbb {R}}^{5 \times N}\) , which refers to the 5 target parameters of the time series called PGV, PGA, SA(1 s), SA(0.3 s) and SA(3 s) for each node in the graph. The task is iteratively complicated (see Fig. 2) by reducing the input length T given to X in Sect. 4.1.1.

Our final regression problem can then be formulated as follows:

$$\begin{aligned} f: L \times X \times Z \rightarrow Y \end{aligned}$$
(3)

where f denotes the learning function, L the graph, X the time series input, Z the node features and Y the regression targets.

3.3 Network creation

Both undirected sensor networks were created by making use of the geographical locations of the seismic sensors. The adjacency matrix \(A_{i,j}\) was calculated by taking all the pairwise geodesic (the shortest path between two points on a sphere) distances in km between each station (latitude, longitude), and taking the 1 − (min,max) scaled distance as the edge weight, since edges with a low distance should have a higher weight. Afterward, the resulting adjacency matrix can be filtered on the threshold k to adjust the sparsity in the graph, e. g., the higher the parameter k is set, the fewer edges will retain in the graph (k has a range between 0 and 1). Experimentation showed that in the CI network a threshold of 0.3 was most optimal, and 0.6 in the CW network (see Sect. 4.2.1).

The adjacency matrix, however, still has to undergo some more changes to make it more suitable for GNNs (especially GCNs). Therefore, we transform the adjacency matrix A into the symmetrically normalized Laplacian matrix \(L=I - D^{-1/2}AD^{-1/2}\) where D refers to the Degree matrix containing the neighbors of each node and I refers to the identity matrix of length n nodes in a graph. A typical Laplacian would only consist of \(L=D-A\), however, if nodes have a wide range of varying connectivity, vanishing gradient problems can occur [24]. Therefore, the degree matrix is symmetrically normalized. Lastly, the addition of the identity matrix helps with the GNN to also involve each node’s own node features [24].

Fig. 3
figure 3

Overview of the source-receiver geometries of the CI (a) and CW (b) seismic datasets. The blue solid dots correspond to the seismic stations (nodes) and the orange dots refer to the earthquake epicenters. Notice the larger geographical area and the sparseness of the epicenters of the CW dataset when compared to CI (visible in the map scale at the bottom of both figures). The thickness of the lines connecting the stations (i.e., the edges) are inversely proportional to the distance of the connecting nodes as from the values of the adjacency matrix

Figure 3 visualizes the resulting graphs; the nodes resulting from the seismic stations of the CI and CW datasets are shown in panels a) and b), respectively. As mentioned before, looking at the geographical maps and the coordinates (latitude and longitude) on the axis of the plots, it is clear that the CW network covers a larger land area. In addition, the thickness of the edges in the figure is determined by the distance between two stations. The less distance between two stations, the higher the edge weight. A higher weight will help the GNN with determining which stations will most likely inhibit similar behavior in their sensor readings, improving the IMs’ prediction.

3.4 Abstracted framework

Fig. 4
figure 4

Abstract overview of our GNN implementation for multivariate time series processing

Figure 4 presents an abstracted overview of the building blocks of our proposed architecture, which can therefore also be instantiated for other tasks, such as time series classification. In summary, our proposed architecture of a GNN for time series regression (TISER-GCN) contains the following main contributions compared to previous work, as we will detail below:

  1. 1.

    To obtain node features, we apply a 1D convolutional layer for feature extraction on the individual nodes using a wide kernel [8, 41] on the input data as in [15].

  2. 2.

    To obtain the graph, the set of stations in the seismic area are considered as nodes, with the distance between them as edges.

  3. 3.

    A GNN (utilizing GCN layers from [24]) of n layers is implemented for processing these feature vectors calculated by the convolutional layers as node features. While in other GNN papers the node features each measure a unique aspect about a node (e. g., the age and friend count in a social network), we demonstrate that GCNs can also learn from features that are sequential in time.

  4. 4.

    In our case, as described below, we focus on a regression task on seismic data, focusing on predicting ground shaking at a set of seismic stations.

3.5 Model implementation

This section introduces the version of our abstracted implementation applied to a regression task on seismic data. For providing a complete picture of the model, the source code is available.Footnote 1

The first block of our proposed model uses as input 10 s of 3-channel seismic waveform data sampled at 100 Hz, i. e., a time series from each station of each earthquake. See Fig. 5 and Sect. 3.1 for a full overview of the model and the dataset. After that, convolutional, graph-convolutional and post-processing layers are applied.

Fig. 5
figure 5

Overview of the proposed architecture. The features from two convolutional layers are used as node features in the GCNs. After the two GCN layers (which are used for inter-station-related feature extraction), the data are flattened to retain as much information as possible. Afterward, the output is fed to fully connected layers

3.5.1 CNN for feature extraction

In the second block of our model, two 1D convolutional layers act as feature extractors by using wide kernel sizes, small strides, increasing filters, kernel regularization and a ReLU activation function, which has proven to be useful for 1D time series data [15, 41]. The purpose of these convolutional layers is to learn the temporal patterns of each station. Afterward, the output of the second convolutional layer, which has shape (NTF) where N refers to the number of nodes, T the remaining length of the time series and F the number of filters, is reshaped to make the dimensions fitted for the graph convolutional layers. These layers typically need an input of (NF) where F now refers to a one-dimensional vector \([x_{1},x_{2}\dots x_{n}]\) for each node in the graph.

To this reshaped feature vector, features (latitude, longitude) of each node are added as node meta data. Therefore, the feature vector of each node now consists of time series features from the convolutional layers and classical node features. This addition of this node metadata has showed to improve performance in [16].

3.5.2 GCN processing

Next follows the graph convolutional layers used from [24]. While in [15] the third convolutional layer gathers the cross-station information, here the graph convolutional layers take this role, since they use the features from the convolutional layer as node features for each node. More concretely, each node N receives one of the feature vectors of dimension (NF) as node features where F is the length of the feature vector (see Fig. 4). The two graph convolutional layers use these features of the nodes by reducing them to (N,64) by both containing 64 filters. Considering the hyperparameters, experimentation revealed that starting with a ReLU activation function followed by a TanH works best. In addition, bias was set to false (as suggested by [24]) and the same kernel regularizer was used as in the convolutional layers.

3.5.3 Postprocessing

A common practice in the graph literature is using global graph pooling operators such as max or average-pooling [27, 42, 43]. These pooling techniques take the embeddings of all the nodes in a graph and globally pool these together by an aggregation function (max, sum, mean ...) However, this procedure reduces the feature vector from (NF) into a single vector F regarding the number of nodes and filters used in the previous layer. This means that the graph is reduced to essentially one node, which is not desirable in our task, because this is defined as a node-level regression task for all the nodes in the graph, in contrast to a graph-level task [42]. Therefore, the output of the final graph convolutional layer is directly flattened and then fed to the fully connected layer.

The output of this layer is then given to five fully connected regression layers. These layers represent the regression target variables called PGV, PGA, SA(0.3 s), SA(1 s) and SA(3 s) for each of the nodes.

3.6 Software and computer

Python was used in combination with TensorflowFootnote 2 and KerasFootnote 3 to develop the proposed models. The GCN layer is derived from Spektral.Footnote 4 Calculations are done with support of NumpyFootnote 5 and table formatting with Pandas.Footnote 6 Furthermore, to reduce the overall training time, the models are trained on a dedicated server with two Intel Xeon CPUs (3.2 GHz), 256 GB RAM and a Nvidia Quadro RTX3956000 (24 GB) GPU. After training, the models are fairly small (around 6 MB) and are deployable, e. g., on standard PC hardware as well as edge computing platforms.

3.7 Model training

Regarding model training, 80% of each dataset was used for training and 20% for testing. The train set was then randomly split by k-fold cross-validation with \(k=5\). The average MAE, MSE and RMSE scores on the test set were taken as the results. This entire procedure was repeated 5 times with different seeds, which generates 5 distinct train-test splits to generalize the results (i.e., the large amplitudes from the less frequent large earthquakes in a test set can influence the scores if more of them are assigned to a test set).

The model used a batch size of 20 and 100 training epochs with early stopping—patience of 10. The same optimizer as in our baseline [15] was used, namely RMSprop with mostly standard settings [44]. Lastly, MSE (mean squared error) was used as the loss function when training the models:

$$\begin{aligned} \mathrm {MSE}=\frac{1}{n} \sum _{i=1}^{n}\left( y_{i}-{\hat{y}}_{i}\right) ^{2}, \end{aligned}$$
(4)

where \(y_i\) are the actual values and \(\hat{y_i}\) the predictions, since it penalizes larger errors more than e. g., Mean Absolute Error

3.8 Baseline models

We compare our work to the model presented by [15]. This baseline model was given the exact same input data except for the graph, i. e., both models received the time series per station for every earthquake combined with the latitude and longitude node features of every station, as proven to be effective in [16]. In addition, we also examine the performance of GAT [17] layers and an adjusted version of [14] for our task. The GAT layers are set with 8 channels and 8 attention heads in order to match the number of parameters of our model, all other hyperparameters were unchanged. To adapt the model of [14] to our task, the last layers used for classification were altered for regression, a weighted initial adjacency matrix was supplied, and node features were added.

Table 1 Overview of possible parameter settings used for the grid search optimization of the ML models

In addition, our proposed model is also compared with traditional machine learning (ML) algorithms: k-nearest neighbors (K-NN), extreme gradient boosting (XGBoost), random forest (RF) and support vector machines (SVM). Since these models are not designed to process multidimensional data, features were calculated from both the time and frequency domain. These features are derived from several studies [45, 46]. From the time domain, the mean, standard deviation, variance, median, minimum, maximum and range (maximum–minimum) are used. From the frequency domain, the signal energy \(E =\sum ({\text {fft}}x_{i})^2\) and signal power \(P =\sum \frac{({\text {fft}}x_{i})^2}{\sum _{i}^{t}}\) were used. For each of the ML models, grid search optimization in combination with fivefold cross-validation was used to assess which models performed best. Table 1 describes all the options of the grid search optimization.

Lastly, we also compare the best performing deep learning models for each network without node features added, to examine their effect on the performance. These results are visible in the Ablation study in Sect. 4.2. Here, the impact of the spatial information can be observed.

4 Results

In this section, the results from the multivariate regression task are shown. In addition, our model is tested on different input window lengths to further complicate the task (Fig. 6).

4.1 IM prediction

The results of the IM prediction are visible in Table 2 and Figs. 7 and 8. All the algorithms perform better on the CI network than on the CW network. Such behavior is expected since the CW dataset contains fewer earthquakes (266 against 915) than the CI dataset and, in contrast with the CI, their spatial distribution is sparse. In addition, the CW network covers a larger area, with greater distances between the stations, a larger depth range of the hypocenters and a larger variability in the geological settings. As an extra test, 266 samples were taken from the 915 earthquakes in the CI dataset to mimic the conditions of the CW dataset while preserving the densely located earthquakes characteristic of the CI dataset. In general, this resulted in an increase in MSE of  30% for our model (TISER-GCN) and  42% for the CNN model from [15], showing the importance of having enough samples for learning.

Table 2 Individual results of each IM metric for the support vector machine (SVM), K-nearest neighbors (KNN), XGBoost, random forest (RF), the CNN model from [15], GAT layers, adjusted implementation of [14] and our proposed model (TISER-GCN)
Fig. 6
figure 6

MSE at different input window lengths for each model in both networks. Read the figure from left (initial 10 s window from Sect. 4.1) to right. Our model is capable of achieving approximately the same MSE scores when given half the input, highlighting the power of GCNs with processing spatial information

When examining the individual performance of the models, TISER-GCN outperforms the best performing baselines (the model from [14] on the CI network and the CNN model from [15] on the CW network, respectively) by a large margin on each of the five metrics of ground motion. Especially in the PGA and SA(1 s) metrics this performance gain is visible. Our model improves an average of 7% on MAE, 16.1% on MSE and 8.3% on RMSE compared to the best baseline for the CI network. Considering the CW network, an improvement of 9.1% on MAE, 16.5% on MSE and 8.4% on RMSE is achieved.

Lastly, it is interesting to see the relatively weak performance of the GAT-based model. A possible explanation could be that the explicitly defined spatial information in the graphs (the distances between the stations) is crucial to make sense of the time series data, which the GAT layers infer themselves via self-attention. Therefore, perhaps the time series features are too complex for the GAT layers to learn such representations, especially in the CW network.

Considering the results of the models on each individual earthquake and each IM metric, Fig. 7 shows the observed versus the predicted IM values of the CW network for both [15] and TISER-GCN as residual plots. The blue lines (and residuals) reveal the performance of the CNN model [15] and the green lines (and residuals) TISER-GCN. The lines were calculated with an ordinary least squares to better visualize the difference in prediction bias between the two models. We observe, that better performance of our model comes also in slight reduction of the bias for large IM values (less underestimation), which is also of great value for the seismological applications (for more information, see the original CNN model [15]).

Fig. 7
figure 7

Residual plots of the predicted against the true 5 IMs of the CW dataset (displayed in logarithmic (\(\log _{10}\)) form). The blue line and points display the results of [15] CNN, the green line and points of TISER-GCN. The dotted black line resembles a perfect prediction score. The units are m/s for PGV and \(\text {m/s}^{2}\) for PGA and SA

Lastly, the characteristics of almost all the deep learning models are highly similar. The models do not deviate concerning Ms/step per epoch. However, there is a small decrease in the number of parameters, since our TISER-GCN model has 6.7% fewer parameters than the CNN model from [15]. The GAT model has an equal amount of parameters as our model, and the adapted GCN model from [14] originally already has fewer parameters.

4.1.1 Variation in different window lengths

Section 4.1 shows the results based on a 10 s input window length. However, it is interesting to investigate the results of both the CNN and the best performing GNN-based models when this window length is reduced, since smaller window lengths could translate in earlier responses. However, a smaller input length creates a more difficult setting, since fewer stations have received enough information to predict the IMs at further away stations. Therefore, all the window lengths between 10 and 4 s (it is not reasonable to reduce the input window further with this specific application) and their corresponding MSE scores are displayed in Fig. 6.

We find that we can half the entire input window, while still achieve similar performance as the best performing baselines. This reduction also results in a reduction of the model parameter size of 1.26 million to around 700 thousand (\(-44\)%).

Overall, these results highlight the power of the GCN layers and our implementation. Since GCN layers are designed to perform node feature sharing in its convolution procedure, it is still possible to transfer plenty of information between the nodes in a situation where half the input is provided. In the context of analyzing streaming time series data, these benefits are crucial [47] since requiring less input translates to earlier responses.

Fig. 8
figure 8

MSE scores for each metric where the blue bar represents the original CNN model by [15] (our initial baseline) and the green bars our proposed model

Table 3 Ablation results of the minimum distance cutoff hyperparameter k for both networks

4.2 Experimentation: ablation study

4.2.1 Tuning hyperparameter k

The graph creation algorithm mentioned in Sect. 3, used the hyperparameter k to cutoff connections between stations that are too far away. k falls between 0 and 1, where 0 means that no edges are filtered, and 1 means that all edges are filtered. Table 3 presents the results of our model when tuning this parameter on both networks and at what distance the cutoff will be. If the MSE scores of two values of k would be highly similar, one would choose a higher k over a lower k, due to the computational advantages of sparser graphs.

While in the CI network the results for k are more promising with lower values of k, different results are visible in the CW network. An explanation can be found in the very different characteristics (see Sect. 3.1) of the two networks. That is, a too low cutoff has bigger effect applied to a widely spaced network (i. e., CW) than on a more concentrated network (i. e., CI). In practice, having more edges than necessary in a largely spread network (CW network) apparently confuses the GCN layers in this experiment, perhaps since stations in less dense networks show different behavior sooner than in more densely networks (e. g., the CI network). These results highlight the importance of this preprocessing step when designing graphs. However, it is an easily interpretable hyperparameter that once determined can help GCNs achieve great performance.

4.2.2 Effect of node metadata

The impact of the node features can be examined by inspecting the results if no features (latitude, longitude) were supplied. Without node features, which have proven to be crucial in our difficult prediction task [16], the CNN from [15] and TISER-GCN score approximately equal. Both have an MSE of 0.26 on the CI network, and the GCN from [14] achieves a promising 0.24 MSE. On the CW network, [15] scores 0.50, TISER-GCN 0.51, and the GCN from [14] scores a 0.37. These scores show how difficult this task is without including the spatial information from the stations, which shows to be crucial for this task, confirming the insights from [16]. In addition, they show the promising results from [14] model with no metadata added.

However, once the features are added, changes in performance become visible. The model from [15] improves 5% on the CI network and 26% on the CW network, whereas our model improves 24% on the CI network and 41% on the CW network. In contrast, the GCN from [14] does not improve at all, staying at 0.24 and 0.37 MSE, respectively. Therefore, our TISER-GCN appears to be more capable of using this spatial information to improve the learned feature representation, by combining the graph information with the time series data while exceptionally well exploiting the information contained in the spatial metadata in the graph convolutional layers. However, we want to emphasize that the model from [14] was not originally optimized for this task, but for earthquake classification.

5 Conclusions

In this work, the use of multivariate time series regression with graph neural networks was presented. Our method (TISER-GCN) proposed a unique way to leverage features from convolutional layers as node features in a GCN. The proposed model is tested on two seismic datasets with different characteristics, demonstrating the generalizability of the model. Our model outperforms the best performing baselines by 16.1% on average on the CI dataset and 16.5% on the CW dataset in terms of MSE. Therefore, besides the original baseline of [15], the adjusted version of [14] and GAT layers [17] also showed weaker performance compared to our model. The experiments demonstrate the impressive power of TISER-GCN, i. e., of our proposed GCN model when processing spatial information, since the tested deep learning models were provided with the same input. More crucially, especially when taking into account the use case of early warning systems, our model can match the performance of the baselines by using half of the input window length on both datasets. Such a reduction can help with faster earthquake early warnings (half the input means a faster response).

One important message which we want to emphasize is that the architecture of the original CNN model by [15] could also be improved from an only CNN perspective. However, to make the comparison more fair, and to observe more directly the actual effect of the GCN layers on the prediction results, we decided to alter the model architecture as little as possible to demonstrate the difference in performance between a classical CNN approach compared to our proposed model.

For future research, other methods of creating the initial adjacency matrix will be investigated, because the results of the cutoff parameter experiments in the ablation study reveal the effect the graph creation steps have on the performance in the CW network. Examples include; to keep the top-k edges for each node or using exponential decaying functions as in [48]. Furthermore, other node features could be added to the node feature vector to improve predictions. For example, the angle between two stations could be added as an edge feature, or the absolute distance between two stations can be used. Here, also explanation techniques [49, 50] could yield interesting insights, in order to lead feature construction and modeling.

In addition, we plan to test our model on other (types of) datasets. For example, both networks now consisted of 39 nodes. It would be interesting how our architecture would scale to datasets featuring 200 or more nodes. Also, to test the transfer learning capabilities of our model, perhaps adding the spatial information to a pre-trained model on one dataset could make it more adaptable for other networks with different data characteristics.

Lastly, our architecture was built to be easily adaptable for other tasks (involving regression or classification). Inspiration can be taken from the examples of [7], however, there could be as well many other types of datasets where scalar or vector value quantities, either associated to a single node or to the entire graph are to be predicted. Therefore, we encourage readers to take Fig. 4 as a departure point for other analysis tasks.