1 Introduction

The ubiquity of sensors [1] has led to a vast increase in available data, enabling comprehensive data collection on sensor networks [2], ultimately leading to complex spatiotemporal data. In such cases, sequential (e.g., time series) information is attributed with location information, forming relationships in the data.

For analyzing these relationships, graph neural networks (GNNs) provide sophisticated modeling options. For instance, in contrast to traditional methods for time series analysis such as autoregressive integrated moving average (ARIMA), which have strong assumptions with respect to a stationary process [3], GNNs enable the modeling of nonlinear dependencies and complex structure, also including spatial information. Thus, GNNs as specialized deep learning models can better capture the dynamics of time series data.

GNNs are a type of neural network that operate on graph data structures, where nodes represent data points and edges represent relationships between them. GNNs are capable of learning representations of graphs, which can capture complex patterns of data and make predictions based on those patterns. This has made them a promising tool for time series analysis, as they can naturally handle irregularly spaced time series data, capture long-range dependencies, and handle missing data.

One of the key advantages of GNNs over traditional neural networks is their ability to incorporate domain knowledge into the model through the graph structure. Sometimes this information is already given [4], such as a graph representing a metro network. Here, the connections are part of the initial problem formulation, and therefore it is clear which objects or data points are neighbors.

However, not every situation is so clear. Imagine a set of sensors that are placed in a city to measure humidity. The problem here is deciding which sensors should have an edge [5,6,7]. Should only directly neighboring sensors have an edge? Or should all sensors in the city have an edge with each other? Connecting all edges seems like a reasonable choice; however, the graph of even a relatively small set of nodes will quickly grow exponentially in size, making it computationally heavy. Therefore, defining a more sparse but still meaningful edge set is still a subject of discussion (see Fig. 1 for a visualization of the problem).

The unfortunate truth is that data (associated with the nodes of a graph, whether location data or time series data) typically exhibit one or more of the following characteristics, making graph construction difficult in most cases [4]. The information may, for instance, contain many dimensions, which would subject it to the curse of dimensionality. There is also a chance that noise will degrade the data quality. This is particularly important if the graph must be built from the signal data alone with no knowledge of the node locations.

Therefore, in this paper, we explore various techniques to create graphs from a set of points in space, and test them with GNNs on time series analysis tasks. These techniques construct a graph from either location information or from sensor data itself, since sometimes the underlying phenomenon could be more informative than the geographical position of the nodes in a network itself. The datasets used in this paper are examples of spatiotemporal data [8]. They are called spatial temporal, as they “live” in the n-dimensional space formed by their spatial, temporal, and (multi)-attribute dimensions [9]. Some common examples of this kind of data are day-to-day temperature measurements in a city over an extended period [10], or a collection of seismic signals from distributed sensor networks [11], which are of complex nature. Our contributions are as follows:

  1. 1.

    We examine and evaluate the properties of the graphs produced by each construction algorithm.

  2. 2.

    We perform a variability analysis to investigate where the graph was most crucial.

  3. 3.

    We provide general insights into which techniques make most sense to use (best-case analysis), on top of only providing the best scoring combinations that might only work in a narrow use-case.

The rest of the paper is structured as follows: related work is discussed in Sect. 2. Afterward, the general goal, the used algorithms, dataset and models are described in Sect. 3. The results Sect. 4 describes and discusses the findings. Finally, Sect. 5 concludes with a summary.

Fig. 1
figure 1

The two basic approaches for graph creation investigated in this study: a Location-based techniques use the (spatial) location information of nodes to define edges and b signal-driven techniques rely on the (signal) data originating from nodes in order to derive edges between node pairs

2 Background and related work

In this section, we first formally introduce basic notions about graphs and related concepts. Then, we provide an introduction to the field of GNNs, where we also discuss related approaches and architectures that have been proposed in this field. Finally, we discuss approaches for graph construction.

2.1 Graphs

A graph \(G=(V,E)\) consist of a set of nodes/vertices V and a set of edges \(E \subseteq V \times V\) (the connections between the nodes). A common way to represent a graph is with an adjacency matrix, which is a \(n \times n\) matrix, with \(n = |V|\). The (ij) entry of the matrix, \(A_{i,j}\), indicates whether there is a edge between node i and node j. The weight of an edge can be a binary value (0 or 1) for an unweighted graph, or a real number for a weighted graph, which can be represented by the respective entry in the adjacency matrix.

Edges can be directed and undirected. Directed edges contain information with a direction, e.g., a road can be a one-way street. Undirected edges contain no source, of which an example could be the geographical distance between two weather stations. Formally, a directed edge \(e \in E\) between nodes \(u \in V, v \in V\) is represented by an ordered tuple (uv) whereas an undirected edge is given by the unordered variant \(\{u, v\}\). The number of neighbors of a node v is known as the degree and is denoted by \(d_{i}=\sum _j A_{ij}\).

Nodes and edges can also have a feature vector \(a=(a_{1}, a_{2}, \ldots , a_{n})\). These features could be informative about the characteristics of a node/edge. For example, node features could contain traffic speed measurements, and edge features could indicate the type of edge that exists between two nodes (which could be more than 1). Whenever a graph contains features (i.e., feature vectors) assigned to nodes and/or edges, we call it an attributed graph. Many graph-based methods rely on these attributes, e.g., GNNs.

2.2 Graph neural networks

The first GNNs were proposed in the early 2000s, but they were limited in their expressive power and scalability [12]. More recently, a new generation of GNNs has emerged that uses neural network layers to compute the message passing function. Some popular architectures include the graph convolutional network (GCN) [13], graph attention networks (GAT) [14] and GraphSAGE [15]. The choice of which method to use will depend on the specific characteristics of the data and the task at hand. For example, GraphSAGE was designed to be scalable with the use of subsampling neighbors.

Essentially, the basic idea behind GNNs is to use message passing to propagate information between nodes in the graph. Each node receives messages from its neighbors, updates its own representation, and passes messages back to its neighbors. The messages are typically computed based on the node and edge features, and the updated node representations are aggregated to form a graph-level embedding. More specifically, the information that is related to node v stored in the feature vector \({\textbf{h}}_{v}\) is updated by combining the features of its neighbors. After l iterations, the vector \({\textbf{h}}_v^l\) will embed all structural information and node features in the l-step neighborhood of node v. The output of the l-th layer in a GNN is then typically defined as:

$$\begin{aligned} {\textbf{h}}_v^l=\text {C}^{(l)}\left( {\textbf{h}}_v^{l-1}, \text {A}^{(l)}\left( \left\{ {\textbf{h}}_u^{l-1}, u \in {\mathcal {N}}[v]\right\} \right) \right) \end{aligned}$$
(1)

where \(\text {A}^{(l)}\) is a function that aggregates the node features from the neighborhood of v, and \(\text {C}^{(l)}\) is a function that combines the own features with neighboring nodes [16]. Afterward, a readout R function transforms the feature vectors into the final output:

$$\begin{aligned} \sigma = \text {R}\left( \{{\textbf{h}}_v^l,v \in {\mathcal {V}} \} \right) , \end{aligned}$$
(2)

which after enough iterations can be used for various downstream tasks, such as time series analysis, where they demonstrated promising results. For example, GNNs have been successfully used for traffic forecasting [17], air quality data imputation [18], seismic analysis [19], optimal sensor placement analysis [20], anomaly detection in soil moisture data from satellites [21] and general multivariate time series classification [22].

However, one of the key challenges in using GNNs is how to construct the graph itself. Previous research showed the drastic influence of the decision of neighborhood aggregating techniques for graph related tasks, such as clustering [23]. Also, the problem of constructing the graph is often ill-posed, since there is no ground truth available. Therefore, there are various techniques for constructing the graph, each with its own strengths and weaknesses.

2.3 Graph construction

There are a variety of methods for constructing graphs depending on information assigned to the nodes, i.e., for constructing the appropriate connection between those nodes, e.g., [24,25,26].

Typically, the methods to connect nodes in N-dimensional space are split into two groups, depending on which information is used. The first group consists of techniques that use the location of points, i. e., the spatial information, to construct the graph. The second group consists of techniques that learn the graph from the signal data of the respective nodes. Therefore, essentially both techniques typically use distance (or similarity) estimations for deriving edges between nodes, however, they do this using different perspectives on the given data—either taking spatial or temporal information into account. This is depicted in Fig. 1.

2.3.1 Location-based techniques

Location-based techniques use the spatial information (latitude and longitude) of the nodes for deriving edges between pairs of nodes. The applied algorithms can be characterized by the number of parameters needed [4]. There are parameter-free techniques, such as relative neighborhood and Gabriel graphs, which create graphs with relatively few edges. However, in many cases it is desirable to have a higher level of interconnectivity between localities so that they can both receive input from and contribute output to nearby areas. Techniques that have only one parameter are K-NN or q-neighborhood graphs, which merely includes edges if samples are within distance of q away from each other [27]. Lastly, there are also techniques that require 2+ parameters, such as DBSCAN which needs parameters MinPoints and eps to connect nodes. Out of these techniques of the first group (that use location data), the K-NN graph remains the more common approach since it is more adaptive to larger scale graphs and larger datasets, while an improper threshold value in the q-neighborhood graph could result in disconnected components or subgraphs in the dataset, or even isolated nodes.

There are also location-based techniques that require a pre-calculated distance matrix, in order to apply a post-hoc processing technique to create the graph. A famous example of such a technique is a thresholded Gaussian kernel [28, 29]. Here, the edge weight between two connected samples \(v_{i}\) and \(v_{j}\) is computed as:

$$\begin{aligned} d(v_{i},v_{j}) = {\text {exp} \left( -\frac{\text {dist}(v_{i},v_{j})^{2}}{\sigma ^{2}} \right) } \end{aligned}$$
(3)

where the function dist(\(v_{i},v_{j}\)) evaluates the distance between two nodes and \(\sigma \) is the kernel bandwidth parameter, where often the standard deviation is used.

2.3.2 Signal-driven techniques

The second group of graph construction techniques relies on the signal information assigned to the nodes of the graph to infer the graph structure, i. e., the set of edges. The most clear example involves calculating the correlation between the signals of nodes in a graph, as performed in [30], in order to create edges between nodes for which the correlation between the respective time series is strong. The correlation-based method uses correlation coefficients to measure the strength of the relationship between two time series. The graph is constructed by thresholding the correlation coefficients, where edges are only included if the correlation coefficient is above a certain threshold.

Notice that many of the already discussed techniques focus on how to sparsify a graph, which is very important since sparse graphs result in more scalability and easier interpretation [31]. Especially the scalability is an important consideration when dealing with large graphs, and sparsity can play a critical role in enabling GNNs to scale to such graphs, because sparse graphs can reduce the number of computations required by GNNs.

There is also a new class of techniques that can automatically learn the graph structure with, e.g., a latent correlation layer [17] or using the node embeddings [32]. In addition, Graph Attention Networks (GATs) have been developed that aim to learn additional edge weights for an initial binary adjacency matrix as input, which benefit the subsequent message passing operations [14]. However, these techniques either modify or learn the graph during training, and therefore cannot be seen as standalone graph construction techniques. Therefore, these techniques require very specific model architectures, making them hard to compare to other graph construction techniques.

In addition, previous work often considers the problem as a graph-denoising [33, 34] or graph-completion problem [35]. In the former, a graph might be incomplete, and the missing edges should be calculated. In the latter, the graph might contain too many edges and the noise edges should be removed. These use-cases are examples of semi-supervised learning, which is not the focus of this work. In our context, the underlying graph is never available, and should be inferred completely from the information assigned to the nodes. Therefore, this either relates to the spatial information about node locations (location-based methods) or the time series data of a node (signal-driven techniques). Thus, in contrast to other approaches, this rules out all the aforementioned automatic techniques, which need at least some prior information (existing edges) to work with.

2.3.3 Summary

Overall, despite the growing interest of GNNs and time series analysis, there is limited research on the comparative performance of each technique to construct the graph. However, this is a crucial aspect, since the accuracy and effectiveness of GNNs heavily rely on the quality of the graph structure [5]. Therefore, insights and guidelines about potential effective graph construction methods and approaches are specifically relevant for an appropriate application of GNNs, in particular in the context of complex spatiotemporal data. This is the motivation for the analysis done in this paper. In particular, we compare various techniques to construct graphs from either location data or signal data and estimate their respective performance and impact.

Another potential reason for the lack of research on this topic is the relative newness of GNNs and their application in time series analysis. While GNNs have shown promising results in various domains, their performance in time series analysis is still an active area of research.

Our results provide insights into the strengths and weaknesses of different techniques for constructing graphs for time series data, and help guide the selection of appropriate methods for different applications. In the next section, we introduce each technique and explain their parameters and (dis)advantages.

3 Methods

In this section, we first introduce and explore a large set of techniques to generate graphs from geographically grounded nodes with assigned feature vectors. For that, we propose either location-based or signal-driven techniques. Specifically, each will be discussed individually in terms of algorithm formalization as well as their advantages and disadvantages (e.g., complexity).

Fig. 2
figure 2

Three different graphs constructed using the CW dataset

3.1 General goal

The task in this paper is to learn a graph \(G=(V,E)\) where V are the nodes (objects/sensors) and E are the edges (connections between them) from location-data or signal data itself [6, 7, 36]. While this task shares a conceptual connection with related problems like graph generation (which focuses on creating unique and varied graphs [37, 38]), and graph completion (where missing edges have to be inferred [35]), in graph construction, the entire unknown graph is to be constructed according to a given approach. In our context, it is supposed to be learned from either location or signal data from the set of nodes.

Graph construction can be seen as an ill-posed problem, as there are multiple ways to accomplish the task and no definitive ground-truth exists. An example is given in Fig. 1, where the edges are calculated by either location information or signal information. If the technique relies on the location information of the nodes, then the goal is: given a set of longitude and latitude locations of objects/sensors X, we need to find a way to create edges/connections given certain criteria. However, if the technique relies on signal data, i.e., M observations of the N nodes (the data entities) represented in a data matrix \(X\in {\mathbb {R}}^{N\times M}\), each column on this matrix X contains a graph signal for each node, and the graph is supposed to be learned from this data.

Both directions result into different graphs with varying characteristics. For example, Fig. 2 shows three versions of a constructed graph from a seismic sensor network called CW, which will be introduced in Sect. 3.5.1 in more detail, for which we specifically refer to Table 2. Figure 2a, b shows techniques that create comparable graphs, but (c) shows a completely different graph that was constructed by the clustering algorithm k-means that only creates connections between nodes in the same cluster. Here, we can already observe quite different graph structures, also in terms of their properties, since Fig. 2c even generates an unconnected graph, compared to the graphs in Fig. 2a, b which differ in terms of their density, i.e., Fig. 2a shows a much sparser graph.

3.2 Location-based techniques

Location-based techniques rely on the location information of nodes for graph construction. This study used the latitude and longitude of stations for this procedure, and distances were calculated with measures that take the curvature of the earth into account (e.g., haversine or geodesic distance). The implemented location-based graph construction techniques are: MinMax, thresholded Gaussian kernel, K-NN(weighted), k-means, OPTICS, relative-neighborhood and Gabriel graphs.

3.2.1 Reversed MinMax

As performed in [19], the reversed MinMax technique takes the absolute distance between each set of nodes and returns the reversed MinMax-scaled value thresholded by k. It is an own implementation of the well known q-neighborhood class of graph construction techniques. In q-neighborhood algorithms, a cutoff distance is always in the scale of the dataset. In this version, we rescale the distances first with min-max scaling, to then have an interpretable parameter k between [0, 1] for cutoff. The resulting graph consists of all the edges that are within the q-neighborhood of each node (see Algorithm 1).

The advantage of reversed MinMax is the easy to tune and interpret hyperparameter k. However, a disadvantage of reversed MinMax is its computational complexity of \({\mathcal {O}}(n^{2})\), which does not easily scale to larger graphs.

Algorithm 1
figure a

Reversed MinMax

3.2.2 Thresholded Gaussian kernel

A thresholded Gaussian kernel is a mathematical function that describes a bell-shaped curve. It is commonly used in signal processing, image processing, and for tasks such as smoothing, blurring, and feature extraction. The thresholded Gaussian kernel function can be used to compute a weight or strength of the connection between nodes, based on their relative distance to each other [28, 29]. The larger the value of \(\sigma \), the wider the curve, and the more nodes will be connected. The smaller the value of \(\sigma \), the narrower the curve, and the fewer nodes will be connected. In this study, the standard deviation of all distances \(d > 0\) were taken as \(\sigma \) (see Algorithm 2). Lastly, the parameter k will act as a thresholding value to sparsify the graph.

The advantage of using a thresholded Gaussian kernel is the easy to tune and interpret hyperparameter k, similar to reversed MinMax. One downside is the time complexity of \({\mathcal {O}}(n^{2})\), since all the pairwise distances have to be calculated.

Algorithm 2
figure b

Thresholded Gaussian Kernel

3.2.3 K-NN (un)weighted

The K-NN algorithm can be used to define an edge set of the K-nearest neighbors of a node in a graph (see Algorithm 3). This technique is one of the most popular options in the literature [27]. In the unweighted version, every edge is assigned a weight of 1, whereas the actual distance value is used in a weighted version. K-NN provides an intuitive parameter (k) that directly limits the number of edges in the graph. In addition, with help of K-D Tree or Balltree structures, it can be calculated very efficiently (\({\mathcal {O}}(n\log {n})\)) [39, 40]. However, a limitation of always connecting the k most close edges for a node is that whatever the distance, an edge will be created. Therefore, K-NN (especially the unweighted version) is sensitive to outliers.

Algorithm 3
figure c

K-NN-(Un)weighted

3.2.4 Relative neighborhood graph

The relative neighborhood graph (RNG) [41] is constructed by connecting each point \(v_{i}\) to its nearest neighbor \(v_{j}\), provided that there are no other points \(v_{k}\) that lie within the circles centered at the midpoint of point \(v_{i}\) and \(v_{j}\) (see Algorithm 4). In other words, if two close neighbors share a common closer neighbor, then no edge is drawn between them. While its typically applied in computing minimum spanning trees or creating graphs that are always planar (has no crossing edges, which is helpful in road design tasks such as urban road planning [42]), it is used in this study since it contains zero parameters to tune, which gives it an advantage in graph construction. In addition, it has a time complexity of \({\mathcal {O}}(n \log n)\) [43].

Algorithm 4
figure d

Relative Neighborhood Graph

3.2.5 Gabriel graph

The Gabriel graph is a geometric proximity graph that is often used in spatial data analysis [44, 45]. More formally, it is the undirected graph G with vertex set S in which any two distinct points \(v_{i} \in S\) and \(v_{j} \in S\) are adjacent precisely when the closed disk (the region in a plane bounded by a circle) having \(v_{i}v_{j}\) as a diameter contains no other points (see Algorithm 5). There are multiple implementations available, ranging from \({\mathcal {O}}(n^{3})\) to \({\mathcal {O}}(n \log n)\) [46]. The graphs that are constructed will have few edges, but the advantage of this technique is that it is parameter free, and it creates a planar graph (no crossing edges in the visualization).

Algorithm 5
figure e

Gabriel Graph

3.2.6 K-means

K-mMeans is a clustering algorithm of which there are two categories, e.g., partitional and hierarchical [47]. Hierarchical algorithms locate nested clusters either top-down or bottom-up. Instead of imposing a hierarchical structure, k-means (a partitional algorithm) simultaneously locates all the clusters as a partition of the data.

Clustering data points using the k-means algorithm [48] involves the partitioning of samples into N groups of equal variance, while minimizing the within-cluster sum-of-squares criterion (see Algorithm 6). The computational complexity of k-means is \({\mathcal {O}}(ktn)\), where k is the number of clusters, t is the number of data points, and n is the number of iterations needed for convergence. However, most often not too many iterations are necessary to reach convergence, making k-means a relatively fast clustering algorithm.

A key advantage of k-means is the intuitive parameter to tune, namely the number of desired clusters. This is especially useful if the number of clusters is something that can be inferred from domain knowledge of the problem at hand.

However, this k parameter could also be a disadvantage if the number of clusters is hard to intuitively define. Also, k-means uses Euclidean distance metrics by design [48], so the advantage of calculating distance on a surface is not used. Also, the centroid initiation heavily influences the results [49]. Lastly, all points will be added to a cluster, meaning that also outlier points will always be joined.

Algorithm 6
figure f

K-Means

3.2.7 OPTICS

Algorithm 7
figure g

Optics

The basic idea of OPTICS [50] is similar to DBSCAN [51] (another popular density based cluster algorithm) but it addresses one of DBSCAN’s major weaknesses: the problem of detecting meaningful clusters in data of varying density. To do so, the data points are (linearly) ordered such that spatially closest points become neighbors in the ordering. Clusters can then be found by finding regions where the reachability is below a certain threshold. So OPTICS instead generates an enhanced ordering of the data that represents the density-based clustering structure of data, rather than directly producing a clustering (see Algorithm 7). Overall, the time complexity can go up to \({\mathcal {O}}(n^{2})\)[52].

In Algorithm 7, two other functions need to be explained: The core distance (\( core-d {} \)) is the smallest radius that a particular point must have to be designated as a core point. The supplied point’s core distance (\( core-d ){ isundefined}( U ){ ifitisnotacorepoint}.{ Themaximumofthecoredistanceof}p{ andtheEuclideandistance}({ orotherdistancemetric}){ between}p{ and}q{ isthereachabilitydistance}( reach-d {} \)) for those points.

The real advantage of OPTICS is that it can detect very different local densities to reveal clusters in different regions of the data space. In other words, it can detect a very dense cluster in a region, and also relatively less dense clusters in another region of the data. In addition, it achieves this while also reducing the number of parameters to tune, making it an easier choice for most situations. However, OPTICS will likely fail if there is a very grid-like structure in the points (no peaks or drops in density)

3.3 Signal-driven techniques

Signal-driven techniques rely on the signal data of nodes for graph construction (see Fig. 1b. The implemented techniques are: correlation, dynamic time warping (DTW) [53] and maximal information coefficient (MIC) [54]. Due to computational constraints, each of these techniques received 50 random data samples to construct the graph. In addition, for correlation and MIC we consider the top-\(m{ correlationvaluesforeachedge},\,{ sotakingthestrongestcorrelationvaluesintoaccount}.{ Weempiricallydetermined}m=20\) via grid search in our experiments.

3.3.1 Correlation

Correlation is a statistical measure of the strength and direction of the linear relationship between two continuous variables. Given a pair of time series signals, the Pearson correlation coefficient is calculated as the covariance of the two variables divided by the product of their standard deviations (see Algorithm 8). The correlation coefficients are taken between the signals of every node, and ordered from large to small. The top \(m=20{ correlationscoresaretakenandaveraged},\,{ asdiscussedabove}.{ Afterthat},\,{ thegraphismademoresparsebythresholding}.{ Usingcorrelationhastheadvantageofbeingperfectlyboundedbetween}[-1, 1],\,{ anditcanbeeasilycalculatedevenforlongsequencelengthswithatimecomplexityof}{\mathcal {O}}(n^{2}).{ Also},\,{ correlationisinvarianttothescalingofthevariables},\,{ andreportsthedirectionofthecorrelation}.{ However},\,{ itisalsolimitedtodetectingonlylineardependenciesanddisregardsvariousothertypesofrelationships}.{ Thedisadvantageofthistechniqueistheextraparameter}k\) that has to be tuned compared to the other techniques.

Algorithm 8
figure h

Correlation

3.3.2 Dynamic time warping

Dynamic time warping (DTW) is a distance function that uses a warped path to align two time series and compare them [53]. It is a popular algorithm in the domain of signal processing and pattern recognition. For example, it shows better results in time series classification when the length of the sequences are relatively small (due to computational complexity) than correlation measurements [55].

DTW aims to find the optimal alignment between two time series by stretching or compressing them in time. Each point is compared to each point in the other time series, and the pair of corresponding points that minimize the cumulative distance between them are searched. The total distance is the cumulative distance of all observations in the signals (see Algorithm 9).

DTW has several advantages over Euclidean distance or correlation-based techniques. For example, it is more robust to noise and local distortions than Euclidean techniques. It can also handle time series of different lengths. However, the time complexity of DTW is \({\mathcal {O}}(n^{2}),\,{ where}n\) refers to the length of the input time series. Therefore, the distances need to be reverse min-max scaled between \([0,1]\) since DTW has no upper-bound.

Algorithm 9
figure i

DTW

3.3.3 Maximal information coefficient

Introduced by [54], the maximum information coefficient (MIC) is a non-parametric measure of similarity between two variables. It is based on the concept of mutual information, which measures the amount of information of \(X{ canbeexplainedbyknowingvariable}Y.{ Itconsidersallpossiblepartitionsofthedataintobinsandcalculatesthemutualinformationofthatpartition}.{ Inturn},\,{ theMICscoreisthemaximumvaluesofallpossiblepartitions}({ seeAlgorithm}~10).{ Theaverageofthetop}20{ MICscoresofeachedgewasusedtocalculatethefinalgraph},\,{ asdiscussedabove},\,{ i}.{ e}.,\,{ thesame}m=20\) as used for the correlation.

MIC has various desirable properties. For example, it is scale-invariant, meaning that it is not affected by the scale of variables. It can also detect nonlinear relationships (e.g., cubic or exponential), and it is robust to outliers. It is also symmetric, such that the \(\text {MIC}(X,Y)=\text {MIC}(Y,X).{ Lastly},\,{ MICisnaturallyrangedbetween}[0,\,1]{ makingiteasytointerpret}.{ However},\,{ itdoesnotreportthedirectionalityortypeofrelationshipitfinds}.{ Thedisadvantageistheextraparameterthatdeterminesthetop}k{ scoreshastobedetermined}.{ Andmostimportantly},\,{ itiscomputationallyexpensive}({\mathcal {O}}(n^{2.4})\)).

Algorithm 10
figure j

MIC

Table 1 Overview on the applied algorithms and tunable parameters, data type, possible edge value bounds, time complexity (\({\mathcal {O}}{} \)), whether a connected graph is created and the possible parameter ranges (and their step sizes) that are examined in this study

3.4 Parameter tuning

In the following paragraph, the possible parameter values of each algorithm are discussed and described in Table 1. We perform a grid search of all the possible parameter values and report the best settings in the results, to make the comparison between all algorithms fair.

The first group of algorithms, which includes correlation, DTW, MIC, thresholded Gaussian kernel, and MinMax, all have a parameter range between 0.05 and 0.95 with step size of 0.05. A higher value means that two data points must be more similar to be considered edges. The second group of algorithms, which includes k-means, OPTICS and K-NN-(un)weighted have a parameter range between 2 and 40. The parameter values determine the size and shape of the clusters, or the number of neighbors. The last two algorithms, Gabriel and relative neighborhood, are parameter free.

3.5 Datasets and tasks

The datasets used in this paper are from the Italian Seismic network (CI and CW), METR-LA and PEMS-BAY. The datasets differ greatly in their data domain, task, number of nodes, \(\text {km}^{2}{} \) and sample rate.

Table 2 Dataset characteristics, including the type, task, number of nodes, geographical area expressed in \(\text {km}^{2}{} \), sample rate and the dimensions of the dataset

In short, Table 2 shows the general characteristics of the datasets in this paper. In total, there are four datasets, of which two are used for regression and two for forecasting. To clarify the difference between the two analysis tasks, Fig. 3 combines two examples. In Fig. 3, the entirety of segment A is used to predict a value for the regression task. In other words, the task is to predict numeric values that depend on the whole segment of the series, rather than the last few observations [59]. For the forecasting task, the red part of segment B is used to directly forecast what will happen in the blue segment.

3.5.1 Regression task

The two regression datasets (CI & CW) recorded by the Italian national seismic network [11] contain sensors that measure seismic waves from 3 channels. Both datasets contain 39 nodes, but there is a difference in the number of earthquakes, 915 (CI) versus 266 (CW), and the geographical area (see [19] Figure 3). The task in this dataset is to take an initial 10 s of input data (1000 data points) of an incoming earthquake from multiple stations, in order to predict the ground shaking intensity of other far-away stations from the earthquakes’ epicenter. This 10-s input window makes sure that only the nearby stations had the opportunity to actually measure the earthquake. 5 external target values of the input are calculated, which do not depend necessarily on recent values, but rather on the whole length of the time series. This kind of regression task is also referred to as time series extrinsic regression (TSER) [59], of which Fig. 3 (A) is an example.

3.5.2 Forecasting task

For the forecasting task in this paper, data from traffic sensors are used (two well-known benchmark datasets, see [17, 29, 60]). The objective of time series forecasting is to estimate the future value of \(Y,\,{ giventhehistoricalvaluesof}Y\). Therefore, this task is different from the TSER task, as can be seen in Fig. 3.

The time series are divided with a sliding window of 12-step lookback and the same horizon, with a stride of 1. Therefore, a single sample span covers a total of 24 time steps. In the forecasting task, the signal-driven graphs used a signal length of 24 due to computational constraints and to be aligned with the sample span. The average error of all 12 steps ahead is calculated. In other words, a sample of shape (\(N,\,12){ wouldbecomparedtothegroundtruthofshape}(N,\,12){ where}N{ referstothenumberofnodesinthegraph}.{ Naturally},\,{ thelongerthepredictionhorizon},\,{ thehighertheerror}.{ Therefore},\,{ theerrorofthefirstvaluesin}(N\),12) will have smaller errors than the last values.

The first of the two forecasting datasets is METR-LA, which contains traffic information collected from 207 loop detectors in the highway of Los Angeles County [61]. The sample frequency was once every 5 min, for the period between March 2012 and June 2012. This dataset has a missing value percentage of 8.11%. These missing values are imputed by the last valid observation. The second forecasting dataset is PEMS-BAY, collected by the California Transportation Agencies (CalTrans) Performance Measurement System (PEMS). 325 sensors were selected in the Bay Area and 6 months of data ranging from January 1, 2017, to May 31, 2017, were used. The traffic speed readings are aggregated in 5 min windows. The dataset has less missing values than the METR-LA dataset, only 0.02%. Missing values are also imputed similarly to METR-LA.

Fig. 3
figure 3

Difference between (1) time series extrinsic regression (TSER): complete segment A is used for regression, and (2) forecasting: the red part of segment B is used to forecast the blue part

3.6 Models

Here, we introduce the two GNN models used for either regression or forecasting. Their general architectures and graph input are introduced. Afterward, their training procedures and evaluation are discussed.

3.6.1 Regression model

The same model as in [19] is used, which showed high performance on the task (consult [19] for more details). In short, the model uses two 1D convolutional layers (CNNs) (see Fig. 4) that act as feature extractors on the time series data. The output is then reshaped into (Nodes, Features) and node location metadata is concatenated, which showed to improve performance in the task. Afterward, two GCN layers incorporate cross-station information and the output is eventually funneled into five dense layers that predict the 5 output metrics.

The model requires a symmetrically normalized Laplacian matrix [13]. A typical Laplacian matrix can lead to vanishing gradient problems if nodes have different levels of connectivity. To circumvent this issue, we normalize the degree matrix symmetrically. Furthermore, we include each node’s unique features by adding the identity matrix to the Laplacian matrix, following the approach outlined in [13].

Fig. 4
figure 4

Architectures of the regression model (left) and the time series forecasting model (right). \(N\) refers to the number of nodes

3.6.2 Forecasting model

For the forecasting experiments, a time-then-space GNN is used. The model (see Fig. 4) consists of a linear and RNN encoder and a nonlinear GCN readout. The first block (linear encoder) takes as input the tensor of shape [batch size, sequence length, nodes, input size] next to the adjacency matrix of the graph, and applies a linear transformation to it, resulting in a hidden representation of the inputs. The RNN encoder block outputs a sequence of hidden states, where each hidden state corresponds to one time step of the input tensor. At last, the GCN decoder block then transforms these tensors to the output shape [batch size, horizon, nodes, input size] where the horizon is the forecasting period.

3.7 Training and evaluation

Considering training split, the regression datasets are first split in an 80–20% manner. Then, the remaining 80% of training data is used for 3-fold cross-validation to generalize the results of the experiment. In each of these folds, the model is trained separately and tested on the unseen test set. The input maximum (i.e., the greatest amplitude detected across all stations during the time window) is used to normalize the data. Lastly, a batch size of 30 is used. For a more detailed explanation, we refer to [19, 62].

Considering the forecast datasets, all data were standardized using mean and standard deviation. A train, validation test set is used with a validation set of 10% and test of 20%. Lastly, a batch size of 64 is used.

All experiments are evaluated based on their mean absolute error (MAE):

$$\begin{aligned} \textrm{MAE}=\frac{1}{n} \sum _{i=1}^{n}|y_{i}-{\hat{y}}_{i}|, \end{aligned}$$
(4)

and mean squared error (MSE):

$$\begin{aligned} \textrm{MSE}=\frac{1}{n} \sum _{i=1}^{n}\left( y_{i}-{\hat{y}}_{i}\right) ^{2}, \end{aligned}$$
(5)

where \(y_i{ aretheactualvaluesand}\hat{y_i}{} \) the predictions. A total of 100 epochs are ran, without early stopping. The best model weights were reinitiated after the last epoch.

3.8 Software and resources

We applied Python with Tensorflow and Keras, but also Pytorch in combination with TSL [63] to develop and train the models. These models were trained on a dedicated server with two Intel Xeon CPUs (3.2 GHz), 256 GB RAM and an Nvidia Quadro RTX6000 (24 GB) GPU. In addition, a supplementary pageFootnote 1 is made available containing code of the experiments, including a python package to construct the graphs.

4 Results

This section describes the results of our analysis on the performance of all graph construction techniques. We focus on understanding the different characteristics of each technique and inferring general guidelines for graph construction.

4.1 General statistics

In real-world networks, often the majority of nodes have a relatively low degree, but a few nodes, which are connected to many other nodes, will have a very large degree. In order to obtain a better understanding of the structural properties of the constructed graphs, we investigate specific network characteristics in the constructed graph via according general properties and metrics. In particular, we examine the number of edges and the degree distributions of the graphs shown in Table 3 and Fig. 5. The reported results correspond to the best hyperparameter settings in terms of their MSE. In addition, Table 3 also reports the graph densities. The density of a graph is the fraction of the number of (observed) edges to the number of all possible edges (in a completely connected graph), which is therefore 0 for a graph without edges and 1 for a complete graph.

Table 3 Total number of edges (and density in %) for each algorithm and corresponding dataset after the grid search on hyperparameter settings
Fig. 5
figure 5

Degree Distributions of all graphs constructed on the PEMS-BAY dataset

Fig. 6
figure 6

Weight distributions of all weighted graphs constructed on the PEMS-BAY dataset, where the red vertical lines represent the optimal cutoff point from the grid search

Immediately, it is visible that there are huge differences between the techniques in terms of the number of edges (see Table 3). Techniques such as DTW, MIC and MinMax produce graphs that contain many edges. Especially, the DTW and MIC graphs sometimes contain graph densities that approach 1, which means that all nodes are connected. The best example for this is the PEMS-BAY graph constructed by MIC, which has 52,950 edges (and only just 325 nodes). However, other techniques such as RNG, Gabriel and OPTICS result in graphs with very few edges. From RNG and Gabriel this behavior is expected since the goal of those algorithms is to construct optimally sparse graphs. Therefore, techniques resulting in the construction of relatively complete graphs do essentially not support data sparsity in terms of graph structure, since very many edges are contained. On the other hand, the sparser graphs automatically tend to include fewer edges thus focusing potentially on the stronger (and thus potentially more important) connections.

In addition, some graphs show scale-free degree distributions. The most notable characteristic in a scale-free network is the relative commonness of vertices with a degree that considerably exceeds the average. Examples of these techniques are K-NN, correlation, k-means and RNG. From a complex system perspective, those methods therefore tend to model the expected distribution, i.e., the scale-freeness more closely, compared to the other methods. Specifically, there are also graphs that show a linear pattern in the number of nodes that are calculated, those are thresholded Gaussian kernel, MinMax and DTW. These techniques also belong to the ones that construct the most edges.

Another way to investigate the constructed graphs is by examining their edge weight distributions. Figure 6 shows the weight distributions of the PEMS-BAY dataset for the algorithms that construct weighted graphs, in contrast to graphs with just binary adjacency matrices. Some clear differences are visible between the signal-driven and location-based techniques. To start, the signal-driven techniques create edges that have a much narrower distribution than the location-based techniques. This results in these techniques having less detailed control over their thresholding parameter, since almost all the edges that are created have a weight less than 0.50 (compared to thresholded Gaussian kernel, K-NN-W and MinMax where the distribution of weights covers the entire \([0, 1]\) range.

4.2 Regression and forecasting

Now that we have a general idea of the types of graphs that were constructed, we can investigate the performance of each technique. The MAE and MSE scores are visible in Table 4. We can observe different high-performing graph construction techniques in each dataset. MIC and correlation perform best in the CI and CW datasets. In the METR-LA and PEMS-BAY datasets, MIC, correlation and DTW all work best (followed closely by MinMax). However, when taking into account only the MSE, the MinMax technique showed the best performance on METR-LA, indicating that the errors of MinMax are in general of smaller magnitude. This could be explained by the many edges that are constructed by the MinMax algorithm, that can act as a damping effect on the predictions.

Clear weak-performing graph construction algorithms are Gabriel, RNG and k-means. In the case of Gabriel and RNG, this could be due to their algorithm being focused on creating too sparse graphs for the GNN to function correctly. However, in the case of k-means different reasons could be given. Clustering methods tend to create disconnected components, which now result in low performance. In addition, k-means is known to produce spherical clusters (also visible in Fig. 2) which may not easily be found in geographical sensor networks. Lastly, it is interesting to see that a thresholded Gaussian kernel seems to perform relatively bad. Thresholded Gaussian kernels are beside K-NN or \(q\)-neighborhood graphs the next most popular option for constructing the graph [28, 29], especially in the spatial temporal time series forecasting domain.

When we take a step back from individual scores to general characteristics of the techniques, some interesting patterns arise. To start, different performance is visible when characterizing techniques (see Table 4) based on their edge count (see Table 3). In almost all cases, having the most edges is most beneficial. This holds true for the CI, METR-LA and the PEMS-BAY datasets. In the CW dataset, we see that having less edges seems to be beneficial for good performance. This could be due to the larger geographical area that the CW use-case covers, compared to the other datasets. Here, the graph construction algorithms that can handle larger distances (in this case by ignoring those far away stations) seem to show their strength more, whereas techniques that will connect no matter what create meaningless edges that weakened performance.

When we examine the performance of the different type of techniques (signal-driven or location-based), some general patterns emerge (see Table 4). In most cases, the signal-driven techniques show the highest performance in the tasks, suggesting that these techniques are able to discover very local patterns in the signal data, which location-based techniques would not pick up on. For example, the traffic in one road segment could be totally different from a road nearby, which location-based methods would not pick up on. Also, imagine an air temperature setting where a graph from weather stations has to be constructed. Figure 7 shows a 3D height plot of this scenario, where location-based techniques would likely connect node D to node C based on their proximity, and signal-driven techniques would likely not, due to the height difference, which would be observable in the signal data. A reason for this is that height is invariant toward location in terms of latitude and longitude. Therefore, signal-driven techniques would rather connect node D and B, due to their possible similar measurements, and location-based methods would likely not.

A supplementary explanation for the relative good performance of signal-driven techniques is their high edge count. Especially MIC shows this behavior, sometimes connecting all nodes. However, in the case of correlation, this does not necessarily hold true. Only from the PEMS-BAY dataset the correlation technique creates many edges, but in the other cases other techniques many more edges.

Another method to examine the impact of the graphs is by looking at the variability in the performance across datasets. This is done by the coefficient of variation, which is the ratio of the standard deviation to the mean for all methods per dataset. The coefficient of variation proves to be valuable, as the context of the mean must always be taken into consideration while interpreting standard deviations. In general, it is noticeable that the variability in scores for each dataset differs (see the bottom row of Table 4). The CI and CW dataset contain the highest variability, with a reported coefficient of variation from the MSE scores of \(13.4\%{ and}14.4\%\). In these datasets, having a richer and more optimal graph seems to have the biggest impact on prediction performance. In the foresting datasets (METR-LA and PEMS-BAY), the variability is lower. A reason for the difference between the two types of datasets could be related to the nature of the data and its sample size.

Table 4 MSE and MAE of all experiments, with the coefficient of variation (in \(\%\) in the bottom row)

Lastly, Table 5 shows the average ranking of each algorithm for all datasets. Algorithms are given a ranking between 1–11, related to the 11 used graph construction techniques, depending on their performance (in terms of MSE) in each dataset, where the lowest performance gets 1 point and the highest 11 points. The model that constructed a graph with correlation shows the best average performance compared to all other algorithms, followed by MIC, DTW MinMax and K-NN. The low performing models constructed graphs with Gabriel or Relative-Neighborhood. The most important finding in this ranking is that signal-driven algorithms are overly represented in the top of the rankings. However, out of the signal-driven techniques, the correlation graph is considerably more sparse. Therefore, this leads to lower computational complexity during training. Considering that correlation graphs are also more easily constructed in the first place, this seems to be the best general option to take. A reasonable second option or baseline option seems to be using an unweighted K-NN, which can be easily calculated and creates a relatively sparse graph.

One interesting thing to also note from Table 5 is that the unweighted K-NN graph seems to generally outperform the weighted version. One would actually expect that the benefit of the weighted K-NN technique would be the extra details added to the graph. This is especially important for samples near to the decision boundary, since they are vulnerable to noise-related effects. Nevertheless, our results indicate that placing greater trust in your immediate neighbors, as reflected by a weight of 1, leads to better results.

Fig. 7
figure 7

Example of a 3D Contour plot showing the advantage of signal-driven techniques in situations with, e.g., height difference in an air temperature scenario

Table 5 Average rankings (and standard deviations in ranking) of each algorithm on all datasets, and the computation time on the BAY dataset (dataset with most nodes)

5 Conclusions

This work examined the influence of the underlying graph for a GNN in time series analysis tasks. Specifically, this work performed a thorough investigation of different kinds of graph construction techniques and their performance and characteristics. To accomplish this, we used two GNN models for either a (1) time series regression (TSER) or (2) time series forecasting task. The models were both tested on sensor data with varying characteristics, which are shown in Table 2.

For each graph construction technique, a hyperparameter search was performed to compare the best performing settings to each other. In total 11 techniques were tested, including ones that construct graphs from either signal-driven information or location-based information (geographical location of the sensors). We reported the results individually, but also provided more broad results based on the characteristics of either the graph construction method or the dataset.

The results on the degree distribution of each constructed graphs indicated that the graph construction techniques created graphs with varying characteristics. In general, signal-driven techniques calculated more edges compared to location-based techniques. However, either graphs with linear degree distributions or long-tail distributions showed best results.

Considering the performance, signal-driven techniques showed the most promising results. The entire top 3 of the general ranking is represented by signal-driven techniques. The correlation technique scored best, while also using the least computational resources out of all signal-driven techniques. Therefore, it seems that if high quality signal data are available, using correlation seems to be a great option to take.

With regard to variability in the results, some clear differences between the datasets can be observed. Most variability (with help of the coefficient of variation) was found in the regression datasets, hinting that the impact of the graphs was more prominent for the regression task.

In conclusion, our study highlights the importance of graph construction methods in graph neural network research. The choice of graph construction method has a considerable impact on the performance of GNNs in our tasks. Our findings provide insights into the strengths and weaknesses of different approaches and can serve as a useful guide for future researchers in this field.

Overall, we recommend that future research focus on developing more effective graph construction methods that can capture the underlying structure of the data while avoiding computational complexity. Therefore, for future work, we aim to investigate further methods for graph construction, e.g., using Granger causality to calculate another type of dynamics between two signals. Furthermore, using a neural network to learn the most optimal graph from signal data is another interesting direction to consider. Also, applying further (larger) datasets is another prominent option to consider for future work.