Introduction

The detection of anomalies in multivariate time-series data based on spatiotemporal modeling is an emerging research field, aiming to capture spatiotemporal dependencies from massive multivariate time-series data and achieve more sensitive anomaly detection through richer feature representations. In the real world, spatiotemporal data are present in various domains, including industry, transportation, meteorology, and finance [1]. These not only exhibit the characteristics of time-series, but encompass diverse aspects such as physical properties and spatial–topological structures, and are multivariate, with high dimensionality and complexity. Hence, to perform automatic anomaly detection on these massive spatiotemporal data can enhance the efficiency and accuracy of data analysis, effectively reducing the risk of accidents in practical industrial production processes, and holding significant economic and safety value [2,3,4].

Despite the comprehensive consideration of data features from a dual perspective, it is still challenging to sensitively detect anomalies from massive amounts of multivariate time-series data through spatiotemporal modeling. Through experimental analysis, we believe that this is primarily due to a lack of sufficient abnormal behavior information in the deep modeling process to guide the model for training. There are two aspects. First, from the data characteristics, in an actual production process, the system is often reliable, so data of anomalous moments are scarce and hidden under a large amount of normal data, and the model must be trained in unsupervised conditions because of expensive labeling [5, 6]. Second, from the modeling perspective, serial decomposition or normalization of data during temporal modeling can reduce distribution differences in the time dimension [7, 8], which can effectively improve mainstream prediction performance. However, similar operations may not apply when the downstream task is anomaly detection, because some of the already sparse anomalous behavior information is lost, reducing the acuity of the model for anomalous moment capture. Spatial-dependency modeling can compensate for the learning of anomalous behavior information, providing a feasible path. However, in practice, the lack of a priori knowledge of physical features that can be translated into effective spatio constraints, such as patterns of influence between features, makes it difficult for the model to obtain adequate representations of anomalous behavioral information.

With its outstanding performance in sequence modeling and prediction tasks, numerous anomaly detection methods based on Transformer [9] have been proposed in recent years [10, 11]. Wiederer incorporated an external attention mechanism on top of self-attention to model the correlation among multivariate time-series, and proposed a regularization-based method to constrain model parameters and prevent overfitting [12]. Su focused on key issues in the prediction process, including the choice of feature embedding, impact of model depth and width, and combination of attention mechanisms and convolutional layers [13]. Anomaly Transformer leveraged the differences between abnormal time points and their local and global contexts to derive a distinguishable detection principle [14]. Some studies have deeply modeled the spatio dependencies among multivariate variables. GDN learns the correlation graph among features without prior knowledge, and utilizes graph neural networks to model the information flow between feature nodes, helping with anomaly detection by predicting the future behavior of features [15]. GTA combines Transformer and GNN in a hierarchical attention mechanism that considers the correlations between spatiotemporal features [16]. Han further advances this approach by integrating sparse self-encoders with graph neural networks, orchestrating a collaborative optimization of both the reconstruction and prediction tasks [17]. TranAD introduces a model for anomaly detection and diagnosis, leveraging a profound transformer network. This network integrates an attention-based sequence encoder, facilitating swift comprehension of overarching temporal patterns in the data [18]. The above literature models spatiotemporal features separately, obtaining future feature behavior expressions through prediction-based approaches, and conducts anomaly detection based on expression differences. However, the modeling process overlooks the impact of preprocessing methods like normalization on the loss of abnormal behavioral information, and there is still room for improvement in the modeling of spatio correlations between features and exploration of spatio constraints to strengthen model inference.

We propose an enhanced abnormal information expression spatiotemporal model for anomaly detection in multivariate time-series (EAIE-AD), which is capable of end-to-end anomaly detection under unsupervised training conditions. Our model performs simultaneous deep temporal modeling in a parallel manner. Through experiments, we have found that while Transformer and its variants have demonstrated strong modeling capabilities in prediction tasks through optimizing the mean squared error (MSE) and mean absolute error (MAE), this goal does not directly yield the final anomaly detection effect (as shown in Table 4). Hence, our goal in the prediction phase is not to optimize these metrics all the time, but to be able to learn more valid representations of anomalous behavior information. Hence, we focus on the non-stationary information in the original data but potentially lost due to normalization, and improve upon the work of non-stationary Transformer [19] by simplifying its model structure and modifying the object of action of the attention mechanism; in the spatio module, we learn the feature association graph of multivariate time-series data, and expand the homogeneous graph to a heterogeneous graph based on a GDN [15], which can more finely simulate the physical characteristics of the features, and on the graph structure, using two contrast learning strategies to find beneficial spatio constraints to strengthen our learning ability for the representation of anomalous behavior information. The contributions made by our model are summarized as follows:

(1) We propose an end-to-end spatiotemporal anomaly detection model that can simultaneously model feature temporal dependencies and spatio correlations in depth through a parallel architecture and guide the full training of the model under unsupervised conditions by enhancing abnormal information expression.

(2) In the temporal dimension, we compensate for the effective modeling of the inherent unsteadiness information in the original data, and mitigate the loss of anomalous information by improving the execution objects of the attention mechanism; in the spatio dimension, we use graph neural networks and contrast learning to model the physical properties of feature behaviors, from which we extract the hidden expression of anomalous behavior information in the spatio topology.

(3) We achieve state-of-the-art anomaly detection results on multiple datasets, with the F1-scores reaching 0.82 and 0.59 on the SWaT and WADI datasets, respectively. Then, we conduct adequate ablation experiments and data visualization. Finally, we enhance the interpretability of the model by exploring the impact of upstream multivariate timing prediction tasks on downstream anomaly detection tasks.

Related work

Time-series anomaly detection

The classical methods employed in multivariate time-series data anomaly detection tasks are primarily reconstruction- [20, 21] or prediction-based [22, 23], which can, respectively, compress data representation or model temporal correlation [24]. In addition, dimensionality-reduction methods, such as principal component analysis (PCA), singular value decomposition (SVD), and autoencoder (AE), as commonly used in machine learning, have been shown to be effective in assisting anomaly detection. In PCA, anomalies are defined as data points that deviate from the normal data space [25]; in SVD, anomalies are defined as high-dimensional data that have not been reconstructed [26]; AE detects anomalies by learning a self-encoder from the data, where the reconstruction error of anomalous data is usually greater than that of normal data [27].

With the high dimensionality of data, by being able to automatically learn its features and better model nonlinear relationships, researchers have turned their attention to deep learning for anomaly detection [28, 29]. Chen proposed an anomaly detection method for multivariate temporal data, using a variational autoencoder (VAE) model to learn the latent representation of the data, and reconstruction error and the Kullback–Leibler (KL) scatter of the latent representation as anomaly metrics [30]. Kong proposed a long short-term memory (LSTM)-based method, using an attention mechanism to assign weights to temporal features [31]. Transformer with an attention mechanism as its structural core has enabled a breakthrough in deep time-series prediction. Its point-to-point attention mechanism is suitable for modeling temporal dependencies in time-series, and stackable codecs are conducive to capturing and aggregating temporal features at different time scales. Hence, Transformer-based improved anomaly detection methods have been proposed. Jeong proposed a self-supervised learning method that uses the transformer model to learn a representation of multivariate time-series data, and uses the distance of the representation to measure the degree of anomalies of data points [32]. Wiederer used Transformer to explore the variability of the association between anomalous moments and local and global moments, deriving a sensitive differentiation principle [12].

While Transformer can well model the relationships between moments, information transfer between features at any moment is also important for learning anomalous behavior. The spatio topology formed by this information transfer is in a non-Euclidean space, and this spatio dependence is difficult to model with conventional neural networks due to the sparsity of the structure [33, 34].

Anomaly detection based on spatiotemporal modeling

Graph neural networks can address the limitations of non-Euclidean space-dependent modeling, handling the relationships between features and enabling end-to-end learning with good robustness [35, 36], and hence are receiving increasing attention in anomaly detection. GGC-LSTM combines the advantages of graph convolutional neural networks and long short-term memory (LSTM) networks, which can consider both graph structure and time-series information [37]. Zhao used a deep graph convolutional neural network that can adaptively learn the relationships between time-series data with high computational efficiency [38]. GTA model spatiotemporal dependency for multivariate sensor features in IoT systems in a tandem fashion and enhances model inference efficiency through a hierarchical attention mechanism [16].

Deng proposed a general framework for spatiotemporal modeling anomaly detection network (GDN) [15], which has received attention for its excellent results on several realistic datasets. GDN learns a spatial-association graph between sensor features in the absence of a priori knowledge, uses an attention mechanism for information transfer and message updating on the graph structure, and assists the learned information in the temporal prediction task, which can more sensitively capture deviations in the future behavior of sensors and improve detection performance. However, there is room for improvement. In the temporal sequence prediction process, the model adopts a traditional normalization strategy, ignoring the expression of inherent unsteadiness in the data, which can be useful in learning behavior patterns at anomalous moments. The feature relationship graph is homogeneous, and cannot well describe physical characteristics in realistic scenarios. We propose an end-to-end spatiotemporal anomaly detection model, which can enhance abnormal information expression through modeling spatiotemporal dependency to guide the full training of the model.

Methods

We describe our proposed model. Figure 1 shows a high-level overview of EAIE-AD, with an end-to-end spatiotemporal model that incorporates temporal and spatio dependency.

Fig. 1
figure 1

Model framework

Problem formulation

Assume a collection of multivariate time-series data obtained from \(d\) features at \({T}_{{\text{train}}}\) time stamps, denoted as \(S=\left\{{S}_{1},\dots ,{S}_{{T}_{{\text{train}}}}\right\}, i\in \left\{1,...,{T}_{{\text{train}}}\right\},{ S}_{i} \in {\mathbb{R}}^{d}\). Our model is trained in an unsupervised manner. The training and validation datasets consist of normal data, while the test dataset has both normal and abnormal data. We seek to acquire knowledge about the behavior of the features through the training set and identify any anomalous time stamps within the test set, assigning a label to each time step in the test set, where 0 and 1 denote normal and abnormal, respectively.

Temporal dependency modeling

Our goal is to predict the behavior of features through time-series modeling. While Transformer is popular for temporal prediction due to its powerful long-time-series modeling capability, for downstream anomaly detection, to learn enough validly expressed anomalous behavior information is more important than realizing small values of MSE and MAE in prediction. Usually, non-stationary information can imply more abnormal behavior expressions, but in the traditional Transformer input, due to operations such as normalization, the input loses some non-stationary information, and may bring about data over-stationarity problems and reduce the performance of the attention mechanism.

The result, if not normalized, is a nonuniform feature scale and more noisy points, reducing prediction performance. Therefore, we compensate for the normalization information and optimize the object of attention. We first slice the original input data \(S\) in the form of a sliding window. For any moment sample \(S_{t}\), we intercept the time window series whose historical series length is \(w\) to get \(X = \left[ {X_{1} , \ldots ,X_{w} } \right]^{T} = \left[ {S_{t} , S_{t - 1} , \ldots ,S_{t - w + 1} } \right]^{T} , X \in {\mathbb{R}}^{w \times d} ,t \ge w\). For each time window \(X\), we normalize to obtain \(X{\prime} = \left[ {X_{1}^{\prime} , \ldots ,X_{w}^{\prime} } \right]^{T}\), where

$$ \mu_{x} = \frac{1}{w}\sum\limits_{i = 1}^{w} {X_{i} } , \sigma_{x}^{2} = \frac{1}{w}\sum\limits_{i = 1}^{w} {\left( {X_{i} - \mu_{x} } \right)^{2} } , X_{i}^{\prime} = \frac{1}{{\sigma_{x} }} \odot \left( {X_{i} - \mu_{x} } \right) for i \in \left\{ {1, \ldots ,w} \right\}, $$
(1)

where \({\mu }_{x},{\sigma }_{X} \in {\mathbb{R}}^{d}\), and \(\odot\) denotes the element-wise product. At this point, for uniform characteristic scales, we use the normalization module to the input time-series. However, normalization will reduce the unsteadiness of the data distribution. On the one hand, some anomalous information will be lost. On the other hand, the input sequence of the attention mechanism may not be able to produce differentiated attention due to over-smoothing, which leads to the degradation of the model training performance. Hence, we change the execution object of the attention mechanism. The standard self-attention mechanism is

$${\text{Att}}\left(Q,K,V\right)={\text{Softmax}}\left(\frac{Q{K}^{{\text{T}}}}{\sqrt{{d}_{k}}}\right)V,$$
(2)

where \(Q,K,V \in {\mathbb{R}}^{w\times {d}_{k}}\) are queries, keys, and values, respectively. \({\text{Softmax}}\) is an exponential normalization function. With the normalization of Eq. 1, each feature variable in the sequence has the same variance, so \({\sigma }_{X}\) can be converted to a scalar. Because the embedding and feedforward layers have linear properties, \(Q^{\prime} = \left( {Q - 1\mu_{Q}^{{\text{T}}} } \right)/\sigma_{X}\) is formed by the projection of \(X^{\prime}\), where \(\mu_{Q} \in {\mathbb{R}}^{{d_{k} }}\), and we can obtain

$$ {\text{Softmax}}\left( {\frac{{QK^{{\text{T}}} }}{{\sqrt {d_{k} } }}} \right) = {\text{ Softmax}}\left( {\frac{{\sigma_{x}^{2} Q^{\prime}(K^{\prime} )^{{\text{T}}} + 1\left( {\mu_{Q}^{{\text{T}}} K^{{\text{T}}} } \right) + \left( {Q\mu_{K} } \right)1^{{\text{T}}} - 1\left( {\mu_{Q}^{{\text{T}}} \mu_{K} } \right)1^{{\text{T}}} }}{{\sqrt {d_{k} } }}} \right), $$
(3)

where \(1h{\mathbb{R}}^{w\times 1}, Q{\mu }_{K}\in {\mathbb{R}}^{w\times 1}\), \({\mu }_{Q}^{{\text{T}}}{\mu }_{K}\) is a scalar, and \(Softmax(\cdot )\) is invariant to the same translation on the row dimension of input, and we have

$$ {\text{Softmax}}\left( {\frac{{QK^{{\text{T}}} }}{{\sqrt {d_{k} } }}} \right) = {\text{ Softmax}}\left( {\frac{{\sigma_{x}^{2} Q^{\prime}(K^{\prime} )^{{\text{T}}} + 1\left( {\mu_{Q}^{{\text{T}}} K^{{\text{T}}} } \right)}}{{\sqrt {d_{k} } }}} \right). $$
(4)

In this way, we obtain an improved attention calculation that can benefit from the predictability of a stationary sequence while maintaining the inherent temporal correlation of the original. However, the assumption that the linear embedding and feedforward layer have linear properties holds with difficulty in practice, and there are often numerous nonlinear activation factors. We need to compensate for this nonlinear information based on \({\sigma }_{x}^{2}\) and \(1\left({\mu }_{Q}^{{\text{T}}}{K}^{{\text{T}}}\right)\) by multilayer perceptron (MLP) to learn two hyperparameters, to obtain non-stationary attention mechanism calculation formula [19]

$$ {\text{Att}}\left( {Q^{\prime},K^{\prime},V^{\prime}} \right){ } = {\text{ Softmax}}\left( {\frac{{{\text{MLP}}\left( {\sigma_{x}^{2} } \right)Q^{\prime}(K^{\prime} )^{{\text{T}}} + MLP\left( {\mu_{Q}^{{\text{T}}} K^{{\text{T}}} } \right)}}{{\sqrt {d_{k} } }}} \right){ }V^{\prime}. $$
(5)

To our knowledge, ours is the first work to simplify and introduce this attention mechanism to multivariate temporal anomaly detection. Hence, we can obtain the values of all features at any moment \(t\) by a feedforward neural network (FNN) [11] based on historical serial time window data as

$$ Y_{t} = {\text{FFN}}\left( {{\text{Att}}\left( {Q^{\prime},K^{\prime},V^{\prime}} \right)} \right), $$
(6)
$${\text{FFN}}\left(x\right)=wx+b,$$
(7)

where \(w\) is the weight matrix and \(b\) is the bias term, \({Y}_{t}\in {R}^{d}\). Then, we calculate the MSE loss as:

$${\zeta }_{{\text{MSE}}}= \frac{1}{{T}_{{\text{train}}}} \sum_{t \in {T}_{{\text{train}}}}{({Y}_{t}-{S}_{t})}^{2}.$$
(8)

Spatial-dependency modeling

To comprehend the interconnections and relationships among features enables us to acquire insights through contrastive learning techniques based on the graph structure. This provides useful supervisory information that enhances abnormal information expression learning.

Graph structure learning

Following the work of GDN, the original training data feature \(d\) types of sensors at different graph nodes. Any sensor will be randomly initialized to the \(d_{1}\) dimension embedding vector based on the sequence ID, and represented as

$$ O_{i} \in {\mathbb{R}}^{{d_{1} }} ,for \quad i \in \left\{ {1, \ldots ,d} \right\}. $$
(9)

We calculate the similarity between sensor representations \(O_{i}\) and \(O_{j}\) for each time stamp

$$ e_{ij} = \frac{{O_{i}^{{\text{T}}} O_{j} }}{{\left\| {O_{i} } \right\| \cdot \left\| {O_{j} } \right\|}}. $$
(10)
$$ A_{ji} = 1\left\{ {j \in topK\left( {\left\{ {e_{ki} :k \in {\mathcal{C}}_{i} } \right\}} \right)} \right\}. $$
(11)

For a given sensor node\(i\), we select the top \(K\) nodes with the highest similarity to as the candidate relations \({\mathcal{C}}_{i}\), where \(A_{ij}\) represents an edge from node \(i\) to node\(j\), so as to obtain the homogeneous graph with only one node type and edge type \(G\left( {O,A} \right)\).

In actual industrial systems, the various models of sensor functions can be placed in one of two categories according to the nature of their work. One is to perform control operations, and the other to monitor the indicators of the working environment, which we, respectively, call actuators and monitors, and we classify nodes according to these attributes. According to the node classification, we can get two types of edges, which connect nodes of either the same or different type. This constitutes our heterogeneous association graph \({G}_{0}(O,A,\alpha ,\beta )\), where \(\alpha \) and \(\beta \), respectively, denote node and edge types.

Graph contrastive learning

Our main goal of graph contrastive learning based on the heterogeneous graph \({G}_{0}\) is to find beneficial spatio constraints. Its two main steps are data augmentation and sampling. We perform an initial spatio embedding of the original data of the graph nodes into the \({d}_{1}\)-dimensional space to obtain a representation of any node in the graph \({G}_{0}\), denoted as \({v}_{i}, {v}_{i}\in {G}_{0}{, v}_{i} \in {\mathbb{R}}^{{d}_{1}}\). It is worth noting that we do not introduce sensor sequence embedding information here, and \({O}_{i}\) is only used to learn the graph structure.

We randomly lose a certain number of edges and node features in the graph \({G}_{0}\) with mask ratio ε. Then, we repeat this operation twice to obtain two new graphs \({G}_{1}\) and \({G}_{2}\). We next perform message aggregation and node updating for the two graphs by GNN, and for any node \({v}_{i}\), we obtain its re-characterization as

$${v}_{i}^{l+1}= \sigma \left(\sum_{r\in R}\sum_{j\in {N}_{i}^{r}}\frac{1}{{C}_{i,r}}{W}_{r}^{l}{v}_{j}^{l}+ {W}_{0}^{l}{v}_{i}^{l}\right),$$
(12)

where \({v}_{i}^{l+1}\) is the feature vector of node \(i\) in layer \(l\), \(R\) is the set of relations, \({W}_{r}^{l}\) is the weight matrix of relations in layer \(i\), \({W}_{0}^{l}\) is the bias vector in layer \(l\), \({C}_{i,r}\) the normalization factor, and \({N}_{i}^{r}\) is the set of nodes whose relation to node \(i\) is \(r\). In simpler terms, during the encoding process for each node, we calculate its feature vector by taking a weighted sum of the feature vectors of all the nodes connected to it in the previous layer. The weights assigned to each node feature vector are determined by a weight matrix associated with their relationship, along with a bias vector. This weighted sum is passed through a nonlinear activation function to introduce nonlinearity. Throughout this process, the weights undergo normalization to mitigate the influence of node degrees.

After data augmentation and graph node re-characterization, we perform positive and negative sampling. The traditional graph contrastive learning sampling strategy is to randomly fix the node of one of the views as the anchor point; only the same point in another view constitutes a positive sample pair, and the rest are negative sample pairs, with repeat traversal to obtain the set of all positive and negative sample pairs [39]. While this is intuitive and easy to implement, we found through experiments that such methods have limitations in heterogeneous graphs, and it is difficult to effectively use the representation of heterogeneous information between different node and edge types. To improve the sampling strategy, we sample positive and negative sample pairs in \({G}_{1}\) and \({G}_{2}\), and classify the relationship between sample pairs into two categories, one unrelated and the other inconsistent, where unrelated refers to sample pairs that are not directly connected, and inconsistent to those whose edges are connected but whose node types are inconsistent.

We first randomly sample a node in \(G_{1}\) to obtain \(v_{i} \in G_{1} ,{ }v_{i} { } \in { }{\mathbb{R}}^{{d_{1} }}\). Then, we find the set \({\text{M}}\) of neighboring nodes of node \(i\) in \(G_{2}\) to obtain \(v_{j}\). A positive sample pair can be expressed as \(P_{i} = (v_{i} , v_{j}\)), \(v_{i} \in G_{1} , (v_{j} \in { }G_{2} ) \cap (v_{j} \in {\text{M}}\)), and a negative sample pair as \(N_{i} = (v_{i} , v_{j}\)), \(v_{i} \in G_{1} , (v_{j} \in { }G_{2} ) \cap (v_{j} \notin M)\). Then, we perform pooling on the sample pairs. For the generalization of the model, we use sum-pooling to obtain \(P_{i} ,N_{i} \in {\mathbb{R}}^{{d_{1} }}\). We repeat sampling \(K_{1}\) times to obtain the set of positive and negative sample pairs denoted as \(P = \left\{ {P_{1} ,..,P_{{k_{1} }} } \right\}, P \in {\mathbb{R}}^{{k_{1} \times d_{1} }}\), \(N = \left\{ {N_{1} ,..,N_{{k_{1} }} } \right\}, N \in {\mathbb{R}}^{{k_{1} \times d_{1} }}\), and the objective function is

$$ \zeta_{1} = - \log \mathop \sum \limits_{i = 0}^{{k_{1} }} \mathop \sum \limits_{j = 0}^{{\theta k_{1} }} \frac{{\exp \left( {P_{i} (P_{j} )^{T} } \right)}}{{\exp \left( {N_{i} (N_{j} )^{T} /\tau } \right)}}, $$
(13)

where \({\uptau }\) is the temperature coefficient. It is worth noting that the number of positive and negative samples is the same \({K}_{1}\) at this time. However, through experimental analysis, we found that increasing the number of negative samples will improve the learning ability of the model, for which we will learn a hyperparameter \(\theta \) to control the sampling ratio of each group of positive and negative samples. The above sampling strategy is based on uncorrelated node structures.

The second sampling strategy is based on the inconsistency of node attributes. The difference with the first sampling strategy is that the first focuses on the characteristics of edges, in short, on whether or not they are connected as a basis for sampling positive and negative samples in two graphs. The second sampling strategy focuses on node characteristics, sampling different types of nodes on the two graphs as positive and negative sample pairs when they are already connected. Specifically, we randomly sample a node \(v_{i}^{\prime} \in G_{1}\), by finding the neighboring nodes of the \(v_{i}^{\prime}\) node combined with \({\text{M}}^{\prime} \) in \(G_{2}\), in which a node with the same node type is randomly selected to form a positive sample pair. When learning the graph structure, we mentioned that our graph structure divides the nodes into actuator and monitor types, and we can denote the set composed of the two types of sensor nodes, respectively, as \({\mathcal{A}} = \left\{ {v_{i}^{\prime } |type\left( {v_{i}^{\prime } } \right) = actuator} \right\}, {\mathcal{B}} = \left\{ {v_{i}^{\prime } |type\left( {v_{i}^{\prime } } \right) = monitor} \right\}\). A positive sample pair can be expressed as \(P_{i}^{\prime} = (v_{i}^{\prime} , v_{j}^{\prime}\)), \(v_{i}^{\prime} \in G_{1} , v_{j}^{\prime} \in {\text{M}}^{\prime} ,{\text{ ty}}pe\left( {v_{i}^{\prime} } \right) = type\left( {v_{j}^{\prime} } \right)\), and a negative sample pair as \(P_{i}^{\prime} = (v_{i}^{\prime} , v_{j}^{\prime}\)), \(v_{i}^{\prime} \in G_{1} ,v_{j}^{\prime} \in {\text{M}}^{\prime} ,{\text{ ty}}pe\left( {v_{i}^{\prime} } \right) \ne type\left( {v_{j}^{\prime} } \right)\). Then, we perform sum-pooling on the sample pairs to obtain \(P_{i}^{\prime} ,N_{i}^{\prime} \in R^{{d_{1} }}\). We repeat sampling \(k_{1}\) times, obtain the respective sets of positive and negative sample pairs \(P^{\prime} = \left\{ {P_{1}^{\prime} ,..,P_{{k_{1} }}^{\prime} } \right\},P^{\prime} \in {\mathbb{R}}^{{k_{1} \times d_{1} }}\), \(N^{\prime} = \left\{ {N_{1}^{\prime} ,..,N_{{k_{1} }}^{\prime} } \right\}, N^{\prime} \in {\mathbb{R}}^{{k_{1} \times d_{1} }}\), and obtain the objective function

$$ \zeta_{2} = - \log \mathop \sum \limits_{i = 0}^{{k_{1} }} \mathop \sum \limits_{j = 0}^{{\theta k_{1} }} \frac{{\exp \left( {P_{i}^{\prime} (P_{j}^{\prime} )^{T} } \right)}}{{\exp \left( {N_{i}^{\prime} (N_{j}^{\prime} )^{T} /\tau } \right)}}. $$
(14)

In this way, we obtain the spatial-dependency modeling objective function

$$ \zeta_{{{\text{spatio}}}} = { }\zeta_{1} + \zeta_{2} , $$
(15)

which introduces a constraint on our temporal prediction task, thereby yielding the comprehensive objective function for our model

$$ {\mathcal{L}} = \zeta_{{{\text{MSE}}}} + \zeta_{{{\text{spatio}}}} . $$
(16)

Anomaly detection

Having integrated spatio constraints into our prediction framework for sensor behavior, we compute an anomaly score to provide an explanation for anomalous behavior [15]. We calculate the discrepancy between the predicted and observed values of a sensor \(i\) at each time stamp \(t\) within the test set

$${E}_{t,i}=\left|{Y}_{t,i}-{S}_{t,i}\right|, t\in {T}_{{\text{test}}},i\in \left\{1,...,d\right\},$$
(17)

where \({T}_{{\text{test}}}\) denotes all the time stamps in the test set, \({Y}_{t,i}\in {Y}_{t}\). Considering the varying sensitivities among the sensors, we apply robust normalization to the calculated deviations

$${\psi }_{t,i}= \frac{{E}_{t,i}-{\mu }_{i}}{{\sigma }_{i}},$$
(18)

where \({\mu }_{i}\) and \({\sigma }_{i}\) are the median and inter-quartile range, respectively. Subsequently, we employ the maximum function to aggregate the anomaly scores of all sensors at time \(t\), yielding the time stamp anomaly score \({\psi }_{t,i}.\) If this surpasses the predefined threshold, we classify it as an anomaly occurring at that time. The way the thresholds are chosen can be optimized differently depending on the direction and distance [40], but to ensure fairness with baseline experiments, we use the maximum value of the system anomaly score at all moments in the validation set as the threshold [15, 18]. It is important to note that our training and validation sets solely consist of normal sample data, and only the test set contains abnormal samples. This is an important condition for us to use unsupervised training and to set thresholds.

Experiment

We performed experiments and conducted quantitative and qualitative analyses to compare our method with baseline approaches.

Dataset: The scarcity of high-dimensional series data originating from real-world industrial systems, incorporating anomalous instances, poses a challenge. However, there are two extensively employed cyber-physical systems (CPS) datasets available for research in time-series anomaly detection. These datasets, Secure Water Treatment (SWaT) and Water Distribution (WADI) [1], were generated and released by the iTrust Center for Research in Cybersecurity at the Singapore University of Technology and Design. Details are shown in Table 1.

Table 1 Details of SWaT and WADI datasets

Baselines: As our model is designed for anomaly detection based on multivariate time-series forecasting, our baselines fall into two categories. The first comprise outstanding work focused on detecting anomalies in multivariate time-series data, and can provide a visual comparison of our model’s performance. The second category consists of transformer models that have recently demonstrated excellent performance in multivariate time-series forecasting. The baselines are as follows:

PCA [41]: Discovers a low-dimensional projection that effectively captures the majority of variance present in the data. The anomaly score, in this context, refers to the reconstruction error associated with this projection;

KNN [42]: Employs the distance between each data point and its top \(k\) nearest neighbor as an anomaly score;

DAGMM [43]: Combines deep autoencoders and a Gaussian mixture model to generate a low-dimensional representation and reconstruction error for each observation;

AE [44]: Consisting of an encoder and a decoder, reconstructs data samples, utilizing the reconstruction error as a metric for detecting anomalies;

LSTMVAE [45]: To leverage the advantages of both LSTM and VAE, the feedforward network in a VAE is replaced by LSTM, allowing for the computation of the reconstruction error, which serves as an error score;

Mad-GAN [21]: By employing generative adversarial networks (GANs) in conjunction with a reconstruction-based approach, error scores are computed for each sample;

GDN [15]: Can capture both spatio and temporal dependencies, representing multivariate time-series data as graphs, and utilizing GNNs to learn the representations of nodes and edges within them. Learned representations are fed into a sensor future behavior prediction module, which enables the detection of anomalies in time-series data;

TranAD [18]: Utilizing an innovative self-attentive mechanism, it incorporates self-regulation grounded in focus scores for resilient multi-modal feature extraction. The model employs adversarial training for stability and integrates reconstruction loss for anomaly detection.

Informer [46]: Employing a multilayer Transformer architecture, this model enhances the weight calculation method for attention, and incorporates techniques such as time-varying positional encoding and length masking, which enable efficient processing of long sequences and accurate predictions across multiple time steps;

Autoformer [47]: This adaptive transformer model introduces an adaptive feature selection module and adaptive transformation module to dynamically learn the crucial features and transformation methods of time-series data, so as to enhance the accuracy and generalization of sequence prediction.

Non-stationary Transformer (Nsformer) [19]: Designed for non-stationary time-series data, a progressive learning mechanism allows for adaptive learning of the dynamic nature of a sequence. Information from both historical and future data is leveraged during the prediction process, resulting in improved accuracy of sequence prediction.


Evaluation metrics: To ensure generalizability and fairness, we chose the evaluation indicators in the literature: precision, recall, and F1-score [15, 21]

$${\text{Pre}}=\frac{{\text{TP}}}{{\text{TP}}+{\text{FP}}}$$
(19)
$${\text{Rec}}=\frac{{\text{TP}}}{{\text{TP}}+{\text{FN}}}$$
(20)
$${F}_{1}=2\times \frac{{\text{Pre}}\times {\text{Rec}}}{{\text{Pre}}+{\text{Rec}}},$$
(21)

where TP is the correctly detected anomaly, FP is the falsely detected anomaly, TN is the correctly assigned normal, and FN is the falsely assigned normal.


Implementation: We used the PyTorch-1.8.1 library to train all the models, and split the trained time-series into 90% training data and 10% validation data. We used the Adam optimizer with a learning rate of 0.01 and an epoch of 10. Some important hyperparameters are as follows: in the temporal module, the sliding window length was \({\text{w}} = 15\), the number of transformer encoder layers was \(L = 3\), the number of heads was 4, and \(d_{k}\)= 64; in the spatio module, \(\varepsilon = 0.2\),\({\text{ d}}_{1} = 64\), \(k = 20,k_{1} = 10, \theta = 5, \tau = 0.{25}\).

Research question 1: anomaly detection performance

We present the anomaly detection performance of our model and the baseline approaches in Table 2, in terms of precision, recall, and F1-score on the SWaT and WADI datasets. The results indicate that our model outperforms the baselines in terms of recall and F1-score on both datasets, achieving F1-scores of 0.82 and 0.59 for SWaT and WADI, respectively. While the GDN baseline achieves higher precision scores, the trade-off between precision and recall is inevitable. In practice, maintenance technicians with domain expertise tend to prioritize high sensitivity over specificity to avoid missing any critical events worthy of future reference [36]. Therefore, the goal of our model optimization is to maximize the recall while optimizing the F1-score.

Table 2 Anomaly detection performance on SWaT and WADI datasets in terms of precision (Prec) (%), recall (Rec) (%), and F1-score (F1)

We observe that the improvement rate on the WADI dataset is higher than on the SWaT dataset. We attribute this to its larger data volume and feature dimensions, which result in a more complex spatio topology. By utilizing graph-based contrastive learning, our model can uncover more valuable spatio constraints and guide the learning process to enhance the representation of anomalous behavior. By analyzing the experimental data, it is worth noting that the TranAD model performs very well in SWaT, especially the Rec metrics, but does not perform as well as our model on the WADI dataset. This is because TranAD is able to learn feature relevance through adversarial learning and meta-learning. However, meta-learning uses limited data and lowers the learning threshold, so although it can detect more anomalous moments, it can easily misclassify some anomalous moments, which makes the accuracy much lower. Our model, on the other hand, has a good ability to learn the non-stationarity of the original data through improved attention, which is more expressive on the WADI dataset with higher data dimensions, and is able to balance the conditions of the two metrics Prec and Rec, to obtain excellent F1 values.

Research question 2: ablation

To demonstrate the necessity of our model components in achieving the optimal detection performance shown in Table 2, we conducted an ablation study, whose results are presented in Table 3.

Table 3 Ablation test results

Temporal: When we replace the temporal modeling component of our model with a regular Transformer network that only includes an encoder, there is a significant decrease in precision, recall, and F1-score. Specifically, the F1-scores decreased by 0.07 and 0.03 on the SWaT and WADI datasets, respectively. This indicates that applying the attention mechanism directly to normalized data resulted in the loss of anomalous information present in the original data. Additionally, it caused excessive stationarity during deep model training, leading to attention weights that were difficult to differentiate across sequences. Our model emulates the attention mechanism on the original data, which helps capture and express the information related to anomalous behavior.


Spatio: When we directly remove the spatio dependency modeling by not calculating \(\zeta_{{{\text{spatio}}}}\), and rely solely on the time-series prediction results for detection, there are decreases in the F1-score of 0.03 and 0.04 on the SWaT and WADI datasets, respectively. This indicates that utilizing graph contrastive learning to search for supervisory signals can guide the model to learn spatio dependencies between features. It strengthens the constraints during model training, allowing for more comprehensive training of the model.

Research question 3: interpretability

We investigate an interesting question that we discovered during our experiments. In our prediction-based multivariate time-series anomaly detection model, the upstream task is one of general multivariate time-series prediction, and downstream is a binary classification anomaly detection task. In multivariate time-series prediction tasks, MSE and MAE are commonly used performance evaluation metrics [46, 47]. We wish to explore whether the downstream anomaly detection task can directly benefit from the optimized MSE and MAE metrics in the prediction task. To study this question, we employed various state-of-the-art transformer models that have shown excellent performance in multivariate time-series prediction tasks. We observed the impact of MSE and MAE during the prediction phase on the precision, recall, and F1-score in the detection task. The results are presented in Table 4.

Table 4 Prediction task (MSE, MAE) and detection task (Prec, Rec, F1) performance test data

As mentioned in Research Question 1, we place more emphasis on recall and F1-score as our primary objectives in the practical process. On the SWaT dataset, Autoformer demonstrates the best performance in the prediction task, with MSE and MAE values of 0.13 and 0.17, respectively. However, its recall and F1-scores are not as high as those of Ns [19] and our model. On the WADI dataset, Informer performs best in the prediction task, with MSE and MAE values of 0.17 and 0.23, respectively. Nevertheless, the final detection performance is still not optimal. Through experimental data analysis, we argue that optimizing the MSE and MAE metrics for the upstream prediction task is a nonlinear relationship in terms of gain for the downstream detection task and that there is a critical point that prevents sustained gain. Hence, we introduced spatial-dependence modeling, and although the MSE and MAE of the prediction task are numerically inferior to those of other models, the recall and F1-score of the final detection task can be surpassed. In further analysis, we believe that we should focus more on whether our model can learn enough information representation of anomalous behavior and get enough constraints in the prediction phase. Hence, our model tries to strengthen the anomalous behavior information expression at two levels. First, different from the mainstream Transformer, we execute the attention mechanism on the original un-normalized non-stationary data to avoid ignoring anomalous behavior information lost due to normalization. Second, we can get more spatio constraints on the anomalous behavior by mining the spatio dependencies between the features through graph contrastive learning.

To further enhance interpretability, we visualize self-attention plots on our dataset WADI using the standard attention mechanism and the non-stationary attention mechanism employed in our model.

As shown in Fig. 2, the horizontal and vertical axes are the normalized input data. Figure 2(a) and (c) visualizes the execution of standard attention, and Fig. 2(b) and (d) visualizes the non-stationary attention mechanism. Comparing the figures in the same dataset, we can clearly find that in Fig. 2(a) and (c), due to the effect of normalization, the attention weights focus on the diagonal line, and all input tokens tend to focus on themselves, thus producing an over-stationarity problem, while in Fig. 2(b) and (d), our model can focus on more information of other tokens in the input sequence and produce effective variability in attention, helping our model learn to express more information about the anomalous behavior.

Fig. 2
figure 2

Attention visualization on WADI dataset

To further illustrate the ability of the error scores constructed by our model to discriminate between abnormalities, we visualize a comparative plot of the distribution of abnormal scores for positive and negative samples. Given the balance of precision and recall, we visualize the distribution of GDN abnormality scores for comparison and contrast to facilitate comparative observations, as shown in Fig. 3. Compared to GDN, our model obtains a better distribution of normal/abnormal data, especially in the SWaT dataset, where error scores of normal data remain low and concentrated, indicating that our model can more effectively separate normal from abnormal embedding. In the WADI dataset, there are still many normal sample points with excessive error scores, and the presence of these noisy points is one of the main reasons why detection performance on the WADI dataset is inferior to that on SWaT.

Fig. 3
figure 3

Sample error score distribution

To explore the sensitivity of the hyperparameters in our model, we conducted tests on six important hyperparameters, as shown in Fig. 4, sliding window length (\(w\)), the number of transformer encoder layers (\(L\)), the number of node neighbors (K), the graph embedding dimension (\(D_{1}\)), the contrastive learning sample times(\(K_{1}\)), and sampling ratio (\(\theta\)). We can find many hyperparameters which have multiple optimal values. Considering the model inference speed, we chose relatively small values for all hyperparameters. It is also worth noting that the two most influential parameters are K and \(\theta\). As for K, we need to ensure that each node has enough neighbors, so that we are able to capture more useful information when we perform information transfer in the graph, but when our number of neighbors exceeds 20, an over-smoothing phenomenon occurs, and the flow of information in the whole graph tends to be like a fully connected graph. As for \(\theta\), when theta is too small, the difference between positive and negative samples cannot be fully explained and the model is underfitted, and when theta is too large a large amount of noise is learned and the inference efficiency is significantly reduced. Finally, the details in module time complexity and parameter are shown in Table 5.

Fig. 4
figure 4

Hyperparametric sensitivity experiments

Table 5 The detail in module time complexity and parameter

Conclusion

In this work, we proposed a novel end-to-end multivariate time-series spatial–temporal anomaly detection model (EAIE-AD), which is capable of deep modeling in both temporal and spatio dimensions, compensating for anomalous behavioral information. Our model is versatile, catering to both temporal and graph domains. It employs self-supervised learning specifically tailored for sparse data, making it well suited for scenarios characterized by data sparsity and complexity. This adaptability is particularly valuable in applications that demand the simultaneous capture of spatial–temporal characteristics, such as traffic flow detection and anomaly detection in intelligent systems. Through experiments, we verified that the performance of our model outperforms the baseline on two generic real datasets, while we explored the relationship between performance in the prediction task and performance gain in the final detection task in mainstream models through visualization and analysis of experimental results, enhancing the interpretability of our model. In our future work, we will further dig deeper into the spatio topology properties, including spatio-localization and path compensation, as a way to provide more realistic application scenarios for our model.