1 Introduction

Anomaly detection is a challenging and important task in a variety of real-world domains. Its goal is to identify observations that are partially or entirely irrelevant as they are not generated by an assumed and unknown stochastic model (Chandola et al., 2007). For tasks, several statistical approaches have been originally proposed, both parametric (Urvoy & Autrusseau, 2014; Rousseeuw & Leroy, 2005; McCallum et al., 2000; Eskin et al., 2002; Duda et al., 2006; Jordan & Jacobs, 1994) and non-parametric (Hofmeyr et al., 1998; Javitz et al., 1991; Desforges et al., 1998). A different class of approaches is that of information theory-based approaches, which analyze the information content of a data set using different information-theoretic measures such as entropy, relative entropy etc. (Lee & Xiang, 2001; Arning et al., 1996; Li & Vitányi, 1993; He et al., 2005; Noble & Cook, 2003). Another perspective is that of machine learning-based approaches for anomaly detection, which aim to identify patterns or observations in data that deviate from the expected behavior. Approaches involve training a model based on historical data, which can then be used to identify instances that do not conform to the observed patterns. Anomaly detection has a wide range of applications in various fields, including fault detection in system diagnostics, credit card fraud, and intrusion detection in cybersecurity (Hale et al., 2019; Lebichot et al., 2021; Aldweesh et al., 2020).

In urban areas, anomalies can be identified in sensor network data. This is the case of anomalous traffic jams or pedestrian flows, which could be promptly detected so as to prevent potential security threats. The detected anomalies could be timely transmitted to security operators who could, in turn, make informed decisions for the situation at hand. Anomaly detection for pedestrian flow can be employed, for instance, to promptly detect large crowds in unfitting areas (possibly due to unauthorized protests). Another useful application in this context is the detection of environmental anomalies, such as abnormal air pollutants, which can potentially compromise citizens’ health (Zhang et al., 2022). Anomaly detection is also relevant in the context of smart grids and power systems, which provide a wide range of relevant application scenarios mainly indicating overloads, malfunctions, etc.

One common approach for anomaly detection in sensor network data is to use supervised or unsupervised machine learning techniques. The goal is to identify an anomalous, not expected, behaviour for one or many values from different sensors simultaneously, considering the specific temporal and spatial coordinates of the considered observations so as to take into account their spatio-temporal autocorrelation (Kou et al., 2006; Shekhar et al., 2001; Corizzo et al., 2021).

Although supervised and semi-supervised machine learning methods are often adopted for anomaly detection, one of the most challenging issue is determined by the limited availability of labeled anomaly data. Indeed, the task of manually labeling the anomalous events requires: i) the availability of human experts; ii) the ability to identify real events that could represent anomalous scenarios in the ground truth; iii) a consistent labeling effort by sequentially reviewing large-scale historical data, which is a time-consuming and error-prone task; iv) labels can be affected by the so-called “contamination” problem, that is, can be inaccurate up to a certain percentage. These motivations usually lead researchers to prefer unsupervised machine learning methods, which are able to work with unlabeled data, avoiding the effort of manually labeling normal and anomalous events.

A good compromise between supervised and unsupervised approaches is that of semi-supervised methods that require a limited number of labels. They can also successfully deal with the contamination problem (Wang et al., 2022). However, a proper estimation of the contamination rate is required to yield a satisfactory model performance. An inaccurate estimation will otherwise negatively result in an inaccurate decision boundary for the models. For all these reasons, our study proposes an unsupervised approach to anomaly detection.

Additional challenges arise with large-scale high-dimensional data, which is commonly present in sensor networks (Thudumu et al., 2020). Big data techniques such as parallel and distributed computing, as well as data stream monitoring and processing tools, should be used to efficiently process and analyze such data in a timely manner. These techniques can be used to allow for early detection of anomalies (Reddy Shabad et al., 2021).

The most popular category of approaches for machine-learning based anomaly detection tasks in the literature is that of one-class learning methods. Popular methods include One-Class Support Vector Machines (OCSVM) (Schölkopf et al., 2000), Isolation Forest (Liu et al., 2008), Auto-Encoders (Najafabadi et al., 2015; Sakurada & Yairi, 2014; Zhou & Paffenroth, 2017; Chong & Tay, 2017), Angle-Based Outlier Detection (ABOD) (Kriegel et al., 2008; Pham & Pagh, 2012; Jahromi et al., 2022), and COPula-based Outlier Detection (COPOD) (Li et al., 2020). A less popular but intriguing alternative is provided by Self-Organizing Maps (SOM). A SOM (Kohonen, 1990) is a neural-network-based model for prototype-based clustering, which works by mapping high-dimensional input data into a 2-dimensional space implemented as a grid of neurons called feature map. The main difference with respect to the classical neural networks is the fact that in SOMs the final model is represented by the feature maps, instead of a matrix of weights. SOM approaches present the advantage of supporting fully unsupervised model training, as well as the ability to analyze data presenting multiple densities, and visualize it through the learned feature map representations (Qu et al., 2021). However, both one-class learning methods and SOM-based approaches present a number of limitations that constrain their effectiveness in the complex real-world anomaly detection setting described above. First, they are usually not designed for handling large-scale data generated by sensor networks. Second, once the predictive models are trained, a complex threshold configuration for the final predictive function is required and usually depends on a user-defined setting that is difficult to estimate and it is subject to change over time. Third, they do not focus on the explainability of the detected anomalies, assuming that the domain experts will be able to understand the raised alerts and the underlying motivations that triggered them. Fourth, despite their broad applicability in many domains, they are scarcely adopted in real-world domains involving sensor network data, such as smart grids and in urban public safety applications.

Although a number of research works have separately addressed such issues, there is a substantial lack of methods that holistically combine effective model capabilities to address the needs of sensor network data analysis. In this paper, we fill this gap by proposing an anomaly detection method that jointly addresses these issues.

Our SOM-based anomaly detection method exploits the attractive properties of SOMs described above and adapts them to fit the context of sensor network data. Specifically, we consider GHSOMs (Growing Hierarchical SOMs), which are particularly suitable to describe the space of normal instances around the neurons and they are sensitive to data belonging to low-density spaces. When a single SOM is not adequate enough to describe the normal data, the algorithm allows the model to be further extended (with additional rows, columns, or entire SOMs) to further fit the data distribution under analysis. However, since GHSOMs are mostly limited to clustering and data visualization, we extend the Spark-GHSOM (Malondkar et al., 2018) algorithm to learn GHSOMs capable of solving anomaly detection tasks. To support the analysis of large-scale data arising in our domains of interest, we devise distributed algorithms based on the map-reduce programming paradigm, which allow us to perform model training efficiently using multiple computational nodes. After the training stage, our anomaly detection component leverages the learned model to identify anomalies in unseen data points based on their distance relationship with respect to the learned normal data distribution. To overcome the limitation of manually set threshold configurations, we devise an automated threshold tuning strategy to improve the robustness of the method to noisy instances (or anomalies) within the training set. Finally, to support practitioners and end users in better understanding model predictions, we propose an explainability component that leverages feature rankings to reveal which features determined the predicted outcome, and should be given higher attention.

Our extensive experiments and comparative analysis with real-world datasets highlight the merits of our proposed method from both a quantitative and qualitative viewpoint. Specifically, experiments were conducted for different sensor network data applications, i.e., renewable energy analysis and urban physical threat detection such as vehicle traffic monitoring, pedestrian flow analysis, and monitoring of anomalous pollen level in the air, which are less commonly explored in anomaly detection literature despite their real-world relevance. Furthermore, we analyzed also time-series for the metrics of different Yahoo! web services.

The remainder of the paper is structured as follows. Section 2 surveys relevant related works. Section 3 describes our proposed method in terms of its different components. Section 4 discusses the analyzed data, the experimental settings, and discusses the results obtained by our study. Finally, Sect. 5 concludes the paper and provides relevant directions for future work.

2 Background

2.1 Statistical and information-theoretic methods

Statistical approaches consider as outliers those observations that are partially or wholly irrelevant as they are not generated by an assumed stochastic model (Anscombe, 1960). These approaches require a training phase for the statistical model estimation (estimating the distribution parameters) and a testing phase where a test instance is compared to the model to determine if it is an outlier or not. Parametric approaches leverage statistical tests, such as the Grubb’s test (Urvoy & Autrusseau, 2014) that assumes a normal data distribution. Parametric approaches also include regression techniques, which fit a regression model on the data (Rousseeuw & Leroy, 2005). A number of techniques assume a Markovian nature of the data when modeling sequential data (McCallum et al., 2000). Since real world scenarios are often characterized by different distributions, some techniques take this aspect into consideration with mixtures of probability distributions (Eskin et al., 2002). Parameter estimation techniques can also be used to estimate the parameters for each of the above cases (Duda et al., 2006; Jordan & Jacobs, 1994). Non-parametric approaches do not assume any knowledge of the data distribution. One of the most widely used techniques is histogram analysis (Hofmeyr et al., 1998), only efficient with univariate data. Some approaches (Javitz et al., 1991) compute histograms for each feature separately and detect outliers independently in each dimension. A popular non-parametric approach is to estimate the probability density function using Parzen windows (Desforges et al., 1998).

From the information theory viewpoint, techniques involve different information theoretic measures such as entropy, information gain, information cost, typically in an unsupervised fashion. (Lee & Xiang, 2001) exploited information measures to detect outliers in a sequence of operating system calls. The main aim is to find the regularity of a data set in order to detect outliers while inducing irregularities in the data. Arning et al. (1996) proposed the overall dissimilarity by exploiting the Kolmogorov data complexity (Li & Vitányi, 1993). He et al. (2005) leverage the Local Search Algorithm (LSA) to reduce the dataset, reducing the entropy of the remaining data set. In Noble and Cook (2003), the authors discovered patterns in networked data with the aim to detect changes over time in terms of the previously mentioned information measures. A change in the encoding of a pattern indicates a meaningful change in the data.

Statistical and information-theoretic approaches have strong theoretical foundations and have shown robustness in different applications. However, they present limited effectiveness in the presence of complex sensor network data characterized by time-evolving, multi-density and large-scale data.

2.2 Anomaly detection with learned models

Anomalous data instances could be classified as point, collective, and contextual (Chandola et al., 2009). In this work, we mainly focus on detecting contextual anomalies, where the context is described through the spatial and temporal dimensions of data (Kou et al., 2006; Shekhar et al., 2001). An example could be represented by a sudden change in car traffic level during the weekend in a suburban area of the city.

One-Class Support Vector Machines (OCSVM) (Schölkopf et al., 2000) is a classical method that learns a separating hyperplane in a high-dimensional space. After the training stage, the hyperplane model produced by OCSVM can classify a new data observation as regular/normal or different/anomaly with respect to the training data distribution, according to its geometrical position within the decision boundary.

Isolation Forest (Liu et al., 2008), exploits an ensemble of tree-based models, and computes an isolation score for each data observation. Such a score is calculated as the average path length from the root of the tree to the node containing the single observation. The shorter the path, the easier is to isolate an observation from the others due to significant differences in values w.r.t. training data instances.

Methods based on auto-encoders and stacked auto-encoders (Najafabadi et al., 2015; Sakurada & Yairi, 2014; Zhou & Paffenroth, 2017; Chong & Tay, 2017) have demonstrated superior performance constructing representations with a low reconstruction error through non-linear combinations of input features (Bengio, 2009). Moreover, the rise in popularity of deep learning led to modern approaches for neural network model training, including Stochastic Gradient Descent (SGD), mini-batch training, and adaptive gradients, which resulted in an increased efficiency for autoencoder-based anomaly detection approaches, among other neural network architectures (Kashyap, 2022; Kingma & Ba, 2015; Draxler et al., 2018). On the other hand, reconstruction-based approaches based on auto-encoders may present a reduced accuracy when anomalies are caused by a few of the available features, which result in minimal variations of reconstruction scores compared to benign data. The same consideration could be made in the context of multi-density data, where the exploitation of a single model may be insufficient to catch multiple distributions.

Angle-Base Outlier Detection (ABOD) (Kriegel et al., 2008; Pham & Pagh, 2012) is another popular approach that computes the variance of weighted cosine scores between each data point and its neighbors and leverages it as the anomaly score. Its key advantages are to efficiently identify outliers in high-dimensional spaces, as well as consider relationships between each point and its neighbors instead of relationships among neighbors, which has the potential to reduce the number of false positive detections.

Another recent state-of-the-art anomaly detection method is COPula-based Outlier Detection (COPOD) (Li et al., 2020), which models data by constructing an empirical copula, i.e., a multi-variate cumulative distribution function with a uniform marginal probability distribution for each feature in the [0, 1] interval. During detection, the method leverages the copula to predict the tail probabilities of each data sample to determine the degree of its “extremeness”.

Despite the diversity, robustness, and high performance showcased by these methods in a number of domains, they present one or more limitations, i.e. the inability to deal with large-scale data, the lack of explainability for their predicted anomalies, the dependence on users for threshold settings, and the inability to analyze data presenting multiple densities.

Similarly to auto-encoder neural network models, SOM-based approaches can be trained in a fully unsupervised manner using background data, without explicit use of labels. Another similarity is that, unlike more complex neural network approaches, they both adopt a simple cost function: while auto-encoders adopt reconstruction error, SOM-based approaches leverage the quantization error. However, GHSOM approaches present a structural advantage over auto-encoder models, since they present a tree-like multi-level expandable model structure that support the analysis of data at multiple levels of density granularity. The authors in Muñoz and Muruzábal (1998) leverage the interneuron distance matrix and the projection of the trained map via Sammon’s mapping to detect outliers in artificial and real data. The work in Palomo et al. (2010) proposes a hierarchical SOM with a particular focus on the reduction of the number of user-dependent parameters. The authors showcase its effectiveness in detecting anomalies with the KDD Cup 1999 intrusion detection dataset. Another SOM-based work on the same application domain is proposed in Ippoliti and Zhou (2012), where the authors propose four enhancements: threshold-based training, dynamic input normalization, feedback-based quantization, and prediction confidence filtering. Common limitations of these works include the inability to deal with large-scale data, and a lack of model explainability capabilities. In this work, we aim to simultaneously address these challenges by proposing our distributed explainable SOM-based approach for anomaly detection.

2.3 Anomaly detection from sensor networks

Among the many application domains that analyze data generated by sensor networks in order to identify anomalies, below we examine some works specifically related to the application domains considered in this study, namely “anomaly detection in electrical networks”, “detection of anomalies from urban data” and ”detection of anomalies in the use of web services”.

2.3.1 Anomaly detection in electrical networks

The digitization of the energy infrastructure is a process that offers benefits both for consumers and service providers. Thanks to the first advent of the industrial control systems, and currently the industrial internet of things, it is possible to allow a growing level of control and supervision (Borges Hink et al., 2014; Pan et al., 2015a, 2015b, 2015c; Shin et al., 2020). One of the most interesting results is the growing availability of applications for monitoring the huge amount of data generated in real time and promptly identifying any malfunctions, thefts or improper overloads. This has motivated a growing interest in anomaly detection from data generated in power grids (e.g. at power plants, distribution stations, etc.) (Himeur et al., 2020; Su et al., 2023).

In Himeur et al. (2020), the authors examined thirty-one databases with different features, such as the period of collection, geographical location, sampling rate of collected data, number of monitored households and so forth. Among the different tasks analyzed, the authors also analyzed and discussed energy consumption dataset for the task of anomaly detection for reducing wasted energy. Similarly, in Su et al. (2023), the authors collected a dataset from real-world industrial solar-cell production lines. The dataset was analyzed by learning predictive models for the anomaly detection task and performed a comparative analysis with four state-of-the-art methods. In De Benedetti et al. (2018), the authors analysed energy production data streams from photovoltaic systems and compared the output of a trained model in order to analyse the vectors of residuals which are aggregated over 1-day and analyzed to detect a potential system degradation. Triangular Moving Average (TMA) is considered in analyses to automatically determine the window size. The proposed anomaly detection method was used to generate daily maintenance alerts. In Reddy Shabad et al. (2021), the authors analyzed smart grid power systems faults focusing on discriminating among normal condition, natural disturbances, or cyberattacks. In Malki et al. (2022), the authors integrated anomaly detection methods to improve the maintenance of power systems and control as a fundamental part of the smart city concept. The authors in Takiddin et al. (2022) detected electricity theft in smart grids through a deep auto-encoder for anomaly detection.

Although these works are specifically tailored for the analysis of energy data, they present one or more of the following limitations: i) Limited support in providing predictive models capable of explaining the reasons for which anomalies have been detected. ii) Inability to deal with large-scale data. This issue is typically tackled via sampling approaches, which limit their generalization capabilities to a narrow selection of data points. iii) Inability to take into account multi-sourced spatio-temporal information (Corizzo et al., 2021) such as ambient conditions, outside weather footprints, as well as energy plant characteristics such as voltages (Ceci et al., 2020), which would greatly increase the scope and the robustness of anomaly detection models (Himeur et al., 2021).

2.3.2 Urban anomaly detection

Anomaly detection tasks are particularly useful in urban areas, where data is continuously generated by geo-located sensors. Detecting anomalies provides security operators with the opportunity to understand potentially anomalous situations and take the appropriate actions in a timely manner.

Data refers to physical information (e.g. temperature, number of vehicles crossing an intersection, number of pedestrians in a given area, PM10 level at certain physical positions in the city, etc.). The authors in Zhang et al. (2022) proposed a relevant survey on urban anomaly detection. Specifically, they discussed different types of anomalies in the analyzed contexts, i.e., urban anomalies, traffic anomalies, unexpected crowds, environment anomalies, and individual anomalies. Furthermore, the authors emphasized that one of the open challenges that undermines the detection accuracy is posed by noisy urban data.

From a methodological viewpoint, SOM-based approaches have shown promise in this domain. In Riveiro et al. (2017), the authors proposed a framework that provides decision support for the exploration of multidimensional road traffic data via visual artifacts. The method for anomaly detection is based on a classical SOM used for clustering. The number of clusters is optimized via Silhouette cluster analysis. This approach is inspired by a previous work presented in Kraiman et al. (2002) that used a classical approach for anomaly detection based on clustering algorithms, SOMs, and Gaussian Mixture Models (GMMs). However, such methods do not present growing capabilities, i.e. the number of neurons in the model is defined at the beginning and does not increase, which results in a significant modeling power reduction in dynamic data scenarios requiring adaptation.

Detecting anomalies with high accuracy in the urban context requires the simultaneous analysis of sensor data at multiple locations, exploiting the temporal and spatial coordinates of the considered observations (Corizzo et al., 2021; Mignone et al., 2022; Sofuoglu & Aviyente, 2022). In Zhang et al. (2019), the authors proposed a decomposition approach to detect urban anomalies of different types, such as abnormal pedestrian flows and traffic accidents with varying locations and times. Specifically, they distinguish between the normal component, i.e., urban dynamics decided by spatio-temporal features, and the abnormal component that is caused by anomalous events.

Overall, anomaly detection with urban data is a challenging task, since data generated by sensors is inherently large-scale, and the spatial proximity of sensors introduces spatial autocorrelation, which is in contrast with the typical assumption that observations are independently and identically distributed (Stojanova et al., 2012). As a result, anomalies such as traffic congestion or large crowds can be difficult to detect due to their rarity, and the fact that their definition varies based on spatial and temporal data characteristics. In other words, a common limitation of anomaly detection methods for urban data is the inability to consider the relationships between different points in space and time and specific domain-dependent characteristics of the anomalies (Sofuoglu & Aviyente, 2022). In addition, methods are often unable to deal with the additional challenges presented by large-scale data and model explainability.

2.3.3 Anomaly detection in the use of web services

Web applications generate real-time web content from online activities including Internet banking, email, social networking, and search engines. Such web services could be encoded as data streams for monitoring purposes. Anomaly detection tasks in Key Performance Indicators (KPIs) in web applications, i.e., number of orders, service response time, CPU utilization, network throughput, and page view counts, received attention in several applications to protect web services from system failures proactively (Tama et al., 2020). These applications are enabled by sensor networks allowing the collection of real-time data from different sources, such as physical sensors, mobile devices or other data sources. Specifically, a bulk of data is mainly generated and trasmitted by wireless sensor networks (Duan et al., 2019) that can be used to measure and evaluate the performance of web services through KPIs.

In Zhang et al. (2022), the authors proposed an unsupervised method, based on a variational auto-encoder, for learning predictive models that could be used to timely detect anomalies of KPI indicators. The method consists of a module for offline pre-training of shape models, through clustering, to select only the closest shape models, in terms of centroid closeness, to the online stream in an adaptive transfer learning strategy. In a similar manner, transfer learning was considered also in Duan et al. (2019) for anomaly detection tasks in web services. It clusters historical KPIs and then it considers all KPIs in each cluster as input into a shared-hidden-layers variational auto-encoder model. Variational auto-encoders are exploited effectively also in Xu et al. (2018), where the authors analyzed the noise distribution in KPI anomaly detection problems.

In Hagemann and Katsarou (2020), the authors evaluated PCA, auto-encoders and long short-term memory encoder-decoders for anomaly detection in cloud-specific metrics of various Yahoo! services. For the Yahoo! Webscope S5 dataset [71] the results showed that PCA is the most robust and fastest way to detect anomalies, while neural networks without regularization tend to overfit the data.

While these works are clearly effective for anomaly detection in the use of web services, they do not have growing capabilities, do not provide explanations associated with the detected anomalies and do not provide scalable solutions to deal with the challenges presented by large-scale data.

3 Proposed method

In this section we describe in detail our proposed distributed and explainable anomaly detection method. First, we provide an overview of SOM-based model training. Subsequently, three subsections focus on the training process in detail, how threshold auto-tuning is performed, and our approach to enrich model predictions with explanations.

A SOM layer contains a two-dimensional grid of neurons, where each neuron is described by a weight vector. Conceptually, the training stage incrementally presents instancesFootnote 1 to the model and takes place for a given number of iterations, i.e. epochs. During this process, the neuron with the shortest distance for a given instance (also known as the winner neuron) is identified, and its neighborhood is adapted to get closer to the instance.

For this purpose, the SOM and GHSOM algorithms leverage a metric called Mean Quantization Error (MQE) (Dittenbach et al., 2000). The MQE of a neuron is calculated as the total deviation of such neuron from the input instances mapped to it. Another important concept is the MQE associated to the entire SOM layer, which is calculated as the average MQE of all its constituent neurons.

The first step for model training is to compute the MQE of the level-0 neuron with respect to all input instances, denoted as \(mqe_0\).

Subsequently, a first neuron map of \(2 \times 2\) neurons is created at level-1. This map is trained according to the conventional SOM training process. Once the training process is complete, the map is analyzed and the MQE for the map m denoted as \(MQE_m\) is calculated. High values of \(MQE_m\) indicate that the map m does not accurately represent the input data and, therefore, may require more neurons to reach this goal, which is formalized by the \(\uptau _1\) training criterion in Eq. (1):

$$\begin{aligned} MQE_m < \uptau _1 \times mqe_p, \end{aligned}$$
(1)

where \(mqe_p\) represents the MQE of the parent neuron which is responsible for the expansion of the map m, while \(\uptau _1\) is a weight value governing the sensitiveness of the single SOM expansion, i.e., the higher the \(\uptau _1\), the smaller the SOM, which results in a faster model training phase.

The map growing process has the goal to reach the condition in Eq. (1). Specifically, to realize the map growing process, the neuron presenting the highest MQE is identified and denoted as the error neuron e. Subsequently, its most dissimilar direct neighbouring neuron d is selected, and a new row or column of neurons is added into the grid between e and d.

Each newly inserted neuron is a vector initialized considering the average of the weight vectors at its corresponding adjacent neighbours. By doing so, the process yields an updated (grown) layer, which is evaluated and trained once again. This dual process characterized by growth and training continues until the \(\uptau _1\) criterion is met. When the \(\uptau _1\) criterion is satisfied, each neuron in the map is then analyzed according to the criterion in Eq. (2) (\(\uptau _2\) criterion):

$$\begin{aligned} mqe_k < \uptau _2 \times mqe_0, \end{aligned}$$
(2)

where \(mqe_k\) represents the MQE of the \(k^{th}\) neuron under analysis, and \(\uptau _2\) is a weight value governing the sensitiveness of the hierarchy expansion, i.e., the higher the \(\uptau _2\), the smaller the hierarchy, which results in a faster model training phase.

The neurons which do not satisfy the \(\uptau _2\) criterion are expanded into new maps at the next level of hierarchy. These new maps undergo the same process of training, growth and hierarchical expansion as the level-1 map.

The training of the GHSOM model stops when all the neurons in the maps at the last level of the hierarchy satisfy Eq. (2). The resulting GHSOM structure thus contains multiple SOM layers arranged as a hierarchy, with each SOM representing the data at a finer granularity than its parent layer.

3.1 Detailed training process

The first step of the GHSOM algorithm consists in computing the global value of dissimilarity in the input dataset denoted by \(mqe_0\) (i.e. the Mean Quantization Error of the level-0 neuron). First, the mean of all input vectors \(m_0\) in the dataset is computed. Second, to obtain the value of \(mqe_0\), the mean distance of the input vectors from the mean vector is computed as follows:

$$\begin{aligned} mqe_0 = \frac{1}{n} \sum _{x_{(\cdot ,i)} \in C_n} \Vert m_0 - x_{(\cdot ,i)} \Vert , \end{aligned}$$
(3)

where \(C_n\) denotes the set of n input instances, \(x_{(\cdot ,i)}\) denotes the vector representing the i-th instance currently presented to the model, and \(mqe_0\) denotes the overall dissimilarity of the input dataset. In our work, we replace \(mqe_0\) with the classical variance, denoted as \(var_0\), as a measure of deviation that is more robust to outliers.

Subsequently, in the training process, a \(2 \times 2\) SOM layer is created. If this is the layer at level 1, the neuron weight vectors are initialized at random. For each subsequent level, SOM layers can be initialized as a function of their parent neuron and their neighbours. To realize this goal, we adopt the approach devised in Chan and Pampalk (2002). Figure 1 graphically describes the training of a single SOM starting from level 1.

Fig. 1
figure 1

Training process of a single SOM layer showing the adaptation of the layer throughout the presentation of multiple training instances (normal and noisy)

Let epochs be the number of desired iterations for SOM model training. For each training instance \(x_{(\cdot ,i)}\) in the dataset, the winner neuron \(w(x_{(\cdot ,i)})\) is identified. Such neuron is considered to adapt a SOM S at epoch t (hereinafter S(t)), represented as a matrix of L-dimensional neurons defined as:

$$\begin{aligned} S(t) = \{m_{(\cdot ,k,k')}(t)\}_{(k,k')}, \end{aligned}$$
(4)

where \(m_{(\cdot ,k,k')}(t)\) represents a L-dimensional neuron in position k and \(k'\) of S(t). The adaptation process can be formalized as:

$$\begin{aligned} \small m_{(\cdot ,k,k')}(t+1) = m_{(\cdot ,k,k')}(t) + \left[ \displaystyle \frac{\sum _{i=1}^n h(w(x_{(\cdot ,i)}),m_{(\cdot ,k,k')}(t)) \times (x_{(\cdot ,i)}-m_{(\cdot ,k,k')}(t))}{\sum _{i=1}^n h(w(x_{(\cdot ,i)}),m_{(\cdot ,k,k')})} \right] _{l=1,\ldots ,L}, \end{aligned}$$
(5)

where \(h(\cdot , \cdot )\) is the neighborhood function defined as:

$$\begin{aligned} h(m^{(1)}_{(\cdot ,k_1,k_1')}(t),m^{(2)}_{(\cdot ,k_2,k_2')}(t)) = \exp \left( - \frac{ {(|k_1-k_1'|+ |k_2-k_2'|)}^2 }{2 \sigma (t)^2} \right) , \end{aligned}$$
(6)

and \(\sigma (t)\) corresponds to the width of the neighborhood function:

$$\begin{aligned} \sigma (t)= \left( \frac{\sqrt{R^2 + C^2}}{2}\right) \times \exp \left( -\frac{t}{{epochs}}\times {\log\left( \frac{\sqrt{R^2 + C^2}}{2}\right) } \right) , \end{aligned}$$
(7)

where R and C denote the number of rows and columns in the SOM, respectively.

Once the SOM has been trained, the growing process (consisting in the creation of new rows and columns in the SOM) and the hierarchical growth process (consisting in the addition of new SOMs which represent data at a finer granularity level) follow the GHSOM training procedure described above. Each SOM layer is evaluated for the \(\uptau _1\) criterion for two-dimensional growth. Once the \(\uptau _1\) criterion is satisfied (Eq. 1), we evaluate each neuron for the hierarchical growth considering the \(\uptau _2\) criterion (Eq. 2). Algorithm 1 and Fig. 2 describe the GHSOM training process.

Fig. 2
figure 2

A graphical representation of the GHSOM training phase following the \(\uptau _1\) and \(\uptau _2\) criteria. While \(\uptau _1\) controls the horizontal (rows) and vertical (columns) growth of a single SOM layer, \(\uptau _2\) controls the hierarchical growth, generating new SOM layers. The entire model is capable to analyze data at multiple densities

Algorithm 1
figure a

GHSOM model training

Algorithm 2
figure b

Distributed SOM training

3.2 Multi-density anomaly detection

Once the entire set of training instances is processed, the learned SOM with its constituent neurons represents a spatial memory of the training instances. This capability allows our model to be trained with training instances that mostly refer to normal cases that may just be marginally affected by noise (background data)Footnote 2, without representing anomalies in the model. By doing so, the final predictive model has the ability to discriminate between normal and anomalous cases considering the distance between a new unlabeled instance and the part of the model that describes a subset of normal instances and is the most closely related to the new instance.

Once the set of neurons within the hierarchical structure of SOMs is set, the hierarchy obtained can thus be used to tackle the anomaly detection task. In particular, when a new unlabeled instance is provided to the hierarchy, the algorithm looks for its winner neuron. Once found, it is used to predict the class (anomaly/not anomaly) of the new instance, based on the distance between the instance and the winner neuron.

More formally, let \(x_{(\cdot , i)}\) be the new example to be considered, and \({\mathcal {N}}\) the entire set of neurons of the trained hierarchy of SOMs defined as \({\mathcal {N}} = \bigcup _{S \in {\mathcal {G}}} m_{(\cdot , k, k')} \in S,\), where S is a single SOM once the training process is completed, and \({\mathcal {G}}\) represents the entire hierarchical GHSOM model with its constituent SOMs. The closest winner neuron to \(x_{(\cdot , i)}\) is defined as:

$$\begin{aligned} w(x_{(\cdot , i)}) = \mathop {\hbox {argmin}}\limits _{w \in {\mathcal {N}}} \{ dist(x_{(\cdot , i)},w) \}. \end{aligned}$$
(8)

Therefore, the unlabeled instance is considered an anomaly if the following inequality holds:

$$\begin{aligned} dist(x_{(\cdot , i)}, w(x_{(\cdot , i)})) > mqe_0 + tf \times \sigma , \end{aligned}$$
(9)

where dist(ab) is the Euclidean distance between two input vectors a and b, \(\sigma\) is the standard deviation of the distances among the training instances and the neurons after the training, and tf the threshold governing the sensitiveness of the final predictive function.

Conceptually, tackling anomaly detection with this hierarchical spatial memory has the ability to naturally fit multi-density distributions, which can be comprised of sub-distributions. In this context, samples can assume values within sub-ranges of a larger distribution, and anomalies may be hidden at different density levels. Formalizing the multi-density distribution can be achieved leveraging the notion of mixture of distributions. More formally, assuming that the underlying data distribution is normal, a multi-density distribution can be defined using the following notation:

$$\begin{aligned} F(\textbf{X}) = \sum _{i=1}^{D} w_i \frac{\exp \left( -\frac{1}{2}(\ {X} - \mathbf {\mu _i})^T\mathbf {\Sigma _i}^{-1}(\textbf{X} - \mathbf {\mu _i})\right) }{\sqrt{(2\pi )^p\text {|}\mathbf {\Sigma _i}\text {|}}}=\sum _{i=1}^{D} w_i \ F_i(\textbf{X}), \end{aligned}$$
(10)

where \(F(\textbf{X})\) is the (multi-density) multidimensional probability density function (PDF), which combines several (single-density) multidimensional normal distributions \(F_i(\textbf{X})\). In the formula, \(\textbf{X} \in {\mathbb {R}}^p\) represents a data point as a p-dimensional vector, D is the total number of single-density multidimensional distributions, \(\mathbf {\mu _i}\) and \(\mathbf {\Sigma _i}\) are the p-dimensional mean vector and the positive finite covariance matrix of size \(p \times p\) of the i-th distribution respectively. \(\text {|}\mathbf {\Sigma _i}\text {|}\) represents the determinant of the covariance matrix. \(w_i\) represents the weight assigned to each i-th multidimensional single-density normal distribution governing the various shapes and peaks that the \(F(\textbf{X})\) can assume. A graphical representation of the resulting multi-density distribution \(F(\textbf{X})\) is depicted in Fig. 3.

Fig. 3
figure 3

An example of a multi-density distribution following Eq. (10)

We note that, in our work, we assume that all parameters of the multi-density distribution are not known in advance, but can be automatically learned by the GHSOM model during the training process. We argue that identifying anomalies with a hierarchy of SOM models should, in principle, allow for a more precise detection of anomalies than a single flat model, which is the most widely adopted approach.

If we refer to a single-density distribution \(F_i\) from the entire multi-density distribution \(F(\textbf{X})\), a data point is classified as normal by the GHSOM model if it can be recognized by at least one single-density distribution. Therefore, the set of normal data points can be defined as

$$\begin{aligned} N = \bigcup _{i=1}^{D} N_i, \text { where } N_i \subset {\mathbb {R}}^p \text { is the set of examples generated according to } F_i \ . \end{aligned}$$
(11)

Therefore, we can define the set of anomalous instances as \(A = {\mathbb {R}}^p \setminus N\), that is, the set of each other possible p-dimensional data point resulting out-of-distribution w.r.t. \(F(\textbf{X})\). Or, in other terms, the set of each possible p-dimensional data point that does not follow any single-density distribution \(F_i(\textbf{X})\).

This aspect of our method is particularly relevant considering that multi-density data and anomalies can be identified in many real-world applications with sensor data contexts. In vehicular traffic analysis, for instance, levels of traffic could be minimal during the night and reach their peak at noon, due to events such as the end of the school day. The same phenomenon can be observed in domains such as pedestrian flows, brightness levels, and so forth. In addition to variations of behavior due to different temporal phases, multiple densities can be also identified in geo-distributed settings. In these settings, multiple geographic locations naturally exhibit a varying degree of intensity for a phenomenon under analysis, due to their inherent spatial characteristics and external factors such as weather conditions, which are also subject to seasonal dependencies. Examples also include renewable energy production and air pollen distribution.

3.3 Threshold autotuning for anomaly detection

The key idea of our autotuning approach is to leverage the predictive function to compute distances between neurons (normal prototypes) and unlabeled instances. The assumption is that the distance between anomalous instances and prototypes is, in principle, greater than the distance between normal cases and prototypes. Following this assumption, our method identifies the correct threshold factor (tf) used in the predictive function according to a false positive rate \((fp\%)\) tolerance within the training set.

Algorithms 34 and Fig. 4 further illustrate the tf autotuning process w.r.t. the considered \(fp\%\) .

Fig. 4
figure 4

The tf autotuning phase to determine the final predictive function w.r.t. the considered \(fp\%\) tolerance. From left to right, it is possible to observe a reduction of the fp rate, achieved through threshold adaptation. This process is illustrated by the growing radius of the decision boundary of the model

Since the precise configuration of tf is often impractical and error-prone in real settings, this step is crucial to support properly calibrating the predictive function. Domain experts and final users can more easily understand \(fp\%\) as the rate of false alarms which corresponds to the sensitivity of the final predictive model. Furthermore, it is possible to properly configure the \(fp\%\) to account for noisy or contaminated instances residing in training data. Let us consider a case with background data that only contains normal cases (free of anomalies). A straightforward scenario is to define \(fp\% = 0\) so that no false alarms are issued on training data. Another case is that of background data containing noise or contamination. In such a case, setting a low percentage of false alarms, i.e., \(fp\% = 1,2,\dots ,n\), where n is the estimated contamination rate, will allow the model to be more robust and account for the estimated anomalies residing in training data.

Once the ratio of acceptable false positives w.r.t. the training set cardinality is achieved, the final predictive model is configured for detecting anomalies in real-domain sensor network data scenarios. Obviously, the higher \(fp\%\) the lower the precision, and the higher the recall of anomalies. In any case, \(fp\%\) is independent on the data distribution, while tf is not. A graphical representation of the predictive phase is provided in Fig. 5.

Fig. 5
figure 5

A graphical illustration of the predictive phase. At inference time, all instances within the model’s decision boundary are considered as normal, while all instances outside the decision boundary are classified as anomalies. Geometrically, the decision boundary defined through the radius (right side of Eq. 9) and the prediction are based on the distance between the instance and the winning neuron (left side of Eq. 9)

3.4 Explainable anomaly detection

The anomaly detection step supports two types of output, depending on the desired level of detail. The simplest approach provides the classification of an unlabeled instance in the form of a Boolean response (normal/anomaly). This type of output is useful to raise alerts when the instance is an anomaly. Its drawback is that interpreting the raised alert could be difficult for domain experts. To deal with this issue, we propose a second type of output that combines the boolean response with a feature ranking that indicates the importance that each feature had in the anomaly identification process. Feature importance is a numerical value between 0 and 1, indicating how anomalous the value expressed by the feature is with respect to the training data. The sum of all features importance values in the feature ranking is equal to 1. To estimate feature importance, we compute the distance between the current instance under analysis and the winner neuron. Specifically, in a normalized space, the ranking is proportional to the contribution provided by each single vector component in the Euclidean distance between \(x_{(\cdot , i)}\) and \(w(x_{(\cdot , i)})\). More formally, the feature importance function for the instance \(x_{(\cdot , i)}\), \(f_{imp}(x_{(\cdot , i)}, l)\), is computed as follows:

$$\begin{aligned} f_{imp}(x_{(\cdot , i)}, l) = \frac{(x_{(\cdot , i)}[l] - w(x_{(\cdot , i)})[l])^2 }{ \sum _{l'=1}^L (x_{(\cdot , i)}[l'] - w(x_{(\cdot , i)})[l'])^2} \end{aligned}$$
(12)

where l represents the feature index, \(l' \in \{1,2, ..., L\}\) a varying feature index, L the feature set cardinality, while the ranking is defined as follows:

$$\begin{aligned} rank(x_{(\cdot , i)}) = \nabla (\{f_{imp}(x_{(\cdot , i)}, 1), f_{imp}(x_{(\cdot , i)}, 2), ..., f_{imp}(x_{(\cdot , i)}, L)\}) \end{aligned}$$
(13)

where \(\nabla\) is a descending ordering operator.

This approach effectively supports the identification of feature(s) that contributed the most for detecting the anomaly and, therefore, enhancing our method with explainability capabilities. Algorithms 4 and 5 include the pseudo-code for the prediction and the explainability phases of our method. In Fig. 6, we graphically illustrate the instance-based feature ranking process for each detected anomaly.

figure c
figure d
figure e
Fig. 6
figure 6

A graphical illustration of the feature ranking for anomaly explanation in two dimensions (\(L=2\)). The features F1 and F2 have an importance 0.75 and 0.25 respectively for the detected anomaly on the top figure, while for the bottom figure, F1 and F2 have an importance 0.2 and 0.8 respectively

4 Experiments

In this section, we present our experiments and discuss their quantitative and qualitative outcomes. We considered different application domains that could be supported by anomaly detection tasks. First, quantitative analyses were conducted for anomaly detection in the context of electrical grids with two real-world datasets (Wind NREL, PV Italy). To set up the training regimen in these experiments, we adopted sliding windows of different sizes (Gama, 2010). Second, we carried out experiments for urban anomaly detection by analyzing two real world datasets: pedestrian flows encountered in the Oslo city and vehicular traffic flow in one Italian city. The training regimen followed for this task is that of landmark windows, where models are trained with all data available until a given time point (Gama, 2010).

Third, the experiments with landmark windows for anomaly detection in web services are presented by considering the Yahoo! dataset (Hagemann & Katsarou, 2020, https://research.yahoo.com.

Further experiments are reported in Appendices 1 and 2. Specifically, in Appendix 1, we present the scalability analysis conducted to emphasize the capability of the proposed method to distribute the computational workload over multiple nodes. While, in Appendix 2, we present the qualitative analyses for detecting anomalous allergenic pollens in the air for the Veneto region in Italy, emphasizing the capability of the proposed method to discriminate among different types of anomalies occurring in this context.

4.1 Quantitative results

Comparative analyses were performed to quantitatively assess the effectiveness of the proposed method when compared with state-of-the-art anomaly detection methods. In line with our discussion of existing works, we considered popular methods from the most representative class of approaches to address the task of interest in our study, i.e., one class learning. Indeed, this class of approach offers the flexibility to learn a model from an initial (regular) data distribution that is able to flag data that significantly differ from the learned distribution. In particular, we considered five well-known and widely adopted competitor methods: OCSVM (Schölkopf et al., 2000), Isolation Forest (Liu et al., 2008), ABOD (Kriegel et al., 2008; Pham & Pagh, 2012; Jahromi et al., 2022), COPOD (Li et al., 2020), and an Auto-encoder architecture that detects anomalies based on the reconstruction error (Beggel et al., 2019; Zhou & Paffenroth, 2017). Such methods were configured by performing ablation analyses on a set of parameters’ values suggested in the respective original papers. Specifically, for OCSVM, we considered two different kernel functions, i.e., Linear and Radial Basis Function (RBF), and we varied the \(\mu\) parameter within the set \(\mu \in \{ 0.5, 1.0 \}\).

For Isolation Forest, we considered the number of trees in the ensemble \(n\_estimators \in \{10, 20, 50\}\), while the whole feature set was considered for every single tree.

For ABOD and COPOD, we used the default configuration proposed in the pyod (Zhao et al., 2019) library. For ABOD, the number of neighbors considered for each data point is set to 10 (Zhao et al., 2019).

For the auto-encoders, we used the architecture suggested by Bengio et al. in Bengio (2012). Specifically, we used a standard contractive model architecture with four layers using the Sigmoid activation function. Additional experiments using ReLU activations did not yield any significant reduction in model training time nor differences in terms of model performance. We experimented with different negative powers of 10 for the configuration of the \(learning\_rate\) and with different powers of 2 for the \(batch\_size\). Preliminary experiments suggested that the different configurations did not provide a significant difference in terms of performance metric values. Therefore, the experiments were executed with the following configuration: \(epochs=50\), \(learning\_rate=0.0001\), \(batch\_size=32\). Since auto-encoders are sensitive to the embedding space dimensionality, we considered two sizes proportional to the feature set cardinality of the input space. Specifically, let \(orig\_dim\) be the number of features of the original dataset, we considered embedding spaces of \(dim \in \{orig\_dim/4, orig\_dim/2\}\) sizes. This choice allowed us to be flexible with respect to different feature dimensionalities of the different datasets. During the inference phase, if the reconstruction error of an unlabeled instance exceeds \(p \times \sigma\) (corresponding to a 3-sigma rule when \(p=3\)) from the average reconstruction error observed on the training set, the instance is marked as an anomaly.

In our proposed method, the \(\uptau _1\) and \(\uptau _2\) parameters were set considering a range of values balancing between computational time, model complexity (number of neurons), and model accuracy. We selected \(\uptau _1, \uptau _2 \in \{0.9, 0.8, 0.7, 0.6, 0.5\}\). This choice takes into consideration the fact that model training time becomes unsustainable with values smaller than 0.5 since the quality criteria become too difficult to satisfy and that, on the other hand, the accuracy is likely negatively affected by values higher than 0.9 due to the shallowness of the model that this choice implies. After preliminary experiments, the final configurations used in the experiments reported in our results are: Wind NREL (\(\uptau _1=0.9\), \(\uptau _2=0.9\)), PV Italy (\(\uptau _1=0.7\), \(\uptau _2=0.9\)), Oslo Pedestrian Flow (\(\uptau _1=0.5\), \(\uptau _2=0.9\)), vehicular traffic (\(\uptau _1=0.8\), \(\uptau _2=0.9\)), and Yahoo! web services (\(\uptau _1=0.8\), \(\uptau _2=0.8\)).

4.1.1 Energy results

In this section, we describe the experiments and analyze the results extracted for anomaly detection in the context of electrical grids. To this aim, we considered the following two datasets.

  • Wind NREL. This datasetFootnote 3 includes time series data from five wind farms of wind speed and production recorded every 10 minutes for a two-year period from January 1, 2005, to December 31, 2006. The data was then aggregated at an hourly level.

  • PV Italy. The dataset includes data collected every 15 minutes (from 2:00 AM to 8.00 PM, every day) by sensors at 17 photovoltaic power plants located in Italy, spanning from January 1\(^{\textrm{st}}\), 2012 to May 4\(^{\textrm{th}}\), 2014. Anomalies consists in perturbations of the correct attribute values. This is done on 25% of instances and 50% of the features. Further information on the data preprocessing steps can be found in Corizzo et al. (2021); Ceci et al. (2016).

For both datasets, we consider the following features: latitude and longitude of each plant; day and hour; altitude and azimuth; weather conditions, i.e., ambient temperature, irradiance, pressure, wind speed, wind bearing, humidity, dew point, cloud cover, and a descriptive weather summary. Weather conditions are either measured (training phase) or forecasted (detection phase). In particular, all weather observations were extracted using the Forecast.io API, except for the expected altitude and azimuth, which were extracted from SunPositionFootnote 4, and the expected irradiance (PV Italy dataset only), that was extracted from PVGISFootnote 5.

We performed experiments considering sliding windows of 60 consecutive days for model training. Once models are trained, their anomaly detection performance is assessed on data observed the following day (61st day), considered as the prediction day. Experiments are repeated 10 times with 10 distinct sliding window selections, and the obtained results are averaged.

Figures 7 and 8 show the detailed results, in terms of macro F1-Score, for each of the 10 considered prediction days (for Wind NREL and PV Italy, respectively), while Fig. 9 shows the ablation analyses according to different \(fp\%\) tolerance settings chosen for the predictive models. According to our ablation analyses, we selected \(2\%\) and \(9\%\) of false positive tolerance for Wind NREL and PV Italy, respectively. On average, with the Wind NREL dataset, the proposed method achieved an improvement of \(9.5\%\) w.r.t. the second ranked method OCSVM. Regarding PV Italy, our method achieved an improvement of \(3.7\%\) compared to the second ranked method OCSVM (see Table 3).

Fig. 7
figure 7

Results in terms of macro F1-Score for each of the considered prediction days (Wind NREL dataset). The performance curves reported consider the best configuration for each method

Fig. 8
figure 8

Results in terms of macro F1-Score for each of the considered prediction days (PV Italy dataset). The performance curves reported consider the best configuration for each method

Fig. 9
figure 9

Ablation analysis for threshold autotuning conducted considering a range of false positive percentage for Wind NREL (left) and PV Italy (right)

4.1.2 Vehicular traffic results

The experiments were conducted considering the vehicle traffic observed in an Italian city. Specifically, 93 sensors located at every access to the city center were considered for data collection. For each sensor, the GPS position was exploited for geo-tagging instances with latitude and longitude coordinates. Data was collected continuously (ISO 8601, resolution: seconds) for the time frame between November 8, 2021 and November 23, 2021. Instances were aggregated every 5 minutes in order to quantify the number of vehicles approaching or leaving the city center, for a total of 370, 775 instances. Table 1 summarizes the descriptive features of the dataset.

Table 1 Descriptive features of the vehicular traffic dataset

Experiments were performed considering a landmark time window approach, where models are trained with all available data observed so far to predict on subsequent time steps. Specifically, for each hour of the time frame covered by the dataset, we trained a model to predict the possible anomalies occurring in the next hour. Multiple train and prediction sets were created where the n-th split includes n hours for training and the \((n+1)\)-th hour for prediction. Since anomalies were not provided, 36 of the available prediction windows randomly selected were perturbed by adding random values in the interval [0,100] to the number of vehicles with direction approaching and unknown, without changing the original values for the vehicles leaving the area. The idea was to simulate an anomalous but realistic scenario when cars are only approaching without the possibility to leave the city center. Therefore, we trained models from the beginning of the time series until the hour that precedes the prediction window. By doing so, we collected 36 predictive models, trained with a progressively increasing amount of training data. Quantitative results are illustrated in Fig. 10.

Fig. 10
figure 10

Vehicular traffic analysis. Each curve considers the best configuration for each method in terms of macro F1-Score (y-axis). The x-axis represents the width of the time window spanning the entire time frame of the analyzed dataset

Results show that, at the beginning of the considered time frame, models are more unstable, possibly due to the distinct distribution of patterns observed during the day and the night. However, after 150 hours of training, predictive models appear more stable and accurate, as expected. In the latest time windows it is possible to observe, in some cases, an unsatisfactory model performance. This behavior can be observed in three specific situations with limited or no sunlight: November 19th 2021 (10:05pm), November 20th, 2021 (7:05pm), and November 20th (11:05pm), as shown by the three drops in the final part of the curve. This could be explained by a reduced number of observations available during the night, motivated by the reduced vehicular traffic.

It can be noted that Auto-encoder yielded perfectly accurate predictions with this dataset, resulting in a macro F1-Score of 1.0. This behavior was expected and can be attributed to the perturbation strategy adopted, which is fully compliant with the \(3 \times \sigma\) rule used for the Auto-encoder predictive function, and results in a simplification of the task for this particular method. It should be also noted that experiments with Auto-encoder required a significantly high training time, due to training regimen followed for this dataset. Moreover, since the method does not provide explanations for its inferences, it fails to provide a trade-off between accuracy and explainability.

For the proposed method, experiments were conducted to automatically tune the tf parameter. To this aim, we considered different values for the \(fp\%\) parameter, i.e., the tolerance on the percentage of training instances labeled as anomalies by our method that actually represent normal cases in the training set. Considering \(fp\%\) rather than tf simplifies the adoption of the tuning mechanism from the perspective of non-expert security operators, since it is easier to understand. To set up the correct value of \(fp\%\), we considered the latest window containing all the training instances (i.e., after 358 training hours). As shown in Fig. 11, the highest macro F1-Score was obtained by considering \(0.02\%\) of false positives. We note that the results improved as the number of false positives within the training set is reduced, as visible in the decreasing curve shown in the top-left sub-figure. This is an expected result, since a low number of errors on training data corresponds, in principle, to a more robust predictive model that avoids recognizing normal cases as anomalous. Figure 11 also highlights that with \(0\%\) of false positives the model has a tendency to become flat, considering all the instances as normal cases. This behavior should be avoided, since the final models would become useless in practical real scenarios, and would not be able to discriminate between normal and anomalous instances. To avoid this phenomenon, we tuned the \(fp\%\) parameter between \(0\%\) and \(0.1\%\) as shown in Fig. 11. Values close to \(0\%\) emphasized that the proposed method is capable to tolerate a few noisy instances that could be significantly different from historical data. This behavior is motivated by a potential degree of contamination in training data, i.e. anomalous instances considered as normal and not removed from training instances. This phenomenon is frequent in real applications such as vehicle flow analysis, where the effort of removing car accidents or unconventional vehicle flows due to municipal road works is too high. As a result, such instances are likely to remain within the training set. However, it is remarkable to observe that our method was capable of handling this problem via the specific \(fp\%\) parameter tuning.

Fig. 11
figure 11

Ablation analysis for the proposed method (vehicular traffic data). A different number of false positives within the training set were considered to identify the ideal parameter configuration. Macro F1-Score values w.r.t. close to zero false positive percentages emphasized that a few false positives are necessary to better tune predictive models for this dataset

4.1.3 Oslo pedestrians flow analysis

We considered a dataset describing the flow of people in the city of Oslo between two areas identified by the following GPS coordinates:

  • \(\langle 59.91301053377869, 10.733979291595263 \rangle\)

  • \(\langle 59.912625158578805, 10.734914979884614 \rangle\)

Collected data covers a time period ranging from April to May 2019 to 2022. This time frame was decided in order to catch the abrupt increase in pedestrian flows occuring during the Constitution Day of Norway on May 17. To this aim, we considered all instances of May 17 as anomalous, whereas the remaining instances were considered as normal cases, i.e. regular pedestrian flows. The detailed descriptions of the features covered by this dataset are reported in Table 2.

Table 2 Descriptive features of the Oslo pedestrians flow dataset

In Fig. 12, we show the results in terms of macro F1-Score. We note that, for this dataset, we did not resort to any perturbation strategy, since it already contained real anomalies.

Fig. 12
figure 12

Results for the Oslo pedestrian flows dataset (left) and ablation analysis for the \(fp\%\) parameter optimized w.r.t. the macro F1-Score (right)

4.1.4 Yahoo! web services

We considered the Yahoo! S5 datasetFootnote 6 (https://research.yahoo.com) containing real time-series with labeled anomalies for the metrics of various Yahoo! services. The dataset consists of 67 univariate time series that were joined considering the temporal dimension as the join key to obtain a multivariate dataset. In the raw version of the dataset, the ground truth label information is available separately (anomaly/normal) for each feature. One time point refers to a specific hour of production traffic. Our experiments consider a training window of two weeks and a testing window of one week, resulting in landmark training windows of increasing time: two, three, and four weeks. We removed anomalies from the training windows and we considered a testing instance as anomalous if the time window contained at least 3 out of the 67 features labeled as anomalous. We considered 3 features since choosing 1 or 2 features would have resulted in several anomalies being skipped in the training sets, leading to empty or very small training window sizes, while choosing 4 features would have resulted in testing windows without anomalies, leading to a more complex evaluation. In Fig. 13, we show the results in terms of macro F1-Score. The results show that, on average, our method outperforms competitor methods. It can be seen that the proposed method presents an opposite behaviour w.r.t. Isolation Forest for this dataset. Our interpretation of this result is that the second evaluation window contains more peripheral anomalies that lie very close to normal instances. These anomalies are likely easier to detect with simple classification rules rather than with clusters of neurons that model multi-densities. However, our method presents a higher performance than Isolation Forest in the other two evaluation windows. Ablation analysis was conducted by considering \(fp\% \in [0, 1]\) by step 0.1. Figure 13 (right) showed that the best results obtained with this dataset are when \(fp\% = 0.2\) attesting that this dataset could contain few to no noisy or anomalous instances within the training windows.

Fig. 13
figure 13

Results for the Yahoo! web services dataset (left) and ablation analysis for the \(fp\%\) parameter optimized w.r.t. the macro F1-Sscore (right)

4.2 Quantitative analysis

Overall, the experiments show that the proposed method outperforms the considered state-of-the-art methods in the majority of the proposed scenarios described through different datasets (see Table 3). In Table 4 we show the ranking achieved by each method on all considered datasets, and their final average ranking. It can be noted that the proposed method is the best performing for all datasets except for the vehicular traffic analysis, where Auto-encoder outperforms all other approaches. We attribute this result to the perturbation strategy followed for this dataset, which matches with the biases of the predictive function of the Auto-encoder method and, as a result, translates to a simplified setting. The OCSVM method positions itself in the middle, providing an acceptable second-best performance for Wind NREL and PV Italy, a sub-par performance for Oslo pedestrians, and an unsatisfactory performance for the Vehicular traffic dataset and on Yahoo!. We can observe that the widely adopted Isolation Forest method suffers in most of the considered applications. Indeed, such method presents high robustness when the anomalies lie at the boundaries of training instances, and are easy to isolate by means of very simple trees with limited depth. A similar behavior is observed for COPOD, and ABOD, resulting in a sub-par anomaly detection performance. Therefore, in cases where the anomalies are very close to normal instances, better suited methods are characterized by the ability of catching finer diversities among normal and anomalous cases. In such a setting, we attribute the superior performance of our method to its ability to model multiple densities in the background data thanks to multiple SOMs.

Table 3 Macro F1-Score summary of the comparative analysis for all the datasets describing five different real-domain applications
Table 4 Average ranking of all methods considered in our experiments

5 Conclusion and future works

In this paper, we proposed a distributed and explainable SOM-based anomaly detection method. The method supports the analysis of background data characterized by multiple densities, by means of multiple SOMs arranged as a hierarchy. Moreover, the method is able to process large-scale data occurring in real-world domains leveraging the map-reduce programming paradigm and the Apache Spark framework. Differently than popular anomaly detection systems, the proposed method is capable of automatically identifying the threshold for the classification rule during the training stage. Its sensitiveness can be also configured to tolerate different levels of false positives during the training stage. Furthermore, the proposed method is enhanced with an explainability component that facilitates the interpretation of predicted anomalies leveraging instance-based feature ranking. The results show the effectiveness of the proposed approach, both qualitatively and quantitatively, in five different real-world applications. Qualitative analysis emphasized that the explainability component is effective in highlighting the severity of the detected anomalies. In future work, we will extend our automatic threshold estimation mechanism to automatically learn the ideal false positive percentage, without any dependence on user inputs. Moreover, possible extensions of the method include its adaptation to biological domains, where the identification of possible variations in subcellular genetic information is important to detect the onset of diseases.