Introduction

Appliance-level energy consumption monitoring is a core component of the control system of smart buildings (Shah et al. 2019; Shaikh et al. 2014). The consumption data can be either directly collected with such devices as smart plugs, or inferred with non intrusive load monitoring (NILM) algorithms able to break down the household aggregate consumption signal into the contributions of individual appliances (Azizi et al. 2021). The analysis of energy consumption data series enables forecasting and diagnostic applications, such as load prediction (Amasyali and El-Gohary 2018), anomaly detection (AD) (Fan et al. 2018) and predictive maintenance (Cheng et al. 2020).

AD in temporal data series is the task of identifying data points or intervals in which the time series deviates from normality. AD finds application in different fields such as healthcare, where it applies to the analysis of clinical images (Schlegl et al. 2019) and of ECG data (Chauhan and Vig 2015), cybersecurity, where it is used for malware identification (Sanz et al. 2014), manufacturing, where it helps monitoring machines and prevent break downs (Kharitonov et al. 2022), and in the utility industry, where it supports the early identification of critical events such as appliance malfunctioning (Mishra et al. 2020) and water leakage (Seyoum et al. 2017; Muniz and Gomes 2022). In the energy field, AD may be combined with energy load forecasting to improve accuracy (Koukaras et al. 2021), or integrated as a component for detecting non nominal energy fluctuations for enhancing decision making in energy transfer between microgrids (An interdisciplinary 2021). Energy consumption time series can be collected from home appliances and building systems with complex periodic or quasi-periodic behavior, such as coolers, water heaters and fridges, which present specific challenges when performing anomaly detection. Machine learning and neural models trained on normal data may overfit with respect to the length of the period. This phenomenon makes the model sensible even to small variations of the cycle duration, which can happen during normal functioning (Liu et al. 2020). As a consequence, the detector may emit a high number of false positive alerts when such small variations occur and also may degrade its performances sensibly when used to detect anomalies of an appliance of the same type but with a different cycle duration.

The literature on AD in temporal data series still lacks a systematic comparison of algorithms belonging to different families on quasi-periodic data sets. Therefore the development of an AD application in such a scenario still has to confront with design decisions such as the choice of the most effective algorithm, the minimum duration of the time series to use for training, the minimum size of the signal prediction/reconstruction window needed to identify the anomalous behavior, and the portability of the chosen algorithm from one appliance to another one with “similar” behavior. This paper tries to fill the gap in the literature about AD in quasi-periodic time series by systematically comparing the performances of 12 algorithms representative of different families of approaches. The experiments were performed on 3 distinct data sets regarding the fridges power consumption.

The aim of the experiments is to address the following questions:

  • Q1 How do the selected algorithm compare in the AD task on quasi-periodic time series under multiple performance metrics?

  • Q2 For the algorithms that require training, what is the relationship between the length of the training series and the performances?

  • Q3 For the algorithms that exploit a window-based approach for the prediction, what is the relationship between the length of the window and the performances?

  • Q4 What is the generalization capability of the methods? How does performance degrade when a method trained on an appliance is tested on the time series produced by a distinct appliance of the same type?

The essential findings can be summarized as follows:

  • The classical ML algorithms Isolation Forest (ISOF), One-Class SVM (OC-SVM), and Local Oulier Factor (LOF) outperform the best neural models (GRU/LSTM autoencoder and multisteps methods)

  • Two weeks of training data are sufficient for most methods, with the multisteps approaches attaining a modest improvement if one month of data is used.

  • The length of the prediction/reconstruction window has a different impact on neural and non-neural methods.

  • ISOF and OC SVM are less dependent on the training set with respect to the neural models, which have a sensible performance decay when tested on an appliance different from the one used for training.

  • The top result of all the experiments is attained by ISOF on the Fridge3 time series, trained with a sub-sequence of length equal to one month and with a window size of 2 \(\times\) period: Precision = 0.947, Recall = 0.965, \(\hbox {F}_{1}\) score = 0.956.

The above mentioned findings can help understand better the requirements and performances of AD algorithms on quasi-periodic data series so as to design more effective household energy consumption applications, e.g., by equipping the mobile apps that are nowadays bundled with smart plug products with functionalities for consumption monitoring, energy saving recommendations and alerting of potential appliance malfunctioning.

The rest of the article is organised as follows: Section “Related work” overviews the state of the art in anomaly detection. Section Experimental settings describes the experimental configuration, including the description of the dataset and of the evaluated algorithms. Section “Experimental results” discusses the results of the performed experiments. Section “Qualitative analysis of results” discusses qualitatively a few examples of the predictions made by the reviewed methods. Finally, Section “Conclusions” draws the conclusions and illustrates our future work.

Related work

Anomaly detection in temporal data series exploits data collected with a broad spectrum of sensors in diverse fields, such as weather monitoring, natural resources distribution and consumption (e.g., water and natural gas), network traffic surveillance, and electrical load measurement (Firth et al. 2017; A platform for Open 2022; Makonin et al. 2016; Shakibaei 2020). As an example, the work in Makonin et al. (2016) discusses the use of residential home smart meters for data collection and highlights how such series often exhibit anomalous behaviors. Raw data must be pre-processed to get ready for further analysis. Besides the usual operations of data cleaning and validation, a prominent task is data annotation, which associates data points or intervals with the specifications of significant events, such as change points and anomalies. For example, Rimor Rashid et al. (2018) is a time-series data annotator supporting the labelling of data with anomaly tags, which can be used as ground truth for training and evaluating predictive models.

AD can be conducted in both univariate (Braei and Wagner 2020) and multivariate time series (Su et al. 2019; Li et al. 2018; Blázquez-García et al. 2021). In the case of multivariate time series, exploiting variable correlation may be necessary for reducing the number of parameters needed to model the problem (Pena and Poncela 2006). Examples of multivariate time series dimensionality reduction techniques are principal components analysis (Cook et al. 2019; Pena and Poncela 2006), canonical correlation analysis (Box and Tiao 1977), and factor modelling (Pena and Box 1987).

AD approaches can be classified in two main families (Cook et al. 2019): non-regressive and regressive. Non-regressive approaches rely on the fundamental statistical quantities computed on the time series (e.g., mean and variance) and combine them with fixed thresholds, but their effectiveness is limited (Cook et al. 2019). The authors of Kao and Jiang (2019) proposed a statistical AD framework using the Dickey-Fuller test, the Fourier transform, and the Pearson correlation coefficient to analyze periodic time series. Performance evaluation on five NAB datasets (Ahmad et al. 2017) showed that the proposed approach performs well on the NAB Jumps periodic data set and outperforms the models it was compared to. Other types of non-regressive techniques are ML methods for time series analysis. In Oehmcke et al. (2015) the Local Outlier Factor (LOF) method was employed to identify anomalous events in the marine domain and attained 83.4% precision. The Isolation Forest (ISOF) algorithm has been applied to streaming data in Ding and Fei (2013), achieving an AUC score of 0.98 in one of the test dataset. In Zhang et al. (2008) the One-Class Support Vector Machines (OC-SVM) has been implemented for the identification of network anomalies, and for the test set, the outliers identified perfectly match the human visual detection result.

Regressive approaches compute a model of the time series generation process. In the case of AD, an autoregression model is used to forecast the variable of interest from its past values. Autoregressive models include methods based on Autoregressive Moving Average (ARMA) (Pincombe 2005; Kadri et al. 2016; Kozitsin et al. 2021) and on Neural Networks, such as Autoencoders (AE) (Yin et al. 2020; Li et al. 2020) and Recurrent Neural Networks (RNNs) (Canizo et al. 2019; Malhotra et al. 2015). Forecasting-based AD approaches are divided into single-step and multi-step methods depending on the number of predicted points. The former strategy is preferable for short-term forecasting (i.e., minutes, hours, and days) and the latter for long-term data series analysis.

In the electric load analysis domain, the work in Masum et al. (2018) studies the problem of time series forecasting for electric load measurements and shows that Long Short-Term Memory (LSTM), a deep learning model, outperforms AutoRegressive Integrated Moving Average (ARIMA), a statistical-based model, on three data sets obtained from the Open Power System Data on electric load in Great Britain, Poland, and Italy (A platform for Open 2022). Zhang et al. (2019) shows the importance of an Fast Fourier Transform (FFT) based periodicity pre-processor to extract the period in smart grids time series. Pereira et al. (2018) proposes the use of Variational Autoencoders (VAE) for the unsupervised anomaly detection in solar energy generation time series and the results show that the trained model is able to detect anomalous patterns by using the probabilistic reconstruction metrics as anomaly scores. Himeur et al. (2021) surveys several Artificial Intelligence methods for anomaly detection in buildings’ energy consumption, identifying several factors (e.g., occupancy and outdoor temperatures) that influence time series behavior.

In the specific field of periodic data series analysis, Zhang et al. (2020) employs a periodicity pre-processor to find the time series period and segment the data into windows. Then it exploits a combination of an RNN and a CNN to detect anomalies achieving an \(\hbox {F}_{1}\) score near 0.9 on all the test datasets. Zhang et al. (2019) also uses a periodicity pre-processor, based on the Fourier transform, and maps multiple periods onto a single cycle to identify deviations across subsequent periods. Pereira et al. (2018) uses Bi-LSTM to detect anomalies and proposes the use of attention maps to explain the results. Capozzoli et al. (2018) encodes periodic time series using letters as a data size reduction technique. The classification process led to robust results with a global accuracy that ranged between 80% and 90%. These works show the advantages of pre-processing to exploit the data periodicity and of dimensionality reduction techniques and discuss results interpretability.

The proliferation of time series analysis methods and of AD specific approaches has spawned a stream of research focused on comparing the performance of alternative techniques. For example, the work in Masum et al. (2018) compares the multi-step forecasting performance of ARIMA and LSTM-based RNN models and shows that the LSTM model outperforms the ARIMA model for multi-step electric load forecasting. Our preliminary work (Zangrando et al. 2022) compares CNN-powered and RNN-powered AD methods with One-Class Support Vector Machines and Isolation Forest techniques on one quasi-periodic data set, using standard metrics (precision, recall, \(\hbox {F}_{1}\) score). In this paper we deepen the analysis assessing performances under multiple metrics, investigating the impact of the training sub-sequence duration and of the analysis window size, and contrasting the generalization capacity of the reviewed approaches.

Experimental settings

Data set

The experiments exploit a fridge energy consumption data set collected using smart plugs. The energy consumption data have been collected in Greek residential households using the BlitzWolf BW-SHP2 smart plugs, which allow exporting the time series through an API. The data collection system, the assessed algorithms and the evaluation framework were all implemented in Python. The time series in the data set record the active power consumption of three fridges for over 2 months, with 1 minute data resolution. The time series have been divided into sub-sequences for training, validation, and testing of the methods. Table 1 summarizes the data split.

Table 1 The dataset collection period and the train-val and test split

When working in normal conditions, the energy consumption curve of a fridge displays a cyclic behavior alternating between a high consumption state (ON) and a low consumption stage (OFF). Figure 1 shows an example of the consumption data of one appliance.

Fig. 1
figure 1

Example of the fridge energy consumption data series. The time series is formed by subsequent ON-OFF cycles and is quasi-periodical

Data set analysis

Periodicity analysis Normal fridge consumption shows a cyclic behavior. Periodicity analysis aims at detecting the mean period corresponding to an ON-OFF cycle and possibly to other longer patterns (e.g., seasonal effects). It is a preliminary step before the application of AD and requires a non-anomalous sub-series, which can be created by manually removing anomalies from the training sub-sequence. The Fast Fourier Transform (FFT) is applied on the anomaly-free sub-sequence to map the data into the frequency domain and the periodicity is defined as the inverse of the frequency corresponding to the highest power in the FFT, as proposed in Kao and Jiang (2019). Table 2 summarizes the periodicity, expressed in minutes of the three data sets. The periods range from 45 minutes to 1h 40 minutes. No seasonal affect is found because the train set refers to only one month. Figure 2 shows the power spectrum computed for one of the three appliances.

Table 2 The periods determined for the energy consumption time series, expressed in minutes
Fig. 2
figure 2

The power spectrum computed by the periodicity pre-processor (right) on the fridge energy consumption time series (left). The period detected for an ON-OFF cycle is about 80 minutes for the analyzed data set

Ground truth annotation

For training and testing purposes, the energy consumption time series have been annotated with ground truth (GT) metadata to specify the points that deviate from normality. Three independent annotators have labeled the data points, with a Boolean tag (normal/anomalous) and with a categorical label denoting the type of the anomaly, with the interface shown in Figure 3.

Fig. 3
figure 3

The interface of the GT anomaly annotator at work on the fridge time series. The user can specify the anomalies and add meta-data to them. The user has annotated the currently selected GT anomaly, shown in red, with the Continuous ON state label

Anomaly classes and their distribution

The anomalies have been distinguished in the following categories: Continuous OFF state, when the appliance is in the low consumption state for a long time, Continuous ON state, when the appliance is in the consumption state for an abnormally long time, Spike, when the appliance has an abnormal consumption peak possibly preceded by a ramp and followed by a decay period, Spike + Continuous, when the appliance has a consumption peak followed by a prolonged ON state, Other, when the anomaly does not follow a well-defined pattern. Figure 4 shows the distribution of the anomaly categories in the data set of the three fridges. The plots highlight the different anomalous behavior of the appliances. Fridge2 is mainly subject to continuous ON cycles. Fridge 1 shows a similar pattern, but the prolonged ON states are preceded by an abrupt increase in the consumption. Fridge3 is subject to a more detectable anomalous behavior because almost 95% of the anomalies are of spike type, which are easier to detect also visually.

Fig. 4
figure 4

The anomaly type distribution on the three fridge energy consumption data series

GT anomaly duration distribution. Figure 5 shows the GT anomaly duration distribution on the data series of the three fridges. The distributions of Fridge1 and Fridge2 are centered close the time series period, which suggests the presence of anomalies shorter than an ON-OFF cycle. The distribution of Fridge3 is centered around values higher than the mean ON-OFF cycle duration, which is typical of the transient behavior caused by high consumption spikes.

Fig. 5
figure 5

The anomaly duration distribution on the fridge energy consumption data sets. The distributions of Fridge1 and Fridge2 are centered close the time series period, which suggests the presence of anomalies shorter than an ON-OFF cycle whereas the distribution of Fridge3 is centered around values higher than the mean ON-OFF cycle duration

Compared algorithms

Algorithm list and definitions

The algorithm selection considered the most common methods used in the reviewed studies and their nature (statistical, regressive, neural) so as to achieve a balanced representation of the different approaches.

  1. 1

    Basic Statistics is an extension of the method presented in Kao and Jiang (2019) for periodic series. The first step analyzes the anomaly-free training data series to determine the periodicity. Then, the anomaly-free train set is divided into non-overlapping windows of the same size as the period and the Pearson product-moment correlation coefficient is computed on all the pairs of contiguous windows to check whether the time series is periodic within the two windows. If it is periodic, the ratio \(R_{std} = \frac{|Std_{current} - Std_{previous}|}{Std_{previous}}\) is computed. An anomaly occurs if \(R_{std}\) exceeds a threshold \(\tau\), defined as follows. \(R_{std}\) is calculated for each window pairs in the train set and the maximum value (\(R_{max}\)) allowed in a non-anomalous time series is found. Then the threshold \(\tau\) is determined on the validation set by performing a grid search. Given a set of possible thresholds \(\tau _\alpha = R_{max}(1+\alpha )\), with \(\alpha\) ranging from 0 to 10 with step 0.1, the threshold \(\tau\) is defined as the value corresponding to the best \(F_1\) score obtained by applying the anomaly definition rule on the validation set. Finally, the same rule is applied to the test set using the computed threshold value.

  2. 2

    AutoRegressive (AR) (Hyndman and Athanasopoulos 2021) is an autoregression model exploiting past data to predict current data. The prediction model is defined as:

    $$\begin{aligned} y_t = c + \sum _{i=1}^{p} \phi _i y_{t-i} + \varepsilon _t \end{aligned}$$
    (1)

    where \(c, \phi _i\) are the model parameters and \(\varepsilon _t\) is a white noise term. Anomalies are computed from the prediction error by thresholding.

  3. 3

    AutoRegressive Integrated Moving Average (ARIMA) (Hyndman and Athanasopoulos 2021; Masum et al. 2018) is a model exploiting past data, differencing of the original time series and a linear combination of white noise terms. A model ARIMA(pdq) is defined as:

    $$\begin{aligned} y^\prime _t=c + \sum _{i=1}^{p} \phi _i y_{t-i}^{\prime } + \sum _{j=1}^{q} \theta _j \varepsilon _{t-j} + \varepsilon _t \end{aligned}$$
    (2)

    where \(y^\prime _t\) is the differenced time series, \(\varepsilon _t\) is a white noise term and \(c, \phi _i, \theta _j\) are the model parameters. Anomalous points are defined as in AR.

  4. 4

    Local Outlier Factor (LOF) (Breunig et al. 2000) is a clustering algorithm based on the identification of the nearest neighbors and of local outliers.

  5. 5

    One-Class SVM (OC SVM) (Schölkopf et al. 1999) is the use of support vector machine (SVM) for novelty detection.

  6. 6

    Isolation Forest (ISOF) (Liu et al. 2008) is an ensemble method that creates different binary trees for isolating anomalous data points.

  7. 7

    Gated Recurrent Unit (GRU) (Chung et al. 2014) is a class of Recurrent Neural Network (RNNs) that exploit update gate and reset gate to decide what information should be passed to the output.

  8. 8

    Gated Recurrent Unit multisteps (GRU-MS) is based on GRU and is used to predict multiple consecutive data points in the future.

  9. 9

    Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber 1997) is another class of RNNs exploiting a cell with an input gate, an output gate and a forget gate. Both GRU and LSTM are designed to take advantage of the past context of the data and to avoid the gradient vanishing problem of RNNs.

  10. 10

    Long Short-Term Memory multisteps (LSTM-MS) is based on LSTM and is used to forecast several consecutive data points.

  11. 11

    GRU-Autoencoder (GRU-AE) (Zhang et al. 2019) is a hybrid model using an autoencoder and a GRU network.

  12. 12

    LSTM-Autoencoder (LSTM-AE) (Cho et al. 2014) is another hybrid model coupling an autoencoder and an LSTM network.

Training procedure and parameter settings

The hyperparameters of the ISOF, OC SVM, LOF, and ARIMA models are set with Bayesian search employing the hold-out set method. For each configuration, the chosen hyperparameters are used to fit the model and the performances are evaluated on the validation set. LOF, OC SVM and ISOF are assessed using the maximum \(\hbox {F}_{1}\)-score whereas the ARIMA models using the mean squared error (MSE) on predictions. The hyperparameters yielding the maximum \(\hbox {F}_{1}\) or the lowest MSE are selected.

ARIMA is trained on anomaly-free data to learn normal patterns as done in Yaacob et al. (2010).

ISOF, LOF and OC SVM work on spatial data and thus the univariate time series is projected onto a space \({\mathbb {R}}^n\) with \(n \ge 1\) (Braei and Wagner 2020; Oehmcke et al. 2015). A window of size n is used to extract from the time series \(N-n+1\) vectors of length n of consecutive points, where N is the length of the time series. Then, the spatial algorithms are trained on the projected vectors. At test time, the test set is projected onto \({\mathbb {R}}^n\) and the score of each projected vector is computed. The anomaly score of a point in the time series is defined as the average of all the anomaly scores of the vectors that contain the point. For all the neural models, training is performed on anomaly-free data.

Table 3 summarizes the relevant features and parameters of the compared methods.

Table 3 Relevant configuration parameters of the compared methods

Anomaly definition, GT matching, and performance metrics

Anomaly definition strategies. An anomaly definition strategy specifies how the output of the anomaly detector and the data points of the time series are compared in order to identify whether a point is anomalous. AD algorithms adopt different strategies to identify abnormal points:

  • Confidence: an anomaly score is directly provided as output by the model.

  • Absolute and Squared Error (Munir et al. 2018): the anomaly score is defined as the absolute or squared error between the input and the predicted/reconstructed value.

  • Likelihood (Malhotra et al. 2015): each point in the time series is predicted/reconstructed l times and associated with multiple error values. The probability distribution of the errors made by predicting on normal data is used to compute the likelihood of normal behavior on the test data, which is used to derive an anomaly score.

  • Mahalanobis (Malhotra et al. 2016): each point in the time series is predicted/reconstructed l times. For each point, the anomaly score is calculated as the square of the Mahalanobis distance between the error vector and the Gaussian distribution fitted from the error vectors computed during validation.

  • Windows strategy (Keras 2022): a score vector of dimension l is associated with each point. Each element \(s_i\) of the score vector is the mean absolute or mean squared error of the i-th predicted/reconstructed window that contains the point.

A threshold \(\tau\) is then applied to the calculated score(s) for classifying the point as normal or anomalous. Table 4 shows the anomaly definition strategies of the compared methods.

Anomaly detection criteria and thresholds. The criteria are the ones adopted in order to identify an anomaly. They are strongly related to the nature of the used algorithm. The anomaly identification criteria used by the compared methods are classified in:

  • Prediction error prediction models identify anomalies based on the difference between the predicted value and the observed one. Anomalies are identified based on the residuals between the input and the generated data: the higher the difference, the higher the likelihood of an anomaly.

  • Reconstruction error this criterion applies to all the models that aim at generating an output as close as possible to the input, such as the autoencoder-based models. As for the prediction models, the larger the residual, the higher the probability of an anomaly.

  • Dissimilarity dissimilarity models classify anomalous points by comparing them with the features or with the distribution of normal points or by matching them with the clusters computed from the normal time series.

Table 4 summarizes the detection criteria used by the different algorithms.

Table 4 Anomaly detection criteria and definition strategies adopted for each algorithm

GT matching To evaluate the predictions as true positives (TP), false positives (FP), false negatives (FN), and true negatives (TN), a Point to Point matching strategy has been adopted: each anomalous point is compared only to the corresponding one in the input data series using the GT label.

Performance metrics The evaluation adopts the most widely used machine learning metrics, precision, recall, and \(\hbox {F}_{1}\) score, defined as follow:

$$\begin{aligned} precision = \frac{TP}{TP + FP} \text { , } recall = \frac{TP}{TP + FN} \text { , } F_1 score = 2 * \frac{precision * recall}{precision + recall} \end{aligned}$$
(3)

Experimental results

In this section we summarize the responses to the four questions introduced in the Introduction. For space reasons we condense the results of the 144 (12 methods \(\times\) 3 training periods \(\times\) 4 window sizes) experiments on 3 data sets and discuss only the essential findings. The complete list of results is published at the address: https://github.com/herrera-sergio/AD-periodic-TS.

Q1: comparative performances

Figure 6 shows the comparison of the methods over all the data sets and across all the training duration values and sizes of the sliding window. The ISOF method consistently achieves the best \(\hbox {F}_{1}\) score, followed by OC SVM and LOF. The AE and MS neural methods have comparable performances. The multi-step approaches exhibit a more consistent behavior yielding smaller values of the standard deviation and the GRU-AE method performs slightly worse than the other approaches. The neural methods that predict only one point in the future (LSTM and GRU) have low performance and a rather inconsistent behavior. This is expected due to the high sampling frequency, which makes one step prediction ineffective to detect anomalies. Of the remaining non-neural methods, ARIMA and Basic Statistic are positioned at the low end of the performance range.

The top result on all the experiments is attained by ISOF on the Fridge3 time series, trained with a sub-sequence of length equal to one month and with a window size of 2 \(\times\) period: Precision = 0.947, Recall = 0.965, \(\hbox {F}_{1}\) score = 0.956.

A special case is that of AR. The training of the method converges only for the shortest duration of the training sub-sequence (a half period). However, the trained model delivers on average a good \(\hbox {F}_{1}\) score. It can be observed that AR grossly fails in the accuracy of the predicted values but nonetheless the error of the points that belong to a normal sub-sequence is very different from the error of the points that lie within an anomalous sub-sequence, which results in good AD performances.

Fig. 6
figure 6

Comparison of the performances of all the algorithms on all the appliances and across all the training duration periods and window sizes. The methods are ordered in descending order of the median values of the \(\hbox {F}_{1}\) score

Figure 7 shows the performance break down by appliance. As expected all methods, but ARIMA and Basic Statistics, perform better on the Fridge3 data set, which contains more recognizable anomalies mostly of a single type (\(\approx 95\%\) of type spike). On the Fridge1 and Fridge2 data sets the performances follow the same ranking as in Fig. 6, with the same top-4 methods (ISOF, AR, OC SVM and LOF) and almost equivalent performances of the MS and AE methods. On the Fridge3 data set the methods that predict one step in the future (LSTM and GRU) work better. This analysis highlights that the performances of the models are affected by the considered appliance. Indeed, in Fridge1 the performances are more subject to variations, while in Fridge3 are more consistent. Moreover, ARIMA and Basic Statistics show low performances independently on the complexity of the dataset, which suggests their inadequacy for this kind of problem.

The results are in line with those of the work of Kharitonov et al. (2022) in which the authors compare the performances of alternative techniques to detect failures using manufacturing machine logs and observed that k-nearest neighbors (KNN) and LOF performed better, while autoencoders could not be considered for deployment in a real-case scenario. Similarly, Elmrabit et al. (2020) found that classical machine learning techniques outperformed deep learning for the AD task in cybersecurity datasets.

Fig. 7
figure 7

Break down of the performance of all the algorithms by appliance. The methods are ordered by descending median value of the \(\hbox {F}_{1}\) score

Q2: training sub-sequence duration

Figure 8 shows the variation of the \(\hbox {F}_{1}\) metrics for the 10 methods that could be trained with all the three sub-sequences (2 weeks, 3 weeks, one month). The results show that the 2 weeks training period is sufficient for most of the methods. Only the multisteps (MS) methods attain a very slight average performance improvement if the training period length extends to 1 month. The results on the time series of Fridge1 and Fridge2 show a similar trend. All the detailed results can be found in the mentioned project repository.

Fig. 8
figure 8

Variation of the \(\hbox {F}_{1}\) score with the duration of the training sub-sequence. The AR and ARIMA method did not complete the training with all the periods

Q3: window length

Fig. 9
figure 9

Variation of the \(\hbox {F}_{1}\) score with the size (in periods) of the sliding window. The AR and ARIMA method did not complete the training with all the periods

Figure 9 shows the variation of the \(\hbox {F}_{1}\) metrics with the sliding window size (half a period, one period, two and three periods), limited to the 9 methods that could be trained completely. The results show a difference in the pattern between neural and non-neural methods.

With ISOF and OC SVM the \(\hbox {F}_{1}\) score decreases when the window size increases. With a value greater than half a period the methods progressively loose effectiveness: the variance increases and the \(\hbox {F}_{1}\) score decreases. This is likely the effect of the worse trade-off between the noise and the context knowledge enclosed in the window.

The AE methods deliver the best \(\hbox {F}_{1}\) score when the window size equals twice the duration of the period. A similar trend is also displayed by MS methods, with LSTM-MS showing a slight monotonic increase up to the three periods. The one step neural methods GRU and LSTM are rather insensitive to the window size, but their performance is at the lower end of the range. The LOF approach exhibit the same trend as the AE and MS neural methods.

The value at the (2 \(\times\) period) point of the neural methods shows that such a duration gives sufficient context for encoding the periodic features of the time series well and that going beyond that size is either counterproductive or yields a modest benefit. In the AE methods, the negative effect of the window size extension may be also due to the dimensionality reduction to a latent space operated by the neural architecture, which may become less effective when the dimension of the original space gets too large.

The results on the time series of Fridge2 and Fridge3 show a similar trend. All the detailed results can be found in the mentioned project repository.

Q4: generalization

The generalization experiments assess the top-5 methods (ISOF, OC SVM, LOF LSTM-AE and GRU-AE) on a dataset different from the one on which the methods have been originally trained. Each method is tested in two variants: the original version trained on the first appliance and a version in which the threshold value is fine-tuned on the validation data series of the target appliance.

Figure 10 contrasts the \(\hbox {F}_{1}\) scores obtained by the baseline version of the algorithm, i.e., the one trained and tested on the same dataset, the \(\hbox {F}_{1}\) scores achieved by fine tuning the threshold on the validation set of the target appliance, and the \(\hbox {F}_{1}\) scores obtained without any fine tuning. The top performing method (ISOF) is also the one that generalizes best, even without fine tuning the threshold. In general, ISOF and OC SVM are less dependent on the training set with respect to the neural models, which have a sensible performance decay when tested on a different appliance. The degradation is more sensible when the test appliances is Fridge3, which has almost all anomalies of type spike, which are absent in Fridge1 and Fridge2.

Fig. 10
figure 10

Comparison of the generalization performance of the top-5 methods. The orange bar represents the baseline \(\hbox {F}_{1}\) score (i.e., training and testing done on the same dataset), the blue bar denotes the \(\hbox {F}_{1}\) score achieved by fine tuning the threshold on the validation set of the target appliance, and the green bar shows the performances obtained using the trained algorithm without fine tuning

Qualitative analysis of results

To get a qualitative appreciation of the different behavior of the best models, Fig. 11 directly compares the anomalies detected by ISOF, OC SVM and LSTM-AE with the GT anomalies. The detected anomalies are highlighted with a color that depends on the method and the GT anomalies are circled in red.

The plot on the left column show a situation in which all the three methods are able to detect more or less the same anomalous data points. The detected points match well the GT annotations. The plots on the right column show how the methods react to a change of the duration of the ON-OFF cycle (an acceleration in the displayed example, which may be caused by a different load of the fridge or by a change in the set point of the thermostat). Only the ISOF method is robust to such an occurrence. The other methods instead signal many normal points as anomalous, because they consider the entire cycle variation as an anomaly. Given that the time series of the appliances are quasi-periodic, as shown in the power spectrum of Fig. 2, the robustness with respect to small variations of the ON-OFF cycle is a very relevant benefit of the ISOF method.

Fig. 11
figure 11

Qualitative analysis of the predictions of three methods on Fridge1: ISOF, LSTM-AE, OC SVM. ISOF (top) is more robust to the variations of the duration of the cycles, while the others show a weakness in the identification of the anomalous points, in fact, LSTM-AE (middle) and OC SVM (bottom) label numerous normal points as anomalous

Conclusions

In this paper we have discussed the results of the experimental comparison of 12 AD methods on three quasi-periodic data series collected with smart plugs connected to three distinct fridges. The comparison has first assessed the prediction performances, measured with the \(\hbox {F}_{1}\) score metrics, which confirmed that the non-neural machine learning methods ISOF, OC SVM and LOF attain the best results, followed by the autoencoder-based and multi-step neural methods (GRU-AE, GRU-MS, LSTM-AE, LSTM-MS). In particular, the ISOF method trained with a sub-sequence of length equal to one month and with a window size of 2 \(\times\) period attained a very good result on a fridge data series containing mostly spike anomalies (Precision = 0.947, Recall = 0.965, \(\hbox {F}_{1}\) score = 0.956).

Next we evaluated the impact of the duration of the sub-sequence used for training the algorithms, which shows that the 2 weeks training period is sufficient for most of the methods and that the AR and ARIMA algorithms did not complete the training within reasonable time with time series of longer duration.

The impact of the sliding window size was also investigated. Non-neural machine learning algorithms require a shorter window (half of the period is enough), whereas neural models deliver the best performance with a larger window size (two periods in most cases).

Finally, the generalization ability of the top performing methods has been assessed too. The best method (ISOF) is also the one that preserves its performances intact when applied to a different appliance, even without fine-tuning the threshold on the target appliance.

Future work will further pursue the investigation of AD algorithms on quasi-periodic data series, focusing also on their runtime performance on hardware with memory and processing constraints. The objective is designing a timely, accurate and efficient system for dispatching mobile phone alerts about the potential malfunctioning of home appliances to real-world users.