Navigating the Metric Maze: A Taxonomy of Evaluation Metrics for Anomaly Detection in Time Series

The field of time series anomaly detection is constantly advancing, with several methods available, making it a challenge to determine the most appropriate method for a specific domain. The evaluation of these methods is facilitated by the use of metrics, which vary widely in their properties. Despite the existence of new evaluation metrics, there is limited agreement on which metrics are best suited for specific scenarios and domain, and the most commonly used metrics have faced criticism in the literature. This paper provides a comprehensive overview of the metrics used for the evaluation of time series anomaly detection methods, and also defines a taxonomy of these based on how they are calculated. By defining a set of properties for evaluation metrics and a set of specific case studies and experiments, twenty metrics are analyzed and discussed in detail, highlighting the unique suitability of each for specific tasks. Through extensive experimentation and analysis, this paper argues that the choice of evaluation metric must be made with care, taking into account the specific requirements of the task at hand.


Introduction
With the growing trend of Industry 4.0, the amount of generated time series data increases, resulting in a huge demand for better time series analysis tools.The study of Time Series Anomaly Detection (TSAD) has become increasingly popular in recent years due to its widespread application in various fields such as cyber-physical systems [1], rail transit [2], online service systems [3], smart grids [4], spacecraft telemetry [5], Internet of Things [6] and healthcare [7].The rapid advancement of machine learning technology has also opened up new opportunities for developing and improving TSAD methods.With the vast number of different machine learning architectures and techniques available, researchers are constantly exploring new ways to create more accurate anomaly detectors.Whether it be through trying out new algorithms, combining different approaches, or incorporating new data sources, the possibilities for improving TSAD are endless.This highlights the importance of careful evaluation of TSAD algorithms, and the need for proper selection of evaluation metrics.The choice of evaluation metric should be guided by the nature of the time series data and the specific requirements of the task at hand.Using the wrong metrics can lead to incorrect conclusions about the performance of an algorithm, potentially leading to incorrect decisions about its use in real-world applications.For example, Figure 1 shows a prediction evaluated by two of the most used metrics in the literature.They vastly disagree on the quality of the prediction.Despite this, most papers give very little attention to the choice of metric.It is important to understand the limitations and trade-offs of different evaluation metrics, and to make an informed choice when evaluating TSAD algorithms.Additionally, the development of new and improved evaluation metrics should continue to be a priority in the field of TSAD, to ensure that the best algorithms are selected and used in real-world applications.

Labels:
P W f 1 P A f 1 Prediction: 0.17 0.95 Figure 1: A hypothetical scenario where there is one long anomalous event in the labels, but the detector predicts two short events, only one of which is within the labelled event.Two of the most used evaluation metrics, the point-wise f 1 score ( P W f 1 ) and point-adjusted f 1 score ( P A f 1 ), score the same prediction very differently.Both metrics output values between 0 and 1, where 1 is optimal.Anomalous point, Normal point.
TSAD has recently been the subject of criticism in regards to its conventional evaluation metrics.A number of studies have pointed out shortcomings in the commonly used metrics, and proposed alternative metrics that address these issues [8,9,10,11,12,13,14,15,16].
For example, the work of [12] criticize the point-adjust metric, and show that a detection algorithm outputting random noise is expected to produce very good scores, and capable of outperforming state of the art methods on most of the common benchmark datasets.The same conclusion is reached experimentally by [13].The work of [17], include a review of several TSAD evaluation metrics from the perspective of industrial control systems, and discuss several properties required for the metrics.The work of [18] analysed the most commonly used TSAD datasets and found that the majority suffered from flaws such as trivial anomalies, unrealistic anomaly density, mislabelled ground truth, and a high ratio of anomalies at the end of the time series.To address these issues, they introduced a new benchmark dataset, the UCR time series anomaly archive, and also discussed potential issues with the evaluation metrics.Finally, the work of [19] point out the lack of consensus regarding the appropriate datasets for benchmarking TSAD algorithms and present a benchmark suite derived from a combination of previous TSAD datasets and transformed classification datasets, which have been subjected to various transformations to increase the complexity and difficulty of the benchmark.They include several evaluation metrics in their work to provide a comprehensive evaluation of the TSAD algorithms.
In this paper, we aim to fill the gap in the literature by providing a comprehensive review of the evaluation metrics used and proposed in the field of time series anomaly detection.To the best of our knowledge, no prior works have offered a thorough overview of all the metrics used in the field.The main contributions of this paper are: • A comprehensive description of the existing evaluation metrics, highlighting their key properties, both desirable and undesirable.• A novel and structured taxonomy of the metrics, based on their calculation methods, to facilitate understanding and comparison.To the best of our knowledge, this is the first time a systematic taxonomy for TSAD evaluation metrics is defined.• An in-depth analysis of the impact of the choice of evaluation metric through a set of hypothetical case studies.
• A clear summary of each metric in terms of a set of defined properties.
In Section 2 we define and introduce terms and concepts central to the topic of evaluating TSAD algorithms.We state the scope and limitations of this work in Section 3. In Section 4 we define 10 different properties distinguishing the metrics, all of which are presented and described briefly in Section 5.In Section 5 we also present the taxonomy of these metrics.Section 6 presents a series of case studies for testing the properties of the metrics, resulting in a categorization of the metrics in Section 7, based on the properties from Section 4. Finally, we summarize our findings and draw some conclusions in Section 8.

Background
In this section, we provide an overview of the fundamental concepts necessary to understand the subsequent discussion in this work.
Time series A time series is a sequence of numbers or vectors, indexed by the time.We will refer to each time step as a point.Although not apparent in the definition, the underlying assumption when working with time series, is that the value of the points are dependent on the time variable.Time series anomaly An anomaly in a time series is defined in various ways [20], but is in general a point or a subsequence of contiguous points with unexpected or abnormal values.We refer to the subsequence as an anomalous event, and each point in it as an anomalous point -not to be confused with a point anomaly, a term often used for events of length 1. Contrasting anomaly detection in independent data, the abnormality may stem from unsatisfied expectations of the time dependency.That is, a point can have a normal value for the time series in general, but anomalous in the context of its preceding values 1 .Furthermore, what is considered as anomalies depends on the domain and origin of the time series.Finally, it is often unclear just how anomalous an event should be in order to be considered an anomaly.This lack of an exact definition of time series anomalies is some of the reason it is difficult to come up with reliable evaluation metrics.

Time series anomaly detection (TSAD)
The goal of TSAD is to identify anomalies in a time series.While a variety of techniques exist for detecting anomalies in time series data, a detailed review of which can be found in the work of [20], ranging from simple to complex and encompassing both machine learning and other approaches, it is not in the scope of this paper to discuss these techniques.Rather, our aim is to provide a comprehensive overview of the metrics used to evaluate these methods and offer a taxonomy of metrics based on their properties.In TSAD, the input data is typically a the time series of data points and the output is a prediction indicating which instances are anomalous.In our work we will refer to the output of the detection algorithm as prediction.
Evaluation Evaluation is the task of assigning a score to each prediction, such that a higher (or lower) score means that the prediction is better.Since anomalies are rare events and can have different characteristics, detectors are usually evaluated on different datasets in order to have a wider spectrum of possible anomalies.In order to easily and objectively sort anomaly detectors in terms of performance, the score must be a single scalar.While is often useful to use several evaluation metrics, to get insights about which detector performs well in certain scenarios, we consider this another task, which we refer to as performance analysis, as opposed to performance evaluation.Labels Evaluation is done by comparing the prediction to a time series of binary labels, that represents the ground truth (GT) of which points are anomalous or not.Note that the use of binary labels is a source for several kinds of errors and inaccuracies -when an anomaly starts, ends, and what even should be considered anomalous is a question that rarely has a definite answer, except for synthetical data.Therefore, there are several different labelling strategies, that will lead to quite different labels on the same dataset -e.g. the Numenta labelling strategy discussed in Section 5.1.5.Furthermore, when labels are made manually by humans, they will often have inconsistencies.
Changes in labels will necessarily affect the evaluation scores, especially if an event is included or excluded, as there are usually very few anomalies.The impact of slight changes in length and position of events however, highly depend on the metric, and will be discussed and tested later in this article.
Due to high variability in both what is considered as anomalies, and how they are labelled, the relevance of results on data from across domains is not obvious.When selecting a detector for use on a specific TSAD task, one should evaluate detectors on a dataset with both similar time series, anomalies, and a labelling strategy in line with the desired output of the detection algorithm 2 .

Thresholding
An anomaly detector outputs an anomaly score, a time series with scalar values indicating how anomalous each time point is.In order to get a binary prediction, only time steps with anomaly score higher than some threshold are considered anomalous.This is visualized in Figure 2.
There are several ways of choosing a threshold, some fully automatic, like the non-parametric dynamic thresholding introduced in the work of [25], others as simple as just choosing mean + n • std for some n [26,27]  3 .
Anomaly detections can be evaluated either before or after the thresholding, as shown in Figure 3.We define binary evaluation metrics as metrics evaluating the binary prediction, and non-binary evaluation metrics as those evaluating  the anomaly score.While the latter class uses the anomaly score as input, thresholding is still done, but as part of the metric.This usually involves calculating a score at several or all thresholds 4 , and either choosing the optimal score or combining the scores.
The difference between the classes may seem subtle, but involves a foundational difference in what is evaluated.Binary metrics evaluate the combination of the detector and the thresholding strategy, while non-binary metrics aim at only evaluating the detector.The argument for the latter class is that thresholding is a seperate issue, and since any detector can be used with any thresholding strategy, detectors should be compared indepentently of this choice.Using non-binary metrics ensure that thresholding is done equally for all detectors, which might be more fair.However, as thresholding is indeed a part of the non-binary metrics as well, this class of metrics is not independent of thresholds, but rather a compromize between them -and the metric might focus overly on irrelevant thresholds.Finally, as thresholding is done in practice, it may make more sense to evaluate the whole pipeline in unison, using a binary metric.This also allows for using the thresholding strategies that work well with specific detectors.

Traditional evaluation metrics
Before embarking on the time series specific metrics, it is beneficial to understand some of the evaluation metrics used for anomaly detection and classification in general.the confusion matrix.The confusion matrix considers the possible combinations of binary prediction and labels, and includes the number of • true positives (TP): points that are labelled and predicted as anomalies, • false positives (FP): points that are labelled normal but predicted as anomalous, • false negatives (FN): points that are labelled anomalies but predicted normal, • true negatives (TN): points that are labelled and predicted as normal, as seen in Figure 4. We refer to these four numbers as counting metrics.They are not used for evaluation directly, but are needed for calculating the following metrics: Accuracy is the fraction of correctly predicted points, i.e.
T P +T N T P +T N +F N +F P .Although simple, and to the uncritical eye informative, this metric should not be used for classifications with imbalanced classes, which anomaly detection is by definition.Since most points are normal, a prediction of only normal points will get a high accuracy despite not being useful at all.
Recall, also known as sensitivity and true positive rate, is the fraction of true anomalies that are correctly classified, i.e.

T P
T P +F N .False positives are not penalized, thus predicting all points as anomalous will get a perfect recall of 1.For this reason, recall is usually not used on its own.Precision is the fraction of anomalous predictions that are actual anomalies, i.e.T P T P +F P .Like recall, this is not used on its own, since false negatives are not penalized, and only marking the most obvious anomaly will be the best strategy.f 1 -score is the harmonic mean of precision and recall, 2P R P +R .The priorization of recision and recall is a trade-offstrict threshold yield few predicted anomalies, thus high precision but low recall, and vice versa.Depending on the situation, it might be (very) preferable to have a false positive than a false negative, or opposite.Thus, a more general definition is f β -score, defined by (1+β)P R R+β 2 P .The value of β is the chosen so that the score reflects the relative importance of precision and recall.We will use β = 1 in the examples of this paper, as is also common in the literature when comparing methods, but we highlight that an informed choice should be made for this parameter when using this metric for real world problems.
False positive rate is the fraction of normally predicted points that are actually anomalies, F P F P +T N .Contrary to recall, optimal score is obtained by predicting all the points as normal.This is used for calculating the AU C ROC score described in Section 5.2.3.
Precision@k is the precision of the k points with highest anomaly score.Although this is just the precision with a specific thresholding strategy, it deserves some extra attention.This is because, since the denominatior T P + F P = k is predetermined, false positives are indeed penalized.Thus this becomes a valid metric in itself, not needing to be combined with recall.In fact, recall@k is the same value as precision@k, except for a predetermined constant k T P +F N5 .Compared to the above metrics, this strategy requires a number of anomalies k instead of a threshold.This may be a simpler and more intuitive choice -a common practice is to use the number/fraction of anomalies in the dataset.It may also be more fair when comparing methods with differently distributed anomaly score, than many other threshold selection strategies.
The metrics above are often used for time series without adaptation, by regarding every time stamp individually.A large number of the evaluation metrics designed specifically for time series are versions of precision and recall that are redefined to handle events in a different way, either by a redefined confusion matrix, or by redefining precision and recall to not use the counting metrics at all.These are then usually used either to calculate f-score, or an AUC score, which we will discuss in Section 5.2.3.

Method
Several choices were made for the purpose of limiting the scope of this paper, and keep it concise.We did not include metrics from similar domains like time series classification, anomaly detection for non-time series, or change point detection.The latter, although similar to TSAD, only contain point anomalies.Furthermore, we only consider single scalar metrics aimed at performance evaluation for detector selection, and not supplementary statistics for performance analysis.This means we will not consider the numerous variants of precision and recall as their own metrics, only as part of the f 1 -score or the AU C P R score described in section 5.2.3.Precision and recall are occasionally used for detector selection in situations where false positives and negatives have very different costs.However, due to the simple optimal strategies described in Section 2.2, f β with a large/small β is a much better alternative.Other interesting statistics excluded by this choice are early detection [28], before/after true positives [29] and alert delay [30].Combinations of these statistics with other statistics could result in evaluation metrics with valuable properties.ROC-and PR-curves (see section 5.2.3) are often used for visualising properties of the anomaly score.We will only consider these for the purpose of calculating the much used single scalar AUC metrics.
There are several ways to vary each metric, by using techniques from one metric on one of the others.Indeed, some of the metrics are indeed modified versions of another metrics, in such a way that all the other metrics could be modified in that same way.Studying all these combinations is not feasible without expanding the work substantially, so we will only study such modifications in their originally proposed, or most used, form.This should give an idea of the effect of the modification.Readers that are interested in a specific metric, either one included here, or that could be made by combining ideas from the ones included, are encouraged to conduct their own experiments.
Finally, for obvious reasons, we only consider metrics that either are rigorously defined in their original paper, or have open source implementations available.

Properties
In order to systematically evaluate the various metrics used in TSAD, we have defined several properties that differentiate the metrics.It is important to note that these properties are in general not inherently positive or negative, but rather the desirability of each property depends on the specific context and scenario.To achieve this, we defined a set of properties for these metrics and analysed how these properties affect the results of the metrics.We have organized the properties into three categories: (1) Preferences: properties related to the predictions generated by the metrics, (2) Requirements: requirements for utilising the metrics, and (3) Suitability: properties regarding the general suitability of the metrics in TSAD applications.

Preferences
As time series anomaly detection methods rarely produce perfect prediction, a good metric needs to be able to prefer the best imperfect detection available, for the situation for which the detector will be used.We listed five properties regarding what kind of prediction are preferred by the metrics.
Early detection.In the literature and in practical scenarios, two distinct contexts can be identified.In the first context, detection of a possible anomaly should occur as soon as possible [31], such as when anomaly detection is used in real-time systems where an anomaly indicates there is an issue requiring immediate attention.In these cases, detecting the anomaly at a late stage is of no value since it is too late to rectify the problem.In the second context, data is analysed offline, or on a much larger time scale, where detection and reaction time is far greater than anomaly length, e.g. for diagnosis based on ECG monitoring [32,33].In these cases, the differences between early and late detection are of no practical relevance.
Long anomalies.Longer anomalies could indicate more serious problems which are also more important to detect, or they might just indicate more subtle anomalies which are harder to locate [17].The shortest anomalies might also be the most important ones, e.g. if they indicate serious problems that were fixed quickly, while the less serious ones were ignored and therefore lasted much longer.In most metrics, the contribution of an anomaly to the final score is either proportional to its length, or independent of its length.As many commonly used TSAD datasets have both long anomalies and single point anomalies, this difference has a great impact.Short predicted anomalies.Some detectors, e.g.window-based ones 6 , produce anomaly scores with a short peaks, while other methods produce wider areas of high anomaly score.The latter will generally result in longer predicted events.This might not have a big impact on the value of the prediction, but some metrics have a strong preference for short predicted events, independent of the length of the labelled anomaly.
Partial detection.The ability to detect a subset of the anomaly (referred to as "partial detection") can be more important than correctly detecting its exact span time (referred to as "covering").According to the work of [30], an operator receiving an alert of an anomaly will investigate the data manually, and the manual inspection will be the determining factor going forward, rendering the exact location and duration of the detection less relevant.However, [9] notes that the operator may not necessarily find the anomaly if it is subtle and of a much longer duration than the detection, which would make the location and duration of the detection significant.
Proximity.The start and end of an anomaly is often unclear [17], and when manually labelled, the labels might not be very reliable [18].Furthermore, a predicted event being off by a few time steps might still be very useful.Indeed, window-based detection methods might report the anomaly at either end of the window [18].In offline anomaly detection, this should not overly effect the score.For these reasons, detecting an anomaly close to a labelled anomaly should be valued by the metric.

Requirements
Different metrics use different input, and require different degree of parameter specifications.
Require no thresholding [15], as discussed in Section 2.1.[15].Correctly specifying numerous of parameters to reflect specific needs can be resource demanding.Furthermore, it is easier to compare results across research papers when they do not use different parameters.Nevertheless, TSAD tasks vary greatly, and parameters offer flexibility needed for a metric to be useful for most specific cases.

Suitability
The different metrics are also meant for different kinds of use, and might not always be suitable.We list up three properties related to the suitability of the metrics in different use cases.Time aware.Metrics not made for time series or sequential data do not use the chronology when calculating the score.Awareness of the labels and predictions of surrounding points is necessary for capturing the underlying time dependency specific for time series.
Insensitivity to True Negatives.Given that anomalies are by definition rare events, a low score should be given when no anomalies are detected, even though the prediction is correct most of the time.Furthermore, it is useful not to be affected by how large the portion of true negative time points is, as this is a rather uninformative part of the data.
Generality.A metric that is appropriate for many real scenarios, is also useful for research that is not domain/problemspecific, as the results would be relevant for more situations.However, since TSAD is used for such a large span of different problems, no metric can suit all situations.
Finally, we highlight that there are several possible desirable properties not included here due to our scope limitations.Such properties can be valuable insights about the performance of the method, e.g.where it performs well or not [16], or how early the detections are [29], or, for multivariate time series, which signals are the most involved in the anomaly.The latter property is often measured using distinct explainability measures [34,35,14,36].

TSAD Evaluation Metrics: a Taxonomy
In this section, a comprehensive examination of the evaluation metrics found through our research is presented.The metrics are divided into two categories, binary metrics in Section 5.1 and non-binary metrics in Section 5.2.For each category, a taxonomy based on their definitions is introduced, followed by a description of each metric including their capabilities and potential limitations in utilization.
A rigorous definition of each metric is not included in this study, as some of them are quite complex, with details not necessary for this work.Readers are referred to the cited literature for further information.However, an effort has been made to provide a concise and intuitive understanding of the metrics.In addition, the most noteworthy, distinctive, or potentially problematic characteristics of the metrics are also discussed.

Binary evaluation metrics
We define binary evaluation metrics as metrics evaluating binary predictions, where each data point is classified as either normal or anomalous, aligning with the binary labelling.
Figure 5 shows the proposed taxonomy of binary evaluation metrics, based on how their definitions use counting metrics (TP, TN, FP, FN), precision, recall or f-score.This information is relevant when combining techniques from different metrics, as such techniques may only work on one type of metrics.

Point
A taxonomy of binary evaluation metrics.A large number of these are f-scores based on various definitions of precision and recall.Precision and recall can be defined in many ways.Compared to the original point-wise definition, the difference can be present in the point-wise predictions, the counting metrics (TP, FP, TN, FN) or the formulas for precision and recall.The metrics can also be divided into point-based and event-based metrics, that count respectively individual points or contiguous events when aggregating to the total score.The point-and event-based metrics use both of these methods for parts of the total score.
Most of the metrics are based on the f-score, with some modification of the definitions.The point-wise counting metrics (Section 5.1.1)is the f-score based on counting metrics calculated in each time point.The adjusted point-wise counting metrics (Section 5.1.2) also use counting metrics in each point, but an adjustment is done to the prediction before the counting, in order to be more suited for anomalous events.For the redefined counting metrics (Section 5.1.3)the counting itself is done in some other way.Redefined precision/recall (Section 5.1.4)are not based on counting metrics at all, but calculated from some different formulas.They still use the terms precision and recall because the base concepts are the same.Finally, the other metrics (Section 5.1.5)are not based on f-score at all.
The metrics are aslo categorized based on their calculation approach, as either point-based or event-based.All the metrics are computed by aggregating the contributing parts of the time series, but in different ways.The point-based metrics evaluate each time point individually, whereas the event-based metrics evaluate entire events as a single subscore, regardless of the number of time points it comprises.This distinction has significant implications for what is considered a good prediction, as will be demonstrated in Section 6.Some metrics calculate part of the score in a point-based way and part event-based.We name these metrics point-and event-based.

Point-wise metrics
Point-wise f-score ( P W f β ).One of the most straightforward evaluation metrics involves treating each time point as a single observation and calculating the f-score as outlined in Section 2.2.This approach is exemplified in Figure 6.
Although not made for time series, the use of point-wise f-score is widely used in TSAD [37,38,39,40,41,42,43,44,45,46,47,48,49,50].It is a simple metric, making it easy to implement and the results simple to understand.Also, methods are rewarded for predicting all the points that are actually labelled as anomalies, and none of the other -exactly what an anomaly detector should do -as opposed to some of the metrics we will describe below.Nevertheless, as we will see in the experiments of Section 6, the uneven event weighting and lack of tolerance can be highly problematic.As a result, the entire contiguous segment is marked as anomalous in the prediction prior to calculating point-wise precision, recall, and f-score.
Previous works [58,14,13,12] have shown that this metric can provide overly optimistic scores even if multiple anomalies are missed.In fact, the work of [13,12] demonstrated that random guessing outperforms state-of-the-art methods using this metric.The cause of this is a seemingly unintended flaw of the metric, which is illustrated in Figure 6.Despite the argument that the whole anomaly is detected if an operator receives an alert within the anomaly, which legitimizes a recall of 1, only half of the alerts were correct, so the precision of the prediction should be 0.5.However, after adjustment, it is close to perfect.The greater the discrepancy between the duration of labelled and predicted anomalies, the more severe the problem becomes 7 .Calculating precision prior to adjustment would avoid this issue and produce a precision-recall pair that aligns with the reason for the adjustment and the meaning of precision and recall.Nevertheless, we instead suggest using the composite f-score (Section 5.1.3),a more appropriate metric in cases when a warning during an anomaly is sufficient.
The works of [60] and [61] use an adaptation of the pointadjusted metrics, where a GT anomaly is only considered detected if an anomaly is predicted within the first k time steps of the anomaly.If not, all the points in the anomaly are marked as false negatives, even the ones predicted as anomalous.With this metric, precision can still be unreasonably high, but it is much more difficult to achieve this, and the random guessing strategy that prevail for P A f β will have a much harder time getting high scores with this metric.
Point adjusted metrics at K% ( K% P A f β ).The work of [12] suggests altering the point adjusted metric by requiring a portion K% of the anomaly to be detected in order to make the adjustment.As with dtP A f , β this effectively reduce the effectiveness of random guessing, and short detections in general.Furthermore, as argued by [9] and [11], an expert receiving a short alert within a much longer anomaly might not be able to see the anomaly, but by requiring a substantial part of the anomaly to be detected, the chance that an expert would actually notice it is much larger.
Latency and sparsity-aware f-score ( ls f n β ).The work of [10] note that the point adjustment metrics do not value early detections, and changes the algorithm to only adjust the values of a contiguous anomaly segment after the first TP.
They also note that false positive points require more resources if they are spread out, than in some close proximity (so that it only requires attention once).The prediction is therefore down-sampled by a used-specified factor n.This way of awarding earliness reflects situations where the negative effects of an anomaly, which is proportional to its length, is avoided after the point that it is detected.

Redefined counting metrics
Segment-wise f-score ( S f β ).The work of [25] introduced a segment-wise precision, recall and f-score, where each contiguous segment of anomalous points is considered one event.Here one true positive is recorded for each true anomalous segment with at least one predicted anomalous point, one false negative for each of the rest of the true anomalous segments, and one false positive for any predicted anomalous segment without any true anomalous points.Figure 6 shows an example of this.This metric is used by [26,29,62,63].
A problematic property of this metric is that extending the length of a predicted anomaly will never give worse score, and often better.Thus it favours detectors with long contiguous events, all the way to the extreme case: Predicting every point in the time series as anomalous will give perfect precision and recall for any time series with at least one anomaly.
Composite f-score ( C f β ).The work of [14] suggested using a combination of point-wise and segment-wise metrics, and proposes the composite f-score, defined as the harmonic mean of point-wise precision and segment-wise recall.The point-wise precision ensures that false positive points are discouraged, whereas extra true positive points in an already partially detected anomaly is only awarded through the increased precision.
Time tolerant f-score ( t f τ β ) 8 .The work of [65] defines (point-wise) precision and recall with temporal tolerance τ , essentially by counting it as a true positive when a predicted anomaly point is closer than τ to a labelled anomaly point.They then show that while the recall and precision of their example prediction increase drastically with the tolerance, the scores of a random prediction increases more, and the statistical significance decreases substantially.Hence reporting results with temporal tolerance may be less significant than without, despite the scores looking more impressive.It should be noted, however, that their data contain many short anomalies.A tolerance of a few time steps will have a much larger impact on the random prediction score in with such a dataset, than with fewer or larger anomalies.Although these evaluation metrics are not widely used, similar tolerance techniques are -either in the metric (as here), in the labelling of the data (as in N AB, explained in Section 5. 1.5 ) or in detectors padding their predicted events before outputting them.Such significance tests can be useful when determining how much temporal tolerance to use.

Redefined Precision and Recall
Range-based f-score ( bias R f α β ).The work of [8] argues that point-wise precision and recall fail to address many aspects present in time series for anomaly detection, and introduce range-based precision and recall.This metrics have been used in [66] and [62].These are rather complex and highly customizable metrics, with a tunable weight and up to 6 tunable functions to enable aligning the score with the goal of the detection task.Thorough guidelines, defaults and examples are provided in [8].The score is based on using up to 4 concepts to calculate the score: Detecting the anomaly range with at least one anomaly point, while also covering as large a portion of the anomaly range as possible.High cardinality, i.e. number of predicted segments within one labelled anomaly, can be punished, and a function rewarding the position of a detected anomaly within a labelled one can be specified.Although evaluation metrics that consider the relative positions of detection and label are mostly useful for rewarding early detection, in these metric they can also be set to e.g.rewarding detections at the middle or at the end of the labelled anomalies, which authors argue can be useful in certain cases, e.g. as a way of preventing false positive alarms.We have not found the cardinality concept in any other TSAD evaluation metric, and thus we have not considered as a desirable property.This may be more relevant for change point detection [67].
Time series aware f-score ( T a f δ β ).The work of [9] propose time-series aware precision and recall metrics.These metrics are similar to range-based precision and recall, but they do not consider the concepts of cardinality and position.The metrics also require that a certain portion θ of the labelled anomaly must be correctly predicted for it to be counted as a correct detection.The authors note that determining the end of a labelled anomaly can be challenging, and therefore include a region of length δ following the labelled event, with a positive but decreasing score, to account for this.This reduces the reliance on correct labelling and prediction at the end of and shortly after the.A slightly altered version of this metric can be found in [17], where the method for determining the length of ambiguous sections was changed.
Enhanced time series aware f-score ( eT a f β ).The author of [11] highlights that previous evaluation metrics may reward detections that overlap with actual anomalies, even if they are either too long or too short to be useful.To address this issue, they propose a metric that considers both a detection score and an overlap score.The metric requires that a certain part of the actual anomaly be detected and a certain part of the detected anomaly be true.Two parameters can be adjusted to control these portions.The precision calculation includes a weighting function that weights each event by the square root, as a compromize between typical point-based and event-based weighting.
Affiliation-based f-score ( A f β ).The work of [16], tackles problems commonly seen in existing metrics and introduces a distance-based metric as a solution.They calculate the average of the local precision and recall for each anomaly event.Local precision is calculated by averaging the distance between each predicted anomaly point and its closest labelled anomaly point, and expressing it as the probability of outperforming a random prediction.Recall is calculated similarly, using the average distance from each labelled anomaly point to its closest predicted anomaly.By using distance, this metric evaluates the proximity of predicted and labelled anomalies, even if they don't overlap.It also values detection over coverage in a natural way.Finally, by scoring locally, the results are more interpretable, since each anomaly and its impact on the score can be evaluated separately.

Other metrics
NAB score (N AB).The Numenta Anomaly Benchmark (NAB), presented by [31], includes a dataset for time series anomaly detection and a novel evaluation metric.The metric penalizes false positive points with a negative value, and rewards true anomalous segments with a positive value based on how early the first anomalous point was predicted.The score is normalized by comparing it to a scenario where no anomalies are detected.
Since only one point of the true positive points in an anomalous segment contribute to the score, while every false positive point contribute negatively, the score favours detectors predicting short events -it is almost never beneficial to predict two contiguous points as anomalous.
NAB also introduced a different approach to labelling anomalies.This approach allows for rewarding detectors predicting anomalies before they occur 9 , and makes the score less dependent on the individuals who label the anomalies.A simplified explanation of the approach is provided here (see the work of [68] for the full details).The process involves a group of labellers deciding the first anomalous point for each anomalous event.Then, the points on both sides are marked anomalous, such that the original starting point is in the center of the event, each event has the same duration, and 10% of the dataset is labelled as anomalous.
This strategy is similar to the temporal tolerance technique in t f τ β .However, in this case it is part of the labelling strategy, instead of the metric.Thus is it not a part of the implementation used in this paper, and we will not see the effects of this in the experiments in Section 6.
The N AB score is not widely used 10 , but their datasets are commonly used for benchmarking, using other metrics [20,19].The labelling strategy of this dataset highlight the importance of not blindly combining arbitrary metrics and datasets.Due to the labelling strategy, at least 50% of the points labelled anomalous were considered normal by the labellers, invalidating metrics counting each point individually, like P W f β .
Temporal distance (T D).Temporal distance, presented by [21], is a very simple metric -summing the distances from each labelled anomaly point to the closest predicted anomaly point, and from each predicted anomaly point to the closest labelled anomaly point.The lower score the better.This metric prioritizes roughly finding all the correct anomalies over getting the detection exact, since any false positive/negative raises the score by the distance to the closest anomaly.As long FPs and FN are punished roughly proportionally to their length, the metric prioritizes long labelled anomalies, and a method predicting short events has an advantage when predicting FPs.The work of [21] presents two version of this metric 11 , by summing either absolute or squared distances.Generalizing this, one could use any positive power of the absolute distance.We will consider this exponent a parameter, and use 1 in all the experiments.High values of this parameter punish great distances more than low values.
Temporal distance might seem very similar to the affiliation f-score.However, there are some important differences.Since A f β is calculated locally for every event, it is an event-based score, while T D is point-based, the effects of which will be clear from the experiments in Section 6.It may also lead to some odd situations when two or more anomalies are relatively close, as seen in Figure 7.While T Dconsider the absolute distances, and therefore consider the first event in prediction 1 to be further from the labels than the second event in prediction 2, A f β consider relative distances within Labels: A f 1 T D Prediction 1: 0.91 14 Prediction 2: 0.9 2 Figure 7: We test the affiliation and temporal distance metrics on two predictions of the same label time series.The best score for each metric is shown in bold.The labels include two events, and each prediction is a bit early on one of them.The affiliation metric split the time series into periods with one event each, and calculate the relative distance of the closest predicted event.In this example, the first anomaly in prediction 1 is seen as closer to a true anomaly than the second anomaly in prediction 2. T D, on the other hand, uses absolute distance, and prefers prediction 2.
the local surroundings of each event, and therefore consider the distance in the last anomaly in prediction 2 as bigger than the first anomaly in prediction 1.

Non-binary evaluation metrics
The non-binary evaluation metrics are those evaluation the anomaly score, as opposed to a binary prediction obtained by using a threshold on the anomaly score.For these metric, the thresholding step is part of the evaluation.
A taxonomy of non-binary evaluation metrics is proposed in Figure 8.The primary difference between these metrics lies in the way they handle the threshold.Some metrics, such as P @K and binary metrics with optimal threshold, choose a single threshold, resulting in a single binary prediction.These metrics are still considered non-binary as the threshold selection is part of the metric.The other non-binary metrics evaluate all possible thresholds (metrics based on all thresholds) and combine them into a single number score.This is done either by calculating the area under a curve (AUC metrics) or the volume under a surface (VUS metrics).The choice of non-binary metric will depend on the specific requirements and goals of the evaluation, and the suitability of each metric for the task at hand.

Point-based metric(s)
Point-and/or event-based metric(s)

Metrics based on all thresholds
P @K Binary metrics with optimal threshold (See Figure 5) Figure 8: A taxonomy of non-binary evaluation metrics.Although the input is different from the binary metrics, they are quite similar, and indeed any binary metric can be made non-binary by using the optimal threshold strategy (See section 5.2.2).

Precision at K (P @K)
The point-wise P @K metric defined in section 2.2 is occasionally used for TSAD evaluation [15,19].Other definitions of precision than point-wise could in principle be used, e.g. the works of [70,71] uses an event-based variant of recall at K for spatiotemporal anomaly detection, although for precision it would require defining how the number K of anomalies included in the prediction is counted.
A variant of P @1 is the UCR score used by [72], defined by [73].The duration of the GT anomaly is increased in both ends to include some time tolerance, before P @1 is calculated.

Binary metrics with optimal threshold
Binary metrics are typically used with the threshold that yields the best score [27,39,42,43,74,31].This can be achieved with any binary evaluation metric.The use of a metric combined with this thresholding strategy requires the input of an anomaly score, resulting in non-binary evaluation.The optimal threshold is determined by using labels, and can only be determined during the evaluation phase, thus providing an upper limit to the score that can be achieved using the binary metric.The relevance of this upper limit depends on the situation and the chosen binary metric 12 .For the sake of brevity, we will only consider the point-wise f-score with the optimal threshold strategy ( best P W f β ) in the remainder of this work.

Area under the curve (AU C ROC , AU C P R )
The receiver operator characteristic (ROC) is an evaluation metric commonly used for TSAD, as well as in binary classification in general.For each choice of threshold, the prediction has a specific value of recall and false positive rate.Plotting these against each other result in the ROC-curve.This is often inspected directly, as it visualizes the trade-off between recall and false positives, e.g.how large false positive rate must be allowed for certain levels of recall.In order to get a single scalar evaluation metric from this curve, it is common to integrate the area under the curve (AUC), to get the AU C ROC .This value summarizes the detection performance across all thresholds, and is widely used in TSAD [40,75,20,42,76,77,78,79,80,81,82,83,84,85,86,87].An alternative method to comparing recall and false positive rate is to apply an area under curve approach to precision and recall, resulting in the calculation of the area under the precision-recall curve (AU C P R ), also known as average precision.This approach too is commonly utilized in TSAD [88,42,89,90,78,46,84,85,86].In our experiments, we only consider the point-wise precision and recall for the PR curve, as is by far most used, although any other pairs can be used, like the point-adjusted AU C P R by [55] or the range based AU C P R by [20].Variations of the ROC curves can be used as well, but the false positive rate is not defined for the event-based metrics.
The use of AU C ROC has been criticized for its integration over all thresholds, which can result in a large portion of the score coming from thresholds that may not be relevant for a specific use case [91,92,93].A possible solution can be to only consider parts of the curve, as suggested by [91], although it can be hard to determine how much of it to use.Another possibility is to use AU C P R instead.While AU C P R also integrates over all thresholds, it has been argued that it is more informative than the ROC for imbalanced datasets [94,95], which by definition is the case for anomaly detection 13 .The reason is that precision and false positive rate respond differently to changes in false positives (FPs).In anomaly detection, the number of true negatives will typically be very large compared to FPs, making the false positive rate low for all relevant choices of threshold.As a result, only a small part of the ROC curve is relevant in such cases.
We visualize this with an example.Assume a very large dataset has 2% anomalies, and that two detectors, named blue and green, produce anomaly scores from the normal distributions visualized in Figure 9.That is, the detectors produces anomaly scores from the black distributions in Figure 9 for normal points, and from the red one for anomalous points.Note that since AU C ROC and AU C P R are independent of the time dimension, time is not included in this example.This results in the ROC-curves in Figure 10a and PR-curves in Figure 10b.
From the roc curves in Figure 10a we see that the green detector outperform the blue detector for most values of the false positive rate.This would result in AU C ROC preferring the green detector.By inspecting the graph, we see that for smaller PFR, the blue detector is better.Inspecting the PR curves in Figure 10b, we see that the blue detector by far would have the best AU C P R , but for low precision the green detector is better.While the figures really contain the same information [94], it is clear that the difference in x-axis is crucial, not only for AUC-values but for inspection of the curves as well.
Figures 9a and 9b also show the thresholds yielding the optimal f-score at different values of β.The points on the curves of these values, and more, are shown in Figures 10a and 10b.We see that for β ≈ 1, the recall value has very little impact on the AU C ROC , compared to the AU C P R .Indeed, the relevant values of β should be quite high for AU C ROC to be more informative than AU C P R .But from Figure 9, the high β might seem more relevant, due to the large increase in TP, and high values of β make up a relatively small part of the pr curves in Figure 10b.As always, what is most suitable comes down to the situation.Since the ROC curve uses the fraction of FP to all normal samples, instead of anomalous predictions, the difference between ROC and PR scales with the imbalance of the data -when the anomalies make up an even smaller fraction of the data, AU C ROC corresponds to even higher values of β.   9: Probability density functions for the anomaly scores of the positive (red) and negative (black) samples, for the blue and green detectors.Anomalous samples generally give higher anomaly scores, although many normal and anomalous points have similar scores, making it hard to set a threshold.9a and 9b show the counting metrics (as the shaded areas) for thresholds optimal for P W f β for β = 1 and β = 8 respectively.Keep in mind that only 2% of samples are anomalies, i.e.TN + FP = 49(FN + TP), which is not shown in the figures.FP and FN are of comparable sizes in 9a.

Volume under the surface (
The concept of volume under the surface (VUS) was introduced by [15], extending AU C ROC and AU C P R .The authors recognized the need for some tolerance for predicted anomalies close to actual anomalies.They addressed this issue by adjusting the labels, and instead of using binary labels of 0 or 1, they use labels with real values in the range [0, 1].The original labelled anomalies are still given a value of 1, and normal points that are a certain distance l away from anomalies are given a value of 0. Labels closer to the original labelled anomalies gradually decrease as the distance from the anomaly increases 14 .The authors refined the point-wise recall by multiplying it with the existence factor used in bias R f α β .Using the new definitions of recall, precision, and false positive rate, they defined range versions of AU C ROC and AU C P R .However, since this approach depends heavily on the tolerance threshold, l, they also introduced the volume under surface metric.Inspired by the way that the AUC metrics integrate away the dependency on the threshold by considering the area under a curve generated from all values of the threshold, the VUS metrics integrates over l to generate the volume under the surface generated by the ROC or PR curve along an axis of values of l.This way, the final value takes into account multiple tolerance levels.Nevertheless, the metric still depends on the maximum value for l. Figure 10: The ROC and PR curves for the blue and green detectors.The marked dots are the points corresponding to the optimal thresholds for P W f β with β ∈ 16, 8, 4, 2, 1, 1/2, 1/4, 1/8.We observe that the blue detector is best for higher thresholds, which are optimal for lower values of β, and vice versa.The green detector has the higher AU C ROC , while the blue on have the higher AU C P R .

Case studies
In this section, we evaluate the presented evaluation metrics on 14 different case studies, in order to illustrate the different properties of the metrics.It is important to note that the desirability of these properties is highly dependent on the specific domain and use case.Thus, there is no universal "correct" answer for which metrics are best, but for a specific use case there is often one that is most appropriate.By presenting examples and highlighting the properties of the metrics, we aim to provide a clearer understanding of how they can be used effectively in different situations.Volume under the precision-recall surface 5.2.4 Figure 11: Overview of all the metrics considered in the case studies.
To simplify reading the results, the names for each evaluation metric presented, is repeated in Figure 11.
Here we outline the decisions made regarding the implementation of the evaluation metrics, and parameter selection.A majority of the metrics have parameters that need to be specified.To maintain consistency in our experiments, we have chosen the same evaluation metric parameters for most of the case studies.However, in some cases, we adjust these parameters to highlight a specific effect.
The β in the f β is 1 for all f-score based metrics.For dtP A f k β we use a delay threshold of k = 2 time points.For K% P A f β we require 20% of the anomaly detected for adjustment.The downsampling factor of ls f n β is set to 2, and the temporal tolerance of t f τ β to τ = 2 for most experiments, expect for the on in Figure 18, where we use τ = 10 to better visualize its effect.
For the range based f-score bias R f α β , we use cardinality = 1, and specify α and the positioning bias the metric name in the table for each experiment.See the work of [8] for the definition of these parameters and functions.We use the same configuration for precision and recall.
For T a f δ β we set α = θ = 0.5 for all tests.We use δ = 0 in most cases since this is more in line with the tests.We use δ = 10 for the graph in Figure 18 to show the effect of this delta.For eT a f β , we use θ p = 0.5, θ r = 0.1, to show the effect of using different values of these parameter.This will effectively ignore any predicted event with less than 0.5 precision, i.e. if less than half of the predicted event overlaps with anomalies.On the other hand, less than 10% of an anomalous event must be detected for it to be counted as undetected.Using θ r = θ p = 0.5 would yield results similar to that of T a f 0 1 in most cases.N AB is implemented using the standard application profile [68].As N AB is implemented for use with longer anomalies, it does not run in the cases where there are events of length 1 in the labels.We do not include N AB in these cases.
P @K is the precision of the K highest anomaly scores.For P @K we set K to the number of anomaly points in the labels.Due to many equal anomaly scores in the test cases, a threshold including K points will often include L > K points.In these cases we report P @L instead.
For V U S l ROC and V U S l P R we use a maximum tolerance of l = 4.
While we have implemented the simple metrics ourselves, the more complicated ones were taken from open source implementations by the authors of the metrics.AU C ROC and AU C P R are from sklearn [96].Our implementation of the metrics, along with the code for generating the tables and figures in this paper, are available on Github15 .

Binary cases
In order to test the preferences of different metrics, we have made a series of simple experiments with one time series of labels, and two imperfect prediction time series that resemble the labels in different ways.We then test which of the two predictions each of the metrics prefer.For each test we refer to a figure showing the time series and scores, with the optimal one for each metric shown in bold.

Partial detection vs covering
In anomaly detection, it may be sufficient to detect only a portion of the anomalous event.However, the correct duration of the event is still useful.Figure 12 illustrates the different ways in which these aspects are addressed by various metrics.The point-wise f-score considers each point equally, regardless of whether the event has already been partially detected.In contrast, some metrics give the highest score to methods that detect only one point, providing no incentive to detect the entire event.

Effect of anomaly length
Most point-based metrics value each time point equally, while most event-based metrics value each event equally.Other options are eT a f β , which weight events by the square root of their length, C f β which counts points and events for precision and recall respectively, and N AB, counting TP event-wise and FP point-wise.These differences may lead to some unwanted prioritizations.Figure 13 shows a situation with two short anomalies and one longer.For point-based metrics, it is better to predict the long one than both of the short ones.For datasets with high variance in anomaly length, or a combination of point anomalies and event anomalies, an event-based metric is often more appropriate.On the other hand, event-based evaluation metrics can be sensitive to sets of short anomalies close to each other, as seen in Figure 14, where the event-based metrics prioritize the cluster of three events over the single long one.

Preference for short predicted anomalies
For P A f β and N AB, there is no gain in having more than one TP point within an anomaly, while every FP is punished point-wise.This leads to a considerable preference for short predicted anomalies, as they can give high reward with a comparatively low risk.As seen in Figure 15, if two detection methods find the same anomalous events, but one of them produce longer predicted anomalies, the score may be very different.This may seem like the precision/recall tradeoff in disguise -these two prediction could come from the same anomaly scores, but using different thresholds.However, some methods indeed predict shorter anomaly events than other methods, independent of the threshold.

Score as a function of position of the predicted event
To visualize how the different metrics value predicted events at different positions relative to a labelled event, we made a scenario with a time series of length 100, with one anomalous event from step 40 to 60, and a prediction with one anomalous event of length 5, at variable positions.We calculate the score for each position of the predicted anomaly, and plot this in a graph, as visualized in Figure 17. Figure 18 visualizes the score for each metric as a function of the position of the predicted event.We include bias R f α β with two positioning bias functions two show the different effects they have on the score.As we see, the sensitivity to the position of the prediction varies considerably.S f 1 only has two values in the score, and P A f 1 and eT a f 1 has almost the same shape, with only slightly reduced score at the edges.Many of the other metrics have more gradually changing scores.As abnormality in reality seldom is a binary concept, gradually changing scores should be more fair in most cases.
Value earliness We see that dtP A f 2 1 , ls f 2 1 , N AB and f ront R f 0 1 all value earliness, but in different ways, and to varying degree.N AB only has a slight preference for early detection, while ls f 2 β and f ront R f 0 1 have about linearly decreasing scores.dtP A f 2 1 changes very abruptly, and only values very early detections.

Value proximity
In cases where ground truth labels are not precise, methods should be rewarded more/punished less for a false positive close to a true anomaly than farther from them.Note that the value of earliness might interfere with Figure 12: Partial detection vs covering: Detecting all the anomalies can be more valuable than covering one of them.Some metrics reflect this, but not all.Some do not value covering at all, and give optimal score to the bottom prediction.Figure 13: Effect of anomaly length: Point-based metrics value anomalies by their length, thus giving higher score to the top prediction than the bottom prediction, even if the latter one discover more of the anomaly events.Figure 14: Effect of anomaly length: Depending on the labelling strategy, some datasets might have several non-contiguous anomalies in close proximity.The event-based metrics will value detecting the (whole) cluster of anomalies over detecting the later contiguous anomaly.Figure 15: Preference for short predicted anomalies: Most metrics value the bottom prediction at least as good as the top prediction, as it has the same precision but detects more of the anomaly.P A f β , T D and N AB, on the other hand, has a strong preference for short predicted events.this, so balancing these concepts can be difficult.We have not found any one metric considering both of these concepts.
A f β and T D stands out as the only ones valuing relative proximity over the whole time series.Along with t f 10 β , these are the only ones valuing detecting anomalies before the labelled anomaly, while also T a f 10  1 and (barely) N AB value detection after the labelled anomaly.
An effect of valuing proximity is that the score is less dependent on the labelling strategy.We show this with an example.The labels of a dataset are usually not perfect, and often it is not clear what is an anomaly, and where an anomaly starts or ends.While the score of an anomaly detector always will depend heavily on what is considered a GT anomaly and not, the sensitivity to the exact length and location to an anomaly varies.Figure 19 shows a situation where it is not clear where to put the anomaly labels.One possibility is to mark all the high valued points as anomalous.Another strategy is to label only the points around the discontinuities, e.g. as done by [23].Indeed, there may be nothing anomalous about the points in between these jumps.Yet, if the distance between the jumps is small enough, it makes more sense to view it as a single contiguous anomaly -as noted by [18], a single normal point between two anomalies is an anomaly in its own right.Thus at some time scale in between these situations, it should be unclear how to label this event.Two possible labels corresponding to this time series are shown in Figure 16, along with scores for predicting the labels from the opposite strategy.The metrics valuing proximity are more tolerant to the labelling strategy, and give good scores in both cases, as opposed to the other metrics.

Non-binary cases
As non-binary metric use the raw anomaly score as input, the space of possible inputs is much larger, making it more difficult to do extensive examinations of how these metrics reacts to a representative variation of realistic inputs.Nevertheless, we attempt to visialize some properties of these metrics as well.Before presenting these tests, we emphasize that the results of these metrics are dependent only on the relative anomaly score at each point, and not their actual value.This is shown in Figure 20, where the anomaly scores are both symmetric, and decreasing in the distance from the middle.This gives the same scores for all the metrics, independent of the labels.For most experiments in this section, we have only a very few possible values of the anomaly scores, and the points that are not visually different, have the same score.The exceptions of this are specified in the captions.

Effect of anomaly length
Figure 21 shows that the non-binary metrics mostly favour detecting the long anomalies, as these have more points.However, the VUS metrics can favour detecting the short ones if there are more of them, as the anomaly events are effectively widened by the metric.

Preference for short predicted anomalies
Figure 22 shows predictions with short and wide anomalies, similar to the binary case shown in Figure 15.We see that none of these metrics have the short predicted anomaly preference like P A f β and N AB.

Partial detection vs covering
Similar to for the binary metrics, we test the value of detection compared to covering in Figure 23.Since all the non-binary metrics considered are point-based, none of them value the detection of the second anomaly over covering the first one.P @K however, value them equally in this case, since K is larger than the number of points with positive anomaly score.

Proximity
By smoothing out the labels, the VUS metrics value proximity of predicted and labelled anomalies, as seen in Figure 24.The other non-binary metrics do not value high anomaly scores close to an anomaly.However, since anomaly scores often are somewhat smooth, high anomaly scores close to the anomaly can indicate that the anomaly score is also relatively high at the anomaly.This is shown in Figure 25, where the anomaly scores are bell curves at different locations.This way of valuing proximity does however fully rely on the form of the anomaly score, which may not necessarily be fair.

Effect of class imbalance
Unlike all the other metrics we have considered, AU C ROC and V U S l ROC include the number TNs in their formulas.This means including extra points that would not affect other metrics, will affect these.This is shown in Figure 26 17: Score as a function of position of the predicted event: Each point on the graph is the output of the considered evaluation metric ( P W f 1 in this case) for one full prediction, where the position of the predicted anomaly changes.The vertical marks indicate the start and end of areas where some, but not all, predicted anomaly points are correct.In other words, before the first and after the last mark, the predictions have no TP points, while between the second and third mark, the predictions have no FP points.
Metric Score    We see that the scores of AU C ROC and V U S l ROC increase from about 0.05 to about 0.95 as the anomaly ratio is decreased from 4/8 to 4/64.This means that for low anomaly ratios, precise detections are less important.This can yield high scores that seem impressive, but are difficult to interpret correctly.An example of this changing the required precision is shown in Figure 27.While AU C ROC and V U S l ROC prefer the short predicted event in the short time series, they prefer the less precise one in the long time series.

Categorization
In this section we present how each metric relates to the properties presented in Section 4. Figure 28 show the properties of each metric.We will in the following paragraphs explain how these results were determined.If the result of any test depends on a parameter 16 , we mark it by an asteriks.We also use an asteriks for partially obtained properties, by what we mean is explained for the relevant properties below.
For Early detection, we consider the results shown in Figure 18 for the binary metrics.If the score at the second mark is higher than the third mark, we consider the metric to value earliness.Since none of the non-binary metrics use the direction of the time series in their calculation, they can not have this property due to symmetry in the time dimension.
The Long anomalies preference property is based on the result in Figure 13 and Figure 21 for binary and non-binary metrics respectively -metrics not preferring the bottom model are considered to have this property.Similarly, the Short predicted anomaly preference property is based on Figures 15 and 22  The Proximity property for binary metrics is based on Figure 18.Metrics that have non-zero score on the first and last mark are considered to fully have this property, while metrics with non-zero score only the last mark partially have the property and are marked with an asterisk.For non-binary metrics, metrics distinguishing the anomaly scores Figure 24 are considered to have the property, while metrics only distinguishing the smoother anomaly scores of Figure 25 are marked with an asterisk.We do not consider this property do depend on parameters for t f τ β , V U S l ROC and V U S l P R , since τ = 0 or l = 0, the only values where the property is not obtained, would make the metrics identical to P W f β , AU C ROC and AU C P R . 16For bias R f α β , where the function parameters could in principle be anything, we have only used the ones suggested in the original paper The Requires threshold parameter simply indicates the binary vs non-binary metrics.# parameters indicate the number of parameters for each metric, including β for f-scores, all specifiable functions for bias R f α β , the distance exponent in T D, and all the TP, FP and FN weigths in N AB.
Time aware indicates the metrics that consider time dimension adjacency in any way, Indifferent to imbalance are the metrics ignoring the amount of true negatives, and General are the metrics suitable for all TSAD application.It should be clear by now that the latter property is quite unachievable without a impractically large amount of parameters.Nevertheless, the category is included for anyone skimming the article in search of simple answers.

Conclusion
Through an extensive literature review on time series anomaly detection (TSAD), we found several different ways to evaluate algorithms.While a rigorous discussion on several of the available metrics can be found in a few papers, some of which strongly disagree with each other on what are important properties of an evaluation metrics, most papers choose metrics that have been repeatedly faulted in the literature.We have tested 20 TSAD evaluation metrics in several case studies, and categorized them based on 10 different properties.As TSAD is a diverse field, no evaluation metrics is appropriate in all cases, and it should be chosen with care in each case.For the same reason, it is difficult to provide detailed guidelines for how to do this.However, we summarize some of the main takeaways from our study: • The choice of evaluation metric has a large impact on the rankings of TSAD methods, underscoring the need for careful alignment of evaluation metrics with specific problem requirements.
• Some metrics give high scores to certain prediction strategies that are not necessarily good strategies.For example, predicting only very long or very short anomalies can result in unreasonably high scores, leading to the selection of inappropriate methods and an overestimation of expected performance.• Some metrics may result in very bad scores for certain types of predictions, even though the predictions are valuable, such as predicting long anomalous events, or predicting anomalies too early or late.This can lead to selecting ineffective methods and underestimating the expected performance.• Due to the way the labels are compared to the prediction, some metrics are less appropriate for certain kinds of labelling strategies.
Therefore, it is crucial to carefully select the appropriate evaluation metric for a given problem, taking into consideration its preferences for specific types of predictions.Simple case studies such as the ones presented in this work can be helpful for gaining such understanding.
There are several directions of future research based on this study.First of all, there is room for defining novel evaluation metrics.For example, valuing earliness and proximity are two very useful traits, but none of the metrics we could find include both.Furthermore, much more investigation can be done of existing metrics that did not meet the limitations of this work, e.g.supplementary performance analysis metrics, or combinations of techniques of the included metrics.Finally, when publishing results in TSAD research in general, we suggest including results from multiple metrics, as well as making both the code and the anomaly scores available, to enable easy comparison with any evaluation metric.

Figure 2 :Figure 3 :
Figure 2: Visualization of thresholds.Each line of dots represent a possible binary prediction from the anomaly score, if the threshold is set to its respective value on the y-axis.Lowering the threshold increases the number of anomalous points predicted.Anomalous point, Normal point.

Figure 4 :
Figure 4: The confusion matrix for anomaly detection.Each point can have one of two label values, and one of two prediction values, resulting in four different classes of points.Anomalous point, Normal point.
Thresholded optimally for P W f 8

Figure
Figure9: Probability density functions for the anomaly scores of the positive (red) and negative (black) samples, for the blue and green detectors.Anomalous samples generally give higher anomaly scores, although many normal and anomalous points have similar scores, making it hard to set a threshold.9a and 9b show the counting metrics (as the shaded areas) for thresholds optimal for P W f β for β = 1 and β = 8 respectively.Keep in mind that only 2% of samples are anomalies, i.e.TN + FP = 49(FN + TP), which is not shown in the figures.FP and FN are of comparable sizes in 9a.

Figure 18 :
Figure18: Score as a function of position of the predicted event: For each metric, the graph shows the scores of the detection scenario shown in Figure17, as a function of the position in the prediction.All graphs are scaled to the same interval for easy comparizon.

Figure 19 :
Figure19: Is there one anomaly at t ≈ 3 and another at t ≈ 4, or just one anomalous event from 3 to 4? Without any information about domain and time scale, we may only guess what is an anomaly here, or if there even exist any.

Figure 27 :
Figure 27: Effect of class imbalance: The bottom two anomaly scores are extensions of the top two.The extra TNs change which anomaly score AU C ROC and V U S l ROC prefer.
. Metrics giving better score to the top model have this property.Metrics with the Partial detection preference are those not preferring the top prediction in Figures12 and 23 .

P W f β 1 P A f β 1 dtP 1 Figure 28 :
Figure 28: The properties of all the metrics.= has property, = does not have property, * = partially / parameter dependent.
Value proximity: Possible predictions for the time series in Figure19.For each predicion, we use the other prediction as the labels.This shows how much metrics are affected if the detector and labeller uses different labelling strategies.A metric not sensitive to this would give good scores in both these situations.Point-based metrics like P A f β and P W f β are heavily affected, event-based metrics a bit less.The metrics rewarding proximity are clearly the most tolerant for differing labelling strategy. .
P @3 best P W f β AU C ROC AU C P R V U S 4Figure20: Ordering the time stamps by value of the anomaly scores yield the same order.Only the order matter for the non-binary metrics, not their value, hence these predictions have the same scores.This would be true for any labels.Effect of anomaly length: Importance of anomaly length for non-binary metrics.Preference for short predicted anomalies: None of the non-binary metrics prefer the short predicted anomalies.Figure24: Proximity: Only VUS metrics value proximity of predicted and labelled anomalies.The other metrics still give positive score, since a low enough threshold marks every point as anomalous.Effect of class imbalance: AU C ROC and V U S l ROC scores are heavily affected by the amount of TNs.In the shorter predictions, only the predicted part of the time series is evaluated.The anomaly score is strictly decreasing, ensuring that none of the added points are more anomalous than the previous ones.As only AU C ROC and V U S l ROC are affected by this, the other metrics are not included.
P @16 best P W f β AU C ROC AU C P R V U S 4 ROC V U S 4Figure 25: Proximity: Non-binary metrics value FPs close to true anomalies indirectly, due to the anomaly score often being somewhat smooth.The scores are gaussian function with different shifts, meaning the anomaly scores at the anomalous points are increase as the centre moves towards the anomaly.