Online boxplot derived outlier detection

Mazarei, Arefeh; Sousa, Ricardo; Mendes-Moreira, João; Molchanov, Slavo; Ferreira, Hugo Miguel

doi:10.1007/s41060-024-00559-0

409 Accesses
Explore all metrics

Abstract

Outlier detection is a widely used technique for identifying anomalous or exceptional events across various contexts. It has proven to be valuable in applications like fault detection, fraud detection, and real-time monitoring systems. Detecting outliers in real time is crucial in several industries, such as financial fraud detection and quality control in manufacturing processes. In the context of big data, the amount of data generated is enormous, and traditional batch mode methods are not practical since the entire dataset is not available. The limited computational resources further compound this issue. Boxplot is a widely used batch mode algorithm for outlier detection that involves several derivations. However, the lack of an incremental closed form for statistical calculations during boxplot construction poses considerable challenges for its application within the realm of big data. We propose an incremental/online version of the boxplot algorithm to address these challenges. Our proposed algorithm is based on an approximation approach that involves numerical integration of the histogram and calculation of the cumulative distribution function. This approach is independent of the dataset’s distribution, making it effective for all types of distributions, whether skewed or not. To assess the efficacy of the proposed algorithm, we conducted tests using simulated datasets featuring varying degrees of skewness. Additionally, we applied the algorithm to a real-world dataset concerning software fault detection, which posed a considerable challenge. The experimental results underscored the robust performance of our proposed algorithm, highlighting its efficacy comparable to batch mode methods that access the entire dataset. Our online boxplot method, leveraging dataset distribution to define whiskers, consistently achieved exceptional outlier detection results. Notably, our algorithm demonstrated computational efficiency, maintaining constant memory usage with minimal hyperparameter tuning.

Revisiting Histogram Based Outlier Scores: Strengths and Weaknesses

Study on Statistical Outlier Detection and Labelling

Article 21 October 2020

Improving Detection Efficiency: Optimizing Block Size in the Local Outlier Factor (LOF) Algorithm

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Outlier detection is a widely employed method used to identify unusual or exceptional events across a range of scenarios, including fault detection, fraud detection, real-time monitoring systems, and information fusion to pinpoint objects diverging from anticipated norms post-fusion[1, 2]. This field has been the subject of extensive research attention for many years, resulting in the introduction of numerous approaches to anomaly detection[3].

Unsupervised machine learning techniques are crucial for outlier detection, allowing for the identification of anomalies in data without relying on pre-labeled data. However, these techniques present challenges, including the requirement for complex algorithms and difficulties in handling noisy or high-dimensional data [4]. As the importance of online learning continues to grow, it becomes crucial to develop statistical estimators that can incrementally compute higher moments and other statistical measures, addressing the complexities associated with implementing unsupervised techniques in dynamic environments.

In the realm of outlier detection, a boxplot stands out as a widely used and straightforward tool that effectively overcomes the limitations imposed by the complexities of other outlier detection methods. It relies on visualizing the data distribution and identifying any points that fall outside the expected range. The boxplot is particularly effective in identifying extreme outliers and is less susceptible to their influence due to its use of quartiles and interquartile ranges to define the box and whisker plot. As a result, the boxplot is a valuable tool for identifying outliers in various types of data and has become a standard approach in many fields [5].

Real-time data analysis has become increasingly important in recent years, as there is a need to analyze data that is received gradually over time. However, this presents a significant challenge in the big data age, as time and memory constraints have made it necessary to use incremental or recursive methods that do not require browsing through old data to update an estimator or analysis [4].

One of the fundamental challenges in the real-time analysis of big data is the computation of statistical estimators, including measures like the mean and median. While the mean can be computed incrementally, the median requires the ordering of all data points, which makes incremental computation difficult. To address this challenge, it is essential to develop online approaches for statistical estimators that enable the incremental computation of higher moments and other statistical estimators [6].

In this study, we propose an online approach for detecting outliers on univariate data sets based on the adjusted boxplot and quartile skewness as a robust measure to reflect the asymmetry of a univariate continuous distribution. Our approach enables the computation of quartiles and the boxplot incrementally, allowing for the detection of outliers as new data points arrive. To estimate quartiles, we utilize an online learning algorithm that updates quartile estimates with each new data point, ensuring consistency and accuracy in our approach. We also update the skewness measure as quartiles are updated with new data points, making our approach more reliable for both symmetric and asymmetric distributions.

To define the whiskers, we adopt the adjusted boxplot approach that was previously proposed in batch mode. Specifically, we utilize the exponential model for defining the whiskers, which has been proven to be efficient in previous studies [7]. By combining these approaches, we have an online boxplot that does not make any assumptions about the distribution of the data. This allows for the accurate and efficient analysis of large data sets without the need to store all historical data, making our approach a valuable tool for online data analysis.

Our proposed algorithm consistently and reliably detects outliers in both symmetric and asymmetric distributions. Its almost real-time capabilities and lack of distribution assumptions make it useful in various fields, including finance, healthcare, and cybersecurity.

The subsequent sections of this paper are organized as follows: In Sect. 2, we provide a comprehensive review of prior work on outlier detection, specifically focusing on the batch mode implementation of the boxplot approach. We discuss the strengths and weaknesses of existing approaches in this context. Section 3 details our proposed methodology, which encompasses several key components. We describe the online quartile estimation algorithm, the quartile skewness measure, the online boxplot approach, and how these three approaches are integrated. In Sect. 4, we outline the experimental setup for our methodology. We specify the dataset we have utilized and explain how the results will be assessed. Section 5 dedicated to presenting and discussing the results obtained from applying our approach to various datasets. This includes simulation datasets as well as real-world IT datasets. We also compare our results with those achieved using the batch mode boxplot approach retains all historical data. And finally, in Sect. 6, we conclude the paper by summarizing our findings. We discuss the limitations of our approach and propose potential directions for future research in this field.

2 Background

In this section, we will delve into the fundamentals of outlier detection, explore various boxplot methods, and provide an overview of several online outlier detection algorithms that have been developed to address the challenges posed by streaming data. While we briefly discuss select online outlier detection algorithms, our primary emphasis remains on understanding the strengths and limitations of different box plot approaches. This exploration aims to inform our development of a novel online boxplot algorithm tailored to dynamic data environments.

Identifying and removing outliers play a crucial role in data analysis and decision-making processes. Outliers are data points that deviate significantly from the underlying distribution of a dataset [8], often carrying valuable information or indicating anomalies. The definition of outliers varies in the literature and often relies on assumptions about the data distribution and analysis techniques [9,10,11].

Hawkins [9] provides a formal definition of an outlier as an observation that deviates to such an extent from other observations that it raises suspicion that it was generated by a different mechanism. Grubbs [10] describes an outlier as an observation that appears markedly different from the rest of the sample. Barnett et al. [11] define outliers as observations or subsets of observations that seem inconsistent with the remaining data.

Outliers can also be referred to as anomalies, discordant observations, exceptions, faults, defects, aberrations, noise, damage, or abnormality contaminants, depending on their cause. They can arise due to various reasons such as human or machine error, mechanical faults, or interesting events generated by a different mechanism [12].

In the field of outlier detection, a wide range of techniques are available. These techniques can be categorized into different groups based on how outliers are defined and detected. First, statistical-based techniques involve constructing a data distribution model and identifying outliers as points with the lowest probability generated from the global distribution [4].

Distance-based techniques determine the outlierness of an object by measuring its distances to its neighbors in the dataset. An example of a distance-based outlier detection method is the k-nearest neighbors (k-NN) approach, which determines the “outlierness” of a data point based on its distance to its k-nearest neighbors [13]. Density-based techniques estimate the density around data points and identify outliers as those surrounded by dissimilar densities compared to their local neighbors. The local outlier factor (LOF) algorithm falls under density-based techniques [14].

The clustering-based technique operates on the assumption that outliers exhibit one of the following characteristics: they do not belong to any cluster, they are located significantly far away from the center of their nearest cluster, or they belong to a small or sparse cluster. Classification-based techniques rely on a training dataset to develop a classification model, which is then used to classify unseen instances into normal or outlier classes. In unsupervised classification, the model learns to fit most of the training data [4].

Frequent pattern mining-based techniques aim to discover patterns that represent normal data behavior, and outliers are patterns that deviate from the established normal behavior [15]. Isolation-based techniques focus on isolating outliers from the rest of the data based on the assumption that outliers are few and different [16]. Information-theoretic techniques use entropy-like measurements or mutual information to identify outliers [17, 18]. Finally, mass-based approaches compute dissimilarity based on the data distribution rather than geometric position, and outliers are identified by recognizing data points in low-mass regions [19].

In statistical analysis, there are methods based on the mean, such as the mean absolute deviation (MAD) and the Z-Score, which provide robust estimations of dispersion and help identify outliers. MAD considers the average distance between each data point and the mean, while the Z-Score uses the mean and standard deviation to calculate the distance from the mean. However, the Z-Score may not be suitable for small datasets [20].

Alternatively, methods based on the median offer reliable approaches for outlier detection. The median absolute deviation (MADe) method and the median rule use the median and median absolute deviation to identify outliers. MADe defines intervals around the median, and the median rule employs a scale factor of the interquartile range to determine the intervals. These methods are less influenced by extreme values [20].

Boxplots are a powerful visualization tool for summarizing the distribution of a dataset. They offer a concise overview of the data by displaying key statistics such as the median, quartiles, and potential outliers. Boxplots are effective in identifying outliers and can reveal the skewness and symmetry of a distribution. Additionally, they are particularly useful for comparing distributions across different groups or categories. Boxplots are space-efficient and robust to extreme values, providing a representative summary of the data while showcasing the spread of the central portion of the distribution.

However, boxplots do have their limitations. They may not capture detailed information about the shape or multimodality of the distribution, which could be essential in some cases. Therefore, it is often beneficial to complement boxplots with other visualization methods to gain a more comprehensive understanding of the data. Despite this drawback, the boxplot introduced by Tukey [21] remains a popular graphical method for analyzing univariate datasets. The construction of a boxplot involves defining a box and two whiskers. The box represents the interquartile range (IQR), which is calculated as the difference between the third quartile ($Q_3$) and the first quartile ($Q_1$). This box effectively demonstrates the magnitude of the dataset. Probable outliers are points that fall outside the range defined by:

$$\begin{aligned}{}[Q_1 - w \times \textrm{IQR}\,\,\,\,,\,\,\,\, Q_3 + w \times \textrm{IQR}] \end{aligned}$$

(1)

When w is set to 1.5, it is assumed that only a small fraction of data points, approximately 0.7%, from a normal (Gaussian) distribution would be classified as outliers according to probability theory [22]. Alternatively, by setting w to 3, the focus shifts to extreme outliers, resulting in almost no data points from a normal distribution being identified as outliers.

It is important to highlight that this method tends to identify more observations as outliers when the data exhibits higher skewness [20]. However, it should be noted that observations falling outside the fence are not necessarily true outliers that behave differently from most of the data. In distributions with long tails and symmetry, many regular observations may extend beyond the whiskers, while in distributions with short tails, observations are unlikely to exceed the fence. This phenomenon applies similarly to skewed distributions as well [7].

Kimber proposed an adjustment to the fence rule for skewed data by replacing the interquartile range (IQR) with 2 times the lower semi-interquartile range (SIQRL) for the lower fence and 2 times the upper semi-interquartile range (SIQRU) for the upper fence. However, this adjustment only slightly accounts for the underlying skewness [23]. By considering $w=1.5$ in Eq. 1, the fence rule is defined as:

$$\begin{aligned}{}[Q_1 - 3 \textrm{SIQRL}\,\,\,\,,\,\,\,\, Q_3 + 3 \textrm{SIQRU}] \end{aligned}$$

(2)

where $\textrm{SIQRL} = Q_2 - Q_1$ and $\textrm{SIQRU} = Q_3 - Q_2$. Although this semi-interquartile range (SIQR) boxplot approach has been introduced, it does not sufficiently adjust itself for skewness. This boxplot still marks many regular observations as outliers and fails to clearly distinguish between regular observations and outliers. The upper whiskers are slightly increased, resulting in fewer observations being highlighted as upper outliers, while the lower whiskers simply mark the smaller observations [7].

Vanderviere and Hubert proposed an adjustment to the boxplot that can be applied to all distributions, even without finite moments [7]. Moreover, this boxplot uses medcouple (MC), a robust measure of skewness defined by Brys, Hubert, and Struyf [24] to estimate the underlying skewness to avoid masking the real outliers. For a univariate sample ${x_1,..., x_n}$ from a continuous unimodal distribution, MC is defined as:

$$\begin{aligned} \textrm{MC} = \textrm{med}_{x_i \le Q_2 \le x_j} h(x_i,x_j) \end{aligned}$$

(3)

where $Q_2$ is the sample median. For all $x_i \ne x_j$, the median ($\textrm{med}$) of the kernel function h is calculated. The function h is given by:

$$\begin{aligned} h(x_i,x_j) = \frac{(x_j - Q_2) - (Q_2 - x_i)}{x_j - x_i} \end{aligned}$$

(4)

MC ranges between $-1$ and 1. MC = 0 implies data are symmetrical and the adjusted boxplot turns into Tukey’s boxplot. If $\textrm{MC} > 0$, the distribution of the data would be right-skewed, and if $\textrm{MC} < 0$, the distribution of the data would be left-skewed. The whiskers for the adjusted boxplot according to MC value is as follows:

$$\begin{aligned} \begin{aligned}&\text {if } \textrm{MC} \ge 0: \\&\text {Lower bound} = Q_1 - 1.5\exp {\mathrm{(aMC)}}\textrm{IQR},\\&\text {Upper bound} = Q_3 + 1.5\exp {\mathrm{(bMC)}}\textrm{IQR} \\ \\&\text {if } \textrm{MC} < 0: \\&\text {Lower bound} = Q_1 - 1.5\exp {(-\textrm{bMC})}\textrm{IQR}, \\&\text {Upper bound} = Q_3 + 1.5\exp {(-\textrm{aMC})}\textrm{IQR} \end{aligned} \end{aligned}$$

(5)

where IQR is the interquartile range, and $Q_3$ and $Q_1$ are the third quartile and the first quartile, respectively. In this model, constants a and b were determined through the analysis of different distribution families. To ensure the model’s simplicity and robustness, Hubert and Vandervieren [7] chose the values of $a=-4$ and $b=3$.

According to this study [7], the model focuses primarily on commonly occurring distributions with moderate skewness. However, constructing a comprehensive model that includes cases with $\textrm{MC} > 0.6$ is acknowledged to be challenging. The adjusted boxplot mentioned in this context focuses on addressing skewness while not considering tail heaviness. The authors highlight various drawbacks associated with this methodology. Primarily, the complexity of the model increases with the addition of more estimators and parameters. Moreover, the model’s robustness diminishes as the tail measures exhibit a lower breakdown value, leading to increased variability in the length of the whiskers due to the variability of these tail measures [7].

Introduced in 2018 by Walker et al., the ratio-skewed boxplot is a practical method for handling skewed data and detecting outliers across various parametric distributions. It can be applied to univariate datasets, regardless of their symmetry or sample size [5]. The method incorporates the use of Bowley’s coefficient [25], derived from Kimber’s rule [23], which replaces the interquartile range with 2SIQR. This coefficient makes use of the quartiles to give an indication of skewness, given by:

$$\begin{aligned} \begin{aligned} {B}_{c}&= \frac{{{\text {SIQR}}_{{\text {U}}} - {\text {SIQR}}_{{\text {L}}} }}{{{\text {IQR}}}}\\&= \frac{(Q_{3} - Q_{2}) - (Q_{2} - Q_{1})}{\text {IQR}} \end{aligned} \end{aligned}$$

(6)

The ratio-skewed boxplot can be constructed by employing this coefficient in the definition of fences, represented by:

$$\begin{aligned}{}[Q_1 - \mathrm{1.5IQR}\,R_\textrm{L}\,\,\,\,,\,\,\,\, Q_3 + \mathrm{1.5IQR}\,R_\textrm{U}] \end{aligned}$$

(7)

where $R_\textrm{L}$ and $R_\textrm{U}$ represent the lower and upper fence skewness adjustment factors, defined as follows:

$$\begin{aligned} R_\textrm{L}=\frac{1-B_{c}}{1+B_{c}}\,\,\,\,\,\text {and}\,\,\,\,\,R_\textrm{U}=\frac{1+B_{c}}{1-B_{c}} \end{aligned}$$

(8)

In scenarios where data arrives in streams, online learning algorithms are particularly useful for processing data incrementally. Traditional outlier detection methods often require the entire dataset upfront, which is not feasible or efficient when dealing with constantly evolving data [26].

Incremental processing is a key concept in online learning algorithms, where the algorithms update their internal state and make decisions based on each new data point, enabling efficient real-time processing and decision-making. Online learning algorithms play a crucial role in outlier detection by continuously analyzing the incoming data stream, identifying outliers based on evolving patterns and statistical characteristics, and adapting to changes in the data distribution over time. These methods provide timely alerts for unusual data points, enhancing the accuracy and reliability of data preprocessing and statistical analysis.

There are various online learning outlier detection algorithms available, and one library that offers a range of such algorithms is River [27]. River provides a collection of online machine learning algorithms, including several outlier detection methods, like half-space tree, local outlier factor, and one-class SVM. These algorithms utilize different techniques to identify outliers in an online setting and have been widely used in various domains.

In addition to specific outlier detection algorithms, statistical online methods provide alternative approaches for detecting outliers in streaming data. These methods take advantage of incremental formulas for mean and standard deviation, allowing for efficient updates to these values without the need to store the entire dataset. By utilizing these incremental computations, outlier detection can be achieved in an efficient manner as new data points are received. Notable examples of such methods include the Cantelli inequality approach and the Chebyshev inequality approach. These methods leverage statistical bounds provided by the Cantelli inequality and the Chebyshev inequality, respectively, to identify outliers in real time. Applying these statistical bounds enables the assessment of the likelihood of data points deviating from the expected range, facilitating the detection of outliers as the data stream evolves [28,29,30].

In contrast to the mean and standard deviation, which are susceptible to the influence of extreme outliers, the robust measures of median and quartiles offer a valuable alternative for outlier detection. These measures demonstrate reduced sensitivity to extreme values, ensuring a more accurate depiction of the data’s central tendency and dispersion. However, calculating the median incrementally presents a challenge as there is no closed-form solution available.

To overcome this hurdle, approximate incremental computation methods enable reliable estimation of the median and its associated quartiles in online scenarios. Motivated by the idea of approximation techniques, this paper proposes a novel histogram-based algorithm for quartile estimation and subsequently introduces an innovative incremental boxplot approach for outlier detection. This methodology enhances the effectiveness of outlier detection in streaming data by providing a reasonable estimation of the median’s value and overcoming the limitations imposed by extreme outliers.

3 Online boxplot algorithm

In this section, we introduce an innovative online boxplot approach for outlier detection, combining the adjusted boxplot and Bowley’s coefficient or quartile skewness. This coefficient serves as a reliable measure to capture the asymmetry of a univariate continuous distribution. To enable online outlier detection, the proposed approach incorporates a histogram-based algorithm that adjusts the whiskers of the boxplot as new data points are added.

The online algorithm dynamically updates quartile estimates with each incoming data point, ensuring consistency and accuracy. Moreover, the approach incorporates updating the skewness measure and adjusting the whiskers along with quartile modifications. This enhancement significantly improves the reliability of the method, enabling it to effectively handle both symmetric and asymmetric distributions.

3.1 Online quartile estimation algorithm

The proposed online quartile estimation algorithm described in this paper comprises two distinct steps. The first step operates in batch mode, enabling the algorithm to process a buffer of collected data. This initialization step involves the computation of a histogram, which divides the range of the data into evenly spaced intervals or bins.

The number of bins serves as a tunable parameter in the algorithm, accommodating the data distribution and the desired accuracy of the quartile estimates. By adjusting the number of bins, the algorithm can effectively capture the nuances and variations in the dataset, leading to more precise quartile estimations. As a result, the selection of this parameter is crucial in ensuring accurate and reliable quartile calculations. To streamline the process and enhance its practicality, future iterations of the algorithm will automate the determination of the number of bins, removing the burden of manual specification and improving its applicability across diverse datasets.

Once the histogram is computed, the data’s cumulative distribution function (CDF) can be estimated. The CDF provides the fraction of data points up to and including the upper bound of each bin. The CDF values are used to estimate the quartiles of the data. The first quartile ($Q_{1}$) is the mean of the bin boundaries such that the CDF value is greater than or equal to 0.25. Similarly, the second quartile ($Q_{2}$) is the mean of the bin boundaries such that the CDF value is greater than or equal to 0.5, and the third quartile ($Q_{3}$) is the mean of the bin boundaries such that the CDF value is greater than or equal to 0.75.

It is worth noting that the length of the data buffer is not of critical importance as long as it accurately represents the entire dataset. Nevertheless, utilizing a highly representative buffer can enhance the algorithm’s speed. The quartile estimates obtained during the initialization step are continually updated, allowing for real-time monitoring of the data stream. The initialization step plays a pivotal role in the algorithm. Algorithm 1 illustrates this initial stage, where the functions $\mathrm {bin\_upper}(i)$ and $\mathrm {bin\_lower}(i)$ provide the upper and lower bounds of the i-th bin, respectively.

The second step of the quartile estimation algorithm is performed in online mode, where the algorithm continuously updates the histogram and quartiles as new data points arrive. Whenever a new data point arrives, the algorithm checks if it falls within any of the predefined intervals or bins. If it does, the corresponding counter for that bin is incremented. If the new data point falls outside the predefined bins, new bins are created, and the new data point is counted.

To prevent excessive memory usage, the algorithm defines a maximum number of bins as a hyperparameter, and if the number of bins exceeds this value, the algorithm aggregates bins in pairs starting from the left. The updated histogram is then used to calculate the quartiles using the cumulative distribution function, just like in the offline step.

The proposed online quartile estimation algorithm is efficient and can accurately estimate quartiles with each new data point, even with limited memory resources. Algorithm 2 precisely represents this step where the function $\mathrm {bin\_lower}(0)$ and $\mathrm {bin\_upper}(-1)$ return the lower bound of the first bin and the upper bound of the last bin, respectively.

3.2 Quartile skewness measure

Skewness is a fundamental statistical concept used to measure the degree of asymmetry observed in a probability distribution. It provides information about the shape of a distribution, as distributions can exhibit varying degrees of right (positive) skewness or left (negative) skewness. A normal distribution, also known as a bell curve, is a symmetric distribution with zero skewness.

However, real-world data often exhibit non-normal distributions, and understanding the nature of the distribution is crucial for accurate statistical analysis. Therefore, skewness is an important tool for describing the distribution of data, and it is used to quantify the degree to which the distribution deviates from symmetry. In this regard, measuring skewness is important in order to properly understand and interpret the data, as well as to apply appropriate statistical techniques.

Furthermore, many statistical algorithms are designed based on the assumption of normality, and they may not perform well when the data is not normally distributed. Therefore, incorporating measures of skewness in statistical analysis are critical to ensure that the algorithm is robust and accurate, particularly in the presence of outliers or other non-normal features in the data. One such measure is quartile skewness, also known as Bowley’s coefficient [25], which is a robust measure of skewness that is resistant to outliers and can accurately reflect the asymmetry of a univariate continuous distribution [31].

The quartile skewness measure or Bowley’s coefficient is a robust and efficient method for quantifying the asymmetry of a univariate continuous distribution. According to Eq. 6, this measure is calculated by taking the difference between the lengths of the upper quartile and the lower quartile, normalized by the length of the interquartile range. In this study, we utilize the abbreviation QSM to denote the quartile skewness measure, as defined by:

$$\begin{aligned} \text {QSM} = \frac{(Q_{3} - Q_{2}) - (Q_{2} - Q_{1})}{\text {IQR}}. \end{aligned}$$

(9)

This measure is not only robust against outliers but can also detect the degree and direction of skewness in the distribution. It remains effective even with the introduction of new data points and updates to the quartiles through the online quartile estimation algorithm discussed earlier in Sect. 3.1. This means that the measure can capture any changes in the distribution, including shifts and changes in skewness, as the data evolves over time and can be a powerful tool for monitoring and analyzing data streams that exhibit non-normal distributions with varying degrees of skewness.

Although the quartile skewness measure is a robust and efficient way to reflect the asymmetry of a univariate continuous distribution and resist up to $25\%$ outliers, it has one disadvantage. It cannot measure very small skewness in the data distribution [7]. This is because the quartile skewness measure is based on the difference between the upper and lower quartiles, which is relatively insensitive to minor departures from symmetry. However, in practice, this is not a major issue, as the quartile skewness measure is designed to detect significant departures from symmetry, which are often the most relevant for practical applications.

3.3 Online boxplot

The proposed online boxplot combines the adjusted boxplot methodology with quartile skewness calculation, resulting in an advanced visualization tool. This integration enables the acquisition of a more comprehensive representation of data distribution and skewness, thereby facilitating its effective utilization in an online environment. This innovative approach allows for real-time updates and dynamic analysis, making it particularly useful for monitoring and exploring data streams or continuously evolving datasets.

The adjusted boxplot approach is a statistical method that is employed to identify outliers in skewed data distributions. It was initially introduced by Hubert and Vandervieren [7]. This method extends the standard boxplot methodology by incorporating the medcouple skewness measure. It provides a visual representation of the dataset’s distribution and allows for the identification of potential outliers based on quartiles and the interquartile range. The adjusted boxplot method is known for its greater robustness and effectiveness compared to the original boxplot method, especially in scenarios where the distribution is skewed or contains outliers.

In their work, Hubert and Vandervieren [7] state that the medcouple skewness measure possesses the robustness of the quartile skewness measure. Therefore, in the proposed online approach, the quartile skewness measure defined by Eq. 9 is replaced with the medcouple measure to quantify the degree of skewness in the data distribution. This method defines the whiskers differently from the original boxplot method, using an exponential formula based on the interquartile range and the skewness measure. The whiskers in this online boxplot method are defined as follows:

$$\begin{aligned} \begin{aligned}&if\,\,\text {QSM} \ge 0:\\&\text {Lower Bound} = Q_1 - w\,\exp {(a\text {QSM})}\textrm{IQR},\\&\text {Upper Bound} = Q_3 + w\,\exp {(b\text {QSM})}\textrm{IQR}\\ \\&if\,\,\text {QSM} < 0:\\&\text {Lower Bound} = Q_1 - w\,\exp {(-b\text {QSM})}\textrm{IQR},\\&\text {Upper Bound} = Q_3 + w\,\exp {(-a\text {QSM})}\textrm{IQR} \end{aligned} \end{aligned}$$

(10)

where w, a, and b are constants that control the length of the whiskers. The value of w can be set to 1.5 or 3, as in the original boxplot method, while the values of a and b are chosen based on the degree of skewness in the data distribution. The selection process for these parameters will be further elaborated in the upcoming section.

The exponential formula used to define the whiskers in the online boxplot method provides a more flexible and adaptive approach to identifying outliers compared to the standard boxplot method. By adjusting the length of the whiskers based on the degree of skewness in the data distribution, this formula effectively handles extreme values and asymmetrical distributions.

In summary, the online boxplot inherits the robustness of the adjusted boxplot method and proves to be an effective statistical technique for identifying outliers incrementally in skewed data distributions. It utilizes the quartile skewness measure to quantify the degree of skewness in the data distribution. Additionally, the whiskers are defined using an exponential formula that better accommodates extreme values and asymmetrical distributions compared to the standard boxplot method. By incorporating this approach into our algorithm, we can more accurately detect and handle outliers in data streams.

3.4 Integration of approaches

The proposed approach for online outlier detection on univariate datasets integrates the online quartile estimation algorithm, online quartile skewness measure, and online boxplot approach. The online quartile estimation algorithm is used to estimate the quartile values incrementally, and the quartile skewness measure is used to quantify the degree of asymmetry in the distribution. The boxplot approach is then applied to identify outliers.

In the online quartile estimation algorithm, the quartile values are estimated using a cumulative distribution function based on the histogram of the data. The algorithm operates in two steps, with the first step performed offline to initialize the quartile values (Algorithm 1), and the second step performed in online mode to update the values as new data points arrive (Algorithm 2). This approach results in more accurate quartile estimates and enables the algorithm to operate with limited memory resources.

The quartile skewness measure is then used to quantify the degree of asymmetry in the distribution (Eq. 9). The quartile skewness measure is updated by incorporating new data points and adjusting the distribution’s quartiles accordingly. This measure is resistant to outliers and can accurately reflect the asymmetry of a univariate continuous distribution.

The exponential formula of the adjusted boxplot approach is then used to identify outliers based on the online quartile values and the online quartile skewness measure. The approach involves constructing a boxplot with adjusted whiskers, defined by Eq. 10, where the quartile skewness measure is used to adjust the whiskers to account for skewness in the distribution, which improves the accuracy of outlier detection. Data points beyond the whiskers are identified as potential outliers.

Since the algorithm aims to detect outliers incrementally while conserving memory, it selectively excludes highly extreme data points from the histogram update, aligning with the principles of Chebyshev’s inequality. This approach not only conserves memory, but also helps address another concern of losing precision due to excessive bin aggregation.

Theorem 1

(Chebyshev’s inequality) Let X be a random variable with finite mean $\mu $ and finite nonzero standard deviation $\sigma $. For any positive constant $k > 1$, the probability that the absolute difference between X and its mean is greater than or equal to k times the standard deviation (i.e., $|X - \mu | \ge k\sigma $) is less than or equal to $1/k^2$ [32].

$$\begin{aligned} \mathbb {P}(|X - \mu | \ge k\sigma ) \le \frac{1}{k^2} \end{aligned}$$

(11)

According to Chebyshev’s inequality, we selected a value for the parameter k with a sufficiently high value, ensuring that data points lying beyond k standard deviations from the mean have a very low probability of belonging to the same distribution. In our study, we selected a probability lower than $10^{-3}\%$. Interestingly, we discovered that the algorithm is not overly sensitive to this parameter. By employing this threshold, we can effectively classify and identify points outside this range as outliers, without the need for histogram and statistical updates. This approach successfully addresses concerns related to excessive bin construction and the loss of precision caused by aggregation functions.

The proposed approach can be used in real time to detect outliers accurately and efficiently by updating the quartiles and skewness measures with each new data point. The approach is particularly useful for datasets that exhibit non-normal distributions and outliers, as it incorporates robust measures of skewness and is resistant to outliers. Overall, the approach provides a reliable method for outlier detection in univariate datasets while utilizing limited memory with minimal hyperparameter tuning.

4 Experiments

This section of the paper provides a comprehensive description of the employed experimental framework. It introduces the datasets used, outlines the evaluation approach, and then provides details of the experimental setup. This section establishes the fundamental groundwork for the subsequent analyses and results.

4.1 Dataset

In this subsection, we present a thorough introduction to the datasets utilized in this study, offering comprehensive insights into their composition, characteristics, and sources.

The dataset collection for this study encompasses simulation datasets with distinct skewness distributions, namely normal, highly right-skewed, and highly left-skewed distributions. This deliberate variation in skewness enables us to evaluate the algorithm’s ability to effectively consider the inherent distributional properties during the outlier detection process. It is important to note that conventional versions of the boxplot often struggle with skewed distributions, incorrectly identifying a substantial portion of the data as outliers. In response, our algorithm seeks to overcome this limitation by considering the specific characteristics of the distribution, resulting in enhanced accuracy in outlier detection. By incorporating datasets with diverse skewness distributions, we can thoroughly assess the algorithm’s performance in accurately identifying outliers across different types of distributions.

In addition to the simulation datasets, we integrate a real-world dataset. This dataset which was obtained from an IT company measures the hardware resources used by cloud infrastructure. The metrics offer quantifiable measurements of various hardware performance aspects, providing valuable insights into the operating environment. With a primary emphasis on outlier detection, our algorithm is applied to more than 200 unique time series. These time series encompass different facets of hardware performance, enabling a comprehensive evaluation of the algorithm’s effectiveness in practical scenarios.

It is noteworthy that the algorithm aligns successfully with the objectives and requirements of the IT company aimed at detecting faults in the system. To demonstrate the algorithm’s performance in an online setting and its adaptability to diverse distributions, we specifically select several time series from the private dataset that exhibit varying skewness characteristics. By incorporating these time series, which represent different distributional patterns, we aim to showcase the algorithm’s ability to accurately detect outliers while providing justifications for its performance across varying distribution types.

4.2 Evaluation

This section provides a thorough analysis of our algorithm’s performance, with particular emphasis on two critical elements: the evaluation procedure and the metrics utilized for assessment.

To evaluate the algorithm’s performance on the dataset, we conducted a comparative analysis between two scenarios: the online boxplot algorithm and the batch scenario. The online algorithm involves computing approximated quartiles, constructing a boxplot that considers skewness, and handling data incrementally. On the other hand, the batch scenario assumes access to the entire dataset. This comparison aims to assess the algorithm’s effectiveness in handling incremental data and its ability to approximate results achieved when all data are accessible.

In the batch scenario, memory usage is a significant consideration as the algorithm requires memory capacity equal to the size of the dataset for comprehensive processing and analysis. In contrast, the novel online algorithm operates with fixed memory requirements, which can be adjusted as a hyperparameter. This capability of fixed memory usage enables the online algorithm to efficiently handle large datasets, making it more scalable and adaptable to real-time applications.

Precision serves as the evaluation metric for outlier detection. By comparing the precision achieved by the online algorithm to that of the batch scenario, we can assess the algorithm’s ability to approximate the results obtained when all data are available. To provide a comprehensive evaluation, we will also incorporate recall and F1 metrics. Recall measures the ability of the algorithm to identify all relevant instances of outliers, while F1 score combines both precision and recall into a single metric, giving a balanced assessment of the algorithm’s performance. This comparison not only offers insights into the effectiveness of the online algorithm in identifying outliers, but also accounts for its performance with limited information available at each step of the analysis and with restricted memory usage. These metrics are calculated using the following equations:

$$\begin{aligned} \begin{aligned}&\textrm{Precision} = \frac{\text {True Positives}}{\text {True Positives} + \text {False Positive}}\\&\textrm{Recall} = \frac{\text {True Positives}}{\text {True Positives} + \text {False Negatives}}\\&F1 = 2 \times \frac{\text {Precision} \times \text {Recall}}{\text {Precision} + \text {Recall}}\\ \end{aligned} \end{aligned}$$

(12)

4.3 Experimental setup

In this subsection, we present the parameters and details of the algorithm used in our experiment. According to Sect. 3, the algorithm is comprised of two steps: an offline phase and an online phase. During the offline phase, it is crucial to define the length of the buffer used to initialize the statistics. The buffer should ensure it represents the entire dataset effectively. It is obvious that if the number of outliers in the buffer exceeds a certain threshold, the algorithm considers those points when determining the distribution of the dataset.

The subsequent parameter to consider is the number of bins employed during the initialization phase to construct the initial histogram. As outlined in Sect. 3, we sought to incorporate the data’s variability when selecting a value for this parameter. To streamline the process in future investigations, we intend to automate the determination of this parameter. For the simulation experiment, we set the standard deviation of the buffer in the offline phase for the bin length, while for real datasets, we opted for 150 bins to ensure a satisfactory level of precision.

The definition of whiskers in the boxplot includes a parameter denoted as w (Eq. 10), which was set to 3 in our experiments with the aim of focusing on extreme outliers. This value plays a crucial role in determining the range within which outliers are identified.

Moreover, the online boxplot approach incorporates two additional parameters, namely a and b. To determine suitable values for these parameters, we employed an optimization process in our experiments. The objective was to minimize the occurrence of false positives compared to the batch definition of the algorithm within a limited time frame. To optimize these values, we obtained the IT company data, which was twice the length of the initializing buffer, all at once during the optimization process.

To enhance the efficiency of the outlier detection process, we incorporated the Chebyshev inequality into the algorithm. This inequality allows us to disregard extremely high values where the probability of following the same distribution is less than $\frac{1}{k^2}$. For our experiment, we set the value of k to 55, reducing false positives while maintaining computational speed. Notably, to analyze this parameter’s impact, we conducted experiments with three different values of k, indicating the algorithm’s robustness to variations in this parameter.

In terms of the experiment itself, we set the maximum number of bins to 1000. However, this value can be adjusted depending on the specific requirements of the analysis. It is important to consider the trade-off between precision and memory efficiency when choosing a lower value, as lower values result in more aggregated bins and reduced precision.

5 Result

In the following section, we present the results of applying our novel online outlier detection algorithm to various datasets and provide an in-depth analysis of the outcomes.

5.1 Simulation study

For the simulation study, we chose three datasets representing different skewness distributions: normal, highly right-skewed, and highly left-skewed. This deliberate variation in skewness allows us to assess the algorithm’s ability to accurately identify the distribution type and consider its inherent properties during the outlier detection process.

The first dataset exhibits a highly right-skewed distribution, generated using a Gamma distribution with a shape parameter ($\alpha = 0.3$) and an inverse scale parameter ($\beta = 0.1$). By calculating the medcouple skewness, a robust and sensitive measure, we find that this dataset has a skewness value of approximately 0.7. While the traditional Tukey boxplot identifies most of the data points (about 11.2%) as outliers, our new algorithm only detects the points at the far end of the right tail. Figure 1 illustrates the dataset’s histogram, where the black points represent the detected outliers.

Furthermore, we evaluated the algorithm’s performance by applying it in batch mode, where the whiskers are calculated using the entire data available up to that point, serving as a baseline. As shown in Fig. 2, both approaches yield nearly identical results, with precision, recall, and F1 evaluation metrics values of 0.89, 1, and 0.94, respectively (look at Table 1).

This finding validates the algorithm’s capability to identify distribution skewness and detect outliers, with only a slight variation between approximate statistical calculations and utilizing the entire dataset. Remarkably, the algorithm identifies a considerably smaller number of normal data points as outliers when compared to the traditional boxplot method, ranging from approximately 0.2% to 11.2% respectively.

Table 1 Evaluation of algorithm performance under highly right-skewed and highly left-skewed distributions with varying parameter k in Chebyshev’s inequality

Full size table

For the second dataset, we generated 10,000 data points from a normal distribution with a mean of zero ($\mu = 0$) and a standard deviation of one ($\sigma = 1$). We anticipate that the algorithm will align with the traditional Tukey’s boxplot, resulting in almost no points being identified as outliers (see Figs. 3 and 4). This expectation is based on the characteristic behavior of a normal distribution, where the majority of data points cluster around the mean and exhibit symmetric distribution.

The final dataset showcases a left-skewed distribution, generated by the Gamma distribution $X \sim -\Gamma (0.4,0.2)$. This dataset exhibits a medcouple skewness value of approximately 0.6. Unlike the traditional Tukey boxplot, which identifies approximately 9.3% of the data points as outliers, our novel algorithm solely detects the points located at the far end of the left tail, amounting to nearly 0.2%. Figure 5 provides a visualization of the dataset’s histogram, with black points denoting the detected outliers.

To evaluate the algorithm’s performance, we conducted the batch mode analysis where the whiskers were calculated using the entire available data up to that point, establishing a baseline. Figure 6 demonstrates that both approaches yield very similar outcomes, resulting in precision, recall, and F1 scores of 0.81, 0.94, and 0.87, respectively (look at Table 1).

These results confirm the algorithm’s ability to discern distribution skewness and effectively detect outliers, with minimal disparity between the approximate statistical calculations and utilizing the entire dataset. Notably, the algorithm identifies fewer normal data points as outliers than the traditional boxplot method.

5.2 Hardware resources data

In this section, we provide a detailed analysis of the algorithm’s performance by applying it to a real-world dataset acquired from an IT company. The experimental dataset is comprised of four distinct metrics: Cache bytes, CPU utilization, State of Sysmon service, and Average disk read queue length. These metrics allow us to gain insights into different aspects of the IT system’s performance and resource utilization.

The Cache bytes metric provides valuable information about the utilization of cache memory, which plays a crucial role in enhancing data access speed and reducing latency.

The CPU utilization metric gives us an understanding of the workload imposed on the central processing unit, enabling us to identify potential bottlenecks or areas where system resources may be underutilized.

The State of Sysmon service metric evaluates the operational status and functionality of the Sysmon service, which plays a critical role in monitoring and logging system events for security and analysis purposes.

Lastly, the average disk read queue length metric provides insights into the efficiency of data retrieval from storage devices, giving us an indication of potential performance issues related to disk read operations.

By analyzing these metrics we aim to provide a comprehensive evaluation of the algorithm’s effectiveness in detecting anomalies and identifying potential issues within an IT infrastructure.

The algorithm’s efficacy in outlier detection is exemplified by Figs. 7, 8, 9 and 10, where it achieves a high performance when compared to the batch scenario used as a baseline. Also, Fig. 11 illustrates the results of the online boxplot algorithm applied to the Cache bytes metric with varying values of the k parameter. This figure demonstrates the algorithm’s adaptability and robustness to parameter changes, ensuring reliable outlier detection across diverse datasets.

Moreover, Table2 provides further insight into the algorithm’s performance, showcasing its precision, recall, and F1 scores across different IT metrics under varying k values. This exceptional precision underscores the algorithm’s capability to calculate approximate statistics incrementally, even with limited data access and a predefined memory constraint of 1000 bins. This attribute highlights a significant strength of the novel online boxplot algorithm.

Table 2 Evaluation of algorithm performance for the real-world dataset with varying parameter k in Chebyshev’s inequality

Full size table

6 Discussion and conclusion

In conclusion, outlier detection plays a vital role in various contexts for identifying anomalous or exceptional events. However, the challenges posed by big data, including the massive volume of data and limited computational resources, call for innovative approaches. This paper has presented an incremental/online version of the boxplot algorithm as a solution.

By employing an approximation approach based on the numerical integration of the histogram and the calculation of the cumulative distribution function, the proposed algorithm has demonstrated its effectiveness. It considers the dataset’s distribution to define the whiskers, leading to exceptional outlier detection results even in the presence of skewed distributions. Moreover, the algorithm leverages robust measures like quartiles and median, surpassing the limitations of unsupervised outlier detection techniques.

A notable contribution of this research is the introduction of a histogram-based approach for online computation of these measures. This ensures accurate estimation and reliable outlier detection in real time. By addressing the challenges faced by traditional batch mode methods, the proposed algorithm opens doors for more efficient and scalable outlier detection in the era of big data.

In the evaluation phase, the algorithm underwent rigorous testing on both simulated datasets with varying degrees of skewness and a real-world dataset for software fault detection. These evaluations showcased the algorithm’s robustness and effectiveness in outlier detection. Furthermore, the algorithm’s computational efficiency was evident as it maintained a constant memory footprint, making it suitable for processing large datasets incrementally. Key evaluation metrics, including memory usage and precision, emphasized the algorithm’s suitability for real-time applications where immediate access to the entire dataset is not feasible.

In addition to its robust performance and efficiency, our proposed incremental/online boxplot algorithm possesses several notable strengths. It stands out for its ability to handle skewed distributions effectively, a common challenge in outlier detection. By leveraging a histogram-based approach for the online computation of measures such as quartiles and median, the algorithm ensures accurate estimation and reliable outlier detection in real time. Moreover, its simplicity and ease of interpretation make it accessible to users across different domains, regardless of their expertise in outlier detection techniques. These strengths collectively contribute to the algorithm’s versatility and applicability in various real-world scenarios, marking it as a valuable tool in the outlier detection toolkit.

While our proposed incremental/online boxplot algorithm demonstrates promising results in outlier detection, it is essential to acknowledge its limitations and areas for further improvement. One crucial aspect for future research is to apply the algorithm to datasets from various industries to showcase its adaptability and identify potential shortcomings. By subjecting the algorithm to different datasets, we can gain insights into its performance under diverse conditions and refine it accordingly.

Additionally, future work should focus on full automation, enabling the algorithm to automatically adjust its parameters to adapt to dynamic data streams. This automation not only enhances the algorithm’s usability, but also contributes to its scalability and applicability in real-world scenarios. Furthermore, there is a need to develop the algorithm for multivariate time series data and extend its capability to detect other types of events based on clear definitions. By addressing these areas, we can further enhance the algorithm’s effectiveness and broaden its utility across various domains.

In summary, this research introduces an incremental/online boxplot algorithm that effectively tackles the challenges of outlier detection in the realm of big data. The algorithm’s online outlier detection capability, consideration of distribution skewness, and efficient utilization of computational resources make it a valuable tool for various applications. As we explore future possibilities, automating the algorithm holds great potential for further enhancing its performance and usability in dynamic data environments.

References

Zhang, P., Li, T., Wang, G., Wang, D., Lai, P., Zhang, F.: A multi-source information fusion model for outlier detection. Inf. Fusion 93, 192–208 (2023)
Article Google Scholar
Yuan, Z., Chen, H., Luo, C., Peng, D.: Mfgad: Multi-fuzzy granules anomaly detection. Inf. Fusion 95, 17–25 (2023)
Article Google Scholar
Xu, H., Pang, G., Wang, Y., Wang, Y.: Deep isolation forest for anomaly detection. IEEE Trans. Knowl. Data Eng. (2023)
Souiden, I., Omri, M.N., Brahmi, Z.: A survey of outlier detection in high dimensional data streams. Comput. Sci. Rev. 44, 100463 (2022). https://doi.org/10.1016/j.cosrev.2022.100463
Article MathSciNet Google Scholar
Walker, M., Dovoedo, Y., Chakraborti, S., Hilton, C.: An improved boxplot for univariate data. Am. Stat. 72(4), 348–353 (2018). https://doi.org/10.1080/00031305.2018.1448891
Article MathSciNet Google Scholar
Tschumitschew, K., Klawonn, F.: Incremental quantile estimation. Evol. Syst. 1, 253–264 (2010). https://doi.org/10.1007/s12530-010-9017-7
Article Google Scholar
Hubert, M., Vandervieren, E.: An adjusted boxplot for skewed distributions. Comput. Stat. Data Anal. 52(12), 5186–5201 (2008). https://doi.org/10.1016/j.csda.2007.11.008
Article MathSciNet Google Scholar
Aggarwal, C.C.: An Introduction to Outlier Analysis. Springer, Switzerland (2017). https://doi.org/10.1007/978-3-319-47578-3_1
Hawkins, D.M.: Identification of Outliers vol. 11. Springer, London (1980). https://doi.org/10.1007/978-94-015-3994-4
Grubbs, F.E.: Procedures for detecting outlying observations in samples. Technometrics 11(1), 1–21 (1969) https://doi.org/10.1080/00401706.1969.10490657
Barnett, V., Lewis, T., et al.: Outliers in Statistical Data, vol. 3. Wiley, New York (1994)
Google Scholar
Zhang, Y., Meratnia, N., Havinga, P.: A taxonomy framework for unsupervised outlier detection techniques for multi-type data sets. Computer 49(3), 355–363 (2007)
Google Scholar
Toliopoulos, T., Gounaris, A.: Explainable distance-based outlier detection in data streams. IEEE Access 10, 47921–47936 (2022). https://doi.org/10.1109/ACCESS.2022.3172345
Article Google Scholar
Muhr, D., Affenzeller, M.: Little data is often enough for distance-based outlier detection. Procedia Comput. Sci. 200, 984–992 (2022) https://doi.org/10.1016/j.procs.2022.01.297
Hemalatha, C.S., Vaidehi, V., Lakshmi, R.: Minimal infrequent pattern based approach for mining outliers in data streams. Expert Syst. Appl. 42(4), 1998–2012 (2015). https://doi.org/10.1016/j.eswa.2014.09.053
Article Google Scholar
Liu, F.T., Ting, K.M., Zhou, Z.-H.: Isolation-based anomaly detection. ACM Trans. Knowl. Discovery Data (TKDD) 6(1), 1–39 (2012). https://doi.org/10.1145/2133360.2133363
Article Google Scholar
Wu, S., Wang, S.: Information-theoretic outlier detection for large-scale categorical data. IEEE Trans. Knowl. Data Eng. 25(3), 589–602 (2011). https://doi.org/10.1109/TKDE.2011.261
Article Google Scholar
Jiang, F., Zhao, H., Du, J., Xue, Y., Peng, Y.: Outlier detection based on approximation accuracy entropy. Int. J. Mach. Learn. Cybernet. 10, 2483–2499 (2019). https://doi.org/10.1007/s13042-018-0884-8
Article Google Scholar
Garg, S., Jain, S.: A brief survey on mass-based dissimilarity measures. In: International Conference on Innovative Computing and Communications: Proceedings of ICICC 2018, vol. 2, pp. 387–395 (2019). https://doi.org/10.1007/978-981-13-2354-6_41 . Springer
Saleem, S., Aslam, M., Shaukat, M.R.: A review and empirical comparison of univariate outlier detection methods. Pak. J. Stat. 37(4) (2021)
Tukey, J.W.E.A.: Exploratory Data Analysis, vol. 2. Addison-Wesley, Reading, MA (1977)
Google Scholar
Moore, D.S.: Introduction to the Practice of Statistics. WH Freeman and company, New York (2009)
Google Scholar
Kimber, A.: Exploratory data analysis for possibly censored data from skewed distributions. J. R. Stat. Soc.: Ser. C: Appl. Stat. 39(1), 21–30 (1990). https://doi.org/10.2307/2347808
Article MathSciNet Google Scholar
Brys, G., Hubert, M., Struyf, A.: A robust measure of skewness. J. Comput. Graph. Stat. 13(4), 996–1017 (2004). https://doi.org/10.1198/106186004X12632
Article MathSciNet Google Scholar
Bowley, A.: Elements of Statistics, 4 eds. Charles Scribner’s Sons, New York, 220–224 (1920)
Odoh, K.: Real-time anomaly detection for multivariate data streams. arXiv preprint arXiv:2209.12398 (2022) https://doi.org/10.48550/arXiv.2209.12398
Montiel, J., Halford, M., Mastelini, S.M., Bolmier, G., Sourty, R., Vaysse, R., Zouitine, A., Gomes, H.M., Read, J., Abdessalem, T., et al.: River: machine learning for streaming data in python. J. Mach. Learn. Res. 22(1), 4945–4952 (2021)
Google Scholar
Shevlyakov, G., Kan, M.: Stream data preprocessing: Outlier detection based on the chebyshev inequality with applications. In: 2020 26th Conference of Open Innovations Association (FRUCT), pp. 402–407 (2020). https://doi.org/10.23919/FRUCT48808.2020.9087459 . IEEE
Wang, P., Wang, H., Hart, P., Guo, X., Mahapatra, K.: Application of chebyshev’s inequality in online anomaly detection driven by streaming pmu data. In: 2020 IEEE Power & Energy Society General Meeting (PESGM), pp. 1–5 (2020). https://doi.org/10.1109/PESGM41954.2020.9281553 . IEEE
Pang, G., Cao, L., Chen, L., Liu, H.: Learning representations of ultrahigh-dimensional data for random distance-based outlier detection. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2041–2050 (2018). https://doi.org/10.1145/3219819.3220042
Moors, J., Wagemakers, R.T.A., Coenen, V., Heuts, R., Janssens, M.: Characterizing systems of distributions by quantile measures. Stat. Neerl. 50(3), 417–430 (1996). https://doi.org/10.1111/j.1467-9574.1996.tb01507.x
Article MathSciNet Google Scholar
Feller, W.: An Introduction to Probability Theory and Its Applications, Volume 2 vol. 81. John Wiley & Sons, United States (1991)

Download references

Acknowledgements

This work is partially financed by the ERDF-European Regional Development Fund through the Norte Portugal Regional Operational Programme-NORTE 2020 under the Portugal 2020 Partnership Agreement, within project OnlineAIOps, with reference NORTE-01-0247-FEDER-070104, and by National Funds through the Portuguese funding agency, FCT-Fundação para a Ciência e a Tecnologia, within project LA/P/0063/2020.

Funding

Open access funding provided by FCT|FCCN (b-on).

Author information

Authors and Affiliations

Labrotary of Artificial Intelligence and Decision Support (LIAAD), INESC TEC, Campus da FEUP, Rua Dr. Roberto Frias, 4200-465, Porto, Portugal
Arefeh Mazarei, Ricardo Sousa, João Mendes-Moreira & Hugo Miguel Ferreira
Department of Computer Science, University of Porto, Rua do Campo Alegre, 4169-007, Porto, Portugal
Arefeh Mazarei
Department of Computer Engineering, University of Porto, Rua Dr. Roberto Frias, 4200-465, Porto, Portugal
Arefeh Mazarei, Ricardo Sousa & João Mendes-Moreira
IT Peers, R. Eng. Frederico Ulrich 3210, 1o andar, s.101, 4470-605, Maia, Portugal
Slavo Molchanov

Authors

Arefeh Mazarei
View author publications
You can also search for this author in PubMed Google Scholar
Ricardo Sousa
View author publications
You can also search for this author in PubMed Google Scholar
João Mendes-Moreira
View author publications
You can also search for this author in PubMed Google Scholar
Slavo Molchanov
View author publications
You can also search for this author in PubMed Google Scholar
Hugo Miguel Ferreira
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors made significant contributions to the study. Arefeh Mazarei and Ricardo Sousa were responsible for the conceptualization of the research idea and problem formulation. Arefeh Mazarei led the development and implementation of the algorithm, analysis and interpretation of the experimental results, and wrote the first draft of the manuscript. Slavo Molchanov conducted the data collection and analysis. João Mendes-Moreira provided supervision throughout the study. Hugo Miguel Ferreira was responsible for project management. All authors actively participated in reviewing and providing constructive feedback on earlier versions of the manuscript.

Corresponding author

Correspondence to Arefeh Mazarei.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. We hereby declare that there are no conflict of interest associated with this research work.

Financial or non-financial interests

The authors have no relevant financial or non-financial interests to disclose. The authors have no conflict of interest to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work is partially financed by the ERDF-European Regional Development Fund through the Norte Portugal Regional Operational Programme-NORTE 2020 under the Portugal 2020 Partnership Agreement, within project OnlineAIOps, with reference NORTE-01-0247-FEDER-070104, and by National Funds through the Portuguese funding agency, FCT-Fundação para a Ciência e a Tecnologia, within project LA/P/0063/2020.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Mazarei, A., Sousa, R., Mendes-Moreira, J. et al. Online boxplot derived outlier detection. Int J Data Sci Anal (2024). https://doi.org/10.1007/s41060-024-00559-0

Download citation

Received: 18 October 2023
Accepted: 28 April 2024
Published: 20 May 2024
DOI: https://doi.org/10.1007/s41060-024-00559-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Online boxplot derived outlier detection

Abstract

Similar content being viewed by others

Revisiting Histogram Based Outlier Scores: Strengths and Weaknesses

Study on Statistical Outlier Detection and Labelling

Improving Detection Efficiency: Optimizing Block Size in the Local Outlier Factor (LOF) Algorithm

1 Introduction

2 Background