1 Introduction

Outlier detection is a widely employed method used to identify unusual or exceptional events across a range of scenarios, including fault detection, fraud detection, real-time monitoring systems, and information fusion to pinpoint objects diverging from anticipated norms post-fusion[1, 2]. This field has been the subject of extensive research attention for many years, resulting in the introduction of numerous approaches to anomaly detection[3].

Unsupervised machine learning techniques are crucial for outlier detection, allowing for the identification of anomalies in data without relying on pre-labeled data. However, these techniques present challenges, including the requirement for complex algorithms and difficulties in handling noisy or high-dimensional data [4]. As the importance of online learning continues to grow, it becomes crucial to develop statistical estimators that can incrementally compute higher moments and other statistical measures, addressing the complexities associated with implementing unsupervised techniques in dynamic environments.

In the realm of outlier detection, a boxplot stands out as a widely used and straightforward tool that effectively overcomes the limitations imposed by the complexities of other outlier detection methods. It relies on visualizing the data distribution and identifying any points that fall outside the expected range. The boxplot is particularly effective in identifying extreme outliers and is less susceptible to their influence due to its use of quartiles and interquartile ranges to define the box and whisker plot. As a result, the boxplot is a valuable tool for identifying outliers in various types of data and has become a standard approach in many fields [5].

Real-time data analysis has become increasingly important in recent years, as there is a need to analyze data that is received gradually over time. However, this presents a significant challenge in the big data age, as time and memory constraints have made it necessary to use incremental or recursive methods that do not require browsing through old data to update an estimator or analysis [4].

One of the fundamental challenges in the real-time analysis of big data is the computation of statistical estimators, including measures like the mean and median. While the mean can be computed incrementally, the median requires the ordering of all data points, which makes incremental computation difficult. To address this challenge, it is essential to develop online approaches for statistical estimators that enable the incremental computation of higher moments and other statistical estimators [6].

In this study, we propose an online approach for detecting outliers on univariate data sets based on the adjusted boxplot and quartile skewness as a robust measure to reflect the asymmetry of a univariate continuous distribution. Our approach enables the computation of quartiles and the boxplot incrementally, allowing for the detection of outliers as new data points arrive. To estimate quartiles, we utilize an online learning algorithm that updates quartile estimates with each new data point, ensuring consistency and accuracy in our approach. We also update the skewness measure as quartiles are updated with new data points, making our approach more reliable for both symmetric and asymmetric distributions.

To define the whiskers, we adopt the adjusted boxplot approach that was previously proposed in batch mode. Specifically, we utilize the exponential model for defining the whiskers, which has been proven to be efficient in previous studies [7]. By combining these approaches, we have an online boxplot that does not make any assumptions about the distribution of the data. This allows for the accurate and efficient analysis of large data sets without the need to store all historical data, making our approach a valuable tool for online data analysis.

Our proposed algorithm consistently and reliably detects outliers in both symmetric and asymmetric distributions. Its almost real-time capabilities and lack of distribution assumptions make it useful in various fields, including finance, healthcare, and cybersecurity.

The subsequent sections of this paper are organized as follows: In Sect. 2, we provide a comprehensive review of prior work on outlier detection, specifically focusing on the batch mode implementation of the boxplot approach. We discuss the strengths and weaknesses of existing approaches in this context. Section 3 details our proposed methodology, which encompasses several key components. We describe the online quartile estimation algorithm, the quartile skewness measure, the online boxplot approach, and how these three approaches are integrated. In Sect. 4, we outline the experimental setup for our methodology. We specify the dataset we have utilized and explain how the results will be assessed. Section 5 dedicated to presenting and discussing the results obtained from applying our approach to various datasets. This includes simulation datasets as well as real-world IT datasets. We also compare our results with those achieved using the batch mode boxplot approach retains all historical data. And finally, in Sect. 6, we conclude the paper by summarizing our findings. We discuss the limitations of our approach and propose potential directions for future research in this field.

2 Background

In this section, we will delve into the fundamentals of outlier detection, explore various boxplot methods, and provide an overview of several online outlier detection algorithms that have been developed to address the challenges posed by streaming data. While we briefly discuss select online outlier detection algorithms, our primary emphasis remains on understanding the strengths and limitations of different box plot approaches. This exploration aims to inform our development of a novel online boxplot algorithm tailored to dynamic data environments.

Identifying and removing outliers play a crucial role in data analysis and decision-making processes. Outliers are data points that deviate significantly from the underlying distribution of a dataset  [8], often carrying valuable information or indicating anomalies. The definition of outliers varies in the literature and often relies on assumptions about the data distribution and analysis techniques [9,10,11].

Hawkins [9] provides a formal definition of an outlier as an observation that deviates to such an extent from other observations that it raises suspicion that it was generated by a different mechanism. Grubbs [10] describes an outlier as an observation that appears markedly different from the rest of the sample. Barnett et al. [11] define outliers as observations or subsets of observations that seem inconsistent with the remaining data.

Outliers can also be referred to as anomalies, discordant observations, exceptions, faults, defects, aberrations, noise, damage, or abnormality contaminants, depending on their cause. They can arise due to various reasons such as human or machine error, mechanical faults, or interesting events generated by a different mechanism  [12].

In the field of outlier detection, a wide range of techniques are available. These techniques can be categorized into different groups based on how outliers are defined and detected. First, statistical-based techniques involve constructing a data distribution model and identifying outliers as points with the lowest probability generated from the global distribution [4].

Distance-based techniques determine the outlierness of an object by measuring its distances to its neighbors in the dataset. An example of a distance-based outlier detection method is the k-nearest neighbors (k-NN) approach, which determines the “outlierness” of a data point based on its distance to its k-nearest neighbors [13]. Density-based techniques estimate the density around data points and identify outliers as those surrounded by dissimilar densities compared to their local neighbors. The local outlier factor (LOF) algorithm falls under density-based techniques [14].

The clustering-based technique operates on the assumption that outliers exhibit one of the following characteristics: they do not belong to any cluster, they are located significantly far away from the center of their nearest cluster, or they belong to a small or sparse cluster. Classification-based techniques rely on a training dataset to develop a classification model, which is then used to classify unseen instances into normal or outlier classes. In unsupervised classification, the model learns to fit most of the training data [4].

Frequent pattern mining-based techniques aim to discover patterns that represent normal data behavior, and outliers are patterns that deviate from the established normal behavior [15]. Isolation-based techniques focus on isolating outliers from the rest of the data based on the assumption that outliers are few and different [16]. Information-theoretic techniques use entropy-like measurements or mutual information to identify outliers [17, 18]. Finally, mass-based approaches compute dissimilarity based on the data distribution rather than geometric position, and outliers are identified by recognizing data points in low-mass regions [19].

In statistical analysis, there are methods based on the mean, such as the mean absolute deviation (MAD) and the Z-Score, which provide robust estimations of dispersion and help identify outliers. MAD considers the average distance between each data point and the mean, while the Z-Score uses the mean and standard deviation to calculate the distance from the mean. However, the Z-Score may not be suitable for small datasets [20].

Alternatively, methods based on the median offer reliable approaches for outlier detection. The median absolute deviation (MADe) method and the median rule use the median and median absolute deviation to identify outliers. MADe defines intervals around the median, and the median rule employs a scale factor of the interquartile range to determine the intervals. These methods are less influenced by extreme values [20].

Boxplots are a powerful visualization tool for summarizing the distribution of a dataset. They offer a concise overview of the data by displaying key statistics such as the median, quartiles, and potential outliers. Boxplots are effective in identifying outliers and can reveal the skewness and symmetry of a distribution. Additionally, they are particularly useful for comparing distributions across different groups or categories. Boxplots are space-efficient and robust to extreme values, providing a representative summary of the data while showcasing the spread of the central portion of the distribution.

However, boxplots do have their limitations. They may not capture detailed information about the shape or multimodality of the distribution, which could be essential in some cases. Therefore, it is often beneficial to complement boxplots with other visualization methods to gain a more comprehensive understanding of the data. Despite this drawback, the boxplot introduced by Tukey [21] remains a popular graphical method for analyzing univariate datasets. The construction of a boxplot involves defining a box and two whiskers. The box represents the interquartile range (IQR), which is calculated as the difference between the third quartile (\(Q_3\)) and the first quartile (\(Q_1\)). This box effectively demonstrates the magnitude of the dataset. Probable outliers are points that fall outside the range defined by:

$$\begin{aligned}{}[Q_1 - w \times \textrm{IQR}\,\,\,\,,\,\,\,\, Q_3 + w \times \textrm{IQR}] \end{aligned}$$
(1)

When w is set to 1.5, it is assumed that only a small fraction of data points, approximately 0.7%, from a normal (Gaussian) distribution would be classified as outliers according to probability theory [22]. Alternatively, by setting w to 3, the focus shifts to extreme outliers, resulting in almost no data points from a normal distribution being identified as outliers.

It is important to highlight that this method tends to identify more observations as outliers when the data exhibits higher skewness [20]. However, it should be noted that observations falling outside the fence are not necessarily true outliers that behave differently from most of the data. In distributions with long tails and symmetry, many regular observations may extend beyond the whiskers, while in distributions with short tails, observations are unlikely to exceed the fence. This phenomenon applies similarly to skewed distributions as well [7].

Kimber proposed an adjustment to the fence rule for skewed data by replacing the interquartile range (IQR) with 2 times the lower semi-interquartile range (SIQRL) for the lower fence and 2 times the upper semi-interquartile range (SIQRU) for the upper fence. However, this adjustment only slightly accounts for the underlying skewness [23]. By considering \(w=1.5\) in Eq. 1, the fence rule is defined as:

$$\begin{aligned}{}[Q_1 - 3 \textrm{SIQRL}\,\,\,\,,\,\,\,\, Q_3 + 3 \textrm{SIQRU}] \end{aligned}$$
(2)

where \(\textrm{SIQRL} = Q_2 - Q_1\) and \(\textrm{SIQRU} = Q_3 - Q_2\). Although this semi-interquartile range (SIQR) boxplot approach has been introduced, it does not sufficiently adjust itself for skewness. This boxplot still marks many regular observations as outliers and fails to clearly distinguish between regular observations and outliers. The upper whiskers are slightly increased, resulting in fewer observations being highlighted as upper outliers, while the lower whiskers simply mark the smaller observations [7].

Vanderviere and Hubert proposed an adjustment to the boxplot that can be applied to all distributions, even without finite moments [7]. Moreover, this boxplot uses medcouple (MC), a robust measure of skewness defined by Brys, Hubert, and Struyf [24] to estimate the underlying skewness to avoid masking the real outliers. For a univariate sample \({x_1,..., x_n}\) from a continuous unimodal distribution, MC is defined as:

$$\begin{aligned} \textrm{MC} = \textrm{med}_{x_i \le Q_2 \le x_j} h(x_i,x_j) \end{aligned}$$
(3)

where \(Q_2\) is the sample median. For all \(x_i \ne x_j\), the median (\(\textrm{med}\)) of the kernel function h is calculated. The function h is given by:

$$\begin{aligned} h(x_i,x_j) = \frac{(x_j - Q_2) - (Q_2 - x_i)}{x_j - x_i} \end{aligned}$$
(4)

MC ranges between \(-1\) and 1. MC = 0 implies data are symmetrical and the adjusted boxplot turns into Tukey’s boxplot. If \(\textrm{MC} > 0\), the distribution of the data would be right-skewed, and if \(\textrm{MC} < 0\), the distribution of the data would be left-skewed. The whiskers for the adjusted boxplot according to MC value is as follows:

$$\begin{aligned} \begin{aligned}&\text {if } \textrm{MC} \ge 0: \\&\text {Lower bound} = Q_1 - 1.5\exp {\mathrm{(aMC)}}\textrm{IQR},\\&\text {Upper bound} = Q_3 + 1.5\exp {\mathrm{(bMC)}}\textrm{IQR} \\ \\&\text {if } \textrm{MC} < 0: \\&\text {Lower bound} = Q_1 - 1.5\exp {(-\textrm{bMC})}\textrm{IQR}, \\&\text {Upper bound} = Q_3 + 1.5\exp {(-\textrm{aMC})}\textrm{IQR} \end{aligned} \end{aligned}$$
(5)

where IQR is the interquartile range, and \(Q_3\) and \(Q_1\) are the third quartile and the first quartile, respectively. In this model, constants a and b were determined through the analysis of different distribution families. To ensure the model’s simplicity and robustness, Hubert and Vandervieren [7] chose the values of \(a=-4\) and \(b=3\).

According to this study [7], the model focuses primarily on commonly occurring distributions with moderate skewness. However, constructing a comprehensive model that includes cases with \(\textrm{MC} > 0.6\) is acknowledged to be challenging. The adjusted boxplot mentioned in this context focuses on addressing skewness while not considering tail heaviness. The authors highlight various drawbacks associated with this methodology. Primarily, the complexity of the model increases with the addition of more estimators and parameters. Moreover, the model’s robustness diminishes as the tail measures exhibit a lower breakdown value, leading to increased variability in the length of the whiskers due to the variability of these tail measures [7].

Introduced in 2018 by Walker et al., the ratio-skewed boxplot is a practical method for handling skewed data and detecting outliers across various parametric distributions. It can be applied to univariate datasets, regardless of their symmetry or sample size [5]. The method incorporates the use of Bowley’s coefficient [25], derived from Kimber’s rule [23], which replaces the interquartile range with 2SIQR. This coefficient makes use of the quartiles to give an indication of skewness, given by:

$$\begin{aligned} \begin{aligned} {B}_{c}&= \frac{{{\text {SIQR}}_{{\text {U}}} - {\text {SIQR}}_{{\text {L}}} }}{{{\text {IQR}}}}\\&= \frac{(Q_{3} - Q_{2}) - (Q_{2} - Q_{1})}{\text {IQR}} \end{aligned} \end{aligned}$$
(6)

The ratio-skewed boxplot can be constructed by employing this coefficient in the definition of fences, represented by:

$$\begin{aligned}{}[Q_1 - \mathrm{1.5IQR}\,R_\textrm{L}\,\,\,\,,\,\,\,\, Q_3 + \mathrm{1.5IQR}\,R_\textrm{U}] \end{aligned}$$
(7)

where \(R_\textrm{L}\) and \(R_\textrm{U}\) represent the lower and upper fence skewness adjustment factors, defined as follows:

$$\begin{aligned} R_\textrm{L}=\frac{1-B_{c}}{1+B_{c}}\,\,\,\,\,\text {and}\,\,\,\,\,R_\textrm{U}=\frac{1+B_{c}}{1-B_{c}} \end{aligned}$$
(8)

In scenarios where data arrives in streams, online learning algorithms are particularly useful for processing data incrementally. Traditional outlier detection methods often require the entire dataset upfront, which is not feasible or efficient when dealing with constantly evolving data [26].

Incremental processing is a key concept in online learning algorithms, where the algorithms update their internal state and make decisions based on each new data point, enabling efficient real-time processing and decision-making. Online learning algorithms play a crucial role in outlier detection by continuously analyzing the incoming data stream, identifying outliers based on evolving patterns and statistical characteristics, and adapting to changes in the data distribution over time. These methods provide timely alerts for unusual data points, enhancing the accuracy and reliability of data preprocessing and statistical analysis.

There are various online learning outlier detection algorithms available, and one library that offers a range of such algorithms is River [27]. River provides a collection of online machine learning algorithms, including several outlier detection methods, like half-space tree, local outlier factor, and one-class SVM. These algorithms utilize different techniques to identify outliers in an online setting and have been widely used in various domains.

In addition to specific outlier detection algorithms, statistical online methods provide alternative approaches for detecting outliers in streaming data. These methods take advantage of incremental formulas for mean and standard deviation, allowing for efficient updates to these values without the need to store the entire dataset. By utilizing these incremental computations, outlier detection can be achieved in an efficient manner as new data points are received. Notable examples of such methods include the Cantelli inequality approach and the Chebyshev inequality approach. These methods leverage statistical bounds provided by the Cantelli inequality and the Chebyshev inequality, respectively, to identify outliers in real time. Applying these statistical bounds enables the assessment of the likelihood of data points deviating from the expected range, facilitating the detection of outliers as the data stream evolves [28,29,30].

In contrast to the mean and standard deviation, which are susceptible to the influence of extreme outliers, the robust measures of median and quartiles offer a valuable alternative for outlier detection. These measures demonstrate reduced sensitivity to extreme values, ensuring a more accurate depiction of the data’s central tendency and dispersion. However, calculating the median incrementally presents a challenge as there is no closed-form solution available.

To overcome this hurdle, approximate incremental computation methods enable reliable estimation of the median and its associated quartiles in online scenarios. Motivated by the idea of approximation techniques, this paper proposes a novel histogram-based algorithm for quartile estimation and subsequently introduces an innovative incremental boxplot approach for outlier detection. This methodology enhances the effectiveness of outlier detection in streaming data by providing a reasonable estimation of the median’s value and overcoming the limitations imposed by extreme outliers.

3 Online boxplot algorithm

In this section, we introduce an innovative online boxplot approach for outlier detection, combining the adjusted boxplot and Bowley’s coefficient or quartile skewness. This coefficient serves as a reliable measure to capture the asymmetry of a univariate continuous distribution. To enable online outlier detection, the proposed approach incorporates a histogram-based algorithm that adjusts the whiskers of the boxplot as new data points are added.

The online algorithm dynamically updates quartile estimates with each incoming data point, ensuring consistency and accuracy. Moreover, the approach incorporates updating the skewness measure and adjusting the whiskers along with quartile modifications. This enhancement significantly improves the reliability of the method, enabling it to effectively handle both symmetric and asymmetric distributions.

3.1 Online quartile estimation algorithm

The proposed online quartile estimation algorithm described in this paper comprises two distinct steps. The first step operates in batch mode, enabling the algorithm to process a buffer of collected data. This initialization step involves the computation of a histogram, which divides the range of the data into evenly spaced intervals or bins.

The number of bins serves as a tunable parameter in the algorithm, accommodating the data distribution and the desired accuracy of the quartile estimates. By adjusting the number of bins, the algorithm can effectively capture the nuances and variations in the dataset, leading to more precise quartile estimations. As a result, the selection of this parameter is crucial in ensuring accurate and reliable quartile calculations. To streamline the process and enhance its practicality, future iterations of the algorithm will automate the determination of the number of bins, removing the burden of manual specification and improving its applicability across diverse datasets.

Once the histogram is computed, the data’s cumulative distribution function (CDF) can be estimated. The CDF provides the fraction of data points up to and including the upper bound of each bin. The CDF values are used to estimate the quartiles of the data. The first quartile (\(Q_{1}\)) is the mean of the bin boundaries such that the CDF value is greater than or equal to 0.25. Similarly, the second quartile (\(Q_{2}\)) is the mean of the bin boundaries such that the CDF value is greater than or equal to 0.5, and the third quartile (\(Q_{3}\)) is the mean of the bin boundaries such that the CDF value is greater than or equal to 0.75.

It is worth noting that the length of the data buffer is not of critical importance as long as it accurately represents the entire dataset. Nevertheless, utilizing a highly representative buffer can enhance the algorithm’s speed. The quartile estimates obtained during the initialization step are continually updated, allowing for real-time monitoring of the data stream. The initialization step plays a pivotal role in the algorithm. Algorithm 1 illustrates this initial stage, where the functions \(\mathrm {bin\_upper}(i)\) and \(\mathrm {bin\_lower}(i)\) provide the upper and lower bounds of the i-th bin, respectively.

Algorithm 1
figure a

Initializing quartiles using histogram and CDF

The second step of the quartile estimation algorithm is performed in online mode, where the algorithm continuously updates the histogram and quartiles as new data points arrive. Whenever a new data point arrives, the algorithm checks if it falls within any of the predefined intervals or bins. If it does, the corresponding counter for that bin is incremented. If the new data point falls outside the predefined bins, new bins are created, and the new data point is counted.

To prevent excessive memory usage, the algorithm defines a maximum number of bins as a hyperparameter, and if the number of bins exceeds this value, the algorithm aggregates bins in pairs starting from the left. The updated histogram is then used to calculate the quartiles using the cumulative distribution function, just like in the offline step.

The proposed online quartile estimation algorithm is efficient and can accurately estimate quartiles with each new data point, even with limited memory resources. Algorithm 2 precisely represents this step where the function \(\mathrm {bin\_lower}(0)\) and \(\mathrm {bin\_upper}(-1)\) return the lower bound of the first bin and the upper bound of the last bin, respectively.

Algorithm 2
figure b

Online Estimation of Quartiles using Histogram and CDF

3.2 Quartile skewness measure

Skewness is a fundamental statistical concept used to measure the degree of asymmetry observed in a probability distribution. It provides information about the shape of a distribution, as distributions can exhibit varying degrees of right (positive) skewness or left (negative) skewness. A normal distribution, also known as a bell curve, is a symmetric distribution with zero skewness.

However, real-world data often exhibit non-normal distributions, and understanding the nature of the distribution is crucial for accurate statistical analysis. Therefore, skewness is an important tool for describing the distribution of data, and it is used to quantify the degree to which the distribution deviates from symmetry. In this regard, measuring skewness is important in order to properly understand and interpret the data, as well as to apply appropriate statistical techniques.

Furthermore, many statistical algorithms are designed based on the assumption of normality, and they may not perform well when the data is not normally distributed. Therefore, incorporating measures of skewness in statistical analysis are critical to ensure that the algorithm is robust and accurate, particularly in the presence of outliers or other non-normal features in the data. One such measure is quartile skewness, also known as Bowley’s coefficient [25], which is a robust measure of skewness that is resistant to outliers and can accurately reflect the asymmetry of a univariate continuous distribution [31].

The quartile skewness measure or Bowley’s coefficient is a robust and efficient method for quantifying the asymmetry of a univariate continuous distribution. According to Eq. 6, this measure is calculated by taking the difference between the lengths of the upper quartile and the lower quartile, normalized by the length of the interquartile range. In this study, we utilize the abbreviation QSM to denote the quartile skewness measure, as defined by:

$$\begin{aligned} \text {QSM} = \frac{(Q_{3} - Q_{2}) - (Q_{2} - Q_{1})}{\text {IQR}}. \end{aligned}$$
(9)

This measure is not only robust against outliers but can also detect the degree and direction of skewness in the distribution. It remains effective even with the introduction of new data points and updates to the quartiles through the online quartile estimation algorithm discussed earlier in Sect. 3.1. This means that the measure can capture any changes in the distribution, including shifts and changes in skewness, as the data evolves over time and can be a powerful tool for monitoring and analyzing data streams that exhibit non-normal distributions with varying degrees of skewness.

Although the quartile skewness measure is a robust and efficient way to reflect the asymmetry of a univariate continuous distribution and resist up to \(25\%\) outliers, it has one disadvantage. It cannot measure very small skewness in the data distribution [7]. This is because the quartile skewness measure is based on the difference between the upper and lower quartiles, which is relatively insensitive to minor departures from symmetry. However, in practice, this is not a major issue, as the quartile skewness measure is designed to detect significant departures from symmetry, which are often the most relevant for practical applications.

3.3 Online boxplot

The proposed online boxplot combines the adjusted boxplot methodology with quartile skewness calculation, resulting in an advanced visualization tool. This integration enables the acquisition of a more comprehensive representation of data distribution and skewness, thereby facilitating its effective utilization in an online environment. This innovative approach allows for real-time updates and dynamic analysis, making it particularly useful for monitoring and exploring data streams or continuously evolving datasets.

The adjusted boxplot approach is a statistical method that is employed to identify outliers in skewed data distributions. It was initially introduced by Hubert and Vandervieren [7]. This method extends the standard boxplot methodology by incorporating the medcouple skewness measure. It provides a visual representation of the dataset’s distribution and allows for the identification of potential outliers based on quartiles and the interquartile range. The adjusted boxplot method is known for its greater robustness and effectiveness compared to the original boxplot method, especially in scenarios where the distribution is skewed or contains outliers.

In their work, Hubert and Vandervieren [7] state that the medcouple skewness measure possesses the robustness of the quartile skewness measure. Therefore, in the proposed online approach, the quartile skewness measure defined by Eq. 9 is replaced with the medcouple measure to quantify the degree of skewness in the data distribution. This method defines the whiskers differently from the original boxplot method, using an exponential formula based on the interquartile range and the skewness measure. The whiskers in this online boxplot method are defined as follows:

$$\begin{aligned} \begin{aligned}&if\,\,\text {QSM} \ge 0:\\&\text {Lower Bound} = Q_1 - w\,\exp {(a\text {QSM})}\textrm{IQR},\\&\text {Upper Bound} = Q_3 + w\,\exp {(b\text {QSM})}\textrm{IQR}\\ \\&if\,\,\text {QSM} < 0:\\&\text {Lower Bound} = Q_1 - w\,\exp {(-b\text {QSM})}\textrm{IQR},\\&\text {Upper Bound} = Q_3 + w\,\exp {(-a\text {QSM})}\textrm{IQR} \end{aligned} \end{aligned}$$
(10)

where w, a, and b are constants that control the length of the whiskers. The value of w can be set to 1.5 or 3, as in the original boxplot method, while the values of a and b are chosen based on the degree of skewness in the data distribution. The selection process for these parameters will be further elaborated in the upcoming section.

The exponential formula used to define the whiskers in the online boxplot method provides a more flexible and adaptive approach to identifying outliers compared to the standard boxplot method. By adjusting the length of the whiskers based on the degree of skewness in the data distribution, this formula effectively handles extreme values and asymmetrical distributions.

In summary, the online boxplot inherits the robustness of the adjusted boxplot method and proves to be an effective statistical technique for identifying outliers incrementally in skewed data distributions. It utilizes the quartile skewness measure to quantify the degree of skewness in the data distribution. Additionally, the whiskers are defined using an exponential formula that better accommodates extreme values and asymmetrical distributions compared to the standard boxplot method. By incorporating this approach into our algorithm, we can more accurately detect and handle outliers in data streams.

3.4 Integration of approaches

The proposed approach for online outlier detection on univariate datasets integrates the online quartile estimation algorithm, online quartile skewness measure, and online boxplot approach. The online quartile estimation algorithm is used to estimate the quartile values incrementally, and the quartile skewness measure is used to quantify the degree of asymmetry in the distribution. The boxplot approach is then applied to identify outliers.

In the online quartile estimation algorithm, the quartile values are estimated using a cumulative distribution function based on the histogram of the data. The algorithm operates in two steps, with the first step performed offline to initialize the quartile values (Algorithm 1), and the second step performed in online mode to update the values as new data points arrive (Algorithm 2). This approach results in more accurate quartile estimates and enables the algorithm to operate with limited memory resources.

The quartile skewness measure is then used to quantify the degree of asymmetry in the distribution (Eq. 9). The quartile skewness measure is updated by incorporating new data points and adjusting the distribution’s quartiles accordingly. This measure is resistant to outliers and can accurately reflect the asymmetry of a univariate continuous distribution.

The exponential formula of the adjusted boxplot approach is then used to identify outliers based on the online quartile values and the online quartile skewness measure. The approach involves constructing a boxplot with adjusted whiskers, defined by Eq. 10, where the quartile skewness measure is used to adjust the whiskers to account for skewness in the distribution, which improves the accuracy of outlier detection. Data points beyond the whiskers are identified as potential outliers.

Since the algorithm aims to detect outliers incrementally while conserving memory, it selectively excludes highly extreme data points from the histogram update, aligning with the principles of Chebyshev’s inequality. This approach not only conserves memory, but also helps address another concern of losing precision due to excessive bin aggregation.

Theorem 1

(Chebyshev’s inequality) Let X be a random variable with finite mean \(\mu \) and finite nonzero standard deviation \(\sigma \). For any positive constant \(k > 1\), the probability that the absolute difference between X and its mean is greater than or equal to k times the standard deviation (i.e., \(|X - \mu | \ge k\sigma \)) is less than or equal to \(1/k^2\) [32].

$$\begin{aligned} \mathbb {P}(|X - \mu | \ge k\sigma ) \le \frac{1}{k^2} \end{aligned}$$
(11)

According to Chebyshev’s inequality, we selected a value for the parameter k with a sufficiently high value, ensuring that data points lying beyond k standard deviations from the mean have a very low probability of belonging to the same distribution. In our study, we selected a probability lower than \(10^{-3}\%\). Interestingly, we discovered that the algorithm is not overly sensitive to this parameter. By employing this threshold, we can effectively classify and identify points outside this range as outliers, without the need for histogram and statistical updates. This approach successfully addresses concerns related to excessive bin construction and the loss of precision caused by aggregation functions.

The proposed approach can be used in real time to detect outliers accurately and efficiently by updating the quartiles and skewness measures with each new data point. The approach is particularly useful for datasets that exhibit non-normal distributions and outliers, as it incorporates robust measures of skewness and is resistant to outliers. Overall, the approach provides a reliable method for outlier detection in univariate datasets while utilizing limited memory with minimal hyperparameter tuning.

4 Experiments

This section of the paper provides a comprehensive description of the employed experimental framework. It introduces the datasets used, outlines the evaluation approach, and then provides details of the experimental setup. This section establishes the fundamental groundwork for the subsequent analyses and results.

4.1 Dataset

In this subsection, we present a thorough introduction to the datasets utilized in this study, offering comprehensive insights into their composition, characteristics, and sources.

The dataset collection for this study encompasses simulation datasets with distinct skewness distributions, namely normal, highly right-skewed, and highly left-skewed distributions. This deliberate variation in skewness enables us to evaluate the algorithm’s ability to effectively consider the inherent distributional properties during the outlier detection process. It is important to note that conventional versions of the boxplot often struggle with skewed distributions, incorrectly identifying a substantial portion of the data as outliers. In response, our algorithm seeks to overcome this limitation by considering the specific characteristics of the distribution, resulting in enhanced accuracy in outlier detection. By incorporating datasets with diverse skewness distributions, we can thoroughly assess the algorithm’s performance in accurately identifying outliers across different types of distributions.

In addition to the simulation datasets, we integrate a real-world dataset. This dataset which was obtained from an IT company measures the hardware resources used by cloud infrastructure. The metrics offer quantifiable measurements of various hardware performance aspects, providing valuable insights into the operating environment. With a primary emphasis on outlier detection, our algorithm is applied to more than 200 unique time series. These time series encompass different facets of hardware performance, enabling a comprehensive evaluation of the algorithm’s effectiveness in practical scenarios.

It is noteworthy that the algorithm aligns successfully with the objectives and requirements of the IT company aimed at detecting faults in the system. To demonstrate the algorithm’s performance in an online setting and its adaptability to diverse distributions, we specifically select several time series from the private dataset that exhibit varying skewness characteristics. By incorporating these time series, which represent different distributional patterns, we aim to showcase the algorithm’s ability to accurately detect outliers while providing justifications for its performance across varying distribution types.

4.2 Evaluation

This section provides a thorough analysis of our algorithm’s performance, with particular emphasis on two critical elements: the evaluation procedure and the metrics utilized for assessment.

To evaluate the algorithm’s performance on the dataset, we conducted a comparative analysis between two scenarios: the online boxplot algorithm and the batch scenario. The online algorithm involves computing approximated quartiles, constructing a boxplot that considers skewness, and handling data incrementally. On the other hand, the batch scenario assumes access to the entire dataset. This comparison aims to assess the algorithm’s effectiveness in handling incremental data and its ability to approximate results achieved when all data are accessible.

In the batch scenario, memory usage is a significant consideration as the algorithm requires memory capacity equal to the size of the dataset for comprehensive processing and analysis. In contrast, the novel online algorithm operates with fixed memory requirements, which can be adjusted as a hyperparameter. This capability of fixed memory usage enables the online algorithm to efficiently handle large datasets, making it more scalable and adaptable to real-time applications.

Precision serves as the evaluation metric for outlier detection. By comparing the precision achieved by the online algorithm to that of the batch scenario, we can assess the algorithm’s ability to approximate the results obtained when all data are available. To provide a comprehensive evaluation, we will also incorporate recall and F1 metrics. Recall measures the ability of the algorithm to identify all relevant instances of outliers, while F1 score combines both precision and recall into a single metric, giving a balanced assessment of the algorithm’s performance. This comparison not only offers insights into the effectiveness of the online algorithm in identifying outliers, but also accounts for its performance with limited information available at each step of the analysis and with restricted memory usage. These metrics are calculated using the following equations:

$$\begin{aligned} \begin{aligned}&\textrm{Precision} = \frac{\text {True Positives}}{\text {True Positives} + \text {False Positive}}\\&\textrm{Recall} = \frac{\text {True Positives}}{\text {True Positives} + \text {False Negatives}}\\&F1 = 2 \times \frac{\text {Precision} \times \text {Recall}}{\text {Precision} + \text {Recall}}\\ \end{aligned} \end{aligned}$$
(12)

4.3 Experimental setup

In this subsection, we present the parameters and details of the algorithm used in our experiment. According to Sect. 3, the algorithm is comprised of two steps: an offline phase and an online phase. During the offline phase, it is crucial to define the length of the buffer used to initialize the statistics. The buffer should ensure it represents the entire dataset effectively. It is obvious that if the number of outliers in the buffer exceeds a certain threshold, the algorithm considers those points when determining the distribution of the dataset.

The subsequent parameter to consider is the number of bins employed during the initialization phase to construct the initial histogram. As outlined in Sect. 3, we sought to incorporate the data’s variability when selecting a value for this parameter. To streamline the process in future investigations, we intend to automate the determination of this parameter. For the simulation experiment, we set the standard deviation of the buffer in the offline phase for the bin length, while for real datasets, we opted for 150 bins to ensure a satisfactory level of precision.

The definition of whiskers in the boxplot includes a parameter denoted as w (Eq. 10), which was set to 3 in our experiments with the aim of focusing on extreme outliers. This value plays a crucial role in determining the range within which outliers are identified.

Moreover, the online boxplot approach incorporates two additional parameters, namely a and b. To determine suitable values for these parameters, we employed an optimization process in our experiments. The objective was to minimize the occurrence of false positives compared to the batch definition of the algorithm within a limited time frame. To optimize these values, we obtained the IT company data, which was twice the length of the initializing buffer, all at once during the optimization process.

To enhance the efficiency of the outlier detection process, we incorporated the Chebyshev inequality into the algorithm. This inequality allows us to disregard extremely high values where the probability of following the same distribution is less than \(\frac{1}{k^2}\). For our experiment, we set the value of k to 55, reducing false positives while maintaining computational speed. Notably, to analyze this parameter’s impact, we conducted experiments with three different values of k, indicating the algorithm’s robustness to variations in this parameter.

In terms of the experiment itself, we set the maximum number of bins to 1000. However, this value can be adjusted depending on the specific requirements of the analysis. It is important to consider the trade-off between precision and memory efficiency when choosing a lower value, as lower values result in more aggregated bins and reduced precision.

5 Result

In the following section, we present the results of applying our novel online outlier detection algorithm to various datasets and provide an in-depth analysis of the outcomes.

5.1 Simulation study

For the simulation study, we chose three datasets representing different skewness distributions: normal, highly right-skewed, and highly left-skewed. This deliberate variation in skewness allows us to assess the algorithm’s ability to accurately identify the distribution type and consider its inherent properties during the outlier detection process.

The first dataset exhibits a highly right-skewed distribution, generated using a Gamma distribution with a shape parameter (\(\alpha = 0.3\)) and an inverse scale parameter (\(\beta = 0.1\)). By calculating the medcouple skewness, a robust and sensitive measure, we find that this dataset has a skewness value of approximately 0.7. While the traditional Tukey boxplot identifies most of the data points (about 11.2%) as outliers, our new algorithm only detects the points at the far end of the right tail. Figure 1 illustrates the dataset’s histogram, where the black points represent the detected outliers.

Furthermore, we evaluated the algorithm’s performance by applying it in batch mode, where the whiskers are calculated using the entire data available up to that point, serving as a baseline. As shown in Fig. 2, both approaches yield nearly identical results, with precision, recall, and F1 evaluation metrics values of 0.89, 1, and 0.94, respectively (look at Table 1).

This finding validates the algorithm’s capability to identify distribution skewness and detect outliers, with only a slight variation between approximate statistical calculations and utilizing the entire dataset. Remarkably, the algorithm identifies a considerably smaller number of normal data points as outliers when compared to the traditional boxplot method, ranging from approximately 0.2% to 11.2% respectively.

Fig. 1
figure 1

Histogram displaying the highly right-skewed distribution, represented as \(X \sim \Gamma (0.3,0.1)\) along with the outliers detected by the novel online algorithm (black points)

Fig. 2
figure 2

Illustration of data points following a highly right-skewed distribution, represented as \(X \sim \Gamma (0.3,0.1)\). Outliers were identified using the online algorithm (black points) compared to the algorithm applied with access to the entire dataset (hollow points)

Table 1 Evaluation of algorithm performance under highly right-skewed and highly left-skewed distributions with varying parameter k in Chebyshev’s inequality

For the second dataset, we generated 10,000 data points from a normal distribution with a mean of zero (\(\mu = 0\)) and a standard deviation of one (\(\sigma = 1\)). We anticipate that the algorithm will align with the traditional Tukey’s boxplot, resulting in almost no points being identified as outliers (see Figs. 3 and 4). This expectation is based on the characteristic behavior of a normal distribution, where the majority of data points cluster around the mean and exhibit symmetric distribution.

Fig. 3
figure 3

Histogram displaying a normal distribution, represented as \(X \sim \mathcal {N}(0,1)\). As anticipated, no outliers were detected using the algorithm

Fig. 4
figure 4

Illustration of a normal distribution, represented as \(X \sim \mathcal {N}(0,1)\), where no point is detected as an outlier applying the online algorithm

The final dataset showcases a left-skewed distribution, generated by the Gamma distribution \(X \sim -\Gamma (0.4,0.2)\). This dataset exhibits a medcouple skewness value of approximately 0.6. Unlike the traditional Tukey boxplot, which identifies approximately 9.3% of the data points as outliers, our novel algorithm solely detects the points located at the far end of the left tail, amounting to nearly 0.2%. Figure 5 provides a visualization of the dataset’s histogram, with black points denoting the detected outliers.

To evaluate the algorithm’s performance, we conducted the batch mode analysis where the whiskers were calculated using the entire available data up to that point, establishing a baseline. Figure 6 demonstrates that both approaches yield very similar outcomes, resulting in precision, recall, and F1 scores of 0.81, 0.94, and 0.87, respectively (look at Table 1).

These results confirm the algorithm’s ability to discern distribution skewness and effectively detect outliers, with minimal disparity between the approximate statistical calculations and utilizing the entire dataset. Notably, the algorithm identifies fewer normal data points as outliers than the traditional boxplot method.

Fig. 5
figure 5

Histogram displaying the highly left-skewed distribution, represented as \(X \sim -\Gamma (0.4,0.2)\) along with the outliers detected by the algorithm (black points)

Fig. 6
figure 6

Illustration of data points following a highly left-skewed distribution, represented as \(X \sim -\Gamma (0.4,0.2)\). Outliers were identified using the online algorithm (black points) compared to the algorithm applied with access to the entire dataset (hollow points)

5.2 Hardware resources data

In this section, we provide a detailed analysis of the algorithm’s performance by applying it to a real-world dataset acquired from an IT company. The experimental dataset is comprised of four distinct metrics: Cache bytes, CPU utilization, State of Sysmon service, and Average disk read queue length. These metrics allow us to gain insights into different aspects of the IT system’s performance and resource utilization.

The Cache bytes metric provides valuable information about the utilization of cache memory, which plays a crucial role in enhancing data access speed and reducing latency.

The CPU utilization metric gives us an understanding of the workload imposed on the central processing unit, enabling us to identify potential bottlenecks or areas where system resources may be underutilized.

The State of Sysmon service metric evaluates the operational status and functionality of the Sysmon service, which plays a critical role in monitoring and logging system events for security and analysis purposes.

Lastly, the average disk read queue length metric provides insights into the efficiency of data retrieval from storage devices, giving us an indication of potential performance issues related to disk read operations.

By analyzing these metrics we aim to provide a comprehensive evaluation of the algorithm’s effectiveness in detecting anomalies and identifying potential issues within an IT infrastructure.

Fig. 7
figure 7

Results of the online boxplot algorithm applied to the Cache bytes metric. Outliers detected by the online and batch scenarios of the proposed algorithm are visually depicted as black solid points and hollow points, respectively. The dynamically updated whiskers, denoted by dotted lines, are characteristic of the online scenario. In contrast, the dashed lines represent the whiskers calculated when access to the entire dataset is available at the time of analysis

Fig. 8
figure 8

Results of the online boxplot algorithm applied to the CPU utilization metric. Outliers detected by the online and batch scenarios of the proposed algorithm are visually depicted as black solid points and hollow points, respectively. The dynamically updated whiskers, denoted by dotted lines, are characteristic of the online scenario. In contrast, the dashed lines represent the whiskers calculated when access to the entire dataset is available at the time of analysis

Fig. 9
figure 9

Results of the online boxplot algorithm applied to the state of service metric. Outliers detected by the online and batch scenarios of the proposed algorithm are visually depicted as black solid points and hollow points, respectively. The dynamically updated whiskers, denoted by dotted lines, are characteristic of the online scenario. In contrast, the dashed lines represent the whiskers calculated when access to the entire dataset is available at the time of analysis

Fig. 10
figure 10

Results of the online boxplot algorithm applied to the average disk read queue length metric. Outliers detected by the online and batch scenarios of the proposed algorithm are visually depicted as black solid points and hollow points, respectively. The dynamically updated whiskers, denoted by dotted lines, are characteristic of the online scenario. In contrast, the dashed lines represent the whiskers calculated when access to the entire dataset is available at the time of analysis

The algorithm’s efficacy in outlier detection is exemplified by Figs. 789 and 10, where it achieves a high performance when compared to the batch scenario used as a baseline. Also, Fig. 11 illustrates the results of the online boxplot algorithm applied to the Cache bytes metric with varying values of the k parameter. This figure demonstrates the algorithm’s adaptability and robustness to parameter changes, ensuring reliable outlier detection across diverse datasets.

Moreover, Table2 provides further insight into the algorithm’s performance, showcasing its precision, recall, and F1 scores across different IT metrics under varying k values. This exceptional precision underscores the algorithm’s capability to calculate approximate statistics incrementally, even with limited data access and a predefined memory constraint of 1000 bins. This attribute highlights a significant strength of the novel online boxplot algorithm.

Fig. 11
figure 11

Results of the online boxplot algorithm applied to the Cache bytes metric with different values of the k parameter. The solid red line represents the batch boundary while other dashed lines demonstrate the dynamically updated whiskers for three values of the k parameter

Table 2 Evaluation of algorithm performance for the real-world dataset with varying parameter k in Chebyshev’s inequality

6 Discussion and conclusion

In conclusion, outlier detection plays a vital role in various contexts for identifying anomalous or exceptional events. However, the challenges posed by big data, including the massive volume of data and limited computational resources, call for innovative approaches. This paper has presented an incremental/online version of the boxplot algorithm as a solution.

By employing an approximation approach based on the numerical integration of the histogram and the calculation of the cumulative distribution function, the proposed algorithm has demonstrated its effectiveness. It considers the dataset’s distribution to define the whiskers, leading to exceptional outlier detection results even in the presence of skewed distributions. Moreover, the algorithm leverages robust measures like quartiles and median, surpassing the limitations of unsupervised outlier detection techniques.

A notable contribution of this research is the introduction of a histogram-based approach for online computation of these measures. This ensures accurate estimation and reliable outlier detection in real time. By addressing the challenges faced by traditional batch mode methods, the proposed algorithm opens doors for more efficient and scalable outlier detection in the era of big data.

In the evaluation phase, the algorithm underwent rigorous testing on both simulated datasets with varying degrees of skewness and a real-world dataset for software fault detection. These evaluations showcased the algorithm’s robustness and effectiveness in outlier detection. Furthermore, the algorithm’s computational efficiency was evident as it maintained a constant memory footprint, making it suitable for processing large datasets incrementally. Key evaluation metrics, including memory usage and precision, emphasized the algorithm’s suitability for real-time applications where immediate access to the entire dataset is not feasible.

In addition to its robust performance and efficiency, our proposed incremental/online boxplot algorithm possesses several notable strengths. It stands out for its ability to handle skewed distributions effectively, a common challenge in outlier detection. By leveraging a histogram-based approach for the online computation of measures such as quartiles and median, the algorithm ensures accurate estimation and reliable outlier detection in real time. Moreover, its simplicity and ease of interpretation make it accessible to users across different domains, regardless of their expertise in outlier detection techniques. These strengths collectively contribute to the algorithm’s versatility and applicability in various real-world scenarios, marking it as a valuable tool in the outlier detection toolkit.

While our proposed incremental/online boxplot algorithm demonstrates promising results in outlier detection, it is essential to acknowledge its limitations and areas for further improvement. One crucial aspect for future research is to apply the algorithm to datasets from various industries to showcase its adaptability and identify potential shortcomings. By subjecting the algorithm to different datasets, we can gain insights into its performance under diverse conditions and refine it accordingly.

Additionally, future work should focus on full automation, enabling the algorithm to automatically adjust its parameters to adapt to dynamic data streams. This automation not only enhances the algorithm’s usability, but also contributes to its scalability and applicability in real-world scenarios. Furthermore, there is a need to develop the algorithm for multivariate time series data and extend its capability to detect other types of events based on clear definitions. By addressing these areas, we can further enhance the algorithm’s effectiveness and broaden its utility across various domains.

In summary, this research introduces an incremental/online boxplot algorithm that effectively tackles the challenges of outlier detection in the realm of big data. The algorithm’s online outlier detection capability, consideration of distribution skewness, and efficient utilization of computational resources make it a valuable tool for various applications. As we explore future possibilities, automating the algorithm holds great potential for further enhancing its performance and usability in dynamic data environments.