Online Percentile Estimation (OPE)

Kilpi, Jorma; Kyntäjä, Timo; Räty, Tomi

doi:10.1007/s11265-021-01673-z

Online Percentile Estimation (OPE)

Open access
Published: 21 June 2021

Volume 93, pages 1085–1100, (2021)
Cite this article

Download PDF

You have full access to this open access article

Journal of Signal Processing Systems Aims and scope Submit manuscript

Online Percentile Estimation (OPE)

Download PDF

2178 Accesses
Explore all metrics

Abstract

We describe a control loop over a sequential online algorithm. The control loop either computes or uses the sequential algorithm to estimate the temporal percentiles of any univariate input data sequence. It transforms an input sequence of numbers into output sequence of histograms. This is performed without storing or sorting all of the observations. The algorithm continuously tests whether the input data is stationary, and reacts to events that do not appear to be stationary. It also indicates how to interpret the histograms since their information content varies according to whether a histogram was computed or estimated. It works with parameter-defined, fixed-size small memory and limited CPU power. It can be used for a statistical reduction of a numerical data stream and, therefore, applied to various Internet of Things applications, edge analytics or fog computing. The algorithm has a built-in feasibility metric called applicability. The applicability metric indicates whether the use of the algorithm is justified for a data source: it works for an arbitrary univariate numerical input, but it is statistically feasible only under some requirements, which are explicitly stated here. We also show the results of a performance study, which was executed on the algorithm with synthetic data.

Quantiles over data streams: experimental comparisons, new analyses, and further improvements

Article 08 February 2016

Composable models for online Bayesian analysis of streaming data

Article Open access 31 October 2017

Extending the Functionality of Score-P Through Plugins: Interfaces and Use Cases

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The Internet of Things (IoT) concept reflects the tens of billions connected devices on the Internet in the near future. Many of such devices produce a measurement or monitor data periodically or even continuously. The growth of data will be huge. This offers engineering challenges and business opportunities for the future. Opportunities, however, require first solutions to the challenges, especially the management of increased data due to the amount of networked sensors or measurement agents. Often, at least a portion of the data needs to be stored for later use while, simultaneously, supporting real-time situation awareness is equally important. The transmission of this big data from the network edge to the cloud servers for processing consumes bandwidth and adds latency [2]. Instead of performing all the computations in a cloud server, some computations can be done already in the proximity of the origins of the data. However, memory, CPU power and communication capacity can be limited where data is measured and/or collected. Occasionally, the observed historical data needs to be recovered with sufficient detail.

When a statistical event appears to be predictable, its information content can be assumed to be small and, therefore, it does not need many bits of information to be stored. A stationary state is such a condition in which the amount of stored data bytes per event can be reduced. Very rare events occur also in a stationary state. However, reacting to all rare events is compulsory, since it is not possible to know beforehand whether an event is significant or not. Afterwards, it may be possible to judge whether the rare event was significant or not, provided that the stored data support the judgement. Responding to a rare event is possible if the latest history of events is available. A rare event is construed as rare in relation to the history of occurred events within a designated period of time. We exploit a probability distribution to identify the rarity of the events, because the percentiles of the distribution provide several fast ways to distinguish a rare event. The process of specifying the probability distribution is informative in itself as it presents indications of both stationary and non-stationary events. The utilization of probability distributions also facilitates the comparison of current events to previously occurred events simply by comparing the different distributions.

In this paper, we provide the details of an algorithm, which attempts to perform the heuristic described above. It continuously tests whether the input data appear stationary and reacts to events that do not appear to be stationary. The algorithm transforms an input sequence of numbers into an output sequence of histograms. It also indicates how to interpret the histograms since their information content vary according to whether the histogram was computed by sorting a small amount of input data or estimated by assuming stationary input data.

The fields of smart grids and rural industrial systems serve as examples in which the proposed method can be applied. These systems produce massive amounts of data to be analysed, but sending it onward is not always cost effective. An example of this would be industrial big data related to systems predictive maintenance [22]. These sensoring processes (e.g. smart grid cable conditions metering, machine and/or tool vibrations) produce large amounts of data to be analysed to predict possible system faults. Our method can produce both the reduced amount of data and indication of the changes within the measurements. Furthermore, this method can be integrated in 5G communication systems, in which Mobile Edge Computing (MEC) can be used as a service [1, 12].

This paper is structured as follows. In Section 2, we provide background to our solution, relate it to existent, similar research and further motivate our approach. In Section 3, we present our design assumptions, which are simultaneously requirements for our algorithm. Section 4 gives a top-level view of the algorithm and Section 5 its details. In Section 6, we describe the output of the algorithm. Sections 7 and 8 contain the statistical details. In Section 9, we present results of selected performance tests and finally the conlusions are drawn in Section 10.

2 Background and Related Work

Our control loop utilizes a sequential algorithm called the extended P²-algorithm. We will also discuss other alternatives later in this section. The P²-algorithm was first developed by Jain and Chlamtac [10] for the sequential estimation of a single percentile of a distribution without storing the observations, then extended by Raatikainen in [15] for the simultaneous estimation of several quantiles^{Footnote 1}, and further analyzed by Raatikainen in [17]. Raatikainen used the extended P²-algorithm as a part of his PhD thesis [16]. The publication [15] of Raatikainen contains the remarkably short pseudocode of the algorithm in such detail that its implementation is straightforward. Recently, the P²-algorithm was included in the comparison analysis of Tiwari and Pandey in [21].

The extended P²-algorithm is a very fast algorithm since it does not sort the input. However, there exist several computationally simpler algorithms which we will shortly discuss. The P²-algorithm approximates the inverse of the empirical cumulative distribution function (CDF) by utilizing a piecewise-parabola^{Footnote 2} to adjust the estimates of the percentiles. If the parabolic adjustment cannot be applied, then linear adjustment is used instead.

The extended P²-algorithm has some drawbacks:

1.
It is suitable only for strictly stationary data sources, that is, for sources whose distribution really have such percentiles which do not depend on the observation time [4, 5, 14]. If the data source is not strictly stationary, then the estimates need not have any meaningful statistical interpretation [21]. This reduces the applicability of the extended P²-algorithm.
2.
The estimates of the percentile near the median are more accurate, but the tail percentiles are less accurate. However, this is not a problem of the extended P²-algorithm alone, as the estimation of the extreme tail probabilities is in general very hard due to a lack of observations [9]. Our solution to this is simply to avoid the use of the extended P²-algorithm in the estimation of the extreme tail percentiles.
3.
The extended P²-algorithm requires initialization data. As in [17], a good statistical representativeness of the initialization values is required. This may be hard to achieve in applications. It is worthwhile to use some computational resources to achieve better representativeness.
4.
Initialization values must also be unequal, distinct values. This may cause problems in applications in which this cannot be guaranteed. Placing effort in the previous item (3) also helps this issue.

At first sight, these restrictions may sound hard for useful applications. However, we will show that the area of application can be significantly extended to cover some (not all) of the data sources which are not strictly stationary. The data source is allowed to change from a stationary state to another stationary state and the number of stationary states is not restricted.

Statistical background knowledge includes the research of Heidelberger and Lewis [9], which examines the quantile estimation for dependent data and presents many practical methods, such as the Maximum Transformation and the Nested Group Quantiles. The work of Feldman and Tucker [7] gives insight what can happen when we try to estimate a non-unique quantile and Zieliński [24] shows that estimation problems can occur even if the quantile is unique.

Alternatives to the extended P²-algorithm are the various recursive algorithms based on the stochastic approximation method of [18]. These algorithms include the Stochastic Approximation (SA) by Tierney [20], the Exponentially Weighted Stochastic Approximation (EWSA) by Chen et al. [6], Log Odds Ratio Algorithm (LORA) by Bakshi and Hoeflin [3] and the recent Dynamic Quantile Tracking using Range Estimation (DQTRE) of Tiwari and Pandey [21]. Yazidi and Hammer [8, 23], present recursive approaches based on stochastic learning. A common property of all the estimation algorithms is the need to obtain good initialization data.

The recursive algorithms EWSA, LORA and DQTRE adapt, to some extent, into some specific types of non-stationary data streams. However, the data source can be non-stationary in a myriad of different ways. If the distribution of data depends on the time of the observation, this dependency can be smooth or non-smooth. For example, EWSA was shown to adapt into a linear trend in [6], but in [21] DQTRE performed better in the simulated case of a sudden non-smooth change of the distribution. The estimation of time-dependent quantiles is conceptually difficult since, if the quantiles depend on time, then also their estimates depend on time. We are not aware of any general theory on the estimation of time-dependent quantiles.

Our contribution is to provide a control loop around the extended version of the P²-algorithm. The controlled extended P²-algorithm will be referred to as Online Percentile Estimation (OPE) and it consists of two separate parts: the extended P²-algorithm of [15] and our control loop. Our intention is to keep the control loop as computationally simple as possible. We emphasize the separation into two parts, because it is possible to replace the P²-algorithm with any of the recursive algorithms discussed above. For example, the authors of DQTRE [21] sketch the estimation of several quantiles simultaneously. The extended P²-algorithm has a built-in property to keep the estimates of the quantiles strictly monotone while DQTRE temporarily allows non-monotone quantile estimates.

The purpose of the control loop is to detect temporal behavior which is not strictly stationary and to avoid the use of the extended P²-algorithm in these situations. In cases which are not strictly stationary, the sample percentiles, obtained by sorting a relatively small sample of data, have the most reliable interpretation. For example, the sample median, denoted in this paper as x_(⌈0.5n⌉) in which n is the sample size, always has the interpretation that 50% of the sample values has been smaller than x_(⌈0.5n⌉) and 50% has been larger than x_(⌈0.5n⌉). This holds true due to sorting of the sample and whether or not the data source was (strictly) stationary when the sample values were observed. In the non-stationary case, the data sample represents also the specific interval of time in which it was observed, and additional information is required for the correct statistical interpretation of it.

The strictly stationary case is just the opposite for the data source and, by its definition, it has a time-invariant distribution. Then the uncertainty, which results from the estimation instead of the computation of percentiles, can be tolerated. In this paper, the term source percentile indicates the percentile of the distribution of the (endless) data source. The temporal existence of such source percentiles can be assumed whenever there is evidence that the assumption of strictly stationary case holds.

The authors of the P²-algorithm [10] considered a hardware implementation (quantile chip) of their algorithm. Tiwari and Pandey [21] suggested a similar, for example FPGA or ASIC, implementation of their DQTRE algorithm. Our viewpoint is that all online quantile estimation algorithms benefit from a simple control loop which attempts to verify that the assumptions of the algorithms are not violated too heavily.

3 Assumptions and Requirements for OPE

For further reference, we present the assumptions that were made during the design process of OPE. These assumptions are also requirements for applying the algorithm.

1.
It is assumed that there is random variability in the data.
2.
It is assumed that the data source can be in some steady state for adequate periods of time that the stationary model can be applied. The number of different transient steady states is unlimited. Momentary periods of behavior, which is not strictly stationary, are allowed but, if the data source cannot ever reach any steady state in which the P²-algorithm can be deployed, then one must try to replace the P²-algorithm part of OPE by some of the mentioned recursive algorithms which adapt to non-stationary data. A control loop similar of OPE could then be defined to ensure that the changes in the data distribution are such that the adaptation makes sense, for example, the percentile estimates must remain monotone.
3.
OPE performs similarly to a low pass filter. It is assumed that the interesting information is not on the small variations of the timescale. Exceptionally small or large values will be visible from the output of OPE, but their exact occurence in time may be lost unless a time-stamping feature is added to the algorithm. The handling of time stamps can be added but, for brevity, this issue is not discussed here. Also, trends can be observed by comparing the current histogram to the previous histogram(s), but a trend within a single histogram is not necessarily observed.

The design principle has been that in all circumstances the algorithm must perform rapidly and never produce unreliable data. Even if it is applied in a situation in which its applicability is not the best possible.

4 The Control Loop

The control is obtained by applying specific, statistical counter models and periodical testing of the estimates of the extended P²-algorithm against the model percentiles that are computed at the initialization phase. The model consists of

1.
the model values of the specified source percentiles,
2.
three counters and
3.
two alarm mechanisms to test each observed value against the model. The alarm mechanisms include four threshold values and two more counters.

The model will be described in Section 5.1.

The model is discarded immediately if the correspondence with the data source is deemed insufficient. When the model is discarded, a new model is built from the new observations. The alarm mechanisms may invalidate the model even during the ongoing observations. The flow chart of Fig. 1 provides an overview of the control loop.

If the model remains valid, it is updated periodically. The updating is done only if the observed, new data fits the model. In this case we can assume that the new data contains information which can improve the existing model.

4.1 Phases of the Control Loop

The control loop consists of configuration, initialization, model building and valid model phases. We will describe the idea of these phases next.

4.1.1 Configuration

In the configuration phase, certain statistical and computational parameters are set. These values are never changed during the execution of OPE. The main configuration parameter is m, which is the number of percentiles that will be estimated. Section 5.3 will present more details on the configuration parameters. The data structure of the extended P²-algorithm requires 2m + 3 initial values. Table 2 contains a quick reference to the numeric values which depend on m.

4.1.2 Initialization

In the initialization phase, we need a buffer of size M_b ≥ 2m + 3 observations and an in-place sorting algorithm to sort the buffer. The intention is to obtain 2m + 3 unequal and statistically representative initial values for the extended P²-algorithm. The sample percentiles are selected from the buffered M_b values after sorting. The sample percentiles will also be used as initialization values of the model for the source percentiles. Figure 2 is a sketch of the initialization process.

The word “marker” in Fig. 2 refers to the data structure of the extended P²-algorithm described in [15]. In the marker selection

$$ q_{j}^{*}=\frac{j}{2m+2}, j=1,\ldots,2m+1, $$

(1)

and in the model selection

$$ q_{j}=\frac{j}{m+1}, j=1,\ldots,m. $$

(2)

Both selections are made from the sorted buffer, hence the initial values of the model are included in the initial values of the marker:

$$ q_{2j}^{*}=q_{j}, j=1,\ldots,m. $$

(3)

The notation ⌈⋅⌉ in Fig. 2 refers to the ceiling function and x_(i) is the i:th order statistic, that is, the i:th smallest value. The q:th sample percentile is just the ⌈qn⌉:th order statistic x_(⌈qn⌉) in which n ≥ 1 is the sample size and 0 < q < 1.

The buffer, the marker and the model are each separate data structures. Each of these structures require memory allocations. The amount of memory that is needed depends on m alone, hence this memory allocation can be done in the configuration phase as it is not changed during the execution.

In this publication, the numbers q_j, j = 1,…, m are called selected probabilities. They always satisfy (2) and are thus equally spaced. Equal spacing is not necessary, but it is difficult to justify any unequal spacing without making any prior assumptions about the data source. Also, the spacings are organized for the quartiles to always be included in them. The inclusion of the quartiles is not necessary either, but it will simplify the presentation and the programming of the algorithm considerably. The consequence of these agreements is that only values m = 3,7,15,31 are considered, which may sound restrictive, but should turn out very natural. Splitting the intervals that the selected probabilities determine into halves is the most efficient approach to have qualitatively different granularities. Other choices for m or q_j do not convey any essential, additional prior value.

Sorting the buffer, which is an array of floating point numbers, is computationally the most challenging part of the algorithm, given the assumption that computational resources are limited. Especially in the online computation, in which the sorting will be regularly repeated. When M_b > 2m + 3 is sufficiently large, the statistical representativeness of the initialization data usually improves and also the chances that unequal values are obtained increases. Since sorting is computationally hard and memory usage needs to be minimized, the value of M_b cannot be too large. Also, the values $x_{M_{b}+1},x_{M_{b}+2},\ldots $ can arrive during the sorting which means that they need to be buffered somewhere else. We assume that the input queue is a first-in-first-out (FIFO) queue and can store a few values. This storage size and the arrival intensity of the observations also affect the size of M_b. Therefore, M_b is a parameter whose value is chosen according to the given computational resources.

The initialization phase ends successfully only if

$$ x_{(1)}<x_{(\lceil q_{1}^{*}M_{b}\rceil)}<\ldots<x_{(\lceil q_{2m+1}^{*}M_{b}\rceil)}<x_{(M_{b})}. $$

(4)

Note that the initial values of the model will then also be distinct. However, the status of the model is not valid after initialization. The status may change while building the model.

4.1.3 Model Building

The phase of model building is divided into two subphases to compromise with the two conflicting criteria that

1st subphase::: bad initialization data should be recognized and ignored as soon as possible,
2nd subphase::: while checking the correspondence of the model with the data source requires numerous observations.

In the 1st subphase, we test that the model median $x_{0.5}^{M}$ behaves like the source median. Taking into account the random fluctuations, approximately 50% of the observed, new values should fall in both of the two bins

$$ \left]-\infty,x_{0.5}^{M}\right]\text{ and }\left]x_{0.5}^{M},\infty\right[ $$

defined by the model median. The superscript “M” in $x_{0.5}^{M}$ stands for the “model”. Before the new value is given as input to the extended P²-algorithm, the relationship with the model median is accounted. After N₁ observations, the bin counts are tested. The test is described in Section 7.1.

In the 2nd subphase, we test that the model quartiles behave like the source quartiles. Approximately 25% of the observed, new values should fall in each of the four bins

$$ \left]-\infty,x_{0.25}^{M}\right], \left ]x_{0.25}^{M},x_{0.5}^{M}\right], \left ]x_{0.5}^{M},x_{0.75}^{M}\right]\text{ and }\left]x_{0.75}^{M},\infty\right[. $$

Before the new value is given as input to the extended P²-algorithm, this bin information is accounted. After N₂ observed values, the bin counts are tested. The test is described in Section 7.2.

If either of these two tests fail, the control goes back to the initialization and a new attempt is performed. Otherwise the model status is stated as valid.

We use the phrase model building instead of model validation, since the model is updated (changed) after the successful termination of the 2nd subphase. We then trust that the initialization phase was succesful and that the marker values contain representative information about the source percentiles. The details of the update are described in Section 5.2.

4.1.4 Valid Model

In the model building phase, the model is already tested twice and, in successful cases, it is updated once. After this, the validity of the model is tested periodically after each n values in which n, the period length, is a variable. There is a lower bound for a meaningful period length which depends on m: n ≥ n_min(m), but there is no upper bound. See Table 2 for values of n_min(m) and Section 8.1 on how these values are computed.

We use the notation $\widehat {x}_{q_{2j}^{*}}$ for the current values of marker variables to distinguish them from the current values of the model variables $x_{q_{j}}^{M}$. The test is twofold: 1) the model quartiles are tested that they still behave like the source quartiles and 2) the model percentiles $x_{q_{j}}^{M}$ and the current estimates of the same percentiles, which are the marker values $\widehat {x}_{q_{2j}^{*}}$, are tested that they are not too far from each other. The first test is called the bin frequency test and the latter is called the bin boundary test. These tests are explained in more detail in Sections 7 and 8.

Upon the termination of the model building phase, the value of n is first set to n = n_min(m). After each successful test, the size of n is incremented by a step: $n\leftarrow n+s$. As additional tests are accepted by the model, we have more evidence on the strict stationarity and more confidence in the extended P²-algorithm. After each successful test the model is updated.

There are also two alarm mechanisms that may invalidate the model during the period. A single value above or below either of the absolute alarm thresholds may invalidate the model. Too many values above or below either of the adaptive alarm thresholds may invalidate the model.

5 Technical Details of the Control Loop

5.1 The Model

The model of the control loop consists of the m model percentiles

$$ x_{q_{1}}^{M},\ldots,x_{q_{m}}^{M}, $$

(5)

which, for simplicity, are assumed to include the model quartiles $x_{0.25}^{M}$, $x_{0.5}^{M}$, $x_{0.75}^{M}$. In addition to these, the model contains three periodwise bin counters

$$ \begin{array}{@{}rcl@{}} n_{1}&=&\#\{x_{i} | x_{i}\leq x_{0.25}^{M}, i=1,\ldots,n\},\\ n_{2}&=&\#\{x_{i} | x_{i}\leq x_{0.5}^{M}, i=1,\ldots,n\}, \end{array} $$

(6)

$$ \begin{array}{@{}rcl@{}} n_{3}&=&\#\{x_{i} | x_{i}\leq x_{0.75}^{M}, i=1,\ldots,n\}, \end{array} $$

which depend on the period length n, two absolute alarm thresholds $A_{abs}^{-}$ and $A_{abs}^{+}$, two adaptive alarm thresholds $A_{ada}^{-}$ and $A_{ada}^{+}$ and two periodwise counters for the adaptive thresholds:

$$ \begin{array}{@{}rcl@{}} r^{-}&=&\#\{x_{i} | x_{i}< A_{ada}^{-}, i=1,\ldots,n\},\\ r^{+}&=&\#\{x_{i} | x_{i}>A_{ada}^{+}, i=1,\ldots,n\}. \end{array} $$

(7)

The negative sign (−) refers to the lower tail of the empirical distribution and the positive sign (+) to the upper tail. All the counters related to the period are naturally initialized to zero in the beginning of the period.

The natural choices for the thresholds of the adaptive alarm are

$$ A_{ada}^{-}=x_{q_{1}}^{M}\text{ and }A_{ada}^{+}=x_{q_{m}}^{M}, $$

(8)

but, for example, the marker values

$$ A_{ada}^{-}=\widehat{x}_{q_{1}^{*}}\text{ and }A_{ada}^{+}=\widehat{x}_{q_{(2m+1)}^{*}} $$

(9)

can also be selected. The latter choice (9) makes sense if m = 3, otherwise (8) is better as their accuracy is typically better. If Eq. 8 are chosen, the expected number of observations in both of the bins $]-\infty ,A_{ada}^{-}[$ and $]A_{ada}^{+},\infty [$ are

$$ \lceil q_{1}n\rceil=\lceil(1-q_{m})n\rceil=\left\lceil\frac{n}{m+1}\right\rceil. $$

(10)

The period length n is always set and known before the next period begins. The expected values of the counters are also known beforehand.

The purpose of the absolute alarm thresholds is that, given the model, the events $x<A_{abs}^{-}$ and $x>A_{abs}^{+}$ are extremely rare. Since this may require some unavailable prior knowledge, it is always possible to switch off the absolute alarm mechanism by setting $A_{abs}^{-}=-\infty $ and $A_{abs}^{+}=\infty $.

The thresholds of the absolute alarms may also be defined in such a way that they depend only on Eq. 5. We define them as

$$ A_{abs}^{-}=x_{q_{1}}^{M}-K^{-}(x_{q_{m}}^{M}-x_{q_{1}}^{M}) $$

(11)

and

$$ A_{abs}^{+}=x_{q_{m}}^{M}+K^{+}(x_{q_{m}}^{M}-x_{q_{1}}^{M}), $$

(12)

in which K⁻ and K⁺ are configurable parameters. Here the distance $x_{q_{m}}^{M}-x_{q_{1}}^{M}$ serves as a yardstick, a similar concept than “sigma” (σ) of the normal distribution N(μ, σ²). The choice of the parameters K⁻ and K⁺ requires either prior knowledge or an estimation about the tail behavior of the data source.

The interval $[x_{q_{1}}^{M},x_{q_{m}}^{M}]$ is called the body bin of the model distribution. If the model is valid, then it can be assumed that this interval contains the majority of the source distribution mass. Its complement consists of two bins which are called the tail bins.

5.2 The Updating of the Model

After successful test(s), the model percentiles are updated by the rule

$$ x_{q_{j}}^{M}\leftarrow(1-\alpha)x_{q_{j}}^{M}+\alpha\widehat{x}_{q_{2j}^{*}}, j=1,\ldots,m, $$

(13)

where 0 ≤ α ≤ 1 is a configurable parameter.

The boundary value α = 0 means that the initialization values are assumed perfect and the model is never changed. In practice, we are not confident that the model could ever be improved by the marker values. The boundary value α = 1 means that the model is updated according to the marker values only. This would be plausible if we could know or trust that the marker values converge to the true percentiles. It can be expected that good choices of α will take into account the size of M_b or the ratio M_b/(2m + 3). This is studied in Section 9.2. The parameter α is called a learning parameter, which is indicative of the confidence assigned to the marker values when a data source is fed as input.

In the phase of model building, Eq. 13 is the only action. In the phase of valid model, the adaptive and absolute alarm threshold values must also be updated, if they depend on the model percentiles as suggested in Eqs. 8, 11 and 12.

5.3 Configuration Parameters

Table 1 contains the main configurable parameters. Changing their values affect to the output of OPE. In Table 2, we have collected a quick reference on the numerical values that depend on m.

Table 1 The main configuration parameters.

Full size table

Table 2 Quick reference to the values that depend on m.

Full size table

The message of n_min(m) in Table 2 is that a simultaneous estimation of several percentiles from the same data requires a lot of data. The default values for N₁ and N₂ in Table 1 are obtained by first setting N₁ + N₂ = n_min(m = 7) = 107, see the last column of Table 2. For K⁺ and K⁻, values like 1, 2, 3, 4 or 5 can be considered, analogously to “one sigma”, “two sigmas”, and so on, and the defaults are chosen as K⁺ = K⁻ = 2. The test level β affects the rate of false positives of the bin boundary test that is explained in Section 8. In addition, we suggest that the increment step could be s = s(m) = 2m + 3.

The data structure of the marker of the extended P²-algorithm requires memory of the size of 4 × (2m + 3) = 8m + 12 floating point numbers. The data structure of the model requires m + 4 floating point numbers and 5 counters which can be integer variables.

6 The Output of OPE

The percentiles are output periodically. However, there are various choices in defining what the output of OPE should be. First of all, the output should depend on the phase of the control loop.

After n observations, where n ≥ n_min(m) is a variable, at least the m model percentiles are output. In the initialization phase, M_b input values are reduced to 2m + 3 marker values and to m initial model values and it is natural to output at least the m model values. In model building, the period lengths are configurable parameters. It is natural to output the model values after updating. In all cases, the amount of input values are reduced to a very small number of output values. There is plenty of room to add metadata without affecting much of the data reduction.

After the initialization, the structure of the marker contains information about the minimum and the maximum observed values, denoted in Eq. 2 as x₍₁₎ and $x_{(M_{b})}$. Until the next initialization, the minimum can only become smaller and the maximum can only become larger. Moreover, they are true observed values, not estimates of any kind. It is natural to include them in the output, so at least m + 2 data values are output. We will use the notation x₍₁₎ and x_(n) for the minimum and maximum of the period, respectively.

Let us consider, for a moment, what values may be computed from the m + 2 output values. In many of the applications, the mean value is of interest. Assuming that the model cumulative distribution function (CDF)F^M is the piecewise linear CDF determined by the m + 2 parameters

$$ \left( x_{(1)},x_{q_{1}}^{M},\ldots,x_{q_{m}}^{M},x_{(n)}\right), $$

(14)

see Section 8.2, then the expected value of F^M is

$$ \overline{x}^{M}=\underbrace{\frac{1}{m+1}\sum\limits_{j=1}^{m}x_{q_{j}}^{M}}_{\text{robust part}}+\underbrace{\frac{x_{(1)}+x_{(n)}}{2(m+1)}}_{\text{noisy part}}. $$

(15)

In addition to the mean $\overline {x}^{M}$, it is easy to compute the variance of the model distribution F^M. The formula of the variance is presented in the Section 8.3. Essentially, the uncertainty of $\overline {x}^{M}$ can also be quantified. Also, the robust and the noisy parts are separated in Eq. 15, which allow to check whether a change in the mean occurs merely due to noise or not.

If the model is valid, then Eq. 15 has a well-justified interpretation as an estimate of the source mean, otherwise it is still the mean of the observed sample. Therefore, we need to have the information about the validity of the model together with the percentile data (14). If the model is invalidated, then the failure reason is informative metadata.

In Table 3, an output called phase code is introduced. These are the numbers in the flow chart of Fig. 1. The purpose of the phase code is to indicate the exact place in the algorithm which caused the output record (16). This indicates the validity of the model.

Table 3 The output phase codes.

Full size table

After introducing the phase code, the output format can be defined as a record with a generic format

$$ (\text{phase code},\text{percentile data},\text{metadata}). $$

(16)

This means that OPE will transform the input sequence of values into an output sequence of records. The phase code generally helps in interpretating the record correctly and parsing the metadata. The phase code provides also an applicability criteria, see the assumption 2) of Section 3. If the output never contains records with the phase code 32, briefly 32-records, then this algorithm is not applicable. Moreover, the proportion of 32-records is a measure of applicability of OPE. This metric of applicability is used in Section 9.

The phase code is an informative form of metadata. Other metadata, which is beneficial to obtain, is the period length. In the initialization and model building, the period lengths are M_b, N₁ + N₂, all of which are configuration parameters. In the valid model phase, the period length n is a variable whose final value depends on whether there was an alarm during the period of execution. The information of the period length helps to keep account of the total number of observations handled. In the cases in which the model is invalidated, it is plausible to add the test parameters and the alarm-causing observations to the metadata. Serial numbers for the output data integrity control and time stamps, if available, are also natural metadata.

We emphasize that OPE is not meant to be a change detection algorithm. If change detection is important, then a change detection algorithm can be applied to the output of OPE. The control loop is defined in such a way that the statistical tests in it will regularly give false positives. This happens on purpose, as it is possible to check if there has been a significant change in the source distribution from the history of the output data. Reducing the amount of testing will increase the risk of false negatives which is much worse, because it may not be possible to detect them afterwards from the recorded data alone.

Since n is typically considerably larger than m, the data reduction target is fulfilled. What remains to be shown is that the computational requirements are small and that the information loss depends on the parameters. The latter issue is studied in Section 9. The computational complexity results from the regular sorting of the buffer. Other computations that the control loop requires are described in Sections 7 and 8.

7 Bin frequency tests

The definition of the counters n₁, n₂ and n₃ was given in Eq. 6. In Section 2.7 of [19], a standard approach to bin frequencies for i.i.d. data is given. The exact multinomial model of the bin frequency vector

$$ (n_{1},n_{2}-n_{1},n_{3}-n_{2},n-n_{3}) $$

is approximated by a multivariate normal model. This approach is not computationally fast, so we seek for a more simple approach.

The theoretical model behind the bin frequency tests presented below is based on an assumption of independent trials, reducing the use of the multinomial distribution to the binomial distribution by considering conditional counter values and the normal distribution approximation of the binomial distribution.

7.1 The Testing in the 1st Subphase

Consider the following simple, classical model for the first subphase of model building. Let X be a random variable modeling the observations and let $x_{0.5}^{M}=x_{(\lceil 0.5M_{b}\rceil )}$ be the median value of the model obtained in the initialization process of Fig. 2. Let

$$ p=\mathbb{P}\left\{X\leq x_{0.5}^{M}\right\} $$

and as in Eq. 6

$$ n_{2}=\#\left\{X_{i} | X_{i}\leq x_{0.5}^{M}, i=1,\ldots,n\right\}. $$

If the strictly stationary assumption holds, then the distribution of X_i does not change and, therefore, p is a constant. If the X_i are independent and n is fixed beforehand, then n₂ is binomially distributed random variable with parameters n and p, denoted as $n_{2}\sim \text {Bin}(n,p)$. The expected value of n₂ is np and variance is np(1 − p). Moreover, the $\widehat {p}=\frac {n_{2}}{n}$ is the maximum likelihood estimator of p. Based on the central limit theorem (see [19]), an approximation can be done as follows

$$ \begin{array}{@{}rcl@{}} \mathbb{P}\left\{|\widehat{p}-p|<a\right\}&=&\mathbb{P}\left\{\frac{|n_{2}-np|}{\sqrt{np(1-p)}}<\frac{na}{\sqrt{np(1-p)}}\right\}\\ &&\approx 2{\Phi}\left( a\sqrt{\frac{n}{p(1-p)}}\right)-1=0.95, \end{array} $$

(17)

if $a=z_{0.975}\sqrt {\frac {p(1-p)}{n}}$ where Φ is the CDF of the standard normal distribution and z_0.975 is the 97.5% percentile of Φ. Furthermore, since z_0.975 ≈ 1.96 ≤ 2 and p(1 − p) ≤ 1/4, the 95% confidence interva l of $\widehat {p}$ can be written in a simplified form

$$ \left]\widehat{p}-a,\widehat{p}+a\right[\subset \left]\widehat{p}-\frac{1}{\sqrt{n}},\widehat{p}+\frac{1}{\sqrt{n}}\right[. $$

We set a requirement that 0.5 = 1/2 belongs to this simplified approximative 95% confidence interval of $\widehat {p}$:

$$ \widehat{p}-\frac{1}{\sqrt{n}}<\frac{1}{2}<\widehat{p}+\frac{1}{\sqrt{n}} $$

which is equivalent with

$$ n-2\sqrt{n}<2n_{2}<n+2\sqrt{n}. $$

(18)

Recall that the motivation of the 1st subphase is to quickly recognize and ignore bad initialization data. The length in the first subphase period, n = N₁, can be short provided that the normal approximation (17) used above is plausible. Since we can assume that p ≈ 0.5, this is not a hard requirement. Another criteria comes from the extended P²-algorithm. The markers, which are next to the median, affect to the accuracy of the estimated median. Therefore, the minimum length of the period (the number of trials) should satisfy N₁ ≥ n_min(3).

7.2 Testing of the Quartiles

First notice that the counters n₁, n₂ and n₃ of Eq. 6 are not independent: n₁ ≤ n₂ ≤ n₃. We can start with n₂ and get the same condition as Eq. 18, but then we must consider conditioned events n₁|n₂ and n − n₃|n − n₂. Luckily the same method applies to conditional confidence intervals, that is, we can set a requirement that

$$ \frac{n_{1}}{n_{2}}-\frac{1}{\sqrt{n_{2}}}<\frac{1}{2}<\frac{n_{1}}{n_{2}}+\frac{1}{\sqrt{n_{2}}} $$

(19)

which is equivalent with

$$ n_{2}-2\sqrt{n_{2}}<2n_{1}<n_{2}+2\sqrt{n_{2}}, $$

(20)

and set a requirement that

$$ \frac{n-n_{3}}{n-n_{2}}-\frac{1}{\sqrt{n-n_{2}}}<\frac{1}{2}<\frac{n-n_{3}}{n-n_{2}}+\frac{1}{\sqrt{n-n_{2}}} $$

(21)

which is equivalent with

$$ n+n_{2}-2\sqrt{n-n_{2}}<2n_{3}<n+n_{2}+2\sqrt{n-n_{2}}. $$

(22)

Note the use of 1/2 instead of 1/4 in Eqs. 19 and 21, respectively, which is due to conditioning.

Together Eqs. 18, 20 and 22 form the test of a conditional approach started from n₂ which we describe symbolically as

$$ n_{1}|n_{2} \leftarrow n_{2} \rightarrow (n-n_{3})|(n-n_{2}). $$

The notation “|” refer to conditioning and the arrow reads as, for example, “if n₂ then n₁ given n₂” ($n_{2}\rightarrow n_{1}|n_{2}$). However, there are four more cases that have to be considered and we list them symbolically below:

$$ \begin{array}{@{}rcl@{}} n_{3}&& \rightarrow n_{2}|n_{3} \rightarrow n_{1}|n_{2} \\ n_{3}&& \rightarrow n_{1}|n_{3} \rightarrow n_{2}|(n_{3}-n_{1}) \\ n_{1}&& \rightarrow (n-n_{2})|(n-n_{1}) \rightarrow (n-n_{3})|(n-n_{2})\\ n_{1}& &\rightarrow (n-n_{3})|(n-n_{1}) \rightarrow n_{2}|(n_{3}-n_{1}) \end{array} $$

Each of these lead to a different group of three inequality chains listed below:

$$ \begin{array}{@{}rcl@{}} 3n-4\sqrt{n}&<&4n_{3}<3n+4\sqrt{n}\\ 2n_{3}-3\sqrt{n_{3}}&<&3n_{2}<2n_{3}+3\sqrt{n_{3}}\\ n_{2}-2\sqrt{n_{2}}&<&2n_{1}<n_{2}+2\sqrt{n_{2}}\\\\ 3n-4\sqrt{n}&<&4n_{3}<3n+4\sqrt{n}\\ n_{3}-3\sqrt{n_{3}}&<&3n_{1}<n_{3}+3\sqrt{n_{3}}\\ n_{3}-n_{1}-\sqrt{n_{3}-n_{1}}&<&n_{2}<n_{3}-n_{1}+\sqrt{n_{3}-n_{1}}\\\\ n-4\sqrt{n}&<&4n_{1}<n+4\sqrt{n}\\ 2n+n_{1}-3\sqrt{n-n_{1}}&<&3n_{2}<2n+n_{1}+3\sqrt{n-n_{1}}\\ n_{2}+n-2\sqrt{n-n_{2}}&<&2n_{3}<n_{2}+n+2\sqrt{n-n_{2}}\\\\ n-4\sqrt{n}&<&4n_{1}<n+4\sqrt{n}\\ 2n+n_{1}-3\sqrt{n-n_{1}}&<&3n_{3}<2n+n_{1}+3\sqrt{n-n_{1}}\\ n_{3}-n_{1}-\sqrt{n_{3}-n_{1}}&<&n_{2}<n_{3}-n_{1}+\sqrt{n_{3}-n_{1}} \end{array} $$

The bin frequency test accepts a triple (n₁, n₂, n₃) if at least one of the five inequality chain triples is satisfied. If all of them fail, then the test rejects the triple. The combined length of the period n = N₁ + N₂ of both subphases requires that the normal approximation is simultaneously plausible near all of the quartiles, which is one criteria. The extended P²-algorithm requires that the four markers next to the quartiles should also have some accuracy. Therefore: N₁ + N₂ ≥ n_min(7) is adequate.

7.3 Adaptive Alarm Bin Counts

Assume that Eq. 8 are selected and let the r⁻ and r⁺ denote the adaptive alarm bin counters. Since n is fixed, r⁻ and r⁺ are not independent. The binomial model is again applicable in which the success probability can be assumed to be p = 1/(m + 1) and it can be assumed that n > (m + 1)². In the beginning of a period, we can set approximate 95% bounds to, for example, r⁺ as

$$ \frac{n}{m+1}-\sqrt{n}<r^{+}<\frac{n}{m+1}+\sqrt{n} $$

(23)

but the conditioning r⁻|r⁺ leads to

$$ \frac{n-r^{+}}{m+1}-\sqrt{n-r^{+}}<r^{-}<\frac{n-r^{+}}{m+1}+\sqrt{n-r^{+}} $$

(24)

which does not work similarly since the final value of r⁺ is not known when the period begins. However, if the r⁺ in Eq. 24 is understood as the current value, not the final value, then it can be implemented. Naturally, the roles of r⁺ and r⁻ in Eqs. 23 and 24 must also be considered in the opposite way:

$$ \frac{n}{m+1}-\sqrt{n}<r^{-}<\frac{n}{m+1}+\sqrt{n} $$

(25)

and

$$ \frac{n-r^{-}}{m+1}-\sqrt{n-r^{-}}<r^{+}<\frac{n-r^{-}}{m+1}+\sqrt{n-r^{-}}. $$

(26)

Finally, the lower bounds in Eqs. 24 and 26 can be dropped since the bin frequency test at the end of the period will handle them. During the period, we are only concerned about drifts away from the current body bin of the observations and take care that the sum r⁻ + r⁺ will not become too large.

8 Bin Boundary Test

The bin boundary test is based on the following theorem:

Theorem 1

Let 0 < q₁ < … < q_m < 1. If X₁,…, X_n are i.i.d., $X_{i}\sim F$ in which the q_j percentiles of F are $x_{q_{j}}$ and F is strictly increasing in the neighborhoods of each $x_{q_{j}}$, then for all ε > 0

$$ \mathbb{P}\left\{\left|\frac{1}{m}\sum\limits_{j=1}^{m}\left( X_{(\lceil q_{j}n\rceil)}-x_{q_{j}}\right)\right|>\varepsilon\right\}\leq 2\sum\limits_{j=1}^{m}e^{-2n\delta_{\varepsilon,j}^{2}} $$

where

$$ \delta_{\varepsilon,j}=\min\left\{F(x_{q_{j}}+\varepsilon)-q_{j}, q_{j}-F(x_{q_{j}}-\varepsilon)\right\}. $$

(27)

The proof of Theorem 1 in case m = 1 is given, for example, in Section 2.3.2. of the book [19]. Extending the proof from [19] to the more general case, stated above, requires basic probabilistic reasoning based on the triangle inequality. We omit the proof since it is more important to describe how it is used.

We use the marker $\widehat {x}_{q_{2j}^{*}}$ of the extended P²-algorithm to approximate $X_{(\lceil q_{j}n\rceil )}$. Furthermore, the CDF F^M is strictly increasing with percentiles $x_{q_{j}}^{M}$, so we use F^M in place of F and $x_{q_{j}}^{M}$ in place of $x_{q_{j}}$.

The probabilistic interpretation of δ_{ε, j} is that the smaller it is, the more uncertain the estimation of $x_{q_{j}}$ is. Theorem 1 states that the most uncertain percentile dominates the convergence of all the estimates. Figure 3 visualizes the relationship between ε and δ_{ε, j} when F = F^M.

What remains to be specified is the ε > 0. When the piecewise linear F^M is used as the model, the selection of ε depends on the scale of the bins. In order to simplify the next formulas, the following convenient notation is now taken into use:

$$ \begin{array}{@{}rcl@{}} q_{0}&=&0\text{ and }x_{q_{0}}^{M}=x_{(1)} \end{array} $$

(28)

$$ \begin{array}{@{}rcl@{}} q_{m+1}&=&1\text{ and }x_{q_{m+1}}^{M}=x_{(n)}. \end{array} $$

(29)

Note that x₍₁₎ and x_(n), the period’s minimum and maximum observations, are not part of the model of the control loop, but they are part of the piecewise linear CDF model F^M. The superscript M is, therefore, justified.

With the notations Eqs. 28 and 29, the Eq. 27 of δ_{ε, j} simplifies to

$$ \delta_{\varepsilon,j}=\min\left\{\frac{1}{(x_{q_{j}}^{M}-x_{q_{j-1}}^{M})}, \frac{1}{(x_{q_{j+1}}^{M}-x_{q_{j}}^{M})}\right\}\frac{\varepsilon}{m+1}, $$

j = 1,…, m. Also, with the Eqs. 28 and 29 we define the maximum and the minimum bin widths as follows:

$$ \begin{array}{@{}rcl@{}} \varepsilon_{1}&=&\max\left\{x_{q_{j}}^{M}-x_{q_{j-1}}^{M}| j=1,\ldots,m+1\right\} \end{array} $$

(30)

$$ \begin{array}{@{}rcl@{}} \varepsilon_{2}&=&\min\left\{x_{q_{j}}^{M}-x_{q_{j-1}}^{M}| j=1,\ldots,m+1\right\}. \end{array} $$

(31)

Now, if ε ≥ ε₁, then for all j = 1,…, m

$$ \delta_{\varepsilon,j}\geq\frac{1}{m+1}, $$

(32)

and if ε ≤ ε₂, then for all j = 1,…, m

$$ \delta_{\varepsilon,j}\leq\frac{1}{m+1}, $$

(33)

We want to minimize the use of F^M since it is only one out of infinitely many CDFs which have the same percentiles $x_{q_{j}}^{M}$, j = 1,…, m. Due to computational feasibility with F^M choosing any ε < ε₂ would sound attractive, but it also means that we assume that F^M is an accurate model of the source distribution. This is hardly true, at least if m = 3. The accuracy improves according to the larger values of m.

The choices of ε = ε₁/m and ε = ε₂/m both seem quite plausible. When the model is updated, the values of ε₁ or ε₂ are recomputed and the test will remain the same.

Assume that either ε = ε₁/m or ε = ε₂/m has been selected. The bin boundary test accepts the current marker values $\widehat {x}_{q_{2j}^{*}}$, j = 1,…, m, if

$$ \left|\frac{1}{m}\sum\limits_{j=1}^{m}\left( \widehat{x}_{q_{2j}^{*}}-x_{q_{j}}^{M}\right)\right|\leq\varepsilon $$

(34)

or else it accepts if

$$ 2\sum\limits_{j=1}^{m}e^{-2n\delta_{\varepsilon,j}^{2}}>\beta $$

(35)

or else in all the other cases it results in rejection. The parameter β is assumed to have values such as β = 0.05 or β = 0.01 (default).

8.1 The Minimum Period Length n _min (m)

Based on Theorem 1 we define $n_{\min \limits }(m,\varepsilon )$ such that

$$ 2\sum\limits_{j=1}^{m}e^{-2n\delta_{\varepsilon,j}^{2}}<\frac{1}{2}\text{ when }n\geq n_{min}(m,\varepsilon). $$

(36)

The idea is that the probability estimate of Theorem 1 is not trivial when n ≥ n_min(m, ε).

The choice ε = ε₁ of Eq. 30 gives a lower bound Eq. 32 to δ_{ε, j} which does not depend on ε or j and, therefore, it gives an upper bound to the probability bound of Theorem 1. We can find a candidate for n_min if we set a requirement that

$$ 2\sum\limits_{j=1}^{m}e^{-2n\delta_{\varepsilon_{1},j}^{2}}\leq2me^{-2n/(m+1)^{2}}<\frac{1}{2} $$

from which n_min(m) = n_min(m, ε₁) is

$$ n_{min}(m)=\left\lceil\frac{1}{2}(m+1)^{2}\log 4m\right\rceil. $$

8.2 The Piecewise Linear CDF Model

The piecewise linear model F^M is used mainly for visualization purposes. However, in the bin boundary test, it is also used to model the source distribution near the bin boundary points $x_{q_{j}}^{M}$, j = 1,…, m. The formula, with the Eqs. 28 and 29, is the following:

$$ F^{M}(x)=\left\{\begin{array}{rl} 0, & x<x_{q_{0}}^{M}\\ \frac{x-x_{q_{j-1}}^{M}}{(m+1)(x_{q_{j}}^{M}-x_{q_{j-1}}^{M})}+\frac{j-1}{m+1},&x_{q_{j-1}}^{M}\leq x<x_{q_{j}}^{M}\\ 1, & x\geq x_{q_{m+1}}^{M} \end{array}\right. $$

8.3 The Variance of F ^M

Let $X\sim F^{M}$. Since $\text {Var}(X)=\mathbb {E}(X^{2})-(\mathbb {E}X)^{2}$ and, since the formula $\mathbb {E}(X)=\overline {x}^{M}$ is already given in Eq. 15, only the second moment remains to be computed. According to Eqs. 28 and 29, the formula is

$$ \mathbb{E}(X^{2})=\frac{1}{3(m+1)}\sum\limits_{j=1}^{m+1}\left( x_{q_{j}}^{M}\right)^{2}+x_{q_{j}}^{M}x_{q_{j-1}}^{M}+\left( x_{q_{j-1}}^{M}\right)^{2}. $$

9 Testing of OPE with the PoC

We have implemented a version of OPE using the programming language C [11]. It contains all the features described here except for the time stamps and the FIFO input queue which is not part of OPE itself. In this publication, this implementation is called the proof of concept (PoC). In this Section, we will present a collection of tests which are based on this PoC.

Even though OPE is designed to be an online algorithm, only offline testing is discussed here. Offline testing allows to run PoC with exactly the same input, but with different values for the parameters. Also, the focus is on the correspondence with the source percentiles and the model percentiles of OPE. Our testing framework is to consider it as a deterministic (time) one pass (or single pass) algorithm, because this framework already has some theoretical results about the space complexity, see [13]. The space complexity of a deterministic one pass algorithm which computes the (population) median of a file with size N numbers is Ω(N), that is, at least of order N asymptotically.

If the global estimates of the source percentiles are computed from all 32-records’ temporally local estimates, (e.g., the mean of the percentiles), this whole offline process can be considered as a deterministic, one-pass algorithm with a large input file as a source. This is achieved when the offline data source file is an output of a simulated, stationary time series. The abstract source percentiles are then practically the same as the population percentiles of the simulated file. In this special case, the size of all the output records (16) corresponding to a given input file is considered as a measure of the space complexity of OPE and it is Ω(N).

We mention just briefly that the PoC has been tested in internal projects at VTT with light sensor data, wireless mesh network delay measurement data, and vibration sensor data from a revolving machine.

9.1 Performance Metrics

First, we present the performance metrics which are considered within the above described framework of a deterministic, single-pass algorithm. These metrics are accuracy, bandwidth savings and the buffer size.

9.1.1 Accuracy

Measuring accuracy is reasonable only when the true source percentiles $x_{q_{j}}^{src}$, j = 1,…, m, exist and are known. Also, only 32-records can be assumed to contain accurate information about the source percentiles. For brevity, we will only report an overall accuracy which we define next. First, let

$$ X_{q_{j}}^{32}=\frac{1}{n_{32}}\sum\limits_{l=1}^{n_{32}}X_{q_{j},l}^{M},j=1,\ldots,m, $$

(37)

in which n₃₂ is the number of 32-records and $X_{q_{j},l}^{M}$ is the q_j:th model percentile value reported in the l:th 32-record. The averaging in Eq. 37 is reasonable in the test framework since all the 32-records model the same stationary state. The overall accuracy can be defined as the average over the m differences between the estimates $X_{q_{j}}^{32}$ and the true values:

$$ \frac{1}{m}\sum\limits_{j=1}^{m}\left( X_{q_{j}}^{32}-x_{q_{j}}^{src}\right). $$

(38)

This is a single number and the average over m allows a comparison between different m.

9.1.2 Bandwidth Savings

The IoT application idea of OPE is that the output records (16) are not stored where they are computed, but transmitted to a cloud server. Therefore, bandwidth savings, defined as

$$ \frac{\text{Input}(Bytes)-\text{Output}(Bytes)}{\text{Input}(Bytes)}, $$

is a plausible metric. The output is is according to the provision of the PoC, which includes the phase code, the model percentiles, the period lengths and some other metadata on failure records. The output size depends on the parameters of the PoC. All the output records are included in Output(Bytes).

9.1.3 The Size of the Buffer M _b

As the content of the buffer is sorted frequently and sorting is computationally expensive, the buffer size is a performance metric.

9.2 Study Case: Buffer Size vs. The Learning Parameter

As we discussed in Section 5.2, the relationship between the buffer size M_b and the learning parameter α is of interest. The effect of α is included in the performance case study which is described here.

9.2.1 Input Files

The input files were simulated realizations from a certain moving average (MA) model of order r which we will specify here. See Chapter 3.5.5 of [14], Section 3.4.3 of [5] or Chapter 3 in [4] for general technical details of MA(r) time-series models. The different orders r = 2,4,8,16,32 and 64 are considered. For each r, there was a separate input file generated with N = 500000 values. The specific MA(r) model used here is the following. Assume r is chosen and let Z_t be an infinite sequence of independent N(0,1) distributed random variables and define

$$ X_{t}=Z_{t}+{\sum}_{i=1}^{r}\beta_{i}Z_{t-i}, $$

(39)

in which the parameters β_i are chosen as

$$ \beta_{i}=\frac{1}{\sqrt{r}}, i=1,\ldots,r. $$

(40)

With these parameters, the corresponding MA(r) model is invertible. Invertibility ensures that there is a unique MA(r) model for a given autocorrelation function (ACF) [5]. With these choices $\mathbb {E}(X_{t})=0$ for all t and, due to independence of Z_t,

$$ \text{Var}(X_{t})=1+\sum\limits_{i=1}^{r}{\beta_{i}^{2}}=1+r\left( \frac{1}{\sqrt{r}}\right)^{2}=2. $$

(41)

For k ≥ 0, the ACF at lag k ≥ 0 is

$$ \rho(k)=\left\{\begin{array}{cl} 1 & k=0,\\ \frac{\sqrt{r}+r-k}{2r} & k=1,\ldots,r,\\ 0 & k>r. \end{array}\right. $$

(42)

For lags k < 0, the ACF is defined as ρ(−k) and, therefore, ρ(k) ≥ 0 for all lags k. The time series (X_t) is strictly stationary and, since Z_t are N(0,1) distributed, by Eq. 41 the X_t are N(0,2) distributed. The source percentiles $x_{q_{j}}^{src}$ of the distributions of X_t are the same with all the input files and their numerical values are known with high precision.

The previously defined MA(r) models with the ACF (42) were chosen because, with larger orders r, they have the property that the realizations tend to have long sequences of consecutive values which are either above or below zero. Zero is also the source median for all of the specified MA(r) models. Therefore, the property ρ(k) > 0 for all the lags |k|≤ r challenges the initialization and the model building phases of the control loop, which are the novel features. This is visualized in Fig. 4. If it were that ρ(k) < 0 with some small lag k, then the initialization and the model building would benefit from this.

9.2.2 Parameter Ranges

For each of the six input files, there are different values considered for the parameter m = 3,7,15. For each of these m, the buffer sizes

$$ M_{b}(k)=k\times(2m+3), k=1,\ldots,12, $$

(43)

are considered. Thus, the buffer sizes are multiplicities of 2m + 3, which is the minimal possible buffer size. For each pair (m, M_b(k)), the discretized learning parameter values

$$ \alpha(h)=\frac{h}{24}, h=0,\ldots,24, $$

(44)

are considered. Then α(h) − α(h − 1) = 1/24 ≈ 0.04. Note that the end points 0 = α(0) and 1 = α(24) are included. Each time-series file is then given as input to the PoC with different (k, h) values for the parameters (m, M_b(k), α(h)). The accuracy and the bandwidth savings are then considered as functions of (m, M_b, α). Other parameters were fixed to their defaults shown in Table 1.

9.2.3 Results

First, a study of the applicability of OPE for the specified MA(r) input files with different orders r was made. The proportion of 32-records in the output is the criterion of applicability as defined in Section 6. In Fig. 4, an exhaustive search was made over all pairs (k, h), k = 1,…,12 and h = 0,…,24, to find the pair with the maximum applicability with the given m and r. If every initialization were succesful and always ended up in exactly one 32-record, then the proportion of 32-records would be quite exactly 25%. The proportion of 32-records can exceed the 25% level if the control loop model stays valid, recall the flow chart of Fig. 1. The applicability reduces when the autocorrelation (42) becomes worse since the statistical representativeness of the available data in the initialization and model building phases becomes poorer.

In Fig. 5, the discretizing parameter pairs (k, h), which produced the maximum applicability in Fig. 4, with the given m and r, are visualized in the space of (k, h). The pairs (k, h) are not necessarily unique. In the non-unique case, the pair with the smallest k is shown. The cases m = 3,7,15 are considered and for each m the points corresponding to input data of orders r = 2,4,8,16,32 and 64 are plotted and joined. The other end points, with r = 64, are marked. The horizontal axis is in scale k = M_b(k)/(2m + 3) only. The vertical scale is given in h and in α(h) scale. There is not much structure in Fig. 5, but it appears that two simple policies may be behind the best applicability: either the buffer size M_b is large and α is small, or the smaller buffer size is accompanied with a larger α. The notable things are that the choice of the parameters M_b and α can significantly affect the performance when the applicability is low (r = 64) and that the maximum considered buffer size M_b(12) = 12(3m + 3) is not automatically needed.

Figure 6 shows how the accuracy depends on the applicability. The gray area is a rough estimate of the 99% confidence interval in the case m = 15, obtained from Eq. 38 by assuming in Eq. 37 that (see Chapter 2.1.1 of [19])

$$ X_{q_{j}}^{32}\sim N\left( x_{q_{j}}^{src},\frac{(1-q_{j})q_{j}}{n_{32}}\right),j=1,\ldots,m, $$

(45)

which is considered as the null hypothesis of accuracy in our test framework.

Finally, the bandwidth savings is presented in Fig. 7. It is computed for the maximizing pairs (k, h). If the applicability of OPE is high and the data source is stationary, then the bandwidth savings with m = 15 can be more than 97%. This indicates that there is not much information in the simulated input data with r = 2. This would be interpreted as it is almost white noise with a minor memory effect. Recall that all the output records are included in bandwidth savings.

If the buffer size M_b is limited beforehand in an application, but some time-series samples of the application data are available in the design phase, one can try to estimate a sufficient value for the α of OPE in a similar way as presented in this section. This would involve estimating the sample autocorrelation function of the application data, simulating data with a similar autocorrelation structure, feeding it to OPE and then choosing the α with the best applicability, that is, the largest proportion of 32-records.

9.3 Complexity Issues

Our control loop alternates between sorting a fixed size buffer M_b and the extended P²-algorithm. Even though the general complexity of sorting is of the order $O(N\log N)$, since the buffer size is constant, the complexity of the sorting phases is at most of constant order, $O(M_{b}\log M_{b})=O(1)$. Also, the control loop adds complexity to the whole and our main effort has been to guarantee that this complexity of the control loop is also constant. Thus, the overall complexity of OPE is of constant order, but this constant is not the minimal possible. Indeed, there is space for further research on developing an algorithm with lower complexity. For example, replacing the P²-algorithm by the other algorithms that were discussed in Section 2 may provide a contolled algorithm with lower complexity.

10 Conclusion

We have described a sequential, online algorithm OPE, which either computes or estimates the temporal percentiles of the observed, empirical distribution of the data. This is done without storing all of the observations. It works with a parameter-defined, fixed-size small amount of memory and limited CPU power. It can be used for the statistical reduction of a numerical data stream. It transfers the univariate input stream into the output stream of vectors in which each vector can be interpreted as an empirical distribution of data starting immediately after the previous vector. For IoT applications, there is significant reduction in the amount of data (bytes) needed to be transmitted to a cloud server. The output data is structured in a (cumulative) histogram format, which quickly recognizes such trends or anomalies that are visible in the histogram scale. We provided requirements that should be satisfied when OPE is applied and we illustrated that OPE contains a built-in applicability metric, the proportion of 32-records. We also presented a performance study, which was specially designed to challenge the control loop part of OPE. A lesson learned from this performance study is that 32-records are self-informative in the sense that their interpretation does not require more metadata than the knowledge that they are 32-records and they remain sufficiently accurate even when the applicability of OPE decreases.

Code Availability

The C-code of the PoC is available from the corresponding author on reasonable request.

Notes

The words percentile, quantile and fractile are synonyms.
The name P² = PP comes from the two P-letters in P iecewise-P arabola [10].

References

Abbas, N., Zhang, Y., Taherkordi, A., & Skeie, T. (2018). Mobile edge computing: a survey. IEEE Internet of Things Journal, 5(1), 450–465. https://doi.org/10.1109/JIOT.2017.2750180.
Article Google Scholar
Adegbija, T., Rogacs, A., Patel, C., & Gordon-Ross, A. (2018). Microprocessor optimizations for the internet of things: a survey.. In IEEE Transactions on Computer-Aided Design of Integrated Circuits And Systems, (Vol. 37 pp. 7–20), DOI https://doi.org/10.1109/TCAD.2017.2717782.
Bakshi, Y., & Hoeflin, D.A. (2006). Quantile estimation: a minimalist approach. In Proceedings of the 2006 winter simulation conference.
Brockwell, P.J., & Davis, R.A. (1991). Time series: theory and methods, 2nd edn. Berlin: Springer.
Book Google Scholar
Chatfield, C. (2004). The analysis of time series, an introduction. Florida, US: CRC Press.
MATH Google Scholar
Chen, F., Lambert, D., & Pinheiro, J.C. (2000). Incremental quantile estimation for massive tracking. In Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’00. https://doi.org/10.1145/347090.347195 (pp. 516–522). New York: ACM.
Feldman, D., & Tucker, H.G. (1965). Estimation of non-unique quantiles. The Annals of Mathematical Statistics, 37, 451–457.
Article MathSciNet Google Scholar
Hammer, H.L., & Yazidi, A. (2017). Incremental quantile estimators for tracking multiple quantiles. In Benferhat, S., & et al. (Eds.) IEA/AIE 2017 Part I, no. 10350 in LNAI. https://doi.org/10.1007/978-3-319-60042-0_23(pp. 202–210).
Heidelberger, P., & Lewis, P.A.W. (1984). Quantile estimation in dependent sequences. Opreations Research, 32(1), 185–209. https://doi.org/10.1287/opre.32.1.185.
Article MathSciNet MATH Google Scholar
Jain, R., & Chlamtac, I. (1985). The P² algorithm for dynamic claculation of quantiles and histograms without storing observations. Communications of the ACM, 28(10), 1076–1085.
Article Google Scholar
Kernighan, B.W., & Ritchie, D.M. (1988). The C programming language, 2nd edn. Prentice Hall Professional Technical Reference.
Leligou, H.C., Zahariadis, T., Sarakis, L., Tsampasis, E., Voulkidis, A., & Velivassaki, T.E. (2018). Smart grid: a demanding use case for 5g technologies. In 2018 IEEE international conference on pervasive computing and communications workshops (percom workshops). https://doi.org/10.1109/PERCOMW.2018.8480296(pp. 215–220).
Munro, J.I., & Paterson, M.S. (1980). Selection and sorting with limited storage. Theoretical Computer Science, 12(3), 315–323. https://doi.org/10.1016/0304-3975(80)90061-4. https://www.sciencedirect.com/science/article/pii/0304397580900614.
Article MathSciNet MATH Google Scholar
Priestley, M.B. (1981). Spectral analysis and time series. Academic Press: London, New York.
MATH Google Scholar
Raatikainen, K. (1987). Simultaneous estimation of several percentiles. Simulation, 49(4), 159–163. https://doi.org/10.1177/003754978704900405.
Article Google Scholar
Raatikainen, K. (1990). Modelling and analysis techniques for capacity planning. Department of Computer Science University of Helsinki.
Raatikainen, K. (1990). Sequential procedure for simultaneous estimation of several percentiles. Transactions of the Society for Computer Simulation, 7(1), 21–44.
Google Scholar
Robbins, H., & Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical Statistics, 22(3), 400–407.
Article MathSciNet Google Scholar
Serfling, R.J. (1980). Approximation theorems of mathematical statistics. New York: Wiley.
Book Google Scholar
Tierney, L. (1983). A space-efficient recursive procedure for estimating a quantile of an unknown distribution. SIAM Journal on Scientific and Statistical Computing, 4(4), 706–711. https://doi.org/10.1137/0904048.
Article MathSciNet Google Scholar
Tiwari, N., & Pandey, P.C. (2018). A technique with low memory and computational requirements for dynamic tracking of quantiles. Journal of Signal Processing Systems. https://doi.org/10.1007/s11265-017-1327-6.
Yan, J., Meng, Y., Lu, L., & Li, L. (2017). Industrial big data in an industry 4.0 environment: challenges, schemes, and applications for predictive maintenance. IEEE Access, 5, 23,484–23,491. https://doi.org/10.1109/ACCESS.2017.2765544.
Article Google Scholar
Yazidi, A., & Hammer, H. (2016). Quantile estimation in dynamic and stationary environments using the theory of stochastic learning. Applied Computing Review, 16(1), 15–24. https://doi.org/10.1145/2924715.2924717.
Article Google Scholar
Zieliński, R. (1998). Uniform strong consistency of sample quantiles. Statistics & Probability Letters, 37, 115–119.
Article MathSciNet Google Scholar

Download references

Acknowledgements

The authors would like to thank Heikki Ailisto and Heli Helaakoski for supporting this study at VTT.

Funding

Open access funding provided by Technical Research Centre of Finland (VTT). This study was funded by ECSEL-MegaM@Rt2 (MegaModelling at Runtime - scalable model-based framework for continuous development and runtime validation of complex systems) and BusinessFinland-RAGE (Real-Time AI-Supported Ore Grade Evaluation for Automated Mining) projects.

Author information

Authors and Affiliations

VTT Technical Research Centre of Finland, Espoo, Finland
Jorma Kilpi, Timo Kyntäjä & Tomi Räty

Authors

Jorma Kilpi
View author publications
You can also search for this author in PubMed Google Scholar
Timo Kyntäjä
View author publications
You can also search for this author in PubMed Google Scholar
Tomi Räty
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

JK is responsible for the algorithm development and the theoretical foundations of the control loop. JK programmed the PoC with valuable help from TK. JK, TK and TR all contributed in the writing and revising of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Jorma Kilpi.

Ethics declarations

Conflict of Interests

The authors declare that they have no conflict of interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Kilpi, J., Kyntäjä, T. & Räty, T. Online Percentile Estimation (OPE). J Sign Process Syst 93, 1085–1100 (2021). https://doi.org/10.1007/s11265-021-01673-z

Download citation

Received: 09 March 2020
Revised: 14 December 2020
Accepted: 25 May 2021
Published: 21 June 2021
Issue Date: September 2021
DOI: https://doi.org/10.1007/s11265-021-01673-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Online Percentile Estimation (OPE)

Abstract

Similar content being viewed by others

Quantiles over data streams: experimental comparisons, new analyses, and further improvements

Composable models for online Bayesian analysis of streaming data

Extending the Functionality of Score-P Through Plugins: Interfaces and Use Cases

1 Introduction

2 Background and Related Work

3 Assumptions and Requirements for OPE

4 The Control Loop

4.1 Phases of the Control Loop

4.1.1 Configuration

4.1.2 Initialization

4.1.3 Model Building

4.1.4 Valid Model

5 Technical Details of the Control Loop

5.1 The Model

5.2 The Updating of the Model

5.3 Configuration Parameters

6 The Output of OPE

7 Bin frequency tests

7.1 The Testing in the 1st Subphase

7.2 Testing of the Quartiles

7.3 Adaptive Alarm Bin Counts

8 Bin Boundary Test

Theorem 1

8.1 The Minimum Period Length n min (m)

8.2 The Piecewise Linear CDF Model

8.3 The Variance of F M

9 Testing of OPE with the PoC

9.1 Performance Metrics

9.1.1 Accuracy

9.1.2 Bandwidth Savings

9.1.3 The Size of the Buffer M b

9.2 Study Case: Buffer Size vs. The Learning Parameter

9.2.1 Input Files

9.2.2 Parameter Ranges

9.2.3 Results

9.3 Complexity Issues

10 Conclusion

Code Availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation

8.1 The Minimum Period Length n _min (m)

8.3 The Variance of F ^M

9.1.3 The Size of the Buffer M _b