1 Introduction

The Internet of Things (IoT) concept reflects the tens of billions connected devices on the Internet in the near future. Many of such devices produce a measurement or monitor data periodically or even continuously. The growth of data will be huge. This offers engineering challenges and business opportunities for the future. Opportunities, however, require first solutions to the challenges, especially the management of increased data due to the amount of networked sensors or measurement agents. Often, at least a portion of the data needs to be stored for later use while, simultaneously, supporting real-time situation awareness is equally important. The transmission of this big data from the network edge to the cloud servers for processing consumes bandwidth and adds latency [2]. Instead of performing all the computations in a cloud server, some computations can be done already in the proximity of the origins of the data. However, memory, CPU power and communication capacity can be limited where data is measured and/or collected. Occasionally, the observed historical data needs to be recovered with sufficient detail.

When a statistical event appears to be predictable, its information content can be assumed to be small and, therefore, it does not need many bits of information to be stored. A stationary state is such a condition in which the amount of stored data bytes per event can be reduced. Very rare events occur also in a stationary state. However, reacting to all rare events is compulsory, since it is not possible to know beforehand whether an event is significant or not. Afterwards, it may be possible to judge whether the rare event was significant or not, provided that the stored data support the judgement. Responding to a rare event is possible if the latest history of events is available. A rare event is construed as rare in relation to the history of occurred events within a designated period of time. We exploit a probability distribution to identify the rarity of the events, because the percentiles of the distribution provide several fast ways to distinguish a rare event. The process of specifying the probability distribution is informative in itself as it presents indications of both stationary and non-stationary events. The utilization of probability distributions also facilitates the comparison of current events to previously occurred events simply by comparing the different distributions.

In this paper, we provide the details of an algorithm, which attempts to perform the heuristic described above. It continuously tests whether the input data appear stationary and reacts to events that do not appear to be stationary. The algorithm transforms an input sequence of numbers into an output sequence of histograms. It also indicates how to interpret the histograms since their information content vary according to whether the histogram was computed by sorting a small amount of input data or estimated by assuming stationary input data.

The fields of smart grids and rural industrial systems serve as examples in which the proposed method can be applied. These systems produce massive amounts of data to be analysed, but sending it onward is not always cost effective. An example of this would be industrial big data related to systems predictive maintenance [22]. These sensoring processes (e.g. smart grid cable conditions metering, machine and/or tool vibrations) produce large amounts of data to be analysed to predict possible system faults. Our method can produce both the reduced amount of data and indication of the changes within the measurements. Furthermore, this method can be integrated in 5G communication systems, in which Mobile Edge Computing (MEC) can be used as a service [1, 12].

This paper is structured as follows. In Section 2, we provide background to our solution, relate it to existent, similar research and further motivate our approach. In Section 3, we present our design assumptions, which are simultaneously requirements for our algorithm. Section 4 gives a top-level view of the algorithm and Section 5 its details. In Section 6, we describe the output of the algorithm. Sections 7 and 8 contain the statistical details. In Section 9, we present results of selected performance tests and finally the conlusions are drawn in Section 10.

2 Background and Related Work

Our control loop utilizes a sequential algorithm called the extended P2-algorithm. We will also discuss other alternatives later in this section. The P2-algorithm was first developed by Jain and Chlamtac [10] for the sequential estimation of a single percentile of a distribution without storing the observations, then extended by Raatikainen in [15] for the simultaneous estimation of several quantilesFootnote 1, and further analyzed by Raatikainen in [17]. Raatikainen used the extended P2-algorithm as a part of his PhD thesis [16]. The publication [15] of Raatikainen contains the remarkably short pseudocode of the algorithm in such detail that its implementation is straightforward. Recently, the P2-algorithm was included in the comparison analysis of Tiwari and Pandey in [21].

The extended P2-algorithm is a very fast algorithm since it does not sort the input. However, there exist several computationally simpler algorithms which we will shortly discuss. The P2-algorithm approximates the inverse of the empirical cumulative distribution function (CDF) by utilizing a piecewise-parabolaFootnote 2 to adjust the estimates of the percentiles. If the parabolic adjustment cannot be applied, then linear adjustment is used instead.

The extended P2-algorithm has some drawbacks:

  1. 1.

    It is suitable only for strictly stationary data sources, that is, for sources whose distribution really have such percentiles which do not depend on the observation time [4, 5, 14]. If the data source is not strictly stationary, then the estimates need not have any meaningful statistical interpretation [21]. This reduces the applicability of the extended P2-algorithm.

  2. 2.

    The estimates of the percentile near the median are more accurate, but the tail percentiles are less accurate. However, this is not a problem of the extended P2-algorithm alone, as the estimation of the extreme tail probabilities is in general very hard due to a lack of observations [9]. Our solution to this is simply to avoid the use of the extended P2-algorithm in the estimation of the extreme tail percentiles.

  3. 3.

    The extended P2-algorithm requires initialization data. As in [17], a good statistical representativeness of the initialization values is required. This may be hard to achieve in applications. It is worthwhile to use some computational resources to achieve better representativeness.

  4. 4.

    Initialization values must also be unequal, distinct values. This may cause problems in applications in which this cannot be guaranteed. Placing effort in the previous item (3) also helps this issue.

At first sight, these restrictions may sound hard for useful applications. However, we will show that the area of application can be significantly extended to cover some (not all) of the data sources which are not strictly stationary. The data source is allowed to change from a stationary state to another stationary state and the number of stationary states is not restricted.

Statistical background knowledge includes the research of Heidelberger and Lewis [9], which examines the quantile estimation for dependent data and presents many practical methods, such as the Maximum Transformation and the Nested Group Quantiles. The work of Feldman and Tucker [7] gives insight what can happen when we try to estimate a non-unique quantile and Zieliński [24] shows that estimation problems can occur even if the quantile is unique.

Alternatives to the extended P2-algorithm are the various recursive algorithms based on the stochastic approximation method of [18]. These algorithms include the Stochastic Approximation (SA) by Tierney [20], the Exponentially Weighted Stochastic Approximation (EWSA) by Chen et al. [6], Log Odds Ratio Algorithm (LORA) by Bakshi and Hoeflin [3] and the recent Dynamic Quantile Tracking using Range Estimation (DQTRE) of Tiwari and Pandey [21]. Yazidi and Hammer [8, 23], present recursive approaches based on stochastic learning. A common property of all the estimation algorithms is the need to obtain good initialization data.

The recursive algorithms EWSA, LORA and DQTRE adapt, to some extent, into some specific types of non-stationary data streams. However, the data source can be non-stationary in a myriad of different ways. If the distribution of data depends on the time of the observation, this dependency can be smooth or non-smooth. For example, EWSA was shown to adapt into a linear trend in [6], but in [21] DQTRE performed better in the simulated case of a sudden non-smooth change of the distribution. The estimation of time-dependent quantiles is conceptually difficult since, if the quantiles depend on time, then also their estimates depend on time. We are not aware of any general theory on the estimation of time-dependent quantiles.

Our contribution is to provide a control loop around the extended version of the P2-algorithm. The controlled extended P2-algorithm will be referred to as Online Percentile Estimation (OPE) and it consists of two separate parts: the extended P2-algorithm of [15] and our control loop. Our intention is to keep the control loop as computationally simple as possible. We emphasize the separation into two parts, because it is possible to replace the P2-algorithm with any of the recursive algorithms discussed above. For example, the authors of DQTRE [21] sketch the estimation of several quantiles simultaneously. The extended P2-algorithm has a built-in property to keep the estimates of the quantiles strictly monotone while DQTRE temporarily allows non-monotone quantile estimates.

The purpose of the control loop is to detect temporal behavior which is not strictly stationary and to avoid the use of the extended P2-algorithm in these situations. In cases which are not strictly stationary, the sample percentiles, obtained by sorting a relatively small sample of data, have the most reliable interpretation. For example, the sample median, denoted in this paper as x(⌈0.5n⌉) in which n is the sample size, always has the interpretation that 50% of the sample values has been smaller than x(⌈0.5n⌉) and 50% has been larger than x(⌈0.5n⌉). This holds true due to sorting of the sample and whether or not the data source was (strictly) stationary when the sample values were observed. In the non-stationary case, the data sample represents also the specific interval of time in which it was observed, and additional information is required for the correct statistical interpretation of it.

The strictly stationary case is just the opposite for the data source and, by its definition, it has a time-invariant distribution. Then the uncertainty, which results from the estimation instead of the computation of percentiles, can be tolerated. In this paper, the term source percentile indicates the percentile of the distribution of the (endless) data source. The temporal existence of such source percentiles can be assumed whenever there is evidence that the assumption of strictly stationary case holds.

The authors of the P2-algorithm [10] considered a hardware implementation (quantile chip) of their algorithm. Tiwari and Pandey [21] suggested a similar, for example FPGA or ASIC, implementation of their DQTRE algorithm. Our viewpoint is that all online quantile estimation algorithms benefit from a simple control loop which attempts to verify that the assumptions of the algorithms are not violated too heavily.

3 Assumptions and Requirements for OPE

For further reference, we present the assumptions that were made during the design process of OPE. These assumptions are also requirements for applying the algorithm.

  1. 1.

    It is assumed that there is random variability in the data.

  2. 2.

    It is assumed that the data source can be in some steady state for adequate periods of time that the stationary model can be applied. The number of different transient steady states is unlimited. Momentary periods of behavior, which is not strictly stationary, are allowed but, if the data source cannot ever reach any steady state in which the P2-algorithm can be deployed, then one must try to replace the P2-algorithm part of OPE by some of the mentioned recursive algorithms which adapt to non-stationary data. A control loop similar of OPE could then be defined to ensure that the changes in the data distribution are such that the adaptation makes sense, for example, the percentile estimates must remain monotone.

  3. 3.

    OPE performs similarly to a low pass filter. It is assumed that the interesting information is not on the small variations of the timescale. Exceptionally small or large values will be visible from the output of OPE, but their exact occurence in time may be lost unless a time-stamping feature is added to the algorithm. The handling of time stamps can be added but, for brevity, this issue is not discussed here. Also, trends can be observed by comparing the current histogram to the previous histogram(s), but a trend within a single histogram is not necessarily observed.

The design principle has been that in all circumstances the algorithm must perform rapidly and never produce unreliable data. Even if it is applied in a situation in which its applicability is not the best possible.

4 The Control Loop

The control is obtained by applying specific, statistical counter models and periodical testing of the estimates of the extended P2-algorithm against the model percentiles that are computed at the initialization phase. The model consists of

  1. 1.

    the model values of the specified source percentiles,

  2. 2.

    three counters and

  3. 3.

    two alarm mechanisms to test each observed value against the model. The alarm mechanisms include four threshold values and two more counters.

The model will be described in Section 5.1.

The model is discarded immediately if the correspondence with the data source is deemed insufficient. When the model is discarded, a new model is built from the new observations. The alarm mechanisms may invalidate the model even during the ongoing observations. The flow chart of Fig. 1 provides an overview of the control loop.

Figure 1
figure 1

The high-level flow chart of the control loop. The numbers in the flow chart will be explained in Section 6.

If the model remains valid, it is updated periodically. The updating is done only if the observed, new data fits the model. In this case we can assume that the new data contains information which can improve the existing model.

4.1 Phases of the Control Loop

The control loop consists of configuration, initialization, model building and valid model phases. We will describe the idea of these phases next.

4.1.1 Configuration

In the configuration phase, certain statistical and computational parameters are set. These values are never changed during the execution of OPE. The main configuration parameter is m, which is the number of percentiles that will be estimated. Section 5.3 will present more details on the configuration parameters. The data structure of the extended P2-algorithm requires 2m + 3 initial values. Table 2 contains a quick reference to the numeric values which depend on m.

4.1.2 Initialization

In the initialization phase, we need a buffer of size Mb ≥ 2m + 3 observations and an in-place sorting algorithm to sort the buffer. The intention is to obtain 2m + 3 unequal and statistically representative initial values for the extended P2-algorithm. The sample percentiles are selected from the buffered Mb values after sorting. The sample percentiles will also be used as initialization values of the model for the source percentiles. Figure 2 is a sketch of the initialization process.

Figure 2
figure 2

Initialization process.

The word “marker” in Fig. 2 refers to the data structure of the extended P2-algorithm described in [15]. In the marker selection

$$ q_{j}^{*}=\frac{j}{2m+2}, j=1,\ldots,2m+1, $$
(1)

and in the model selection

$$ q_{j}=\frac{j}{m+1}, j=1,\ldots,m. $$
(2)

Both selections are made from the sorted buffer, hence the initial values of the model are included in the initial values of the marker:

$$ q_{2j}^{*}=q_{j}, j=1,\ldots,m. $$
(3)

The notation ⌈⋅⌉ in Fig. 2 refers to the ceiling function and x(i) is the i:th order statistic, that is, the i:th smallest value. The q:th sample percentile is just the ⌈qn⌉:th order statistic x(⌈qn⌉) in which n ≥ 1 is the sample size and 0 < q < 1.

The buffer, the marker and the model are each separate data structures. Each of these structures require memory allocations. The amount of memory that is needed depends on m alone, hence this memory allocation can be done in the configuration phase as it is not changed during the execution.

In this publication, the numbers qj, j = 1,…, m are called selected probabilities. They always satisfy (2) and are thus equally spaced. Equal spacing is not necessary, but it is difficult to justify any unequal spacing without making any prior assumptions about the data source. Also, the spacings are organized for the quartiles to always be included in them. The inclusion of the quartiles is not necessary either, but it will simplify the presentation and the programming of the algorithm considerably. The consequence of these agreements is that only values m = 3,7,15,31 are considered, which may sound restrictive, but should turn out very natural. Splitting the intervals that the selected probabilities determine into halves is the most efficient approach to have qualitatively different granularities. Other choices for m or qj do not convey any essential, additional prior value.

Sorting the buffer, which is an array of floating point numbers, is computationally the most challenging part of the algorithm, given the assumption that computational resources are limited. Especially in the online computation, in which the sorting will be regularly repeated. When Mb > 2m + 3 is sufficiently large, the statistical representativeness of the initialization data usually improves and also the chances that unequal values are obtained increases. Since sorting is computationally hard and memory usage needs to be minimized, the value of Mb cannot be too large. Also, the values \(x_{M_{b}+1},x_{M_{b}+2},\ldots \) can arrive during the sorting which means that they need to be buffered somewhere else. We assume that the input queue is a first-in-first-out (FIFO) queue and can store a few values. This storage size and the arrival intensity of the observations also affect the size of Mb. Therefore, Mb is a parameter whose value is chosen according to the given computational resources.

The initialization phase ends successfully only if

$$ x_{(1)}<x_{(\lceil q_{1}^{*}M_{b}\rceil)}<\ldots<x_{(\lceil q_{2m+1}^{*}M_{b}\rceil)}<x_{(M_{b})}. $$
(4)

Note that the initial values of the model will then also be distinct. However, the status of the model is not valid after initialization. The status may change while building the model.

4.1.3 Model Building

The phase of model building is divided into two subphases to compromise with the two conflicting criteria that

1st subphase::

bad initialization data should be recognized and ignored as soon as possible,

2nd subphase::

while checking the correspondence of the model with the data source requires numerous observations.

In the 1st subphase, we test that the model median \(x_{0.5}^{M}\) behaves like the source median. Taking into account the random fluctuations, approximately 50% of the observed, new values should fall in both of the two bins

$$ \left]-\infty,x_{0.5}^{M}\right]\text{ and }\left]x_{0.5}^{M},\infty\right[ $$

defined by the model median. The superscript “M” in \(x_{0.5}^{M}\) stands for the “model”. Before the new value is given as input to the extended P2-algorithm, the relationship with the model median is accounted. After N1 observations, the bin counts are tested. The test is described in Section 7.1.

In the 2nd subphase, we test that the model quartiles behave like the source quartiles. Approximately 25% of the observed, new values should fall in each of the four bins

$$ \left]-\infty,x_{0.25}^{M}\right], \left ]x_{0.25}^{M},x_{0.5}^{M}\right], \left ]x_{0.5}^{M},x_{0.75}^{M}\right]\text{ and }\left]x_{0.75}^{M},\infty\right[. $$

Before the new value is given as input to the extended P2-algorithm, this bin information is accounted. After N2 observed values, the bin counts are tested. The test is described in Section 7.2.

If either of these two tests fail, the control goes back to the initialization and a new attempt is performed. Otherwise the model status is stated as valid.

We use the phrase model building instead of model validation, since the model is updated (changed) after the successful termination of the 2nd subphase. We then trust that the initialization phase was succesful and that the marker values contain representative information about the source percentiles. The details of the update are described in Section 5.2.

4.1.4 Valid Model

In the model building phase, the model is already tested twice and, in successful cases, it is updated once. After this, the validity of the model is tested periodically after each n values in which n, the period length, is a variable. There is a lower bound for a meaningful period length which depends on m: nnmin(m), but there is no upper bound. See Table 2 for values of nmin(m) and Section 8.1 on how these values are computed.

We use the notation \(\widehat {x}_{q_{2j}^{*}}\) for the current values of marker variables to distinguish them from the current values of the model variables \(x_{q_{j}}^{M}\). The test is twofold: 1) the model quartiles are tested that they still behave like the source quartiles and 2) the model percentiles \(x_{q_{j}}^{M}\) and the current estimates of the same percentiles, which are the marker values \(\widehat {x}_{q_{2j}^{*}}\), are tested that they are not too far from each other. The first test is called the bin frequency test and the latter is called the bin boundary test. These tests are explained in more detail in Sections 7 and 8.

Upon the termination of the model building phase, the value of n is first set to n = nmin(m). After each successful test, the size of n is incremented by a step: \(n\leftarrow n+s\). As additional tests are accepted by the model, we have more evidence on the strict stationarity and more confidence in the extended P2-algorithm. After each successful test the model is updated.

There are also two alarm mechanisms that may invalidate the model during the period. A single value above or below either of the absolute alarm thresholds may invalidate the model. Too many values above or below either of the adaptive alarm thresholds may invalidate the model.

5 Technical Details of the Control Loop

5.1 The Model

The model of the control loop consists of the m model percentiles

$$ x_{q_{1}}^{M},\ldots,x_{q_{m}}^{M}, $$
(5)

which, for simplicity, are assumed to include the model quartiles \(x_{0.25}^{M}\), \(x_{0.5}^{M}\), \(x_{0.75}^{M}\). In addition to these, the model contains three periodwise bin counters

$$ \begin{array}{@{}rcl@{}} n_{1}&=&\#\{x_{i} | x_{i}\leq x_{0.25}^{M}, i=1,\ldots,n\},\\ n_{2}&=&\#\{x_{i} | x_{i}\leq x_{0.5}^{M}, i=1,\ldots,n\}, \end{array} $$
(6)
$$ \begin{array}{@{}rcl@{}} n_{3}&=&\#\{x_{i} | x_{i}\leq x_{0.75}^{M}, i=1,\ldots,n\}, \end{array} $$

which depend on the period length n, two absolute alarm thresholds \(A_{abs}^{-}\) and \(A_{abs}^{+}\), two adaptive alarm thresholds \(A_{ada}^{-}\) and \(A_{ada}^{+}\) and two periodwise counters for the adaptive thresholds:

$$ \begin{array}{@{}rcl@{}} r^{-}&=&\#\{x_{i} | x_{i}< A_{ada}^{-}, i=1,\ldots,n\},\\ r^{+}&=&\#\{x_{i} | x_{i}>A_{ada}^{+}, i=1,\ldots,n\}. \end{array} $$
(7)

The negative sign (−) refers to the lower tail of the empirical distribution and the positive sign (+) to the upper tail. All the counters related to the period are naturally initialized to zero in the beginning of the period.

The natural choices for the thresholds of the adaptive alarm are

$$ A_{ada}^{-}=x_{q_{1}}^{M}\text{ and }A_{ada}^{+}=x_{q_{m}}^{M}, $$
(8)

but, for example, the marker values

$$ A_{ada}^{-}=\widehat{x}_{q_{1}^{*}}\text{ and }A_{ada}^{+}=\widehat{x}_{q_{(2m+1)}^{*}} $$
(9)

can also be selected. The latter choice (9) makes sense if m = 3, otherwise (8) is better as their accuracy is typically better. If Eq. 8 are chosen, the expected number of observations in both of the bins \(]-\infty ,A_{ada}^{-}[\) and \(]A_{ada}^{+},\infty [\) are

$$ \lceil q_{1}n\rceil=\lceil(1-q_{m})n\rceil=\left\lceil\frac{n}{m+1}\right\rceil. $$
(10)

The period length n is always set and known before the next period begins. The expected values of the counters are also known beforehand.

The purpose of the absolute alarm thresholds is that, given the model, the events \(x<A_{abs}^{-}\) and \(x>A_{abs}^{+}\) are extremely rare. Since this may require some unavailable prior knowledge, it is always possible to switch off the absolute alarm mechanism by setting \(A_{abs}^{-}=-\infty \) and \(A_{abs}^{+}=\infty \).

The thresholds of the absolute alarms may also be defined in such a way that they depend only on Eq. 5. We define them as

$$ A_{abs}^{-}=x_{q_{1}}^{M}-K^{-}(x_{q_{m}}^{M}-x_{q_{1}}^{M}) $$
(11)

and

$$ A_{abs}^{+}=x_{q_{m}}^{M}+K^{+}(x_{q_{m}}^{M}-x_{q_{1}}^{M}), $$
(12)

in which K and K+ are configurable parameters. Here the distance \(x_{q_{m}}^{M}-x_{q_{1}}^{M}\) serves as a yardstick, a similar concept than “sigma” (σ) of the normal distribution N(μ, σ2). The choice of the parameters K and K+ requires either prior knowledge or an estimation about the tail behavior of the data source.

The interval \([x_{q_{1}}^{M},x_{q_{m}}^{M}]\) is called the body bin of the model distribution. If the model is valid, then it can be assumed that this interval contains the majority of the source distribution mass. Its complement consists of two bins which are called the tail bins.

5.2 The Updating of the Model

After successful test(s), the model percentiles are updated by the rule

$$ x_{q_{j}}^{M}\leftarrow(1-\alpha)x_{q_{j}}^{M}+\alpha\widehat{x}_{q_{2j}^{*}}, j=1,\ldots,m, $$
(13)

where 0 ≤ α ≤ 1 is a configurable parameter.

The boundary value α = 0 means that the initialization values are assumed perfect and the model is never changed. In practice, we are not confident that the model could ever be improved by the marker values. The boundary value α = 1 means that the model is updated according to the marker values only. This would be plausible if we could know or trust that the marker values converge to the true percentiles. It can be expected that good choices of α will take into account the size of Mb or the ratio Mb/(2m + 3). This is studied in Section 9.2. The parameter α is called a learning parameter, which is indicative of the confidence assigned to the marker values when a data source is fed as input.

In the phase of model building, Eq. 13 is the only action. In the phase of valid model, the adaptive and absolute alarm threshold values must also be updated, if they depend on the model percentiles as suggested in Eqs. 811 and 12.

5.3 Configuration Parameters

Table 1 contains the main configurable parameters. Changing their values affect to the output of OPE. In Table 2, we have collected a quick reference on the numerical values that depend on m.

Table 1 The main configuration parameters.
Table 2 Quick reference to the values that depend on m.

The message of nmin(m) in Table 2 is that a simultaneous estimation of several percentiles from the same data requires a lot of data. The default values for N1 and N2 in Table 1 are obtained by first setting N1 + N2 = nmin(m = 7) = 107, see the last column of Table 2. For K+ and K, values like 1, 2, 3, 4 or 5 can be considered, analogously to “one sigma”, “two sigmas”, and so on, and the defaults are chosen as K+ = K = 2. The test level β affects the rate of false positives of the bin boundary test that is explained in Section 8. In addition, we suggest that the increment step could be s = s(m) = 2m + 3.

The data structure of the marker of the extended P2-algorithm requires memory of the size of 4 × (2m + 3) = 8m + 12 floating point numbers. The data structure of the model requires m + 4 floating point numbers and 5 counters which can be integer variables.

6 The Output of OPE

The percentiles are output periodically. However, there are various choices in defining what the output of OPE should be. First of all, the output should depend on the phase of the control loop.

After n observations, where nnmin(m) is a variable, at least the m model percentiles are output. In the initialization phase, Mb input values are reduced to 2m + 3 marker values and to m initial model values and it is natural to output at least the m model values. In model building, the period lengths are configurable parameters. It is natural to output the model values after updating. In all cases, the amount of input values are reduced to a very small number of output values. There is plenty of room to add metadata without affecting much of the data reduction.

After the initialization, the structure of the marker contains information about the minimum and the maximum observed values, denoted in Eq. 2 as x(1) and \(x_{(M_{b})}\). Until the next initialization, the minimum can only become smaller and the maximum can only become larger. Moreover, they are true observed values, not estimates of any kind. It is natural to include them in the output, so at least m + 2 data values are output. We will use the notation x(1) and x(n) for the minimum and maximum of the period, respectively.

Let us consider, for a moment, what values may be computed from the m + 2 output values. In many of the applications, the mean value is of interest. Assuming that the model cumulative distribution function (CDF)FM is the piecewise linear CDF determined by the m + 2 parameters

$$ \left( x_{(1)},x_{q_{1}}^{M},\ldots,x_{q_{m}}^{M},x_{(n)}\right), $$
(14)

see Section 8.2, then the expected value of FM is

$$ \overline{x}^{M}=\underbrace{\frac{1}{m+1}\sum\limits_{j=1}^{m}x_{q_{j}}^{M}}_{\text{robust part}}+\underbrace{\frac{x_{(1)}+x_{(n)}}{2(m+1)}}_{\text{noisy part}}. $$
(15)

In addition to the mean \(\overline {x}^{M}\), it is easy to compute the variance of the model distribution FM. The formula of the variance is presented in the Section 8.3. Essentially, the uncertainty of \(\overline {x}^{M}\) can also be quantified. Also, the robust and the noisy parts are separated in Eq. 15, which allow to check whether a change in the mean occurs merely due to noise or not.

If the model is valid, then Eq. 15 has a well-justified interpretation as an estimate of the source mean, otherwise it is still the mean of the observed sample. Therefore, we need to have the information about the validity of the model together with the percentile data (14). If the model is invalidated, then the failure reason is informative metadata.

In Table 3, an output called phase code is introduced. These are the numbers in the flow chart of Fig. 1. The purpose of the phase code is to indicate the exact place in the algorithm which caused the output record (16). This indicates the validity of the model.

Table 3 The output phase codes.

After introducing the phase code, the output format can be defined as a record with a generic format

$$ (\text{phase code},\text{percentile data},\text{metadata}). $$
(16)

This means that OPE will transform the input sequence of values into an output sequence of records. The phase code generally helps in interpretating the record correctly and parsing the metadata. The phase code provides also an applicability criteria, see the assumption 2) of Section 3. If the output never contains records with the phase code 32, briefly 32-records, then this algorithm is not applicable. Moreover, the proportion of 32-records is a measure of applicability of OPE. This metric of applicability is used in Section 9.

The phase code is an informative form of metadata. Other metadata, which is beneficial to obtain, is the period length. In the initialization and model building, the period lengths are Mb, N1 + N2, all of which are configuration parameters. In the valid model phase, the period length n is a variable whose final value depends on whether there was an alarm during the period of execution. The information of the period length helps to keep account of the total number of observations handled. In the cases in which the model is invalidated, it is plausible to add the test parameters and the alarm-causing observations to the metadata. Serial numbers for the output data integrity control and time stamps, if available, are also natural metadata.

We emphasize that OPE is not meant to be a change detection algorithm. If change detection is important, then a change detection algorithm can be applied to the output of OPE. The control loop is defined in such a way that the statistical tests in it will regularly give false positives. This happens on purpose, as it is possible to check if there has been a significant change in the source distribution from the history of the output data. Reducing the amount of testing will increase the risk of false negatives which is much worse, because it may not be possible to detect them afterwards from the recorded data alone.

Since n is typically considerably larger than m, the data reduction target is fulfilled. What remains to be shown is that the computational requirements are small and that the information loss depends on the parameters. The latter issue is studied in Section 9. The computational complexity results from the regular sorting of the buffer. Other computations that the control loop requires are described in Sections 7 and 8.

7 Bin frequency tests

The definition of the counters n1, n2 and n3 was given in Eq. 6. In Section 2.7 of [19], a standard approach to bin frequencies for i.i.d. data is given. The exact multinomial model of the bin frequency vector

$$ (n_{1},n_{2}-n_{1},n_{3}-n_{2},n-n_{3}) $$

is approximated by a multivariate normal model. This approach is not computationally fast, so we seek for a more simple approach.

The theoretical model behind the bin frequency tests presented below is based on an assumption of independent trials, reducing the use of the multinomial distribution to the binomial distribution by considering conditional counter values and the normal distribution approximation of the binomial distribution.

7.1 The Testing in the 1st Subphase

Consider the following simple, classical model for the first subphase of model building. Let X be a random variable modeling the observations and let \(x_{0.5}^{M}=x_{(\lceil 0.5M_{b}\rceil )}\) be the median value of the model obtained in the initialization process of Fig. 2. Let

$$ p=\mathbb{P}\left\{X\leq x_{0.5}^{M}\right\} $$

and as in Eq. 6

$$ n_{2}=\#\left\{X_{i} | X_{i}\leq x_{0.5}^{M}, i=1,\ldots,n\right\}. $$

If the strictly stationary assumption holds, then the distribution of Xi does not change and, therefore, p is a constant. If the Xi are independent and n is fixed beforehand, then n2 is binomially distributed random variable with parameters n and p, denoted as \(n_{2}\sim \text {Bin}(n,p)\). The expected value of n2 is np and variance is np(1 − p). Moreover, the \(\widehat {p}=\frac {n_{2}}{n}\) is the maximum likelihood estimator of p. Based on the central limit theorem (see [19]), an approximation can be done as follows

$$ \begin{array}{@{}rcl@{}} \mathbb{P}\left\{|\widehat{p}-p|<a\right\}&=&\mathbb{P}\left\{\frac{|n_{2}-np|}{\sqrt{np(1-p)}}<\frac{na}{\sqrt{np(1-p)}}\right\}\\ &&\approx 2{\Phi}\left( a\sqrt{\frac{n}{p(1-p)}}\right)-1=0.95, \end{array} $$
(17)

if \(a=z_{0.975}\sqrt {\frac {p(1-p)}{n}}\) where Φ is the CDF of the standard normal distribution and z0.975 is the 97.5% percentile of Φ. Furthermore, since z0.975 ≈ 1.96 ≤ 2 and p(1 − p) ≤ 1/4, the 95% confidence interva l of \(\widehat {p}\) can be written in a simplified form

$$ \left]\widehat{p}-a,\widehat{p}+a\right[\subset \left]\widehat{p}-\frac{1}{\sqrt{n}},\widehat{p}+\frac{1}{\sqrt{n}}\right[. $$

We set a requirement that 0.5 = 1/2 belongs to this simplified approximative 95% confidence interval of \(\widehat {p}\):

$$ \widehat{p}-\frac{1}{\sqrt{n}}<\frac{1}{2}<\widehat{p}+\frac{1}{\sqrt{n}} $$

which is equivalent with

$$ n-2\sqrt{n}<2n_{2}<n+2\sqrt{n}. $$
(18)

Recall that the motivation of the 1st subphase is to quickly recognize and ignore bad initialization data. The length in the first subphase period, n = N1, can be short provided that the normal approximation (17) used above is plausible. Since we can assume that p ≈ 0.5, this is not a hard requirement. Another criteria comes from the extended P2-algorithm. The markers, which are next to the median, affect to the accuracy of the estimated median. Therefore, the minimum length of the period (the number of trials) should satisfy N1nmin(3).

7.2 Testing of the Quartiles

First notice that the counters n1, n2 and n3 of Eq. 6 are not independent: n1n2n3. We can start with n2 and get the same condition as Eq. 18, but then we must consider conditioned events n1|n2 and nn3|nn2. Luckily the same method applies to conditional confidence intervals, that is, we can set a requirement that

$$ \frac{n_{1}}{n_{2}}-\frac{1}{\sqrt{n_{2}}}<\frac{1}{2}<\frac{n_{1}}{n_{2}}+\frac{1}{\sqrt{n_{2}}} $$
(19)

which is equivalent with

$$ n_{2}-2\sqrt{n_{2}}<2n_{1}<n_{2}+2\sqrt{n_{2}}, $$
(20)

and set a requirement that

$$ \frac{n-n_{3}}{n-n_{2}}-\frac{1}{\sqrt{n-n_{2}}}<\frac{1}{2}<\frac{n-n_{3}}{n-n_{2}}+\frac{1}{\sqrt{n-n_{2}}} $$
(21)

which is equivalent with

$$ n+n_{2}-2\sqrt{n-n_{2}}<2n_{3}<n+n_{2}+2\sqrt{n-n_{2}}. $$
(22)

Note the use of 1/2 instead of 1/4 in Eqs. 19 and 21, respectively, which is due to conditioning.

Together Eqs. 1820 and 22 form the test of a conditional approach started from n2 which we describe symbolically as

$$ n_{1}|n_{2} \leftarrow n_{2} \rightarrow (n-n_{3})|(n-n_{2}). $$

The notation “|” refer to conditioning and the arrow reads as, for example, “if n2 then n1 given n2” (\(n_{2}\rightarrow n_{1}|n_{2}\)). However, there are four more cases that have to be considered and we list them symbolically below:

$$ \begin{array}{@{}rcl@{}} n_{3}&& \rightarrow n_{2}|n_{3} \rightarrow n_{1}|n_{2} \\ n_{3}&& \rightarrow n_{1}|n_{3} \rightarrow n_{2}|(n_{3}-n_{1}) \\ n_{1}&& \rightarrow (n-n_{2})|(n-n_{1}) \rightarrow (n-n_{3})|(n-n_{2})\\ n_{1}& &\rightarrow (n-n_{3})|(n-n_{1}) \rightarrow n_{2}|(n_{3}-n_{1}) \end{array} $$

Each of these lead to a different group of three inequality chains listed below:

$$ \begin{array}{@{}rcl@{}} 3n-4\sqrt{n}&<&4n_{3}<3n+4\sqrt{n}\\ 2n_{3}-3\sqrt{n_{3}}&<&3n_{2}<2n_{3}+3\sqrt{n_{3}}\\ n_{2}-2\sqrt{n_{2}}&<&2n_{1}<n_{2}+2\sqrt{n_{2}}\\\\ 3n-4\sqrt{n}&<&4n_{3}<3n+4\sqrt{n}\\ n_{3}-3\sqrt{n_{3}}&<&3n_{1}<n_{3}+3\sqrt{n_{3}}\\ n_{3}-n_{1}-\sqrt{n_{3}-n_{1}}&<&n_{2}<n_{3}-n_{1}+\sqrt{n_{3}-n_{1}}\\\\ n-4\sqrt{n}&<&4n_{1}<n+4\sqrt{n}\\ 2n+n_{1}-3\sqrt{n-n_{1}}&<&3n_{2}<2n+n_{1}+3\sqrt{n-n_{1}}\\ n_{2}+n-2\sqrt{n-n_{2}}&<&2n_{3}<n_{2}+n+2\sqrt{n-n_{2}}\\\\ n-4\sqrt{n}&<&4n_{1}<n+4\sqrt{n}\\ 2n+n_{1}-3\sqrt{n-n_{1}}&<&3n_{3}<2n+n_{1}+3\sqrt{n-n_{1}}\\ n_{3}-n_{1}-\sqrt{n_{3}-n_{1}}&<&n_{2}<n_{3}-n_{1}+\sqrt{n_{3}-n_{1}} \end{array} $$

The bin frequency test accepts a triple (n1, n2, n3) if at least one of the five inequality chain triples is satisfied. If all of them fail, then the test rejects the triple. The combined length of the period n = N1 + N2 of both subphases requires that the normal approximation is simultaneously plausible near all of the quartiles, which is one criteria. The extended P2-algorithm requires that the four markers next to the quartiles should also have some accuracy. Therefore: N1 + N2nmin(7) is adequate.

7.3 Adaptive Alarm Bin Counts

Assume that Eq. 8 are selected and let the r and r+ denote the adaptive alarm bin counters. Since n is fixed, r and r+ are not independent. The binomial model is again applicable in which the success probability can be assumed to be p = 1/(m + 1) and it can be assumed that n > (m + 1)2. In the beginning of a period, we can set approximate 95% bounds to, for example, r+ as

$$ \frac{n}{m+1}-\sqrt{n}<r^{+}<\frac{n}{m+1}+\sqrt{n} $$
(23)

but the conditioning r|r+ leads to

$$ \frac{n-r^{+}}{m+1}-\sqrt{n-r^{+}}<r^{-}<\frac{n-r^{+}}{m+1}+\sqrt{n-r^{+}} $$
(24)

which does not work similarly since the final value of r+ is not known when the period begins. However, if the r+ in Eq. 24 is understood as the current value, not the final value, then it can be implemented. Naturally, the roles of r+ and r in Eqs. 23 and 24 must also be considered in the opposite way:

$$ \frac{n}{m+1}-\sqrt{n}<r^{-}<\frac{n}{m+1}+\sqrt{n} $$
(25)

and

$$ \frac{n-r^{-}}{m+1}-\sqrt{n-r^{-}}<r^{+}<\frac{n-r^{-}}{m+1}+\sqrt{n-r^{-}}. $$
(26)

Finally, the lower bounds in Eqs. 24 and 26 can be dropped since the bin frequency test at the end of the period will handle them. During the period, we are only concerned about drifts away from the current body bin of the observations and take care that the sum r + r+ will not become too large.

8 Bin Boundary Test

The bin boundary test is based on the following theorem:

Theorem 1

Let 0 < q1 < … < qm < 1. If X1,…, Xn are i.i.d., \(X_{i}\sim F\) in which the qj percentiles of F are \(x_{q_{j}}\) and F is strictly increasing in the neighborhoods of each \(x_{q_{j}}\), then for all ε > 0

$$ \mathbb{P}\left\{\left|\frac{1}{m}\sum\limits_{j=1}^{m}\left( X_{(\lceil q_{j}n\rceil)}-x_{q_{j}}\right)\right|>\varepsilon\right\}\leq 2\sum\limits_{j=1}^{m}e^{-2n\delta_{\varepsilon,j}^{2}} $$

where

$$ \delta_{\varepsilon,j}=\min\left\{F(x_{q_{j}}+\varepsilon)-q_{j}, q_{j}-F(x_{q_{j}}-\varepsilon)\right\}. $$
(27)

The proof of Theorem 1 in case m = 1 is given, for example, in Section 2.3.2. of the book [19]. Extending the proof from [19] to the more general case, stated above, requires basic probabilistic reasoning based on the triangle inequality. We omit the proof since it is more important to describe how it is used.

We use the marker \(\widehat {x}_{q_{2j}^{*}}\) of the extended P2-algorithm to approximate \(X_{(\lceil q_{j}n\rceil )}\). Furthermore, the CDF FM is strictly increasing with percentiles \(x_{q_{j}}^{M}\), so we use FM in place of F and \(x_{q_{j}}^{M}\) in place of \(x_{q_{j}}\).

The probabilistic interpretation of δε, j is that the smaller it is, the more uncertain the estimation of \(x_{q_{j}}\) is. Theorem 1 states that the most uncertain percentile dominates the convergence of all the estimates. Figure 3 visualizes the relationship between ε and δε, j when F = FM.

Figure 3
figure 3

Interpretation of δε, j is easy with FM.

What remains to be specified is the ε > 0. When the piecewise linear FM is used as the model, the selection of ε depends on the scale of the bins. In order to simplify the next formulas, the following convenient notation is now taken into use:

$$ \begin{array}{@{}rcl@{}} q_{0}&=&0\text{ and }x_{q_{0}}^{M}=x_{(1)} \end{array} $$
(28)
$$ \begin{array}{@{}rcl@{}} q_{m+1}&=&1\text{ and }x_{q_{m+1}}^{M}=x_{(n)}. \end{array} $$
(29)

Note that x(1) and x(n), the period’s minimum and maximum observations, are not part of the model of the control loop, but they are part of the piecewise linear CDF model FM. The superscript M is, therefore, justified.

With the notations Eqs. 28 and 29, the Eq. 27 of δε, j simplifies to

$$ \delta_{\varepsilon,j}=\min\left\{\frac{1}{(x_{q_{j}}^{M}-x_{q_{j-1}}^{M})}, \frac{1}{(x_{q_{j+1}}^{M}-x_{q_{j}}^{M})}\right\}\frac{\varepsilon}{m+1}, $$

j = 1,…, m. Also, with the Eqs. 28 and 29 we define the maximum and the minimum bin widths as follows:

$$ \begin{array}{@{}rcl@{}} \varepsilon_{1}&=&\max\left\{x_{q_{j}}^{M}-x_{q_{j-1}}^{M}| j=1,\ldots,m+1\right\} \end{array} $$
(30)
$$ \begin{array}{@{}rcl@{}} \varepsilon_{2}&=&\min\left\{x_{q_{j}}^{M}-x_{q_{j-1}}^{M}| j=1,\ldots,m+1\right\}. \end{array} $$
(31)

Now, if εε1, then for all j = 1,…, m

$$ \delta_{\varepsilon,j}\geq\frac{1}{m+1}, $$
(32)

and if εε2, then for all j = 1,…, m

$$ \delta_{\varepsilon,j}\leq\frac{1}{m+1}, $$
(33)

We want to minimize the use of FM since it is only one out of infinitely many CDFs which have the same percentiles \(x_{q_{j}}^{M}\), j = 1,…, m. Due to computational feasibility with FM choosing any ε < ε2 would sound attractive, but it also means that we assume that FM is an accurate model of the source distribution. This is hardly true, at least if m = 3. The accuracy improves according to the larger values of m.

The choices of ε = ε1/m and ε = ε2/m both seem quite plausible. When the model is updated, the values of ε1 or ε2 are recomputed and the test will remain the same.

Assume that either ε = ε1/m or ε = ε2/m has been selected. The bin boundary test accepts the current marker values \(\widehat {x}_{q_{2j}^{*}}\), j = 1,…, m, if

$$ \left|\frac{1}{m}\sum\limits_{j=1}^{m}\left( \widehat{x}_{q_{2j}^{*}}-x_{q_{j}}^{M}\right)\right|\leq\varepsilon $$
(34)

or else it accepts if

$$ 2\sum\limits_{j=1}^{m}e^{-2n\delta_{\varepsilon,j}^{2}}>\beta $$
(35)

or else in all the other cases it results in rejection. The parameter β is assumed to have values such as β = 0.05 or β = 0.01 (default).

8.1 The Minimum Period Length n min (m)

Based on Theorem 1 we define \(n_{\min \limits }(m,\varepsilon )\) such that

$$ 2\sum\limits_{j=1}^{m}e^{-2n\delta_{\varepsilon,j}^{2}}<\frac{1}{2}\text{ when }n\geq n_{min}(m,\varepsilon). $$
(36)

The idea is that the probability estimate of Theorem 1 is not trivial when nnmin(m, ε).

The choice ε = ε1 of Eq. 30 gives a lower bound Eq. 32 to δε, j which does not depend on ε or j and, therefore, it gives an upper bound to the probability bound of Theorem 1. We can find a candidate for nmin if we set a requirement that

$$ 2\sum\limits_{j=1}^{m}e^{-2n\delta_{\varepsilon_{1},j}^{2}}\leq2me^{-2n/(m+1)^{2}}<\frac{1}{2} $$

from which nmin(m) = nmin(m, ε1) is

$$ n_{min}(m)=\left\lceil\frac{1}{2}(m+1)^{2}\log 4m\right\rceil. $$

8.2 The Piecewise Linear CDF Model

The piecewise linear model FM is used mainly for visualization purposes. However, in the bin boundary test, it is also used to model the source distribution near the bin boundary points \(x_{q_{j}}^{M}\), j = 1,…, m. The formula, with the Eqs. 28 and 29, is the following:

$$ F^{M}(x)=\left\{\begin{array}{rl} 0, & x<x_{q_{0}}^{M}\\ \frac{x-x_{q_{j-1}}^{M}}{(m+1)(x_{q_{j}}^{M}-x_{q_{j-1}}^{M})}+\frac{j-1}{m+1},&x_{q_{j-1}}^{M}\leq x<x_{q_{j}}^{M}\\ 1, & x\geq x_{q_{m+1}}^{M} \end{array}\right. $$

8.3 The Variance of F M

Let \(X\sim F^{M}\). Since \(\text {Var}(X)=\mathbb {E}(X^{2})-(\mathbb {E}X)^{2}\) and, since the formula \(\mathbb {E}(X)=\overline {x}^{M}\) is already given in Eq. 15, only the second moment remains to be computed. According to Eqs. 28 and 29, the formula is

$$ \mathbb{E}(X^{2})=\frac{1}{3(m+1)}\sum\limits_{j=1}^{m+1}\left( x_{q_{j}}^{M}\right)^{2}+x_{q_{j}}^{M}x_{q_{j-1}}^{M}+\left( x_{q_{j-1}}^{M}\right)^{2}. $$

9 Testing of OPE with the PoC

We have implemented a version of OPE using the programming language C [11]. It contains all the features described here except for the time stamps and the FIFO input queue which is not part of OPE itself. In this publication, this implementation is called the proof of concept (PoC). In this Section, we will present a collection of tests which are based on this PoC.

Even though OPE is designed to be an online algorithm, only offline testing is discussed here. Offline testing allows to run PoC with exactly the same input, but with different values for the parameters. Also, the focus is on the correspondence with the source percentiles and the model percentiles of OPE. Our testing framework is to consider it as a deterministic (time) one pass (or single pass) algorithm, because this framework already has some theoretical results about the space complexity, see [13]. The space complexity of a deterministic one pass algorithm which computes the (population) median of a file with size N numbers is Ω(N), that is, at least of order N asymptotically.

If the global estimates of the source percentiles are computed from all 32-records’ temporally local estimates, (e.g., the mean of the percentiles), this whole offline process can be considered as a deterministic, one-pass algorithm with a large input file as a source. This is achieved when the offline data source file is an output of a simulated, stationary time series. The abstract source percentiles are then practically the same as the population percentiles of the simulated file. In this special case, the size of all the output records (16) corresponding to a given input file is considered as a measure of the space complexity of OPE and it is Ω(N).

We mention just briefly that the PoC has been tested in internal projects at VTT with light sensor data, wireless mesh network delay measurement data, and vibration sensor data from a revolving machine.

9.1 Performance Metrics

First, we present the performance metrics which are considered within the above described framework of a deterministic, single-pass algorithm. These metrics are accuracy, bandwidth savings and the buffer size.

9.1.1 Accuracy

Measuring accuracy is reasonable only when the true source percentiles \(x_{q_{j}}^{src}\), j = 1,…, m, exist and are known. Also, only 32-records can be assumed to contain accurate information about the source percentiles. For brevity, we will only report an overall accuracy which we define next. First, let

$$ X_{q_{j}}^{32}=\frac{1}{n_{32}}\sum\limits_{l=1}^{n_{32}}X_{q_{j},l}^{M},j=1,\ldots,m, $$
(37)

in which n32 is the number of 32-records and \(X_{q_{j},l}^{M}\) is the qj:th model percentile value reported in the l:th 32-record. The averaging in Eq. 37 is reasonable in the test framework since all the 32-records model the same stationary state. The overall accuracy can be defined as the average over the m differences between the estimates \(X_{q_{j}}^{32}\) and the true values:

$$ \frac{1}{m}\sum\limits_{j=1}^{m}\left( X_{q_{j}}^{32}-x_{q_{j}}^{src}\right). $$
(38)

This is a single number and the average over m allows a comparison between different m.

9.1.2 Bandwidth Savings

The IoT application idea of OPE is that the output records (16) are not stored where they are computed, but transmitted to a cloud server. Therefore, bandwidth savings, defined as

$$ \frac{\text{Input}(Bytes)-\text{Output}(Bytes)}{\text{Input}(Bytes)}, $$

is a plausible metric. The output is is according to the provision of the PoC, which includes the phase code, the model percentiles, the period lengths and some other metadata on failure records. The output size depends on the parameters of the PoC. All the output records are included in Output(Bytes).

9.1.3 The Size of the Buffer M b

As the content of the buffer is sorted frequently and sorting is computationally expensive, the buffer size is a performance metric.

9.2 Study Case: Buffer Size vs. The Learning Parameter

As we discussed in Section 5.2, the relationship between the buffer size Mb and the learning parameter α is of interest. The effect of α is included in the performance case study which is described here.

9.2.1 Input Files

The input files were simulated realizations from a certain moving average (MA) model of order r which we will specify here. See Chapter 3.5.5 of [14], Section 3.4.3 of [5] or Chapter 3 in [4] for general technical details of MA(r) time-series models. The different orders r = 2,4,8,16,32 and 64 are considered. For each r, there was a separate input file generated with N = 500000 values. The specific MA(r) model used here is the following. Assume r is chosen and let Zt be an infinite sequence of independent N(0,1) distributed random variables and define

$$ X_{t}=Z_{t}+{\sum}_{i=1}^{r}\beta_{i}Z_{t-i}, $$
(39)

in which the parameters βi are chosen as

$$ \beta_{i}=\frac{1}{\sqrt{r}}, i=1,\ldots,r. $$
(40)

With these parameters, the corresponding MA(r) model is invertible. Invertibility ensures that there is a unique MA(r) model for a given autocorrelation function (ACF) [5]. With these choices \(\mathbb {E}(X_{t})=0\) for all t and, due to independence of Zt,

$$ \text{Var}(X_{t})=1+\sum\limits_{i=1}^{r}{\beta_{i}^{2}}=1+r\left( \frac{1}{\sqrt{r}}\right)^{2}=2. $$
(41)

For k ≥ 0, the ACF at lag k ≥ 0 is

$$ \rho(k)=\left\{\begin{array}{cl} 1 & k=0,\\ \frac{\sqrt{r}+r-k}{2r} & k=1,\ldots,r,\\ 0 & k>r. \end{array}\right. $$
(42)

For lags k < 0, the ACF is defined as ρ(−k) and, therefore, ρ(k) ≥ 0 for all lags k. The time series (Xt) is strictly stationary and, since Zt are N(0,1) distributed, by Eq. 41 the Xt are N(0,2) distributed. The source percentiles \(x_{q_{j}}^{src}\) of the distributions of Xt are the same with all the input files and their numerical values are known with high precision.

The previously defined MA(r) models with the ACF (42) were chosen because, with larger orders r, they have the property that the realizations tend to have long sequences of consecutive values which are either above or below zero. Zero is also the source median for all of the specified MA(r) models. Therefore, the property ρ(k) > 0 for all the lags |k|≤ r challenges the initialization and the model building phases of the control loop, which are the novel features. This is visualized in Fig. 4. If it were that ρ(k) < 0 with some small lag k, then the initialization and the model building would benefit from this.

Figure 4
figure 4

Applicability when maximized over all parameters (k, h).

9.2.2 Parameter Ranges

For each of the six input files, there are different values considered for the parameter m = 3,7,15. For each of these m, the buffer sizes

$$ M_{b}(k)=k\times(2m+3), k=1,\ldots,12, $$
(43)

are considered. Thus, the buffer sizes are multiplicities of 2m + 3, which is the minimal possible buffer size. For each pair (m, Mb(k)), the discretized learning parameter values

$$ \alpha(h)=\frac{h}{24}, h=0,\ldots,24, $$
(44)

are considered. Then α(h) − α(h − 1) = 1/24 ≈ 0.04. Note that the end points 0 = α(0) and 1 = α(24) are included. Each time-series file is then given as input to the PoC with different (k, h) values for the parameters (m, Mb(k), α(h)). The accuracy and the bandwidth savings are then considered as functions of (m, Mb, α). Other parameters were fixed to their defaults shown in Table 1.

9.2.3 Results

First, a study of the applicability of OPE for the specified MA(r) input files with different orders r was made. The proportion of 32-records in the output is the criterion of applicability as defined in Section 6. In Fig. 4, an exhaustive search was made over all pairs (k, h), k = 1,…,12 and h = 0,…,24, to find the pair with the maximum applicability with the given m and r. If every initialization were succesful and always ended up in exactly one 32-record, then the proportion of 32-records would be quite exactly 25%. The proportion of 32-records can exceed the 25% level if the control loop model stays valid, recall the flow chart of Fig. 1. The applicability reduces when the autocorrelation (42) becomes worse since the statistical representativeness of the available data in the initialization and model building phases becomes poorer.

In Fig. 5, the discretizing parameter pairs (k, h), which produced the maximum applicability in Fig. 4, with the given m and r, are visualized in the space of (k, h). The pairs (k, h) are not necessarily unique. In the non-unique case, the pair with the smallest k is shown. The cases m = 3,7,15 are considered and for each m the points corresponding to input data of orders r = 2,4,8,16,32 and 64 are plotted and joined. The other end points, with r = 64, are marked. The horizontal axis is in scale k = Mb(k)/(2m + 3) only. The vertical scale is given in h and in α(h) scale. There is not much structure in Fig. 5, but it appears that two simple policies may be behind the best applicability: either the buffer size Mb is large and α is small, or the smaller buffer size is accompanied with a larger α. The notable things are that the choice of the parameters Mb and α can significantly affect the performance when the applicability is low (r = 64) and that the maximum considered buffer size Mb(12) = 12(3m + 3) is not automatically needed.

Figure 5
figure 5

The parameter pairs (k, h) which provide the maximum applicability.

Figure 6 shows how the accuracy depends on the applicability. The gray area is a rough estimate of the 99% confidence interval in the case m = 15, obtained from Eq. 38 by assuming in Eq. 37 that (see Chapter 2.1.1 of [19])

$$ X_{q_{j}}^{32}\sim N\left( x_{q_{j}}^{src},\frac{(1-q_{j})q_{j}}{n_{32}}\right),j=1,\ldots,m, $$
(45)

which is considered as the null hypothesis of accuracy in our test framework.

Figure 6
figure 6

Accuracy vs. applicability.

Finally, the bandwidth savings is presented in Fig. 7. It is computed for the maximizing pairs (k, h). If the applicability of OPE is high and the data source is stationary, then the bandwidth savings with m = 15 can be more than 97%. This indicates that there is not much information in the simulated input data with r = 2. This would be interpreted as it is almost white noise with a minor memory effect. Recall that all the output records are included in bandwidth savings.

Figure 7
figure 7

Bandwidth savings.

If the buffer size Mb is limited beforehand in an application, but some time-series samples of the application data are available in the design phase, one can try to estimate a sufficient value for the α of OPE in a similar way as presented in this section. This would involve estimating the sample autocorrelation function of the application data, simulating data with a similar autocorrelation structure, feeding it to OPE and then choosing the α with the best applicability, that is, the largest proportion of 32-records.

9.3 Complexity Issues

Our control loop alternates between sorting a fixed size buffer Mb and the extended P2-algorithm. Even though the general complexity of sorting is of the order \(O(N\log N)\), since the buffer size is constant, the complexity of the sorting phases is at most of constant order, \(O(M_{b}\log M_{b})=O(1)\). Also, the control loop adds complexity to the whole and our main effort has been to guarantee that this complexity of the control loop is also constant. Thus, the overall complexity of OPE is of constant order, but this constant is not the minimal possible. Indeed, there is space for further research on developing an algorithm with lower complexity. For example, replacing the P2-algorithm by the other algorithms that were discussed in Section 2 may provide a contolled algorithm with lower complexity.

10 Conclusion

We have described a sequential, online algorithm OPE, which either computes or estimates the temporal percentiles of the observed, empirical distribution of the data. This is done without storing all of the observations. It works with a parameter-defined, fixed-size small amount of memory and limited CPU power. It can be used for the statistical reduction of a numerical data stream. It transfers the univariate input stream into the output stream of vectors in which each vector can be interpreted as an empirical distribution of data starting immediately after the previous vector. For IoT applications, there is significant reduction in the amount of data (bytes) needed to be transmitted to a cloud server. The output data is structured in a (cumulative) histogram format, which quickly recognizes such trends or anomalies that are visible in the histogram scale. We provided requirements that should be satisfied when OPE is applied and we illustrated that OPE contains a built-in applicability metric, the proportion of 32-records. We also presented a performance study, which was specially designed to challenge the control loop part of OPE. A lesson learned from this performance study is that 32-records are self-informative in the sense that their interpretation does not require more metadata than the knowledge that they are 32-records and they remain sufficiently accurate even when the applicability of OPE decreases.