Abstract
The article presents an algorithm for fast and errorfree determination of statistics such as the arithmetic mean and variance of all contiguous subsequences and fixedlength contiguous subsequences for a sequence of industrial measurement data. Additionally, it shows that both floatingpoint and integer representation can be used to perform this kind of statistical calculations. The author proves a theorem on the number of bits of precision that an arithmetic type must have to guarantee errorfree determination of the arithmetic mean and variance. The article also presents the extension of Welford’s formula for determining variance for the sliding window method—determining the variance of fixedlength contiguous subsequences. The section dedicated to implementation tests shows the running times of individual algorithms depending on the arithmetic type used. The research shows that the use of integers in calculations makes the determination of the aforementioned statistics much faster.
1 Introduction
The development of measurement, computational and data storage technologies provided us with huge amounts of information that can form the basis for various analyses. Statistics is one of the basic tools used to achieve this purpose. The need to process everincreasing sets, however, means that the methods developed a few decades ago are no longer useful today due to their low efficiency. Consequently, a need arises to constantly improve the existing algorithms and propose new solutions. Data from production processes of enterprises serve as a good example of an area where the amount of data collected has been growing rapidly. Currently, we are able to measure the manufactured goods and the condition of the production line at almost every stage of the process. This way, huge amounts of numerical data are collected and can be used, among others, in advanced methods of quality control and predictive maintenance. In this context, the concept of statistical process control created by Shewart (1931), Deming (1986) takes on a whole new dimension—a statistical analysis of large data sets. In general, analysis of statistical properties of a sequence of measurement data is a wellknown issue. There are many proven methods to verify specific statistical hypotheses (Selvamuthu and Das 2018). They all assume, however, that we have a welldefined data sequence that we verify. The situation is completely different when we look for contiguous subsequences having specific statistical properties within a certain data sequence. The number of potential possibilities poses a problem of high computational complexity of the approach consisting in independent analysis of all possible contiguous subsequences. To illustrate this fact, we will use the example of the arithmetic mean calculated for all possible contiguous subsequences of a sequence consisting of \(10^5\) measurements. This task will take approximately \(1.7 \cdot 10^{14}\) floatingpoint operations. Assuming that a single core of a typical processor is now capable of performing about \(5 \cdot 10^9\) operations, it would take about 9.5 hours to make these calculations. Consequently, it is reasonable to look for such algorithmic solutions that will allow us to accelerate this type of analyses.
The question obviously arises: why set statistics for all contiguous subsequences? This is due to the fact that the analysis of statistics for the entire measurement series does not give us an answer if there were any situations during production that should be checked more closely. However, if we analyze all measurement contiguous subsequences, then we will be able to indicate the time periods in which something unexpected happened. In this way, for example, you can try to detect employee mistrust, technical problems with machines or problems with the quality of raw material. In addition, the analysis of individual contiguous subsequences gives us accurate information about what time period interesting events took place. To better present this idea, we decided to examine a series of \(10^5\) measurements from the production line of one of the factories. This series corresponds to about 30 days of production. The measured value was the diameter of the aluminum piston of an internal combustion engine. The measurements were carried out for each item produced. Ideally, these measurements should have a normal distribution. Therefore, we decided to test about \(10^6\) overlapping measuring contiguous subsequences of lengths \(10^3, 2\cdot 10^3, 3\cdot 10^3, \dots , 10\cdot 10^3\). Each of the sequences \((x_1, x_2, \dots , x_n)\) was subjected to a quick statistical test, which verified that all the elements of the sequence belong to a specific range \([\mu d, \mu +d]\). The range was chosen in such a way that, assuming the normality of samples distribution, all samples were in the range with a probability of \(1\alpha \). Verification of this property is very quick, because it comes down to checking the conditions \(\min \{x_i\} < \mu d\) and \(\max \{x_i\} > \mu + d\) with \(d = \sigma \sqrt{2}\;\text {erf}\;^{1}((1\alpha )^{1/n})\). Each time the test was rejected, we increased the counter of participating samples by 1. Each sample participated in approximately 50,000 tests. The results obtained are presented in Fig. 1.
It can be deduced from this plot that potentially interesting phenomena occurred around samples \(0.6 \cdot 10^4, 2.5 \cdot 10^4, 3.2 \cdot 10^4, 4.9 \cdot 10^4, 5.8 \cdot 10^4, 6.1 \cdot 10^4\) and \(9 \cdot 10^4\), which corresponds to specific periods during production. Figures 2 and 3 show histograms for samples in the ranges (57500, 58500) and (90000, 91000). In addition, there is a graph showing the histogram of all tested samples (\(10^5\) measurements) scaled by factor 0.01. The above ranges were chosen due to the fact that in the most obvious way they show evident deviations from what can be observed for entire data set. The real values taken by the samples were linearly converted into numbers in the range \([15,15]\). It was necessary due to the inability to show the real dimensions of the measured object. The presented histograms show an unusually large number of products with dimensions matching the design specification. This can be caused, for example, by the replacement of the machining tool with a new or deliberate action of employees who modified the settings of the measuring machine in order to achieve the appropriate effect during quality control. The search for such regularities in the production process is the main reason for the development of the calculation methods presented in this article.
It should be emphasized here that the problem of finding anomalies in time series has been the subject of intense research for the last twenty years. There are many algorithms and techniques used to search for anomalies provided that we know the details of their location or the size of the time window to be analyzed (Keogh et al. 2005, 2006; Senin et al. 2015).
The current methods of finding anomalies in time series can be divided into three basic categories in terms of the type of results obtained. We distinguish algorithms for finding point anomalies, structural anomalies and anomalous series (in the case of many series). By point anomalies we mean single, significant deviations from the values for a given time series (Hawkins 1980; Evans et al. 2015). On the other hand, structural anomalies are such contiguous time series subsequences that deviate from the values (Keogh and Lin 2005; Zhang and Meratnia 2010; Senin et al. 2015). In contrast, by anomalous series we mean whole series that are somehow different from other observed series (Hyndman et al. 2015; Laptev et al. 2015). In many cases, one of the most important element that allows us for search structural anomalies is determining the value of appropriate statistics for a large class of contiguous subsequences. Therefore, the algorithms presented in this article may be of considerable importance for such methods. Anomaly search algorithms using artificial intelligence methods are currently very popular. The publication based on the experience gathered by the Intel Wang et al. (2018) is highly interesting from the point of view of the production process. Beside article Chalapathy and Chawla (2019) presents a comprehensive overview of anomaly detection methods using machine learning. For this type of method, it is also very important to be able to quickly determine the value of statistics and divide the data set into appropriate clusters.
Article Ding et al. (2016) introduces the division of algorithms that detect anomalies by the type of mechanisms used to detect them. The authors distinguish four basic methods for detecting anomalies: classification, nearest neighbor, clustering and statistical. It should be emphasized that the last three methods more or less directly use functions that are either statistics or certain measures originating in the theory of probability. Quick determination of various types of statistics is therefore crucial from the point of view of the computational complexity of these algorithms.
The next sections of this article focus primarily on determining the arithmetic mean \(\mu _n = \frac{1}{n}\sum _{i=1}^{n}{x_i}\) and variance \(\nu _n = \frac{1}{n}\sum _{i=1}^{n}{(x_i\mu _n)^2}\), but the methods proposed therein can also be used to determine other statistics. It should be emphasized that the use of the biased estimator for the variance is appropriate in this case. This is due to its computational properties that allow for more effective determination of its numerical value. When its value is determined, it is very easy to use it to calculate the value of the unbiased estimator by multiplying the result by \(\frac{n}{n1}\). This is the reason why we will consistently use the biased variance estimator later in this article. Before we get to that, however, we briefly summarise the current state of knowledge on the calculation of variance, because it is undoubtedly one of those parameters whose calculation poses a lot of numerical problems. The first algorithm, which significantly reduces the numerical error in the determination of variance, was proposed by Welford (1962). His approach is modelled on the idea put forward by Box and Hunter (1959), who proposed a similar technique for determining residual sums. The Welford formula was improved by Knuth (1981), who gave it a very elegant and computationally effective form
Next, Neely (1966) developed the idea presented by Kahan (1965) and proposed an algorithm that needs double access to data; its main element was the correction of input data by the mean computed during the first round. This way, he developed a procedure highly resistant to numerical errors in many practical applications. In search for solutions offering the highest resistance to rounding errors, Rodden (1967) presented the concept of replacing floatingpoint numbers with integers. We will elaborate on his proposal later on in this article, adding estimates on the number of bits necessary for errorfree computation of variance. Then, Van Reeken (1968), Youngs and Cramer (1971) extended the set of algorithms described by Neely. Ling (1974) summarised the methods described above and compared their operation for many different input data distributions. Next, Hanson (1975), West (1979) and Cotton (1975) added an option to determine weighted variance to the existing methods and increased the numerical stability of the results obtained. A substantial contribution to the estimation of numerical errors for individual methods was made by Chan and Lewis (1979), Chan et al. (1982), Chan et al. (1983). Recent years have seen a fairly intense development of algorithms and methods focused on the processing of large data sets. The two basic directions of research work are worth mentioning. The first one includes online calculation methods developed by Kreinovich et al. (2007), Pébay (2008), Villaverde and Xiang (2008) and Terriberry (2008). The second one involves parallel data processing methods proposed by Schubert et al. (2015), Schubert and Gertz (2018) and Gouin et al. (2016). Peaby, Terriberry, Kolla and Bennett published very valuable results that combine both of these research directions. In article Pébay et al. (2016) they defined general recursive formulas for determining higher order moments. Their approach can be the basis for many parallel and numerically stable algorithms. This article focuses on computationally effective methods for determining statistics for all possible contiguous subsequences or for a specific group of contiguous subsequences.
2 Computation of the arithmetic mean and variance after conversion of measurement data into integer values
A classic algorithm for determining variance based on the following relation
poses a multitude of numerical problems. In production control, calculation errors grow along with rising product quality. This is due to the fact that the expected value is calculated by adding numbers that are very similar in value. However, calculation of the variance under the aforesaid method involves deducting the square of the arithmetic mean from the root mean square, which poses a number of problems connected with calculation precision.
The arithmetic and logic unit of modern processors usually supports both integer and floating point operations (mathematical coprocessor). Nowadays, a 64bit application processor is able to perform very fast operations on 32 and 64bit integers, and 32 and 64bit floating point numbers (single and double precision). Floating point numbers are given by
where S is a sign bit, M is the mantissa and C is the exponent. In the case of floating point numbers, the IEEE754 standard (IEEE 2008) specifies that in the singleprecision format, the mantissa and the exponent have 23 and 8 bits, respectively. In the doubleprecision format, the mantissa and the exponent have 52 and 11 bits, respectively. The speed of multiplication of numbers in particular formats depends on the type of processor. Table 1 presents the times of scalar multiplication of vectors consisting of \(10^8\) elements of a given type.
This table clearly shows that integer multiplication is more efficient than floating point multiplication. Therefore, by switching to integer arithmetic, we can achieve two goals: speed up calculations and eliminate rounding errors. Such an approach may raise doubts, as the measurement results are, by their nature, floating point values. However, if we assume that we know the nominal value of the measured element and assume a specific measurement precision, then we can easily calculate the relevant statistics by using integer operations. If, for example, the measured value has a nominal value of \(\delta _0 = 125.950\) [mm] and the measurement precision is \(\varepsilon _0 = 0.001\) [mm], then the measurement results \(\delta _i\) can be represented as integers \(m_i\), where
is the actually measured value. By determining \(m_i\) from the above formula, we get
The great advantage of this representation is that it allows for computing the expected value and variance without rounding errors. Some of them cannot be avoided in the case of floating point representation, because if, for example, a measuring device produces results with a precision of 0.1, then the floating point representation of number \(\frac{1}{10}\) in the binary system is infinite and is given by:
We can avoid such problems by switching to integer representation. Its correct operation, however, is conditional on precise control of the factor by which the integers being processed had been previously multiplied. This approach will be most effective for sets that have a welldefined and appropriately small range of possible values. For example, such a set is the measurement results. If we perform the measurement using a 16bit sensor, we can be sure that we will get no more than 65536 possible values. However, if variables can potentially take any floating point values, then in such cases the transition to an integer representation will not make much sense.
Later on in the article we focus on determining the number of bits of the mantissa necessary for correct calculation of the arithmetic mean \(\mu _n\) and variance \(\nu _n\). In doing so, we assume that only integer numbers are used, i.e. that the exponent has zero bits. Under these assumptions, the following theorem is true:
Theorem 1
Let \(\delta _0\) be the nominal value of the measured value and \(\varepsilon _0\) the calculation precision. We also assume that the measuring range is set and the results fall within the range \([\delta _0r_0, \delta _0+r_0)\). If n is the number of measurements, \(A_n(\varepsilon _0, r_0)\) is the number of integer bits necessary to correctly determine the arithmetic mean and \(V_n(\varepsilon _0, r_0)\) is the number of integer bits necessary for correct determination of the variance, then:
Proof
To prove the theorem it suffices to show that the largest values appearing in the calculation of \(\mu _n\) and \(\nu _n\) are contained in bit registers with lengths \(A_n(\varepsilon _0, r_0)\) and \(V_n(\varepsilon _0, r_0)\), respectively. Note that if the calculation precision is \(\varepsilon _0\), then there are \([\delta _0r_0, \delta _0+r_0)\) possible measurement results in the interval \(\left\lfloor \frac{2r_0}{\varepsilon _0}\right\rfloor \). This means that the measurement results \(x_i\) stand in a onetoone relation with integers \(m_i = \frac{x_i  \delta _0}{\varepsilon _0}\) satisfying the following inequality
Observe that
Therefore, to precisely calculate the arithmetic mean it is necessary to precisely compute the sum \(m_1+\dots +m_n\), satisfying the relation
Remembering that the number of bits necessary to represent the nonnegative integer m is expressed by the formula \(\left\lfloor {\log _2{m}}\right\rfloor + 1\) and considering the sign bit, we obtain
To calculate the variance, we use the fact that it is invariant to shifts by a fixed value and can be expressed as
From Jensen’s inequality (Jensen 1906) for convex function it follows that \(\frac{1}{n}\sum _{i=1}^{n}{m_i^2} \ge \left( \frac{1}{n}\sum _{i=1}^{n}{m_i}\right) ^2\), therefore, to compute the variance precisely, it is necessary to accurately determine the sum of squares \(m_1^2+\dots +m_n^2\) satisfying the following relation
Consequently, we obtain
which ends the proof. \(\square \)
Corollary 1
If we have a sequence of measurement data \((x_1, \dots , x_n)\) and there exist such constants \(\delta _0, \varepsilon _0\) and \(r_0\) that
for certain integers \(m_i\), then the arithmetic mean \(\mu _n\) and variance \(\nu _n\) of the sequence of measurement data can be computed precisely using the formulas
on condition that for the purpose of computing the arithmetic mean we use the integer data type with a length of at least \(A_n(\varepsilon _0,r_0)\) bits, and for the purpose of computing the variance we use the integer data type with a length of at least \(V_n(\varepsilon _0,r_0)\) bits.
To illustrate the importance of the result, imagine an optical measuring system on the production line. Let’s assume that it is able to measure manufactured elements with an accuracy of \(2^{20}\) [m] (about 1 [\(\upmu \)m]), assuming that their dimensions do not exceed \(2^{4}\) [m] (about 6 [cm]) . If we use the 64bit integer type to operate on measurement data, then the Theorem 1 shows that we will be able to correctly determine the variance for about \(2 \cdot 10^9\) mesured elements. As for the production process, this number is so large that it gives a real opportunity to analyze this process in a fairly long time horizon.
3 The algorithm for computing the arithmetic mean and variance of all contiguous subsequences
In this section we assume that we have a certain sequence of n observations
We define contiguous subsequence s of sequence t to be sequence \(s = (x_{l+1}, x_{l+2}, \dots , x_{l+k})\) such that for every i satisfying the condition \(l+1 \le i \le l+k\), \(x_i \in s\). We also assume that the order of elements of contiguous subsequence s is the same as in sequence t. The contiguous subsequence can therefore be treated as a kind of time window. Because the contiguous subsequence is precisely defined by the index of its first and last term, we also use the notation \(s_{l+1,l+k}\) to denote contiguous subsequence \((x_{l+1}, x_{l+2}, \dots , x_{l+k})\).
Let \(t_k\) signify the number of klong contiguous subsequences of sequence s. Note that this number is given by \(t_k = n+1k\). Therefore, the number of all possible contiguous subsequences T of sequence t equals
Now, let us assume that we want to calculate a certain statistic for all possible contiguous subsequences of an nelement sequence. Let the statistics be, for example, the arithmetic average and variance. Note that if kelement contiguous subsequence \(s = (x_{l+1}, x_{l+2}, \dots , x_{l+k})\), is given, then the time needed to compute both the arithmetic mean and variance is proportional to the number of its elements. This arises directly from the definitions of these values:
In order to compute the arithmetic mean, one needs \(k1\) additions and one division (k operations in total), and to compute the variance, one needs \(k1\) additions, \(k+1\) squarings, one subtraction, one division and operations needed to determine the arithmetic mean (\(3k+2\) operations in total). We can therefore assume that the number of floating point operations needed to calculate the arithmetic mean and the variance is limited by Ck, where C is a constant and k is the number of elements of the contiguous subsequence. Consequently, the complexity of the algorithm for computing the arithmetic mean and variance of all possible contiguous subsequences of an nelement sequence may be expressed by
Note that the operations of determining the sum and the sum of squares materially affect the complexity of computing the arithmetic mean and variance. These are the only operations that depend on the number of elements in the contiguous subsequence. The number of the remaining operations is constant and independent of the length of the processed sequence. Therefore, the only way to reduce the computational complexity of the algorithm referred to above is to speed up the calculation of sums and sums of squares. It turns out that if the contiguous subsequences are processed in the right order, it takes a constant amount of time to compute the values of the arithmetic mean and variance. It is similar for other statistics. Consequently, the whole algorithm will be presented in the most general form possible.
Let us set all contiguous subsequences in the order shown below:
Note that each row begins with a contiguous subsequence of length 1 and that the difference in length between neighbouring contiguous subsequences in one row is also 1. This means that the time needed to compute the expected value and variance for the first contiguous subsequence in a row is constant. In the case of neighbouring contiguous subsequences located in one row, the expected value and variance of the next contiguous subsequence can also be computed in a constant time.
Definition 1
Let F be a function defined for every contiguous subsequence \(s = (x_{l+1}, \dots , x_{l+k})\). We say that F is forward computationally separable function, if it meets the condition:
for a certain function h having constant computational complexity. Moreover, we say that F is backward computationally separable function, if it meets the condition:
for a certain function g having constant computational complexity. If function is both forward and backward computationally separable, then it is computationally separable function.
Corollary 2
Functions
are computationally separable functions.
Proof
In the case of \(M_1\) we simply assume that \(h(x,y) = x+y\) and \(g(x,y) = xy\). For \(M_2\) we assume that \(h(x,y) = x+y^2\) and \(g(x,y) = xy^2\). \(\square \)
Before we proceed any further, it is worth showing that there are functions that are either only forward or only backward computionally separable. We can use continued fractions as an example. If \(F_1\) and \(F_2\) are functions defined as follows
then \(F_1\) is a forward computationally separable function and \(F_2\) is backward computationally separable function. It results directly from the following relations
In other words, function h for \(F_1\) takes the form \(h(x,y) = \frac{1}{y+x}\), while function g for \(F_2\) takes the form \(g(x,y) = \frac{1}{x}  y\).
Let R be any statistic such that for every contiguous subsequence \(s = (x_{l+1}, x_{l+2}, \dots , x_{l+k})\) meets the condition \(R(s) = f_s(F(s))\), where \(f_s\) is a function dependent on s having constant computational complexity and \(F(s) = (F_1(s), \dots , F_m(s))\) is a forward computationally separable function. Since s is explicitly designated by l and k, then we can define \(f_s\) as \(f_{l+1,l+k}\).
Using these designations, we can define Algorithm 1, that computes statistic R for all contiguous subsequences.
Theorem 2
If \(s = (x_{l+1}, \dots , x_{l+k})\) is a contiguous subsequence of sequence \(t = (x_1, \dots , x_n)\), statistic R is given by \(R(s) = f_s(F(s)) = f_s(F_1(s), \dots , F_m(s))\), functions \(f_s, h\) have fixed computational complexity O(1) and values \(F_i((x_j))\) can also be computed in a constant time O(1), then Algorithm 1 allows for computing R for all contiguous subsequences t in time \(O(n^2)\).
Proof
First, we prove that the algorithm operates correctly. For this purpose, it suffices to prove that \(r_{i,j} = R(s_{i,j}) = R((x_i, \dots , x_j))\). Note that lines (2) and (5) of Algorithm 1 define the following recursive relation
We prove by induction on j that \(u_{i,j}\) was computed correctly. The first induction step for every i is true because this is how \(u_{i,i}\) values are initiated. In step two we assume that for \(j1\) the following is true
Consequently, we obtain
Therefore, R is computed correctly, which results directly from step (6) of Algorithm 1, because
This ends first part of the proof. In order to estimate computational complexity, note that the assumptions show that steps (2), (3), (5) and (6) of the Algorithm have constant computational complexity O(1). Consequently, the determination of every \(r_{i,j}\) also has constant computational complexity O(1). Finally, we conclude that the computational complexity of the entire algorithm equals the number of computed \(r_{i,j}\) values which is exactly the same as the number of twoelement combinations from an nelement set, that is \(\frac{n(n+1)}{2}\). Therefore, we proved that the algorithm has complexity \(O(n^2)\), which ends the proof. \(\square \)
Theorem 3
If \(s = (x_{l+1}, \dots , x_{l+k})\) and \(R(s) = \mu (s) = \frac{1}{k}M_1(s)\), then Algorithm 1 makes it possible to compute the arithmetic mean for all contiguous subsequences of sequence \((x_1, \dots , x_n)\) in time \(O(n^2)\).
Proof
If \(s_{i,j} = (x_i, x_{i+1}, \dots , x_j)\), then from definition of R it follows that \(F_1(s_{i,j}) = M_1(s_{i,j}) = x_i + \dots + x_j\) and \(f_{i,j}(x) = \frac{1}{ji+1}(x)\). Using these designations, Algorithm 1 takes the form presented as Algorithm 2. First, we prove that this algorithm works correctly. For this purpose, it suffices to prove that \(r_{i,j}\) contains the arithmetic mean of the elements of sequence \((x_i, x_{i+1}, \dots , x_j)\). To do this, let us note that for a given i the elements of \(u_{i,j}\) meet the following recursive relation:
By expanding this recursion, we obtain that \(u_{i,j} = x_i + x_{i+1} + \dots + x_j\). But
which proves that \(r_{i,j}\) is the arithmetic mean of sequence \((x_i, x_{i+1}, \dots , x_j)\). The estimation of the computational complexity arises directly from Theorem 2, because the individual computations of \(u_{i,j}\) and \(r_{i,j}\) have constant computational complexity. \(\square \)
Theorem 4
If \(s = (x_{l+1}, \dots , x_{l+k})\) and \(R(s) = \nu (s) = \frac{1}{k}M_2(s)  \left( \frac{1}{k}M_1(s)\right) ^2\), then Algorithm 1 makes it possible to compute the variance for all contiguous subsequences of sequence \((x_1, \dots , x_n)\) in time \(O(n^2)\).
Proof
If \(s_{i,j} = (x_i, \dots , x_j)\), then from the definition of R it follows that
Using these designations, Algorithm 1 takes the form presented as Algorithm 3. First, we prove that this algorithm works correctly. For this purpose, it suffices to prove that \(r_{i,j}\) contains the variance of the elements of sequence \((x_i, \dots , x_j)\). To do this, let us note that for a given i the elements of \(u_{i,j}\) meet the following recursive relation:
By expanding this recursion, we obtain that \(u_{i,j} = (x_i + \dots + x_j, x_i^2 + \dots + x_j^2)\). But
which is the variance of sequence \((x_i, \dots , x_j)\). The estimation of the computational complexity arises directly from Theorem 2, because the individual computations of \(u_{i,j}\) and \(r_{i,j}\) have constant computational complexity. \(\square \)
The variance computation method used in Theorem 4 is not numerically stable. That is why it should be applied only in cases where the values of functions \(M_1\) and \(M_2\) are determined precisely (i.e. are free from numerical errors). The conditions that must be met in this regard have been set out in Theorem 1 and Corollary 1. When numerical errors cannot be avoided, it is advisable to compute the variance using the method proposed by Welford (1962), which also defines a forward computationally separable function.
Theorem 5
If \(s = (x_{l+1}, \dots , x_{l+k})\) and \(R(s) = (\mu (s),\nu (s))\), then the Welford recursive formula and Algorithm 1 allow us to simultaneously determine the arithmetic mean and variance for all contiguous subsequences of sequence \((x_1, \dots , x_n)\) in time \(O(n^2)\).
Proof
If \(s_{i,j} = (x_i, \dots , x_j)\), then the following relations for the arithmetic mean and variance result from the definitions of \(\mu \) and \(\nu \) and the Welford formulas Welford (1962)
The Welford formulas shown in Eqs. 4 and 6 confirm that \(s_{i,j} \mapsto (\mu (s_{i,j}), \nu (s_{i,j}))\) is a forward computationally separable function. This means that Algorithm 1 takes the form presented as Algorithm 4 and proves correctness of presented method. Complexity \(O(n^2)\) esults from the fact that all operations in the loop have complexity O(1). This ends the proof. \(\square \)
4 The algorithm for computing the arithmetic mean and variance for fixedlength contiguous subsequences
Theorem 2 which has been proven, is of general nature and applies to many different statistics. Theorems 3, 4 and 5 , which use Algorithm 1 to compute the arithmetic mean and variance for all contiguous subsequences of a given sequence \(t = (x_1, x_2, \dots , x_n)\), serve as an example. However, it is not always necessary to calculate statistics for all possible contiguous subsequences. Sometimes we just want to scan a series of observations using a window of a predefined width. If such a window covers w observations, then the analysed contiguous subsequences are arranged in the following series
In order to effectively compute statistics for such a set of contiguous subsequences, it is necessary to assume that a statistic is made up of certain computationally separable functions. The weaker assumption about forward computationally separability is insufficient.
Let R be any statistic such that for every contiguous subsequence \(s = (x_{l+1}, x_{l+2}, \dots , x_{l+k})\) meets the condition \(R(s) = f_s(F(s))\), where \(f_s\) is a function dependent on s having constant computational complexity and \(F(s) = (F_1(s), \dots , F_m(s))\) is a computationally separable function. Since s is explicitly designated by l and k, then we can define \(f_s\) as \(f_{l+1,l+k}\). Using these designations, we can define Algorithm 5, that computes statistic R for all contiguous subsequences of length w.
Theorem 6
If \(s = (x_{l+1}, \dots , x_{l+k})\) is a contiguous subsequence of sequence \(t = (x_1, \dots , x_n)\), statistic R is given by \(R(s) = f_s(F(s)) = f_s(F_1(s), \dots , F_m(s))\), functions \(f_s, h, g\) have fixed computational complexity O(1) and values \(F_i((x_j))\) can also be computed in a constant time O(1), then Algorithm 5 allows for computing statistic R for all contiguous subsequences given by \(s_{l+1,l+w} = (x_{l+1},\dots ,x_{l+w})\) in time O(n) for any w.
Proof
First, we prove that the algorithm operates correctly. For this purpose, it suffices to show that \(r_{i} = R(s_{i,i+w1}) = R((x_i, \dots , x_{i+w1}))\). We prove the above by induction, which first step is to show that \(r_1 = R(s_{1,w})\). Note that lines (1) and (3) of Algorithm 5 define the following recursive relation for \(u_{1,i}\)
Since \(h\left( (F_1(s_{1,i1}), \dots , F_m(s_{1,i1})), x_i\right) = ((F_1(s_{1,i}),\) \(\dots , F_m(s_{1,i}))\), then by expanding this simple recursion we conclude that
Consequently, we proved that
Now, we assume that \(r_{i1}\) was calculated correctly. By definition of function R it follows that
Considering the properties of function g, we conclude that step (8) of Algorithm 5 defines
Then, using the properties of function h in step (9) we obtain
This means that
Using inductive reasoning we proved that Algorithm 5 correctly determines statistics \(r_i\) for \(i \in \{1,\dots ,w\}\). In order to estimate computational complexity, note that the assumptions show that steps (1), (3), (5–6) and (8–10) of the Algorithm have constant computational complexity O(1). Consequently, the total complexity of the Algorithm equals the total number of iterations of loop (2) and (7) and amounts to \(O(w+(nw)) = O(n)\), which ends the proof. \(\square \)
Corollary 3
If \(s = (x_{l+1}, \dots , x_{l+k})\) i \(R(s) = \mu (s) = \frac{1}{k}M_1(s)\), then Algorithm 5 makes it possible to compute the arithmetic mean for all contiguous subsequences \(s_{l+1,l+w} = (x_{l+1},\dots ,x_{l+w})\) in time O(n) for any w.
Corollary 4
If \(s = (x_{l+1}, \dots , x_{l+k})\) and \(R(s) = \nu (s) = \frac{1}{k}M_2(s)  \left( \frac{1}{k}M_1(s)\right) ^2\), then Algorithm 5 makes it possible to compute the variance for all contiguous subsequences \(s_{l+1,l+w} = (x_{l+1},\dots ,x_{l+w})\) in time O(n) for any w.
The proof of the above conclusions is very similar to the proof of Theorems 3 and 4. It must be stressed that the method presented in Corollary 4 should be used to compute the variance only when we are certain that the calculations will be free from numerical errors. Otherwise, we advise to use the method presented below.
Lemma 1
If \(s = (x_1, \dots , x_m)\), and \(s' = (x_2, \dots , x_{m+1})\), then the following relations are true
Proof
Equation 7 can be easily proved using the definition of arithmetic mean
Equation 8 is slightly more complex and so to prove it we use alternative expressions for the variance \(\nu (s) = \frac{x_1^2+x_2^2+\dots +x_m^2}{m}  \mu (s)^2\) and \(\nu (s') = \frac{x_2^2+\dots +x_m^2+x_{m+1}^2}{m}  \mu (s')^2\). Using the above relations, we can write that
Using Eq. 7 proved above, we obtain that
This ends the proof of the lemma. \(\square \)
Theorem 7
If \(s = (x_{l+1}, \dots , x_{l+k})\) and \(R(s) = (\mu (s),\) \(\nu (s))\), then Algorithm 5 allows for computing the variance for all contiguous subsequences \(s_{l+1,l+w} = (x_{l+1},\dots ,x_{l+w})\) in time O(n) for any w.
Proof
If \(s_{i,j} = (x_i, \dots , x_j)\), then the Welford formulas can be used to compute the arithmetic mean and variance in loop (2)–(4) of Algorithm 5. Since in this loop we calculate \(\mu (s_{1,w})\) and \(\nu (s_{1,w})\), then the formulas are reduced to
In turn, for loop (7)–(11) of Algorithm 5 we use the formulas from Lemma 1, which for a window of width w are given by
The above equations prove that function \(R(s) = (\mu (s),\) \(\nu (s))\) is determined correctly for every \(s = s_{l+1,l+w} = (x_{l+1},\dots ,x_{l+w})\) using Algorithm 5, which takes the form presented as Algorithm 6. Since all operations within the two loops have the complexity O(1), then by Theorem 6 we conclude that the complexity of the presented algorithm is O(n), which ends the proof. \(\square \)
5 Practical implementation and its results
In order to compare the methods proposed in the previous section, in C++ we implemented Algorithm 7 calculating the minimum, maximum, arithmetic mean and variance for all contiguous subsequences of a given sequence of measurement data.
The first objective of the implementation was to determine the speed of calculation using the available data types. The following five arithmetic types were analysed:

1.
32bit integer type int32_t,

2.
64bit integer type int64_t,

3.
32bit floatingpoint type with a 23bit mantissa float (single precision),

4.
64bit floatingpoint type with a 53bit mantissa double (double precision),

5.
80bit floatingpoint type with a 63bit mantissa long double (extended precision).
Bearing in mind the application of Algorithm 7 to the analysis of large sets of measurement data, we focused on int64_t and long double. arithmetic types. These are the only types allowing us to precisely determine the variance for samples consisting of more than \(10^6\) elements and represented with a 16bit precision. However, in order to compare all options, each of the previously mentioned data types has been subjected to a performance analysis. Table 2 presents the time needed to perform the following operations: determination of extremes, determination of the arithmetic mean, determination of the variance, and determination of all these statistics simultaneously.
An analysis of running times showed that int64_t is more than twice as fast as long double. Due to the nature of our calculations (determination of extremes, expected value and variance), both types guarantee virtually identical precision. In addition, the integer type allows us to avoid rounding errors when the precision of measurements is based on the powers of 10. Table 3 presents running times of Algorithm 7 and of the algorithm determining statistics for each contiguous subsequence separately (denoted as std.). Figure 4 gives a more accurate picture of the speed of the algorithms for particular arithmetic types. It clearly shows the difference in the complexity of both algorithms. The time needed to determine the basic statistics grows quickly with the size of the analysed data. For a sequence of around \(10^6\) elements, the running time of Algorithm 7 is approximately 30 hours. For comparison, it would take almost 4 years to perform this analysis using a nonoptimised algorithm. The basic condition for the reliability of Algorithm 7 is the selection of an arithmetic type that allows for errorfree calculation of sums and sums of squares for a large number of elements. That is why we suggest using integer types instead of floatingpoint types.
Figure 5 compares the running times of the proposed algorithm for two types of data: long double and int64_t. Note that in terms of speed, the advantage of the integer type over the floatingpoint type has been significantly reduced. There are three reasons behind this:

1.
a large number of short contiguous subsequences for which the statistics are calculated,

2.
the need to store a large number of calculated statistics, which translates into a large number of read and write memory operations,

3.
a larger number of conditional instructions in the loops determining individual statistics (the processor spends more time on analysing conditions than on making calculations).
Naturally a question arises how the performance time will change for sequences consisting of \(10^5\) or \(10^6\) elements. Another important issue is how to optimise access to memory or structure the algorithm so that it does not remember statistics, but only locations that should be analysed more closely. Figure 6 presents the ratio of the running time of the algorithm determining the statistics for each sequence separately to the running time of Algorithm 7 for two arithmetic types: long double and int64_t. The plot shows that the integer type guarantees a much higher calculation efficiency than the floating point type, and it can also be seen that this disproportion increases as the size of the input sequence being processed increases.
6 Conclusions
The article proposes a new algorithm for determining statistics such as the arithmetic mean and variance for all contiguous subsequences and fixedlength contiguous subsequences. Its main element is to reduce floating point data to their integer representation. As a result, an algorithm was obtained that works without a numerical error for many practical applications—especially for the analysis of industrial measurement data. The developed method significantly speeds up the calculations and gives the possibility of its effective application in machine learning and accurate statistical inference.
The first section of this article presents a brief introduction to the issues of statistical analysis of measurements in manufacturing enterprises and a description of the current state of knowledge on the determination of the arithmetic mean and variance.
The second section outlines calculation methods based on measurement data with the use of both floatingpoint representation and integer representation. In Theorem 1 we laid down the conditions that an arithmetic type (its mantissa) has to meet so that it can be used to precisely determine statistics such as the expected value and variance. Additionally, this sections presents a method of converting measurement data to a corresponding sequence of integers that can be used to determine the said statistics, which is expressed in Corollary 1.
The third section presents an algorithm for fast statistical analysis of all contiguous subsequences of a certain numerical sequence. Algorithm 1 has been presented in the general version using the proposed concept of computationally separable functions. Theorem 2 proves the correctness of the proposed method and determines its computational complexity at \(O(n^2)\), where n is the number of elements of the analysed sequence. Theorems 3 and 4 prove the usefulness of Algorithm 1 for calculating the arithmetic mean and variance. It is worth stressing that these methods have been presented in a form that allows errorfree calculations, provided that the assumptions of Theorem 1 are met. For numerically unstable cases, we presented Theorem 5, which uses Algorithm 1 to determine arithmetic means and variances based on the Welford formula. This theorem applies when the mantissa of a number is too small to enable the calculations to be free from numerical errors.
The fourth section deals with the problem of determining statistics for a set of fixedlength contiguous subsequences. In this case, we proposed Algorithm 5, which was proved to operate correctly in Theorem 6. The theorem also proved that if the statistics can be defined using computationally separable functions, then the determination of statistics for all welement contiguous subsequences of an nelement sequence has complexity O(n) regardless of the value of w. Next, Corollaries 3 and 4 show how to apply Algorithm 5 to determine arithmetic means and variances, avoiding numerical errors in the process. Lemma 1 proved the correctness of the new formula for calculating the arithmetic mean and variance for fixedlength contiguous subsequences. The lemma was used to prove Theorem 7, that proved that Algorithm 5 operates correctly based on the proposed relation.
The last section of the article compares the running times of the proposed algorithms for various data types. The conclusion is that by converting measurement data from floatingpoint to integer representation, we can not only speed up calculations, but also make them more precise. Although it is possible to achieve this goal only when the conditions of Theorem 1, are met, practice shows that it is very often feasible, especially in the case of a welldefined set of measurement data.
References
Box G, Hunter J (1959) Condensed calculations for evolutionary operation programs. Technometrics 1(1):77–95
Chalapathy R, Chawla S (2019) Deep learning for anomaly detection: a survey. CoRR. abs/1901.03407
Chan T, Golub G, LeVeque R (1982) Updating formulae and a pairwise algorithm for computing sample variances. COMPSTAT. https://doi.org/10.1007/9783642514616_3
Chan T, Golub G, LeVeque R (1983) Algorithms for computing the sample variance: analysis and recommendations. Am Stat 37(3):242–247. https://doi.org/10.2307/2683386
Chan T, Lewis J (1979) Computing standard deviations: accuracy. Commun ACM 22(9):526–531. https://doi.org/10.1145/359146.359152
Cotton I (1975) Remark on stably updating mean and standard deviation of data. Commun ACM 18(8):458. https://doi.org/10.1145/360933.360981
Deming W (1986) Out of the crisis. MIT Press, Cambridge
Ding J, Liu Y, Zhang L, Wang J, Liu Y (2016) An anomaly detection approach for multiple monitoring data series based on latent correlation probabilistic model. Appl Intell 44:340–361. https://doi.org/10.1007/s1048901507137
Evans K, Love T, Thurston S (2015) Outlier identification in modelbased cluster analysis. J Classif 32:63–84. https://doi.org/10.1007/s0035701591715
Gouin F, Ancourt C, Guettier C (2016) Threewise: a local variance algorithm for gpu. In: 2016 IEEE International Conference on CSE, IEEE International Conference on EUC, International Symposium DCABES, pp 257–262. https://doi.org/10.1109/CSEEUCDCABES.2016.194
Hanson R (1975) Stably updating mean and standard deviation of data. Commun ACM 18(1):57–58. https://doi.org/10.1145/360569.360662
Hawkins D (1980) Identification of outliers. Springer, Dordrecht
Hyndman R, Wang E, Laptev N (2015) Largescale unusual time series detection. In: 2015 IEEE international conference on data mining workshop (ICDMW), pp 1616–1619. https://doi.org/10.1109/ICDMW.2015.104
IEEE: Ieee std 7542008. Tech. rep., IEEE (2008). https://doi.org/10.1109/ieeestd.2008.4610935
Jensen J (1906) Sur les fonctions convexes et les inégalités entre les valeurs moyennes. Acta Math 30:175–193. https://doi.org/10.1007/BF02418571
Kahan W (1965) Pracniques: further remarks on reducing truncation errors. Commun ACM 8(1):40. https://doi.org/10.1145/363707.363723
Keogh E, Lin J (2005) Clustering of timeseries subsequences is meaningless: implications for previousand future research. Knowl Inf Syst 8(2):154–177. https://doi.org/10.1007/s1011500401727
Keogh E, Lin J, Fu A (2005) Hot sax: efficiently finding the most unusual time series subsequence. In: Fifth IEEE international conference on data mining (ICDM’05), p 8. https://doi.org/10.1109/ICDM.2005.79
Keogh E, Lin J, Lee S, Herle H (2006) Finding the most unusual time series subsequence: algorithms and applications. Knowl Inf Syst 11(1):1–27. https://doi.org/10.1007/s1011500600346
Knuth D (1981) The art of computer programming, vol II: seminumerical algorithms, 2nd edn. AddisonWesley
Kreinovich V, Nguyen H, Wu B (2007) Online algorithms for computing mean and variance of interval data, and their use in intelligent systems. Inf Sci 177:3228–3238. https://doi.org/10.1016/j.ins.2006.11.007
Laptev N, Amizadeh S, Flint I (2015) Generic and scalable framework for automated timeseries anomaly detection. In: KDD’15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 1939–1947. https://doi.org/10.1145/2783258.2788611
Ling R (1974) Comparison of several algorithms for computing sample means and variances. J Am Stat Assoc 69(348):859–866. https://doi.org/10.2307/2286154
Neely P (1966) Comparison of several algorithms for computation of means, standard deviations and correlation coefficients. Commun ACM 9(7):496–499. https://doi.org/10.1145/365719.365958
Pébay P (2008) Formulas for robust, onepass parallel computation of covariances and arbitraryorder statistical moments. Tech. Rep. SAND20086212, Sandia National Laboratories. https://doi.org/10.2172/1028931
Pébay P, Terriberry T, Kolla H, Bennett J (2016) Numerically stable, scalable formulas for parallel and online computation of higherorder multivariate central moments with arbitrary weights. Comput Stat 31(4):1305–1325. https://doi.org/10.1007/s001800150637z
Rodden B (1967) Errorfree methods for statistical computations. Commun ACM 10(3):179–180. https://doi.org/10.1145/363162.363205
Schubert E, Gertz M (2018) Numerically stable parallel computation of (co)variance. In: SSDBM’18. https://doi.org/10.1145/3221269.3223036
Schubert E, Koos A, Emrich T, Züfle A, Schmid K, Zimek A (2015) A framework for clustering uncertain data. Proc VLDB Endowment 8(12):1976–1979. https://doi.org/10.14778/2824032.2824115
Selvamuthu D, Das D (2018) Introduction to statistical methods, design of experiments and statistical quality control. Springer Nature Singapore Pte Ltd., Singapore
Senin P, Lin J, Wang X, Oates T, Gandhi S, Boedihardjo A, Chen C, Frankenstein S (2015) Time seriesanomaly discovery with grammarbased compression. In: Proceedings of the 18th international conference on extending database technology, EDBT 2015, pp 481–492. https://doi.org/10.5441/002/edbt.2015.42
Shewart W (1931) Economic control of quality manufactured product. D. Van Nostrand, New York
Terriberry T (2008) Computing higherorder moments online. http://people.xiph.org/tterribe/notes/homs.html
Van Reeken A (1968) Letters to the editor: dealing with Neely’s algorithms. Commun ACM 11(3):149–150. https://doi.org/10.1145/362929.362961
Villaverde K, Xiang G (2008) Estimating variance under interval and fuzzy uncertainty: Parallel algorithms. In: IEEE International Conference on Fuzzy Systems, pp 1030–1033. https://doi.org/10.1109/FUZZY.2008.4630496
Wang X, Lin J, Patel NMB (2018) Exact variablelength anomaly detection algorithmfor univariate and multivariate time series. Data Mind Kowl Discov 32:1806–1844. https://doi.org/10.1007/s1061801805697
Welford B (1962) Note on a method for calculating corrected sums of squares and products. Technometrics 4(3):419–420. https://doi.org/10.2307/1266577
West D (1979) Updating mean and variance estimates: an improved method. Commun ACM 22(9):532–535. https://doi.org/10.1145/359146.359153
Youngs E, Cramer E (1971) Some results relevant to choice of sum and sumofproduct algorithms. Technometrics 13(3):657–665. https://doi.org/10.1080/00401706.1971.10488826
Zhang Y, Meratnia NPH (2010) Outlier detection techniques for wireless sensor networks: a survey. IEEE Commun Surv Tutor 12(2):159–170. https://doi.org/10.1109/SURV.2010.021510.00088
Acknowledgements
The study was carried out with the use of apparatus purchased from funds of the project Establishment of the Scientific and Research InterUniversity Laboratory in Stalowa Wola, realized as part of the Operational Programme Eastern Poland 20072013, Priority axis I Modern Economy, Measure 1.3 Supporting Innovativeness as per contract No. POPW.01.03.0018016/1200.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Chmielowiec, A. Algorithm for errorfree determination of the variance of all contiguous subsequences and fixedlength contiguous subsequences for a sequence of industrial measurement data. Comput Stat 36, 2813–2840 (2021). https://doi.org/10.1007/s00180021010961
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180021010961
Keywords
 Variance
 Arithmetic mean
 Errorfree variance computation
 Measurement data representation