Constant-Time

. We demonstrate a technique allowing for constant-time calculation of low order Fourier moments, applicable in detection tasks. Real and imaginary parts of the moments can be used as features for machine learning and classiﬁcation of image windows. The technique is based on a set of special integral images , prepared prior to the scanning procedure. The integral images are constructed as cumulative inner products between the input image and suitable trigonometric terms. Additional time invested in the preparation of such integral images is amortized later at the stage of scanning. Then, the extraction of each moment requires only 21 operations, regardless of the number of pixels in the detection window, and thereby is an O (1) calculation. As an application example, face detection experiments are carried out with detectors based on Haar-like features serving as opponents to the proposed Fourier-based detectors.


Introduction
Constant-time computational complexity is the most attractive complexity for a computer scientist. Unfortunately, favourable opportunities to apply algorithms of that complexity are rare -typically, they pertain to some selected data structures e.g. hash tables, Union-Find 1 [2] and constitute a narrow fragment of a larger software. Often, one deals in fact with a so-called amortized constant-time complexity. This means that in the company of essential operations, performed are also some auxiliary operations meant to guarantee the speed for the future.
Not so long ago an algorithmic idea of that class has appeared in the field of computer vision and works remarkably well -namely, the idea of Haar-like features due to Jones (2001, 2004) [9,10]. Haar-like features are now commonly applied to detect objects (faces, people, vehicles, road signs, etc.) in digital This work was financed by the National Science Centre, Poland. Research project no.: 2016/21/B/ST6/01495. 1 For strictness: the 'Find' operation in this data structure is of amortized complexity O(log * n) -iterated logarithm of n. Wherein log * 2 n is not greater than 5 for all quantities n observable in the universe; in particular, log * 2 2 65536 = 5.
images [1,8]. One should be aware that the fast performance of Haar-like features is not owed to the nature of these features as such; they are simple differential features that can be viewed as rough contours (e.g. difference in average pixel intensity between forehead and eyes regions). Instead, the fast performance is in fact a consequence of a computational trick known as integral image. For an image i(x, y) the elementary integral image is: ii(x, y) = 1 j x 1 k y i(j, k).
Once such a cumulant is prepared, the sum of intensities over any image window can be calculated in constant time -O(1) -regardless of the number of pixels, using 2 subtractions and 1 addition. This allows for very fast feature extraction.
There exist a few modifications of that idea. For example, a cumulant of squares ii(x, y) = 1 j x 1 k y i 2 (j, k) is useful for calculations of variance. In turn, cumulants of so-called vote matrices allow for extraction of HOG 2 features [5,7]. Yet, other propositions of that kind are scarce and, in generality, approaches which would allow for constant-time extraction of more advanced features, exhibitting better approximation properties, are not known.
In this paper we demonstrate that it is possible to prepare a set of cumulants of form: ii(x, y) = 1 j x 1 k y i(j, k) · cosf (j, k, · · · ) and ii(x, y) = 1 j x 1 k y i(j, k) · sinf (j, k, · · · ), with f being a suitably chosen function, and then to use the cumulants to extract Fourier moments of low orders in constant-time, using 21 operations, regardless of size and position of detection window.
We omit the topic of classifiers cascade in the paper.

Haar-Like Features -Short Review
In this section we briefly remind Haar-like features and point out their connection to Haar wavelets. Recall the mother Haar wavelet ψ(x) defined to yield: 1 for 0 x < 1 2 , −1 for 1 2 x < 1, and 0 otherwise. The descendant wavelets are generated as follows: Thus, descendants are narrowed and shifted versions of the mother wavelet. For any continuous function f (to be approximated) the orthogonality of wavelets allows to write down the following expansion where the best coefficients can be found through inner products of Haar bases and the target function: c j,k = 1/ ψ j,k 2 f, ψ j,k and c 0 = f, 1 . Viola and Jones [9,10] proposed two-dimensional templates resembling Haar wavelets. The templates can be mapped to features by anchoring them within an image window at different positions and scales, and then calculating the difference in average intensity of pixels under white (+1) and black (−1) regions. We depict the templates and their connection to wavelets in Fig. 1. The intention of Viola and Jones was to generate a massive multitude of features (e.g. ∼10 5 ), so that some of them might happen to represent good characteristics of target objects (e.g. for faces: differences between forehead and eyes, nose and cheeks, etc.). Therefore, the way to implement how Haar-like features are actually embedded inside a window (i.e. setting up their positions and scales) is fairly arbitrary. One may allow for overlapping of feature supports and neglect orthogonality. On the other hand, we remark that it is straightforward to define orthogonal two-dimensional wavelets via products ψ j,k;l,m (x, y) = ψ j,k (x) · ψ l,m (y), and to write down a polynomial in wavelets to approximate some fragment of image function i(x, y). Note that in the formula for coefficients, c j,k;l,m = 1/ ψ j,k;l,m 2 i, ψ j,k;l,m , the expression i, ψ j,k;l,m is then equivalent to taking the white-black difference, as in the definition of Haar-like features, whereas the normalization constant 1/ ψ j,k;l,m 2 plays the role of averaging (provided that white and black supports are of the same size).

Constant-Time Fourier Moments via Integral Images
Consider the following approximation, by a partial Fourier sum, of an image fragment restricted to a rectangle spanning from (x 1 , y 1 ) to (x 2 , y 2 ): where: n is the harmonic order of approximation (variable-wise), i = √ −1 is the imaginary unit (please note the calligraphic difference from i denoting the image), the coefficients c are complex numbers, and N x =x 2 −x 1 +1, N y =y 2 −y 1 +1 are rectangle widths in pixels. The superscripts k x , k y of c coefficients indicate the particular harmonic indexes. The subscripts represent the boundaries of the rectangle. In the current context the boundaries are fixed, but shall vary later when partitioning of the detection window becomes involved. To avoid confusion, we explain that throughout the paper N x , k x and similar subscript notations should not be treated as functions of the specific subscript value, but instead as an indication of what coordinate the quantity is associated with.
Due to orthognality of Fourier bases, the optimal complex coefficients from (4) can be derived as From now on, we shall refer to the coefficients as Fourier moments, and we intend to use their real and imaginary parts as features for learning and detection.
ii kx,ky sin Nx,Ny where indexes (k x , k y ) iterate over the set: We remark that a single integral image of form (6) or (7) can be calculated by induction in linear time with respect to the total number of pixels in the input image (i.e. with one pass).
Let us now define the growth operator for any integral image ii taken from either of the sets {ii cos }, {ii sin }: Note that Δ returns a subsum over given cuboid in constant time using 2 subtractions and 1 addition, instead of Θ(N x N y ) operations.
The following proposition constitutes the main contribution of the paper.

Proposition 1. Suppose the two sets of integral images:
ii kx,ky cos Nx,Ny , ii kx,ky sin Nx,Ny , defined as in (6) and (7) Im c kx,ky x1,y1 x2,y2 ii kx,ky cos Nx,Ny As one can note both parts, real (10) and imaginary (11), require a calculation of two growth operations and two trigonometric functions. It is easy check that this comprises a total of 21 operations: 8 additions (or subtractions), 8 multiplications, 3 divisions, and 2 trigonometric functions for either of the two formulas. Note that it is sufficient to calculate the argument under trigonometric functions only once. Furthermore, it is worth noting that this argument depends on the offset (x 1 , y 1 ) of the rectangle, but does not depend on the rectangle contents -the pixels, thereby making the overall calculation a constant-time calculation. The proof of the proposition is a straightforward derivation.
Proof. Rewriting the moments from (5) using Euler's identity leads to: The argument of the trigonometric functions can be parted into a group of terms independent from the pixel index (x, y) and a group dependent on it as follows: Now, one can apply in (12) the trigonometric identities for cos(α+β) and sin(α+ β). Simultaneously, the cos α and sin α terms can be pulled out as factors in front of the summations as they are independent of the pixel index (x, y). Finally, by splitting the expression into real and imaginary parts one obtains: The underbraces show how the expensive summations over pixels get replaced by cheap (constant-time) growths of integral images, yielding (10), (11).
The form of indexes set (8) and also by the fact that the zeroth order moment is a real number -Im(c 0,0 · ) = 0. Hence, it suffices to calculate roughly only a half of all moments. More precisely, the effective number of distinct moments is which yields 2n 2 + 2n + 1 and corresponds to the size of set (8). In fact, any set of 2n 2 + 2n + 1 coefficients will do to uniquely reconstruct all coefficients. As regards the needed number of integral images, it is equal to the double of expression (15) yielding: (2n + 1) 2 + 1, since required are two kinds of integral images, related to cosine and sine functions, for each (k x , k y ) pair. Hence, the calculation of all cumulants is potentially expensive. That is why, when using Proposition 1, one should in practice limit himself to low harmonic orders, so that the time invested in the preparation of integral images is reasonably small.

Window Paritioning -Piecewise Approximations
Apart from n let us now introduce an additional integer parameter p > 0, responsible for the partitioning of detection window and affecting the final number of features. Let the window be partitioned into a regular grid of rectangles: p × p. The moments shall be extracted from each rectangle independently and their concatenation shall form the final vector of features. This approach can be understood as a piecewise Fourier approximation of the window under detection.
Consider a single image pass with a detection window of size w x × w y . The partitioning leads to a grid of pieces with widths equal to:  N y = w y /p . We denote the corresponding division remainders as: m x = w x mod p, m y = w y mod p. Now, for a window starting at a point (x 1 , y 1 ) and for fixed numbers N x , N y we define the collection of features fx1,y1 Nx,Ny (· · ·) as follows:

fx1,y1
Nx,Ny where: (x 1 , y 1 ) = (x 1 + m x /2 , y 1 + m y /2 ) represents a shifted starting point taking into account small corrections due to the partitioning remainders 3 ; indexes k x , k y iterate over the set defined in (8); 0 ≤ p x , p y ≤ p − 1 represent the index (and hence the offset) of a particular rectangle; and r is a flag switching between real and imaginary parts. Since Im c 0,0 · = 0, then f (0, 0, p x , p y , 0) is also zero for any p x , p y pair, and therefore should not be taken as an actual feature. Finally, the total number of features is: d(n, p) = (2n + 1) 2 p 2 . Figures 2 and 3 show example reconstructions of an image from Fourier moments. Reconstructions are carried out according to formula (4) (piecewise reconstructions for p > 1). Obviously, image reconstruction as such is not a needed step in a detection procedure. Yet, the quality of reconstructions helps to understand the descriptive capability of the features. Under each reconstruction we report the mean absolute error (MAE) and the ratio of the number of features to the number of pixels (feats/pxs). To compare accuracy, we have introduced opponents for our Fourier-based detectors -namely, additional detectors trained on the same learning material but using Haar-like features. Our intention was to impose a similar feature space parameterization in both approaches, but to slightly favour the Haar-like features in terms of their quantity. To achieve this, we were using 5 Haar templates discussed earlier (Fig. 1) and we were anchoring the Haar-like features within the detection window on p × p grids (p = 5, 7) -hence, the grids were of the same sizes as in the case of Fourier moments. Lengths of Haar-like features were scaled independetly along each axis, and the number of scales was controlled by an additional parameter q > 0. More precisely, for a window of size w x × w y , the lengths of features were changing according to: w x λ sx and w y λ sy with 1 s x , s y q and the scaling factor chosen to be λ = √ 2/2. Hence, the final number of generated Haar-like features became d HF (q, p) = 5q 2 p 2 . We remind that the corresponding number for Fourier moments is d FM (n, p) = (2n + 1) 2 p 2 .

Face Detection Experiments
A learning material of moderately large size was used. It contained 7 258 positive examples (windows with faces marked manually from 3 000 images) and 100 000 negative examples (windows sampled randomly from non-face images). Accuracy measures were evaluated on test data consisting of 500 images with 1 000 faces and a total of 70 252 859 windows. In order to produce ROC curves for detectors on the basis of test material, a test set with a limited number of negatives was randomly selected (to fit in RAM memory). We imposed 2 · 10 6 negative windows in that set, thereby making the precision along the FAR 4 axis at the level of 5 · 10 −7 . Details of the experimental setup are listed in Table 1.
We have applied a boosted learning algorithm known as RealBoost+bins, see e.g. [6], with additional weight trimming [3]. In this variant, an ensemble consists of partial (weak) classifiers that are based on selected single features each. Classifiers' responses are real-valued, equal to half the logit transform, and a binning mechanism is introduced to store those responses. We have set up 8 bins of equal widths per feature. Finally, T = 256 or T = 512 rounds of boosting were carried out, yielding ensembles with at most T distinct features selected.
The software has been programmed in C#, with key procedures (e.g. integral images, features extraction) implemented for efficiency in C++ as dll libraries.
We start the review of results by showing in Fig. 4 some example outcomes produced by a Fourier-based detector (variant: n = 3, p = 7). The left-hand side images contain all single positive indications. Their counterparts on the right-hand side are postprocessed outcomes, i.e. after grouping of windows clusters has been performed. Figure 5 shows some examples of false alarms, interesting because of their resemblance to faces (we encourage to zoom the document). ROC curves for all detectors are presented in Fig. 6. To distinguish the curves better, logarithmic scale was imposed on FAR axis. Operational decision  thresholds for detectors were taken as averages of threshold values registered for two left-most points on ROCs, with the smallest FAR values (≈5 · 10 −7 ).
Detailed accuracy results are reported in Table 2. In particular, AUC (area under ROC) measures are stated. In learning tasks with strongly imbalanced classes, it is the left-most part of the ROC curve that is of crucial importance. Therefore, we decided to report normalized AUCs obtained up to several first orders of magnitudes of FAR. More precisely, AUC α should be understood as 1/α α 0 s(f ) df where s and f represent sensitivity and FAR respectively.  Test results clearly show that detectors based on Fourier moments surpass in accuracy their counterparts based on Haar-like features, even though there were fewer features at disposal at the learning stage. It is an experimental evidence that Fourier moments have better approximation properties for the face detection problem. Naturally, the best (and definitely satisfactory) accuracy was achieved by the variant with the most rich space consisting of 2 401 features (n = 3, p = 7). Please note that the number 2.4 · 10 3 is decidedly smaller than the total of 1.8 · 10 5 features originally used by Viola and Jones in their experiment [9].  Table 3 reports the time performance we achieved on our machine (Intel i7 Q 720 4 × 2-core 1.60 GHz CPU). Please remember that cascades of classifiers were purposely not involved in the experiment. The observed amortized extraction time for a single Fourier feature was approximately 37 ns -about two times longer than for a Haar-like feature. This is related to the 21 operations we declared in Proposition 1, and is roughly proportional to the number of operations needed for Haar-like features (8 operations for an 'edge' template, 12 for a 'diagonal' template). We remark that by involving a cascade of classifiers and a processor with more cores/threads (to save time for preparation of integral images) the real-time regime could be achieved without difficulty in both approaches (FM and HF).

Conclusions
We have proposed a computational technique, based on special integral images, for constant-time extraction of low order Fourier moments. The technique is suitable for detection tasks. Our experiments have shown that fairly small sets of Fourier features -real and imaginary parts of moments -can lead to face detectors superior in accuracy than Haar-based detectors. The proposed approach could be beneficial in other machine learning applications where accuracy is of primary importance (e.g. medical diagnosis, image-based fault dection in production, landmine detection [4,5]), and where one is willing to invest some additional time in the preparation of special integral images in order to improve accuracy.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.