Keywords

1 Introduction

Phase Correlation (PhC) is one of the four classical methods for local motion estimation, together with discrete matching (a.k.a. block matching), differential matching and spatio-temporal optical flow measurement (structure tensor). Although the fundamentals of PhC date back to the 1970s [2, 4, 5], the precise relations between the listed families of approaches have not been analyzed thoroughly in the literature so far. Depending on the characteristics of the data to be processed and to some degree also depending on the scientific community which is regarded (computer vision, geophysical data analysis, time delay estimation, ...), different families of methods are preferred for the task of estimating displacements or 2D motion. For instance, [6, 7] are early papers proposing the normalized cross correlation metric. How to optimize the metric is another issue. Discrete matching is opposed to differential approaches lead by the classic Lucas&Kanade approach [1].

Fig. 1.
figure 1

Exemplary result of our enhanced PhC on data from the Middlebury Stereo Dataset. Three different motions (depths) are present in the marked cell (brush, bust and background), causing 3 distinct peaks in the delta array.

Since Fourier transform is the main element of the PhC method, it shows strong robustness against geometrical and photometric distortions [810]. Like the classical differential matching schemes, the PhC method, with certain extensions, can achieve subpixel matching accuracy [11]; accuracy of better than 1/100 pixel was claimed by [12, 13]. Additional details regarding different variants of PhC algorithms can be found in [10, 1417, 21, 23]. Some of them describe different ways to achieve subpixel matching accuracy, some others emphasize the advantages of PhC for estimating homogenous displacements for larger images (image registration). Some recent papers considered the use of PhC-based stereo algorithms for remote sensing tasks applied to aerial imagery [18] and for interferometric SAR image co-registration [19]. We refer also to [20] where the PhC stabilizes video sequences against illumination changes and camera shaking. Besides some completely novel approaches, we extend in the present paper several ideas that appear already in [20] and put them on a more systematic basis.

We emphasize that the method presented here does not aim at the computation of dense motion fields, but a) makes the classical PhC robust, and b) extends the PhC method towards being able to obtain distributions of motion vectors that appear in a given patch. In applications where the patch is assumed to be subjected to a homogeneous translation motion (image registration), this is already the desired result, whereas for complex motion fields these distributions give valuable prior information that allows to systematically initialize and guide a subsequent sparse or (semi-)dense motion estimation procedure.

2 Approach

This section embeds the plain PhC method as it is described in the literature into a framework that checks for potential problematic situations (due to invalid or ambiguous input data) and performs a series of self-checks and filtering steps that are necessary to employ the method in an autonomous mode without user intervention. We provide solid and proven procedures for tuning the different parameters that appear in the enhanced PhC method. The presentation of the PhC method and the proposed extensions are described here for one-dimensional signals; the generalization to more dimensions is straightforward.

Let \(y[x_{n}]\) and \(z[x_{n}]\) be two observations of the same discrete signal \(s[x_{n}]\), where \(z[x_{n}]\) contains a shift by a displacement d:

$$\begin{aligned} y[x_{n}] = s[x_{n}]&\quad \text {and} \quad z[x_{n}] = s[x_{n}] * \delta [x_{n}-d] = s[x_{n}-d]. \end{aligned}$$
(1)

The orthonormal Fourier transform over a discrete area of size N yields:

$$\begin{aligned} Y[f_{k}] = S[f_{k}]&\quad \text {and} \quad Z[f_{k}] = S[f_{k}] \cdot \frac{1}{\sqrt{N}} \cdot \exp (-2\pi i \cdot \frac{f_{k} \cdot d}{N}). \end{aligned}$$
(2)

For further examination, we isolate the displacement and frequency dependent phase shift between the two signals and introduce the cross-power spectrum \(P[f_{k}]\) and its inverse Fourier transform, the delta array \(p[x_{n}]\):

(3)
(4)

The delta array \(p[x_{n}]\) consists of an ideal \(\delta \)-impulse which indicates the relative shift between the two signals \(y[x_{n}]\) and \(z[x_{n}]\). In a realistic setting, with noiseFootnote 1, multiple motionsFootnote 2 and without periodicity of the imagesFootnote 3, the delta array is more complex and needs to be analyzed in detail to obtain reliable results.

In the following Sects. 2.12.4 we introduce several checks and filtering steps which must be performed to let the PhC actually yield reliable and precise results. Steps which need to be applied separately for both patches (\(y[x_{n}]\), \(Y[f_{k}]\) or \(z[x_{n}]\), \(Z[f_{k}]\)) are only denoted for the first patch (second patch accordingly).

2.1 Structure Check

First we check if both image patches show sufficient structure to allow the displacement estimation. We compute the gray scale variance of the patches in a weighted manner using the weights \(w[x_{n}]\) of the anti-leakage window:

$$\begin{aligned} \hat{\sigma }^{2}&= \left( \sum _{n = 1}^{N} w[x_{n}] \right) ^{-1} \sum _{n = 1}^{N} w[x_{n}] \cdot \left( y[x_{n}] - \hat{\mu } \right) ^{2} \text{ with } \end{aligned}$$
(5)
$$\begin{aligned} \hat{\mu }&= \left( \sum _{n = 1}^{N} w[x_{n}] \right) ^{-1} \sum _{n = 1}^{N} w[x_{n}] \cdot y[x_{n}]. \end{aligned}$$
(6)

Then we compare it against a threshold \(\tau _{1}\) which was experimentally determined:

(7)

In our experiments with different datasetsFootnote 4 we found \(\tau _{1} \approx 90\) to be a good threshold to distinguish between structured and unstructured patches. Of course this value varies with the noise level of the input images. Due to the normalization of the weights, \(w[x_{n}]\), it is independent of the chosen patch size.

2.2 Spectral Significance Filtering

After the transition to the frequency domain, we need to identify those significant spectral coefficients \(Y[f_{k}]\) and \(Z[f_{k}]\) which represent the main structure of the image patches and thus allow us to determine the displacement d. Therefore we need to suppress the influence of the DC (\(f_{k}=0\)) spectral component of the signal (mean value compensation) as well as the components whose spectral magnitudes are dominated by noise (noise suppression).

Mean Compensation. Since most of the structural information of the image is encoded in the low frequency AC (\(f_{k} \ne 0\)) spectral components, it is important to compensate for the gray scale mean before the anti-leakage window \(w[x_{n}]\) is applied. Otherwise, these low frequency components would be superimposed by the gray scale mean of the original image patch when the convolution with the Fourier transform of the anti-leakage window \(w[x_{n}]\) is performedFootnote 5.

$$\begin{aligned} Y[{f_{k}}] = {\mathcal {F}}\left( w[{x_{n}}] \cdot (y[{x_{n}}] - \left\langle y[x_{n}] \right\rangle ) \right) . \end{aligned}$$
(8)

Noise Suppression. We also need to suppress those spectral components of \(Y[f_{k}]\) and \(Z[f_{k}]\) whose magnitudes are in the order of magnitude of the noise floor because their phases are only dominated by noise and do not contain any information. To do so, we compute the frequency distributions of \({|Y[f_{k}]|}\) and \({|Z[f_{k}]|}\) and look for the first interval which is mainly dominated by noise. For a fast approximation, we compute the mean \(\tau _2\) of those magnitudes which lie in the smaller half of the frequency distribution. Generally \(\tau _2\) might be too large, but this is negligible.

(9)

2.3 Delta Array Check

After significance filtering has been applied, the cross-power spectrum \(P[f_{k}]\) and the delta array \(p[x_{n}]\) are computed for significant components (see Eqs. 3 and 4). The inverse Fourier transform is an orthonormal transformation, thus:

$$\begin{aligned} \sum _{k=1}^{N}{|P[f_{k}]|}^{2} = \overbrace{N_{\text {sig}}}^{\text {no. of significant componentens}} =\sum _{n=1}^{N} {|p[x_{n}]|}^{2} \le N. \end{aligned}$$
(10)

In an ideal case all the energy should concentrate on one \(\delta \)-impulse which represents the displacement d. Hence we are only interested in those values of \(p[x_{n}]\) which hold a significant amount of the energy known in beforehand (see Eq. 10) and thus represent a dominant motion. The other values which possess a much lower energy are suppressed by computing a threshold \(\tau _{3}\) based on the histogram of the distribution of \({|p[x_{n}]|}^{2}\). We set the histogram range to \([0, N_{\text {sig}}]\) (see Eq. 10), the number of bins to the geometric mean \( m_{win} \) of the lengths of the window and the right border of the first bin to be \(\tau _3\). In our experiments we verified that energies which represent a relevant motion always lie above this threshold. This check fails if the energies of all spectral components are below \(\tau _3\).

(11)

2.4 Delta Array Clustering

So far, the tests were described for the one-dimensional case, but for the next check we need the actual two-dimensional representation of the signal. Therefore the delta array is written as \(p[\mathbf {x}_{n}]\). In the absence of noise, the inverse Fourier transform of \(P[\mathbf {f}_{k}]\) contains a single \(\delta \) peak, or multiple \(\delta \) peaks in case of multiple motion. For real data, this / these peak(s) get smeared out and there will be some background noise in the delta array. Hence we only examine the significant (Eq. 11) values of the delta array \(p[\mathbf {x}_{n}]\). We define the sets:

(12)

These two sets will serve as input for a weighted K-means clustering algorithm.

Initial Phase. The first mean chosen is the point with the largest weight. We iteratively determine \(K-1\) more candidates as the ones with the largest cumulative euclidean distance to the already chosen ones. A set of K covariance matrices \(\mathbf {\Sigma }_{k}\) is initialized as two dimensional identity matrices.

Labeling Phase. For each point we calculate the Mahalanobis distance to the current K means \(\mathbf {m}_{k}\) and assign the point to the cluster of the mean with minimum distance. Subsequently, we compute means and covariance matrices of the updated clusters using the values of the delta array as weights. This is repeated until either the clusters converge or a predefined maximum number of iterations is reached.

figure a

This algorithm returns K means \(\mathbf {m}_{k}\) and covariance matrices \(\mathbf {\Sigma }_{k}\) which describe the distribution within each cluster. To find the optimal K, the algorithm is run for different values of K and a cost function which sums up the areas of the covariance ellipses and penalizes large values of K (Occam’s razor) is minimized:

$$\begin{aligned} {K_{opt}} = {\mathop {\text{ arg } \text{ min }}\limits _{K}} \, \sum _{k = 1}^{K} \det (\mathbf {\Sigma }_{k}) + a \cdot \exp (b\cdot K). \end{aligned}$$
(13)

We determined the values of the parameters in our experiments to be

$$\begin{aligned} a \approx 2.5 \cdot \det (\mathbf {\Sigma }_{0})&\quad \text {and} \quad b \approx 0.5, \end{aligned}$$
(14)

where \(\det (\mathbf {\Sigma }_{0})\) is the area of the covariance ellipse in the case of only one cluster (\(K = 1\)).

2.5 Multiresolution

Since the estimation of a relative displacement of the signal in two regarded patches is limited by the patch sizeFootnote 6 and works best when most of the image content is present in both patches, the previously presented steps are performed iteratively on different resolution scales of the image. We employed a Gaussian pyramid with two levels and a scaling of 2 for each image dimension. We used the same patch size on both pyramid levels, performed a first motion estimation on the upper (=lower resolution) pyramid level and transferred the result to the original scale by shifting the patch windows relative to each other according to the (correctly scaled) motion vector determined in the upper pyramid level. This way we ensure that we can deal also with large displacements.

3 Experiments

Our enhanced PhC approach allows us to estimate multiple motion distributions. The proposed method is evaluated at the optical flow dataset from the KITTI Vision Benchmark Suite [22] and the Middlebury Stereo Dataset [3]. Due to the fact that the PhC, by construction, aims at determining the distribution of motion vectors but not a dense motion field, we could not apply the metrics of these benchmarks which expect a dense motion field. Therefore we can only compare our method against the PhC implementation of OpenCV, which is based on the work of Stone et al. [16], and the ground truth data of the training datasets of the two mentioned benchmarks.

3.1 Middlebury Stereo Dataset

In this experiment, we intend to show that our proposed approach is able to estimate multiple motions within a defined patch. We also want to demonstrate that these estimates are correct and precise. However, PhC is of course only able to detect motions if the moving objects show enough structure. Therefore, we chose the dataset from 2001 as its images exhibit well structured elements.

The 6 image pairs of this stereo dataset are divided into 6 centered patches, each of size \(128 \times 128\) pixels. These patches are shown as black rectangles in Fig. 2. Since this dataset was originally created for a stereo benchmark, the images are recorded by a left and right camera. Thus we can assume that the captured scene is only translated horizontally, although of course the PhC is not aware of this. The provided disparity maps express exactly this described behavior. Objects which are more far away from the camera exhibit a lower displacement than objects in the near field. The disparity values (represented by the gray level values) describe the ‘motion’ of an object between two images. For example, two different motions are present in the first patch of the disparity map 2a. The aim of this experiment is to detect exactly these multiple motions within a patch. The total quantity of available displacements of all patches of a specific image pair is listed in the second column of Table 1. Another aspect which has to be considered is that the objects do not necessarily lie in a frontal plane w.r.t. the camera and hence the translation of the object cannot be described with one single disparity value. This means that we observe disparity value ranges, not singular values. In our experiment, we computed all such ranges in each patch, which serve as ground truth ranges.

Fig. 2.
figure 2

All disparity maps from the Middlebury Stereo Dataset 2001 [3] divided into patches. In each patch, the estimated motion is encoded by a colored line. Note: The length of the line (respectively the motion) is scaled by a factor of two for better visualization.

Table 1. We compare the quantity of right estimated motions by our and the OpenCV PhC [16] w.r.t. the occurring motions within each dataset.

For each patch we executed our enhanced PhC algorithm. The results are shown in Table 1, where they are compared against the OpenCV version of the PhC. Obviously our enhanced PhC is able to estimate a significantly larger amount of motions than the OpenCV PhC. Moreover, the calculated displacements are more precise and more reliable than the OpenCV ones. We estimated every single detected motion correctly, which means that our computed displacements fall within the above stated range of ground truth data. In contrast to that, the OpenCV version does not determine all its detected motions correctly, as it can be seen in the last column of Table 1.

Using the Middlebury Stereo Dataset, we showed that our enhanced PhC can detect and correctly estimate multiple motions within a patch if the individual objects possess enough structure and cover a reasonable percentage of the patch.

Fig. 3.
figure 3

6 exemplary results taken from the KITTI Training Dataset which show the performance of our method. The colored lines indicate the length of estimate motions. A red overlay on a patch indicates that one of our checks yielded a negative result and thus no reliable and precise motion estimation is possible.

3.2 KITTI Optical Flow Dataset

In the second part of our experiments, we use real world driving scenes and show that our algorithm outperforms the OpenCV PhC in terms of precision and reliability and also simultaneously measures the uncertainties of the estimated motions. Furthermore we show that our four self-diagnostic checks are both useful and work correctly. For the evaluation of our new approach, we took all 194 image pairs from the KITTI Optical Flow Training Dataset [22]. We did not evaluate our PhC method on the Test Dataset, because ground truth data is not available and the benchmark only accepts sparse or dense motion fields. We cannot provide this because we can only estimate motion distribution within a defined patch. For that reason we analyze the performance of our PhC by dividing each image into 45 non-overlapping patches of \(128\times 64\) pixels (cf. Fig. 3) and compare the results of our PhC against the ones from OpenCV and the ground truth from the training dataset. Unfortunately, KITTI does not provide ground truth for the entire image because their LIDAR scanner has only a limited field of view. This is why we could only do the evaluation on 6942 of the 8730 possible patches. The KITTI data provides an almost dense motion field within a patch, but we can only compare motion distributions characterized by a mean \(\mathbf {m}_{n,gt}\) and a covariance matrix \(\mathbf {\Sigma }_{n,gt}\). Therefore, we determine the parameters of the motion distribution from the ground truth data for the patches \(\{c_n\}_{n=1,\ldots ,3752}\) where K motions \(\mathbf {x}_i\) occur in a patch \(c_n\), in the following way:

$$\begin{aligned} \mathbf {\hat{m}}_{n,gt}&= \frac{1}{K} \sum _{i=1}^{K} x_i \,\text {and}\, \mathbf {\hat{\Sigma }}_{n,gt}&= \frac{1}{K-1} \sum _{i=1}^{K} (\mathbf {x}_i-\mathbf {\hat{m}}_{n,gt})(\mathbf {x}_i-\mathbf {\hat{m}}_{n,gt})^T. \end{aligned}$$
(15)

With the described self-diagnosis checks, we determined that on 3752 of the 6942 patches, the PhC can provide a reliable motion estimate. On the other patches, one of our proposed checks failed due to low structure, too much noise or too different image patches caused by large displacements. In the Fig. 4a and b, the ordered displacements in horizontal respectively vertical direction are shown for all possible patches. These plots show that our PhC is located much closer to the trend of the ground truth than the OpenCV implementation. Many of the motions computed by OpenCV are either outliers or lie close to the zero line. As opposed to this our PhC produces only a few outliers. Consider to the given integer precision, our results comply well to the ground truth.

Fig. 4.
figure 4

Ordered horizontal and vertical displacements which occur through the 3752 patches where ground truth data and a motion estimate is available.

In the last part of this experiment, we want to show that our motion distribution parametersFootnote 7 are well estimated. We chose a relatively pessimistic approach by evaluating \(\mathbf {m}_{n, eval}\) and \(\mathbf {\Sigma }_{n, eval}\) for each patch \( c_n \) in the following way:

$$\begin{aligned} \mathbf {m}_{n, eval} = \mathbf {\hat{m}}_{n, gt} - \mathbf {m}_{n, k, PhC}&\quad \text {and} \quad \mathbf {\Sigma }_{n, eval} = \mathbf {\hat{\Sigma }}_{n, gt} + \mathbf {\Sigma }_{n, k, PhC}. \end{aligned}$$
(16)

Figure 5a and b show the histogram of the euclidean lengths of the deviation \(\{\mathbf {m}\}_{n=1,\ldots ,3752, eval}\) between the ground truth and our PhC and the OpenCV PhC respectively. The results show that our PhC provides fewer estimates than the OpenCV PhC, simply because we recognize patches where no reliable motion estimate is possible. Secondly, slightly more displacements are estimated correctly, as it can be seen in histogram bins [0, 5]. However, the main advantage of our PhC is that only very few estimates deviate more than 30 pixels to the ground truth. The OpenCV PhC, on the other hand, yields more than 2000 motion estimates which exhibit a deviation of more than 30 pixels w.r.t the ground truth. To evaluate the uncertainty of the estimate, expressed by the ‘size’ of the covariance matrix, we compute the area \(A_n\) of \(\mathbf {\Sigma }_{n, eval}\) as \(A_n = \det (\mathbf {\Sigma }_{n, eval})\), which corresponds to the \(1\sigma \)-area covered by the covariance ellipse. Figure 5c shows the histogram of \( \{A\}_{n=1,\ldots ,3752}\). Some motions have a relatively high uncertainty (large \(A_{n}\)), but most of the estimated motion distributions are compact (small \(A_{n}\)), which means that their displacements are reliable and precise.

Fig. 5.
figure 5

In the Figs. 5a and b, the euclidean length of the deviation between the ground truth and the estimated motions of two used PhCs is shown. Figure 5c shows the \(1\sigma \)-area covered by the covariance ellipse of the PhC result.

We have evaluated our work on two different datasets. The performance of our enhanced PhC clearly outperforms the OpenCV one. We achieve good estimates of motion distributions if the moving objects possess enough structure and cover a significant part of the patch. As already stated, the purpose of our approach is not to compute a dense optical flow field, but to estimate the dominant motions and their uncertainties. The particular advantage of our approach is that we achieve with very moderate computational effort reliable information about the distribution of the optical flow vectors within a patch - including the case of multiple motions. The runtime of our PhC is roughly 1 ms for a \( 256 \times 256\) pixel patch without any use of multithreading and GPU support on a common PC.

4 Summary and Conclusion

We have shown that the classical PhC method can be made significantly more robust against different sources of malfunction. This has been achieved by a systematic analysis of the effects of noise and the conditioning of the input data (texture, similarity). Obviously, the spatial precision of the method can be extended into the subpixel range by using existing schemes for providing subpixel resolution to phase correlation [1113]. This, however, is independent from the method improvements presented here. We refrained from using any of these schemes in order to present the effects of our modifications in ‘clean room conditions’, unaffected by other modifications. We emphasize that the standard PhC is a good motion estimator for patches with a homogeneous translational motion field, whereas our extended PhC provides distributions for multiple motions in a patch which can be used for local methods that need a good initialization.