Introduction

Image stabilization [1,2,3,4,5] is a well-known process used to reduce undesired motion in image sequences which occur due to shaking or jiggling of a camera or rapidly moving objects while rolling the shutter. These motion anomalies are jitters, caused by various external sources responsible for the shaking of the camera, which leads to unpleasant visual effects in video sequences. Sources such as unsteady handling of the camera by an operator, rapidly moving sports camera, or camera-mounted vehicles or robots when maneuvering on uneven surfaces are responsible for jittery motion anomalies. The stabilization techniques can be categorized as (1) optical image stabilization (OIS) and (2) digital image stabilization (DIS). The OIS systems have been designed to reduce apparent motion in image sequences by controlling the optical path sensed by sensors such as gyroscopes or accelerometers. The lens-shift OIS systems shift their optical path using optomechatronic devices such as a lens-barrel-shift mechanism [6, 7], a fluidic prism [8], a magnetic 3-DOF platform [9], a deformable mirror [10], sensor-shift OIS systems to shift their image sensors using voice coil actuators [11,12,13,14,15], and hand-held OIS systems with multi-DOF gimbal control systems [16,17,18,19] have been reported by researchers. Recent consumer digital cameras have the OIS stabilization functions to remove inevitable and undesired fluctuating motion while capturing video. These OIS systems can stabilize input images by reducing the motion blur induced by camera shake. However, conventional systems have difficulty in perfectly reducing large and quick apparent motion by controlling the optical path with sensors that cannot detect any apparent motion in images, due to the physical limitations in the lens-shift or sensor-shift mechanisms. For frame-by-frame image stabilization in video sequence, the DIS systems can produce a compensated video sequence. The residual fluctuated motion in images can be reduced using various image processing techniques to estimate the local motion vectors, such as block matching [20,21,22,23], bit-plane matching [24, 25], Kalman-filter-based prediction [26,27,28,29,30], DFT filtering [31], particle filter [32], scale-invariant feature [33, 34], feature point matching [35,36,37,38,39], and optical flow estimation [40,41,42,43,44,45]. These systems do not require any additional mechanism or optical device for video stabilization, and they have been used as low-cost video stabilizers in various applications such as airborne shooting [46,47,48,49,50,51,52], off-road vehicles [53], and teleoperated applications [54,55,56,57], including commercial applications [58,59,60,61,62]. Researchers have been reporting various approaches to achieve real-time DIS systems [63,64,65,66,67,68,69] for stabilizing a video sequence with simultaneous video processing at conventional frame rate, whereas most of them have limited ability to reduce large and quick apparent motion observed in images due to heavy computation in the frame corresponding process.

With rapid advancements in computer vision technologies, various real-time high-frame-rate (HFR) vision systems operating at 1000 fps or more have been developed [70,71,72,73], and their effectiveness has been demonstrated in tracking applications such as robot manipulations [74,75,76,77], multi-copter tracking [78, 79], optical flow [80], camshift tracking [81], multi-object tracking [82], feature point tracking [83], and face tracking [84]. These systems were computationally accelerated by parallel-implementation on field-programmable gate arrays (FPGAs) and graphics processing units (GPUs) to obtain real-time HFR video processing. If a real-time HFR vision system could simultaneously estimate the apparent motion in images at a high framerate in a manner similar to that of conventional sensors, it could made to function as an HFR jitter sensor for DIS-based video stabilization even when the camera or the targeted scene moves quickly.

In this paper, we introduced a concept of real-time digital video stabilization with HFR video processing, in which an HFR vision system can simultaneously estimate apparent translational motion in image sequences as an HFR jitter sensor and is hybridized to assist for compensating high-resolution image sequences. We developed a hybrid-camera system for real-time high-resolution video stabilization that can simultaneously stabilize \(2048\times 2048\) images captured at 80 fps by executing frame-by-frame feature point tracking in real time at 1000 fps on a \(512\times 512\) HFR vision system. Its performance was demonstrated by the experimental results for several moving scenes.

Video stabilization using HFR jitter sensing

Concept

Most of the feature-based DIS methods are realized by executing (1) feature extraction, (2) feature point matching, (3) frame-by-frame transform estimation, and (4) composition of jitter-compensated image sequences. Corresponding to steps (1)–(3), feature-based motion estimation at the frame rate of conventional cameras is not always stable and there are chances of inaccurate reduction of large apparent motions when a camera moves rapidly. This leads to heavy computation caused by large image displacements between frames. Narrowing the search range by assuming temporal redundancy in HFR image sequences can accelerate the process of frame-by-frame motion estimation, whereas video stabilization using HFR image sequences has a shortcoming in image-space resolution and brightness; the former is restricted by the specification of an image sensor as well as the processing power for motion estimation, and the latter depends on its short exposure time, which is less than the frame cycle time of the HFR camera. Thus, we introduce the concept of hybrid-camera-based digital video stabilization that can solve this trade-off between the tracking accuracy in real-time motion estimation and the space resolution in composing a compensated video sequence. The hybrid-camera-based system consists of a high-speed vision system that can extract and track the feature points in consecutive images in real time at thousands of frames for fast apparent motion estimation as an HFR jitter sensor in steps (1)–(3), and a high-resolution camera system for composing compensated high-resolution sequences at dozens of frames per seconds convenient for human eyes in step (4). It is assumed that these camera systems have overlapped views of scenes or objects in the view field. Our approach has the following advantages over conventional methods:

  1. (a)

    Motion estimation accelerated by assuming HFR image sequences

    When M feature points in input images of \(N_x \times N_y\) pixels are selected, the computational complexity of the order of \(O(M^2)\) is required in feature point matching when all the feature points correspond to each other between consecutive frames. The image-displacement between frames is considerably smaller in an HFR image sequence, and it allows a smaller neighborhood search range for matching feature points between the current and preceding frames. Assuming that one or a small number of feature points at the previous frame are detected in the narrowed neighborhood of each feature point at the current frame, this narrowed neighborhood search can reduce the computational complexity in feature point matching in the order of O(M).

  2. (b)

    Stabilization of high-resolution image sequences

    Generally, real-time video stabilization aims to reduce fluctuated motions in image sequences to generate compensated videos convenient for human eyes on a computer display. Most displays are designed to operate at tens of frames per second, which is enough for human eyes to perceive it as a smooth movie. If a high-resolution camera of \(N'_x \times N'_y\) pixels can capture a tens of frames per second video sequence for a view similar to that in the HFR image sequence when mounted on the same platform, both cameras can experience the same desired and undesired motion at the same time. Hence a jitter-compensated \(N'_x \times N'_y\) image sequence can be composed in real time without heavy computational complexity for HFR image synthesis; the high-speed vision system works as an HFR jitter sensor to determine jitter-compensation parameters.

Algorithm for jitter sensing and stabilization

Our algorithm for hybrid-camera-based digital video stabilization consists of the following processes. In the steps of (1) feature point extraction and (2) feature point matching, we used the same algorithms as those used in real-time image mosaicking using an HFR video [83], considering the implementation of parallelized gradient-based feature extraction on an FPGA-based high-speed vision platform.

Feature point detection

The Harris corner feature [85], \(\lambda ({{\varvec{x}}},t_k)=\text{ det }\,C({{\varvec{x}}},t_k) - \kappa (\text{ Tr }\,C({{\varvec{x}}},t_k))^2\) at time \(t_k\), is computed using the following gradient matrix:

$$\begin{aligned} {\displaystyle C({{\varvec{x}}},t_k)=\sum _{{{\varvec{x}}}\in N_a ({{\varvec{x}}})}} \left[ \begin{array}{ll} {I}_x'^{2}({{\varvec{x}}},t_k) &{} I'_x({{\varvec{x}}},t_k) I'_y({{\varvec{x}}},t_k) \\ I'_x({{\varvec{x}}},t_k)I'_y({{\varvec{x}}},t_k) &{} {I}_y'^{2}({{\varvec{x}}},t_k) \end{array} \right] , \end{aligned}$$
(1)

where \(N_a({{\varvec{x}}})\) is the \(a\times a\) adjacent area of pixel \({{\varvec{x}}}=(x,y)\). \(t_k=k\Delta t\) indicates when the input image \(I({{\varvec{x}}},t)\) at frame k is captured by a high-speed vision system operating at a frame cycle time of \(\Delta t\). \(I'_x({{\varvec{x}}},t)\) and \(I'_y({{\varvec{x}}},t)\) indicate the positive values of x and y differentials of the input image \(I({{\varvec{x}}},t)\) at pixel \({{\varvec{x}}}\) at time t, \(I_x({{\varvec{x}}},t)\) and \(I_y({{\varvec{x}}},t)\), respectively. \(\kappa\) is a tunable sensitive parameter, and values in the range 0.04–0.15 have been reported as feasible.

The number of feature points in the \(p\times p\) adjacent area of \({{\varvec{x}}}\) is computed as the density of feature points by thresholding \(\lambda ({{\varvec{x}}},t_k)\) with a threshold \(\lambda _T\) as follows:

$$\begin{aligned} & P({{\varvec{x}}},t_k)=\sum _{{{\varvec{x}}}^{\prime} \in N_p({{\varvec{x}}})} R({{\varvec{x}}}^{\prime},t_k), \,\,\,\, \\& R({{\varvec{x}}},t_k)=\left\{ \begin{array}{ll} 1 & {} (\lambda ({{\varvec{x}}},t_k) > \lambda _T) \\ 0 & {} \text{(otherwise) } \end{array} \right. , \end{aligned}$$
(2)

where \(R({{\varvec{x}}},t)\) is a map of feature points.

Closely crowded feature points are excluded by counting the number of feature points in the neighbourhood. The reduced set of feature points is calculated as \(R'(t_k)=\left\{ {{\varvec{x}}}\,|\,P({{\varvec{x}}},t_k) \le P_0 \right\}\) by thresholding \(P(t_k)\) with a threshold \(P_0\). It is assumed that the number of feature points is less than M.

Feature point matching

To enable correspondence between feature points at the current time \(t_k\) and those at the previous time \(t_{k-1}=(k-1)\Delta t\), template matching is conducted for all the selected feature points in an image.

To enable the correspondence of the i-th feature point at time \(t_{k-1}\) belonging to \(R'(t_{k-1})\), \({{\varvec{x}}}_i(t_{k-1})\) \((1\le i \le M)\), to the \(i'\)-th feature point at time \(t_k\) belonging to \(R'(t_k)\), \({{\varvec{x}}}_{i'}(t_k)\) \((1\le i' \le M)\), the sum of squared differences is calculated in the window \(W_m\) of \(m\times m\) pixels as follows:

$$\begin{aligned} {\displaystyle E(i^{\prime},i;t,t_{k-1})}&={\sum _{{\xi }=({\xi },{\eta }){\in }W_m}} {\left\| I({{\varvec{x}}}_{i^{\prime}}(t)+{\xi },t)\right.} {\left. -I({{\varvec{x}}}_i(t_{k-1})+{\xi },t_{k-1})\right\| }^2. \end{aligned}$$
(3)

To decrease the number of mismatched points, \(\hat{{{\varvec{x}}}}({{\varvec{x}}}_{i}(t_{k-1});t_k)\) and \(\hat{{{\varvec{x}}}}({{\varvec{x}}}_{i'}(t);t_{k-1})\), which indicate the feature point at time \(t_k\) corresponding to the i-th feature point \({{\varvec{x}}}_{i}(t_{k-1})\) at time \(t_{k-1}\), and the feature point at time \(t_{k-1}\) corresponding to the \(i'\)-th feature point \({{\varvec{x}}}_{i'}(t_k)\) at time \(t_k\), respectively, are bidirectionally searched so that \(E(i',i;t_k,t_{k-1})\) is minimal in their adjacent areas as follows:

$${\hat{\varvec{x}}} ({\varvec{x}}_{i}(t_{k-1});t_k)=\varvec{x}_{i'(i)}(t_k)= \mathop{\text{arg min}}\limits_{{\varvec{x}_{i'}}(t_k) \in N_b(\varvec{x}_{i}(t_{k-1}))} E(i',i;t_k,t_{k-1}),$$
(4)
$$\begin{aligned}& \hat{\varvec{x}}(\varvec{x}_{i^{\prime}}(t_k);t_{k-1})=\varvec{x}_{i(i^{\prime})}(t_{k-1}) =\mathop {\text{arg min}}\limits_{\varvec{x}_{i}(t_{k-1}) \in N_b(\varvec{x}_{i^{\prime}}(t_k))} E(i^{\prime},i;t_k,t_{k-1}),\end{aligned}$$
(5)

where \(i'(i)\) and \(i(i')\) are the index numbers of the feature point at time \(t_k\) corresponding to \(\varvec{x}_{i}(t_{k-1})\), and that at time \(t_{k-1}\) corresponding to \(\varvec{x}_{i'}(t_k)\), respectively. According to mutual selection of the corresponding feature points, the pair of feature points between time \(t_k\) and \(t_{k-1}\) are selected as follows:

$$\begin{aligned} \tilde{\varvec{x}}_i(t_k) = \left\{ \begin{array}{ll} \hat{\varvec{x}}(\varvec{x}_{i}(t_{k-1});t_k) &{} (i=i(i'(i))) \\ \emptyset &{} \text{(otherwise) } \end{array} \right. , \end{aligned}$$
(6)
$$\begin{aligned} f_{i}(t_k)= & {} \left\{ \begin{array}{ll} 1 &{} (i=i(i'(i))) \\ 0 &{} \text{(otherwise) } \end{array} \right. , \end{aligned}$$
(7)

where \(f_i(t_k)\) indicates whether there are feature points at time \(t_k\) or not, corresponding to the i-th feature point \({{\varvec{x}}}_{i}(t_{k-1})\) at time \(t_{k-1}\).

On the assumption that the frame-by-frame image-displacement between time \(t_k\) and \(t_{k-1}\) is small, the feature point \({{\varvec{x}}}_{i}(t_k)\) at time \(t_k\) is matched with a feature point at time \(t_{k-1}\) in the \(b \times b\) adjacent area of \({{\varvec{x}}}_{i}(t_k)\); the computational load of feature point matching is reduced in the order of O(M) by setting a narrowed search range. For all the feature points belonging to \(R'(t_{k-1})\) and \(R'(t_k)\), the processes described in Eqs. (4)–(7) are conducted, and \(M'(t_k) (\le M)\) pairs of feature points are selected for jitter sensing, where \(M'(t_k)=\sum _{i=1}^M f_i(t_k)\).

Jitter sensing

Assuming that the image-displacement between time \(t_k\) and \(t_{k-1}\) is translational motion, the velocity \(\varvec{v}(t_k)\) at time \(t_k\) is estimated by averaging the positions of selected pairs of feature points as follows:

$$\begin{aligned} \varvec{v}(t_k)=\frac{1}{\Delta t} \cdot \frac{1}{M'(t_k)} \sum _{i=1}^M f_i(t_k)(\tilde{\varvec{x}}_i(t_k)-\varvec{x}_i(t_{k-1})), \end{aligned}$$
(8)

Jitter displacement \(\varvec{d}(t_k)\) is computed at time \(t_k\) by accumulating the estimated velocity \(\varvec{v}(t_k)\) as follows:

$$\begin{aligned} \varvec{d}(t_k)=\varvec{d}(t_{k-1})+\varvec{v}(v_{k-1}) \cdot \Delta t, \end{aligned}$$
(9)

where the displacement at time \(t=t_0=0\) is initially set to \(\varvec{d}(t_0)=\varvec{d}(0)=\varvec{0}\). The high-frequency component of jitter displacement \(\varvec{d}_{cut}(t_k)\), which is the camera jitter movement intended for removal is extracted using the following high-pass IIR filter,

$$\begin{aligned} \varvec{d}_{cut}(t_k)=\text{ IIR }(\varvec{d}_k,\varvec{d}_{k-1},\ldots ,\varvec{d}_{k-D};f_{cut}), \end{aligned}$$
(10)

where the order of the IIR filter is D; it is designed to exclude the low-frequency component of velocity lower than a cut-off frequency \(f_{cut}\).

Composition of jitter-compensated image sequences

When the high-resolution input image \(I'(\varvec{x}',t'_{k'})\) at frame \(k'\) is captured at time \(t'_{k'}=k'\Delta t'\) by a high-resolution camera operating at a frame cycle time of \(\Delta t'\), which is much larger than that of the high-speed vision system, \(\Delta t\), the stabilized high-resolution image \(S(\varvec{x}',t'_{k'})\) is composed by displacing \(I'(\varvec{x}', t'_{k'})\) with the high-frequency component of jitter displacement \(\varvec{d}_{cut}(\hat{t}'_{k'})\) as follows:

$$\begin{aligned} S(\varvec{x}',t'_{k'})=I'(\varvec{x}'- l \cdot \varvec{d}_{cut}(\hat{t}'_{k'}),t'_{k'}), \end{aligned}$$
(11)

where \(\varvec{x}'=l\varvec{x}\) indicates the image coordinate system of the high-resolution camera; its resolution is l times that of the high-speed vision system. \(\hat{t}'_{k'}\) is the time when the high-speed vision system captures its image at the nearest frame after time \(t'_{k'}\) when the high-resolution camera captures its image as follows:

$$\begin{aligned} \hat{t}'_{k'}=\left\lceil \frac{t'_{k'}}{\Delta t}\right\rceil \Delta t , \end{aligned}$$
(12)

where \(\lceil a \rceil\) indicates the minimum integer, which is larger than a.

In this way, video stabilization of high-resolution image sequences can be achieved in real time by image composition using input sequences based on a high-frequency-displacement component sensed by executing the high-speed vision system as an HFR jitter sensor.

Real-time video stabilization system

System configuration

To realize real-time high-resolution video stabilization, we implemented our algorithm on a hybrid-camera system. It consists of an FPGA-based high-speed vision device, the IDP-Express [72], a high-resolution USB 3.0 camera (XIMEA, MQ042-CM), and a personal computer (PC). Figure 1 shows (a) the system configuration, (b) the overview of its dual-camera head when mounted on a monopod, and (c) its top-view geometric configuration. IDP Express consists of a camera head that can capture gray-level 8-bit \(512\times 512\) images at 2000 fps, and a dedicated FPGA board for hardware-implementation of the user-specific algorithms. The image sensor of the camera head is a \(512\times 512\) CMOS sensor of \(5.12\times 5.12\) mm-size at \(10\times 10\) \(\upmu \hbox {m}\)-pitch.

Fig. 1
figure 1

Hybrid-camera system for real-time video stabilization a configuration, b dual-camera head, c top-view geometric configuration

On the dedicated FPGA board, the 8-bit gray-level \(512\times 512\) images could be processed in real time with circuit logic on the FPGA (Xilinx XC3S5000); the captured images and processed results could be transferred to memory allocated in the PC. The high-resolution camera MQ042-CM can capture gray-level 8-bit \(2048\times 2048\) images, and these images can be transferred at 90 fps via a USB 3.0 interface to a PC; the sensor-size and pixel-pitch are \(11.26\times 11.26\) mm and \(5.5\times 5.5\,\upmu \hbox {m}\), respectively. We used a Windows 7(64 bit)-OS installed PC (Hewlett Packard, Z440 workstation) with the following specifications: Intel Xeon E5-1603v4 at 2.8 GHz, 10 MB cache, 4 cores, 16 GB DDR4 RAM, two 16-lane PCI-e 3.0 buses, and four USB 3.0 ports. As illustrated in Fig. 1, the camera head of the IDP Express (camera 1) and the high-resolution camera MQ042-CM (camera 2) were installed in such a way that the optical axis of their lenses were parallel; the distance between the two axes was 48 mm. The hybrid-camera system was attached on a monopod of length variable from 55 to 161 cm for hand-held operation. Identical CCTV lenses of \(f=\) 25 mm were attached to both cameras 1 and 2. As shown in Fig. 2, when the hybrid-camera module was placed 5 m away from the patterned scene (a) the high-resolution image captured by camera 2 could observe \(2.20\times 2.20\) m-area and (b) camera 1 could observe \(1.01\times 1.01\) m-area. If we observe scenes when the measurement area of camera 1 is completely involved in that of camera 2, the high-speed vision system works as an HFR jitter sensor for stabilizing the \(2048\times 2048\) images of camera 2, as discussed earlier.

Fig. 2
figure 2

Input images of patterned objects captured at a distance of 5 m a high-resolution camera (camera 2: \(2048\times 2048\)), b high-speed camera (camera 1: \(512\times 512\))

Specifications

The feature point extraction process in step (1) was accelerated by hardware-based implementation of a feature extraction module [83] on the FPGA. The dedicated FPGA extracts feature points in a \(512\times 512\) image, and the xy coordinates of the feature points appended at the bottom 16 rows of an ROI input image of \(512\times 496\) pixels. The implemented Harris corner feature extraction module is illustrated in Fig. 3. In step (1), the area size and the tunable sensitive parameter in computing the Harris corner features were set to \(a=\) 3 and \(\kappa =\) 0.0625, respectively. The area size for extracting the number of feature points was set to \(p=\) 8. According to the experimental scene, parameters \(\lambda _T\) and \(P_0\) were determined so that the number of feature points must be less than \(M=\) 300.

Fig. 3
figure 3

Hardware circuit module for Harris corner feature extraction

Steps (2)–(4) were software-implemented on the PC. In step (2), we assumed that the number of selected feature points were less than \(M=\) 300, and that \(5\times 5\,(m=5)\) template matching with bidirectional search in the \(31\times 31\,(b=31)\) adjacent area was executed. In step (3), the high-frequency component of the frame-by-frame image displacement is extracted as its jitter displacement to be compensated by executing a 5th-order Butterworth high-pass filter (\(D=5\)).

The high-speed vision system captured and processed \(512\times 496\) (\(N_x=512\), \(N_y=496\)) at 1000 fps, corresponding to \(\Delta t=\) 1 ms, whereas the high-resolution camera set for capturing \(2048\times 2048\) (\(N'_x=N'_y=2048\)) at 80 fps in step (4), corresponding to \(\Delta t'=\) 12.5 ms.

Table 1 Execution times on our hybrid camera system (unit: ms)

Table 1 summarizes the execution times of steps (1)–(4) when our algorithm was implemented on the hybrid-camera system with the parameters stated above. The execution time of step (1) includes the image acquisition time for a \(512\times 512\) image on the FPGA board of the high-speed vision system. The total execution time of steps (1)–(3) was less than the frame cycle time of the high-speed vision system, \(\Delta t=\) 1 ms. Due to the higher synthesizing cost for \(2048\times 2048\) image sequences, the execution time of step (4) was much larger than that of the other steps, but it was less than the frame cycle time of the high-resolution camera \(\Delta t'=\) 12.5 ms. Here steps (2)–(4) were software-implemented as multithreaded processes to achieve real-time jitter sensing at 1000 fps in parallel with real-time composition of jitter-compensated high-resolution images at 80 fps so as to simultaneously display it on a computer display.

We compared our algorithm with conventional methods for feature-based video stabilization using SURF [5], SIFT [86], FAST [87], and Harris corner [88], which are distributed in the OpenCV standard library [89]. Table 2 shows the execution times for step (1) and steps (2)–(4) when conventional methods for \(512\times 496\) and \(2048\times 2048\) images on the same PC as that used in our hybrid-camera system. These methods involved the processes for steps (2)–(4) such as descriptor matching, affine transformation for displacement estimation, Kalman filtering for jitter removal, and stabilized image composition. In the evaluation, we assume that the number of feature points to be selected in step (1) are less than \(M=300\) in both the cases of \(512\times 496\) and \(2048\times 2048\) images. As shown in Table 2, the computational cost for synthesizing \(2048\times 2048\) images is expressively higher than that for \(512\times 496\) images.

Table 2 Comparison in execution time (unit: ms)

Our algorithm can accelerate the execution time of steps (2)–(4) for video stabilization of \(2048\times 2048\) images to 12.41 ms by hybridizing the hardware-implemented feature extraction of \(512\times 496\) images in step (1). We confirmed that our method could sense the jitter of several HFR videos, in which frame-by-frame image displacements are small, at the same accuracy level as those of conventional methods. The latter involve a matching process with predictions such as the Kalman filter to compensate certain image displacements between frames in a standard video at dozens of frames per second. Similarly, with the feature point extraction process, such a matching process with prediction is time consuming to the extent that conventional methods cannot be executed for the real-time video stabilization of \(2048\times 2048\) images at dozens of frames per second. Thus, our hybridized algorithm for video stabilization of high-resolution images has computational advantages over conventional feature-based stabilization methods.

Experiments

Checkered pattern

Firstly, we evaluated the performance of our system in video stabilization by observing a static checkered pattern when a hybrid-camera system vibrated mechanically in the pan direction, as illustrated in Fig. 4. The hybrid-camera system was mounted on a direct drive AC servo motor (Yaskawa, SGM7F-25C7A11) so as to mechanically change its pan angle, and a \(12\times 7\) mm-pitch checkered pattern was installed 1000 mm in front of the camera system. The measurement area observed in the \(512\times 496\) image of camera 1 corresponded to \(202\times 192\) mm on the checkered pattern.

Fig. 4
figure 4

Experimental setup for observation of a checkered pattern a overview, b checkered pattern

In the pan direction, the hybrid-camera system vibrated on the 2.5-degree-amplitude sinusoid trajectories at various frequencies ranging from 0.1 Hz to 3 Hz with increments of 0.1 Hz. This camera ego-motion exhibited 120-pixel displacement in the horizontal direction in the camera 1 image. The threshold parameters in the feature extraction step were set to \(\lambda _T=\) \(5\times 10^7\) and \(P_0=\) 15, and the cut-off frequency in the jitter sensing step was set to \(f_{cut} =\) 0.5 Hz. Figure 5a the response of a 5th-order Butterworth high-pass filter, of which the cut-off frequency is 0.5 Hz, and (b) the relationship between the vibration frequencies and the damping ratios in the jitter cancellation on our system. The damping ratio was computed as the ratio of the standard deviation of the filtered high-frequency component to that of the jitter displacement in the horizontal direction for 10 s. Figure 6 shows the pan angles of the hybrid-camera system, the jitter displacements (JDs), their filtered high-frequency component displacements (HDs), and the stabilized displacements (SDs) in the horizontal direction in the camera 1 image for 10 s when the hybrid-camera system was vibrated at 0.1, 0.5, and 1.0 Hz. The SDs were computed by canceling the HDs from the JDs. According to the cut-off frequency at 0.5 Hz, it can be observed that the SD was almost matched with the JD when the vibration was done at 0.1 Hz, whereas the SD tended toward zero with 1.0 Hz vibration. This tendency can be confirmed in Fig. 5 where it can be observed that our system detected and canceled a specified high-frequency camera jitter displacement and that the damping ratio was largely varied from 1 to 0 around the cut-off frequency \(f_{cut} =\) 0.5 Hz.

Fig. 5
figure 5

Frequency responses in jitter cancellation when \(f_{cut}=\) 0.5 Hz a 5th-order Butterworth high-pass filter, b damping ratios in jitter cancellation

Fig. 6
figure 6

Pan angles, jitter displacements (JDs), filtered high-frequency components (HDs), and stabilized displacements (SDs) when a hybrid-camera system vibrated at: a 0.1 Hz, b 0.5 Hz, c 1.0 Hz

Photographic pattern

Next, we evaluated the video stabilization performance by observing a printed photographic pattern when the hybrid-camera system moved with drifting at a certain frequency in the pan direction. Figure 7a shows the experimental setup, which was set up in the same way as that in the previous subsection. A printed cityscape photographic pattern of dimensions \(1200\times 900\) mm was placed 1000 mm in front of the hybrid-camera system mounted on a pan-tilt motor head. On the pattern, the \(440\times 440\) mm area observed by camera 2 involved and the \(202\times 192\) mm area observed by camera 1 are illustrated in Fig. 7b. In the experiment, the pan angle varied with 1 Hz vibration as illustrated in Fig. 8. The parameters in the feature extraction step and the cut-off frequency in the jitter sensing step were set to the same values as those in the previous subsection. Figure 9 shows the JD, the HD, and the SD in the horizontal direction in the camera 1 image for 16 s when the hybrid-camera system drifted with 1 Hz vibration in the pan direction. In Fig. 10a the extracted feature points (green ‘+’) and (b) the pairs of the matched feature points between previous and current frames (blue and red dots) were plotted in the \(512\times 496\) input images of camera 1. Figure 11a the \(2048\times 2048\) input images and (b) the stabilized images of camera 2. The images in Figs. 10 and 11 were taken for \(t =\) 0–14 s with an interval of 2 s. Figure 12 a the \(2048\times 2048\) input images and (b) their stabilized images of camera 2 from \(t=\) 0 to 0.7 s, taken at intervals of 0.1 s. In Fig. 9, the SD was obtained as the DC component by reducing the 1 Hz vibration in the camera drift, which is higher than the cut-off frequency of 0.5 Hz. The stabilized images of camera 2 for 0.7 s in Fig. 12b were compensated so as to cancel the 1 Hz vibration, whereas the apparent left-to-right motion of the cityscape scene for 14 s in the stabilized images of camera 2, which corresponded to the DC component in the camera drift, were not canceled, as illustrated in Fig. 11b. These experimental results show that our hybrid-camera system can automatically stabilize \(2048\times 2048\) images of complex scenes so as to cancel high-frequency components in the camera ego-motion.

Fig. 7
figure 7

Experimental setup for observation of a cityscape photographic pattern a overview, b scenes observed by camera 1 and camera 2

Fig. 8
figure 8

Pan angle when observing a photographic pattern

Fig. 9
figure 9

Jitter displacement (JD), filtered high-frequency component (HD), and stabilized displacement (SD) when observing a photographic pattern

Fig. 10
figure 10

Feature points plotted on \(512\times 496\) input images of camera 1 when observing a photographic pattern a extracted feature points, b pairs of matched feature points

Fig. 11
figure 11

\(2048\times 2048\) images of camera 2 when observing a photographic pattern (\(t=\) 0–14 s) a input images, b stabilized images

Fig. 12
figure 12

\(2048\times 2048\) images of camera 2 when observing a photographic pattern (\(t=\) 0–0.7 s) a input images, b stabilized images

Outdoor scene

To demonstrate the performance of our proposed system in a real-world scenario, we conducted an experiment when an operator was holding a hand-held dual-camera head of our hybrid-camera system while walking on outdoor stairs where undesired camera ego-motion usually induces unpleasant jitter displacements in video shooting. Figure 13 shows the experimental scene when walking down on outdoor stairs holding a dual-camera head. It was mounted on a 70 cm-long monopod. In the experiment, we captured an outdoor scene of walking multiple persons with background trees; they were walking on the stairs at a distance of 2 to 4 m from the operator. Induced by left-and-right hand-arm movement and up-and-down body movement while walking, the dual camera head was repeatedly panned in the horizontal direction and moved in the vertical direction around 1 Hz. At a distance of 3 m from the operator, an area of \(1.30\times 1.30\) m corresponded to a \(2048\times 2048\) input image of camera 2, which involved an area of \(0.60\times 0.55\) m observed in a \(512\times 496\) image of camera 1. The threshold parameters in the feature extraction step were set to \(\lambda _T=\) \(5\times 10^7\) and \(P_0=\) 15, and \(M=\) 300 feature points or less were selected for feature point matching. The cut-off frequency in the jitter sensing step was set to \(f_{cut}=\) 0.5 Hz. to reduce the 1 Hz camera jitter in the experiment.

Figure 14 shows the JDs, the HDs, and the SDs in (a) the vertical direction and (b) the horizontal direction in the camera 1 image for \(t =\) 0-7 s. Figure 15 shows (a) the extracted feature points and (b) the pairs of matched feature points, which are plotted in the \(512\times 496\) input images of camera 1. Figure 16 shows (a) the \(2048\times 2048\) input images and (b) the stabilized images of camera 2. Additionally, \(2048\times 2048\) images are stabilized in real time at an interval of 12.41 ms; the fastest rate of our stabilization is 80.6 fps. The images in Figs. 15 and 16 for \(t =\) 0–6.16 s with an interval of 0.88 s were used to monitor whether the camera ego-motion at approximately 1 Hz was reduced in the stabilized images. According to raster scanning from the upper left to the lower right in the camera 1 image, feature points in its upper region were selected for feature point matching when their number was much larger than 300. Thus, as illustrated in Fig. 15b, only 300 feature points located on the background trees in the upper region of the camera 1 image were selected for feature point matching in all the frames and those around walking persons in the center and lower regions were ignored. Video stabilization was conducted based on the static background trees, ignoring the dynamically changing appearances of the walking persons in the center and lower regions of the camera 1 image. In Fig. 14, the JDs in both the horizontal and vertical directions time-varied at approximately 1 Hz, corresponding to the frequency of the camera ego-motion, which was determined by the relative geometrical relationship between the dual-camera head and the static background trees. It can be observed that the SDs were obtained as the low-frequency component by reducing the high-frequency jitter component, and that the \(2048\times 2048\) images were stabilized so as to significantly reduce the apparent motion of the background objects such as trees and a handrail of stairs in the images, as illustrated in Fig. 16b. We confirmed that the camera jitter with the operator’s quick hand motion and the 1 Hz camera jitter in the experiment were correctly measured as the background objects were always observed, with naked eyes in real time, as semi-stationary objects in the stabilized images when they were displayed on a computer display. By selecting the feature points in the static background for feature point matching, our hybrid-camera system can correctly stabilize \(2048\times 2048\) images in real time without disturbance from the dynamically changing appearances around the walking persons when assisted by feature-point-based HFR-jitter sensing at 1000 fps even when a walking operator moves the hand-held dual-camera head of our system quickly. Here, the frequency of the camera jitter may increase depending on the operator’s motion, however, our system is capable of stabilizing frequencies much higher than 1 Hz. The operator’s motion in the frequency range from 0.5 to 10 Hz can be compensated for video stabilization on our system.

Conclusions

In this study, we developed a hybrid-camera-based video stabilization system that can stabilize high-resolution images of \(2048\times 2048\) pixels in real time by estimating the jitter displacements of the camera assisted by an HFR vision system operating at 1000 fps. Several experiments were conducted with real scenes in which the hybrid-camera system had certain jitter displacements due to its mechanical movement, and the experimental results verified its performance for real-time video stabilization with HFR video processing. For real-time video stabilization, our method was designed only for reducing translational movements in images; it cannot perfectly reduce camera jitter with large rotational movements. Accuracy in jitter sensing using our method will decrease significantly when feature points around moving targets are selected for feature point matching. Based on these results, we aim to improve our video stabilization system for more robust usage in complicated Scenes with 3-D translational and rotational movements under time-varying illumination, with object recognition and motion segmentation to segregate the camera motion by intelligently ignoring feature points around moving objects such as persons and cars, and to extend it to create embedded and consumer camera systems for mobile robots and systems for a variety of applications.

Fig. 13
figure 13

Experimental scene when walking on stairs with a dual-camera head

Fig. 14
figure 14

Jitter displacement (JD), filtered high-frequency component (HD), and stabilized displacement (SD) in outdoor stair experiment a horizontal displacements, b vertical displacements

Fig. 15
figure 15

Feature points plotted on \(512\times 496\) input images of camera 1 in outdoor stair experiment a extracted features points, b pairs of matched feature points

Fig. 16
figure 16

\(2048\times 2048\) images of camera 2 in outdoor stair experiment a input images, b stabilized images