1 Introduction

Complex traffic surveillance systems are often used for controlling traffic and detecting unusual situations, such as traffic congestion or accidents. Most such systems are built using high resolution cameras connected via a high-bandwidth link to the processing center. The need for automated processing of video data is obvious and many solutions of systems for traffic analysis can be found in the literature [110]. This paper describes an approach which uses a single, stationary, constant zoom, monochrome camera to detect moving and recently stopped vehicles. The result of the algorithm is a binary mask image of blobs representing the detected objects and a table with the parameters of the objects. A low resolution camera can be used, since the detected objects (vehicles) are large enough and their details are not important for this application. The camera is mounted at a location high above the road, e.g. on a street-lamp pole, to reduce occlusions of vehicles and provide a large field of view. The camera observes dynamic events in a scene with a fixed or slowly changing background, locally occluded by moving vehicles. It is assumed that no a priori knowledge about the scene is needed; possible camera vibrations have to be reduced by a separate block. The proposed algorithm supports day and night operation, where the scene might be illuminated by an additional light source (e.g. street lamps) or an infra-red camera could be used. The algorithm runs at video-rate performance and enables low power realization in a sensor network node, which can be powered from a solar cell. Most of the design decisions have been made taking into the account the possibility of the implementation in the sensor network node with limited hardware and power resources. The low power operation is achieved due to the low resolution of the processed image, which enables to use a low frequency clock. This algorithm has been developed for and tested in an autonomous low-cost sensor network node for the car traffic flow evaluation. The set of such nodes enabled to estimate traffic in a large area of a city. Some early results of the authors’ work have been presented in [6], in this paper the final algorithm has been described in details, the Hough transform block has been added, the final processing block has been revised and improved and the edge detection blocks have been introduced.

The layout of this paper is as follows: the overview of the most important developments in the image segmentation area is presented in Section 2. In Section 3 the authors present a low-level image-processing algorithm for the detection of moving objects. Section 4 describes the transformation of the blobs obtained from the image-processing algorithm into a table containing the coordinates of the detected moving objects with their basic shape parameters. The results of hardware implementation and conclusions are presented in Sections 5 and 6, respectively.

2 Related Work

Moving object detection and segmentation techniques have been investigated for many years. Two main approaches to the problem of recognizing the vehicles on the video image can be distinguished:

  1. 1)

    A model-based approach using feature-extraction methods, where recognition is achieved by matching 2-D features extracted from the image with the features of a 3-D model [1, 2, 1115]

  2. 2)

    A non-model-based approach, where the following three major methods of moving object segmentation, can be distinguished: optical flow [1618], frame differencing and background subtraction [4, 1925].

Background subtraction is probably most often used for moving object segmentation. The idea of the background subtraction can be described with the following inequality:

$$ \left| {{{\mathbf{I}}_t} - {{\mathbf{B}}_t}} \right| < \theta $$
(1)

where I t is the matrix of the pixel intensities of the current frame, B t denotes the actual background image for time t and θ is a constant threshold.

The main effort in the background subtraction method is concentrated on maintaining the correct background image. The running average [3] enables the quick calculation of an approximated value of the background (2):

$$ {{\mathbf{B}}_t} = \alpha \cdot {{\mathbf{I}}_{{t - 1}}} + \left( {1 - \alpha } \right) \cdot {{\mathbf{B}}_{{t - 1}}} $$
(2)

where α is a learning ratio.

For image processing, a median operation usually gives better results than the average, thus the running median has been introduced in [27] and [4], where the running estimate of the median is incremented by one if the input pixel’s intensity is larger than the estimate and decreased by one if it is smaller. The estimate converges to the median value, since half of the input pixels are larger than and half are smaller than the estimated value. Calculating the median estimate in this way can also be very efficiently realized in hardware.

Methods using Gaussian distribution or a running Gaussian average [2830] provide a solution to the threshold selection problem. Each pixel is modeled as an independent statistical process of Gauss distribution and non-background pixels are found using the following inequality:

$$ \left| {{{\mathbf{I}}_t} - {\mu_t}} \right| < k \cdot {\sigma_t} $$
(3)

where μ t and σ t are mean and standard deviation matrices of Gaussian distribution for image pixel intensities and the constant k typically has a value between 2 and 3.

Updating the background image with running Gaussian is calculated as shown in the following equations:

$$ {\mu_t}\left( {x,y} \right) = \alpha \cdot {I_{{t - 1}}}\left( {x,y} \right) + \left( {1 - \alpha } \right) \cdot {\mu_{{t - 1}}}\left( {x,y} \right) $$
(4)
$$ \sigma_t^2\left( {x,y} \right) = \alpha {\left[ {{I_{{t - 1}}}\left( {x,y} \right) - {\mu_{{t - 1}}}\left( {x,y} \right)} \right]^2} + \left( {1 - \alpha } \right) \cdot \sigma_{{t - 1}}^2\left( {x,y} \right) $$
(5)

In further considerations, the pixel’s coordinates (x, y) will be omitted to aid the readability of this paper, unless necessary.

Simplification of the Gaussian models can be found in [31], where the absolute maximum, minimum and the largest consecutive difference values for every pixel are used. Adding color information can improve the sensitivity of the moving object detection. For this purpose, several color models can be used, such as RGB or HSV [29, 32].

Modeling each pixel as an independent statistical process with Gaussian distribution does not enable the overcoming of the problem of non-stationary background (waving trees or water waves), where the background pixels might have several distributions. This problem can be solved by using the Mixture of Gaussians (MOG) [5, 3337], where each pixel is modeled by a few distributions which are constantly updated. In [38] the authors, apart from MOG, use a statistical model of gradients with a hierarchical approach of image analysis with pixel-level, region-level and frame-level processing. An interesting FPGA implementation (using off-chip RAM) of the modified MOG algorithm is presented in [39], where also the blob labeling is implemented in hardware.

There are also other methods found in the literature, such as kernel density estimators based on a simple adaptive filter introduced in [40], a method using Kalman filter [34], a linear prediction (Wiener filter) with autoregressive process of order 30 [41], mean-shift based estimation [42] or eigenbackgrounds.

Background subtraction enables the detection of moving and stopped objects. Depending on the background model update technique, the stopped objects can be detected for a certain amount of time, until they become a part of the background. The disadvantage of this approach is the effect of detecting the places where the stopped object, which was a part of the background, started to move. Such a relocation of the background object is called a ghost. The empty place where the object was before it started to move is detected until it becomes a part of the background. The other issue is how to update the background. Simple approaches use a non-selective background update—information from every pixel of the current image is included in the background model, despite the result of the segmentation. In this way, the pixels belonging to the moving objects are also included in the background, decreasing the selectivity of the segmentation. In [3], a selective background update has been introduced, where only pixels that are not recognized as moving objects are allowed to be included in the background model. Selectivity improves the quality of the background model, but it also creates a risk of the occurrence of the dead-lock phenomenon, which appears when some part of the background changes and the pixels that are falsely detected as being part of the moving object will never be included in the background and are always indicated in the segmented picture. To reduce this problem, two backgrounds can be used [7, 40, 43, 44], as a combination of both selective and non-selective approach.

Owing to the algorithm imperfections, some pixels of the original image being a part of the moving vehicle are not indicated in the binary mask. Such pixels are called false negatives (FN). When the pixels of the original image that are part of the stationary background are recognized as part of the moving objects, they are called false positives (FP). TP denotes the number of true positive pixels, i.e. pixels that are a part of the moving object and are correctly identified. The following detection quality measures can be defined, i.e. the fill ratio FIL and the precision ratio PR, similarly as in [46]:

$$ FIL = \frac{{TP}}{{TP + FN}} \cdot 100\% $$
(6)
$$ PR = \frac{{TP}}{{TP + FP}} \cdot 100\% $$
(7)

One of the reasons for errors in image segmentation is the existence of shadows of various types: shadows cast by objects onto the background, shadows cast by objects onto themselves and shadows cast by other objects. The most important are the shadows which are cast by objects onto the background and which move along with them. In the night, highlights can be observed, such as the reflections of car lights from background surfaces. Even a snow can cause additional problems [47]. Some authors divide the image into squares and manipulate the mean values of the pixels’ intensities to eliminate shadow and to preserve the texture [48], or use color, spatial and temporal information with an a posteriori probabilistic estimator to determine shadows [49]. In [50], a set of 17 image region types is used and heuristic rules to classify pixels as shadow are applied. Many solutions characterize shadow by the same color but lower hue or brightness [8, 26, 53]. The detailed review of shadow elimination techniques can be found in [8] and [54]. Elimination of shadows and highlights is very important in a non-model-based approach, since they could cause object merging or shape distortions.

The binary mask image obtained from the segmentation algorithm should be post-processed to delete single erroneously classified pixels with morphological operations or other methods using additional information obtained from object-level and frame-level processing [41].

The hardware implementation of image processing algorithms is becoming more popular with the constant development of more sophisticated FPGAs. Examples of implementation of image processing algorithms can be found in [51] and [52].

3 Image Segmentation Algorithm

In this paper, a non-model-based approach is presented, which transforms the camera image into a binary mask containing moving blobs. The aim of the authors is to develop a pipelined, iteration-less algorithm which can be implemented in hardware, performing simple segmentation of traffic objects with a monochrome camera mounted above the road. The use of a monochrome camera decreases the segmentation sensitivity and it also excludes the use of color information for shadow and highlight detection, but it reduces the complexity of the hardware. The authors developed the algorithm that can be implemented in pipelined fashion in the hardware, without iterations. The contribution of this paper is also a novel method of binary mask combination from two background subtraction results and the use of the non-linear functions for the detection of the highlights.

The general diagram depicting the idea of the algorithm is presented in Fig. 1, where the block structure of the system and the data-flow are shown. Each block will be described in detail in the next sections. Since the background subtraction technique is used, the image stabilization circuit at the input might be needed, which is outside the scope of this paper.

Figure 1
figure 1

General diagram depicting the idea of the algorithm for the FPGA implementation.

3.1 Models for Selective and Non-selective Background

The presented algorithm is based on the background subtraction technique and uses two background models: long-term with non-selective update and short-term with selective background update [6, 7, 40, 43]. The models for both selective and non-selective backgrounds are similar; the difference is only in updating the background with data from the current image. For the simplicity of the realization in hardware, the models assume single Gaussian distribution of pixels’ intensities. The pixel is classified as foreground using (3) and the results are stored as m S and m N masks for selective and non-selective background, respectively.

Depending on the auto exposure system implemented in the camera, sudden changes in the illumination of the scene can cause the background subtraction-based algorithms to detect all the regions where the brightness has changed. Such a situation can be observed at night, for example near periodically flashing city neon lights. In this situation an additional average brightness control block adjusting the average brightness of the background models and the previous frame to the current image might be needed, which is outside the scope of this paper.

3.2 Non-selective Background Update Block

The non-selective background update block, along with the selective background update block, performs the main task of detecting the moving and recently stopped objects. To enable easy implementation in hardware, the running mode [7] as a background update function was chosen:

$$ {\mu_{{N,t}}}\left( {x,y} \right) = \left\{ {\begin{array}{*{20}{c}} {{\mu_{{N,t - 1}}}\left( {x,y} \right) + {\delta_{{NI}}}} & {\hbox{if}} & {{I_t}\left( {x,y} \right) > {\mu_{{N,t - 1}}}\left( {x,y} \right)} \\{{\mu_{{N,t - 1}}}\left( {x,y} \right) - {\delta_{{NI}}}} & {\hbox{if}} & {{I_t}\left( {x,y} \right) < {\mu_{{N,t - 1}}}\left( {x,y} \right)} \\{{\mu_{{N,t - 1}}}\left( {x,y} \right)} & {} & {\hbox{otherwise}} \\\end{array} } \right. $$
(8)

where:

I t (x, y):

the brightness of a pixel situated at coordinates (x, y) of input monochrome image at the time t

μ N,t (x, y):

the brightness of a pixel situated at coordinates (x, y) of background image, updated non-selectively

δ N1 = 2−5 = 0.03125 is a small constant evaluated experimentally. It is assumed that the brightness of the input image is in the range: \( {I_t}\left( {x,y} \right) \in \left\langle {0,255} \right\rangle \).

As can be seen from (8), the calculation of the background requires only a few simple operations. The running mode is also used for updating standard deviation σ t . Experimental results show that this approach works correctly and also enables fast and easy implementation in hardware. The updating of σ Ν, t , which is σ t from (3) for non-selective model, is presented in (9):

$$ {\sigma_{{N,t}}} = \left\{ {\begin{array}{*{20}{c}} {{\sigma_{{N,t - 1}}} + {\delta_{{N2}}}} & {\hbox{if}} & {\left| {{I_t} - {\mu_{{N,t - 1}}}} \right| > {\sigma_{{N,t - 1}}}} \\{{\sigma_{{N,t - 1}}} - {\delta_{{N2}}}} & {\hbox{if}} & {\left| {{I_t} - {\mu_{{N,t - 1}}}} \right| < {\sigma_{{N,t - 1}}}} \\{{\sigma_{{N,t - 1}}}} & {} & {\hbox{otherwise}} \\\end{array} } \right. $$
(9)

where δ N2 is also a small constant of experimentally evaluated value of 0.00390625 (i.e. 2−8).

3.3 Selective Background Update Block

The selective update block works similarly to the non-selective, but uses information from the final steps of the algorithm, as can be seen in (10) and (11):

$$ {\mu_{{S,t}}}\left( {x,y} \right) = \left\{ {\begin{array}{*{20}{c}} {{\mu_{{S,t - 1}}}\left( {x,y} \right) + {\delta_{{S1}}}} & {\hbox{if}} & {{I_{{t - 1}}}\left( {x,y} \right) > {\mu_{{S,t - 1}}}\left( {x,y} \right)} & {\hbox{and}} & {{m_{{VTS,t - 1}}}\left( {x,y} \right) = 0} \\{{\mu_{{S,T - 1}}}\left( {x,y} \right) - {\delta_{{S1}}}} & {\hbox{if}} & {{I_{{t - 1}}}\left( {x,y} \right) < {\mu_{{S,t - 1}}}\left( {x,y} \right)} & {\hbox{and}} & {{m_{{VTS,t - 1}}}\left( {x,y} \right) = 0} \\{{\mu_{{S,t - 1}}}\left( {x,y} \right)} & {} & {\hbox{otherwise}} & {} & {} \\\end{array} } \right. $$
(10)
$$ {\sigma_{{S,t}}}\left( {x,y} \right) = \left\{ {\begin{array}{*{20}{c}} {{\sigma_{{S,t - 1}}}\left( {x,y} \right) + {\delta_{{S2}}}} & {\hbox{if}} & {\left| {{I_{{t - 1}}}\left( {x,y} \right) - {\mu_{{S,t - 1}}}\left( {x,y} \right)} \right| > {\sigma_{{S,t - 1}}}\left( {x,y} \right)} & {\hbox{and}} & {{m_{{VTS,t - 1}}}\left( {x,y} \right) = 0} \\{{\sigma_{{S,t - 1}}}\left( {x,y} \right) - {\delta_{{S2}}}} & {\hbox{if}} & {\left| {{I_{{t - 1}}}\left( {x,y} \right) - {\mu_{{S,t - 1}}}\left( {x,y} \right)} \right| < {\sigma_{{S,t - 1}}}\left( {x,y} \right)} & {\hbox{and}} & {{m_{{VTS,t - 1}}}\left( {x,y} \right) = 0} \\{{\sigma_{{S,t - 1}}}\left( {x,y} \right)} & {} & {\hbox{otherwise}} & {} & {} \\\end{array} } \right. $$
(11)

where:

$$ {m_{{VTS,t}}}\left( {x,y} \right) = {m_{{V,t}}}\left( {x,y} \right) \vee {m_{{ET,t}}}\left( {x,y} \right) \vee {m_{{ES,t}}}\left( {x,y} \right) $$
μ S,t (x, y):

the brightness of a pixel at coordinates (x, y) of background image updated using selectivity

m V (x, y):

the element of the detected vehicle mask image of value equal to 0 or 1, where 1 denotes the detected moving objects

m ET (x, y) and m ES (x, y):

the elements of value {0, 1}, obtained from temporal and spatial edge detection block, respectively.

The values of constants δ S1 and δ S2 were established experimentally: δ S1  = 0.25, δ S2  = 0.03125 for \( {I_{{t - 1}}}\left( {x,y} \right) \in \left\langle {0,255} \right\rangle \). The input image I t-1 is used instead of I t to provide the coherence with the masks m V, t-1, m ES, t-1 and m ET, t-1. The sizes of matrices μ S , μ N , m V , m ET and m ES are equal to the size of I t .

The use of two background update blocks has a very important advantage—the fast adapting selective background update block gives a better sensitivity, while the non-selective background helps to avoid the dead-lock phenomenon. An example of the frame, where the additional pixels are detected by the selective background is presented in Fig. 2.

Figure 2
figure 2

An example illustrating a better sensitivity of the non-selective background model, (a)—input image I t , (b)—mask m N from non-selective background, (c)—mask m S from selective background, (d)—pixels detected in the mask m S and not detected in the mask m N .

The recently stopped objects can be detected for some time which is often required in car detection (i.e. detecting a traffic jam). The non-selective background model has longer adaptation times than the selective one, i.e. δ N1 < δ S1 and δ N2 < δ S2 , so the recently stopped moving objects are not added to the quickly adapting selective model, because the update is blocked by the mask m V m ES m ET . After some time, the stopped objects become a part of the non-selective background and then they are quickly included in the selective background, since the mask m V m ES m ET stops blocking of the update. This process is illustrated in Fig. 3. Using constant adaptation times (δ N1 , δ S1 , δ N2 , δ S2 ) benefits in a simpler hardware, but rapid changes of the scene caused by sudden weather changes cause temporary problems in the detection. It has been observed, that typically after a few seconds, the backgrounds adapt to the new light conditions. This situation can be detected at the final stage of object segmentation, as the total area of the detected objects is comparable to the area of the whole image.

Figure 3
figure 3

Detection of the recently stopped object by selective and non-selective background update blocks: (a)—the input image I t with added mask m V (darker areas indicated by the white rectangles), the recently stopped car in the center of the image is being detected by the algorithm, (b)—the non-selective background μ N , the new car has not been yet added to the non-selective background, (c)—the selective background μ S , the new car has not yet been included into the selective background, (d)—the car is still being detected, (e)—the car is slowly being added into the non-selective background, (f)—the car is not included into the selective background, because it has been blocked by the mask m V , (g)—the car is not detected any more, (h)—the new car is fully included into the non-selective background, (i)—the selective background is quickly updating, because mask m V is not blocking the update of the car.

3.4 Binary Mask Combination Block

Detection results from both models have to be combined into a single binary mask m B . With a simple and operation, all the pixels that were not detected simultaneously by both models would be lost. Owing to this, a special combination of and and or operations can be used to improve the detection. In this paper, the authors refined the idea described in [43]. When in the proximity of the inspected pixel there is at least one pixel detected by both models, the or operation is used, otherwise the and operation is used, as shown in (12).

$$ {m_B}\left( {x,y} \right) = \left\{ {\begin{array}{*{20}{c}} {{m_S}\left( {x,y} \right) \vee {m_N}\left( {x,y} \right)} & {\hbox{if}} & {\left( {{m_S}\left( {x - 1,y} \right) \wedge {m_N}\left( {x - 1,y} \right)} \right) \vee } \\{} & {} & {\left( {{m_S}\left( {x - 1,y - 1} \right) \wedge {m_N}\left( {x - 1,y - 1} \right)} \right) \vee } \\{} & {} & {\left( {{m_S}\left( {x,y - 1} \right) \wedge {m_N}\left( {x,y - 1} \right)} \right) \vee } \\{} & {} & {{m_S}\left( {x + 1,y - 1} \right) \wedge {m_N}\left( {x + 1,y - 1} \right)} \\{{m_S}\left( {x,y} \right) \wedge {m_N}\left( {x,y} \right)} & {\hbox{otherwise}} & {} \\\end{array} } \right. $$
(12)

The operation of the Binary Mask Combination Block is presented in Table 1, where single FN and FP pixels are considered. In those situations the results are better than simple binary operations (AND, OR). As can be seen in Fig. 4, the noise observed in the mask m S (Fig. 4b) does not appear in the resulting image mask m B (Fig. 4d). It must be noted that for the simplicity of the hardware, apart from the current pixel, only the four previously analyzed neighboring pixels are used in (12).

Table 1 Binary mask combination improving the final m B mask quality.
Figure 4
figure 4

An example of Mask Combination Block operation: (a)—input picture, (b)—mask m S from selective background, (c)—mask m N from non-selective background, (d)—mask m B . The differences in FP pixels between (b) and (c) are caused by the differences in the background models (different update rates of selective and non-selective background, specified by the constants: δ N1 , δ N2 , δ S1 and δ S2 ).

The proposed background subtraction has been compared with the Stauffer’s and Grimsons’s MOG method [35] with K = 3 distributions. For the comparison, the test sequences have been used: publicly available dataset PETS2001 [45] and the sequence taken from the bridge above the highway. As can be seen in Fig. 5, the obtained results are comparable or even better than the standard background subtraction method using the MOG.

Figure 5
figure 5

Comparison of the proposed background subtraction algorithm with the standard algorithm [35]: (a)—input image (frame #949) from PETS2001 Camera 1 sequence [45] resized to 128 × 128 pixels, (b)—manually marked ground truth, (c)—unfiltered result from MOG algorithm [35] with K = 3 distributions, (d)—result from the proposed background subtraction algorithm, (e)—input image (frame #1314) from obw2_d3 sequence, (f)—manually marked ground truth, (g)—unfiltered result from MOG algorithm [35] with K = 3 distributions, (h)—result from the proposed background subtraction algorithm.

3.5 Temporal and Spatial Edge Detection Blocks

A pure background subtraction does not detect many TP pixels, especially in dark scenes. In the worst case, the major part of the moving vehicle may not be detected in the night, except for the car lights. To overcome such a problem, an additional detection scheme has been introduced using the edge detection. The edge detection improves the segmentation quality by increasing the number of TP pixels. Two edge detection blocks have been used: temporal edge and spatial edge detection blocks. The temporal edge detection block detects the edges in the image obtained as the difference between the current and the previous frame:

$$ \Delta {{\mathbf{I}}_T} = \left| {{{\mathbf{I}}_t} - {{\mathbf{I}}_{{t - 1}}}} \right| $$
(13)

The spatial edge detection block uses the difference between the current image and the background:

$$ \Delta {{\mathbf{I}}_S} = \left| {{{\mathbf{I}}_t}--{\mu_{{N,t}}}} \right| $$
(14)

To avoid locking up the background update by continuously detected edges, the non-selective background is used in (14). Temporal edge mask image m ET is described with (15):

$$ {m_{{ET}}}\left( {x,y} \right) = \left\{ {\begin{array}{*{20}{c}} 1 & {{\hbox{if}}\,\left| {\Delta {I_T}\left( {x,y} \right) - \Delta {I_T}\left( {x - 1,y} \right)} \right| > {\theta_{{ET}}} \vee } \\{} & {\left| {\Delta {I_T}\left( {x,y} \right) - \Delta {I_T}\left( {x,y - 1} \right)} \right| > {\theta_{{ET}}}} \\0 & {\hbox{otherwise}} \\\end{array} } \right. $$
(15)

A similar equation as (15) can be written for m ES :

$$ {m_{{ES}}}\left( {x,y} \right) = \left\{ {\begin{array}{*{20}{c}} 1 & {{\hbox{if}}\,\left| {\Delta {I_S}\left( {x,y} \right) - \Delta {I_S}\left( {x - 1,y} \right)} \right| > {\theta_{{ES}}} \vee } \\{} & {\left| {\Delta {I_S}\left( {x,y} \right) - \Delta {I_S}\left( {x,y - 1} \right)} \right| > {\theta_{{ES}}}} \\0 & {\hbox{otherwise}} \\\end{array} } \right. $$
(16)

where θ ET and θ ES are constant thresholds evaluated experimentally. The example of the edge detection is shown in Fig. 6. The background detection (mask m B ) has problems in finding a dark car in the night, but the edge detections add more pixels improving the overall result. Some new FP pixels are also introduced (the lower part of Fig. 6c), but they can be easily filtered out during one of the next processing steps.

Figure 6
figure 6

Additional pixels found by the edge detection in the dark scene, the moving car is marked with the circle: (a)—original picture (highway in the night), (b)—result of background detection m B , (c)—result of spatial edge detection m ES for θ ES  = 20, (d)—result of temporal edge detection m ET for θ ET  = 20.

3.6 Shadow and Highlight Detection Blocks

The basic detection of shadows in monochrome images can be done simply by comparing the decrease in brightness [26]:

$$ {m_{{SH}}}\left( {x,y} \right) = \left\{ {\begin{array}{*{20}{c}} 1 & {\hbox{if}} & {\alpha \leqslant \frac{{{I_t}\left( {x,y} \right)}}{{{\mu_{{N,t}}}\left( {x,y} \right)}} \leqslant \beta } \\0 & {} & {\hbox{otherwise}} \\\end{array} } \right. $$
(17)

where α and β are constant coefficients: α = 0.55, β = 0.95, both evaluated experimentally.

During the night, the illumination of the scene changes drastically. The light reflections from car lights are imposing the detection of many FP pixels. Detection of the highlights working similarly to that in shadow detection would cause many errors during the day. To solve this problem, the authors propose non-linear brightness transformations f, providing different behavior of the highlight detection block in the day and night. The idea of this method is presented in Fig. 7.

Figure 7
figure 7

Flow diagram of highlight detection block.

The input image and the background image are first transformed with a non-linear function, which transforms dark pixels into bright ones and vice versa. For example, a hyperbolic function from (18) can be used:

$$ f\left( {I\left( {x,y} \right)} \right) = \frac{{2047}}{{I\left( {x,y} \right) + 1}} $$
(18)

where I(x, y) represents the brightness of the pixel at (x,y), \( I\left( {x,y} \right) \in \left\langle {0,255} \right\rangle \).

In the night, when the background pixels are mainly dark and are very sensitive to any highlights, after the transformation the difference between the highlight (small value after transformation) and the background (large value after transformation) is large and the highlights can easily be detected and stored as mask m HI . During the day, the difference between the transformed background (low value) and a bright object (also low value after transformation) is smaller than the constant threshold τ H1 . An additional threshold τ H2 was introduced to exclude very bright pixels from being classified as highlights during the day. Further improvement in the number of TP pixels can be achieved by detecting very dark pixels on a bright background, also using non-linear transformations (mask m X calculated according to Fig. 7). The values of τ H1 , τ H2 , τ X1 , τ X2 have to be determined experimentally, the authors used the following values: τ H1  = −8, τ H2  = 120, τ X1  = 25, τ X2  = 70. The results of shadow and highlight detection are shown in Figs. 8 and 9. As can be seen in Fig. 8, the use of the simple shadow detection technique is not perfect, but it detects the major part of the shadow and seems to be sufficient for this application. More reliable shadow detection techniques, thus more complex, are widely present in the literature, e.g. [8]. The detected highlights from the car lights are shown in Fig. 9c as the mask m HI . Finally, the mask m X (Fig. 9d) identifies few more pixels of the moving object.

Figure 8
figure 8

Simulation results of the shadow detection: (a)—original picture, (b)—result of the background detection m B , (c)—the detected shadow mask m SH .

Figure 9
figure 9

Simulation results of the detection of the highlights: (a)—original picture, (b)—result of the background detection m B , (c)—the detected highlights in mask m HI , (d)—mask m X .

The shadow and highlight detections work constantly during the day and night, resulting in adding some noise at night, which is canceled at morphological operation at the final processing stage. The highlight detection depends on the brightness of the pixels, thus its operation is limited during the day, when the pixels are brighter.

3.7 Final Processing

The masks obtained in the previous steps of the algorithm are combined into a single mask m BEHSX in accordance with (19) and (20):

$$ {{\mathbf{m}}_{{HS}}} = dil\left( {ero\left( {\neg \left( {\left( {{{\mathbf{m}}_{{ET}}} \wedge {{\mathbf{m}}_{{ES}}}} \right) \vee {{\mathbf{m}}_X}} \right) \wedge \left( {{{\mathbf{m}}_{{HI}}} \vee {{\mathbf{m}}_{{SH}}}} \right)} \right)} \right) $$
(19)
$$ {{\mathbf{m}}_{{BEHSX}}} = ero\left( {dil\left( {\left( {{{\mathbf{m}}_B} \wedge \neg {{\mathbf{m}}_{{HS}}}} \right) \vee \left( {\left( {{{\mathbf{m}}_{{ET}}} \wedge {{\mathbf{m}}_{{ES}}}} \right) \vee {{\mathbf{m}}_X}} \right)} \right)} \right) $$
(20)

where dil() and ero() denote 2 × 2 morphological dilation and erosion operation, respectively.

The blobs representing moving objects in the mask m BEHSX usually contain holes (FN pixels) and many FP pixels (Fig. 10b). To improve the shape of the blobs, the authors propose to apply a generalized Hough transform with a rectangular structuring element. The size of the structuring element should correspond to the size of the objects to be detected; in our case, a square of size 4 × 4 pixels was appropriate. For every bright pixel, the structuring element is positioned in all the positions overlapping with the pixel and the element of the voting matrix is calculated. The voting matrix is finally compared with the constant threshold, as shown in (21):

Figure 10
figure 10

Simulation results showing the effect of the Hough transform: (a)—original picture, (b)—mask m BEHSX (c)—final mask m V for Θ H =180.

$$ {m_V}\left( {x,y} \right) = \left\{ {\begin{array}{*{20}{c}} 1 & {\hbox{if}} & {Houg{h_{{4 \times 4}}}{{\left[ {{{\mathbf{m}}_{{BEHSX}}}} \right]}_{{\left( {x,y} \right)}}} > {\Theta_H}} \\0 & {} & {\hbox{otherwise}} \\\end{array} } \right. $$
(21)

It must be noted that the Hough transform has a tendency to connect the blobs which are very close together. However, this transformation significantly improves many others aspects of the final mask; the transformed blobs usually have a more convenient shape for labeling and speed estimation described in the next section, so the use of this transformation is very important to the overall efficiency of the algorithm.

4 Blob Analysis and Speed Estimation

The blobs obtained from the previously described blocks have to be analyzed to detect and to measure the speed of the moving vehicles. Since the camera usually observes the scene at some angle, additional transformations of the image are needed.

For speed estimation, knowledge regarding the relationship between the blobs’ dimensions found on the image and the real world coordinates is necessary. Here, the authors assume a model like in [55], where the camera is located above the ground and is pointed towards the road. The ground level is assumed to be planar. Such a model is presented in Fig. 11. If we additionally assume that the observed objects are also planar, so their heights are Z = 0, then the X, Y coordinates on the road can be transformed into the x CAM , y CAM coordinates of the image on the camera sensor as [55]:

Figure 11
figure 11

Geometrical model of camera and road [55].

$$ {x_{{CAM}}} = f\frac{X}{{Y\cos \left( \varphi \right) + h/\sin \left( \varphi \right)}} $$
(22)
$$ {y_{{CAM}}} = f\frac{{Y\sin \left( \varphi \right)}}{{Y\cos \left( \varphi \right) + h/\sin \left( \varphi \right)}} $$
(23)

where:

φ :

tilt angle [rad]

h :

height of the camera above the road [m] and

f :

focal length of the camera [m].

Thus, knowing the parameters f, h and φ, one can calculate the real world coordinates X, Y from the x CAM , y CAM coordinates. The input image is transformed into the image which provides the linear correspondence between pixels x lin , y lin on the transformed image and real-world coordinates X, Y, where x map , y map are the indexes of pixels on the camera converter. An example of transformation is presented in Fig. 12. For a better view, the original image is presented instead of blobs.

Figure 12
figure 12

Example of original image (a) and result of transformation (b) for φ= 43°, h = 15 m, f = 2.8 mm using the reverse of (22) and (23).

The detected blobs are labeled and the following parameters are estimated: object’s boundaries, centre of the object, area in pixels, fill factor (as the percentage of pixels with respect to the bounding rectangle area). The parameters are calculated using simple operations during pixel by pixel revision of the image. After this stage, a table with the column number equal to the number of indexed objects is created. Rows describe found parameters.

The objects which are small and have a small fill factor are discarded. The blobs which are overlapping on two subsequent frames are detected and marked. Such blobs are treated as the same object in movement. Estimation of the speed and direction of the objects is calculated by finding the distance between the centers of the objects marked in the previous stage.

5 Implementation Results

The algorithm described in the Section 3 has been tested with various video streams. The ground truth reference has been prepared for a set of video streams by manually extracting frame-by-frame all the pixels of each moving vehicle. The simulation results show that 57–94% of the pixels (depending on the stream) belonging to the moving vehicles in ground truth image are correctly identified by the algorithm (fill ratio FIL as defined in (6)). Moreover, the precision ratio PR defined in (7), indicating how many among the detected pixels belong to the moving objects, is about 56–88%. Simulation results for several frames of the selected video streams are shown in Fig. 13. As can be seen from Fig. 13, the algorithm is able to properly detect the moving vehicles at various scene conditions. A worse detection usually occurs in: dark scenes, for gray colored vehicles or in strong sun light causing intensive shadows. Such problematic situations with detection errors are collected in Fig. 14, the arrows indicate the erroneous detection results. The shapes of the resulting blobs in many situations are different from the shapes of the real moving objects, but for the purpose of simplified tracking and traffic measurement, the detailed shape is not very important.

Figure 13
figure 13

Simulation results of the algorithm (a)—frame #638 from “obwodnica_1” movie, (b)—frame #85 from “obwodnica_2” movie, (c)—frame #1879 from “wrzeszcz” movie, (d)—frame #176 from “obwodnica_noc” movie, (e)—frame #718 from “obwodnica_6” movie. Frames “result” on the right contain the input image with the added mask of the detected blobs. Rectangles indicating the detected blobs were introduced during simulation for improved visibility.

Figure 14
figure 14

Simulation results of the algorithm, white arrows show the problematic situations for the detection algorithm (a)—a gray vehicle with minority of pixels detected, frame #1778 from “obwodnica_4” movie, (b)—the shadow detected as moving object in a sunny day, frame #42 from “obwodnica_6” movie, (c)—the highlight detected as moving object at night, frame #309 from “obwodnica_noc” movie. Frames “result” on the right contain the input image with the added mask of the detected blobs. Rectangles indicating the detected blobs were introduced during simulation for improved visibility.

The algorithm has been implemented in real hardware using Xilinx Virtex-4 SX FPGA on prototype board Virtex-4 Evaluation Kit from Avnet, utilizing approx. 1700 LUTs, 1200 flip-flops and 2.3 Mbits of the built-in RAM. The design has been written in VHDL and it has been synthesized and implemented using Xilinx’s ISE 9.1.03i. The analog signal from the camera was being captured by Philips’s SAA7113H video input processor. On-chip implementation included: selective background, non-selective background, background masks combination, temporal and spatial edge detection, highlight, shadow and extra pixel detection, Hough transform, geometrical transformation with indexing (i.e. blob labeling and blob parameter evaluation), as shown in Fig. 15. The interface to the external processor was used to collect the table with the detected objects’ parameters.

Figure 15
figure 15

A simplified structure of the realized algorithm in FPGA.

The details of the implementation of the non-selective background update and subtraction is shown in Fig. 16. The values of μ N and σ N are stored in the dual port RAMs and are updated with every new pixel data. The selective background block is realized in a similar way, with the selectivity information added.

Figure 16
figure 16

Simplified schematic diagram of non-selective background block implementation.

The implementation of the binary mask combination (Fig. 17) contains a shift register of a length w + 1, where w is the length of a single video line. The previously analyzed pixels, stored in the shift register, are used to calculate the mask m B . The similar shift registers have also been used for calculating the erosion and the dilation in masks m HS and m BEHSX , the edges in the edge detection block and the indexes in the blob indexing block.

Figure 17
figure 17

Simplified schematic diagram of the implementation of the binary mask combination.

Due to the properties of the algorithm, all the remaining blocks from Fig. 1, except for the Hough block, are implemented in a similar way as the blocks shown in Figs. 16 and 17, requiring only a few cycles of the main 1.79 MHz clock to calculate the result and providing the possibility to obtain the pipelined implementation.

The Hough block requires, that for each pixel, a rectangular structuring element of a size of 4 × 4 is moved around the pixel and added to the voting matrix. Instead of moving the 4 × 4 structuring element, a 7 × 7 rectangular matrix of 5-bit values has been used. In this way, the pixel-centered matrix is stationary for every pixel and it is stored in the Hough matrix memory, as shown in Fig. 18.

Figure 18
figure 18

Simplified schematic diagram of the implemented Hough block.

The Hough block requires 49 clock cycles to calculate all the elements of the matrix, so the clock of the frequency of 28.5 MHz has been used for this block to provide coherent operation with the other blocks. The indexing block also works with the faster clock, as it also requires many internal iterations.

The image transformation block uses two memories for transforming the coordinates of each pixel of the input image. At the start of the system, the mapping memories should be programmed with the values calculated by the external processor. At normal operation, the transformation of single pixel takes only 3 cycles of the main clock for that block.

The system has been tested without image stabilization block, which is only needed in extreme situations, i.e. at a strong wind. The relative distribution of hardware resources among the blocks is shown in Table 2. The algorithm makes possible to use only integer values in all the calculations. All the constants required by the algorithm are read from the on-chip memory and are stored in the control registers. Table 2 also contains a typical number of clock cycles required to process a single pixel.

Table 2 Relative usage of hardware resources and relative power consumption.

All the simulation results presented in this paper have been done using 8-bit image representation. As already shown in Fig. 16, in the implemented system only the 4 most significant bits have been used, which was forced by the limited resources of the FPGA. To show the influence of this reduction, the simulation has been made using artificial test scene of linearly changing background of μ A  = 0…255, with rectangular objects casting simulated shadows of intensity γ k μ A , moving from left to right, as shown in Fig. 19. The parameters of the artificial moving objects are shown in Table 3.

Figure 19
figure 19

Simulation results of algorithm operation for artificial scene for 8- and 4-bit versions; (a)—input image with added objects’ indices and arrows indicating direction of movement, (b)—ground truth for objects, (c)—object detection results (mask m V ) for 8-bit version of the algorithm, (d)—object detection results (mask m V ) for 4-bit version of the algorithm, (e)—ground truth for shadows, (f)—shadow detection results (mask m SH ) for 8-bit version of the algorithm, (g)—shadow detection results (mask m SH ) for 4-bit version of the algorithm.

Table 3 Parameters of the objects in the artificial test scene.

To better illustrate the influence of data reduction, the FIL and PR ratios have been calculated and presented in Fig. 20. As can be seen from the simulations, data reduction resulted in a lower sensitivity of the algorithm in detecting the objects and shadows. However, the relative precision of the 4-bit version of the algorithm (PR ratios) slightly has increased. The reduction of data width saved a lot of FPGA resources and power, at the expense of the decreased sensitivity. Nevertheless, the algorithm results still seem to be sufficient for this application.

Figure 20
figure 20

Calculated values of FIL and PR ratios for object and shadow detection in the artificial scene from Fig. 19; (a)—FIL and PR rations for objects using 8-bit data representation, (b)—FIL and PR rations for shadows using 8-bit data representation, (c)—FIL and PR rations for objects using 4-bit data representation, (d)—FIL and PR rations for shadows using 4-bit data representation.

The photographs showing the results of the algorithm are presented in Fig. 21. More results are available on-line at http://www.ue.eti.pg.gda.pl/sn. The hardware was designed to work with the main 1.79 MHz clock and the additional 28.5 MHz clock for the Hough and indexing blocks to process 25 frames per second of a low resolution 128 × 128 pixels monochrome image. The main clock frequency has been set as low as possible to enable the processing of pixel data by each block. The estimated dynamic power consumption was about 600 mW with 600 mW of quiescent power. The core elements realizing the algorithm were estimated to consume 400 mW, this power is distributed among the blocks as shown in the last column of Table 2. Since the FPGAs are known to have a large power demand, implementing the algorithm in ASIC would further decrease the power consumption. The obtained maximum clock frequency was about 135 MHz, which would permit the processing of up to 117 fps—this indicates the great potential of increasing the processing speed of the algorithm, for example higher resolutions of input video stream can easily be used. As can be seen from Table 2, the limiting stages for the algorithm are the Hough block and the indexing block. The authors decided to use the generalized Hough transform due to its property to detect the rectangular objects, but for simpler implementations, the morphological operations could also be used. The state machine in the indexing block requires many clock cycles per pixel to communicate with its memories and to index the blobs.

Figure 21
figure 21

Photo of implementation results of the algorithm (a)—input image I t , (b)—non-selective background mask m N , (c)—selective background mask m S , (d)—combined background mask m B , (e)—temporal edge mask m ET , (f)—spatial edge mask m ES , (g)—shadow and highlight mask m HS , (h)—mask m V after final processing, (i)—mask after geometrical transformation for φ=29.8º, h = 7 m, f = 2.8 mm.

To demonstrate the efficiency of the FPGA realization, the software implementation in C of the same algorithm has been developed. The software version permitted the processing of about 160 fps using Intel’s dual core processor with 2.13 GHz clock and maximum power dissipation of 65 W with Linux operating system—the speed is similar, but the FPGA implementation uses much less power.

The comparison of some parameters of the proposed implementation with the solution presented in [39] is shown in Table 4. The implementation described in [39] uses monochrome images of VGA resolution and the segmentation algorithm is based on MOG method, which should work better for non-stationary backgrounds. The implementation presented in this paper works with lower resolution images, but additionally contains the geometrical image transformation block, moreover, the highlight and shadow detection blocks together with the edge detection blocks should provide a better detection sensitivity. To show the potential speed of the presented algorithm, the speed-optimized version of reduced functionality has also been included in the comparison in Table 4. In the speed optimized version, the Hough block is removed and the indexing block is reduced to the 1st phase of the connected component algorithm with the label equivalence table generation. No power information is given in [39], so it has not been compared.

Table 4 The comparison of the selected design parameters of the algorithm with the implementation presented in [39].

6 Conclusions

In this paper, the combined algorithm for extracting moving objects from a real-time video stream is proposed. The processing steps were carefully selected and adopted to provide simple and straightforward realization in specialized hardware, such as FPGA or ASIC. A few novel ideas to enhance the algorithm are also developed, increasing the robustness and maintaining its simplicity for hardware implementation. The proposed method of background calculation, using running mode is very fast and requires only basic operations. The novel combination of masks from selective and non-selective backgrounds improves the detection quality. The non-linear brightness transformations enable correct detection of shadows and highlights in various light conditions. The further improvement could include the automatic recognition of day and night with switching between shadow and highlight detection. The application of generalized Hough transformation significantly improves the final blob mask. However, to simplify the hardware, the Hough block could be replaced with a set of morphological operations. The proposed algorithm has been implemented in FPGA and tested in the real environment. The test results proved the usability of the presented idea for recognizing the moving vehicles at low power consumption—the system properly found almost all of the moving vehicles during the day and night.