Keywords

1 Introduction

Intelligent video surveillance systems, devoted to detect and track moving objects, can accomplish unsupervised results using background modeling methodologies, where a representation of the background is estimated and the regions that diverge from it are subtracted and labeled as moving objects named foreground [1]. Afterwards, surveillance systems interpret the activities and behaviors from the foreground objects to support computer vision analysis (e.g. object classification, tracking, activity understanding, among others) [2]. To achieve proper results, background modeling approaches focus on the elaboration of a background model, which suitably represents the pixel dynamics from real-world scenarios [3]. Among developed background modeling approaches, the most used are the ones derived from the conventional pixel-wise Gaussian Mixture Models (GMM) since they provide a trade-off between robustness to real-world video conditions and computational burden [4]. To date, several adapted versions have been proposed. In fact, Authors in [5] provide a survey and propose the classification of the most salient GMM-based background modeling approaches.

Regarding the updating rules of the GMM parameters, improvements have been mainly reported related to the learning rate parameter, which aims to adapt the background model by observing the pixel dynamics through time. Originally, from the derivation of the GMM parameter updating rules, a Gaussian kernel term is attained, providing smoothness to the updating rules of the mean and variance parameters, nonetheless, the cost function of the weights is set using a binary ownership value. This updating rule may lead to noisy foreground masks, especially, when the pixel labels are uncertain like in dynamical scenarios. Zivkovic et al. proposed an improvement for the original GMM, which uses Dirchilet priors to update the weight parameters. Nonetheless, this improvement was mainly made to decrease the computational cost and the foreground/background discrimination performance remains similar to the original GMM.

Here, we propose a new cost function for the GMM weights updating. Using Euclidean divergence (ED), we compare the instant and cumulative probabilities of each Gaussian GMM model fitting the pixel input samples. Then, employing Least Mean Squares (LMS), we minimize the ED of obtained probabilities to adjust the weights values through time. By doing so, we provide non-linear smoothness to the whole GMM parameter updating rules, reducing the number of false positives in the obtained foreground masks. The proposed cost function is coupled into the traditional GMM approach, producing a new background modeling approach named ED-GMM, which improves the foreground/background discrimination in the case of real-world scenarios, especially in dynamical environments.

2 Methods

Background Modeling Based on GMM: The probability that a query input pixel (\(C{{\mathrm{\,\in \,}}}\mathbb {N}\) is the color space dimension), at time \(t{{\mathrm{\,\in \,}}}T\), belongs to a given GMM-based background model is as:

$$\begin{aligned} p\left( {\varvec{x}}_t|{\varvec{\mu }}_t,{\varvec{\Sigma }}_t\right) = \sum _{m=1}^{M} w_{m,t}\ \mathcal {N}\left( {\varvec{x}}_t|{\varvec{\mu }}_{m,t},{\varvec{\Sigma }}_{m,t}\right) , \end{aligned}$$
(1)

where \(M{{\mathrm{\,\in \,}}}\mathbb {N}\) is the number of Gaussian models of the GMM, the weight related to the m-th Gaussian model, \(\mathcal {N}\{\cdot , \cdot \}\) with mean value, and covariance matrix . For computational burden alleviation, all elements of the color representation set are assumed as independent and having the same variance value [4]: , that is, \({\varvec{\Sigma }}_{m,t} {{\mathrm{\,=\,}}}\sigma ^{2}_{m,t}{\varvec{I}},\) being the identity matrix. Afterwards, each query pixel, \({\varvec{x}}_t\), is evaluated until it matches a Gaussian model of the GMM. Here, the match occurs whenever a pixel value ranges within 2.5 standard-deviation interval of the Gaussian model. However, if \({\varvec{x}}_t\) does not match any Gaussian model, the least probable model is replaced by a new one having low initial weight, large initial variance, and mean \({\varvec{\mu }}_{m,t} {{\mathrm{\,=\,}}}{\varvec{x}}_t\) [4]. In the positive case that the m-th model matches a new input pixel, its parameters are updated as follows:

$$\begin{aligned} w_{m,t}&= w_{m,t-1}+\alpha (o_{m,t}-w_{m,t-1}) \end{aligned}$$
(2a)
$$\begin{aligned} {\varvec{\mu }}_{m,t}&= {\varvec{\mu }}_{m,t-1}+\rho _{m,t} o_{m,t} ({\varvec{x}}_{t}-{\varvec{\mu }}_{m,t}) \end{aligned}$$
(2b)
$$\begin{aligned} \sigma ^{2}_{m,t}&= \sigma ^{2}_{m,t-1}+\rho _{m,t} o_{m,t} \left( ({\varvec{x}}_{t}-{\varvec{\mu }}_{m,t})^{\top } ({\varvec{x}}_{t}-{\varvec{\mu }}_{m,t}) - \sigma ^{2}_{m,t-1} \right) \end{aligned}$$
(2c)

where is the weight learning rate, \(o_t{{\mathrm{\,\in \,}}}\,\{0,1\}\) is a binary number indicating the membership of a sample to a model, and is the mean and variance learning rate set as a version of the \(\alpha \) parameter smoothed by the Gaussian kernel \(g({\varvec{x}}_t;\cdot ,\cdot ),\) i.e.: \(\rho _{m,t} {{\mathrm{\,=\,}}}\alpha g\left( {\varvec{x}}_t;{\varvec{\mu }}_{m,t}, {\sigma _{m,t}}\right) .\) Lastly, the derived models are ranked according to the ratio \(w/\sigma \) to determine the most likely produced by the background, making suitable the further foreground/background discrimination [6].

Enhanced GMM-Based Background Using Euclidean Divergence (ED-GMM): The updating rules of the GMM parameters, used in Eq. (2), can be derived within the conventional Least Mean Square (LMS) formulation framework as follows:

$$\begin{aligned} \theta _{t} = \theta _{t-1}-\eta _{\theta _{t-1}} \partial _{\theta _{t-1}} \{\varepsilon ^{2}_{\theta _{t-1}}\}, \end{aligned}$$
(3)

where \(\theta _t{{\mathrm{\,\in \,}}}\{w_t, {\varvec{\mu }}_t, \sigma _t\}\) is each one of the estimated parameters by the corresponding learning rate:

$$\begin{aligned} \eta _{\theta _{t-1}}{{\mathrm{\,\in \,}}}\left\{ \begin{array}{l} \eta _{w_{t-1}}{{\mathrm{\,=\,}}}\alpha /2\\ \eta _{\mu _{t-1}}{{\mathrm{\,=\,}}}\alpha \sigma ^2_{t-1}/2\\ \eta _{\sigma _{t-1}}{{\mathrm{\,=\,}}}\alpha g\left( {\varvec{x}}_t;{\varvec{\mu }}_{t},\sigma _{m,t}\right) /2\\ \end{array}\right. \end{aligned}$$
(4)

and the following cost functions, respectively:

$$\begin{aligned} \varepsilon _{\theta _{t-1}} {{\mathrm{\,\in \,}}}\left\{ \begin{array}{l} \varepsilon _{w_{t-1}} {{\mathrm{\,=\,}}}o_t-w_{t-1}\\ \varepsilon _{{\varvec{\mu }}_{t-1}} {{\mathrm{\,=\,}}}g\left( {\varvec{x}}_t;{\varvec{\mu }}_{t},\sigma _{m,t}\right) \\ \varepsilon _{\sigma _{t-1}} {{\mathrm{\,=\,}}}|{\varvec{x}}_t-{\varvec{\mu }}_{t}|^2-\sigma _{t-1}^2\\ \end{array}\right. \end{aligned}$$
(5)

It is worth noting that the \({\varvec{\mu }}_t\) and \(\sigma _t\) updating rules, grounded on kernel similarities \(g({\varvec{x}}_t;\cdot ,\cdot )\) (Eqs. (4) and (5)), provide smoothness to encode the uncertainty of a pixel belonging whether to the background or foreground. In contrast, the cost function of the weights is set using a binary reference (i.e., membership \(o_t\)). This updating rule may lead to noisy foreground masks when the ownership value is uncertain, especially, in environments holding dynamical sources like trees waving, water flowing, snow falling, etc. To cope with this, we propose to set the cost function of \(w_{t}\) using the ED as follows:

$$\begin{aligned} \varepsilon _{w_{t-1}} {{\mathrm{\,=\,}}}\bar{g}\left( {\varvec{x}}_t;{\varvec{\mu }}_{m,t},{\sigma _{m,t}}\right) -w_{t-1}. \end{aligned}$$
(6)

The ED allows measuring the difference between two probabilities [7]. So, back into the LMS scheme we aim to minimize the ED between an instant probability determined by the Gaussian kernel \(\bar{g}\left( {\varvec{x}}_t;{\varvec{\mu }}_{m,t},{\sigma _{m,t}}\right) \) and a cumulative probability encoded by \(w_{t-1}\). This is grounded by the fact that, if a model has a high cumulative probability \(w_{t-1}\), means that such model has suitably adapted to the pixel dynamics through time, then, \(\bar{g}\left( {\varvec{x}}_t;{\varvec{\mu }}_{m,t},{\sigma _{m,t}}\right) \) should have a high value too. Since the difference between both is expected to be low, the following updating rule is introduced:

$$\begin{aligned} w_{m,t} = w_{m,t-1}+\alpha (\bar{g}\left( {\varvec{x}}_t;{\varvec{\mu }}_{m,t},{\sigma _{m,t}}\right) -w_{m,t-1}) \end{aligned}$$
(7)

where the kernel term, \(\bar{g}\left( {\varvec{x}}_t;{\varvec{\mu }}_{m,t},{\sigma _{m,t}}\right) {{\mathrm{\,=\,}}}{\mathbb {E}}\left\{ g\left( {\varvec{x}}^{c}_{t};{\varvec{\mu }}^{c}_{m,t},\sigma _{m,t}\right) {{\mathrm{{:}}}}\forall c{{\mathrm{\,\in \,}}}C\right\} ,\) measures the average similarity along color channels. Also, since we aim to incorporate the information about each new input sample into all the GMM Gaussian models, we exclude the ownership \(o_t\) from the weight updating.

3 Experimental Set-Up

Aiming to validate the proposed cost function, the ED-GMM approach is compared against the traditional GMM (GMM1) and the Zivkovic GMM proposed (ZGMM) in [8], which uses Dirichlet priors into the weight updating rules to automatically set the number of Gaussians M. The following three experiments are performed: (i) Visual inspection of the temporal weight evolution to make clear performance of background model and the foreground/background discrimination through time. (ii) Foreground/background discrimination over a wide variety of real-world videos that hold ground-truth sets. (iii) Robustness against variations of the learning rate parameter in foreground/background discrimination tasks. The following datasets are employed for the experiments:

  • DBa- Change Detection: (at http://www.changedetection.net/) Holds 31 video sequences of indoor and outdoor environments, where spatial and temporal regions of interest are provided. Ground-truth labels are background, hard shadow, outside region of interest, unknown motion, and foreground.

  • DBb- A-Star-Perception: (at http://perception.i2r.a-star.edu.sg) Recorded in both indoor and outdoor scenarios, contains nine image sequences with different resolution. The ground-truths are available for random frames in the sequence and hold two labels: background and foreground.

Measures: The foreground/background discrimination is assessed only for two ground-truth labels (foreground and background) by supervised pixel-based measures: Recall, \(r{{\mathrm{\,=\,}}}t_p/(t_p + f_n)\), Precision, \(p{{\mathrm{\,=\,}}}t_p / (t_p + f_p)\), and \(F_1{{\mathrm{\,=\,}}}2 p r/(p + r)\). Here, \(t_p\) is the number of true positives, \(f_p\) is the false positives, and \(f_n\) is the false negatives. These values are obtained comparing against the ground-truth. Measures range within [0,1], where the higher the attained measure – the better the achieved segmentation.

Implementation and Parameter Tuning: The ED-GMM algorithm is developed using as basis the C++ BGS library [9]. Parameters are left as default for all the experiments except for task three requiring to vary the learning rate \(\alpha \). We set three mixing models \(M{{\mathrm{\,=\,}}}3\) (noted as Model1, Model2, and Model3). The GMM1 and ZGMM algorithms are also taken from the BGS library.

Fig. 1.
figure 1

Temporal evolution of \({\varvec{w}}_t\) and \({\varvec{\mu }}_t\) for pixel (150, 250, 1) from DBa-snowFall video. (a) GMM1 \({\varvec{\mu }}_t\), (b) ED-GMM \({\varvec{\mu }}_t\), (c) GMM1 \({\varvec{w}}_t\) and label, (d) ED-GMM \({\varvec{w}}_t\) and label (Color figure online)

4 Results and Discussion

Temporal Analysis: We conduct a visual inspection of the temporal evolution of estimated parameters to make clear the contribution of the proposed weight cost function. Testing is carried out on the video DBa-snowFall for which a single pixel in the red color channel is tracked as seen in Fig. 1, showing temporal evolution of \({\varvec{\mu }}_t\) (top row) and \({\varvec{w}}_t\) (bottom row). Also, the inferred foreground/background labels and the ground-truth are shown in subplots Fig. 1(c) and (d) (‘1’: foreground and ‘0’: background). It can be observed that the estimated \({\varvec{\mu }}_t\) parameter by either GMM1 (see subplot 1(a)) or ED-GMM (subplot 1(b)) is similar for the three considered mixing models. However, the weights estimated by ED-GMM are updated better along time. Particularly, the ED-GMM weight increases around the 500th frame, where the Model2 (in green) is generated (see subplot 1(d)). Then, the model properly reacts to the pixel change occurring close to the 800th frame, obtaining labels corresponding to the ground-truth (background). In contrast, the GMM1 updating rule makes the \({w}_t\) weight remain almost zero even if the Model2 gets very close to the pixel value (see subplot 1(c)). As a consequence, this strategy infers wrongly foreground labels.

Table 1. Foreground discrimination performance

Performance of the Foreground/Background Discrimination Task: Aiming to check for the generalizing ability of the ED-GMM method, we test 25 videos embracing a wide variety of dynamics. The videos are grouped into two categories a and b. The former holds videos where the background is mostly static and the latter videos where the background exhibit highly dynamical variations. The total average seen in Table 1 shows that the ED-GMM reaches higher precision during the discrimination of the foreground/background labels, decreasing the amount of false positives. This fact is explicable since the proposed weight updating rule (see Eq. (7)) allows the ED-GMM models to adapt faster to changes of the pixel dynamics. The above can be even more remarked for videos with dynamical background sources as seen in the Category b in which the precision is improved by 10% comparing against the other two methods. By the other hand, GMM1 and ZGMM attain very similar results, since the main proposal of Zivkovic was focused to reduce computational cost. As a result, the foreground masks attained by ED-GMM have less false positives and are more similar to the ground truth masks as seen in Fig. 2 showing concrete scenarios with highly dynamical background sources relating to snow falling (DBa-snowFall, DBa-winterDriveway) and water flowing (DBa-fountain02, DBb-waterSurface).

Fig. 2.
figure 2

Foreground masks of interest. (a) Original frame, (b) Ground-truth, (c) GMM1, (d) ZGMM, (e) ED-GMM.

Robustness Against Variation of the Learning Rate Parameter: The influence of the learning rate variation on the foreground/background discrimination is assessed through supervised measures, which are estimated from the videos of Category a: DBa-highway, DBa-office, DBa-pedestrians and DBa-pets2006 and Category b: DBa-boats, DBa-canoe, DBa-fountain02 and DBa-overpass.

Figure 3 shows the obtained supervised measures, averaged over the 10 videos, where the x axis is the logarithm of the employed \(\alpha \) rate ranging within: {0.0005, 0.001, 0.005, 0.01, 0.03, 0.05, 0.07, 0.1, 0.15, 0.2}. It can be seen that the proposed ED-GMM (continuous lines) behaves similar as the traditional GMM1 method (dashed lines) and ZGMM (pointed lines). However, the obtained Precision and F1 measures are everywhere higher than the ones reached by GMM1 and ZGMM. An interval of confidence is found within the interval \(\alpha {{\mathrm{\,\in \,}}}{0.005-0.01}\), where the highest F1 measure is reached.

Fig. 3.
figure 3

Supervised measures changing the learning rate value for GMM1, Z-GMM and ED-GMM.

5 Conclusions

We propose a cost function for the GMM weights updating using Euclidean divergence. The proposed cost function is coupled into the traditional GMM, producing a new approach named ED-GMM used to support the background modeling task for videos recorded in highly dynamical scenarios. The Euclidean divergence allows comparing the instant and cumulative probabilities of a GMM model fitting the pixel input samples. Then, employing LMS, we minimize the Euclidean divergence of such probabilities to adjust the weights values through time. Carried out experiments show that ED-GMM reduces the amount of false positives in the obtained foreground masks comparing against traditional GMM and Zivkovic GMM, especially, for videos holding dynamical background sources: water flowing, snow falling and trees waving. Additionally, the proposed cost function demonstrated to be robust when varying the learning rate parameter value, always achieving better results than traditional GMM. Consequently, the proposed cost function can be coupled into more complex GMM based background modeling approaches to improve the foreground/background discrimination. As future work, authors plan to test the proposed cost function using selective updating strategies to improve the discrimination in scenarios holding motionless foreground objects.