1 Introduction

Video object segmentation that is used primarily in various application areas of computer vision which include video surveillance system, artificial intelligence enabled traffic monitoring system, path detection, robotics, autonomous navigation, activity based human recognition and many other. Surveillance is one of the most critical areas which include detection, tracking, classification of moving object or group of objects and recognition of various motions or pose. The effectiveness of these systems is characterized by first how accurately in shape and size the system can detect the object or any suspicious behavior of an object (human), and second by how reliable the system is in different environmental conditions, such as lighting and background conditions.

In computer vision, sensors are used to capture the real-time picture or scenario. The images generated, can contain several moving or stationary objects in front of a static or dynamic background. A common assumption is that the background is of static nature since in many surveillance systems the camera remains fixed. Some researchers [13] have shown that a background only changes due to the motion of the camera and can be compensated overall. But there are situations where we have continuous motion in the background such as tree leaves, water bodies or some moving object which becomes stationary later on.

Therefore, instead of considering our background static, the foreground object is the set of pixels that is not stationary and changes its position and direction between frames. Moving foreground object detection can be achieved by two different ways: (1) by motion detection and (2) by motion estimation. In detection, we identify changed regions from the frames when the camera is fixed. Motion estimation estimates the motion vector or the expected position of the moving object in next frame. Sometimes in surveillance, it is required to find out the speed, acceleration of the moving object.

In the practical situation, object extraction becomes tough. The presence of noise generated from the capturing device, textural similarity between foreground and background, change in lighting conditions. Dynamic background (water bodies, tree leaves, rain, wind), some moving object which become static after moving or starts moving suddenly, occlusion between objects and last but not least presence of shadow.

There are several approaches for foreground object extraction such as temporal differencing, spatial homogeneity, optical flow, and change detection [1, 4, 5] which can be pixel based or region based. Pixel based detection algorithms are sensitive to pixel variations (noise, illumination changes). Methods [68] handles noise and light change using adaptive background model. Tsai and Lai [9] do not use background model and instead analyze independent components. On the other hand, region based change detection method measures the characteristic of a region at some pixel location. Likelihood ratio test [10] uses the hypothesis test to check the intensity distribution of a region. Shading model [11] considers the ratio of intensity in a region. Liu et al. [12] considers the reflectance component of image intensity. Li and Leung [13] combines texture and intensity difference.

The requirement of a reference frame (frame with no object) is the one that restricts the use of change detection algorithm. Cases, where there is a difference in the speed of the moving object and the cases where the object in the frame moves and stop for some time, makes the identification difficult with change detection algorithm. Also, the availability of cast shadow in the background region can cause problems in detection.

Among all the above approaches, the widely used approach in the absence of any available knowledge of foreground object or the background for object detection with fixed camera is background subtraction [6, 7, 1420]. A background reference frame is computed just by averaging the background frames or by using an initial estimation of background frame and the iteratively updating it to obtain the final estimate. Pfinder [6] uses Gaussian distribution at each pixel as the background model. Haritaoglu et al. [14], models the background by representing pixels by minimum, the maximum of intensity and maximum intensity difference between frames. Marko and matti [21] present a texture based method that each pixel is modeled as a group of adaptive local binary histogram. The background model should be able to reflect the real background as accurately as possible and should reflect with sudden scene change such as start or stop of a moving object. Also detection of ghost and shadow effects the detection of an object.

Whether the object detection method used is pixel based or region based, thresholding the difference image is the most challenging task. In many cases, a single threshold is used, but the problem with this approach is that single threshold is enough to separate two classes. From classification point of view, applying P thresholds result in \(P + 1\) classes. For \(P = 1\), we have two categories as background and foreground. Consider a histogram of pixel intensities of a given frame (background plus foreground). In the ideal case (bimodal distribution), the histogram has a deep and sharp valley between two peaks (representing object and background). In this case, a single threshold T1 is enough to separate two classes. However, in real cases, the valleys are flat, broad and noisy, and it is tough to obtain sharp valleys making it difficult to find the threshold value to segment. Instead of selecting a threshold by trial and error several adaptive algorithms [2229] are proposed. To overcome the problem with single threshold it is better to consider multiple thresholds.

Table 1 Examples for multivariate units and variables

In this paper, our aim is to detect a moving object with high accuracy by reducing the False Negative and False positive as much as possible. The organization of the paper is as follow. Section 2 explains the algorithm for object extraction. Section 3 explains Multivariate analysis of variance using Chi-square distribution. Section 4 explains hypothesis generation for object detection. Section 5 provides experimental results and analysis. Section 6 describes shadow detection and removal approaches. And the conclusion is given in Sect. 7.

2 Proposed algorithm for object detection

The basic idea of our algorithm is change detection. However, the moving object region is not obtained directly by background subtraction. In other words, our estimation of background is based on multivariate analysis of data. The multivariate analysis consists of methods that we can use when several measurements are performed on each object in one or more samples. The measurements are known as variables, and the individuals or objects are the units (research units, sampling units, or experimental units) or observations. Some real world examples of multivariate data units and their variables are given in Table 1. Similarly, we can deduce that the RGB image which we are using for object extraction is a multivariate data unit, and the variables are R,G and B color components respectively.

Sometimes, it is wise to extract each variable available and study them separately. But they may be correlated in nature with other variables. Thus, in many cases, the variables are tangled in such a way that study of individual may not provide enough information. Multivariate analysis, provide us the methods to examine the behavior of correlated variables simultaneously so that we can access the key features of the process that produced them.

The multivariate methods help us to (1) find out the joint performance of the M.V variables and (2) to identify the effect of one on the other. The multivariate analysis provides both descriptive and inferential procedures in which we can search for patterns in the data or test hypotheses about patterns. Several methods are available that focuses primarily on variance, covariance, ratio of variance. The most commonly used methods MANOVA and ANOVA deals with the variables variance. Variance is a numerical representation of the distribution of variables in the population. If two variables are associated or correlated with one another, then they share some common property that makes them vary together.

This concept of multivariate analysis can easily be extended to extract the foreground object. The input images (containing the moving object) are correlated M.V units and the color components, which in our case are R, G and B components acts as M.V variables. A block diagrammatic representation of the proposed scheme is shown in Fig. 1. The complete process is divided into five major steps. The first step is to generate a background model. In our algorithm, we are using very simple method for background estimation which is averaging the background. The second phase is to generalize the M.V variables (R, G, B color components) as multivariate Gaussian distribution. Step third uses MANOVA and Chi-squared distribution to identify the correlation between the variables which is further used in Step fourth to generate the hypothesis. And finally in step five, the input image pixels are verified if they satisfy the generated hypothesis or not. Each step is further explained in the following subsections.

Fig. 1
figure 1

Block diagram of proposed object extraction method

2.1 Multivariate Gaussian generalization for RGB Color components

Multivariate procedures will be based on multivariate normal distribution and have some basic properties as below:

  • The distribution can be completely described using only means, variances and covariance.

  • If the variables are uncorrelated, they are independent.

  • The dependent variable should be normally distributed within groups.

Salvador et al. [30] states that if we can find a unit vector (\(r^{*}, g^{*}, b^{*}\)) in the RGB space and project each pixel color vector \((R,G,B)_{(x,y)}\) onto this vector. The projected length is the intensity \(I_{(x,y)}\) and the residual vector is perpendicular to the color vector and lies on a 2D plane \(\beta \). By analyzing the distribution of residuals in plane \(\beta \) it is found that the residuals can be modeled by a 2D normal distribution and its isovalue curve can be represented by an ellipse in the plane.

Based on these observations, considering R, G and B components as the random variables with their respective mean and variance. The Multivariate generalization in d-dimensional space (\(d=3\)) is given by:

$$\begin{aligned} p(x)=\frac{1}{(2.\pi )^{\frac{d}{2}} \vert \Sigma \vert ^{\frac{1}{2}}} \text {exp}^{\left( -\frac{1}{2}(x-\mu )^{T}\Sigma ^{-1}(x-\mu )\right) } \end{aligned}$$
(1)

where, \(\mu =\) E[x] is the mean value and \(\sigma \) is the (\(d \times d\)) covariance matrix given as,

$$\begin{aligned} \text {Cov}(x_{1}, x_{2}, x_{3}) \quad \text {or} \quad \Sigma = \left[ \begin{array}{ccc} \sigma _{1}^{2} &{} \quad \sigma _{12} &{} \quad \sigma _{13} \\ \sigma _{21} &{} \quad \sigma _{2}^{2} &{} \quad \sigma _{23}\\ \sigma _{31} &{} \quad \sigma _{32} &{} \quad \sigma _{3}^{2} \end{array} \right] \end{aligned}$$
(2)

Considering the case of diagonal covariance matrix, the isovalue curves is equivalent to,

$$\begin{aligned} x^{T}\Sigma ^{-1}x = \left[ x_{1},x_{2},x_{3}\right] \left[ \begin{array}{ccc} \frac{1}{\sigma _{1}^{2}} &{} \quad 0 &{} \quad 0\\ 0 &{} \quad \frac{1}{\sigma _{2}^{2}} &{} \quad 0\\ 0 &{} \quad 0 &{} \quad \frac{1}{\sigma _{3}^{2}}\\ \end{array} \right] \left[ \begin{array}{c} x_{1} \\ x_{2}\\ x_{3}\\ \end{array} \right] = C \end{aligned}$$
(3)
$$\begin{aligned} \frac{x_{1}^{2}}{\sigma _{1}^{2}} + \frac{x_{2}^{2}}{\sigma _{2}^{2}} + \frac{x_{3}^{2}}{\sigma _{3}^{2}} = C \end{aligned}$$
(4)

This is the equation of an ellipse whose axes are determined by the the variances of the involved features. In our case, with three features RG and B above equation changed as

$$\begin{aligned} \left( \frac{x_{R}}{\sigma _{R}}\right) ^{2} + \left( \frac{x_{G}}{\sigma _{G}}\right) ^{2} + \left( \frac{x_{B}}{\sigma _{B}}\right) ^{2} = C \end{aligned}$$
(5)

The distribution of RGB components (blue colored samples) in the background model is shown in Fig. 2.

Fig. 2
figure 2

Representation of RBG distribution and \(99\,\%\) confidence ellipse obtained using Chi-squared distribution

3 Multivariate analysis of variance using chi squared distribution

In the above Eq. (5), ‘C’ defines the scale of the ellipse and could be any arbitrary number. The question is now how to choose C, such that we can represent an ellipse with a given confidence level (e.g. 95 or 99 % ). The left-hand side of Eq. (5), is the sum of the square of normally distributed random variables. Chi-square distribution is known to be suitable here, which states that let \(x_{i}\), \(i=1,2,\ldots ,N\), be samples of a gaussian distribution then, y is a chi-square distributed variable with N degree of freedom.

$$\begin{aligned} y=x_{1}^{2} + x_{2}^{2} + \cdots + x_{N}^{2} \end{aligned}$$
(6)

A Chi-square distribution considers the terms ‘Df’ (Degrees of Freedom), which represent the number of unknowns. In our case, there are three unknowns, and, therefore, three degrees of freedom. Therefore, we can quickly obtain the probability that the above sum, and thus ‘C’ equals a particular value by using the Chi-square likelihood. As we are interested in a confidence interval, we are looking for the probability that ‘C’ is less than or equal to a particular value which can easily be obtained using the cumulative Chi-square distribution.

Using the Chi-square probabilities in Table 2 and Degree of Freedom \(=\) 3, we can find that,

$$\begin{aligned} P(C < 7.815) = 1- 0.05 = 0.95 \end{aligned}$$
(7)

And similarly,

$$\begin{aligned} P(C < 11.345) = 1- 0.01 = 0.99 \end{aligned}$$
(8)

From Eqs. (7) to (8) it is clear that the value for the constant ‘C’ will vary from \(7.815 \quad \text {to} \quad 11.345\). A \(99\,\%\) confidence ellipse (red colored) is displayed along with the RGB distribution in Fig. 2. The confidence ellipse does not cover all the data, the reason being that the data are highly uncorrelated in nature. The primary cause of which is outliers. Outliers are values that are very low or very high as compared to the most values in the data set. Outliers should be removed before performing MANOVA.

Table 2 Table for Chi-square probability

4 Hypothesis generation

Multivariate analysis of variance, which is performed on the background model is used to generate the hypothesis that is evaluated for every differenced pixel obtained from the input image and the background model. The equation used is

$$\begin{aligned} \frac{x_{R}^{2}}{C .V_{R}} + \frac{x_{G}^{2}}{C .V_{G}} + \frac{x_{B}^{2}}{C .V_{B}} = 1 \end{aligned}$$
(9)

where, \(x_{R}^{2}, x_{G}^{2}, x_{B}^{2}\) are the RGB distribution of the differenced pixels. \(V_{R},V_{G},V_{B}\) are the variance obtained from the background model and constant ‘C’ varies from 7.815to 11.345.

The hypothesis are stated as:

  • NULL HYPOTHESIS: If the Eq. (9) satisfies, then the pixel belongs to the background and it is assigned the value ZERO.

  • ALTERNATIVE HYPOTHESIS: If the null hypothesis is false, then pixel belongs to the foreground and it is assigned the value ONE.

But before testing the hypothesis on the diferenced pixel, we have to remove the outliers as discussed in Sect. 3 .We have several approaches of thresholding, such as Single threshold, multiple thresholds and adaptive thresholding discussed in the Sect.  1. As single threshold are not very useful in practical situations, therefore, two different thresholds are used in our algorithm. The only disadvantage is that the thresholds are generated through trial and error method for object extraction.

Fig. 3
figure 3

Case 1: indoor scene with static background and low illumination

5 Experimental results and analysis

The proposed method with multivariate analysis on background variance for object detection has been evaluated on images with different illumination conditions, indoor and outdoor cases. Simulations are carried on image frames obtained from CAVIAR, MOT challenge benchmark and change detection datasets.

Cases with static background are taken from CAVIAR (Context Aware Vision using Image-based Active Recognition) project which includes people walking alone, fighting and passing out. The resolution is half-resolution PAL standard (384 \(\times \) 288 pixels, 25 frames per second) and compressed using MPEG2. The file sizes are mostly between 6 and 12 MB, a few up to 21 MB. Change detection and MOT challenge dataset is used for cases exhibiting dynamic background motion.

The segmentation results are displayed in Figs. 3, 4, 5, 6, 7 and 8 with the following layout: (a) input image, (b) Averaged background model, (c) obtained result and (d) ground truth. Figures 3 and 7 are the cases for which the ground truths are not available in the dataset, and, therefore, the performance evaluation of the method is done solely using obtained results. From the obtained results it is clear that the proposed method works well with indoor and outdoor scenes with the static and dynamic background having varying illumination. The only observed problem is the presence of cast shadow in high illumination cases (Case 3) and presence of false holes (Case 4) inside the objects silhouette. We can further process the output using morphological operations to reduce the effect of false holes.

Fig. 4
figure 4

Case 2: indoor scene with static background and moderate illumination

Fig. 5
figure 5

Case 3: outdoor scene with static background and high illumination

Fig. 6
figure 6

Case 4: outdoor scene with dynamic background having water bodies, leaves and moving objects

Fig. 7
figure 7

Case 5: a floating object

Fig. 8
figure 8

Case 6: a canoe with a group of people

5.1 Error rate

The error rate is obtained to evaluate the effectiveness of the algorithm. The error rate is given by the following equation:

$$\begin{aligned} \mathrm{Error\,Rate} = \frac{\mathrm{Error\,Pixel\, Count}}{\mathrm{Frame\,Size}} \end{aligned}$$
(10)

where error pixel count is the number of false positive and false negative pixels. Figures 9 and 10 shows the error rate for the Case 4. The error rate is minimized after refinement as shown in Fig. 10. In object detection, the error pixel count should be reduced as much as possible for accurate results. The accuracy of proposed algorithms is evaluated using ROC curve in Fig. 11, which is generated by fixing one of the thresholds and varying another. As depicted by the ROC curve the false positive, which stands for the number of background pixels detected as object pixels do not change over a large range, which is one of the plus points for the proposed algorithm.

Fig. 9
figure 9

Error rate in each frame for Case 4 before refinement

Fig. 10
figure 10

Error rate in each frame for Case 4 after refinement

Fig. 11
figure 11

ROC curve for object detection

5.2 Boundary displacement error

Boundary displacement error reflects the error in the obtained boundary and the actual boundary (ground truth boundary). The boundary of the result is achieved by first removing the holes inside the object using morphological operations and then by applying Canny edge detection. Ground truth boundary is obtained just by using Canny edge detection. The boundary in white color is the obtained boundary and red-colored is the actual boundary. The displacement error for four continuous frames is shown in Fig. 12. If the detected boundary pixels are exactly on the ground truth boundary then the error is zero; but if the pixel overlaps with a point on the dilated or eroded ground truth boundary with dilation or erosion radius ‘r’, then the displacement error is of ‘r’ or ‘\(-r\)’ pixels. From the output it is clear that we have either zero displacement error or negative error but no positive displacement error is obtained.

Fig. 12
figure 12

Boundary displacement for four continuous frames

6 Shadow detection and removal

Segmentation of the moving object from its background is an important research topic in the recent past. But as shown in the experimental results (Figs. 4, 5, 6), the major issues with segmentation is cast shadows and self shadow. As a result, segmentation becomes inaccurate. Numerous shadow detection algorithms are available based on several color models. Major color models are RGB, HSV, HIS and YCbCr. In this section, we have discussed some of the Shadow Detection and Removal approaches from literature (Sects. 6.1, 6.2, 6.3) for better understanding of shadow removal and at last we proposed an automatic shadow removal approach for RGB color model in Sect. 6.4.

Shadow occurs when the object totally or partially occludes the light coming from the light source. Cast shadow can be defined as the darkened region on the background of an image that is due to the foreground objects blocking the light source, the presence of cast shadow can modify the perceived object shape. Self shadow is the part of the object that is not illuminated by direct light, the presence of self shadow modifies the perceived object shape and its color. The shadow has two parts to it, called umbra and penumbra. The umbra corresponds to the area where the direct light is almost totally blocked, whereas the area where light is partially blocked is called penumbra.

6.1 Shadow identification and classification using luminance and chrominance edge map

This method [30] propose to exploit color information for shadow detection by using the invariance properties of some color transformations. Among the traditional color features, normalized RGB, Hue (H) and saturation (S) are invariant features to shadows and shading. In addition to these well-known color spaces, new invariant color models, \(c_{1}c_{2}c_{3}\) and \(l_{1}l_{2}l_{3}\) are proposed in [31].

Optimum results were obtained using \(c_{1}c_{2}c_{3}\) color model, and are defined as:

$$\begin{aligned} c_{1}=\text {arctan}\left( \frac{R}{\text {max}(G,B)}\right) \end{aligned}$$
(11)
$$\begin{aligned} c_{2}=\text {arctan}\left( \frac{G}{\text {max}(R,B)}\right) \end{aligned}$$
(12)
$$\begin{aligned} c_{3}=\text {arctan}\left( \frac{B}{\text {max}(R,G)}\right) \end{aligned}$$
(13)

The first step is to convert the input image in a color model sensitive to shadow and obtain an edge map using Sobel operator on the luminance component of the input image. Morphological operations can be applied if the edge map do not form the closed contours. The edge map helps in searching the shadow pixels in the portion of the image that is occupied by the object and its cast shadow. Second step is shadow classification in which the shadow pixels identified in the previous step is classified into cast and self shadow. Here a color edge map is obtained by logical OR operation on the edge map obtained with the Sobel operator on each color component. The color edge map detects shadow points which are occupied by the objects i.e., the self shadow and the remaining shadow pixels that were obtained form step 1st excluding self shadow pixels from step 2nd are classified as cast shadow. The flowchart for this method is given in Fig. 13.

Fig. 13
figure 13

Shadow identification and classification using two edge maps

6.2 Shadow detection using local and spatial information (statistical parametric approach)

There are several other ways to identify or to detect object and shadows. One such method is to obtain information from local, spatial or temporal domain. Local information is obtained based on the appearance of the individual pixels (a point in shadow gets darker compared to its appearance when illuminated). Spatial information is obtained from the neighborhood pixels as object and shadow inhibit compact region in an image. Temporal information are obtained from the relation of current frame and the previous frame.

The Statistical parametric approach of shadow detection [32], which was developed for ATON project make use of local information and can further combine it with spatial information. The flowchart of this approach is given in Fig. 14. This approach uses the concept that the probability density function of a shadowed pixel can be computed using change in the appearance of the pixels when shadowed, given its appearance when illuminated. An approximated linear transformation [33, 34] is:

$$\begin{aligned} \overrightarrow{V}=D.V \text { where}, \quad V=[R G B]^{T} \end{aligned}$$
(14)
Fig. 14
figure 14

Shadow identification using statistical parametric approach

is used to obtain the change in appearance. The diagonal matrix D is obtained as the slopes of the line fitted to plots between the shadow and background points for the three color components. Given the mean and variance of the color channels of the reference point, we can determine the mean and variance of pixels under shadow.

Given, \((\mu _{\text {IL}}^{R},\mu _{\text {IL}}^{G},\mu _{\text {IL}}^{B},\sigma _{\text {IL}}^{R},\sigma _{\text {IL}}^{G},\sigma _{\text {IL}}^{B})\), the mean and variance of reference point and \(D=\text {diag}(d_{R},d_{G},d_{B})\), the diagonal matrix. we have,

$$\begin{aligned}&\mu _{\text {SH}}^{i}=\mu _{IL}^{i}\cdot d_{i} \end{aligned}$$
(15)
$$\begin{aligned}&\sigma _{\text {SH}}^{i}=\sigma _{IL}^{i}\cdot d_{i} , i\in {R,G,B} \end{aligned}$$
(16)

where, IL and SH represents illuminated and shadow.

Pixel segmentation is performed by estimating the a-posteriori probabilities separately for background, foreground and shadow classes. A pixel is then classified to the class having maximum a-posteriori probability.

$$\begin{aligned} p(C_{i}/v)=\frac{p(v/C_{i})p(C_{i})}{\Sigma _{j=1,2,3}p(v/C_{j})p(C_{j})} \end{aligned}$$
(17)

where, v is the feature vector for a given pixel, \(p(C_{i}) \) is the prior probability of ith class and \(C_{1}\) \(=\) Background, \(C_{2}\) \(=\) Shadow and \(C_{3}\) \(=\) Foreground.

Spatial constraints can further be imposed with the local information by updating the class membership probability based on the result of the neighboring pixels, which is then used to obtain new a-posterior probabilities for all the pixels.

6.3 Shadow detection using temporal information

This approach [35] is based on the idea that the shadow points can be detected as the points that are static for a short temporal sequence and are characterized by a constant luminosity change with respect to the reference background image. The first step is temporal image analysis, where two successive images \(I_{t-1}\) and \(I_{t}\) are compared to detect static points. Static point detection method uses radiometric similarity method to determine the similarity between two points.

$$\begin{aligned} R(p_{i},q_{i})=\frac{m[W_{1}(p_{i})W_{2}(q_{i})]-m[W_{1}(p_{i})]m[W_{2}(q_{i})]}{\sqrt{v[W_{1}(p_{i})]v[W_{2}(q_{i})]}} \end{aligned}$$
(18)

where, m and v are the mean and variance estimated into small window \((W_{1},W_{2})\). Two points \(p_{i}\) and \(q_{i}\) are said to be static if their radiometric similarity is greater than 0.9.

The shadow points are stationary points \(s_{i}\) in the temporal sequence (between current image frame \(I_{t}\) and previous image frame \(I_{t-1}\)) and with respect to the corresponding background reference points \(b_{i}\), differ by a constant factor A due to the change of luminosity (photometric gain), i.e.

$$\begin{aligned} A=\alpha _{i}=s_{i}/b_{i}. \end{aligned}$$
(19)

Among all the static points recovered, the shadow points selected are those with a photometric gain less than 0.9. An iterative relaxation labeling method is used to remove all those points that are not the part of static shadow but satisfies the constant luminosity gain. This algorithm searches for all those neighboring points that are mutually compatible according to the constraint. The mutually compatible points are selected as the final static points those having optimal photometric gain.

Next step in the process is to remove static shadow points with the points from the background reference frame to obtain the image with removed static shadow points. Temporal image analysis is again performed between background reference frame and frame with shadow removed, to obtain moving points. The moving points obtained are further compared with the previously obtained moving points between \(I_{t}\) and \(I_{t-1}\). The points that are common to both are obtained as foreground points. Complete flowchart of the process is given in Fig. 15.

Fig. 15
figure 15

Shadow identification using temporal information’s

6.4 Automatic shadow removal using texture, luminance and chrominance differences

Cast shadow is the darkened region on the background of an image that is due to the foreground objects blocking the light source.

Considering the textural, luminance and chrominance properties or values of background and shadow pixels. The luminance values of the cast shadow pixels are normally lower than similar background pixels, whereas the chrominance values of the cast shadow pixels are similar to similar background pixels. And in terms of the textural property, the textural feature of a cast shadow is also very similar to background pixels. In other words, cast shadow does not alter the difference in the textural properties of the background and foreground pixels.

As texture based segmentation method [36] make use of the differences in textural property between the background pixels, shadow pixels and the object pixels itself, rather than just the intensity differences between them. Therefore, it is better to use all the three properties (textural difference, luminance difference and chrominance difference) for object segmentation. The proposed method for automatic shadow removal [37], considers all the three differences and then merging the output using logical OR operation to remove shadow.

This method comprises of two major steps.

  • First step, is to calculate texture, luminance and chrominance difference (\(T_{\text {diff}},L_{\text {diff}},C_{\text {diff}}\)) and

  • In step second, a threshold value is estimated from the histograms of these differences and finally \(\text {TT}_{\text {diff}}\), \(\text {TL}_{\text {diff}}\), \(\text {TC}_{\text {diff}}\) are computed by isodata thresholding method. An \(\text {OR}_{\text {map}}\) is then constructed by performing a logical OR operation of these thresholded diffrences.

Texture description of an image block is commonly calculated using the following autocorrelation function R,

$$\begin{aligned} R(u,v)&=\frac{(2M+1)(2M+1)}{(2M+1-u)(2N+1-v)} \nonumber \\&\quad \times \frac{\Sigma _{m=0}^{2M-u}\Sigma _{n=0}^{2N-v}p(m,n)p(m+u,n+v)}{\Sigma _{m=0}^{2M}\Sigma _{n=0}^{2N}p^{2}(m,n)} \nonumber \\&\quad \quad 0\le u \le 2M, 0 \le V \le 2N \end{aligned}$$
(20)

where uv are the position displacements in the mn direction, \(2M + 1, 2N + 1\) are the dimensions of the image block, and p(mn) represents the intensity value at (mn).

Fig. 16
figure 16

Shadow identification using texture, luminance and chrominance information’s

Fig. 17
figure 17

Illustration of the proposed method. T-map is obtained by applying textural autocorrelation on input image. L-map and C-map are obtained through Luminance and Chrominance difference. OR-map is the logical-OR of bd. Morphological operations are used to remove the shadow part to obtain final silhouette

Fig. 18
figure 18

Background and object frames from four different videos

Fig. 19
figure 19

Silhouette extracted using automatic shadow removal approach. \(\text {TT}_{\text {diff}}\) is the thresholded texture difference, \(\text {TL}_{\text {diff}}\) is the thresholded luminance difference and \(\text {TC}_{\text {diff}}\) is the thresholded chrominance difference, \(\text {OR}_{\text {map}}\) is the result of logical OR of the previous three results

The texture difference \(T_{\text {diff}}\) between two image blocks is calculated by mean square difference of two autocorrelation functions R to compare their similarities, where \(R_i\), \(R_j\) are the autocorrelation functions R of two different image blocks.

$$\begin{aligned} T_{\text {diff}}=\frac{1}{(2M+1)(2N+1)} \Sigma _{u=0}^{2M} \Sigma _{v=0}^{2N}[R_{i}(u,v)-R_{j}(u,v)]^{2} \end{aligned}$$
(21)

The color model YCbCr is used to separate the luminance and chrominance components of the images to calculate \(L_{\text {diff}}\) and \(C_{\text {diff}}\). A luminance difference \(L_{diff}\) between the input frame \(f_{i}\) and the background reference frame \(f_{b}\) is computed according to the following equation,

$$\begin{aligned} L_{\text {diff}}=\left\{ \begin{array}{ll} Y_{b}(x,y)-Y_{i}(x,y) &{}\quad \hbox {If }Y_{b}(x,y)-Y_{i}(x,y)> 0 \\ 0 &{}\quad \hbox {Otherwise} \end{array}\right\} \end{aligned}$$
(22)

Chrominance difference \(C_{\text {diff}}\) between input frame \(f_{i}\) and background reference frame \(f_{b}\) is computed using both Cb and Cr components according to the following equation,

$$\begin{aligned} C_{\text {diff}}=[\text {Cb}_{i}(x,y)-\text {Cb}_{b}(x,y)]^{2}+[\text {Cr}_{i}(x,y)-\text {Cr}_{b}(x,y)]^{2} \end{aligned}$$
(23)

6.4.1 Experimental results

Complete flowchart for this approach is given in Figs. 16 and 17 is used to demonstrate the proposed algorithm. The proposed approach is tested on different real life samples extracted form various videos. Fig. 18a–d contains background reference frames used whereas Fig. 18e–h are the frames having moving objects. The result of silhouette extraction using automatic shadow removal method is depicted in Fig. 19. First row with Fig. 19a–d are the \(\text {TT}_{\text {diff}}\), \(\text {TL}_{\text {diff}}\), \(\text {TC}_{\text {diff}}\) and \(\text {OR}_{\text {map}}\) outputs of first image frame in Fig. 18e and similarly, Fig. 19e–h are outputs of frame in Fig. 18f and so on. It can be easily inferred from the outputs generated that it works very well on shadow and there is no need of shadow removal as preprocessing step of silhouette extraction.

7 Conclusion

In this paper, we present MANOVA based foreground object extraction method with multiple thresholding. The threshold values are learned through experiments. This process can further be improved using adaptive thresholds. Cast shadow and false holes were the areas that need extra effort to remove. This method can work with the static or dynamic background and with varying degree of illumination as the error rate incurred is always less as false positive values remain almost constant and do not vary over the large range as compared to true positive values.

Several shadow detection and removal methods are also discussed in detail and an automatic shadow removal method using texture, luminance, and chrominance is explained with results on different image frames. Experimental results show that silhouette extraction using statistical parametric, edge map and temporal approaches needs a pre-processing step of shadow removal while the proposed silhouette extraction method removes shadow by applying texture, luminance and chrominance differences as an inherent step and no pre or post processing is required. The noise obtained in the silhouette extraction results can be removed using further filtering or using morphological operations like erosion and pruning.