Keywords

1 Introduction

Effective recognition of objects for tracking in video stream and processing of data involve integration of background modelling, shadow removal, analysis of segmented objects from the video frames and proper detection of objects. Subsequently, recognition of the detected objects is done by extracting the features adopting the machine learning inspired principle, bag of words.

In our paper, we use the Visual Vocabulary Model using Bag of Words to extract the necessary features of certain instances of objects through rigorous high-level training. Subsequently, we apply the extracted feature sets to the test domain to recognize and locate our objects of interest in the video scenes. Using visual instance occurrence and their probabilistic presence to imply a certain domain, we obtain optimum accuracy in domain recognition as well.

The contributions of this paper are:

  • Background modelling and extraction of astute shadow free images using color invariant approach.

  • Extraction of the features of the objects captured in the blobs via the principle of Bag of Words.

  • Classification of the objects in a certain domain of interest using probabilistic word occurrence for domain recognition.

The organization of the paper constitutes: Sect. 2 briefly explains the related works in the respective domain, Sect. 3 explains the proposed method for detection and recognition, specifically, Sect. 3.3 describes the concept of Visual Vocabulary Model for object recognition. Experimental results on several datasets and the comparative analysis with some state-of-the-art algorithms are presented in Sect. 4. Section 5 concludes the paper and discusses future possibilities for further improvements.

2 Brief Review of Related Works

Numerous color histograms based object detection algorithms have been proposed in recent years. He et al. [4] developed a locality sensitive histogram at each pixel for finer distribution of the visual feature points for object tracking in video scenes. Haar-like features have been proposed for appearance based tracking of objects [5,6,7, 9]. Spatiotemporal representation combined with genetic algorithm has also been used for feature extraction [1]. Recently pixel based segmentations have been applied [2] to handle tracking.

In recent years, the classifiers that have been extensively used for object tracking are: ranking SVM [7], semi-boosting [14], support vector machine (SVM) [12], boosting [13], structured output SVM [8], and online multi-instance boosting [6]. Various detection and tracking codes are available for evaluation with significant effort of the authors, e.g., MIL, IVT, TLD, FCT, VTD and likes.

3 Proposed Method

Initially, we model the segmented objects from the video frames and subtract the background model without shadow to obtain the blob of an object. Before recognizing the object inside the blob, we train a machine learning inspired Visual Vocabulary Model with a set of objects which can represent our domain of interest for recognition and tracking. We extract the features of the objects of both the training data and test data by principle of Bag of Words, in the training and testing phases respectively.

3.1 Background Modeling

In [10], Li et al. proposed an idea for background modelling. In our work, we introduce some modification over the same work and proceed as follows: At each time step an image \( I_{m}^{t} \) is obtained by subtracting two successive video frames and \( F_{m}^{t} \) can be obtained by subtracting the current video frame with the background model. To deal with sudden illumination variation an AND-OR operation is performed over \( I_{m}^{t} \) and \( F_{m}^{t} \). The extracted frame I t is compared with its previous frame I t −1 in order to obtain \( I_{m}^{t} \) by predicting the similarity between the two consecutive pixel values of frames It (x, y) and It-1(x, y). Pixel centers are compared between the succeeding images (I t (x, y), I t −1 (x, y)). Temporal binary image of the moving object \( (I_{m} ) \) has a radiometric similarity value, formally expressed as:

$$ I_{m} \left( {x,y} \right)\text{ = }\left\{ {\begin{array}{*{20}l} {1, } \hfill & {if\,R\left( {x,y} \right)\text{ > }T_{b} } \hfill \\ {0,} \hfill & {otherwise} \hfill \\ \end{array} } \right.. $$
(1)

Similarly, \( F_{m}^{t} \) is formulated on a hypothesis based on the difference threshold \( \left( {T_{b} } \right) \), between background frame and the current frame, formally:

$$ F_{m}^{t} \text{ = }\left\{ {\begin{array}{*{20}l} {1,} \hfill & { if \left| {I^{t} \left( {x,y} \right) - B^{t} \left( {x,y} \right)} \right|\text{ > }T_{b} } \hfill \\ {0,} \hfill & {otherwise } \hfill \\ \end{array} } \right.. $$
(2)

The pixels \( \left( {x,y} \right) \) of moving objects are formulated by operating on \( I_{m} \left( {x,y} \right) \) and \( F^{t} (x,y): \)

$$ M^{t} \left( {x,y} \right)\text{ = }\left\{ {\begin{array}{*{20}l} {1,} \hfill & { if\,\left( {I_{m} \left( {x,y} \right){ \cap }F^{t} \left( {x,y} \right)} \right)\text{ = }1)} \hfill \\ {0,} \hfill & {otherwise } \hfill \\ \end{array} } \right.. $$
(3)

The moving pixels in video frames are identified by \( M^{t} \left( {x,y} \right) \).

In our implementation, a vector history V, with the six last values updated cumulatively, is considered as:

$$ V = \left[ {E\left( t \right),E\left( {t - 1} \right),E\left( {t - 2} \right),E\left( {t - 3} \right),E\left( {t - 4} \right),E\left( {t - 5} \right)} \right]. $$
(4)

At time t, the mean value of pixel intensities in the frame is E(t). For each frame, we calculate proper learning rate \( \alpha \), based on this vector:

$$ \alpha \text{ = }a\text{ + }b\frac{{\left| {E\left( t \right) - E\left( {t - 5} \right)} \right|}}{{{ \hbox{max} }\left( {E\left( t \right),E\left( {t - 5} \right)} \right)}}, $$
(5)

Let d be a pixel of the image, the gray histogram of the pixel is h(d), and background pixels and foreground pixels are denoted by I B and I F respectively. Probability of a background pixel misidentified as foreground pixel and vice versa are as follows:

$$ P_{F|B} \text{ = }\sum\nolimits_{{d \in I_{F} }} {p (d | B)} \,{\text{and}}\, P_{B|F} \text{ = }\sum\nolimits_{{d \in I_{B} }} {p (d | F)} , $$
(6)

where P d|B is the probability of background pixel and P d|F is the probability of foreground pixel.

Our goal is to minimize P d|B and P d|F as much as possible.

The Min P F|B is significant, as after morphological operation in the post-process, P B|F will be smaller.

\( p\left( B \right) \) is the priori probability of the background as calculated from gray histogram of the image \( I_{m}^{t} \).

$$ p\left( B \right)\text{ = }\sum\nolimits_{d = - T}^{T} {h\left( d \right)\quad\upmu = 0} . $$
(7)

3.2 Shadow Removal

As mentioned in [11] by Xu et al., by formally normalizing the pixels to r, g, b color space the shadow-free color invariant image can be constructed:

$$ r^{{\prime }} = \frac{r}{{\sqrt {r^{2} + g^{2} + b^{2} } }},\,g^{{\prime }} = \frac{g}{{\sqrt {r^{2} + g^{2} + b^{2} } }},\,b^{{\prime }} = \frac{b}{{\sqrt {r^{2} + g^{2} + b^{2} } }}. $$
(8)

where r, g, b are input image color channels, r’, b’, g’.

Application of Gaussian smooth filter suppresses the high frequency textures in both invariant and original images, formally:

$$ E_{ori} = ||edge\left( {I_{ori} } \right)||, E_{inv\left( i \right)} = ||edge\left( {I_{inv\left( i \right)} } \right)||, $$
(9)

where \( E_{ori} \) is the edge of the original image after applying smooth filter and \( I_{ori} \) is the original image. \( E_{inv\left( i \right)} \) is the edge of the color invariant image after applying smooth filter and \( I_{inv\left( i \right)} \) is the color invariant image. The hard shadow edge mask is constructed by choosing the strong edges of original images that are absent in the invariant images. Thus, we get:

$$ HS\left( {x,y} \right) = \left\{ {\begin{array}{*{20}r} \hfill {1,} & \hfill {E_{ori} \left( {x,y} \right) > t1,\& } \\ \hfill {} & \hfill {\mathop {\hbox{min} }\nolimits_{i} (E_{inv\left( i \right)} \left( {x,y} \right) < t2),} \\ \hfill {0,} & \hfill {otherwise} \\ \end{array} } \right. $$
(10)

where t1, t2 are thresholds, set manually, based on the empirical analysis of datasets and assessed hard shadow edge mask is HS(x,y). In (10), t1 maps the selected shadow edges to the strong edges of the subsequent hard shadows in images. t2 selects edges belonging only to shadows, as shown in Fig. 1.

Fig. 1.
figure 1

(a) Video Frame, (b) Segmented Object Model, (c) Foreground Model.

3.3 Visual Vocabulary Model for Object Recognition

Visual Vocabulary Model is a machine learning based image classification model, specifically, handling images as documents, by labelling specific features as words by observing presence of such feature key words in an image.

First, we localize the key words by extracting the features of the object of interest such that they are distinct and invariant under different scale and illumination based conditions even with the presence of noise. We have used Nonlinear (cubic) Support Vector Machine (SVM) as the feature classifier. Polynomial kernel for cubic SVM is:

$$ K\left( {x,y} \right) = \left( {x^{T} y\text{ + }c} \right)^{3} . $$
(11)

Here x and y are input vector features, calculated from the training samples. A free parameter, c ≥ 0, is indicating how far the equation is from homogeneity.

The following equation expresses the contribution of a feature f, at location l, at position x in the object class \( {\text{o}}_{n} \) with matching visual keyword index \( ({\text{C}}_{i} ) \) indicating its potentiality of belonging to the class \( {\text{o}}_{n} \). Thus, we get:

$$ p\left( {o_{n} ,x |f,l} \right) = \sum\nolimits_{i} {p\left( {o_{n} ,x |C_{i} ,l} \right) \, p\left( {C_{i} |f} \right)} , $$
(12)

Mean-shift mode estimation with a kernel K, along with scale-adaptive kernel, is used to obtain the maxima in this space:

$$ \hat{p}\left( {o_{n} ,x} \right) = \frac{1}{{V_{\text{b}} \left( {x_{s} } \right)}}\sum\nolimits_{k} {\sum\nolimits_{j} {p\left( {o_{n} ,x_{j} |f_{k} ,l_{k} } \right)K\left( {\frac{{x - x_{j} }}{{b\left( {x_{s} } \right)}}} \right)} } . $$
(13)

Kernel bandwidth is denoted by b, and volume is denoted by \( V_{b} \), which are varied over the radius of the kernel. In order to fix the hypothesized interest object, size and scale coordinate \( \varvec{x}_{s} \) is updated in parallel. This strategy makes it easier to deal with partial occlusions and also typically requires fewer training examples.

The pictorial structure model represents any object of interest as collection of parts, connected in pairs, and defined by a graph G = (V, E), where the nodes \( V = \{ v_{1} , \ldots ,v_{n} \} \) defines the parts and the edges \( \left( {v_{i} ,v_{j} } \right) \in E \) describes the corresponding connections.

\( L = \left\{ {l_{1} , \ldots , l_{n} } \right\} \) be a certain arrangement of part frame locations. Then the matching of the model to a video frame is formulated using an energy minimization function:

$$ l_{1}^{ *} = \arg \mathop {\hbox{min} }\nolimits_{{l_{1} }} \left( {m_{1} \left( {l_{1} } \right) + \sum\nolimits_{i = 2}^{n} {\mathop { \hbox{min} }\limits_{{l_{i} }} m_{i} \left( {l_{i} } \right) + \left| {\left| {l_{i} - T_{1i} \left( {l_{1} } \right)} \right|} \right|_{{M_{ij} }}^{2} } } \right). $$
(14)

where \( M_{ij} \) is the diagonal covariance between transformed locations \( T_{ij} \left( {l_{i} } \right) \) and \( T_{ji} \left( {l_{j} } \right). \)

For further improvement of our validation score by approximating the similarity measures, we discriminatively model a linear time matching function, represented by the Pyramid Match Kernel (PMK) model to bridge the feature sets to the variable cardinalities. Let the input of a histogram pyramid be X ϵ S where \( \Psi \left( {\text{X}} \right) = [H_{0} \left( X \right), \ldots ,H_{L - 1} \left( X \right)] \), number of pyramid levels expressed as L. The histogram vector of point X is defined by \( H_{i} \left( X \right) \).

Similarity between two input set of features Y and Z is expressed as:

$$ \kappa_{PMK} \left( {\Psi \left( Y \right),\Psi \left( Z \right)} \right) = \sum\nolimits_{i = 0}^{L - 1} {\omega_{i} \left( {I\left( {H_{i} \left( Y \right),H_{i} \left( Z \right)} \right) - I\left( {H_{i - 1} \left( Y \right),H_{i - 1} \left( Z \right)} \right)} \right)} , $$
(15)

where \( I\left( {H_{i} \left( Y \right),H_{i} \left( Z \right)} \right) \) signifies the histogram intersection of two input set of features Y and Z at i th level of the pyramid.

Finally, the features of the recognized objects are tracked via the classical Kalman Filter, which can also efficiently handle the tracking under partial occlusions as shown in Fig. 2. The performance measure of the proposed algorithm is done with respect to available benchmark datasets and we obtain very satisfactory and competitive results.

Fig. 2.
figure 2

Tracking results on INRIEA.

4 Experimental Results and Analysis

We test our algorithm on various benchmark datasets [3] with the aforementioned settings. Using the trained model as a reference to recognize newly arrived objects, we compare our algorithm with the other state-of-the-art algorithms, in other datasets as well for the validation our experiment. The tracking result of our algorithm on INRIA Person dataset and on other datasets in multiple frames handling various challenges, is shown in Figs. 2 and 3 respectively.

Fig. 3.
figure 3

Sample tracking results of the eight top performed trackers on challenging sequences. (a) Result samples on BlurBody, Boy and Crossing sequences. Challenging factors: background clutter and deformation. (b) Result samples on David, David2 and Dog1 sequences. Challenging factors: scale variation, motion blur, and occlusion. (c) Result samples on Dudek, FaceOcc1 and Human9 sequences. Challenging factors: deformation and occlusion. (d) Result samples on Jogging, Mhyang and Walking2 sequences. Challenging factors: fast motion, scale variation, and occlusion.

The overlap rate of tracking methods indicates stability of each algorithm by taking the pose and size of the target object into consideration in Table 1. Our algorithm achieves competitive, rather satisfactory results compared to the other state-of-the-art tracking algorithms [3]. Figure 4 represents a comparative analysis of the overlap rate in video frames against the other state-of-the-art methods showing competitive as well as satisfactory outcomes.

Table 1. Average overlap rate of tracking methods. The red, yellow and orange colors indicate the results ranked at the first, second and third places, respectively.
Fig. 4.
figure 4

Comparative analysis of overlap rate against the state-of-the-art methods on various benchmark datasets and challenges.

5 Conclusion

This paper presents object detection and recognition of the detected objects based on Visual Vocabulary Model. We train different objects separately in several images with multiple aspects and camera viewpoints to find the best key word points for recognition. Subsequently, we verify the extracted features of the training images after classification of the feature sets. These key word points are applied to the regions based on visual feature point analysis. The performance measure of the proposed algorithm is analyzed with respect to available benchmark data and we obtain very satisfactory and competitive results. This has great potentials in the field of problem solving integrating vision and pattern recognition with more robustness and variability, with exciting opportunities to explore in near future.