1 Introduction

Human tracking is one of the core problems in the field of computer vision The fundamental thought behind the tracking system is estimating the target position from the sequence of the frame where real-time depth cameras have simplified the tracking process to a great extent. Increasing the number of surveillance cameras and day by day development in it makes the tracking system extremely popular and more accessible nowadays. The number of differential methods [1,2,3] and procreative methods [4,5,6] tracked the appearance model of the target and used for estimating the position of the target in the next frame. There are different applications based on the human tracking framework, like human–computer interaction, security, telepresence, military, and health-care systems [7,8,9,10].

Various skin detection methods introduced the deep learning techniques used the AdaBoost algorithm, cascade classifier with AdaBoost algorithm. The techniques discriminate skin and non-skin pixels, making the skin detector robust and practicable [11, 12]. Ghaziasgar et al. [13] adopted the process of filling the skin holes. For the computer vision and the medical image analysis, Criminisi and Shotton [14] presented the unified and efficient model of decision forest used for scene recognition from photographs, object recognition in the images and automatic diagnosis from radiological scans with supervised or unsupervised machine learning techniques. Random forests are highly non-linear learners that are usually extremely fast during both learning and evaluation.

In this paper, the harmonious polling of patched correlation (HPPC) technique is presented, which is the improvement in the kernelized correlation filter-based tracking algorithm. There are four significant contributions presented in the proposed framework. The first contribution is an Improved Patch Based Tracking approach, which is the significant innovation used in the proposed framework. In this approach, 50 patches are extracted from the bounding box of every image. In the second contribution, each patch is processed by using the windowing mechanism and feature extraction with HOG and tracked using the kernelized correlation filter, which generates 50 correlation scores. The third contribution is a polling mechanism, which is the next novel concept used in the proposed framework. In the polling mechanism, 50 correlation scores from the KCF provided as an input, and the confidence map is drawn from all correlation score. The maximum score achieved from the confidence map is the target.

The correlation filter further trained by applying the same procedure for all the patches. The trained correlation filter is utilized continuously throughout the sequence for tracking the target. The fourth contribution is the proposed framework applied to the online object tracking benchmark OOTB-100 [15, 16] and National Laboratory of Pattern Recognition NLPR_MCT [17] dataset which are open source. Moreover, the system compared with the existing correlation filter-based tracking algorithms.

The rest of the paper is systematized as follows. Section 2 presents the previous work related to the proposed HPPC tracking framework. Section 3 includes harmonious polling of patch confidence tracker. Section 4 shows the qualitative and quantitative analysis of the HPPC tracker, which is compared with the existing correlation filter-based tracking algorithms and applied to [15,16,17].

2 Related work

Human tracking has long been a popular topic in Computer vision. A large number of trackers have been proposed and standardized the quantitative and qualitative evaluation metrics which is accelerating the pace of development in this field. Various approaches that form the basis of existing trackers can be used for tracking numerous unusual circumstances. Henriques et al. [1] introduced the improved MOSSE filter with circulant structure and the Kernel matrix, which proposed that the correlation filters can be effectively kernelized, improving the tracking performance. Bolme et al. [18] presented an adaptive training algorithm, minimum output sum of squared error filter (MOSSE) which is the vigorous and successful method utilized for tracking. After the MOSSE filters various advancements made for improving the performance of the tracking system. Danelljan et al. [16] applied color-attributes for proper representation of the input data and improved the baseline intensity-based tracker by 24% in median distance precision. The correlation filter-based tracking algorithms (CFT) [19] geared up the concept of the correlation filters and presented various algorithms for tracking the object and pedestrian [1, 20,21,22,23].

Further, the advanced CFTs increased the efficiency and robustness in the correlation filter-based tracking frameworks improving it to the next level of progression [16, 24,25,26]. The context-aware correlation filter-based tracking system [27] has built up by incorporating the CFT with the global context. Learned specific convolutional neural network (CNN) presented the trackers without pre-training, which averts the issues brought about by the offline training [28] where CNN treated as a black box. A perceptual hash (pHash) algorithm [29] is a straightforward and fast technique to update the observation model dynamically with image similarity. Chen et al. [30] worked on face tracking algorithm, which is an online feature selection mechanism. The algorithm chooses the most discriminative feature during the tracking process with the unconstrained correlation filters. A while later, Yang et al. [31] introduced an on-line feature selection mechanism to choose the most discriminative feature during the tracking to make the tracker more accurate. A boolean map representations method for visual tracking is a simple and effective Boolean map-based way of representation that exploits connectivity cues for visual tracking [32]. As of late, in 2018, Yang et al. [33, 34] presented the spatiotemporal nonlocally regularized correlation filter and parallel attentive correlation filter utilized for tracking.

Viola and Jones [35] described a machine learning approach for visual object detection for processing images rapidly and achieving high detection rates. The work distinguished the Integral Images, AdaBoost learning algorithm, and cascade classifiers method. Further, Pooya and Yazdi used a train set selection method, based on histograms generated from AdaBoost for selecting the features [36]. Moreover, Viola and Jones face detection method used a simple method to select few features in beginning cascades are proposed in [37]. Moreover, a cascaded classifier using the AdaBoost algorithm is trained in [38] with two edge detectors.

Several classification and regression methods are there, utilized for analyzing different type of data [39, 40]. The methods classify subjects: the technique of “classification trees” or “recursive partitioning” as defined by Breiman et al. [39].

The feature descriptor algorithms like SIFT, HAAR, HOG takes an image and outputs the feature which encodes the information into a series of numbers and differentiates one feature from another [41, 42]. In hatred of the vital advancement in this area, the tracking system has experienced many challenging situations like occlusion, complex motions, fast motion, illumination variation, deformation, image blur, background clutter, scale variation, rotation which debases the general execution of the framework [43, 44].

3 Harmonious polling of patch confidence tracker

In the proposed framework, shown in Fig. 1, the main idea is to process each patch from the bounding box, separately. In the processing stage, the patch boundaries extracted from the bounding box smoothed by passing it through a cosine window. HOG feature description algorithm simplifies the image by extracting the useful information about the patches. It works on the gradient and orientation information. Patches are tracked using KCF, which provides a correlation score. HOG extracts the positive and the negative training samples, which are further applied to train the correlation filters for the next frames. The process is repeated for 50 times, and 50 correlation scores are there at the output of the correlation filters, which are further applied for the polling mechanism. The polling is the next innovative mechanism used in the proposed system, where the correlation score is used to draw the confidence map. The highest point of matching in the confidence map is the exact position of the target. The patch tracking using a kernelized correlation filter followed with a polling mechanism, effectively improve the accuracy and the overall performance of the system.

Fig. 1
figure 1

The workflow of harmonious polling of patched correlation tracker. Each patch processed separately, and correlation score applied to the polling mechanism

In this paper, 50 patches are extracted from the bounding box, as shown in Fig. 2. Information about the target position, patch position, target size for all the patches is stored in the Context Field.

Fig. 2
figure 2

Patches are cropped from bounding box. Each patch is treated separately for processing

3.1 Improved patch based tracking

The improved patch-based tracking algorithm used in the proposed framework is an innovative idea which is treated separately throughout the process. If an entire bounding box is considered at once for tracking, the effect of the occlusion is for the whole bounding box, which degrades the correlation score. In case of patch-based tracking approach, when occlusion is detected on a patch, it affects only on that patch, not on the complete image. So, the patch-based tracking approach significantly reduce the effect of occlusion. In the proposed framework, patches are extracted from the bounding box, and each patch is tracked separately. Information of each patch fitted in the context field. It means, 50 patches are sampled from one weak patch, as shown in Fig. 2. Each small patch is treated separately for processing. There is an abrupt change in the patch boundaries when we crop the patch from an image. It is essential to nullify the effect of these abrupt changes to get smooth patch boundaries. In the HPPC tracker, cosine window applied to these patches, nullify the effect of these abrupt changes [19]. A feature description algorithm, histogram of oriented gradient (HOG) applied to the patches for extracting the features, provides gradient and orientation information [41].

3.1.1 Kernelized correlation filter

The patches cropped from an image further produces peaks for the target using kernelized correlation filters (KCF). The KCF used in the proposed system for tracking the human provides a correlation score for the polling mechanism. A preparatory version of this work was presented earlier [3]. The connection between ridge regression with cyclically shifted samples and classical correlation filters is well explained in [1].

Ridge regression This method uses a simple solution which is closely related to the support vector machine. The aim is to find out a function, f (z) = wtz to minimize the squared error over the input training samples xi and regression target yi where λ is regularization parameter.

$$\mathop {\hbox{min} }\limits_{w} \sum\limits_{i} {\mathop {\left( {f\left( {\mathop x\nolimits_{i} } \right) - \left( {\mathop y\nolimits_{i} } \right)} \right)}\nolimits^{2} } + \mathop {\lambda \parallel w\parallel }\nolimits^{2}$$
(1)

Equation (1) is the error between the output of the training sample and given input. The difference should be as low as possible, which is the minimum output sum of squared error filter (MOSSE) [1]. The objective is to minimize the error in the regression equation, which is an objective equation.

Circulant matrix For computing a regression with the shifted sample, consider \(n*1\) vector representing a patch with the target where x referred to as the base sample. The goal is to train a classifier with positive, negative, and the base samples. The cyclic property [1] indicates that the shifted signal is obtained \(\left\{ {\mathop P\nolimits^{u} x} \right\}\)

$$\left\{ {P^{u} x|u = 0, \ldots ,n - 1} \right\}$$
(2)

Due to the cyclic property, the first half of the overall set is shifted in the positive direction and the second half in a negative direction. A full kernel correlation function can be given by the following equation which is well explained in Henriques et al. [1],

$$\mathop k\nolimits^{{xx^{\prime } }} = \exp \left( { - \frac{1}{{\mathop \sigma \nolimits^{2} }}\left( {\mathop {\parallel X\parallel }\nolimits^{2} + \mathop {\parallel X^{\prime } \parallel }\nolimits^{2} - 2\mathop F\nolimits^{ - 1} \mathop {\hat{X}}\nolimits^{*} \odot\,\hat{X}^{\prime } } \right)} \right)$$
(3)

Now, for the next frame, the target is detected by the trained parameter and maintain the training sample. For a new sample, confidence map is,

$$y = C\left( {\mathop k\nolimits^{xz} } \right)a = \mathop F\nolimits^{ - 1} \left( {\mathop {\hat{X}}\nolimits^{xz} \odot\,\hat{a}} \right)$$
(4)

So, the position of the maximum value in y is predicted as a new targeted position. All the equations starting from (1) to (4) are from [1, 8].

3.2 Polling mechanism

The most appealing approach in the proposed system is a polling mechanism. It describes the possible position of each patch in every frame, which is said to be a confidence score. A confidence score is achieved from the KCF equation. Combining all the confidence scores, robustly gives the maximum point of matching, which is the target positions. The polling mechanism is an innovative idea, which gives an exact position of the target, improving accuracy. The polling for tracking the target is based on spatial and temporal evaluation of patches.

  1. (1)

    KCF provides 50 values of correlation scores from 50 patches.

  2. (2)

    Length of trajectory is found out according to the patch location matched.

  3. (3)

    From the length of trajectory and correlation score, the poll (weight) of each patch w is found out and calculated using Eq. (5).

  4. (4)

    The final confidence map is achieved ψt, by normalizing the polls.

  5. (5)

    The maximum position in the confidence map is the exact position of the target.

The polling score increases in correspondence to the existence of patches in successive frames. Contexts are provided as an input for the polling. In the process of polling, the length of trajectory is calculated from the count of patch location match. For example, in the patch location matching process, if the first four patch locations are matched, and the fifth patch location does not match then the length of trajectory is 4. This way length of trajectory is found out.

Now to calculate the weight of the patch correlation score is divided by the lobe width. Let yi be the confidence map or the response map of ith patch, and l be the side lobe of the corresponding confidence map.

$$w_{t}^{i} = n_{t}^{i} \frac{{y_{t}^{i} }}{{l_{t}^{i} }}$$
(5)

The ith patch appears in consecutive n frames and ni which is defined as the trajectory of an ith patch. The polling score or weight of the patch is expressed as (5). The final confidence map to represent the target using a set of N patches is given by,

$$\psi_{t} = \sum\limits_{k = 1}^{N} {w_{t}^{i} }$$
(6)

The process is repeated for each patch. For all the patches the polling score is combined and normalized. Finally, a graph is plotted to find out a maximum value which can be recognized as the final detected target for a frame. The process is applied for every patch of the frame. From the performance of the system it is observed that the polling mechanism has profoundly improved the accuracy of the tracking system.

4 Quantitative and qualitative analysis

The extensive quantitative and qualitative evaluations are presented in the framework, shows the precision and success rate of the proposed system over the CFT’s few currently available source codes KCF [1], STC [23, 25, 45], CN [16], MUSTer [24]. The proposed framework is contrasted with the few state-of-the-art human tracking algorithms like Particle Filter (PF), Kalman filter (KF), Camshift (CS) algorithm, Mean shift (MS) algorithm [46, 47]. The system is applied over more than 100 test sequences from the online object tracking benchmark OOTB [15, 16] and National Laboratory of Pattern Recognition NLPR_MCT dataset [17]. The protocols used for evaluation in the proposed system are the area under the precision curve (APC), the area under the success curve (ASC).

4.1 Quantitative evaluation

The plots in Figs. 3, 4 and 5 shows the APC and ASC of the few sequences from [15,16,17]. Figure 3 shows the Girl sequence where the HPPC tracker successfully tracks the target even in case of scale variation, occlusion, in-plane rotation, out-of-plane rotation. Figure 4 shows the sequence Surfer which suffers from problems like scale variation, fast motion, in-plane rotation, out-plane rotation, low resolution. Figure 5 shows the APC and ASC plot of the NLPR_MCT dataset. The HPPC tracker successfully tracks in all these unfavorable conditions. According to the quantitative results presented in the APC and ASC, it is observed that the performance of the proposed system has been improved exceptionally.

Fig. 3
figure 3

APC and ASC of the OOTB-100 dataset, girl [15, 16]

Fig. 4
figure 4

APC and ASC the OOTB-100 dataset, surfer [15, 16]

Fig. 5
figure 5

APC and ASC of the NLPR dataset sequence 2 [17]

Tables 1 and 2 shows quantitative evaluations of [15,16,17] by comparing HPPC tracker with the CFT’s few currently available source codes. Tables 3 and 4 shows quantitative evaluations of [15,16,17] by comparing HPPC tracker with state-of-the-art human tracking algorithms. Experimentally validating the results, it is being proved that the proposed technique outperforms the state-of-the-art performance.

Table 1 OOTB benchmark for the CFT’s few currently available source codes
Table 2 NLPR benchmark for the CFT’s few currently available source codes
Table 3 OOTB benchmark for state-of-the-art human tracking algorithms
Table 4 NLPR benchmark for state-of-the-art human tracking algorithms

4.2 Qualitative evaluation

The tracker is applied to the online object tracking benchmark OOTB-100 [15, 16], National Laboratory of Pattern Recognition NLPR_MCT dataset [17]. For the evaluation, all the challenging attributes have been selected like image blur, occlusion, change in illumination, in-plane rotation, out of plane rotation, deformation, which makes the database extremely challenging.

Despite of the critical scenario, the proposed framework works appropriately even in crowdy areas. In the sequences shown below few trackers which gives its best performance in almost every sequence are included in top-performing trackers, i.e., KCF, CFT, CN, MUSTer is compared with the proposed HPPC tracker and human tracking algorithms. From the evaluation in Fig. 6, it is observed that in the sequence of Jump, the tracker is leaving the tracking sequence in many frames as KCF has the limitation of a fixed window.

Fig. 6
figure 6

Qualitative analysis for the HPPC tracker, compared with top-performing algorithms MUSTer, KCF, STC, CN using OOTB-100 benchmark [15, 16]

The same issue is observed in the sequences like Bolt-2, Walking, Basketball. The proposed system shows a much better result in such a case.

In case of partial occlusions like Girl, KCF is leaving the tracking sequence for some time, but the proposed algorithm still working better in such a scenario from starting to the last frame. It works excellent even the face is 360° rotating. However, in the case, when there is a partial occlusion, it is continuing the tracking properly.

From the sequences of the NLPR_MCT dataset HPPC tracker is giving an outstanding performance, as shown in Fig. 7. Figure 8 shows the comparison of the proposed system with correlation filters based algorithms and few state-of-the-art human tracking algorithms. So, it is being observed that HPPC tracker gives its best performance in almost every sequence.

Fig. 7
figure 7

Qualitative analysis for HPPC tracker, compared with top-performing algorithms MUSTer, KCF, STC, CN using an NLPR_MCT benchmark [17]

Fig. 8
figure 8

Qualitative analysis for proposed HPPC tracker, compared with State-of-the-art human tracking algorithms Particle Filter, Kalman Filter, Camshift, Mean shift using an NLPR_MCT benchmark [17]

5 Conclusion

The HPPC tracker includes an innovative technique where the bounding box is divided into 50 patches, and each patch is tracked separately using the kernelized correlation filter. A novel methodology polling, used in the proposed framework where, the maximum score achieved from the confidence map gives the exact position of the target. The results have been validated based on qualitative and quantitative evaluations. The tracker is applied to the online object tracking benchmark OOTB-100 [15, 16] tracking dataset and National Laboratory of Pattern Recognition NLPR benchmarks [17]. The algorithm is also compared with the correlation filter-based tracking algorithms MUSTer, KCF, STC, CN and human tracking algorithms Particle Filter, Kalman Filter, Camshift, Mean shift. From the experiment it has been proved that the HPPC tracker successfully track the human in all the challenging situations like occlusion, background clutters, illumination variation, scale variation, fast motion, in-plane rotation, out of the plane rotation. The precision value is improved by 15%, and the success rate is improved by 19% as compared to the existing techniques.

The limitation of the HPPC tracker is, the HPPC tracker is taking more run time as compared to the existing trackers since we are taking a higher number of patches to improve accuracy. However, it is justified by the high value of APC and ASC. One more limitation of the framework is, in the case of long-time occlusion, the performance of tracker degrades.

Henceforth, still human tracking is a challenging topic and there is a scope for further improvements.