1 Introduction

Target recognition and tracking is applied in many fields, such as motion analysis [1] and behavior recognition [2]. However, occlusion, similar background, lighting, surface, and etc. pose great challenges for target recognition and tracking, which will make target shift or even tracking fail [3]. Appearance model-based tracking algorithms [4,5] represent targets with scale-invariant feature transformation or histogram of oriented gradient, but these features cannot reflect the basis of targets, and mismatches usually appear in the process of tracking. Moreover, complex appearance models lead to very high computation.

The combination of appearance model and traditional machine learning techniques consumes target tracking as a binary classification problem [6,7], and this method can utilize background information effectively and thus can improve the effectiveness of tracking. However, as there are not enough training data to the classification model, the recognition ability of target is very low and thus misclassification usually occurs. Deep learning is a hot research in image and visual processing. According to construct deep non-linear network model [8,9], the essential features of images can be learned with the constructed model, and then, the classification accuracy is improved.

Flock of tracker [10] combines local trackers with global motion model and can handle the problem of occlusion and local changes of non-rigid targets. Cell flock of tracker [11] tracks targets with the selected optimal local tracker and thus can handle the problem of target shifting and is more robust in target tracking.

Multiple-instance learning is first proposed by Dietterich et al. [12], and it is the fourth machine learning technique besides supervised learning, unsupervised learning, and reinforcement learning. Zhang et al. [13] propose to embed multiple-instance learning into the AnyBoost algorithm framework and construct the MILBoost classifier for target detection. Babenko et al. [14] use multiple-instance learning for target tracking, which gets a good tracking effectiveness, so multiple-instance learning becomes a hot research in target tracking. Zeisl et al. [15] apply the semi-supervised multiple-instance learning for target tracking, in which the target and background of the first frame is assumed to be tagged sample, and targets of the subsequent frames are assumed untagged samples. When the first frame comes, the tagged sample and untagged samples, which are tracked correctly, are priors for the following frame, and this improves the stability of target tracking [16]. In addition, Babenko et al. [17] has analyzed the visual tracking with online multiple-instance learning, but they aim to track the predefined target, and our method can recognize any target from its background.

However, the original multiple-instance learning has the weaknesses of low classification effectiveness and real-time ability. In order to handle these weaknesses, we propose a new weak classifier, which assigns different positive samples, different weights and assigns, different weak classifiers, and different weights. In addition, we propose a strong classifier to improve the accuracy and real-time ability of target tracking.

The rest of the paper is organized as follows. In Section 2, we present our proposed target tracking algorithm based on multiple-instance learning. Experiments and conclusion are given in Sections 3and4, respectively.

2 Multiple-instance learning target tracking algorithm

The flowchart of a tracking system is in Fig. 1, where we use all previous frames as training data to train a classifier and use this classifier to classify thet + 1-th frame; once thet + 1-th frame is classified, we add it into the training data for future prediction. The classifier evolves as time goes on.

Fig. 1
figure 1

The flowchart of tracking system

2.1 Selection of positive and negative samples

During the process of traditional target tracking, the target is usually one candidate object. When the target changes a lot or is occluded, the tracking frame shifts easily. Taking the limit of single candidate target, we consider multiple candidate targets. Here, we consider the target as positive sample and consider the background as negative samples. The samples including both positive and negative samples are denoted as X. Let the location of a sample bel t at timet, then the category of sample isy ∈ {0, 1}, wherey = 1, ifXis the target, andy = 0, if X is the background.Let the location of the target be\( {l}_{t-1}^{*} \)at timet − 1, then the sample set that is waited for classification at timet is

$$ {X}^s=\left\{X\Big|\left\Vert l(X)-{l}_{t-1}^{*}\right\Vert <s\right\}, $$
(1)

where l(X) is the location of sample X and s is the searching radius.

In order to acquire the location \( {l}_t^{*} \) of the target at time t, compute the probability p(y = 1) that all samples X is a positive sample. Let the probability that the target occurs in a cycle region with radiussbe uniform, then we have

$$ p\left({l}_t^{*}\Big|{l}_{t-1}^{*}\right)=\left\{\begin{array}{cc}\hfill 1\hfill & \hfill \left\Vert l(X)-{l}_{t-1}^{*}\right\Vert <s\hfill \\ {}\hfill 0\hfill & \hfill \mathrm{otherwise}\hfill \end{array}\right.. $$
(2)

Then, the new location of the target is

$$ {l}_t^{*}=l\left(\underset{X\in {X}^s}{ \arg \max }p\left(y=1\Big|X\right)\right). $$
(3)

When the new location is calculated out, we need to select new positive and negative samples to update the classifier. While selecting the positive samples, the positive sample set X +contains N samples, which is a cycle with \( {l}_t^{*} \) as its center, radius α, that is

$$ {X}^{+}=\left\{{X}_{1i}\Big|\left\Vert l(X)-{l}_t^{*}\right\Vert <\alpha \right\}. $$
(4)

The negative sample set X contains L samples, which is a cirque with \( {l}_t^{*} \) as its center, radius from β to γ, that is

$$ {X}^{-}=\left\{{X}_{0i}\Big|\beta <\left\Vert l(X)-{l}_t^{*}\right\Vert <\gamma \right\}. $$
(5)

2.2 Training a classifier

While training the classifier, we use the selected positive and negative sample set, X + and X , and then, the probability that a sample is a positive sample is as follows [14]:

$$ p\left(y=1\Big|X\right)=\frac{e^{H(X)}}{e^{H(X)}+{e}^{-H(X)}}=0.5 \tan h\left(H(X)\right)+0.5, $$
(6)

where\( \tan h(z)=\frac{e^{H(X)}-{e}^{-H(X)}}{e^{H(X)}+{e}^{-H(X)}} \), H(X) is a strong classifier of the samples and consists of K weak classifiers.

The definition of H(X) is in the following equation:

$$ H(X)={\displaystyle \sum_{k=1}^K{\lambda}_k{h}_k(X)}, $$
(7)

where h k (X) is the kth weak classifier and λ k is its weight. The weak classifiers are selected according to their classification ability. If a weak classifier is good at classification, then we give it a big weight; otherwise, we give it a small weight. Let \( {\lambda}_k={e}^{\frac{1-k}{K}} \), then the weak classifier is selected from the set of weak classifier set Φ, where Φ = {h 1, …, h M } and M > K. The weak classifier set is generated with the following method: let\( {h}_k= \log \left(\frac{p\left(y=1\Big|{f}_k(X)\right)}{p\left(y=0\Big|{f}_k(X)\right)}\right) \), where f k (X) is the Haar-like feature [18]; let p(y = 0) = p(y = 1), then, with the Bayes rule, we can have\( {h}_k= \log \left(\frac{p\left({f}_k(X)\Big|y=1\right)}{p\left({f}_k(X)\Big|y=0\right)}\right) \), where p(f k (X)|y = 1) and p(f k (X)|y = 0) conform to the Gaussian distribution [19], that is

$$ p\left({f}_k(X)\Big|y=1\right)\sim N\left({\mu}_1,{\sigma}_1\right), $$
(8)
$$ p\left({f}_k(X)\Big|y=0\right)\sim N\left({\mu}_0,{\sigma}_0\right), $$
(9)

where μ 1, σ 1, μ 0, and σ 0 are expectations and variances of the two Gaussian distributions.

During the training of the classifier, we use the gradient descent method, and the iterations of μ i and σ i are as follows:

$$ {\mu}_i=\eta {\mu}_i+\left(1-\eta \right)\frac{1}{N}{\displaystyle \sum_{j\Big|y=1}f\left({X}_j\right)}, $$
(10)
$$ {\sigma}_i=\eta {\sigma}_i+\left(1-\eta \right)\sqrt{\frac{1}{N}{\displaystyle \sum_{j\Big|y=1}{\left(f\left({X}_j\right)-{\mu}_i\right)}^2}}, $$
(11)

where i = 0, 1, η is the learning coefficient.

2.3 Selecting weak classifiers

As we can see from Eq. 7, target tracking needs to use a set Φ of K weak classifiers, and then, the rule for the selection of weak classifiers is to assure an optimal strong classifier [20]. Babenko et al. [14] propose to ascertain weak classifier h by maximizing the log-likelihood function with both positive and negative sample sets, that is

$$ {h}_k=\underset{h\in \varPhi }{ \arg \max }L\left({H}_{k-1}+{\lambda}_kh\right), $$
(12)

whereL(H) is computed as follows:

$$ \begin{array}{r}L(H)={\displaystyle \sum_{s=0}^1\left({y}_s \log \left(p\left(y=1\Big|{X}^{+}\right)\right)\right.}+\\ {}\left.\left(1-{y}_s\right) \log \left(p\left(y=0\Big|{X}^{-}\right)\right)\right),\end{array} $$
(13)

where \( p\left(y=1\Big|{X}^{+}\right)={\displaystyle {\sum}_{j=1}^{N-1}{w}_jp\left(y=1\Big|{X}_{1j}\right)} \). As there exists similarity between positive sample and negative sample, we define the similar coefficient as follows:

$$ {w}_j=\frac{1}{c}{e}^{-\Big|l\left({X}_{1j}\right)-l\left(\left({X}_{10}\right)\right)}, $$
(14)

where c is the normalization constant.

With the same reason, we can have

$$ \begin{array}{c}p\left(y=0\Big|{X}^{-}\right)={\displaystyle {\sum}_{j=N}^{N+L-1}{w}_j^{\hbox{'}}p\left(y=0\Big|{X}_{0j}\right)}\\ {}=w{\displaystyle {\sum}_{j=N}^{N+L-1}\left(1-p\left(y=1\Big|{X}_{1j}\right)\right)}.\end{array} $$
(15)

In Eq. 15, the similarities between negative samples are small, so we let w be constant.

Computing h with Eq. 12consumes a lot of computing resources, so we use a more efficient approach. Unwrapping L(H k − 1 + λ k h) with the first-order Taylor formula, we have

$$ L\left({H}_{k-1}+{\lambda}_kh\right)\approx L\left({H}_{k-1}\right)+<{\lambda}_kh,\mathit{\nabla}L(H)>\Big|{}_{H={H}_{k-1}}, $$
(16)

where\( <{\lambda}_kh,\mathit{\nabla}L(H)>=\frac{\lambda_k}{N+L}{\displaystyle {\sum}_{j=0}^{N+L-1}h\left({x}_{ij}\right)}\mathit{\nabla}L(H)\left({X}_{ij}\right) \).

$$ \begin{array}{l}\mathit{\nabla}L(H)\left({X}_{ij}\right)={\left.\frac{\partial L\left(H+\theta {1}_{X_{ij}}\right)}{\partial \theta}\right|}_{\theta =0}\\ {}=\frac{\partial }{\partial \theta }{\displaystyle {\sum}_{s=0}^1\left({y}_s \log \left({\displaystyle {\sum}_{j=0}^{N-1}{w}_j\left(0.5 \tanh \left(H\left({X}_{1m}\right)+\theta {1}_{X_{ij}}\right)+0.5\right)}\right)\right.}\\ {}+\left(1-{y}_s\right) \log \Big({\displaystyle {\sum}_{j=N}^{N+L-1}\left(1-\left(0.5 \tanh \left(H\left({X}_{0m}\right)+\theta {1}_{X_{ij}}\right)+0.5\right)\right)}\\ {}{\left.+\left. \log \left({c}^{-{y}_s{w}^{1-{y}_s}}\right)\right)\right|}_{\theta =0}\\ {}=\frac{\partial }{\partial \theta }{\displaystyle {\sum}_{s=0}^1\left({y}_s \log \left({\displaystyle {\sum}_{j=0}^{N-1}{w}_j\left(0.5 \tanh \left(H\left({X}_{1m}\right)+\theta {1}_{X_{ij}}\right.+0.5\right.}\right.\right.}\\ {}+{\left.\left.\left(1-{y}_s\right) \log \Big({\displaystyle {\sum}_{j=N}^{N+L-1}\left(1-\Big(0.5 \tanh \left(H\left({X}_{0m}\right)+\theta {1}_{X_{ij}}\right)+0.5\right.}\right)\right|}_{\theta =0}\\ {}={y}_i\frac{w_j\left(1-{ \tanh}^2\left(H\left({X}_{0m}\right)\right)\right)}{{\displaystyle {\sum}_{m=0}^{N-1}{w}_j\left( \tanh \left(H\left({X}_{0m}\right)\right)+1\right)}}\\ {}-\left(1-{y}_i\right)\frac{\left(1-{ \tanh}^2\left(H\left({X}_{0m}\right)\right)\right)}{{\displaystyle {\sum}_{m=N}^{N+L-1}\left(1- \tanh \left(H\left({X}_{0m}\right)\right)\right)}},\end{array} $$

where y i  = i and i = 0, 1.

L(H k − 1) is already known, so in order to compute the maximum of L(H k − 1 + λ k h), we only need to compute the maximum of\( <{\lambda}_kh,\mathit{\nabla}L(H)>\Big|{}_{H={H}_{k-1}} \); then, the Eq. 12 can be rewrote as follows:

$$ {h}_k=\underset{h\in \varPhi }{ \arg \max }<{\lambda}_kh,\mathit{\nabla}L(H)>. $$
(17)

In the MIL algorithm proposed by Babenko et al. [14], it needs to maximize Eq. 13, and this would compute additional M probabilities belonging positive or negative set for each sample, so the computing complexity is very high. In this paper, we propose an algorithm for computing \( H(X)={\displaystyle {\sum}_{k=1}^K{\lambda}_k{h}_k(X)} \), and the algorithm is in algorithm 1. According to the first frame of a video, we find the target to be tracked and generate positive and negative sample set {X +, X }, where X + = {X 1j , y 1 = 1, j = 0, 1, …, N − 1}, andX  = {X 0j , y 0 = 1, j = N, 1, …, N + L − 1}. Next, according to Eqs. 8 and 9, we compute p(f(X 1j )|y = 1) and p(f(X 0j )|y = 0) and then compute h k for k from 1 to M to generate weak classifier set Φ = {h 1, …, h M }.

figure a

3 Experiments

3.1 Experimental setup

In the experiments, we use iCoseg [21] and MSRC [22], the two public datasets. The iCoseg dataset consists a series of related images for each object. For example, an athlete moves on a horizontal bar. The MSRC dataset monitors an environment in a forest. In this dataset, a panda occurs and disappears in the camera. We test target recognition and tracking in these two scenes.

The baseline algorithms are MIL [14], OAB [23], and SBT [6]. The MIL algorithm is a classical multiple-instance learning approach for target tracking. The OAB algorithm is a boosting approach for target classification in image series. The SBT algorithm is a semi-supervised machine learning approach, and it uses massive untagged data to improve the accuracy of classification.

3.2 Experimental results

While evaluating the performance of the proposed algorithm, we use precision and recall two metrics. Here, we use “Jumping” to represent a woman moving on a horizontal bar and ‘panda’ to represent a panda appearing in a camera.

Firstly, we compare the precision of the four algorithms on both two datasets, and the result is in Fig. 2. As we can see from the figure, the OAB and SBT algorithms have better precisions in Jumping than they are in the panda dataset. Moreover, the MIL algorithm has better precision in the panda dataset than it is in the Jumping dataset. The above observation concludes that different tracking algorithms would have different precision in different scenes. However, as we use multiple-instance learning while classifying target from its background, it has the best precision in both of the two dataset.

Fig. 2
figure 2

Comparison of precision

Secondly, we compare the recall of the four algorithms on both of the two datasets, and the result is in Fig. 3. As we can see from the figure, the OAB and SBT algorithms have lower recalls in Jumping than they are in the panda dataset. Moreover, the MIL algorithm has better recall in the Jumping dataset than it is in the panda dataset. The above observation also concludes that different tracking algorithms would have different recalls in different scenes. However, as we use multiple-instance learning while classifying target from its background, it has the best recall in both of the two dataset.

Fig. 3
figure 3

Comparison of recall

Next, we illustrate the target recognition results on these two scenes, and the results are in Fig. 4. The images in the first line capture the panda. Whenever the panda sits down, walks, or crosses a river, it can be easily recognized. Even some part of the panda is not in the images, the panda can also be recognized. The images in the second line illustrate the recognition of a woman while she is moving on a horizontal bar. In this scene, the backgrounds in the images are almost the same, and the woman does different actions. This situation is much easier than the last one, and classification accuracy can be assured. In this dataset, even though some parts of the woman are occluded, the woman can also be recognized clearly.

Fig. 4
figure 4

Illustration of target recognition results in image series

Finally, we compare the performances of executing time and memory usage of the algorithms on the two datasets. Figure 5 illustrates the executing time comparison, and from the figure, we can see that our proposed algorithm consumes the least executing time under both datasets, the OAB algorithm is the second least, and the other two algorithms take longer executing time. While comparing SBT and MIL, the MIL algorithm takes the longest executing time under the jumping dataset and the SBT algorithm takes the longest executing time under the panda dataset. Figure 6 illustrates the memory usage comparison of the algorithms under both datasets. From this figure, we can see that our proposed algorithm consumes the least memory usage while recognizing and tracking targets under the two datasets and the OAB algorithm consumes the second least memory on both datasets. In addition, for SBT and MIL algorithms, SBT needs more memory than MIL under the jumping dataset and MIL needs more memory than SBT under the panda dataset.

Fig. 5
figure 5

Comparison of executing time

Fig. 6
figure 6

Comparison of memory

4 Conclusions

In this paper, we studied target recognition and tracking in a series of images, and our approach is based on the multiple-instance learning technique. In the target tracking framework, we use image frames to generate positive and negative samples to train a classifier, and use the classifier to differentiate target from its background. We use a set of weak classifiers to construct a strong classifier. The experiments show that the proposed approach has better precision and recall on two public datasets than related works.