1 Introduction

Object tracking is an extensively studied topic in visual sensor networks (VSN). A VSN is a network composed of smart cameras; they capture, process, and analyze the image data locally and exchange extracted information with each other [1]. The main applications of a VSN are indoor and/or outdoor surveillance, e.g., airports, massive waiting rooms, forests, deserts, inaccessible locations, and natural environments [2]. In general, the typical task of a VSN is to detect and track specific objects. The objects are usually described by a state that includes various characteristics of the objects such as position, velocity, appearance, behavior, shape, and color. These states can be used to detect and track the objects. Recursive state estimation algorithms are predominantly used to track objects in a VSN [3].

In [411], the authors presented several Kalman filter (KF)-based object tracking methods. Extended Kalman filter (EKF)-based object tracking method is proposed in [12]. The unscented Kalman filter (UKF) is applied for visual contour tracking in [13] and object tracking in [14]. In terms of object tracking in a VSN, the cubature Kalman filter (CKF) is primarily applied in our previous work [15]. In [1624], the authors presented particle filter (PF)-based object tracking. The object tracking methods based on these conventional Bayesian filters have a varying degree of complexity and accuracy.

In general, the performance of the tracking algorithms suffers from different adverse effects such as distance or orientation of the camera, and occlusions. However, a VSN with overlapping field of views (FOVs) is capable of providing multiple observations of the same object simultaneously. The authors in [25] presented a distributed and collaborative sensing mechanism to improve the observability of the objects by dynamically changing the camera’s pan, tilt, and zoom. Other examples of distributed object tracking methods are presented in [26] and [27].

Recently, information filters have emerged as suitable methods for multi-sensor state estimation [28]. In information filtering, the information vector and matrix are computed and propagated over time instead of the state vector and its error covariance. The information matrix is the inverse of the state error covariance matrix. The information vector is the product of the information matrix and state vector. The information filters have an inherent information fusion mechanism which makes them more suitable for multi-camera object tracking. A more detailed description of information filters is given in Section 3. The authors in [29] and [30] presented information weighted consensus-based distributed object tracking with an underlying KF or a distributed maximum likelihood estimation. In our work [31], we have presented a robust cubature information filter (CIF)-based distributed object tracking in VSNs. However, the limited processing, communication, and energy capabilities of the cameras in a VSN present a major challenge.

Nowadays, VSNs tend to evolve into large-scale networks with limited bandwidth and energy reservoirs. This allows a large number of cameras to observe a single object. In spite of the improved tracking accuracy, the information exchange of the large number of observations among the cameras increases the communication overhead and energy consumption. Hence, allowing only a desired number of cameras to participate in the information exchange is a way to meet the stringent requirements of bandwidth and energy.

Estimating an object’s state with a selected set of cameras is a well-investigated topic. Several camera selection mechanisms have been proposed in literature to minimize and/or maximize different metrics such as estimation accuracy, monitoring area, number of transmissions, and amount of data transfer. In [32], the authors presented an object tracking method based on fuzzy automaton in handing over to expand the monitoring area. This method selects a single best camera to control and track the objects by comparing its rank with the neighboring cameras. This method fails to select multiple cameras, and cameras have to communicate with each other to select the best camera. In [33], the authors presented an efficient camera-tasking approach to minimize the visual hull area (maximal area that could be occupied by objects) for a given number of objects and cameras. They also presented several methods to select a subset of cameras based on the positions of the objects and cameras to minimize the visual hull area. If the objects are recognized in the vicinity of a certain location, then a subset of cameras that is best suited to observe this location performs the tracking. This method is capable of selecting multiple cameras but not the desired number of cameras on average. In [34], the authors presented a framework for dynamically selecting a subset of cameras to track people in a VSN with limited network resources to achieve the best possible tracking performance. However, the camera selection decision is made at the FC based on training data and the selection is broadcast to the cameras in the VSN. Hence, this selection process does not depend on the true observations.

The observations received by the cameras in the VSN are typically realizations of a random variable. Hence, they contain a varying degree of information about the state of the object. They can be broadly classified into informative and uninformative observations. The non-informative observations do not contribute significantly to the tracking accuracy. Hence, a camera selection strategy that allows only a desired number of cameras with most informative observations to participate in the information exchange and discards the cameras with non-informative observations is an efficient way to meet the requirements of bandwidth and energy.

In [35], the authors presented an entropy-based algorithm that dynamically selects multiple cameras to reduce transmission errors and subsequently communication bandwidth. In this work, the cameras in the VSN use the extended information filter (EIF) as the local filter and calculate the expected information gain (EIG) in the form of a logarithmic ratio of the expected and posterior information matrices. If the information gain is greater than the cost of transmissions, then the cameras participate in the information fusion. The calculated EIG in this method does not depend on the measurements directly, and the cluster head has to run an optimization step to select the best possible cameras at each step. Moreover, this method is not capable of selecting only a desired number of cameras on average. In [36], a camera set is selected based on an individual image quality metric (IQM) for spherical objects. The cameras that detect the spherical target are ranked in ascending order based on their value of the local IQM, and the required number of cameras with highest IQM are chosen. This approach is limited to spherical objects. However, it can be easily extended to non-spherical objects. The major disadvantage of this method is either all the cameras in the VSN or the FC should know IQM of all the other cameras in the VSN. Hence, this method does not ensure cameras to take independent decisions thus restricting the scalability.

In our work, a multi-camera object tracking method based on the CIF is proposed in which the cameras can take independent decisions on whether or not to participate in information exchange. Furthermore, the proposed method also ensures that on average, only a desired number of cameras participate in the information exchange to meet bandwidth requirements. We model the state of an object utilizing a dynamic state representation that includes its position and velocity on the ground plane. Further, we consider a VSN with overlapping FOVs; thus, multiple cameras can observe an object simultaneously. Each camera in the VSN has a local CIF on board. Hence, they can calculate the local information metrics (information contribution vector and matrix) based on their observations. The cameras that can observe a specific object form a cluster (observation cluster) with an elected fusion center (FC). In this paper, we consider the concept of surprisal [37] to evaluate the amount of information in the observations received by the cameras in the VSN. The surprisal of the measurement residual indicates the amount of new information received from the corresponding observation. The observations of a camera are informative only if the corresponding surprisal of the measurement residual is greater than a threshold. The threshold is calculated as a function of the ratio of the number of desirable cameras and the total number of cameras in the observation cluster. This ensures that on average, only the desired number of cameras are selected as the cameras with informative observations (surprisal cameras). The surprisal cameras calculate the local information metrics based on the CIF and transmit them to the FC. Then, the FC fuses the surprisal local information metrics to achieve the global state by using the inherent fusion mechanism of the CIF. The proposed selection mechanism only requires the knowledge of the total number of cameras in the observation cluster and the desired number of cameras. Further, we compare the proposed multi-camera object tracking method with surprisal cameras with multi-camera object tracking with random and fixed cameras using simulated and experimental data.

The paper is organized as follows: Section 2 describes the considered VSN with motion and observation models. Section 3 presents theoretical concepts of information filtering. Section 4 describes the camera selection based on the surprisal of the measurement residual and the calculation of the surprisal threshold. Section 5 explains the proposed CIF-based multi-camera object tracking with surprisal cameras. Section 6 evaluates the proposed method based on simulation and experimental data. Finally, Section 7 presents the conclusions.

2 System model

In this work, we consider a VSN consisting of a fixed set of calibrated smart cameras c i , where i∈{1,2,⋯,M}, with overlapping FOVs as illustrated in Fig. 1. The task of the cameras in the VSN is to monitor the given environment and to identify and track an object. As these cameras are calibrated, there exists a homography to calculate the object’s position on the ground plane. The cameras c i that can observe the object at time k form the observation cluster C k . The state of the object comprises its position (x k ,y k ) and the velocity \((\dot {x}_{k},\dot {y}_{k})\) on the ground plane. Thus, the state at time k is described as \(\mathbf {x}_{k}=\left [x_{k}\ y_{k}\ \dot {x}_{k}\ \dot {y}_{k}\right ]^{T}\). The motion model of the object at camera c i at time k is given as

$$\begin{array}{*{20}l} \mathbf{{x}}_{k} &=\mathbf{f}_{i,k}\left(\mathbf{{x}}_{k-1}, \mathbf{{w}}_{i,k}\right) \\ &=\left[\begin{array}{cc} {x}_{k-1} + \delta\dot{{x}}_{k-1} + \frac{\delta^{2}}{2} \ddot{x}_{i,k} \\ {y}_{k-1} + \delta\dot{{y}}_{k-1} + \frac{\delta^{2}}{2} \ddot{y}_{i,k} \\ \dot{{x}}_{k-1} + \delta \ddot{x}_{i,k} \\ \dot{{x}}_{k-1} + \delta \ddot{y}_{i,k} \end{array}\right], \end{array} $$
((1))
Fig. 1
figure 1

Visual sensor network

where \(\ddot {x}\) and \(\ddot {y}\) represent the acceleration of the object in x and y directions that are modeled by the independent and identically distributed (IID) white Gaussian noise vector \( \textbf {w}_{i,k}=\left [\ddot {x}_{i,k}\ \ddot {y}_{i,k} \right ]^{T}\) with covariance Q i,k =diag(q x i ,q y i ). δ is time interval between two observations. The state transition model (1) can be further written as

$$ \textbf{{x}}_{k} = \left[ \begin{array}{llll} 1 & 0 & \delta & 0 \\ 0 & 1 & 0 & \delta \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{array}\right] \textbf{{x}}_{k-1} +\textbf{w}^{s}_{i,k}, $$
((2))

where \(\textbf {{w}}^{s}_{i,k}\) is IID white Gaussian noise vector with covariance

$$ \textbf{Q}^{s}_{i,k} = \left[ \begin{array}{cccc} \frac{qx_i \delta^{4}}{4} & 0 & \frac{qx_i \delta^{3}}{2} & 0 \\ 0 & \frac{qy_i \delta^{4}}{4} & 0 & \frac{qy_i \delta^{3}}{2} \\ \frac{qx_i \delta^{3}}{3} & 0 & qx_i \delta^2 & 0 \\ 0 & \frac{qy_i \delta^{3}}{3} & 0 & qy_i \delta^2 \end{array}\right]. $$
((3))

The state of the object is estimated from observations taken at each time step k. The observation model of the object at camera c i and time k is given as

$$ \mathbf{z}_{i,k} = \mathbf{h}_{i,k}\left(\mathbf{x}_{k}\right)+\mathbf{v}_{i,k}, $$
((4))

where v i,k is an IID measurement noise vector with covariance R i,k . The measurement function h i,k is the non-linear homography function which converts the object’s coordinates from the ground to the image plane. The considered motion model (1) and measurement model (4) are adapted from [27].

3 Information filtering

The information filter is an alternative version of the Bayesian state estimation methods. In information filtering, the information vector and the information matrix are computed and propagated instead of the estimated state vector and the error covariance. The estimated global information matrix Y k−1|k−1 and information vector \(\widehat {\mathbf {y}}_{k-1|k-1}\) at time k−1 are given as

$$ \mathbf{Y}_{k-1|k-1} = \mathbf{P}^{-1}_{k-1|k-1}, $$
((5))
$$ \widehat{\mathbf{y}}_{k-1|k-1} = \mathbf{Y}_{k-1|k-1} \widehat{\mathbf{x}}_{k-1|k-1}, $$
((6))

where \(\widehat {\mathbf {x}}_{k-1|k-1}\) and P k−1|k−1 are the estimated global state vector and error covariance matrix at time k−1. At time k and camera c i , the information filter has two steps: time and measurement update.

3.1 Time update

The information form of the predicted state and the corresponding information matrix are computed as

$$ \mathbf{Y}_{i,k|k-1} = \mathbf{P}^{-1}_{i,k|k-1}, $$
((7))
$$ \widehat{\mathbf{y}}_{i,k|k-1} = \mathbf{Y}_{i,k|k-1} \widehat{\mathbf{x}}_{i,k|k-1}, $$
((8))

where \(\widehat {\mathbf {x}}_{i,k|k-1}\) and P i,k|k−1 are the predicted state vector and the error covariance matrix, respectively.

3.2 Measurement update

Upon receiving the measurement z i,k , the information contribution matrix I i,k and information contribution vector i i,k are computed as

$$ \mathbf{I}_{i,k} = \mathbf{Y}_{i,k|k-1}\mathbf{P}_{\mathbf{xz},i,k}\mathbf{R}^{-1}_{i,k}\mathbf{P}^{T}_{\mathbf{xz},i,k}\mathbf{Y}^{T}_{i,k|k-1}, $$
((9))
$$ \begin{aligned} \mathbf{i}_{i,k} = \mathbf{Y}_{i,k|k-1}\mathbf{P}_{\mathbf{xz},i,k}&\mathbf{R}^{-1}_{i,k}\\ &\left(\mathbf{e}_{i,k} + \mathbf{P}^{T}_{\mathbf{xz},i,k}\widehat{\mathbf{y}}_{i,k|k-1}\right), \end{aligned} $$
((10))

where P x z,i,k , R i,k , and e i,k are the cross-covariance of the state and measurement vector, the measurement noise variance, and the measurement residual, respectively. The measurement residual is defined as

$$ \mathbf{e}_{i,k} = \mathbf{z}_{i,k}-\widehat{\mathbf{z}}_{i,k|k-1}, $$
((11))

where \(\widehat {\mathbf {z}}_{i,k|k-1}\) is the predicted measurement. In this work, the CIF is used at the cameras to track the objects locally. We refer to Appendices Appendix 1: time update (TU) and Appendix 2: measurement update (MU) and [38] for the CIF algorithm.

3.3 Information fusion

In multi-camera networks, multiple cameras have an overlapping FOV and thus can observe an object simultaneously. Hence, each camera c i where iC k that observes the object computes its own information contribution vector i i,k and information contribution matrix I i,k as shown in (9) and (10), respectively. Let us consider that each camera sends their local information metrics to an elected FC, then the global information equivalents of the estimated state and error covariances at the FC c o , where oC k are calculated as

$$ \mathbf{Y}_{k|k} = \mathbf{Y}_{o,k|k-1}+\sum^{|C_{k}|}_{i=1}\mathbf{I}_{i,k}, $$
((12))
$$ \widehat{\mathbf{y}}_{k|k} = \widehat{\mathbf{y}}_{o,k|k-1}+\sum^{|C_{k}|}_{i=1} \mathbf{i}_{i,k}, $$
((13))

where \(\widehat {\mathbf {y}}_{o,k|k-1}\) and Y o,k|k−1 are the predicted information vector and matrix at the FC, respectively.

4 Surprisal camera selection

The VSNs usually have limited bandwidth and energy reservoirs. Therefore, it might be necessary that only a desired number of cameras (subset) transmit their local information to the FC. On the other hand, this can lead to decreased tracking accuracy. A better tracking accuracy can be achieved by selecting the cameras based on the information associated with their observations. This strategy improves the accuracy of the global state estimation under the given bandwidth and energy constraints. The information content associated with the observations can be calculated by applying the concept of self-information or surprisal.

4.1 Surprisal

The surprisal H is a measure of the information associated with the outcome x of a random variable. It is calculated as

$$ H = -\text{log}\left(\text{Pr}(x)\right), $$
((14))

where Pr(x) is the probability of the outcome x and the base of the logarithm can be considered as 2, 10, or e. In this paper, the surprisal is calculated with the natural logarithm (base e) for the sake of mathematical simplification. The surprisal of the outcome of a random variable depends only on the probability of the corresponding outcome Pr(x). A highly probable outcome of a random variable is less surprising and vice versa.

4.2 Surprisal of measurement residual

In multi-camera object tracking, the local observations z i,k of each camera c i are random variables because of the additive Gaussian noise and the random initial state. Hence, they contain a varying degree of information about the state of the object. Within the framework of information filtering, the measurement residual e i,k at camera c i and time k is the disagreement between the predicted observation and the actual observation (see (11)). Hence, the surprisal of the measurement residual e i,k gives the additional information associated with the received observations that is not available in the predicted observations through the predicted state. The surprisal of the measurement residual e i,k at camera c i and time k can be computed as1

$$ H_{i,k} = -\text{log}_{e}\left(\mathrm{p}\left(\mathbf{e}_{i,k}\right)\right). $$
((15))

Under the assumptions of IID additive Gaussian observation noise, the measurement residual becomes approximately a Gaussian distributed variable with zero mean and the covariance P z z,i,k , called the innovation covariance

$$ \mathbf{e}_{i,k} \sim \mathcal{N}\left(0, \mathbf{P}_{\mathbf{zz},i,k} \right). $$
((16))

By substituting (16) in (15), the surprisal of the measurement residual e i,k becomes

$$ \begin{aligned} H_{i,k} &= -\text{log}_{e}\left(\frac{\text{exp}\left({-\frac{1}{2}\mathbf{e}^{T}_{i,k} \mathbf{P}^{-1}_{\mathbf{zz},i,k}\mathbf{e}_{i,k}}\right)}{\left(2\pi\right)^{\frac{n_{z}}{2}} \text{det}^{\frac{1}{2}}(\mathbf{P}_{\mathbf{zz},i,k})}\right) \\ & =\alpha_{i,k} +\frac{1}{2} \mathbf{e}^{T}_{i,k}\mathbf{P}^{-1}_{\mathbf{zz},i,k}\mathbf{e}_{i,k}, \end{aligned} $$
((17))

where α i,k is

$$ \alpha_{i,k} = \frac{n_{z}}{2} \text{log}_{e}\left(2\pi\right) + \frac{1}{2}\text{log}_{e}\left(\text{det}(\mathbf{P}_{\mathbf{zz},i,k})\right), $$
((18))

and n z is the length of the observation vector of camera c i at time k. The observations of the camera c i at time k are informative enough if the surprisal of the corresponding measurement residual H i,k is greater than a threshold

$$ H_{i,k} \left\{ \begin{array}{ll} \geq \chi_{k} & \quad \text{informative}\\ < \chi_{k} & \quad \text{non-informative}. \end{array} \right. $$
((19))

The cameras with enough informative measurements are called surprisal cameras. The threshold χ k has to be defined based on the bandwidth and energy constraints in such a way that at each time k, on average, only a given number of cameras are selected as surprisal cameras.

4.3 Surprisal threshold

Let \(\mathbf {s}_{k} =\left (s_{1,k},s_{2,k}, \cdots, s_{\left |C_{k}\right |,k}\right)\) be the indication vector at time k, where |C k | is the number of cameras in the observation cluster. Each element s i,k in the indication vector is either 1 or 0

$$ s_{i,k} = \left\{ \begin{array}{ll} 1 & \quad \text{surprisal camera}\\ 0 & \quad \text{non-surprisal camera.} \end{array} \right. $$
((20))

From (17), (19), and (20), the average number of times a camera c i becomes a surprisal camera is given as

$$ \begin{aligned} \mathbb{E}\left[s_{i,k}\right]&=\text{Pr}\left(s_{i,k}=1\right)\\ & =\text{Pr}\left(H_{i,k} \geq \chi_{k}\right)\\ & =\text{Pr}\left(\alpha_{i,k} +\frac{1}{2} \mathbf{e}^{T}_{i,k}\mathbf{P}^{-1}_{\mathbf{zz},i,k}\mathbf{e}_{i,k} \geq \chi_{k}\right)\\ & =\text{Pr}\left(\mathbf{e}^{T}_{i,k}\mathbf{P}^{-1}_{\mathbf{zz},i,k}\mathbf{e}_{i,k} \geq 2\left(-\alpha_{i,k} +\chi_{k}\right)\right)\\ & =\text{Pr}\left(\mathbf{e}^{T}_{i,k}\mathbf{P}^{-1}_{\mathbf{zz},i,k}\mathbf{e}_{i,k} \geq \beta_{k}\right), \end{aligned} $$
((21))

where β k =2(−α i,k +χ k ). Since \(\mathbf {e}_{i,k} \sim \mathcal {N}\left (0, \mathbf {P}_{\mathbf {zz},i,k} \right)\),

$$ \mathbf{e}^{T}_{i,k}\mathbf{P}^{-1}_{\mathbf{zz},i,k}\mathbf{e}_{i,k} \sim \chi^{2}_{n_{z}}, $$
((22))

where \(\chi ^{2}_{n_{\mathbf {z}}}\) is a chi-square distribution with a degree of freedom of n z . The surprisal threshold β k in (21) should be calculated in such a way that on average, |l k | cameras are selected as surprisal cameras. Thus,

$$ \begin{aligned} \mathbb{E}\left[\sum^{\left|C_{k}\right|}_{i=1}s_{i,k}\right] &= \sum^{\left|C_{k}\right|}_{i=1}\mathbb{E}\left[s_{i,k}\right] \\ & = \left|C_{k}\right|\mathbb{E}\left[s_{i,k}\right] =\left|l_{k}\right|. \end{aligned} $$
((23))

From (21), (23), and (22), it is implied that

$$ \text{Pr}\left(\mathbf{e}^{T}_{i,k}\mathbf{P}^{-1}_{\mathbf{zz},i, k}\mathbf{e}_{i,k} \geq \beta_{k}\right) = \frac{\left|l_{k}\right|}{\left|C_{k}\right|}. $$
((24))

The surprisal threshold β k can be calculated as the value for which the probability of chi-square distributed squared and normalized measurement residual \(\chi ^{2}_{n_{\mathbf {z}}}\) is greater than or equal to |l k |/|C k | as

$$ \beta_{k} = \mathrm{F}^{-1}_{\chi^{2}_{n_{\mathbf{z}}}}\left(1-\frac{\left|l_{k}\right|}{\left|C_{k}\right|}\right), $$
((25))

where \(\mathrm {F}{\chi ^{2}_{n_{\mathbf {z}}}}\) is the cumulative distribution function of the chi-square distribution \(\chi ^{2}_{n_{\mathbf {z}}}\) with a degree of freedom of n z .

Hence, the surprisal threshold β k at time k can be calculated by using the knowledge of the number of cameras in the observation cluster |C k | and the number of desirable surprisal cameras |l k |. Thus, the cameras c i in the cluster can independently decide whether their local observations are informative or not.

5 Multi-camera object tracking with surprisal cameras (MOTSC)

In the proposed scheme, the cameras c i where i∈{1,2,⋯,M} in the network that can observe an object at time k form a cluster (observation cluster) C k with a FC c o,k as shown in the Fig. 2. The dynamic clustering can be achieved in several ways. One of such methods is presented in [39]. Further, each camera in the VSN has an on-board CIF algorithm. At each time k, each camera in the observation cluster C k except the FC independently decides whether it is a surprisal camera or not, as discussed in Section 4. All surprisal cameras in the cluster C k transmit their information contribution vectors and matrices to the FC. Moreover, the FC also performs the local filtering based on the on-board CIF. The locally calculated and received information contribution metrics are then fused together to achieve the estimated global state of the object at time k.

Fig. 2
figure 2

The VSN with the observation and surprisal clusters with a fixed FC. All the cameras (dots) inside the blue cluster can observe the object at a given time and form the observation cluster. The cameras (dots) inside the red cluster are the surprisal cameras and form the surprisal cluster. The blue dot with cross represents the FC

The FC is initialized with the global initial information vector and matrix \(\left (\widehat {\mathbf {y}}_{0|0}, \mathbf {Y}_{0|0}\right)\). At each time step k, it has four main functions: surprisal threshold calculation, local filtering, information fusion, and global state dissemination as shown in Algorithm ??.

  • Surprisal threshold calculation: The surprisal threshold can be calculated with the knowledge of the size |C k | of the observation cluster and desired size |l k | of the surprisal cluster as shown in (25). Hence, the FC which knows this information calculates and broadcasts the surprisal threshold whenever the observation and surprisal cluster sizes change.

  • Local filtering: The FC performs the local estimation based on its measurement z o,k by using the on-board CIF. Firstly, the FC predicts the information vector and matrix \(\left (\widehat {\mathbf {y}}_{o,k|k-1}, \mathbf {Y}_{o,k|k-1}\right)\) from the prior global information vector and matrix \(\left (\widehat {\mathbf {y}}_{k-1|k-1}, \mathbf {Y}_{k-1|k-1}\right)\) as shown in Appendix Appendix 1: time update (TU). Then, it computes the information contribution vector and matrix (i o,k ,I o,k ) by using its own local observations z o,k as shown in Appendix Appendix 2: measurement update (MU).

  • Information fusion: The FC receives a set of information contribution metrics (i i,k ,I i,k ) where i=1,2,⋯,|l k | from the surprisal cameras in the cluster. The global information vector and information matrix \(\left (\widehat {\mathbf {y}}_{k|k}, \mathbf {Y}_{k|k}\right)\) are obtained by fusing the received surprisal information contributions and its own information contributions (i f,k ,I f,k ) with the predicted information vector and matrix \(\left (\widehat {\mathbf {y}}_{k|k-1}, \mathbf {Y}_{k|k-1}\right)\).

  • Global state dissemination: After the information vector and matrix \(\left (\widehat {\mathbf {y}}_{k|k}, \mathbf {Y}_{k|k}\right)\) are computed, the FC broadcasts it in the network. Hence, the cameras in the network have the global knowledge which can be used as prior information for the local filtering in the time step k+1.

The cameras in the observation cluster C k at time k have two main functions to perform: time update and surprisal update as shown in Algorithm ??. The cameras in the observation cluster know the prior global information of the object \(\left (\widehat {\mathbf {y}}_{k-1|k-1}, \mathbf {Y}_{k-1|k-1}\right)\). At each time step k, they perform the following:

  • Time update: The camera predicts the information vector and matrix \(\left (\widehat {\mathbf {y}}_{i,k|k-1}, \mathbf {Y}_{i,k|k-1}\right)\) from the prior global information vector and matrix \(\left (\widehat {\mathbf {y}}_{k-1|k-1}, \mathbf {Y}_{k-1|k-1}\right)\) using the CIF time update as shown in Appendix Appendix 1: time update (TU).

  • Surprisal update: Each camera receives the surprisal threshold β k from the FC whenever the observation and/or surprisal cluster size changes. Upon receiving the measurement z i,k , each camera c i calculates the corresponding measurement residual and innovation covariance (e i,k ,P z z,i,k,). The proposed surprisal threshold rule in Section 4.3 is used to determine whether it is a surprisal camera or not. If the camera is a surprisal camera, the information contribution vector and matrix (i i,k ,I i,k ) are calculated according to (9) and (10). Thereafter, the information metrics are transmitted to the FC. If the camera is not a surprisal camera, then the surprisal update is aborted.

After the surprisal update, each camera c i in the network receives the global information \(\left (\widehat {\mathbf {y}}_{k|k}, \mathbf {Y}_{k|k}\right)\) from the FC. Hence, each camera in the network has the knowledge of the global state of the object which can also be used as the prior information in the local estimation for the next time step k+1.

In this paper, the FC is assumed to be fixed and not effected by node failures. It is also assumed that the delays in transmitting local information to the FC are all less than the sampling interval of the cameras. Thus, the FC can fuse the arriving information contribution in time. The communication links in the network are assumed to be perfect. Hence, the only cause of a missing information metric from a camera is that the corresponding observations are not informative enough.

6 Results

In this section, the efficiency of the proposed MOTSC method is evaluated based on the simulation and experimental data. In our approach, the efficiency is defined in terms of the sum of the root mean square errors (RMSEs) of the estimated global state and the ground truth in x and y directions. Moreover, the energy and bandwidth efficiency are calculated in terms of the average number of transmissions from the cameras in the observation cluster to the FC.

6.1 Simulation results

The simulation considers a VSN with cameras having overlapping FOVs as shown in Fig. 2. All of the cameras that can observe the xy-plane, where x∈ [−500,500] and y∈[−500,500] form an observation cluster with a FC. The motion of the object is modeled with Gaussian distributed acceleration as given in (1). The ground truth of the position of the object is simulated by assuming that the process noise covariance Q k and measurement noise covariance R k are diag(5,5) and diag(1,1), respectively. Each camera c i in the cluster has its own homography function h i . Since we assume static cameras, the homography of the cameras do not change with time k and object. The algorithms are evaluated on 1000 different trajectories with different initializations. Figure 3 shows some of the simulated trajectories of the object.

Fig. 3
figure 3

Simulated trajectories of the object

6.1.1 Scenario 1

In this scenario, the accuracy of the CIF- and EIF-based object tracking methods in the VSN are compared. In this comparison, the proposed surprisal selection method is not employed. Hence, all the cameras in the observation cluster participate in the information fusion. In the abovementioned simulation setup, each camera calculates the local information metrics based on the local observations. The information metrics from the local cameras are fused at the FC. Moreover, the process noise covariance Q k and measurement noise covariance R k are considered to be known to all the cameras in the cluster. The cluster is also assumed to be fully connected with perfect communication links to the FC.

Under the above conditions, Fig. 4 shows the average RMSE (ARMSE) of the multi-camera object tracking methods based on the CIF and EIF for different observation cluster sizes. To achieve statistical reliability, the RMSE is averaged over a thousand simulation runs and 1000 simulated trajectories to yield the ARMSE. From Fig. 4, we can infer that the CIF-based object tracking outperforms the EIF-based method, though the tracking accuracy of the two methods improves with increasing cluster size.

Fig. 4
figure 4

Tracking accuracy of the multi-camera object tracking methods based on the CIF and EIF in terms of average RMSE for different cluster sizes

6.1.2 Scenario 2

In this scenario, the accuracy of the proposed MOTSC is analyzed in comparison with multi-camera object tracking with random cameras, fixed cameras, best cameras, and active sensing cameras.

  • Multi-camera object tracking with random cameras (MOTRC): A random subset of cameras in the observation cluster transmit their local information metrics to the FC independent of the information contained in their measurements.

  • Multi-camera object tracking with fixed cameras (MOTFC): A fixed subset of cameras in the observation cluster transmit their local information metrics to the FC.

  • Multi-camera object tracking with best cameras (MOTBC): All the cameras in the observation cluster C k send their surprisal of the measurement residual to the FC. The FC ranks the cameras in the ascending order of their surprisal score and informs |l k | best cameras to share their local information metrics. Then, the informed cameras send their local information metrics to the FC. The total number of transmissions to and from the FC involved in this method are ||C k |+2|l k ||. The MOTBC method is an adoption from [36].

  • Multi-camera object tracking method with active sensing cameras (MOTAC): The FC activates or deactivates the cameras from participating in information exchange by maximizing reward-cost utility function as given in [35]. The reward is expected information gain (EIG). At each time k, the FC evaluates the utility function for all possible activated and deactivated camera combinations before activating the best cameras to participate in the information fusion. Refer to [35] for complete details.

Figure 5 shows the RMSE of the MOTSC, MOTRC, MOTFC, MOTBC, and MOTAC methods. The x-axis of the figure represents the average number of cameras participated in the information fusion at each time k. The total number of cameras |C k | in the observation cluster remains 10. From Fig. 5, we can infer that the tracking accuracy of these methods improves with increasing size of the subset that can participate in the information fusion. However, the proposed MOTSC method outperforms both the MOTRC and MOTFC for the same number of cameras |l k | that can transmit to the FC. The MOTSC, MOTBC, and MOTAC methods approximately achieve the same tracking accuracy. However, in the MOTAC method, at each time k, the FC has to evaluate the reward-cost utility function for all possible activated and deactivated camera combinations (210 in this case) before selecting the best possible cameras to participate in the information fusion. Moreover, the camera selection at time k in the MOTAC method does not depend on the current measurements. In the MOTBC method, in order to select the best possible cameras, the FC has to receive the surprisal scores from all the cameras in the observation cluster. The centralized and complex camera selection restricts the scalability of both the MOTAC and MOTBC methods. On the other hand, in the proposed MOTSC method, the cameras take decision independently whether to participate in information fusion or not.

Fig. 5
figure 5

Tracking accuracy of the MOTSC, MOTRC, MOTFC, MOTBC, and MOTAC methods. The size |C k | of the observation cluster is 10, and the size of the fixed, random, or surprisal subset varies from 1 to 10

On the other hand, Fig. 6 shows the number of transmissions sent to the FC in the MOTSC and MOTRS methods. The x-axis shows the theoretical number |l k | of surprisal cameras which is used to calculate the surprisal threshold. The y-axis shows the number of transmissions to the FC from the surprisal and random cameras in the corresponding methods. From the figure, it is illustrated that on average, the number of transmissions to the FC for both methods is approximately equal and matches the theoretical requirements. Even though the MOTBC achieves the same performance as the MOTSC, the number of transmissions in MOTBC is equal to ||C k |+2|l k || which can be significantly higher than the average number of transmissions |l k | in MOTSC.

Fig. 6
figure 6

The average number of transmissions to the FC in the MOTSC and MOTRC methods. The observation cluster size C k is 10, and the size |l k | of the random or surprisal subset varies from 1 to 10

6.2 Experimental results

The experimental setup consists of a self-aware multi-camera cluster built in the lab of our institute. The camera cluster consists of four atom-based cameras (1.6 GHz processor, 2 GB RAM, 30 GB internal SSD hard disk) from SLR Engineering and two PandaBoards on which the middle-ware system ELLA [40] is developed. The cameras in the cluster can perform object detection and tracking together with state estimation locally. Moreover, they are connected via Ethernet. In the experimental setup, the four cameras in the network have overlapping FOVs. The motion of the object is modeled by predefined tracks. The experiment considers ten different such predefined tracks within the overlapping FOV of the four cameras. Figure 7 shows some of the object tracks that are used for evaluating the proposed MOTSC method. The x- and y-axes represent the dimensions of the lab where the experimental setup is built. Each track has a duration of 120 s. Each camera c i in the cluster has its own homography function h i . Since we assume fixed cameras, the homography of the cameras does not change with time k. The process noise covariance Q k and measurement noise covariance R k are considered as diag(10,10) and diag(2,2), respectively.

Fig. 7
figure 7

Examples of the predefined object tracks used for evaluating the MOTSC method

Figure 8 shows the average RMSE of the MOTSC, MOTBC, and MOTRC methods. The x-axis of the figure represents the size |l k | of random and surprisal subset of the cameras that transmit their local information metrics to the FC at each time k. The total number of cameras |C k | in the observation cluster remains four irrespective of the desired size of the random and surprisal subset. To achieve statistical reliability, the average RMSE is averaged over ten predefined tracks discussed above. From Fig. 8, we can infer that the proposed MOTSC outperforms the MOTRC for the same number of cameras |l k | that can participate in the information fusion. Even though the MOTBC method achieves approximately the same tracking accuracy as the MOTSC method, the number of transmissions to the FC is always ||C k |+2|l k ||.

Fig. 8
figure 8

Tracking accuracy of the MOTSC, MOTRC, and MOTBC methods in the experimental setup defined above. The size of the observation cluster is 4, and the size |l k | of the random or surprisal subset varies from 1 to 4

On the other hand, Fig. 9 shows the average number of transmissions sent to the FC in the MOTSC and MOTRC methods. The x-axis shows the theoretical number |l k | of surprisal cameras which is used to calculate the surprisal threshold. The y-axis shows the average number of transmissions to the FC by the corresponding methods during the experiment. From the figure, it is illustrated that on average, the number of transmissions to the FC for both the methods is approximately equal and matches the theoretical requirements. Hence, the proposed MOTSC shows better accuracy than that of the MOTRC for the same number of average transmissions.

Fig. 9
figure 9

The average of number of transmissions sent to the FC in the multi-camera object tracking with surprisal and random cameras. The observation cluster size |C k | is 4, and the size |l k | of the random or surprisal subset varies from 1 to 4

7 Conclusions

In this work, a multi-camera object tracking with surprisal cameras in a VSN is proposed. The cameras in the VSN that can observe an object form an observation cluster with a fixed FC. However, due to bandwidth constraints and energy limitations, it is usually desirable to have only a subset of cameras exchanging their local information to the fusion center. In our approach, each camera runs a local object tracking algorithm based on the on-board CIF. Each camera independently determines whether its observations are informative enough or not by using the surprisal of its measurement residual. Only if a camera’s measurements are informative enough (surprisal cameras), it calculates and transmits the local information vector and matrix to the fusion center. The global state of the object is obtained by fusing the local information from surprisal cameras at the fusion center. The proposed scheme also ensures that on average, only a desired number of cameras participate in the information exchange. The proposed multi-camera object tracking with surprisal cameras shows a considerable improvement in tracking accuracy over the multi-camera object tracking with random and fixed cameras for the same number of transmissions to the fusion center.

8 Endnote

1 In general, the surprisal is defined for the discrete random variables (DRV). Hence, we are considering the innovation to be a DRV.

9 Appendices

The multi-sensor CIF constitutes of three main steps: time update and measurement update at each sensor i and time k.

9.1 Appendix 1: time update (TU)

Calculate the predicted information vector and information matrix \(\left [\widehat {\mathbf {y}}_{i,k|k-1}, \mathbf {Y}_{i, k|k-1}\right ]\) from global prior information \(\left [\widehat {\mathbf {y}}_{k-1|k-1}, \mathbf {Y}_{k-1|k-1}\right ]\).

  1. 1.

    Calculate the state estimate

    $$ \widehat{\mathbf{x}}_{k-1|k-1} = \mathbf{Y}_{k-1|k-1}\widehat{\mathbf{y}}_{k-1|k-1}. $$
  2. 2.

    Compute the cubature points m=(1,2,…,2n x )

    $$\mathbf{cp}_{m,k-1\mid k-1} = \sqrt{\mathbf{Y}^{-1}_{k-1\mid k-1}}\xi_{m} + \widehat{\mathbf{x}}_{k-1\mid k-1}, $$

    where n x is the length of the state vector. ξ m represent the mth intersection point of the surface of the n-dimensional unit sphere and its axes.

  3. 3.

    Propagate the cubature points through the motion model

    $$\mathbf{x}^{*}_{m,k \mid k-1} = \mathbf{f}_{i,k}\left(\mathbf{cp}_{m,i,k-1\mid k-1}\right). $$
  4. 4.

    Calculate the predicted state as

    $$\widehat{\mathbf{x}}_{i,k\mid k-1} = \frac{1}{2n_{\mathbf{x}}}\sum^{2n_{\mathbf{x}}}_{m=1} \mathbf{x}^{*}_{m,i,k\mid k-1}. $$
  5. 5.

    Calculate the predicted error covariance as

    $$\textbf{P}_{k|k-1} = \textbf{M}_{i,k|k-1}\textbf{M}^{T}_{i,k|k-1} + \textbf{Q}^{s}_{i,k}, $$

    where Q i,k is the process noise covariance. The predicted weighted centered matrix M i,k|k−1 is given as

    $$\begin{aligned} \mathbf{M}_{i,k|k-1} &= \frac{1}{\sqrt{2n}} \left[\mathbf{x}^{*}_{1,i,k \mid k-1}- \widehat{\mathbf{x}}_{i,k\mid k-1} \quad \mathbf{x}^{*}_{2,i,k \mid k-1}\right. \\ &\quad\left. -\widehat{\mathbf{x}}_{i,k\mid k-1} \cdots \mathbf{x}^{*}_{2n,i,k \mid k-1}- \widehat{\mathbf{x}}_{i,k\mid k-1}\right]. \end{aligned} $$
  6. 6.

    Compute the predicted information matrix and predicted information vector

    $$ \mathbf{Y}_{i,k|k-1} = \mathbf{P}^{-1}_{i,k|k-1}, $$
    $$ \widehat{\mathbf{y}}_{i,k|k-1} = \mathbf{Y}_{i,k|k-1}\widehat{\mathbf{x}}_{i,k|k-1}. $$

9.2 Appendix 2: measurement update (MU)

Each sensor calculates its information contribution vector and matrix [i i,k ,I i,k ] from the predicted information vector and matrix \(\left [\widehat {\mathbf {y}}_{i,k|k-1}, \mathbf {Y}_{i,k|k-1}\right ]\) and the measurement z i,k .

  1. 1.

    Calculate the cubature points

    $$ \mathbf{cp}_{m,i,k\mid k-1} = \sqrt{\mathbf{P}_{i,k\mid k-1}}\xi_{m} + \widehat{\mathbf{x}}_{i,k\mid k-1}. $$
  2. 2.

    Propagate the cubature points through the observation function

    $$\mathbf{z}^{*}_{m,i,k\mid k-1} = \mathbf{h}_{i,k}\left(\mathbf{cp}_{m,i,k\mid k-1}\right). $$
  3. 3.

    Calculate the predicted measurement

    $$\widehat{\mathbf{z}}_{i,k\mid k-1} = \frac{1}{2n_{\mathbf{x}}}\sum^{2n_{\mathbf{x}}}_{m=1} \mathbf{z}^{*}_{m,i,k\mid k-1}. $$
  4. 4.

    Calculate the measurement residual

    $$\mathbf{e}_{i,k} = \mathbf{z}_{i,k} - \widehat{\mathbf{z}}_{i,k\mid k-1}. $$
  5. 5.

    Calculate the cross covariance

    $$\begin{aligned} \mathbf{P}_{\mathbf{xz},i,k\mid k-1} &= \frac{1}{2n}\sum^{2n}_{m=1} \mathbf{cp}_{m,i,k\mid k-1}\mathbf{z}^{*T}_{m,i,k\mid k-1} \\ &\quad-\widehat{\mathbf{x}}_{i,k\mid k-1}\widehat{\mathbf{z}}^{T}_{i,k\mid k-1}. \end{aligned} $$
  6. 6.

    Calculate the information contribution matrix

    $$ \mathbf{I}_{i,k} = \mathbf{Y}_{i,k|k-1}\mathbf{P}_{\mathbf{xz},i,k\mid k-}\mathbf{R}^{-1}_{i,k}\mathbf{P}^{T}_{\mathbf{xz},i,k\mid k-1}\mathbf{Y}^{T}_{i,k|k-1}, $$

    where R i,k is the measurement noise covariance matrix.

  7. 7.

    Compute the information contribution vector

    $$\begin{aligned} \mathbf{i}_{i,k} &= \mathbf{Y}_{i,k|k-1}\mathbf{P}_{\mathbf{xz},i,k\mid k-}\mathbf{R}^{-1}_{i,k} \\ &\quad\left(\mathbf{e}_{i,k}+\mathbf{P}^{T}_{\mathbf{xz},i,k\mid k-1}\mathbf{Y}^{T}_{i,k|k-1}\widehat{\mathbf{x}}_{i,k\mid k-1}\right). \end{aligned} $$