Our approach first extracts object detections from every frame of a generic image sequence. Given all the detections in each frame, we use a modified tracking-by-detection method [12] to associate the bounding boxes among different frames. This algorithm computes a distance matrix using patch appearance and associate detections using the Hungarian method for bipartite matching. We relaxed the part associated to the smoothness of the object trajectory because we might not have consistent camera motion among consecutive frames thus causing the corresponding consecutive bounding boxes to be far apart. Notice that, it is common that bounding boxes might not be precisely aligned with the true object centre and often they include a portion of background.
We then assume that the object is bounded by a rectangular region \(\mathcal{B}_i\) in image i. In 3D space, each region \(\mathcal{B}_i\) defines a semi-infinite pyramid \(\mathcal{Q}_i\) with its apex in the camera center (see Fig. 1), which bounds the possible locus of the object. In the case of two views, assuming that the object’s projections are bounded by rectangles \(\mathcal{B}_1\) and \(\mathcal{B}_2\) in the images respectively, the object in space must lie within a polyhedron \(\mathcal{D}\) as in Fig. 1. Geometrically, \(\mathcal{D}\) is obtained by intersecting the two semi-infinite pyramids defined by the two rectangles \(\mathcal{B}_1\) and \(\mathcal{B}_2\) and the respective centres of projection \(\mathrm {C}_1\) and \(\mathrm {C}_2\).
In the general case of n views, the object is localised inside the polyhedron formed by the intersection of the n semi-infinite pyramids generated by the rectangles \(\mathcal{B}_1, \dots , \mathcal{B}_n\):
$$\begin{aligned} \mathcal{D}= \mathcal{Q}_1\cap \mathcal{Q}_2 \dots \cap \mathcal{Q}_n. \end{aligned}$$
(1)
Analytically, the polyhedron \(\mathcal{D}\) is defined as the following set:
$$\begin{aligned} \mathcal{D}= \{{X} \in \mathbb {R}^3 : \exists {x}_i \in \mathcal{B}_i, i = 1\dots n \text { s.t. }\forall i : {x}_i = \varPi _i({X}) \} \end{aligned}$$
(2)
where \(\varPi \) is the known perspective projection onto the i-th image.
3.1 Vertex Enumeration Solution
The semi-infinite pyramid \(\mathcal{Q}_i\) can be written as the intersection of the four negative half-spaces \(\mathcal{H}^i_1,\mathcal{H}^i_2,\mathcal{H}^i_3, \mathcal{H}^i_4\) defined by its supporting planes. Thus, the solution set D can be expressed as the intersection of 4n negative half-spaces:
$$\begin{aligned} \mathcal{D}= \bigcap _{\begin{array}{c} i=1\dots n\\ \ell =1 \dots 4 \end{array}} \mathcal{H}^i_\ell . \end{aligned}$$
(3)
Implicitly these equations represent the polyhedron \(\mathcal{D}\), and indeed this is also called the H-representation of \(\mathcal{D}\). However, we aim at an explicit description of \(\mathcal{D}\) in terms of vertices and edges, also called a V-representation. The problem of producing a V-representation from an H-representation is called the VertexEnumeration problem, in Computational Geometry. The vertices and the faces of \(\mathcal{D}\) can be enumerated in \(O(n \log n)\) time, being n the number of cameras [22]. In particular we used the implementation of the reverse search vertex enumeration algorithm described in [2] and available on the webFootnote 1.
In the following, this approach based on Computational Geometry (proposed in [10]) will be referred to as the “CG approach”. In the next section, following [9], we shall describe how the solution set can be enclosed with an axis-aligned box using an approach based on Interval Analysis, henceforth dubbed “IA approach”.
3.2 Bounded Computational Geometry Method
The polyhedron generated by the CG approach can approximate effectively the 3D volume occupied by a detected object if several images of the object with a large baseline between cameras are available. Otherwise, when there are few images with a narrow baseline between cameras, the computed polyhedron can easily overestimate the occupancy volume. To reduce this effect, we bounded the estimated volume by including a prior over its maximum elongation. This is done by first finding the centroid of the object using triangulation between the centres of the bounding boxes in different views [4]. Then, the final polyhedron is obtained by cutting the pyramid, generated by CG with two planes, with a distance before and after the object 3D centroid equal to half of the maximum size of the objectFootnote 2, and with the normal aligned to the optical axis of the camera. We will henceforth refer to this variation as the CG\(_{b}\) method.
3.3 Interval Analysis
Interval Analysis [18] is an arithmetic defined on intervals, rather than on real numbers. It was firstly introduced for bounding the measurement errors of physical quantities for which no statistical distribution was known. In the sequel of this section we shall denote intervals with boldface. Underscores and overscores will represent respectively lower and upper bounds of intervals. \(\mathbb {IR}\) stands for the set of real intervals. If f(x) is a function defined over an interval \(\varvec{ x}\) then \({{\mathrm{\mathrm {range}}}}({f}, \varvec{ x})\) denotes the range of f(x) over \(\varvec{ x}\).
If \(\varvec{ x} =\left[ \underline{{x}}, \overline{{x}}\right] \) and \(\varvec{ y}=\left[ \underline{{y}}, \overline{{y}}\right] \), a binary operation between \(\varvec{ x}\) and \(\varvec{ y}\) is defined in interval arithmetic as:
$$\begin{aligned} \varvec{ x} \circ \varvec{ y} = \left\{ x \circ y\ |\ x \in \varvec{ x} \wedge y \in \varvec{ y} \right\} , \forall \; \circ \in \left\{ +, - , \times , \div \right\} . \end{aligned}$$
Operationally, interval operations are defined by the min-max formula:
$$\begin{aligned} \varvec{ x} \circ \varvec{ y} = \left[ \min \left\{ \underline{x} \circ \underline{y}, \underline{x} \circ \overline{y}, \overline{x} \circ \underline{y}, \overline{x} \circ \overline{y} \right\} , \right. \left. \max \left\{ \underline{x} \circ \underline{y}, \underline{x} \circ \overline{y}, \overline{x} \circ \underline{y}, \overline{x} \circ \overline{y}\right\} \right] \end{aligned}$$
(4)
Interval division \(\varvec{ x}/\varvec{ y}\) is undefined when \(0 \in \varvec{ y}\).
In general, for arbitrary functions, interval computation cannot produce the exact range, but only approximate it.
Definition 1
(Interval extension [23]). A function \(\varvec{ f}: \mathbb {IR} \rightarrow \mathbb {IR} \) is said to be an interval extension of \(f: \mathbb {R} \rightarrow \mathbb {R} \) provided that \( {{\mathrm{\mathrm {range}}}}({f},\varvec{ x}) \subseteq \varvec{ f}(\varvec{ x}) \) for all intervals \(\varvec{ x} \subset \mathbb {IR}\) within the domain of \(\varvec{ f}\).
Such a function is also called an inclusion function. So, given a function f and a domain \(\varvec{ x}\), the inclusion function yields a rigorous bound (or enclosure) on \({{\mathrm{\mathrm {range}}}}({f},\varvec{ x})\). This property is particularly suited for error propagation: If \(\varvec{ x}\) bounds the input error on the variable x, \(\varvec{ f}(\varvec{ x})\) bounds the output error. Therefore, if the exact value is contained in interval data, the exact value will be contained in the interval result.
Definition 2
(Natural interval extension [23]). Let us consider a function f computable as an arithmetic expression \(\mathsf {f}\), composed of a finite sequence of operations applied to constants, argument variables or intermediate results. A natural interval extension of such a function, denoted by \(\mathsf {f}(\varvec{ x})\), is obtained by replacing variables with intervals and executing all arithmetic operations according to the rules above.
Please note how different expressions for the same function yield different natural interval extensions. For instance, \(\varvec{ \mathsf {f}_1}(\varvec{ x})={\varvec{ x}}^2 - \varvec{ x}\), and \(\varvec{ \mathsf {f}_2}(\varvec{ x})={\varvec{ x}(\varvec{ x} - 1)}\) are both natural interval extensions of the same function. For example, consider the expression \(\mathsf {f}(x) = x -x\) which is equivalent to 0. However evaluating the expression with the interval [1,2], gives \(\varvec{ \mathsf {f}}([1,2]) = [1,2] - [1,2] = [-1, 1]\), because the piece of information that the two intervals represent the same variable is lost. In general, although the ranges of interval arithmetic operations are exact, this is not so if operations are composed. For example, if \(\varvec{ x} = \left[ 0, 1\right] \) we have \( \varvec{ \mathsf {f}_2}(\varvec{ x}) = \left[ 0, 1\right] (\left[ 0, 1\right] - 1 ) = \left[ 0,1\right] \left[ -1, 0\right] = \left[ -1, 0\right] , \) which strictly includes \({{\mathrm{\mathrm {range}}}}(f,\left[ 0, 1\right] )= \left[ -1/4, 0\right] \).
It is well-known that Interval Analysis systematically overestimates the bound on the results of a computation: this is the price to pay for its simplicity.
3.4 Interval-Based Triangulation
Let us assume that we can write a closed form expression that relates the 3D point \({X}\) to its projections \({x}_1 = \varPi _1(X)\) and \({x}_2 = \varPi _2(X)\) in two images (see [9]):
$$\begin{aligned} {X} = \mathsf {f} ( {x}_1, {x}_2) \end{aligned}$$
(5)
If we let \({x}_1\) and \({x}_2\) in Eq. (5) vary in \(\mathcal{B}_1\) and \(\mathcal{B}_2\) respectively, then \({{\mathrm{\mathrm {range}}}}(\mathsf {f}, \mathcal{B}_1 \times \mathcal{B}_2)\) describes the polyhedron \(\mathcal{D}\) that contains the object. Interval Analysis gives us a way to compute an axis-aligned bounding box containing \(\mathcal{D}\) by simply evaluating \(\varvec{ \mathsf {f}}(\varvec{ {x}}_1, \varvec{ {x}}_2)\), the natural interval extension of \(\mathsf {f}\), with \(\mathcal{B}_1=\varvec{ {x}}_1\) and \(\mathcal{B}_2=\varvec{ {x}}_2\).
The 3D interval \(\varvec{ \mathsf {f}}(\varvec{ {x}}_1, \varvec{ {x}}_2)\) encloses the polyhedron \(\mathcal{D}\), and, in general, it is an overestimate. In fact, intervals can model only axis-aligned rectangular boxes; moreover, as seen in the examples, interval evaluation inevitably introduces overestimation.
The approach is easily extensible to the general n-views case. As defined in Sect. 3, the sought polyhedron \(\mathcal{D}\) is formed by the intersection of the semi-infinite pyramids generated by back-projecting in space the sets \(\mathcal{B}_1, \dots , \mathcal{B}_n\). Thanks to the associativity of intersection, \((\mathcal{D})\) can be obtained by first intersecting pairs of such pyramids and then intersecting the results. Let \(\mathcal{D}^2_{i,j}\) be the solution set of the triangulation between view i and view j. Then:
$$\begin{aligned} \mathcal{D}= \bigcap _{\begin{array}{c} i=1,\dots ,n\\ j=i+1,\dots ,n \end{array}} \mathcal{D}^2_{i,j}. \end{aligned}$$
(6)
An enclosure of the solution set \(\mathcal{D}\) is obtained by intersecting the \(n(n-1)/2\) enclosures of \(\mathcal{D}^2_{i,j}\) computed with the IA method described above. Since each enclosure contains the respective solution set \(\mathcal{D}^2_{i,j}\), their intersection contains \(\mathcal{D}\). In summary, the IA approach yields a rectangular axis-aligned bounding box \(\varvec{ \mathsf {f}}(\varvec{ {x}}_1, \varvec{ {x}}_2)\) that contains the polyhedron \(\mathcal{D}\). This method is faster and easier to implement (basing on an interval arithmetic library, such as INTLAB [25]) than the CG one, but the enclosure is – in general – an overestimate.