In this section, we highlight the shortcomings of using only intensity images or disparity images to detect flying bees at the beehive entrance. Accordingly, we introduce our hybrid intensity depth segmentation (HIDS) method, a hybrid segmentation based on depth and intensity images.
In terms of intensity images, many motion detection methods are based on background modeling. Depending on the conditions, simple methods (e.g., approximated median filtering) can perform nearly as well as more complex techniques (e.g., Gaussian mixture models) [24]. However, most methods based on intensity images reveal their limits when used under difficult conditions. In our application, intensity values were strongly affected by recurrent and rapid changes in lighting, shadows, and reflections. When lowering thresholds for motion detection, the adaptive methods generally tend to include near-static elements in the background too quickly. Even when focusing on small temporal windows, the results are not satisfying.
Disparities were computed by a stereo pair matching algorithm. The disparity map contains holes (unmatched areas) for which there is no certainty that they correspond to a target. These holes are caused by unmatchable textures that are too uniform, too different, or simply outside the disparity range. Under satisfying conditions, flying targets are represented by a peak on the depth map, and under difficult conditions, they are represented by holes. However, holes do not necessarily indicate the presence of a target. Figure 7 shows different effects that can be observed on the depth map depending on the situation. The most common reason for holes is that the part of the background hidden by a flying target is different depending on the point from which the target is observed, so it becomes unmatchable.
The strength of our segmentation method is that it relies first on the depth map, on which potential targets (peaks and holes) are detected, and this is then confirmed using the motion calculated from the corresponding intensity images. Depending on the light, flying bees project shadows onto the flight board which may be detected as motion on the intensity map. However, it is unlikely that a hole would be observed on the depth map in an area where there is motion because a detected motion indicates a significant change in texture. Furthermore, significant changes in texture allow matching for disparity computation in most cases and do not result in holes. This therefore constitutes the strength of our method: the combination of both disparity maps and intensity images prevents false detections that are generally triggered by the shadows of flying bees.
3.1 Flying target detection
Our segmentation method is an extension of standard motion detection methods with adaptive background modeling. The main improvement is the use of the depth information to drive the adaptation of the background intensity model.
The stereo camera provides a pair of grayscale images (left and right) and a corresponding disparity map. Below, I
t,u,v
refers to the intensity of the pixel at time t and position (u,v), while D
t,u,v
refers to the distance from the camera at time t and position (u,v). The objective here is to compute two binarized masks based on I and D: a determined depth target mask DDTM and an undetermined depth target mask UDTM. The DDTM represents targets with depth information that may be recovered, and UDTM represents targets with no direct recoverable depth information.
3.1.1 DDTM
The determined depth target mask DDTM is based on background subtraction between D and the computed depth background image, DBG, (see below):
(3)
where Δ d is a threshold. A morphological opening is then applied to remove the noise in the depth map D.
The depth background image DBG is computed over several frames by a non-evolutive temporal median:
(4)
Unlike intensities, disparity values are generally stable over time regardless of changes in lighting. A small jitter effect (few millimeters) is caused by imperfections in intensity image matching, but the values remain around an average value that corresponds to the real depth. The quality of the depth background DBG depends on the crowding condition. An increase in the frame number k used in the median computation and the increase in the time Δ t between two frames improve the robustness of the depth background computation with respect to passing (flying or walking) targets.
3.1.2 UDTM
In order to compute the UDTM, we first computed an undetermined depth mask UDM, which contained regions of the depth map D with undetermined depth and excluded regions of the depth background DBG with undetermined depth. UDM was computed as the intersection of D and DBG:
(5)
where e is the value assigned by the stereo camera to pixels that have an undetermined depth and ∩ is the logical conjunction operator.
Then, an intensity absolute motion mask IAMM was computed based on the absolute difference between an intensity background image IBG (see below) and I. A morphological closing was applied to enlarge potential motion regions and merge them with their close neighbors. IAMM was computed as
(6)
where ⊕ is a dilation using the structuring element S
1, ⊖ is an erosion using the structuring element S
2, ∗ is a convolution with the mean filter M, and Δ m is the threshold for binarization. We chose S
1 to be bigger than S
2.
Finally, UDTM was computed as the intersection of UDM and IAMM:
(7)
The evolutive temporal median intensity background IBG used for the computation of IAMM was initialized as follows:
(8)
The intensity relative motion mask IRMM corresponds to the relative motion map. It was computed by
(9)
where ∘ is a morphological opening using the structuring element S
3 to enlarge potential motion regions, and Δ rm is a threshold for binarization.
The foreground mask FG included all the potential targets except areas that exhibited no motion. FG was computed as
(10)
where EDDTM was obtained by applying a morphological dilation to DDTM to be sure that the whole targets were included in FG.
(11)
FG was used to perform a selective adaptation of IBG as defined by
(12)
where δ is the learning rate used for the adaptation.
Figure 8 illustrates the segmentation process described above. Our segmentation works especially well in outdoor conditions because typically non-uniform textures are present, which is favorable for the stereo matching algorithm.
3.2 Target extraction
The previous step provides two binarized masks, DDTM and UDTM, which represent targets with recoverable depth information and targets with no direct recoverable depth information, respectively. The centroid (center of mass) of each region of DDTM and UDTM is associated with a target. Using DDTM, the depth can be recovered by computing the median value of the depth values corresponding to the region in the depth map. Moreover, an ellipse is approximated to each region, giving information on the orientation and the size of the target. Figure 9 shows examples of extracted ellipses for the following cases: available and unavailable depth for targets over a clear and a cluttered background.
Raw statistics on depths and the size of ellipses was collected in a preliminary segmentation without constraints. We identified an exploitable relation between the lengths of the major axis of the ellipses and the depths. This relation appeared to be almost linear at this depth scale, although it cannot be linear on a bigger scale. Figure 10 shows that, using a polynomial regression, this relation can be approximated by two functions (mean value μ
d
and standard deviation σ
d
). Thus, we considered that observations from our segmentation fitted this model. The increase in the distance between an observation and the model may correspond to a false alarm. The degree of truth that an observation belongs to false alarms is given by the following membership function:
(13)
where d is the depth and s is the size of the target. Concerning targets without depth information, our initial idea was to approximate their depth from their size, but we invalidated this idea given the standard deviation of the models, which was generally quite high.