MVDet: multi-view multi-class object detection without ground plane assumption

Although many state-of-the-art methods of object detection in a single image have achieved great success in the last few years, they still suffer from the false positives in crowd scenes of the real-world applications like automatic checkout. In order to address the limitations of single-view object detection in complex scenes, we propose MVDet, an end-to-end learnable approach that can detect and re-identify multi-class objects in multiple images captured by multiple cameras (multi-view). Our approach is based on the premise that incorrect detection results in a specific view can be eliminated using precise cues from other views, given the availability of multi-view images. Unlike most existing multi-view detection algorithms, which assume that objects belong to a single class on the ground plane, our approach can classify multi-class objects without such assumptions and is thus more practical. To classify multi-class objects, we propose an integrated architecture for region proposal, re-identification, and classification. Additionally, we utilize the epipolar geometry constraint to devise a novel re-identification algorithm that does not require assumptions about ground plane assumption. Our model demonstrates competitive performance compared to several baselines on the challenging MessyTable dataset.


Introduction
With the significant progress in deep learning, several methods based on deep learning have been proposed in various applications of computer vision, such as automatic checkout [1,2], autonomous driving [3], and robotics [4]. Among these applications, automatic checkout employs different types of sensors including cameras, LIDARs, microphones, and scales to automatically recognize the items bought by a customer in a store, without the need for scanning barcodes. In comparison with other types of sensors, camerabased automatic checkout is not only less expensive but also advantageous in utilizing the recent promising outcomes of computer vision research.
However, Rigner et al. [2] have shown that the performances of some representative vision-based detection algorithms, such as Mask R-CNN [5], YOLO [6], and RetinaNet [7], were degraded when the scenes become crowded in automatic checkout. While several methods have attempted to solve this issue in a single-view setting [8][9][10], a single-view clue is insufficient to address the occlusion problem in complex scenes. To improve detection accuracy under occlusion, recent approaches have utilized depth information [11][12][13], LIDAR point cloud [14,15], or multiple camera views (multi-view) [16][17][18][19]. In this paper, we focus on multi-class object detection from multiple RGB camera views.
In this study, the multi-view images capture the overlapping fields of view and are characterized by the intrinsic and extrinsic camera parameters. Each scene contains multiple classes of objects, and the positions of the cameras are randomized per every scene for a general setting. The objective is to locate objects in each view (region proposal), compare the objects across views to determine if 1 3 they have the same identity (re-identification), and classify the re-identified objects (classification).
Existing multi-view detection studies have two limitations: (i) they cannot process multi-class objects, and (ii) they assume ground plane for re-identification. For instance, Cai et al. [20] have proposed to separately use the state-of-the-art methods in the fields of single-view detection and multi-view re-identification for multi-view detection. However, this approach cannot determine the final class if the object's classes are different in each view. Similarly, other studies [17,19,21] have integrated the detection and re-identification procedures in an end-toend manner, but they also assume only one class, and all objects in the scenes are considered to be of the same class without classification. Furthermore, multi-view pedestrian detection methods [17,21], which are the main focus of multi-view detection research, assume a reference ground plane, which is not applicable to scenarios where objects do not stand on the ground plane, such as automatic checkout and unmanned stores. Therefore, a general model that performs multi-class object detection and re-identification without ground plane assumption is required.
Therefore, this paper proposes a method to address two issues in object detection: multi-class classification and re-identification without ground plane assumption. For multi-class classification, the proposed method simultaneously performs region proposal, re-identification, and classification in an end-to-end manner. Specifically, we use faster R-CNN's region proposal network to locate objects in each view, and then, a view embedding network (VEN) trained with triplet loss [22] to re-identify the region proposal boxes. Finally, a classification network determines their class after pooling the re-identified regions. We save time and memory use by sharing the initial features in three stages. In addition, to improve detection accuracy at inference time, the re-identification and classification networks are trained on the incomplete detection results generated by the region proposal network instead of the error-free ground truth.
For re-identification without ground plane assumption, the proposed method adopts epipolar geometry to deal with occlusions and view variations. We calculate the embedding distance of a pair of region proposal boxes using the features extracted by VEN, and re-identify them as the same instance if they have the smallest distance across views and satisfy the epipolar constraint. Without ground plane assumption, our method achieves accurate re-identification.
Our model has been extensively tested on the challenging MessyTable dataset [20], which contains complex scenes with multi-view multi-class objects. The studies demonstrate that our model improves the detection performance (MODA) of faster R-CNN by +16% point. Moreover, our jointly optimized model outperforms the simple combination of detection and re-identification by +21% MODA and +25.9% AP, respectively. In summary, our contributions are as follows: • We propose MVDet, an end-to-end learnable object detector that is capable of handling multi-class objects in multi-view scenarios. To the best of our knowledge, this is the first attempt in this domain. • We have developed a novel algorithm that can re-identify objects across multi-view images under the epipolar geometry constraints, without relying on the ground plane assumption. Our method can be applied to various scenarios where objects are not on the ground plane, such as automatic checkout and unmanned stores. • Our proposed MVdet outperforms the single-view detection model and separately optimized multi-view detection models by jointly optimizing the region proposal, re-identification, and classification networks, without the ground plane assumption.

Single-view object detection
Object detection on a single image has made significant progress with deep learning, with methods such as Faster R-CNN [23] proposing regions where objects are expected to be and performing classification on those regions. Other detectors like YOLO [24], SSD [25], and EfficientDet [26] combine both steps by simultaneously localizing and classifying objects. However, accurate object detection a single image is limited when objects are partially or completely occluded.

Multi-view re-identification
Multi-view re-identification research has primarily focused on person retrieval, with many studies exploiting the parts of the objects [27][28][29][30]. In contrast, FaceNet [22] has introduced a triplet loss to minimize the distance between an anchor and a positive input, while maximizing the distance between the anchor and a negative input. ASNet [20], which performs well on the MessyTable dataset, leverages context of instances in complex scenarios. However, these approaches significantly suffer when inaccurate single-view detection results are used as input since they are trained and optimized on accurate ground truth boxes.

Multi-view classification
Research on 3D object detection has focused on classifying a group of 2D images that represent a 3D object [31][32][33][34][35]. One relevant study to our paper is MVCNN [36], which combines the multi-view images and performs pooling for classification. However, this method only classifies bundles of images with the same class and does not provide complete multiview detection since it does not locate individual objects.

Multi-view object detection
Roig et al. [16] have proposed multi-class object detection under multi-camera settings by applying conditional random fields to object detection results. In comparison with this study, which does not contain re-identification, several studies [17,18,21] have suggested the simultaneous localization and re-identification for the single class object in multi-view images. Baque et al. [17] have demonstrated the effectiveness of integrating CNN and conditional random field to improve the robustness to occlusion in multi-view multitarget detection. Chavdarova et al. [18] have used a predefined occlusion mask to partially mask input images during training and fuse multi-view features. Hou et al. [21] have proposed an anchor-free multi-view pedestrian detection using perspective transformation of the feature map. However, these studies assumed a ground plane and could not handle multi-class objects. There have also been approaches to recognize instances without ground plane assumption using image appearance and geometric information of cameras in the multi-view setting [19,37], but they could not handle multi-class objects. To address these limitations, we propose a multi-view multi-class object detector without a ground plane assumption.

Method
In this section, we provide a detailed introduction to our method, which has two distinct characteristics: a network capable of learning multi-class multi-view object detection in an end-to-end manner, and an epipolar geometry-based reidentification algorithm without the assumption of a ground plane. As shown in Fig. 1

Fig. 1
MVDet is a multi-view multi-class object detector that comprises three networks: RPN, VEN, and classifier. All three networks share the initial features that are extracted from VGG. The RPN network generates region proposal boxes using the shared features, which are then passed through the ROI pooling layer and resized to the same size. Meanwhile, the VEN network extracts view embedding features from the shared features and then, selects the view embedding vectors corresponding to the region proposal boxes. Reidentification across the views is performed using these vectors and epipolar geometry, as specified in Algorithm 1. Finally, the classifier network determines the classes for the sets of re-identified boxes. The dotted lines indicate that the weights of the networks are shared for each camera 1 3 RGB image as input and outputs the position and class of the objects in each view. We explain the architecture of the multi-class multi-view object detector in Sect. 3.1 and the re-identification algorithm based on epipolar geometry in Sect. 3.2.

Architecture of MVDet
MVDet consists of a region proposal (RPN), view embedding (VEN), and classification networks, as shown in Fig. 1. These networks work together to detect, re-identify, and classify objects in multiple views simultaneously. In the following sections, we explain each network in detail.

RPN
In the first step of MVDet, the RPN generates box-shaped region proposals where objects are likely to be located, similar to the Faster R-CNN approach [23]. To extract the feature maps from the multi-view images that will be shared in the subsequent steps, the VGG network [38] is used. We choose VGG because it is a lightweight network, but other networks could be used as well. The weights of the VGG network are shared across all cameras to ensure memory efficiency and to extract features that are robust to changes in viewpoint. Following this, the RPN network locates regions in the feature map that corresponds to the predicted object locations, and the features within these regions are cropped and resized to a uniform size using pooling layer.

VEN
VEN is a network that extracts view embedding features from a shared feature map. When the region proposal boxes extracted by the RPN are provided across all views, the view embedding vector that corresponds to each region proposal box is indexed from the view embedding feature. The distances between these vectors across all views are used as the similarity between objects, and these distances are utilized for re-identification with epipolar geometry. This section provides a detailed explanation of VEN. View embedding. VEN extracts the embedding feature map Y ∈ IR W×H×A×F 2 from the shared feature map X ∈ IR W×H×F 1 , where W and H represent the number of rows and columns of grid cells in an image, and A denotes the number of anchors in a grid cell. F 1 represents the channel size of the shared feature, and F 2 is the dimension of the view embedding vector. The view embedding vector y ∈ IR F 2 corresponding to a specific region proposal box is chosen from Y based on the coordinates of the grid cell and anchor index of that region proposal box. Indexing the view embedding feature map to find the embedding vectors that correspond to the region proposal boxes is a faster process than embedding all of the cropped region proposal images from the beginning. This speeds up the running time of the algorithm. VEN training. Inspired by the approach used in Face-Net [22], our goal is to ensure that the embedding distance between region proposal boxes belonging to the same object is small, while the distance between boxes of different instances is large. In other words, a region proposal box B a (anchor) of a specific object in a particular view should be closer to the region proposal boxes B p (positive) of the same object in other views than to the region proposal boxes B n (negative) of any other object. To achieve this, we aim to minimize the VEN loss, which is formulated as follows: where f is a mapping function from view embedding feature map Y to view embedding vector using the index of the given region proposal box B. is a margin between positive and negative pairs, and N is the number of samples.
To train the embedding vectors of the region proposal boxes using Eq. (1), we first sample an anchor in a specific view from the region proposal boxes that has the largest Intersection Over Union (IOU) with a ground truth object in that view. Then, we search for the positive and negative samples corresponding to this anchor in the other views. The positive sample is the region proposal box with the largest IOU with the anchor. The negative sample, on the other hand, is randomly selected from the region proposal boxes whose embedding distances to the anchor are longer than the embedding distance between the anchor and the positive sample.
VEN architecture. VEN architecture consists of two ReLU-activated convolutional layers, each with a 3 × 3 kernel and 512 output channels, and one sigmoid-activated convolutional layer with a 1 × 1 kernel and A * F 2 output channels. The resulting feature map Z ∈ IR W×H×A * F 2 is then reshaped to the embedding feature map Y ∈ IR W×H×A×F 2 , followed by L 2 normalization.
The technical specifications for VEN are as follows: • The size of the shared feature map is W × H × 512 ( F 1 = 512). (1)

3
• The size of the embedding feature map is W × H × 9 × 128 ( A = 9, F 2 = 128). • The margin between positive and negative pairs ( ) is set to 0.3. • We exclude region proposal boxes with an IOU smaller than 0.3 during VEN training. • We limit the number of VEN training samples to a maximum of 16 per iteration.

Classifier
After re-identifying the region proposal boxes using VEN, the next step is to classify them. For classification, we use a modified network based on faster R-CNN. Firstly, a 1 × 1 convolution operation is applied to the features of each region proposal box to reduce their channel size by 1/N, where N is the number of views. Then, the features of the same instances are concatenated. However, if a view is missing in a re-identified instance, the view is replaced with a black image. Next, two fully connected layers with two dropout layers are used for pooling. The resulting feature map is passed through two parallel paths. The first path is a fully connected layer followed by softmax for object classification. The second path is a linearly activated fully connected layer for localization. The localization step refines the localization result obtained in the region proposal step.
Finally, after classification, a novel multi-view non-maximum suppression (NMS) is used. If the overlapping area between two objects of the same class is more than 0.3 in at least one view, the object with lower confidence is removed.
Our proposed classifier is designed to have shallow layers, which provides several advantages compared to existing multi-view classifiers with deep structures such as MVCNN [36]. One advantage is the efficient use of memory during training and inference, which is beneficial for practical applications. Moreover, our simple classifier can still achieve high accuracy because it utilizes the shared feature map that has already extracted the key features of objects, as opposed to starting from scratch for each view.
Additionally, our classifier includes a localization layer that refines the inaccurate region proposal results from RPN. This is in contrast to MVCNN, which only performs classification. The ability to refine the localization results can further improve the overall accuracy of our method.

Re-identification based on VEN and epipolar geometry
In this section, we explain our approach for re-identification without assuming a ground plane. We use the embedding vector from VEN and epipolar geometry for this task. We note that the term re-identification in this paper is slightly different from the term used in person re-identification. The latter refers to the task of retrieving person images in one view, given a query target person in another view. In contrast, re-identification in our paper refers to the task of associating the same instances when objects are detected across multiple views. Our approach to re-identification differs from person re-identification in that it deals with inaccurate detection results and associates objects in all views. The objective of this section is to describe our re-identification algorithm that leverages VEN and epipolar geometry.
To apply our re-identification algorithm, we first gather region proposal boxes from all views and select the top M boxes with the highest confidence scores. These M boxes are then defined as reference boxes, denoted as R top . Given a reference box B 1 in the first view, we use its center coordinates to compute an epipolar line in the second view, using the intrinsic and extrinsic parameters of the camera. The pixel distances between all region proposal boxes and the epipolar line in the second view are then calculated to exclude unlikely region proposal boxes that are far from the epipolar line more than a certain threshold 1 . This threshold is a hyperparameter, and we empirically determine its value. After excluding unlikely candidates, we identify the box B 2 with the smallest embedding distance from B 1 as the same instance.
Once a pair of matched region proposal boxes B 1 and B 2 are identified, two epipolar lines in the third view could be calculated from B 1 and B 2 , respectively, using the camera's intrinsic and extrinsic parameters. A feasible match in the third view is a region proposal box that is close to the intersection of these two epipolar lines. We set a threshold 2 to exclude candidates that are too far from the intersection. Among the remaining region proposal boxes, the one with the shortest embedding distance from B 1 is considered as the same instance. This process is repeated for the rest of the views.
Algorithm 1 outlines the re-identification process. The EMD(a, b) function calculates the embedding distance between the region proposal boxes a and b. EPD 1 (a, b) function calculates the pixel distance between the center coordinate of region proposal box a and the epipolar line derived from the region proposal box b. The EPD 2 (a, b, c) function calculates the pixel distance between the center coordinate of region proposal box c and the intersection of two epipolar lines, which are derived from region proposal boxes a and b, respectively.

Algorithm 1 Re-identification
Require: R 1 , R 2 , ..., R N where each R i is a set of region proposal boxes for the i th view and N is the number of the views, θ 1 and θ 2 which are the thresholds to remove the candidates that do not meet the epipolar constraint. Ensure: Re-identified region proposal boxes

Dataset
MessyTable [20] is a multi-camera object dataset designed for the instance re-identification task. It consists of 120 object classes with varying sizes, colors, and materials. The dataset comprises 5,579 scenes captured by nine synchronized cameras, with 6 to 67 instances randomly placed on a table under different lighting conditions and backgrounds. The camera poses are set randomly in 567 configurations. The scenes are categorized into three difficulty levels, with harder scenes featuring more occluded objects, similarlooking instances, or fewer instances in the overlapping field of cameras. A total of 50,211 images are labeled with 1,219,240 bounding boxes, each annotated by class and instance IDs. The dataset also provides the calibrated intrinsic and extrinsic camera parameters.
The annotations in MessyTable, including camera parameters, bounding boxes with class labels for objects, and instance IDs for each bounding box, are utilized to evaluate the effectiveness of the proposed multi-view detection model. However, due to high memory usage, only 16,737 images from three cameras are utilized in our experiments. The training, validation, and test sets are randomly divided in a 1:1:1 ratio, following the original setup of MessyTable.

Implementation details
Our multi-view object detection model is based on Keras-FasterRCNN, which is an implementation of single-view Faster R-CNN using Keras. The source codes for Keras-FasterRCNN and Keras can be found at https:// github. com/ you359/ Keras-Faste rRCNN and https:// github. com/ kerasteam/ keras, respectively. For our implementation, we choose VGG16 [38], which was pre-trained on ImageNet [39], as the backbone network. The anchor boxes used in our model have sizes of [128,256,512] and aspect ratios of [1:1, 1:2, 2:1]. Region proposal boxes are resized to 7 × 7 in the ROI pooling layer. We used dropout layers [40] with a drop probability of 0.5. We used the ADAM optimizer [41] with an initial learning rate of 0.00001 for the region proposal, view embedding, and classification networks. All of our experiments were conducted on a single NVIDIA 1080 Ti GPU.

Metrics
To evaluate the effectiveness of our model, we employ detection and re-identification metrics. For detection, we use MODA, MODP, and F1-score (F1). MODA takes into account both false positives and false negatives, while MODP measures the localization error of true positives [42]. F1 is a harmonic mean of recall and precision. We use a threshold of 0 to compute MODA and F1. Re-identification performance is evaluated using AP and FPR-95, as in [20]. AP is calculated as a weighted sum of precisions, counting the number of positive and negative matches at each threshold. FPR-95, commonly used in patch-based matching, is the false positive rate when recall is 95% [43], and complements AP.

Baselines
In this section, we explain the baselines used in our experiments. To the best of our knowledge, there are no existing studies on multi-view multi-class object detection without the ground plane assumption. Therefore, we employ heuristics or deep learning-based methods for single-view detection, re-identification, and multi-view classification. We then integrate the results from each step to generate the final results for multi-view detection, which serve as the baseline for our experiments. We note that the performance of single-view faster R-CNN, which forms the backbone of our model, represents the lower bound in our experiments.
To re-identify the objects detected in single views, we use ASNet and TripleNet, which are state-of-the-art methods for re-identification on the MessyTable dataset. ASNet uses neighboring information around a bounding box when the appearance features of a pair of boxes are dissimilar, while TripleNet is a feature extractor trained with triplet loss [22] that measures the feature similarity of a pair of boxes. In our experiments, we train ASNet and TripleNet on the ground truth labels and refine the similarity scores using epipolar geometry in the inference step for the fair comparison, following the methodology of FaceNet [22].
To evaluate the effect of ground plane assumption, we also utilize homographic projection as another method of re-identification. Homographic projection is a widely used technique in multi-view pedestrian detection [21] and tracking [44,45], which is based on the assumption of a ground plane. It projects the coordinates of objects from each view onto a 2D ground plane and determines whether they represent the same object based on the distance between their projected locations. We calculate the similarity scores as the reciprocal of the distances between the projected locations.
Since ASNet, TripleNet and homographic projection generate only the similarity scores between two boxes, a method for the complete re-identification on the boxes across all views is required. Therefore, we build a graph, where the nodes represent the detection boxes, and the edges are weighted by the similarity scores of the corresponding nodes. We only include edges with similarity scores higher than 0.5 to ensure reliable re-identification. Next, a maximum bipartite graph matching is applied to identify the valid paths in the graph, where all nodes in a valid path are considered as the same instances. Finally, the paths with To assign a class label to the above re-identified instance, we use majority voting and MVCNN [36]. Majority voting method selects the class label for the re-identified instance based on the class that appears in the highest number of bounding boxes. The confidence score is calculated by averaging the detection scores of the boxes whose class is the majority class. If there is a tie between multiple classes, the final class is randomly selected from the tied classes.
MVCNN is a deep learning-based method that classifies multi-view images of a single object. However, in the original MVCNN, the number of input views must be fixed. Therefore, if a view is missing from the previous reidentification step, we compensate for the missing view by duplicating the other views and feeding them as inputs to MVCNN with the fixed number of views. Additionally, to enable background classification, which is not possible in the original MVCNN, we cut out patches without objects from the images and use them as background image samples during training. Finally, we also apply non-maximum suppression (NMS), which is also used in MVDet, to both the majority voting and MVCNN.   Table 1 demonstrates that our MVDet model achieves a MODA performance gain of +16% over single-view faster R-CNN (SVDet). The reason for this improvement is illustrated in Fig. 2, where we can see that MVDet successfully reduces many false positives that occur in SVDet. This is achieved by utilizing re-identification and classification methods to remove false region proposal boxes.
The results from Table 1 and Fig. 3 indicate t hat SVDet+Homog raphy+Major ity and SVDet+Homography+MVCNN have poor performance due to the presence of viewpoint variation and elevated surfaces. Homographic projection assumes that an object is located at the center of the occupied area on the ground. Therefore, the top-down view should represent the object with the center coordinates of the bounding box, while in other views, it should use the bottom center coordinates of the bounding box to represent the object. If multiple views with different coordinate matching methods are combined, re-identification performance deteriorates, as illustrated in Fig. 3. Additionally, objects not on the ground plane can hinder accurate homographic projection as they violate the ground plane assumption.
Also, MVDet surpasses the detection and re-identification performance of SVDet+TripleNet+Majority by +21% MODA and +25.9% AP, respectively, as shown in Table 1. It is worth noting that the detection accuracy of the separate multi-view detection models is even lower than that of SVDet due to the lack of robustness to false positives generated by SVDet. This is because the re-identification and classification networks are trained using error-free ground truth boxes, which are not robust to the false positives produced by SVDet. Therefore, the performance of GT SVDet+TripleNet+Majority, which assumes that the SVDet results are accurate, is much better than SVDet+TripleNet+Majority. In other words, the performance of re-identification and classification networks heavily relies on the accuracy of SVDet results. In contrast, MVDet uses region proposal boxes instead of ground truth boxes in the re-identification training process, making it robust against localization errors in the boxes, as shown in Fig. 4.
Furthermore, since the accuracy of the MVCNN model is highly dependent on the accuracy of the SVDet and ReID results, GT SVDet+GT ReID+MVCNN achieves 87% MODA, which is +14% higher than SVDet+TripleNet+MVCNN in Table 1. However, GT SVDet+GT TripleNet+Majority performs better than GT SVDet+GT TripleNet+MVCNN in Table 1 because MVCNN often misclassifies instances with minor localization errors. Therefore, the MVCNN model, which is trained on the ground truth boxes, performs well only when the accuracy of the SVDet and ReID results is guaranteed.
We conducted additional validation of our model on Easy, Medium, and Hard test sets, which are divided based on the scene complexity. The more difficult the test set, the more it contains similar objects and occlusions. As shown in Table 2, MVDet outperforms SVDet and SVDet+TripleNet+Majority on all test sets.

Conclusion
This paper addresses the problem of multi-view multi-class object detection that does not assume a ground plane. The proposed MVDet model performs region proposal, re-identification, and classification simultaneously in an end-to-end manner, using faster R-CNN and triplet loss. The model also employs an epipolar constraint-based re-identification algorithm to avoid the ground plane assumption. Experimental results on the MessyTable dataset demonstrate that MVDet outperforms both single-view detectors and separate multi-view detectors in terms of detection and re-identification accuracy. Overall, the proposed MVDet model presents a promising solution to the multi-view multi-class object detection problem in the absence of a ground plane assumption.
Data availibility Publicly available data are used.

Conflict of interest
The authors have no relevant financial or nonfinancial interests to disclose.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.