Survey and systematization of 3D object detection models and methods

Drobnitzky, Moritz; Friederich, Jonas; Egger, Bernhard; Zschech, Patrick

doi:10.1007/s00371-023-02891-1

Survey and systematization of 3D object detection models and methods

Survey
Open access
Published: 11 July 2023

Volume 40, pages 1867–1913, (2024)
Cite this article

Download PDF

You have full access to this open access article

The Visual Computer Aims and scope Submit manuscript

Survey and systematization of 3D object detection models and methods

Download PDF

2726 Accesses
1 Altmetric
Explore all metrics

Abstract

Strong demand for autonomous vehicles and the wide availability of 3D sensors are continuously fueling the proposal of novel methods for 3D object detection. In this paper, we provide a comprehensive survey of recent developments from 2012–2021 in 3D object detection covering the full pipeline from input data, over data representation and feature extraction to the actual detection modules. We introduce fundamental concepts, focus on a broad range of different approaches that have emerged over the past decade, and propose a systematization that provides a practical framework for comparing these approaches with the goal of guiding future development, evaluation, and application activities. Specifically, our survey and systematization of 3D object detection models and methods can help researchers and practitioners to get a quick overview of the field by decomposing 3DOD solutions into more manageable pieces.

3D Object Detection for Autonomous Driving: A Comprehensive Survey

Article 27 April 2023

A survey of 3D object detection algorithms for intelligent vehicles development

Article 01 November 2021

3D Object Detection in Autonomous Driving

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Gaining a high-level and three-dimensional understanding of digital pictures is one of the major challenges in the field of artificial intelligence. Applications like augmented reality, autonomous driving and other robotic navigation systems are pushing research in this field faster than ever [3, 4, 14]. Participating in real-life road traffic, self-driving vehicles need to gain an absolute understanding of their surroundings. Hence, a vehicle not only needs to recognize other road users and other objects, but also comprehend their pose and location to avoid collisions. This objective is well known as 3D object detection (3DOD) [5].

Meanwhile, 2D object detection (2DOD) has obtained impressive results in terms of precision and inference time, and is able to compete with or even surpass human vision [164]. However, to fully grasp the scene in a real 3D world, 2D recognition and detection results alone are no longer sufficient. 3DOD now extends this approach into the three-dimensional space by adding the desired parameters of dimension and orientation of the object to the established location and classification results.

The literature volume for 3DOD has increased significantly over the past years [31, 34]. Against the backdrop of highly sophisticated 2DOD models, it is apparent that the focus of research is shifting to 3DOD as the necessary hardware in terms of sensors and computing units becomes increasingly available.

Since 3DOD is a steadily growing field of investigation, there are several promising approaches and trends, including a large pool of various design options for the object detection pipeline. Providing an overview about relevant approaches and seminal achievements may offer orientation and can help to initiate further development in the research community. For this reason, we present a comprehensive review of 3DOD models and methods with exemplary applications and aim to conceptualize the full range of 3DOD approaches along a multi-stage pipeline.

With our work, we complement related surveys in the field (e.g., Arnold et al. [5]; Guo et al., [45]; Fernandes et al., [31]), which often focus on a particular domain (e.g., autonomous driving), specific data input (e.g., point cloud data), or a certain set of methods (e.g., deep learning techniques).

To carry out our review, we investigated papers that were published in a period from 2012 to 2021. In total, our literature corpus comprises more than hundred papers which we examined in detail to provide a classification of all approaches. Throughout our review, we describe representative examples along the 3DOD pipeline, while highlighting seminal achievements.

This survey is structured as follows. In Sect. 2, we provide all relevant foundations and subsequently refer to related work in Sect. 3. Thereafter, the identified literature is discussed and analyzed in detail. This constitutes the main part of our survey, for which we propose a structured framework along the 3DOD pipeline in Sect. 4. The framework consists of several stages with corresponding design options, which are examined in Sects. 5–9 due to their thematic depth. Afterward, we leverage our framework and classify the examined literature corpus in Sect. 10. Finally, we draw a conclusion of this work and give an Sect. 11.

2 Foundations

To give orientation in the following sections, we introduce major concepts of computer vision and in particular of 3DOD that are regarded as background and foundational knowledge.

2.1 Object detection

A core task in the field of computer vision is to recognize and classify objects in images. This general task can further be subdivided into several sub-tasks as summarized in Table 1.

Table 1 Different tasks in computer vision that focus on object instances

Full size table

Following this distinction, object detection is the fusion of object recognition and localization. In detail, the approach tries to simultaneously classify and localize specific object instances in an image [164]. Detected objects are classified and usually marked with bounding boxes. These are imaginary boxes that describe the objects of interest. They are defined as the coordinates of the rectangular border that fully encloses the target object.

Object detection can be considered as a supervised learning problem which defines the process of learning a function that can map input data to known targets based on a given set of training data [10]. For this task, different kinds of machine learning algorithms can be applied.

Conventional machine learning approaches require to first extract representative image features based on feature descriptors, such as Viola-Jones method [143], scale-invariant feature transform (SIFT) [90] or histogram of oriented gradients (HOG) [23]. Those are low-level features which are manually designed for the specific use case. On their basis, prediction models such as support vector machines (SVMs) can be trained to perform the object recognition task [120]. However, the diversity of image objects and use cases in form of pose, illumination and background makes it difficult to manually create a robust feature descriptor that can describe all kinds of objects [54].

For this reason, recent efforts are increasingly directed toward the application of artificial neural networks with deep network architectures, broadly summarized under the term deep learning [66]. Deep neural networks are able to perform object recognition without having to manually define specific features in advance. Their multi-layered architecture allows them to be fed with high-dimensional raw input data and then automatically discover internal representations at different levels of abstraction that are needed for recognition and detection tasks [7, 66].

A common type of deep neural network architecture, which is widely adopted by the computer vision community, is that of convolutional neural networks (CNNs). Due to their nested design, they are able to process high-dimensional data that come in the form of multiple arrays, such as given by color images that are composed of arrays containing pixel intensities in different color channels [85]. These techniques are proved to be superior in 2DOD offering more complex and robust features, while being applicable to any use case [66, 85].

2.2 3D vision and 3D object detection

3D vision aims to extend the previously discussed concepts of object detection by adding data of the third dimension. This leads on the one hand to six possible degrees of freedom (6-DoF)^{Footnote 1} instead of three, and on the other hand, to an accompanying increase in the number of scenery configurations. While methods in 2D space are good for simple visual tasks, more sophisticated approaches are required to improve, for instance, autonomous driving or robotics applications [78, 82]. The full understanding of the environment composed of real 3D objects implies the interpretation of scenes in which items may show up in absolutely discretionary positions and directions related to 6-DoF. This requires a substantial amount of computing power and increases the complexity of the performed operations [24].

3DOD transfers the task of object detection into the three-dimensional space. The idea of 3DOD is to output dimension, location and rotation of 3D bounding boxes and the corresponding class labels for all relevant objects within the sensors field of view [121]. 3D bounding boxes are rectangular cuboids in the three-dimensional space. To ensure relevancy, their size should be minimal, while still containing all relevant parts of an object. One common way to parameterize a 3D bounding box is (x, y, z, h, w, l, c), where (x, y, z) represent the 3D coordinates of the bounding box center, (h, w, l) refer to the height, width and length of the box and c stands for the class of the box [19]. Further, most approaches add an orientation parameter to the 3D bounding box defining the rotation of each box (e.g., Shi et al.,[125]).

2.3 Sensing technologies

To capture 3D scenes, commonly used monocular cameras are no longer sufficient. Therefore, special sensors have been developed to capture depth information. RGB-Depth (RGB-D) cameras like Intel’s RealSense use stereo vision, while Light Detection and Ranging (LiDAR) sensors such as Velodyne’s HDL-32E use laser beams to infer depth information. The data acquired by these 3D sensors can be converted to a more generic structure, the point cloud, which can be understood as a set of points in vector space. Further details on the different data inputs are provided in Sect. 5. Typically, 3DOD models rely on data captured by various active and passive optical sensor modalities, with cameras and LiDAR-sensors being the most popular representatives.

2.3.1 Cameras

Stereo cameras: Stereo cameras are inspired by human ability to estimate depth by capturing images with two eyes. Depth gets reconstructed by exploiting the disparity between two or more camera images that record the same scene from different points of view. To do so, stereoscopy leverages triangulation and epipolar geometry theory to create range information [37]. The acquired depth map normally gets appended to an RGB image as the fourth channel, together called RGB-D image. This sensor variant exhibits a dense depth map, though its quality is heavily dependent on depth estimation, which is also computationally expensive [28].

Time-of-Flight cameras: Instead of deriving depth information from different perspectives, the time-of-flight (TOF) principle can directly estimate the device-to-target distance. TOF systems are based on the LiDAR principle of sending light signals to the scene and measuring the time until they return. The difference between LiDAR and camera-based TOF is that LiDAR creates point clouds with a pulsed laser, whereas TOF cameras capture depth maps with an RGB-like camera. The data captured by stereo and TOF cameras can either be transformed into 2.5D representation like RGB-D or into 3D representation by generating a point cloud (e.g., Song and Xiao [134], Qi et al. [107], Sun et al. [137], Ren and Sudderth [118].

In general, camera sensors such as stereo and TOF have the advantage of low integration costs and relatively low arithmetic complexity. However, all these methods experience considerable quality volatility through environmental conditions like light and weather.

2.3.2 LiDAR sensors

LiDAR sensors emit laser pulses onto the scene and measure the time from emitting the beam to receiving the reflection. In combination with the constant of the speed of light, the measured time reveals the distance to the target. By assembling the 3D spatial information from the reflected laser at a 360\(^\circ \) angle, the sensor constructs a full three-dimensional map of the environment. This map is a set of 3D points, also called point cloud.

The respective reflection values represent the strength of the received pulses. Thereby, LiDAR does not consider RGB [67]. The HDL-64E3, as a common LiDAR system, outputs 120,000 points per frame, which adds up to a huge amount of data, namely 1,200,000 points per second on a 10 Hz frame rate [5].

The advantages of LiDAR sensors are long-range detection abilities, high resolution compared to other 3D sensors and independence of lighting conditions, which are counterbalanced by its high costs and bulky devices [5, 31].

2.4 Domains

Looking at the extensive research on 3DOD in the past ten years, the literature can be roughly summarized into two main areas: indoor applications and autonomous vehicle applications. These two domains are the main drivers for the field, even though 3DOD is not strictly limited to these two specific areas, as there are also other applications conceivable and already in use, such as in retail, agriculture and fitness.

Both domains face individual challenges and opportunities which led to this differentiation. However, it should also be noted that research in both areas is not mutually exclusive, as some 3DOD models offer solutions that are sufficiently generic and therefore do not focus on a particular domain (e.g., Qi et al. [108], Xu et al. [153], Tang and Lee [139], Wang and Jia [148]).

A fundamental difference between indoor and autonomous vehicle applications is that objects in indoor environments are often stacked on top of each other. This provides opportunities for learning inter-object relationships between the target and base/carrier objects. In research, this is referred to as a holistic understanding of the inner scene, which enables a better communication between service robots and people [51, 118]. The challenges with indoor applications are that scenes are often cluttered and many objects occlude each other [118].

Autonomous vehicle applications are characterized by long distances to potential objects and difficult weather conditions such as snow, rain and fog, which make the detection process more difficult [5]. Objects may also occlude each other, but since objects like cars, pedestrians and traffic lights are unlikely to be on top of each other, techniques such as bird’s-eye view projection can efficiently compensate for this disadvantage (e.g., Beltrán et al. [9], Wang et al. [149]).

2.5 Datasets

As seen in 2DOD, a crucial prerequisite for continuous development and fast progress of algorithms is the availability of publicly accessible datasets. Extensive data are required to train learning-based state-of-the-art models. Further, they are used to apply benchmarks to compare one’s own results with those of others. For instance, the availability of the dataset ImageNet [25] accelerated the development of 2D image classification and 2DOD models remarkably. The same phenomenon is observable in 3DOD and other tasks based on 3D data: More available data lead to a larger coverage of possible scenarios.

Similar to the domain focus, the most commonly used datasets for 3DOD developments can be roughly divided into two groups, distinguishing between autonomous driving scenarios and indoor scenes. In the following, we describe some of the major datasets that are publicly available. In addition, Table 2 provides an overview of the described datasets including the main reference and the environment of the recordings as well as the number of scenes, frames, 3D bounding boxes and object classes.

Table 2 Overview of the described datasets

Full size table

2.5.1 Autonomous driving datasets

KITTI: The most popular dataset for autonomous driving applications is KITTI [36]. It consists of stereo images, LiDAR point clouds and GPS coordinates, all synchronized in time. Recorded scenes range from highways, complex urban areas and narrow country roads. The dataset can be used for various tasks such as stereo matching, visual odometry, 3D tracking and 3D object detection. For object detection, KITTI provides 7481 training and 7518 test frames including sensor calibration information and annotated 3D bounding boxes around the objects of interest in 22 video scenes. The annotations are categorized in easy, moderate and hard cases, depending on the object size, occlusion and truncation levels. Drawbacks of the dataset are the limited sensor configurations and light conditions: All recordings were made during daytime and mostly under sunny conditions. Moreover, the class frequencies are quite unbalanced. 75% belong to the class car, 15% to the class pedestrian and 4% to the class cyclist. In natural scenarios, the missing variety challenges the evaluation of the latest methods.

Waymo Open: The Waymo Open dataset [138] focuses on providing a diverse and comprehensive dataset. It consists of 1,150 videos that are exhaustively annotated with 2D and 3D bounding boxes in images and LiDAR point clouds, respectively. The data collection was conducted by using five cameras presenting a front and side view of the recording vehicle, as well as a LiDAR sensor for 360\(^\circ \) view. Further, the data were recorded in three different cities with various light and weather conditions, providing a diverse scenery.

nuScenes: NuScenes [12] comprises 1,000 video scenes, 20 s each, in the context of autonomous driving. Each scene is represented by six different camera views, LiDAR and radar data with full 360\(^\circ \) field of view. It is significantly larger than the pioneering KITTI dataset with more than seven times as many annotations and 100 times as many images. Further, the nuScenes dataset also provides nighttime and bad weather scenarios, which is neglected in the KITTI dataset. On the downside, the dataset has limited LiDAR sensor quality with 34,000 points per frame and limited geographical diversity compared to the Waymo Open dataset, which covers an effective area of only five square kilometers.

2.5.2 Indoor datasets

NYUv2 & SUN RGB-D: NYUv2 [128] and its successor SUN RGB-D [133] are datasets commonly used for indoor applications. The goal of these datasets is to encourage methods focused on total scene understanding. The datasets were recorded using four different RGB-D sensors to ensure the generalizability of applied methods for different sensors. Even though SUN RGB-D inherited the 1449 labeled RGB-D frames from the NYUv2 dataset, NYUv2 is still occasionally used by nowadays methods. SUN RGB-D consists of 10,335 RGB-D images that are labeled with about 146,000 2D polygons and around 64,500 3D bounding boxes with accurate object orientation measures. Additionally, there is a room layout and scene category provided for each image. To improve image quality, short videos of every scene have been recorded. Several frames of these videos were then used to create a refined depth map.

Objectron: Recently, Google released the Objectron dataset [1], which is composed of object-centric video clips capturing nine different objects categories in indoor and outdoor scenarios. The dataset consists of 14,819 annotated video clips containing over four million annotated images. Each video is accompanied by a sparse point cloud representation.

3 Related reviews

As of today, to the best of our knowledge, there are only a limited number of reviews that aim to organize and classify the most important methods and pipelines for 3DOD.

Arnold et al. [5] were some of the first to propose a classification for 3DOD approaches with a particular focus on autonomous driving applications. Based on the input data that is passed into the detection model, they divide the approaches into (i) monocular image-based methods, (ii) point cloud methods and (iii) fusion-based methods. Furthermore, they break down the point cloud category into three subcategories of data representation: (ii-a) projection-based, (ii-b) volumetric representations and (ii-c) Point Nets. The data representation states which kind of input the model consumes and which information the input contains so that the subsequent stage can process it more conveniently according to the design choice.

While regarding various applications, such as 3D object classification, semantic segmentation and 3DOD, Liu et al. [87] focus on feature extraction methods which constitutes the properties and characteristics that the model derives from the passed data. They classify deep learning models on point clouds into (i) point-based methods and (ii) tree-based methods. The former directly uses the raw point cloud, and the latter first employs a k-dimensional tree to preprocess the corresponding data representation.

Griffiths and Boehm [44] consider object detection as a special type of classification and thus provide relevant information for 3DOD in their review on deep learning techniques for 3D sensed data classification. They differentiate the approaches on behalf of the data representation into (i) RGB-D methods, (ii) volumetric approaches, (iii) multi view CNNs, (iv) unordered point set processing methods and (v) ordered point set processing techniques.

Huang and Chen [53] touch lightly upon 3DOD in their review paper about autonomous driving technologies using deep learning methods. They suggest a similar classification of methods as Arnold et al. [5] by distinguishing between (i) camera-based methods, (ii) LiDAR-based methods, (iii) sensor-fusion methods and additionally (iv) radar-based methods. While giving a coarse structure for 3DOD, the conference paper is waiving an explanation for their classification.

Bello et al. [8] consider the field from a broader perspective by providing a survey of deep learning methods on 3D point clouds. The authors organize and compare different methods based on a structure that is task-independent. Subsequently, they discuss the application of exemplary approaches for different 3D vision tasks, including classification, segmentation and object detection.

Addressing likewise the higher-level topic of deep learning for 3D point clouds, Guo et al. [45] give a more detailed look into 3DOD. They structure the approaches for handling point clouds into (i) region proposal-based methods, (ii) single-shot methods and (iii) other methods by categorizing them on account of their model design choice. Additionally, the region proposal-based methods are split along their data representation into (i-a) multi-view, (i-b) segmentation, (i-c) frustum-based and again (i-d) other methods. Likewise, the single-shot category inherits the subcategories (ii-a) bird’s-eye view, (ii-b) discretization and (ii-c) point-based approaches.

Most recently, Fernandes et al. [31] presented a comprehensive survey which might be the most similar to this work. They developed a detailed taxonomy for point-cloud-based 3DOD. In general, they divide the detection models along their pipeline into three stages, namely data representation, feature extraction and detection network modules.

The authors note that in terms of data representation, the existing literature takes the approach of either converting the point cloud data into voxels, pillars, frustums or 2D-projections, or consuming the raw point cloud directly. Feature extraction gets emphasized as the most crucial part of the 3DOD pipeline. Suitable features are essential for an optimal feature learning which in turn has a great impact on the appropriate object localization and classification in later steps. The authors classify the extraction methods into point-wise, segment-wise, object-wise and CNNs which are further divided into 2D CNN and 3D CNN backbones. The detection network module consists of the multiple output task of object localization and classification, as well as the regression of 3D bounding box parameters and object orientation. Same as in 2DOD, these modules are categorized into the architectural design principles of single-stage and dual-stage detectors.

Although all preceding reviews provide some systematization for 3DOD, they move—with exception of Fernandes et al. [31]—on a high level of abstraction. They tend to lose some of the information which is crucial to fully map relevant trends in this vivid research field.

Moreover, as mentioned above, all surveys are limited to either domain-specific aspects (e.g., autonomous driving applications) or focus on a subset of methods (e.g., point cloud-based approaches). Monocular-based methods, for example, are neglected in almost all existing review papers.

4 3D object detection pipeline

Intending to structure the research field of 3DOD from a broad perspective, we propose a systematization that enables to classify current 3DOD approaches at an appropriate abstraction level, by neither losing relevant information caused by a high level of abstraction nor being too specific and complex by a too fine-granular perspective. Likewise, our systematization aims at being sufficiently robust to allow a classification of all existing 3DOD pipelines and methods as well as of future works without the need for major adjustments to the general framework.

Figure 1 provides an overview of our systematization. It is structured along the general stages of an object detection pipeline, with several design choices at each stage. It starts with the choice of input data (Sect. 5), followed by the selection of a suitable data representation (Sect. 6) and corresponding approaches for feature extraction (Sect. 7). For the latter steps, it is possible to apply fusion approaches (Sect. 8) to combine different data inputs and take advantage of multiple feature representations. Finally, the object detection module is defined (Sect. 9).

The structuring along the pipeline enables us to order and understand the underlying principles of this field. Furthermore, we can compare the different approaches and are able to outline research trends in different stages of the pipeline. To this end, we carry out a qualitative literature analysis of proposed 3DOD approaches in the following sections along the pipeline to examine specific design options, benefits, limitations and trends within each stage.

5 Input data

In the first stage of the pipeline, a model consumes the input data which already restricts the further processing. Common inputs for 3DOD pipelines are (i) RGB images (Sect. 5.1), (ii) RGB-D images (Sect. 5.2) and (iii) point clouds (Sect. 5.3). 3DOD models using RGB-D images are often referred to as 2.5D approaches (e.g., Deng and Latecki [27], Sun et al. [137], Maisano et al. [94]), whereas 3DOD models using point clouds are regarded as true 3D approaches.

5.1 RGB images

Monocular or RGB images provide a dense pixel representation in the form of texture and shape information [5, 37, 80]. A 2D image can be seen as a matrix, containing the dimensions of height and width with the corresponding color values.

Especially for subtasks of 3DOD applications such as lane line detection, traffic light recognition or object classification, monocular-based approaches enjoy the advantage of real time processing by 2DOD models. Probably, the most severe disadvantage of monocular images is the lack of depth information. 3DOD benchmarks have shown that depth data is essential for accurate 3D localization [59]. Furthermore, monocular images face the problem of object occlusion as they only capture a single view.

5.2 RGB-D images

RGB-D images can be created with stereo or TOF cameras that provide depth information in addition to color information (cf. Section 2.3). RGB-D images consist of an RGB image with an additional depth map [28]. The depth map is comparable to a grayscale image, except that each pixel represents the actual distance between the sensor and the surface of the scene object. An RGB image and a depth image ideally have a one-to-one correspondence between pixels [147].

RGB-D images, also known as range images, are convenient to use with the majority of 2DOD methods, treating depth information similarly to the three RGB channels [37]. However, as with monocular images, RGB-D faces the problem of occlusion since the scene is only presented through a single perspective. In addition, objects are presented at different scales depending on their position in space.

5.3 Point cloud

The data acquired by 3D sensors can be converted to a more generic structure, the point cloud. It is a three-dimensional set of points that has an unorganized spatial structure [101]. The point cloud is defined by its points, which comprise the spatial coordinates of a sampled surface of an object. However, further geometric and visual attributes can be added to each point [37].

As described in Sect. 2.3, point clouds can be obtained from LiDAR sensors or transformed RGB-D images. Yet, point clouds obtained from RGB-D images are typically noisier and sparser compared to LiDAR-generated point clouds due to low resolution and perspective occlusion [92].

The point cloud offers a fully three-dimensional reconstruction of the scene, providing rich geometric, shape and scale information. This enables the extraction of meaningful features that boost the detection performance. Nevertheless, point clouds face severe challenges which are based on their nature and processability. Common deep learning operations, which have proven to be the most effective techniques for object detection, require data to be organized in a tensor with a dense structure (e.g., images, videos) which is not fulfilled by point clouds [169]. In particular, point clouds exhibit irregular, unstructured and unordered data characteristics [8].

Irregular means that the points of a point cloud are not evenly sampled across the scene. Hence, some of the regions have a denser distribution of points than others. Especially distant objects are usually represented sparsely by very few points because of the limited range recording ability of current sensors.
Unstructured means that points are not on a regular grid. Accordingly, the distances between neighboring points can vary. In contrast, pixels in an image always have a fixed position to their neighbors, which is evenly spaced throughout the image.
Unordered means that the point cloud is just a set of points that is invariant to permutations of its members. Particularly, the order in which the points are stored does not change the scene that it represents. In other formats, e.g., an image, data usually get stored as a list [8, 106]. Permutation invariance, however, means that a point cloud of N points has N! permutations and the subsequent data processing must be invariant to each of these different representations.

Figure 2 provides an illustrative overview of the three challenging characteristics of point cloud data.

6 Data representation

To ensure a correct processing of the input data by 3DOD models, it must be available in a suitable representation. Due to the different data formats, 3DOD data representations can be generally classified into 2D and 3D representations. Beyond that, we assign 2.5 representations to either 2D if they come in an image format (regardless of the number of channels), or 3D if the data get described in a spatial structure. 2D representations generally cover (i) monocular representations (Sect. 6.1) and (ii) RGB-D front views (Sect. 6.2). 3D representations cover (iii) grid cells (Sect. 6.3) and (iv) point-wise representations (Sect. 6.4).

6.1 Monocular representation

Despite the lack of information about the range, monocular representation enjoys a certain popularity among 3DOD methods due to its efficient computation. Additionally, the required cameras are affordable and simple to set up. Hence, monocular representations are attractive for applications where resources are limited [56, 61].

The vast majority of monocular representations use the widely known frontal view which is limited by the viewing angle of the camera. Other than that, [63] tackle monocular 360\(^\circ \) panoramic imagery using equirectangular projections instead of rectilinear projections of conventional camera images. To access true 360\(^\circ \) processing, they fold the panorama imagery into a 360\(^\circ \) ring by stitching left and right edges together with a 2D convolutional padding operation.

Only a small proportion of 3DOD monocular approaches exclusively use an image representation (e.g., Jörgensen et al. [56]) for 3D spatial estimations. Most models leverage additional data and information to substitute the missing depth information (more details in Sect. 7.2.2). Additionally, representation fusion techniques are quite popular to compensate disadvantages. For instance, 2D candidates get initially detected from monocular images before a 3D bounding box for the spatial object is predicted based on the initial proposals. In general, the latter step processes an extruded 3D subspace derived from the 2D bounding box. In case of representation fusion, the monocular representation is usually not used for full 3DOD but rather as a support for increasing efficiency through limiting the search space for heavy three-dimensional computations or for delivering additional features such as texture and color. These methods are described in depth in Sect. 8 (Fusion Approaches).

6.2 RGB-D front view

RGB-D data can either be transformed to a point cloud or kept in its natural form of four channels. Therefore, we can distinguish between an RGB-D (3D) representation (e.g., Chen et al. [18]; Tang and Lee [139], Ferguson and Law [30]), which exploits the depth information in its spatial form of a point cloud, and an RGB-D (2D) representation (e.g., Chen et al. [17], He et al. [49], Li et al. [75], Rahman et al. [112], Luo et al. [92]), which holds an additional 2D depth map in an image format.

Thus, as mentioned in Sect. 5.2, RGB-D (2D) images represent monocular images with an appended fourth channel of the depth map. The data are compressed along the z-axis generating a dense projection in the frontal view. The 2D depth image can be processed similarly to RGB channels by common 2D CNN models.

The RGB-D (2D) representation is often referred to as front view (FV) in 3DOD research. However, front and range view (RV) are occasionally equated with each other in current research. For clarification: In this work, the FV is considered as an RGB-D image generated by a TOF, stereo or similar camera, while the RV is defined as the natural frontal projection of a point cloud (see also Sect. 6.3.2).

6.3 Grid cells

Processing a point cloud (cf. Section 5.3) poses a particular challenge for CNNs because convolution operations require a structured grid, which is not present in point cloud data [8]. Thus, to take advantage of advanced deep learning methods and leverage highly informative point clouds, they must first be transformed into a suitable representation.

Current research presents two ways of handing point clouds. The first and more natural solution is to fit a regular grid onto the point cloud, producing a grid cell representation. Many approaches do so by either quantizing point clouds into 3D volumetric grids (Sect. 6.3.1) (e.g., Song and Xiao [134], Zhou and Tuzel [169], Shi et al. [125]) or by discretizing them into (multi-view) projections (Sect. 6.3.2) (e.g., Li et al. [72], Chen et al. [19], Beltrán et al. [9], Zheng et al. [165]).

The second and more abstract way to solve the point cloud representation problem is to process the point cloud directly by grouping points into point sets. This approach does not require convolutions and thus allows the point cloud to be processed without transforming it into a point-wise representation (Sect. 6.4) (e.g., Qi et al. [108], Shi et al. [124], Huang et al. [52]).

Along these directions, several state-of-the-art methods have been proposed for 3DOD pipelines, which we will describe exemplarily in the following.

6.3.1 Volumetric grids

The main idea behind the volumetric representation is to subdivide the point cloud into equally distributed grid cells, called voxels, which allows further processing in a structured form. For this purpose, the point cloud is converted into a 3D fixed-size voxel structure of dimension (x, y, z). The resulting voxels either contain raw points or already encode the occupied points into a feature representation such as point density or intensity per voxel [8]. Figure 3 illustrates the transformation of a point cloud into a voxel-based representation.

Usually, voxels have a cuboid shape (e.g., Li [70], Zhou and Tuzel [169], Ren and Sudderth [118]). However, there are also approaches applying other forms such as pillars (e.g., Lang et al. [65], Lehner et al. [68]).

Feature extraction networks utilizing voxel-based representations are computationally more efficient and reduce memory needs. Instead of extracting low- and high-dimensional features for each point individually, clusters of points (i.e., voxels) are used to extract such features [31]. Despite reducing the dimensionality of a point cloud through discretization, the spatial structure is kept and allows to make use of the geometric features of the scene.

Volumetric approaches using cuboid transformation of the point cloud scene have been used, for example, by Song and Xiao [134], Li [70], Engelcke et al. [29], Zhou and Tuzel [169] and Ren and Sudderth [118].

To speed up computation, Lang et al. [65] propose a pillar-based voxelization of the point cloud, instead of using the conventional cubical quantization. The vertical column representation allows to skip expensive 3D convolutions in the following steps, since pillars have unlimited spatial extent in z-direction and can therefore be projected directly onto 2D pseudo images. As a result, all feature extraction operations are processable by efficient 2D CNNs. Kuang et al. [62] extent this approach even further by using a learning-based feature encoding approach as opposed to relying on handcrafted feature initialization.

6.3.2 Projection-based representation

In addition to the need to transform the point cloud into a processable state, several approaches seek to leverage the expertise and power of 2DOD processing. Especially for reasons of better inference times for point cloud models, projection approaches became popular. They project the point cloud onto an image plane whilst preserving the depth information. Subsequently, the representation can be processed by efficient 2D extractors and detectors. Commonly used representations are the previously mentioned range view, using an image plane projection, and the bird’s eye view, projecting the point cloud onto the ground plane.

Range View: Whereas the 2D FV corresponds to monocular and stereo cameras, the RV is a native 2D representation of the LiDAR data. The point cloud is projected onto a cylindrical 360\(^\circ \) panoramic plane exactly as the data is captured by the LiDAR sensor. Since the LiDAR projection is still neither in a processable state nor contains any discriminative features such as RGB information, the projected RV is partitioned into a fine-grained grid and encoded in the successive feature initialization step (see Sect. 7.4.1).

Meyer et al. [97] and Liang et al. [81] emphasize that the naturally compact RV results in a more efficient computation in comparison to other projections. In addition, the information loss of projection is considerably small since the RV constitutes the native representation of a rotating LiDAR sensor [81]. At the same time, RV suffers from distorted object size and shape due to its cylindrical image character [157]. Inevitably, RV representations face the same problem as camera images, in that the size of objects is closely related to their range and occlusions may occur due to perspective [168]. Zhou et al. [81] argue that “range image is a good choice for extracting initial features, but not a good choice for generating anchors”. Moreover, its performance in 3DOD models does not match with state-of-the-art bird’s eye view (BEV) projections. Nevertheless, RV enables more accurate detection of small objects [96, 97].

Exemplary approaches using RV are proposed by Li et al. [72], Chen et al. [19], Zhou et al. [168] and Liang et al. [81].

Bird’s Eye View: While RV discretize the data onto a panoramic image plane, the BEV is an orthographic view of the point cloud scene projecting the points onto the ground plane. Therefore, data get condensed along the y-axis. Figure 4 shows an exemplary illustration of a BEV projection.

Chen et al. [19] were some of the first introducing BEV to 3DOD in their seminal work MV3D. They organized the point cloud as a set of voxels and then transformed each voxel column through an overhead perspective into a 2D grid with a specific resolution and encoding for each cell. As a result, a dense pseudo-image of the ground plane is generated which can be processed by standard image detection architectures.

Unlike RV, BEV offers the advantage that the object scales are preserved regardless of the range. Further, BEV perspective eases the typical occlusion problem of object detection since objects are displayed separately from each other in a free-standing position [168]. These advantages let the networks exploit priors about the physical dimension of objects, especially for anchor generation [157].

On the other hand, BEV data gets sparse in at distance, which makes it unfavorable for small objects [65, 153]. Furthermore, the assumption that all objects lie on one mutual ground plane often turns out to be infeasible in reality, especially in indoor scenarios [168]. Also the often coarse voxelization of BEV may remove fine granular information leading to inferior detection at small object sizes [96].

Exemplary models using BEV representation can be found in the work from Wang et al. [149], Beltrán et al. [9], Liang et al. [80], Yang et al. [157], Simon et al. [130], Zeng et al. [162], Li et al. [74], Ali et al. [2], He et al. [48], Liang et al. [81] and Wang et al. [145].

Multi-View Representation: Often BEV and RV are not used as single data representations but as a multi-view approach, meaning that RV and BEV but also monocular-based images are combined to represent the spatial information of a point cloud.

Chen et al. [19] were the first to integrate this concept into a 3DOD pipeline, followed by many other models adapting to fuse 2D representations from different perspectives (e.g., Ku et al. [60], Li et al. [74], Wang et al. [145, 146], Liang et al. [81]).

Although the representations of BEV and RV are compact and efficient, they are always limited by the loss of information, originating from the discretization of the point cloud into a fixed number of grid-cells and the respective feature encoding of the cells’ points.

6.4 Point-wise representation

Either way, discretizing the point cloud into a projection or volumetric representation inevitably leads to information loss. Against this backdrop, Qi et al. [106] introduced PointNet, and thus a new way to consume the raw point cloud in its unstructured nature having access to all the recorded information.

In point-wise representations, points are isolated and sparsely distributed in a spatial structure representing the visible surface, while preserving precise localization information. PointNet handles this representation by aggregating neighboring points and extracting a compressed feature from the low-dimensional point features of each set, enabling a raw point-based representation for 3DOD models. A more detailed description of PointNet and its successor PointNet++ is given in Sects. 7.3.1 and 7.3.2, respectively.

However, PointNet was developed and tested on point clouds containing 1,024 points [106], whereas realistic point clouds captured by a standard LiDAR sensor such as Velodyne’s HDL-64E3 usually consist of 120,000 points per frame. Thus, applying PointNet on the whole point cloud is a time- and memory-consuming operation. De facto, point clouds are rarely consumed in total. As a consequence, further techniques are required to improve efficiency, such as cascading fusion approaches (see Sect. 8.1) that crop point clouds to the region of interest and pass only subsets of the point cloud to the point-based feature extraction stage.

In general, it can be stated that point-based representations retain more information than voxel- or projection-based methods. But on the downside point-based methods are inefficient when the number of points is large. Yet, a reduction of the point clouds like in cascading fusion approaches always comes with a decrease in information. In summary, it can be said that maintaining both efficiency and performance is not achievable for any of the representations to date.

7 Feature extraction

Feature extraction gets emphasized as the most crucial part of the 3DOD pipeline that research is focusing on Fernandes et al. [31]. It follows the paradigm to reduce dimensionality of the data representation with the intention of representing the scene by a robust set of features. Features generally depict the unique characteristics of the data used to bridge the semantic gap, which denotes the difference between the human comprehension of the scene and the model’s prediction. Suitable features are essential for an optimal feature learning which in turn has a great impact on the detection in later steps. Hence, the goal of feature extraction is to provide a robust semantic representation of the visual scene that ultimately leads to the recognition and detection of different objects [164].

As with 2DOD, feature extraction approaches can be roughly divided into (i) handcrafted feature extraction (Sect. 7.1) and (ii) feature learning via deep learning methods. Regarding the latter, we can distinguish the broad body of 3DOD research depending on the respective data representation. Hence, feature learning can be performed either in a (ii-a) monocular (Sect. 7.2), (ii-b) point-wise (Sect. 7.3), (ii-c) segment-wise (Sect. 7.4) or in a (ii-d) fusion-based approach (Sect. 8).

7.1 Handcrafted feature extraction

Although the vast majority of 3DOD approaches have moved toward hierarchical deep learning, which can generate more complex and robust features, there are still cases where features are created manually.

Handcrafted feature extraction differs from feature learning in that the features are selected individually and are usually used directly for the final determination of the scene. Features like edges or corners are tailored by hand and serve as the ultimate characteristics for object detection. There is no algorithm that independently learns how these features are constructed or how they can be combined, as it is the case with CNNs. The feature initialization already represents the feature extraction step. Often, these handcrafted features are then scored by SVMs or random forest classifiers, which are deployed exhaustively over the entire image or scene.

A few exemplary 3DOD models using handcrafted feature extraction shall be introduced in the following. For instance, Song and Xiao [134] use four types of 3D features which they exhaustively extract from each voxel cell, namely point density, 3D shape feature, 3D normal feature and truncated signed distance function feature. The features are used to handle the problem of self-occlusion (see Fig. 5).

Wang and Posner [144] also use a fixed-dimensional feature vector containing the mean and variance of the reflectance values of all points that lie within a voxel and an additional binary occupancy feature. The features are not processed further and are used directly for detection purposes in a voting scheme.

Ren and Sudderth [116] introduce their discriminative cloud of oriented gradients (COG) descriptor, which they further develop in their subsequent work by proposing LSS (latent support surfaces) [117] and COG 2.0 [118]. Additionally, the approach was also adopted by Liu et al. [88]. In general, the COG feature is able to describe complex 3D appearances within every orientation, as it consists of a gradient computation, 3D orientation bins and a normalization. For each proposal, the point cloud density, surface normal features and a COG-feature are calculated in a sliding window fashion, which are then scored with pre-trained SVMs.

In addition, manually configured features are typically used in combination with matching algorithms for detection purposes. For example, Yamazaki et al. [154] create gradient-based image features by applying principal component analysis to the point cloud, which are subsequently used to compute the normalized cross-correlation. The main idea is to use the spatial relationships between image projection directions to discover the optimal template matching for detection. Similarly, Teng and Xiao [140] use handcrafted features, namely color histogram and key point histogram for template matching purposes. Another example is the approach proposed by He et al. [49]. The authors create silhouette gradient orientations from RGB and surface normal orientations from depth images.

7.2 Monocular feature learning

Probably the most difficult form of feature extraction for 3DOD is exercised in monocular representation. Since there is no direct depth data, well-informed features are difficult to obtain, and thus, most approaches attempt to compensate for this lack of information with various depth substitution techniques.

7.2.1 Solely monocular approaches

In our literature corpus, there is only one paper which takes up the challenge of performing 3DOD exclusively with monocular images [56]. The authors use a single-shot detection (SSD) framework ( [86], see also Sect. 9.3.2) to generate per-object canonical 3D bounding box parameters. They start from a classical bounding box detector and add new output heads for 3DOD features. More specifically, they add distance, orientation, dimension and 3D corners to the already available class score and 2D bounding box heads. The conducted feature extraction is the same as it is in the 2D framework.

Models that use only monocular inputs follow the straightforward way of predicting spatial parameters directly without any range information. While the objectives of dimension and pose estimation are relatively easy to fulfill as they rely heavily on the given feature of appearance in 2D images, they face the challenge of difficult location estimation, since monocular inputs do not have a natural feature for spatial location [46, 87].

7.2.2 Informed monocular approaches

The absence of depth information cannot be fully compensated for in purely monocular approaches. Therefore, many state-of-the-art monocular models supplement the 2D representation by an auxiliary depth estimation network or additional external data and information. The idea is to use prior knowledge of the target objects and the scene, such as shape, context and occlusion patterns to compensate for missing depth data.

These depth substitution techniques can be roughly divided into (i) depth estimation networks, (ii) geometric constraints and (iii) 3D template matching. Often, they are used in combination.

While depth estimation networks are applied directly to the original representation to generate a new input for the model, geometric constraints and 3D model matching tackle the lack of range information in the later detection steps of the pipeline.

Depth Estimation Networks: Depth estimation networks generate informed depth features or even entirely new representations that possess range information, such as point clouds derived synthetically from monocular imagery. These new representations are then subsequently exploited by depth-aware models. A representative model is Mono3D [16]. It uses additional instance and semantic segmentation along with further features to reason about the pose and location of 3D objects.

Srivastava et al. [136] modify the generative adversarial network of BirdNet Beltran et al. [9] to create a BEV projection from the monocular representation. All following operations such as feature extractions and predictions are then performed on this new representation.

Similarly, Roddick et al. [119] transform a monocular representation to the BEV perspective. They introduce an orthographic feature transformation network that maps the features from the RGB perspective to a 3D voxel map. The features of the voxel map are eventually reduced to the 2D orthographic BEV feature map by consolidation along the vertical dimension.

Payen de La Garanderie et al. [63] adapt the monocular depth recovery technique of Godard et al. [40], called Mono Depth, for their special case of monocular 360\(^\circ \) panoramic processing. They predict a depth map by training a CNN on left-right consistency inside stereo image pairs. However, at the time of inference, the model only requires single monocular images to estimate a dense depth map.

Even more advanced, a few approaches devote themselves to generate a point cloud from monocular images. Xu and Chen [152] use a stand-alone network for disparity prediction which is similarly based on Mono Depth to predict a depth map. Unlike Payen de La Garanderie et al. [63], they further process the depth map to estimate a LiDAR-like point cloud.

An almost identical procedure is presented by Ma et al. [93]. First, they generate a depth map through a self-defined CNN and then proceed to generate a point cloud by using the camera calibration files. While Xu and Chen [152] mainly take the depth data as auxiliary information of RGB features, Ma et al. [93] focus on taking the generated depth as a core feature and explicitly using its spatial information.

Similarly, Weng and Kitani [150] adapt the deep ordinal regression network (DORN) by Fu et al. [35] for their pseudo LiDAR point cloud generation and then exploit a Frustum-PointNet-like model [108] for the object detection task. Further information on the Frustum PointNet approach is given in Sect. 8.1.

Instead of covering the entire scene in a point-based representation, Ku et al. [61] primarily reduce the space by lightweight predictions and then only transform the candidate boxes to point clouds, preventing redundant computation. To do so, they exploit instance segmentation and available LiDAR data for training to reconstruct a point cloud in a canonical object coordinate system. A similar approach for instance depth estimation is pursued by Qin et al. [110].

Geometric Constraints: Depth estimation networks offer the advantage of closing the gap of missing depth in a direct way. Yet, errors and noise occur during depth estimation, which may lead to biased overall results and contributes to a limited upper-bound of performance [6, 11]. Hence, various methods try to skip the naturally ill-posed depth estimation and tackle monocular 3DOD as a geometrical problem of mapping 2D into 3D space.

Especially in autonomous driving applications, the 3D box proposals are often constrained by a flat ground assumption, namely the street. It is assumed that all possible targets are located on this plane, since automotive vehicles do not fly. Therefore, these approaches force the bounding boxes to lay along the ground plane. In indoor scenarios, on the other hand, objects are located on various height levels. Hence, the ground plane constraint does not get the attention as in autonomous driving applications. Nevertheless, plane fitting is frequently applied in indoor scenarios to get the room orientation.

Zia et al. [170] were one of the first to assume a common ground plane in their approach, which helped them to extensively reconstruct the scene. Further, a ground plane drastically reduces the search space by only leaving two degrees of freedom for translation and one for rotation. Other representative examples for implementing ground plane assumptions in monocular representations are given by Chen et al. [16], Du et al. [28] and Gupta et al. [46]. All of them leverage the random sample consensus approach by Fischler and Bolles [33], a popular technique that is applied for ground plane estimation.

A different geometrical approach that is used to recover the under-constrained monocular 3DOD problem is to establish consistency between the 2D and 3D scenes. Mousavian et al. [99], Li et al. [71], Liu et al. [84] and Naiden et al [100] do so by projecting the 3D bounding box onto a previously determined 2D bounding box. The core notion is that the 3D bounding box should fit tightly to at least one side of its corresponding 2D box detection. Naiden et al. [100] use, for instance, a least square method for the fitting task.

Other methods deploy 2D-3D consistency by incorporating geometric constraints such as room layout and camera pose estimations through an entangled 2D-3D loss function (e.g., Huang et al. [51], Simonelli et al. [131], Brazil and Liu [11]). For example, Huang et al. [51] define the 3D object center through corresponding 2D and camera parameters. Then a physical overlap between 3D objects and 3D room layout gets penalized. Simonelli et al. [131] first disentangle the 2D and 3D detection losses to optimize each loss individually. Subsequently, they also leverage the correlation between 2D and 3D in a combined multi-task loss.

Another approach is presented by Qin et al. [111]. The authors exploit triangulation, which is well known for estimating 3D geometry in stereo images. They use 2D detection of the same object in a left and right monocular image for a newly introduced anchor triangulation, where they directly localize 3D anchors based on the 2D region proposals.

3D template matching: An additional way of handling monocular representations for 3DOD is to match the images with 3D object templates. The idea is to have a database of object images from different viewpoints and their underlying 3D depth features. One popular approach for creating templates is to render synthetic images from computer-aided design (CAD) models, whereby images are created from all sides of the object. Then the monocular input image is searched and matched using this template database. On this basis, the object pose and location can be concluded.

Fidler et al. [32] address the task of monocular object detection with the representation of an object as a deformable 3D cuboid. The 3D cuboid consists of faces and parts, which are allowed to deform according to their anchors in the 3D box. Each of these faces is modeled by a 2D template that corresponds to appearance of the object from an orthogonal point of view. It is assumed that the 3D cuboid can be rotated so that the image view from a defined set of angles can be projected onto the respective cuboids’ face and subsequently scored by a latent SVM.

Chabot et al. [13] initially use a network to generate 2D detection results, vehicle part coordinates and a 3D box dimension proposal. Thereafter, they match the dimensions to a 3D CAD dataset consisting of a fixed number of object models to assign the corresponding 3D shapes in the form of manually annotated vertices. Those 3D shapes are then used for performing 2D-to-3D pose matching in order to recover 3D orientation and location. Barabanau et al. [6] reason their approach on sparse but salient features, namely 2D key points. They match CAD templates based on 14 key points. After assigning one of five distinct geometric classes, they execute an instance depth estimation by a vertical plane passing through two of the visible key points to lift the predictions into a 3D space.

3D templates provide a potentially powerful source of information. However, a sufficient number of models are not always available for each object class. Therefore, these methods tend to focus on a small amount of classes. The limitation to shapes covered by the selection of 3D templates makes it difficult to generalize or extend 3D template matching to classes for which no models are available [61].

In summary, monocular 3DOD achieves promising results. Nevertheless, the lack of depth data prevents monocular 3DOD from reaching state-of-the-art results. The depth substitution techniques may limit the detection performance because errors in depth estimation, geometrical assumptions or template matching are propagated to the final 3D box prediction. In addition, informed monocular approaches may be comparably vulnerable to external attacks. Cheng et al. [21] showed that both physical and digital attacks on depth estimation networks have serious impacts on 3DOD performance.

7.3 Point-wise feature learning

As described above, the application of deep learning techniques on point cloud data is not straightforward due to the irregularity of the point cloud (cf. Section 5.3). Many existing methods try to leverage the expertise of convolutional feature extraction by either projecting point clouds onto 2D image views or by converting them into regular grids of voxels. However, projecting point clouds onto a specific viewpoint discards valuable information, which is particularly important in crowded scenes. The voxelization, on the other hand, leads to high computational costs due to the sparse nature of point clouds and also suffers from information loss in point-crowded voxels. Either way, manipulating the original data may have a negative effect.

To overcome the problem of irregularity in point clouds with an alternative approach, Qi et al. [106] proposed PointNet, which is able to learn point-wise features directly from the raw point cloud. It is based on the assumption that points which lie close to each other can be grouped together and compressed as a single point. Shortly after, Qi et al. [109] introduced its successor PointNet++, adding the ability to capture local structures in the point cloud. Both networks are originally designed for classification tasks on the whole point cloud, next to being able of predicting semantic classes for each point of the point cloud. Thereafter, Qi et al. [108] introduced a way to implement PointNet into 3DOD by proposing Frustum-PointNet. By now, many of the state-of-the-art 3DOD methods are based on the general PointNet-architectures. Therefore, it is crucial to understand the underlying architecture and how it is used in 3DOD methods.

7.3.1 PointNet

PointNet consists of three key modules: (i) a max-pooling layer serving as a symmetric function, (ii) a local and global information combination structure in the form of a multi-layer perceptron (MLP) and (iii) two joint alignment networks for the alignment of input points and point features, respectively Qi et al. [106].

To deal with the permutation invariance of point clouds, PointNet is built with symmetric functions in the form of max-pooling operations. Symmetric functions have the same output regardless of the input order. The max-pooling operation results in a global feature vector that aggregates information from all points of the point cloud. Since the max pooling function operates as a “winner takes it all” paradigm, it does not consider local structures, which is the main limitation of PointNet.

Following, PointNet accommodates an MLP that uses the global feature vector subsequently for the classification tasks. Other than that, the global features can also be used in a combination with local point features for segmentation purposes.

The joint alignment networks ensure that a single point cloud is invariant to geometric transformations (e.g., rotation and translation). PointNet uses this natural solution to align the input points of a set of point clouds to a canonical space by pose normalization through spatial transformers, called T-Net. The same operation is deployed again by a separate network for feature alignment of all point clouds in a feature space. Both operations are crucial for the networks’ predictions to be invariant of the input point cloud.

7.3.2 PointNet++

PointNet++ is built in a hierarchical manner on several set abstraction layers to address the original PointNet’s missing ability to consider local structures. At each level, a set of points is further abstracted to produce a new set with fewer elements, in fact summarizing the local context. The set abstraction layers are composed again of three key layers: (i) a sampling layer, (ii) a grouping layer, (iii) and a PointNet layer [109]. Figure 6 provides an overview of the architecture.

The sampling layer is employed to reduce the resolution of points. PointNet++ uses farthest point sampling (FPS) which only samples the points that are the most distant from the rest of the sampled points [8]. Thereby, FPS identifies and retains the centroids of the local regions for a set of points.

Subsequently, the grouping layers are used to group the representative points, which are obtained from the sampling operation, into local patches. Moreover, it constructs local region sets by finding neighboring points around the sampled centroids. These are further exploited to compute the local feature representation of the neighborhood. PointNet++ adopts ball query, which searches a fixed sphere around the centroid point and then groups all within laying points.

The grouping and sampling layers represent preprocessing tasks to capture local structures before the abstracted points are passed to the PointNet layer. This layer consists of the original PointNet-architecture and is applied to generate a feature vector of the local region pattern. The input for the PointNet layer is the abstraction of the local regions, i.e., the centroids and local features that encode the centroids’ neighborhood.

The process of grouping, sampling and applying PointNet is repeated in a hierarchical fashion, with points being down-sampled further and further until the last layer yields a final global feature vector [8]. In this way, PointNet++ can work with the same input data at different scales and generate higher level features at each set abstraction layer, while thus capturing local structures.

7.3.3 PointNet-based feature extraction

In the following, no explicit distinction is made between PointNet and PointNet++. Instead, we summarize both under the term PointNet-based approaches, considering that we primarily want to imply that point-wise methods for feature extraction or classification purposes are used.

Many state-of-the-art models use PointNet-like networks in their pipeline for feature extraction. An exemplary selection of seminal works include the proposals from Qi et al. [108], Yang et al. [159], Zhou and Tuzel [169] Xu et al. [153], Shi et al. [124], Shin et al. [127], Pamplona et al. [102], Lang et al. [65], Wang and Jia [148], Yang et al. [158], Li et al. [73], Yoo et al. [161], Zhou et al. [167] and Huang et al. [52].

Very little is changed in the way PointNet is used from the models adopting it for feature extraction, indicating that it is already a well-designed and mature technique. Yet, Yang et al. [158] examine in their work that the up-sampling operation in the feature propagation layers and refinement modules consume about half of the inference time of existing PointNet approaches. Therefore, they abandon both processes to drastically reduce inference time. However, predicting only on the surviving representative points on the last set abstraction layer leads to huge performance drops. Therefore, they propose a novel sampling strategy based on feature distance and merge this criterion with the common Euclidean distance sampling for meaningful features.

PointNet-based 3DOD models generally show superior performances in classification as compared to models using other feature extraction methods. The set abstraction operation brings the crucial advantage of flexible receptive fields for feature learning through setting different search radii within the grouping layer [124]. Flexible receptive fields can better capture the relevant content or features because they adapt to the input. Fixed receptive fields such as convolutional kernels are always limited in their dimensionality, which can mean that features and objects of different sizes cannot be captured so well. However, PointNet operations, especially set abstractions, are computationally expensive, which translates into long inference times compared to convolutions or fully connected layers [158, 160].

7.4 Segment-wise feature learning

Segment-wise feature learning follows the idea of a regularized 3D data representation. In comparison to point-wise feature extraction, it does not take the points of the whole point cloud into account as in the previous section. Instead it processes an aggregated set of grid-like representations of the 3D scene (cf. Section 6.3), which are therefore considered as segments.

In the following, we describe central aspects and operations of segment-wise feature learning, including (i) feature initialization (Sect. 7.4.1), (ii) 2D and 3D convolutions (Sect. 7.4.2), sparse convolution (Sect. 7.4.3) and voting scheme (Sect. 7.4.4).

7.4.1 Feature initialization

Volumetric approaches discretize the point cloud into a specific volumetric grid during preprocessing. Projection models, on the other hand, typically lay a relatively fine-grained grid on the 2D mapping of the point cloud scene. In either case, the representation is transformed into segments. In other words, segments can be either be volumetric grids like voxels and pillars (cf. Section 6.3.1) or discretized projections of the point cloud like RV and BEV projections (cf. Section 6.3.2).

These generated segments enclose a certain set of points that does not yet have a processable state. Therefore, an encoding is applied on the individual segments that aggregates the points they enclose.

The intention of the encoding is to fill the formulated segments or grids with discriminative features that provide information about the set of points that lie in each individual grid. This process is called feature initialization. Through the grid, these features are now available in a regular and structured format. In contrast to handcrafted feature extraction methods (cf. Section 7.1), these grids are not yet used for detection, but are made accessible to CNNs or other extraction mechanisms to further condense the features.

For volumetric approaches, current research can be divided into two streams of feature initialization. The first and probably more intuitive approach is to manually encode the voxels. However, handcrafted feature encoding introduces a bottleneck that discards spatial information and may prevent these approaches from effectively utilizing 3D shape information. This led to the latter approach, where models apply a lightweight PointNet to each voxel to learn point-wise features and assign them in aggregated form as voxel features, which is referred to as voxel feature encoding (VFE) [169].

Volumetric feature initialization: Traditionally, voxels are encoded into manually selected features such that each voxel contains one or more values consisting of statistics computed from the points within that voxel cell. The selection of features is already a crucial task, as they should capture the important information within a voxel and describe the corresponding points in a sufficiently discriminative way.

Early approaches such as Sliding Shapes [134] use a combination of four types of 3D features to encode the cells, namely point density, a 3D shape feature, a surface normal feature and a specifically designed feature to deal with the problem of self-occlusion, called the truncated signed distance function. Others, however, rely only on statistical encoding like Wang and Posner [144] as well as their adaption by Engelcke et al. [29], which propose three shape factors, the mean and variance of the reflectance values of points as well as a binary occupancy feature.

Much simpler approaches are pursued by Li [70] and Li et al. [77] using a binary encoding to express whether a voxel contains points or not. To avoid too much information loss due to a rudimentary binary encoding, the voxel size is usually chosen comparatively small to generate high-resolution 3D grids [77].

More recently, voxel feature initialization has shifted more and more toward deep learning approaches for similar reasons as feature extraction in other computer vision tasks, where manually selected features are not as performant as learned ones.

While segment-wise approaches prove to be comparatively efficient 3D feature extraction methods, point-wise models show impressive results in detection accuracy as they have recourse to the full information of each point of the point cloud. By manually encoding the voxels into standard features, a lot of information is usually lost.

As a response, Zhou and Tuzel [169] introduced the seminal idea of VFE and a corresponding deep neural network which moves the feature initialization from hand-crafted voxel feature encoding to deep-learning-based encoding. More specifically, they proposed VoxelNet, which is able to extract point-wise features from each segment of a voxelized point cloud through a lightweight PointNet-like network. Subsequently, the individual point features are stacked together with a locally aggregated feature at voxel level. Finally, this volumetric representation is fed into 3D convolutional layers for further feature aggregation.

The use of PointNet allows VoxelNet to capture inter-point relations within a voxel and therefore hold a more discriminative feature then normal encoded voxel consisting of statistical values. A schematic overview of VoxelNet’s architecture is shown in Fig. 7.

The seminal idea of VFE is used—in modified versions—in many subsequent approaches. Several works focus on improving performance and efficiency of VoxelNet. In terms of performance, Kuang et al. [62] developed a novel feature pyramid extraction paradigm. To speed up the model, Yan et al. [155] and Shi et al. [125] combined VFE with more efficient sparse convolutions. In addition, Sun et al. [137] and Chen et al. [20] reduced the original architecture of VFE to reduce inference times.

Projection-based feature initialization: In projection approaches, feature initialization of cells is usually done by hand. Both RV and BEV utilize fine-grained grids that are primarily filled with statistical quantities of the point lying within them. Only recently have the first models begun to encode features using deep learning methods (e.g., Lehner et al. [68], Wang et al. [145], Liang et al. [81]).

For RV, it is most popular to encode the projection map into three-channel features, namely height, distance and intensity [19, 81, 168]. Instead, Meyer et al. [96, 97] form a five-channel image with range, height, azimuth angle, intensity and a flag indicating whether a cell contains a point. In contrast to manual encoding, Wang et al. [145] use a point-based, fully connected layer to learn high-dimensional point features of the LiDAR point cloud, and then apply a max-pooling operation along the z-axis to obtain the cells’ features in RV.

Similar to RV, Chen et al. [19] encode each cell of a BEV representation in height, intensity and density. To increase significance of this representation, the point cloud is divided into M slices along the y-axis, resulting in a BEV map with \(M + 2\) channel features.

After the introduction to 3DOD by Chen et al. [19], BEV representation became quite popular and many other approaches followed their proposal of feature initialization (e.g., Beltrán et al. [9], Liang et al. [80], Wang et al. [149], Simon et al. [130], Li et al. [74]). Yet, others choose a simpler set up, encoding only the maximum height and density without slicing the point cloud [2] or even using a binary occupancy encoding [48, 157].

To avoid information loss, more recent approaches first extract features using deep learning approaches and then project these features into BEV (e.g., Wang et al. [145], Liang et al. [81]).

Analogous to their RV approach, Wang et al. [145] again use the learned point-wise features of the point cloud, but now apply the max pooling operation along the y-axis for feature aggregation in BEV. Liang et al. [81], on the other hand, first extract features in RV and then transform them into a BEV representation, adding more high-level information compared to directly projecting the point cloud to BEV.

After feature initialization of either volumetric grids or projected cells, segment-wise solutions usually utilize 2D/3D convolutions to extract features of the global scene.

7.4.2 2D and 3D convolutions

Preprocessed 2D representations such as feature-initialized projections, monocular images and RGB-D (2D) images all have the advantage that they can all leverage mature 2D convolutional techniques to extract features.

Volumetric voxel-wise representations, on the other hand, represent the spatial space in a regular format which are accessed by 3D convolutions. However, directly applying convolutions in 3D space is a very inefficient procedure due to the multiplication of space.

Early approaches to extend the traditional 2D convolutions to 3D were applied by Song and Xiao [135] as well as Li [70] by placing 3D convolutional filters in 3D space and performing feature extraction in an exhaustive operation. Since the search space increases drastically from 2D to 3D, this procedure involves immense computational costs.

Further examples using conventional 3D CNNs can be found in the models of Chen et al. [17], Sun et al. [137], Zhou and Tuzel [169] and Sindagi et al. [132].

Despite delivering state-of-the-art results, the conventional 3D CNN lacks efficiency. Given the fact that the sparsity of point clouds leads to many empty and non-discriminative voxels in a volumetric representation, the exhausting 3D CNN operations perform a large amount of redundant computations. This issue can be addressed by a sparse convolution.

7.4.3 Sparse convolution

Sparse convolution was proposed by Graham [41] and [42]. First, a ground state is defined for the input data. The ground state expresses whether the spatial location (site) is active or not. A site in the input representation is active if it has a non-zero value, which in the case of a regularized point cloud is a voxel enclosing at least a certain threshold of points. Furthermore, a site in the following layers is active if any of the spatial locations from the foregoing layer, from which it receives its input, is active. Therefore, sparse convolution must only process the sites which differ from the ground state of the preceding convolution layer, focusing computational power on the meaningful and new information.

To reduce resource costs and speed up feature extraction, irrelevant regions need to be skipped when processing point clouds. Yan et al. [155] were the first to apply sparse convolutions in 3DOD, which do not suffer but exploit sparsity.

However, the principle of sparse convolution has the disadvantage that it continuously leads to dilation of the data, as it discards any non-active sites. The deeper a network becomes, the more the sparsity is reduced and the data are dilated. For this reason, Graham et al. [43] introduced submanifold sparse convolution, where the input is first padded so that the output retains the same dimensions. Moreover, the output is active only when the central site of the receptive field is active, thus preserving the efficiency of sparse convolution while simultaneously maintaining sparsity.

Relevant works using sparse 3D convolutions are proposed by Yan et al. [155], Chen et al. [20], Shi et al. [125], Pang et al. [104], Yoo et al. [161], He et al. [48], Zheng et al. [165] and Deng et al. [26].

7.4.4 Voting scheme

Another approach to exploit sparsity of point clouds is to apply a voting scheme as exemplified by Wang and Posner [144], Engelcke et al. [29] and Qi et al. [107]. The idea behind voting is to let each non-zero site in the input declare a set of votes to its surrounding cells in the output layer. The voting weights are computed by flipping the convolutional weights of the filter along its diagonal Fernandes et al. [31]. As with sparse convolution, processing only needs to be performed on non-zero sites. Hence, computational cost is proportional to the number of occupied sites rather than to the dimension of the scene. Facing the same problem of dilation as sparse convolution, Engelcke et al. [29] argue to select non-linear activation functions. In this case, rectified linear units help to maintain sparsity because only features with values greater than zero, and not just nonzero, are allowed to cast votes.

Mathematically, the feature-centric voting is equivalent to the submanifold sparse convolution, as Wang and Posner [144] proof in their work.

8 Fusion approaches

Single-modality pipelines for 3DOD have developed well in recent years and have shown remarkable results. Yet, unimodal models still reveal shortcomings preventing them to reach full maturity and human-like performance. For instance, camera images lack depth information and suffer from truncation and occlusion, while point clouds lack texture information and are sparse at longer distances. To overcome these problems, recent research is increasingly focusing attention on fusion models that attempt to leverage the combination of information from different modalities.

The main challenges of fusion approaches are the synchronization of the different representations and the preservation of relevant information during the fusion process. Further, holding the additional complexity at a computationally reasonable level must also be taken into account.

Fusion methods can be divided into two classes depending on the orchestration of the modality integration, namely (i) cascaded fusion (Sect. 8.1) and (ii) feature fusion (Sect. 8.2). The former combines different sensor data and their individual features or predictions across different stages, whereas the latter jointly reasons about multi-representation inputs.

8.1 Cascaded fusion

Cascaded fusion methods use consecutive single-modality detectors to restrict second-stage detection by the results of the first detector. Typically, monocular-based object detectors are leveraged in the first stage to define a contained subset of the point cloud containing only 3D points that are likely to define an object. Hence, in the second stage, 3D detectors only need to reason over a limited 3D search space.

Two seminal works in this regard are the fusion frameworks proposed by Ghanem [64] and Qi et al. [108]. The approaches use the detection results from the 2D image to extrude a corresponding frustum into 3D space. For each 2D proposal, a frustum is generated. The popular Frustum-PointNets by Qi et al. [108] then processes the frustum with PointNet for instance segmentation. Finally, the amodal 3D box is predicted based on the frustum and the extracted foreground points (see Fig. 8). Lahoud and Ghanem [64], on the other hand, first estimates the orientation for each object within the frustum. In the last step, they apply an MLP regressor for the 3D boundaries of the object.

Several approaches follow this basic idea in a similar way. For example, both Yang et al. [159] and Ferguson and Law [30] use 2D semantic segmentation and then project these foreground pixels of the image into a point cloud. The selected points are subsequently exploited for proposal generation through PointNet or convolution operations. Du et al. [28] leverage the restricted 3D space by applying a model matching algorithm for detection purpose. In contrast, Shin et al. [127] attempt to improve the 3D subset generation by creating point cloud region proposals with the shape of standing cylinders instead of frustums, which is more robust to sensor synchronization.

While the described models above mainly focus on the frustum creation process, Wang and Jia [148], Zhang et al. [163] as well as Shen and Stamos [122] seek to advance the processing of the frustums.

Due to its modular nature, Frustum-PointNet is not able to provide an end-to-end prediction. To overcome this limitation, Wang and Jia [148] subdivide the frustums to eventually use of a fully convolutional network allowing a continuous estimation of oriented boxes in 3D space. They generate a sequence of frustums by sliding along the frustum axis, and then aggregate the grouped points of each respective section into local point-wise features. These features on frustum level are arranged as a 2D feature map, enabling the use of a subsequent fully convolutional network.

Other than that, Shen and Stamos [122] aim to integrate the advancements of voxelization into frustum approaches by transforming regions of interests (ROIs) within the point frustums into 3D volumetric grids. Thus, only relevant regions are voxelized, allowing a high resolution that improves the representation while still being efficient. In this case, the voxels are then fed to a 3D fully convolutional network.

More recently, Zhang et al. [163] observed that point-based 3DOD does not perform well in longer range because of an increasing sparsity of point clouds. Therefore, they take advantage of RGB images that contain enough information to recognize distant objects with mature 2D detectors. While following the idea of frustum generation, the estimated location of objects that are considered to be distant is recognized. Taking into account that very few points define these objects in a point cloud, they do not possess sufficient discriminative information for neural networks, so that 2D detectors are applied on corresponding images. Otherwise, for close objects, Zhang et al. [163] use conventional neural networks to process the frustum.

8.2 Feature fusion

Since the performance of cascaded fusion models is always limited by the accuracy of the detector at each stage, some researchers try to increase performance by arguing that the models should infer more jointly across modalities.

To this end, feature fusion methods first concatenate the information from different modalities before reasoning about the combined features, trying to exploit the diverse information in their combination. Within feature fusion, it can be further distinguished between (i) early, (ii) late and (iii) deep fusion approaches [19], constituting at which stage of the 3DOD pipeline fusion occurs. Figure 9 provides an illustrative overview of the different fusion schemes.

Early fusion merges multi-view features in the input stage before any feature transformation takes place, and proceeds with a single network to predict the results. Late fusion, in contrast, uses multiple subnetworks that process the individual inputs separately up until the last stage of the pipeline, where they get concatenated in the prediction stage. Beyond that, deep fusion allows an interaction of different input modalities at several stages in the architecture and alternately performs feature transformation and feature fusion.

Although the following approaches can all be categorized as feature fusion methods, the classification between the various subclasses of early, late and deep fusion is not trivial and can be fluid. Nevertheless, the concepts help to convey a better understanding of feature fusion processes.

The pioneers among 3DOD fusion approaches are Chen et al. [19], introducing the multi-view approach. They take multiple perspectives, specifically RV, BEV and FV as input representations. The BEV representation is used to generate 3D candidates, followed by a region-wise feature extraction by projecting these 3D proposals onto the respective feature maps of each view. A deep fusion scheme is then used to combine the information element-wise over several intermediate stages.

Ku et al. [60] use 3D anchors mapped to both the BEV representation and the 2D image. Subsequently, a crop and resize operation is applied to every projected anchor, and the feature crops from both views are fused via an element-wise mean operation at an intermediate convolutional layer. Unlike Chen et al. [19], Ku et al. [60] not only merge features in the late refinement stage but already in the region proposal stage to generate positive proposals. They specifically use the full-resolution feature map to improve prediction quality, particularly for small objects. Similar approaches to Chen et al. [19] and Ku et al. [60] are performed Li et al. [18], Li et al. [74] and Wang et al. [146], who also fuse the region of interests of the input data representations element-wise. Yet, Chen et al. [18] use segment-wise 3D detection for box proposals in the first stage.

In contrast, Rahman et al. [112] not only fuse the regions of interest, but combine the entire feature map of processed monocular images and FV representations already at the region proposal stage.

Further, Ren et al. [115] want to leverage not only object detection information but also context information. Therefore, they simply concatenate the processed features of 2D scene classification, 2D object detection and 3D object detection of the voxelized scene before feeding them to a conditional random field model for joint optimization.

Rather than just combining regular representations through multi-view fusion models, some approaches also aim to merge raw point features with other representations.

For example, Xu et al. [153] process the raw point cloud with PointNet. They then concatenate each point-wise feature with a global scene feature and the corresponding image feature. Each point is then used as a spatial anchor to predict the offset to the 3D bounding box. Likewise, Yang et al. [160] use features from a PointNet++ backbone to extract semantic context features for each point. The sparse features subsequently get condensed by a point pooling layer to take advantage of a voxel-wise representation applying VFE.

Shi et al. [123] first use 3D convolution on a voxel representation to summarize the scene into a small set of key points. In the next step, these voxel-feature key points are fused with the grid-based proposals for refinement purposes.

The previously presented models all perform late or deep fusion procedures. Instead of fusing multi-sensor features per object after the proposal stage, Wang et al. [149] follow the idea of early fusing BEV and image views using sparse, non-homogeneous pooling layers over the full resolution. Similarly, Meyer et al. [96] also employ early fusion, but they use RV images for the point-cloud-related representation. The approach associates the LiDAR point cloud with camera pixels by projecting the 3D points onto the 2D image. The image features are then concatenated and further processed by a fully connected convolutional network.

Furthermore, several works apply a hybrid approach of early and late fusion schemes. For example, Sindagi et al. [132] first project LiDAR onto an RV representation and concatenate image features with the corresponding points in an early fusion fashion. Then they apply a VFE layer to the voxelized point cloud and append the corresponding image features for each non-empty voxel. While early concatenation already fuses features, late fusion aggregates image information for volumetric-based representation, where voxels may contain low-quality information due to low point cloud resolution or distant objects.

Akin, Liang et al. [79] initially conduct feature fusion to RGB images and BEV feature maps. Thereby, they incorporate multi-scale image features to augment the BEV representation. In the refinement stage, the model fuses image and augmented BEV again, but other than in the first stage, the fusion occurs element-wise for the regions of interest. They further add ground estimation and depth estimation to the fusion framework to advance the fusion process.

Simon et al. [129] extend their previous work Complex YOLO [130] with Complexer You Only Look Once (YOLO) by exchanging the input of a BEV map for a voxel representation. To leverage all inputs, they first create a semantic segmentation of the RGB image and then fuse this picture point-wise on the LiDAR-frame to generate a semantic voxel grid.

Lately, Zheng et al. [165] first initialize features of a voxelized point cloud by calculating the mean coordinates and intensities of points in each voxel. They then apply sparse convolution and transform the representation into a dense feature map before condensing it on the ground plane to produce a BEV feature map.

Continuous Fusion: LiDAR points are continuous and sparse, whereas cameras capture dense features at a discrete state. Fusing these modalities is not a trivial task due to the one-to-many projection. In other words, there is not a corresponding LiDAR point for every image pixel in every projection and vice versa.

To overcome this discontinuous mapping of images into point-cloud-based representations such as BEV, Liang et al. [80] propose a novel continuous convolution that is applied to create a dense feature map through interpolation. They propose to project the image feature map onto a BEV space and then fuse the original LiDAR BEV map in a deep fusion manner through continuous convolution over multiple resolutions. The fused feature map is then further processed by a 2D CNN to solve the discrepancy between image and projection representations.

Attention Mechanism: A common challenge among fusion approaches is the occurrence of noise and the propagation of irrelevant features. Previous approaches simply fuse multiple features by concatenations or element-wise summation and/or mean operations. Thereby, noise such as truncation and occlusion gets inherited to the resulting feature maps. Thus, inferior point features will be obtained through fusion. Attention mechanisms can cope with these difficulties by determining the relevance of each feature to only fuse features that improve the representation.

Lu et al. [91] use a deep fusion approach for BEV and RGB images in an element-wise way, but additionally incorporate attention modules over both modalities to leverage the most relevant features. Spatial attention adapts pooling operations over different feature map scales, while the channel-wise fusion applies global pooling. Both create an attention map that expresses the importance of each feature. These attention maps are then multiplied with the feature map and finally fused.

In analogy, Wang et al. [145] use an attentive point-wise fusion module to estimate the channel-wise importance of BEV, RV and image features. In contrast, they deploy the attention mechanism after the concatenation of the multi-view feature maps to consider the mutual interference and the importance of the respective features. They specifically address the issue of ill-posed information introduced by the front view of images and RV. To compensate the inevitable loss of geometric information through the projection of LiDAR points, the authors finally enrich the fused point features with raw point features through an MLP network.

Consecutive to the attention fusion, Yoo et al. [161] use first stage proposals from the attentive camera-LiDAR feature map to extract the single-modality LiDAR and camera features of the proposal regions. Using a PointNet encoding, these are subsequently fused element-wise with the joint feature map for refinement.

In contrast, Huang et al. [52] operate directly on the LiDAR point cloud introducing a point-wise fusion. In their deep fusion approach, they process a PointNet-like geometric—and a convolution-based—image stream in parallel. Between each abstraction stage, the point features are fused with the semantic image features of the corresponding convolutional layer by applying an attention mechanism.

Moreover, Pang et al. [104] observed that element-wise fusion takes place after non-maximum suppression (NMS), which can which can result in useful candidates of each modality being incorrectly suppressed. NMS is used to suppress duplicate candidates after proposal and prediction stage, respectively. Therefore, Pang et al. [104] use a much-reduced threshold for proposal generation for each sensor and combine detection candidates before NMS. The final prediction is based on a consistency operation between 2D and 3D proposals in a late fusion fashion.

Other representative examples exploiting attention mechanism for effective fusion of features are proposed by Chen et al. [20] and Li et al. [73].

A totally different approach to combine different inputs is presented by Chen et al. [15]. For the specific case of autonomous vehicles, they propose to connect surrounding vehicles and combine their sensor measurements. More specifically, LiDAR data collected from different positions and angels of the connected vehicles are fused together to provide the vehicles with a collective perception of the scene.

In summary, the fusion of different modalities is a vibrant research area within 3DOD. With continuous convolutions and attention mechanism, potential solutions for common issues, such as image-to-point-cloud discrepancies and/or noisy data representations, are already introduced. Nevertheless, fusion approaches still face several unsolved challenges. For example, 2D-driven fusion approaches such as cascaded methods are always constrained by the quality of the 2D detection during the first stage. Therefore, they may fail in cases that can only be observed properly from the 3D space. Feature fusion approaches, on the other hand, generally face the difficulty to fuse different data structures. Consider the example of fusing images and LiDAR data. While images provide a dense, high-resolution structure, LiDAR point clouds show a sparse structure with a comparably low resolution. The workaround of transforming point clouds into another representation inevitably leads to a loss of information. Another challenge for fusion approaches is that crop and resize operations to fuse proposals of different modalities may destroy the feature structure derived from each sensor. Thus, a forced concatenation of a fixed feature vector size could result in imprecise correspondence between the different modalities.

9 Detection module

The detection module depicts the final stage of the pipeline. It uses the extracted features to perform the multi-task consisting of classification, localization along with the bounding box regression and object orientation determination.

Early 3DOD approaches either relied on (i) template and keypoint matching algorithms, such as matching 3D CAD models to the scene (Sect. 9.1) or (ii) suggested handcrafted SVM-classifiers using sliding window approaches (Sect. 9.2).

More recent research mainly focuses on detection frameworks based on deep learning due to their flexibility and superior performance (Sect. 9.3). Detection techniques of this era can be further divided into (i) anchor-based detection (Sect. 9.4), (ii) anchorless detection (Sect. 9.5) and (iii) hybrid detection (Sect. 9.6).

9.1 Template and keypoint matching algorithms

A natural approach to classifying objects is to compare and match them against a template database. These approaches typically leverage 3D CAD models to synthesize object templates that guide geometric reasoning during inference. Applied matching algorithms use parts or whole CAD models of the objects to classify the candidates.

Teng and Xiao [140] follow a surface identification approach. Therefore, they accumulate a surface object database of RGB-D images taken from different viewpoints. Then, the specific 3D surface segment obtained by segmenting the current scene is matched with the surface segments in the database. Key points between the matched surface and the observed surface are then matched for pose estimation.

Crivellaro et al. [22] initially perform a part detection. For each part, seven so-called 3D control points are projected to represent the pose of the object. Finally, a bounding box matching the constraints on the part and the control points is estimated from a small set of learned objects.

Kehl et al. [57] create a codebook of local RGB-D patches from synthetic CAD models. These patches, consisting of a variety of different views, are matched with feature descriptors from the scene to classify the object.

Another matching approach is designed by He et al. [49] extending LINE-MOD [50], which combines surface normal orientations from depth images and silhouette gradient orientations from RGB images to represent object templates. LINE-MOD is first used to produce initial detection results based on lookup tables for similarity matching. To exclude the many false positive and duplicate detections, He et al. [49] cluster templates that matched with a similar spatial location and only then score the matchings.

Further, Yamazaki et al. [154] applied a template matching to point cloud projections. The key novelty is the use of constraints imposed by the spatial relationship between image projection directions which are linked through the shared point cloud. This allows to achieve a consistency of the object throughout the multi-viewpoint images, even in cluttered scenes.

Another approach is proposed by Barabanau et al. [6]. The authors introduce a compound solution of key point and template matching. They observe that depth estimation on monocular 3DOD is naturally ill-posed. For that reason, they propose to use sparse but salient key point features. They initially regress 2D key points and then match them with 3D CAD models to predict object dimension and orientation.

9.2 Sliding window approaches

The sliding window technique was largely adopted from 2DOD to 3DOD. Here, an object detector slides in the form of a specified window over the feature map and directly classifies each window position. For 3DOD pipelines, this idea is extended by replacing the 2D window with a spatial rectangular box that slides through a discretized 3D space. However, tested solutions have revealed that traversing a window over the entire 3D space is a very exhaustive task, leading to heavy computations and long inference times.

One of the popular pioneers in this area were Song and Xiao [134] with their Sliding Shapes approach. They use prior trained SVM classifiers to run exhaustively over a voxelized 3D space.

Similarly, Ren and Sudderth [116,117,118] use in all of their works a sliding window approach and extensively leverage pre-trained SVMs. In COG 1.0, for example, they use SVMs along with a cascaded classification framework to learn contextual relationships among objects in the scene. Therefore, they train SVMs for each object category with handcrafted features such as surface orientation. Furthermore, they integrate a Manhattan space layout, which assumes an orthogonal space structure to estimate walls, ceilings and floors for a more holistic understanding of the 3D scene and to restrict the detection. Finally, the contextual information is used in a Markov random field representation problem to consider object relationship in detection [116].

In the successor model, namely LSS, Ren and Sudderth [117] observe that the height of the support surface is the primary cause of style variation for many object categories. Therefore, they add support surfaces as a latent part for each object, which they use in combination with an SVM and additionally constraints from the predecessor.

Even in their latest work, COG 2.0, Ren and Sudderth [118] still apply an exhaustive sliding window search for 3DOD, laying their focus on robust feature extraction rather than detection techniques.

Similarly, Liu et al. [83] use SVMs learned for each object class based on the feature selection proposed by Ren and Sudderth [116]. Through a pruning of candidates by comparing the cuboid size of the bounding boxes with the distribution of the physical size of the objects, they further reduce inference time of detection.

However, an exhaustive sliding window approach tends to be computationally expensive, since the third dimension significantly increases the search space. Therefore, Wang and Posner [144] exploit the sparsity of 3D representations by adding a voting scheme that is activated only for occupied cells, reducing the computational complexity while preserving mathematical equivalence. Whereas the sliding window approach of Song and Xiao [134] operates linear to the total number of cells in 3D grids, the voting approach by Wang and Posner [144] reduces the operations exclusively to the occupied cells. The voting scheme is explained in more detail in Sect. 7.4.4.

Engelcke et al. [29] tie on the success of Wang and Posner [144] and propose to exploit feature-centric voting to detect objects in point clouds in even deeper networks to boost performance.

9.3 Detection frameworks based on deep learning

All of the above detection techniques, which are solid solutions for their specific use cases, are based on manually developed features and are difficult to transfer. Thus, to exploit more robust features and improve detection performance, most modern detection approaches are based on deep learning models.

As with 2DOD, detection networks for 3DOD relying on deep learning can be basically grouped into two meta frameworks: (i) two-stage detection frameworks (Sect. 9.3.1) and (ii) single-stage detection frameworks (Sect. 9.3.2).

To provide a basic understanding of these two concepts, we will briefly revisit the major developments for 2D detection frameworks in the following subsections.

9.3.1 Two-stage detection frameworks

As the name indicates, two-stage frameworks perform the object detection task in two stages. In the first stage, spatial sub-regions of the input image are identified that contain object candidates, commonly known as region proposals. The proposed regions are coarse predictions that are scored based on their “objectness”. Regions with a high probability of containing an object will achieve a high score and are used as input to the second stage. These unrefined predictions often lack localization precision. Therefore, the second stage mainly improves the spatial estimation of the object through a more fine-grained feature extraction. The following multi-task head then outputs the final bounding box estimation and classification score.

A seminal work following this central idea is that of Girshick et al. [39], who introduced region-based CNN (R-CNN). Instead of dealing with a huge amount of region proposals via an exhaustive sliding window procedure, R-CNN integrates the selective search algorithm [142] to extract just about 2,000 category-independent candidates. More specifically, selective search is based on a hierarchical segmentation approach which recursively combines smaller regions into larger ones based on similarity in color, texture, size and fill. Subsequently, the 2,000 generated region proposals are cropped and warped into fixed size images in order to be fed into a pre-trained and fine-tuned CNN. The CNN acts as feature extractor to produce a feature vector with a fixed length, which can then be consumed by binary SVM classifiers trained independently for each object class. At the same time, the CNN features are used for the class-specific bounding box regression.

The original R-CNN framework proved to be time and memory consuming due to a lack of shared computations between each training step (i.e., CNN, SVM classifiers, bounding box regressors). To this end, Girshick [38] developed an extension, called Fast R-CNN, in which the individual computations were integrated into a jointly trained framework. Instead of feeding the region proposals generated by selective search to the CNN, the two operations are swapped, so that the entire input image is now processed by the CNN to produce a joint convolutional feature map. The region proposals are then projected onto the joint feature map and a fixed-length feature vector is extracted from each region proposal using a region-of-interest pooling layer. Subsequently, the extracted features are consumed by a sequence of fully connected layers to predict the results for the final object classes and the bounding box offset values for refinement purposes Girshick [38]. This approach saves memory and improves both, the accuracy and efficiency of object detection models [85, 164].

Both R-CNN and Fast R-CNN have the disadvantage of relying on external region proposals generated by selective search, which is a time-consuming process. Against this backdrop, [114] introduced another extension, called Faster R-CNN. As an innovative enrichment, the detection framework consists of a region proposal network (RPN) as a sub-network for nominating regions of interest. The RPN is a CNN by itself and replaces the functionality of the selective search algorithm. To classify objects, the RPN is connected to a Fast R-CNN model with which it shares convolutional layers and the resulting feature maps.

It initializes multiple reference boxes, called anchors, with different sizes and aspect ratios at each possible feature map position. These anchors are then mapped to a lower dimensional vector, which is used for "objectness" classification and bounding box regression via fully connected layers. These are in turn passed to the Fast R-CNN for bounding box classification and fine tuning. Due to the convolutional layers used simultaneously by the RPN and the Fast R-CNN, the architecture provides a highly efficient solution for region proposals [114]. Furthermore, since Faster R-CNN is a continuous CNN, the network can be trained end-to-end using backpropagation iteratively and handcrafted features are no longer necessary [85, 164].

9.3.2 Single-stage detection frameworks

Single-stage detectors present a simpler network by transforming the input into a structured data representation and employing a CNN to directly estimate bounding box parameters and class scores in a fully convolutional manner.

Object detectors based on region proposals are computationally intensive and have long inference times, especially on mobile devices with limited memory and computational capacities [85, 113]. Therefore, single-stage frameworks with significant time advantages have been designed, while having acceptable drawbacks in performance in comparison to the heavyweight two-stage region proposal detectors of the R-CNN family. The speed improvement results from the elimination of bounding box proposals as well as the feature resampling [86]. Two popular approaches which launched this development are YOLO (you only look once) [113] and SSD (single-shot multibox detector) [86].

The basic idea of YOLO is to divide the original input image into an \(S\times S\) grid. Each grid cell is responsible for both, classifying the objects within it and predicting the bounding boxes and their confidence value. However, they use features of the entire input image and not only those of proposed local regions. The use of only a single neural network by omitting the RPN allows YOLO-successor FAST YOLO to run in real-time at up to 155 frames per second. However, YOLO exhibits disadvantages in form of comparably lower quality results such as more frequent localization errors, especially for smaller objects [113].

The SSD framework also has real-time capability but does not suffer from such severe performance losses as YOLO. Similarly, the model consists of a single continuous CNN but uses the idea of anchors from RPN. Instead of fixed grids as in YOLO, anchor boxes of various sizes are used to determine the bounding boxes. To detect objects of different sizes, the predictions of several generated feature maps of descending resolution are combined. In this process, the front layers of the SSD network are increasingly used for classification due to their size, and the back layers are used for detection [86].

Thanks to regressing bounding boxes and class scores in one stage, single-stage networks are faster than two-stage frameworks. However, features are not learned from predicted bounding-box proposals but from predefined anchors. Hence, resulting predictions are usually not as accurate as those from two-stage frameworks.

Compared to single-stage approaches, proposal-based models can leverage finer spatial information in the second stage, by only focusing on the narrowed-down region of interest, predicted by the first stage. Features get re-extracted for each proposal which achieves more accurate localization and classification, but in turn, increases the computational costs.

The single- and two-stage paradigm can be transferred from 2DOD to 3DOD. Other than that, we want to further distinguish between the detection techniques described in the following.

9.4 Anchor-based detection

Many modern object detectors make use of anchor boxes, which serve as the initial guess for the bounding box prediction. The main idea behind anchor boxes is to define a certain set of boxes with different scales and ratios that are mapped densely across the image. This exhaustive selection should be able to capture all relevant objects. The boxes that best contain and match the objects are finally retained.

Anchor boxes are boxes of predefined width and length. Both depict important hyperparameters to choose since they must match those of the objects in the dataset. To consider all variation of ratios and scales, it is common to choose a collection of anchor boxes in multiple sizes (see Fig. 10).

Anchor boxes are proposed in high numbers across the image. Typically, they are initialized at the center of each cell of the final feature map after the feature extraction stage. For localization and classification, a network uses these anchor boxes to learn to predict the offsets between the anchors and the ground truth. Typically, by a combination of the classification confidence of each box and its overlap with the ground truth box, called intersection-over-union (IoU), it is chosen which anchor boxes are discarded and which are kept for refinement purposes Zhong et al. [166].

In Faster R-CNN, three aspect ratios and three scales are used by default, resulting in nine anchors per location. The approach significantly reduces the number of anchors in comparison to existing solutions. This is of particular importance as it enables a computationally acceptable integration of region proposals into the huge search space of 3DOD candidates [135].

9.4.1 Two-stage detection: 2D anchors

For 2D representations of the spatial space (e.g., BEV, RV and RGB-D), the previously described R-CNN frameworks can be directly implemented and are therefore widely used. Most models first predict 2D candidates from monocular representations. These predictions are then given to a subsequent network that transforms these proposals into 3D bounding boxes by applying various constraints such as depth estimation networks, geometrical constraints or 3D template matching (cf. Section 7.2.2).

As a representative example of a 2D anchor-based detection approach in 3DOD, Chen et al. [19] use a 2D RPN at considerably sparse and low-resolution input data such as FV or BEV projections. As this data may not have enough information for proposal generation, Chen et al. [19] assign four 2D anchors— per frame and class—to every pixel in BEV, RV and image feature map, and combine these crops in a deep fusion scheme. The 2D anchors are derived from representative 3D boxes which were obtained by clustering the ground truth objects in the training set by the size and restricting the orientation to 0\(^\circ \) and 90\(^\circ \). Leveraging sparsity, they only compute non-empty anchors of the last convolution feature map.

For 3DOD, the 2D anchors are then reprojected to their original spatial dimensionality, which was derived from the ground truth boxes. Following, these 3D proposals serve as the ultimate refinement regression of the bounding boxes.

Further examples are proposed by Deng and Latecki [27], Zeng et al. [162], Maisano et al. [94] and Beltrán et al. [9].

9.4.2 Two-stage detection: 3D anchors

Offering a relatively fast detection, 2D anchor-based detectors are not as suitable for high-precision detection. Therefore, a growing part of research is devoted to new and more complex 3D anchor-based detection.

An initial attempt to deploy region proposals with 3D anchors was made by Chen et al. [17] when introducing 3DOP based on handcrafted features and priors. 3DOP uses depth features in RGB-D point clouds to score candidate boxes in spatial space. A little later, Song and Xiao [135] exploit more powerful deep learning features for candidate creation in their seminal work of Deep Sliding Shapes. Inspired by Faster R-CNN, Deep Sliding Shapes divides the 3D scene, obtained from RGB-D data, into voxels and then designs a 3D convolutional RPN to learn objectness for spatial region proposals. The authors define the anchor boxes for each class based on statistics. For each anchor with non-square ground planes, they define an additional anchor with the same size but rotated by 90\(^\circ \). This results in a set of 19 anchors for their indoor scenarios. Given their experiments on the SUN RGB-D [133] and NYUv2 [128] datasets, the total number of anchors per image is about 1.4 million, in comparison to 2,000 anchors per RGB image frame through the selective search algorithm in an R-CNN. Thus, the huge number of anchors leads to extreme computation costs.

Apart from regular spatial representations in the form of a voxelized spatial space, Xu et al. [153] leverage a point-wise representation for anchor-based detection. The input 3D points are used as dense spatial anchors and a prediction is performed on each of the points with two connected MLPs. Similarly, Yang et al. [159] define two anchors to propose a total of six 3D candidates on each point of the point cloud. To reduce the number of proposals, they apply a 2D semantic segmentation network which is mapped to the 3D space and eliminates all proposals made on background points. Subsequently, the proposals are refined and scored by a lightweight PointNet prediction network.

9.4.3 Single-stage detection: 2D anchors

Similar to two-stage architectures, the 2D anchor-based single-stage detector framework can be directly applied to 2D-based representations. Exemplary representatives can be found in the work from Liang et al. [80], Meyer et al. [96], Ali et al. [2] and He et al. [48].

Ali et al. [2], for instance, use the average box dimensions for each object class from the ground truth dataset as 3D reference boxes and derive 2D anchors. Then a single-stage YOLO framework is applied to a BEV representation and two regression branches are added to produce the z-coordinate of the center of the proposal as well as the height of the box.

To enhance prediction quality, He et al. [48] perform the auxiliary detection task of point-wise foreground segmentation prior to exploiting anchor-based detection. Subsequently, they estimate the object center with a 3D CNN and only then reshape the feature maps to BEV and employ anchor-based 2D detection.

For further improvement, Gustafsson et al. [47] design a differentiable pooling operator for 3D to extend the SA-SSD approach of He et al. [48] by a conditional energy-based regression approach instead of the commonly used Gaussian model.

9.4.4 Single-stage detection: 3D anchors

Single-stage detection networks that use anchors based on 3D representations are particularly focuses on extracting meaningful and rich features in the first place, since 3D detection methods are not yet as mature as their equivalent 2D ones. Likewise, the missing performance boost due to the lack of an additional refinement stage must be compensated.

As an exemplary approach, Zhou and Tuzel [169] introduce the seminal VFE module as an approach for discriminative feature extraction (cf. Section 7.4.1). As of today, it is the state-of-the-art encoding for voxel-wise detection models. Having access to these meaningful features, they only use a simple convolutional middle layer in combination with a slightly modified RPN for a single-stage detection purpose.

Observing the problem of high inference times of 3D CNNs in volumetric representations, Sun et al. [137] introduce a single-stage 3D CNN, treating detection and recognition as one regression problem in a direct manner. Therefore, they develop a deep hierarchical fusion network capturing rich contextual information.

Further exemplary representatives for single-stage 3D anchor detection are proposed by Yan et al. [155] or Li et al. [77], which mainly build upon the success of VoxelNet.

9.5 Anchorless detection

Anchorless detection methods are commonly based on point- or segment-wise detection estimates. Instead of generating candidates, the whole scene is densely classified, and the individual objects and their respective position are derived directly. Apart from a larger group of approaches that use fully convolutional networks (FCNs) (Sect. 9.5.1), there exist several other individual solutions (Sect. 9.5.2) that propose anchor-free detection models.

9.5.1 Approaches based on fully convolutional networks

Rather than exploiting anchor-based region proposal networks, Li et al. [72] pioneered the idea of extending fully convolutional networks (FCNs) [89] to 3DOD. The proposed 3D FCN does not require candidate regions for detection, but implicitly predicts objectness over the entire image. Instead of generating multiple anchors over the feature map, the bounding box then gets directly determined over the objectness regions. Li [70] further extend this approach in their successor model by going from depth map data to a spatial volumetric representation derived from a LiDAR point cloud.

Kim and Kang [58] use a two-stage approach, initially predicting candidates in a projection representation based on edge filtering. Leveraging edge detection, objects get segmented and unique box proposals are generated based on the edge boundaries. In the second stage, the authors then apply a region-based FCN to the region of interest.

Meyer et al. [96, 97] both employ a mean shift clustering for detection. They use an FCN to predict a distribution over 3D boxes for each point of the feature map independently. In conclusion, points on the same object should predict a similar distribution. To eliminate the natural noise of the prediction, they combine per-point prediction through mean shift clustering. Since all distributions are class-dependent and multimodal, the mean shift has to be performed for each class and modality separately. For efficiency reasons, mean shift clustering is performed over box centers instead of box corners, thereby reducing dimensionality.

Further representatives using FCNs are Yang et al. [156, 157], who use hierarchical multi-scale feature maps, and [148], who apply a sliding-window-wise application of an FCN. These networks output pixel-wise predictions at a single stage, with each prediction corresponding to a 3D object estimate.

9.5.2 Other approaches

Since point-based representations do not admit convolutional networks, models processing the raw point cloud need to find other solutions to apply detection mechanisms. In the following, we summarize some of the innovative developments.

A seminal work that offers a pipeline for directly working on raw point clouds was proposed by Qi et al. [107] when introducing VoteNet. The approach integrates the synergies of 3D deep learning models for feature learning, namely PointNet++ [109] and Hough Voting [69]. Since the centroid of a 3D bounding box is most likely far from any surface point, the estimation of bounding box parameters that are based solely on point clouds is a difficult task. By considering a voting mechanism, the authors generate new points that are located close to object centroids, which are used to produce spatial location proposals for the corresponding bounding box. They argue that a voting-based detection is more compatible with sparse point sets as compared to RPNs since RPNs have to carry out extra computations to adjust the bounding box without having an explicit object center. Furthermore, the center is likely to be in an empty space of the point cloud.

Figure 11 illustrates the approach. First, a backbone network based on PointNet++ is used to learn features on the points and derive a subset of points (seeds). Each seed proposes a vote for the centroid by using Hough voting (votes). The votes are then grouped and processed by a proposal module to provide refined proposals (vote clusters). Eventually, the vote clusters are classified and bounding boxes are regressed.

VoteNet provides accurate detection results even though it relies solely on geometric information. To further enhance this approach, Qi et al. [105] complements VoteNet by utilizing the high resolution propose ImVoteNet. The successor and rich texture of images to fuse 3D votes in point clouds with 2D votes in images.

Another voting approach is proposed by Yang et al. [158]. The authors first use a voting scheme similar to Qi et al. [107] to generate candidate points as representatives. The candidate points are then further treated as object centers and the surrounding points are gathered to feed an anchor-free regression head for bounding box prediction.

Pamplona et al. [102] propose an on-road object detection method where they eliminate the ground plane points and then establish an occupation grid representation. The bounding boxes are then extracted for occupation regions that contain a certain number of points. For classification purposes, PointNet is applied.

Shi et al. [124] use PointNet++ for a foreground point segmentation. This contextual dense representation is further used for a bin-based 3D box regression and then refined through point cloud region pooling and a canonical transformation.

Besides their anchor-based solution, Shi et al. [125] also conduct experiments on an anchor-free solution, reusing the detection head of PointRCNN [124]. They examined that while the anchorless application is more memory efficient, the anchor-based strategy results in a higher object recall.

Similar to Shi et al. [124], Li et al. [73] also exploit a foreground segmentation of the point cloud. For each foreground point, an IoU-sensitive proposal is produced, which leverages the attention mechanism. This is done by only taking the most relevant features into account as well as further geometrical information about the surroundings. For the final prediction, they add a supplementary IoU-perception branch to the commonly used classification and bounding box regression branch for a more accurate instance localization.

Other than that, Zhou et al. [167] introduce spatial embedding-based object proposals. A point-wise semantic segmentation of the scene is used in combination with a spatial embedding for instance segmentation. The embedding consists of an assembling of all foreground points into their corresponding object centers. After clustering, a mean bounding box is derived for each instance that is again further refined by a network based on PointNet++.

9.6 Hybrid detection

Next to representation and feature extraction fusion (cf. Section 8), there are also approaches for fusing detection modules. Two-stage detection frameworks, especially representation-fusion-driven models, generally prefer to exploit 3D detection methods such as anchor-based 3D CNNs, 3D FCNs or PointNet-like architectures for the refinement stage after an originally lightweight 2D-based estimation was performed in the first place. This opens the advantage of a precise prediction using the spatial space through 3D detection frameworks, which are otherwise too time-consuming to be applied to the entire scene. In the following, we classify models that use multiple detection techniques as hybrid detection modules.

Exemplary, Wang and Jia [148] apply multiple methods to finally give a prediction. First, they use an anchor-based 2D detection for proposal generation, which are then extruded as frustums into 3D space. Thereafter, an anchorless FCN detection technique is applied to classify the frustum in a sliding window fashion along the frustum axis to output the ultimate prediction.

A frequently exercised approach is to extrude 2D proposals into the spatial domain of point clouds. The pioneer for this technique is Frustum-PointNet [108], enabling PointNet for the task of object detection. Since PointNet can effectively segment the scene, but not produce location estimations, the authors use preceding 2D anchor-based proposals, which then are classified by PointNet.

Likewise, Ferguson and Law [30] as well as Shen and Stamos [122] propose a quite similar idea. They first reduce the search space extruding a frustum from a 2D region proposal into 3D space and then use a 3D CNN for the detection task.

Apart from extruding frustums of the initial 2D candidates, proposals in projection representations are often converted into fully specified 3D proposals in space. This is possible because the projection occupies depth information enabling the mapping of the 2D representation to 3D.

For example, Zhou et al. [168], Shi et al. [123] and Deng et al. [26] transfer the proposals of the 2D-representation into a spatial representation not by extruding an unrestricted search space along the z-axis but as fully defined anchor-based 3D proposals by a 2D to 3D conversion of the projections. 3D-compatible detection methods such as anchorless and anchor-based 3D CNNs or PointNets can then be deployed for refinement.

Chen et al. [20] first generate voxelized candidate boxes that are further processed in a point-wise representation during a second stage. 3D and 2D CNNs are stacked upon a VFE-encoded representation for proposals in the first stage, and only then a PointNet-based network is applied for the refinement of the proposals.

Table 3 Classification of monocular 3DOD models (360\(^\circ \): 360\(^\circ \)-monocular image, PC: point cloud, 2D-3D: 2D-3D consistency, GP: groundplane)

Full size table

Table 4 Classification of RGB-D front view-based 3DOD models

Full size table

Table 5 Classification of projection-based 3DOD models

Full size table

Table 6 Classification of volumetric grid-based models 3DOD models

Full size table

Table 7 Classification of point-based 3DOD models

Full size table

Ku et al. [61] use an anchor-based monocular 2D detection to estimate the spatial centroid of the object. An object instance is then reconstructed and laid on the proposal in the point cloud, helping to regress the final 3D bounding box.

In contrast to anchor-based detection, Gupta et al. [46] do not execute a dense pixel-wise regression of the bounding box but initially estimates key points in the form of the bottom center of the object. Only those key points and their nearest neighbors are then used to produce a comparatively low number of positive anchors, accelerating the detection process.

Similar to other techniques, matching algorithms are likewise fused in hybrid frameworks. For example, Chabot et al. [13] use a network to output 2D bounding boxes, vehicle part coordinates and 3D bounding box dimensions. Then they match the dimensions and parts derived in the first step with CAD templates for final pose estimation. Further, Du et al. [28] score the 3D frustum region proposals by matching them with a predefined selection consisting of three car model templates. Similarly, Wang and Jia [148] use a combination of PointNet and 2D-3D consistency constraints within the frustum to locate and classify the objects.

Instead of commonly fusing detection techniques hierarchically, Pang et al. [104] use 2D and 3D anchor-based detection in a parallel fashion to fuse the candidates in an IoU-sensitive way. There is no NMS performed before fusing the proposals because the 2D-3D consistency constraint between the proposals eliminates most elements.

In summary, hybrid detection approaches try to compensate for the inferiorities of a single detection framework. While some of these models reach remarkable results, the harmonization of two different systems represents a major challenge. Next to the advantages also the disadvantage of the specific technology needs to be handled. In the case of a compound solution between a 2D- and PointNet-like technique, the result offers an obvious improvement in inference speed, as the initial prediction is usually performed in a lightweight 2D detection framework limiting the search space for the PointNet. Yet, the accuracy and precision of detection are less favorable in comparison to full 3D region proposal networks, since the possible uncertainties of a 2D detection are inherited to the hierarchical next step.

10 Classification of 3D object detection pipelines

In the previous sections, we gave a comprehensive review of different models and methods along the 3DOD pipeline and emphasized representative examples for every stage with their corresponding design options. In the following, we use our proposed pipeline framework from Fig. 1 (see Sect. 4) to classify each 3DOD approach of our literature corpus to derive a thorough systematization.

For better comparability, we distinguish all 3DOD models according to their data representation. That is, we provide separate classification schemes for (i) monocular models (Table 3), (ii) RGB-D front-view-based models (Table 4), (iii) projection-based models (Table 5), (iv) volumetric grid-based models (Table 6), (v) point-based models (Table 7) and (vi) fusion-based models (Table 8).

For each 3DOD approach, we provide information on the authors, the year and the name, classify the underlying domain and benchmark dataset(s) and categorize the specified design choices along the 3DOD pipeline.

The resulting classification can help researchers and practitioners alike to get a quick overview of the field, spot developments and trends over time and identify comparable approaches for further development and benchmark purposes. As such, our classification delivers an overview of different design options, provides structured access to knowledge in terms of a 3DOD pipeline catalog and offers a setting to position individual configurations of novel solutions on a more comparable basis.

11 Concluding remarks and outlook

3D object detection is a vivid research field with a great variety of different approaches. The additional third dimension compared to 2D vision forces the exploration of completely new methods, while mature 2DOD solutions can only be adopted to a limited extent. Hence, new ideas and data usage are emerging to handle the advanced problem of 3DOD resulting in a fast growing research field that is finely branched in its trends and approaches.

From a broader perspective, we could observe several global trends within the field. For instance, a general objective of current research is the efficiency optimization of increased computation and memory resource requirements due to the extra dimension of 3DOD, with the ultimate goal of finally reaching real-time detection.

More recent approaches increasingly focus on fully leveraging point-wise representations, since it promises the best conception of 3D space. As of the literature body of this work, PointNet-based approaches remain the only method so far that can directly process the raw point representation.

Furthermore, we observe that the fusion of feature extraction and detection techniques as well as data representation are the most popular approaches to challenge common problems of object detection, such as amodal perception, instance variety and noisy data. For feature fusion approaches, the development of attention mechanisms to efficiently fuse features based on their relevance is a major trend. Additionally, the introduction of continuous convolutions facilitates complex modality mapping. In general, hybrid detection models enjoy popularity for exploiting lightweight proposals to restrict the search space for more powerful but heavier refinement techniques.

Table 8 Classification of fusion-based 3DOD models (BEV: bird’s eye view, RV: range view, V: voxel, PC: point cloud, 2D-3D: 2D-3D consistency, H: handcrafted, VFE: voxel feature encoding, DL: deep learning, P: PointNet, P++: PointNet++, 3D S.-CNN: 3D sparse CNN, Vote: voting scheme)

Full size table

Summarizing, this work aimed to complement previous surveys, such as those from Arnold et al. [5], Guo et al. [45] and Fernandes et al. [31], by closing a gap of not only focusing on a single domain and/or specific methods of 3D object detection. Therefore, our search was narrowed only to the extent that the relevant literature should provide a design of an entire pipeline for 3D object detection. We purposely included all available approaches independent of the varieties of data inputs, data representations, feature extraction approaches and detection methods. Therefore, we reviewed an exhaustively searched literature corpus published between 2012 and 2021, including more than 100 approaches from both indoor applications as well as autonomous driving applications. Since these two application areas cover the vast majority of existing literature, our survey may not be subject to the risk of missing major developments and trends.

A particular goal of this survey was to give an overview of all aspects of the 3DOD research field. Therefore, we provided a systematization of 3DOD methods along the model pipeline with a proposed abstraction level that is meant to be neither too coarse nor too specific. As a result, it was possible to classify all models within our literature corpus to structure the field, highlight emerging trends and guide future research.

At the same time, however, it should be noted that the several stages of the 3DOD pipeline can be designed with a much broader variety and that each stage, therefore, deserves a much closer investigation in subsequent studies. Fernandes et al. [31], for instance, go into further details for the feature extraction stage by aiming to organize the entanglement of different extraction paradigms. Yet, we believe that a full conception and systematization of the entire field has not been reached.

In addition, we would like to acknowledge that, in individual cases, it might be difficult to draw a strict boundary for the classification of models regarding design choices and stages. 3D object detection is a highly complex and multi-faceted field of research, and knowledge from 2D and 3D computer vision as well as continuous progress in artificial intelligence and machine learning are getting fused.

Thus, our elaboration of the specific stages marks a broad orientation within the configuration of a 3DOD model and should be rather seen as a collection of possibilities than an ultimate and isolated choice of design options. Especially modern models often jump within these stages and do not follow a linear way along the pipeline, making a strict classification challenging.

Furthermore, as with any review, our work represents only a snapshot in time of the extremely fast-evolving 3DOD research area. Only recently, new deep learning methods have entered the field of 3DOD that can handle point cloud processing, such as graph-based neural networks (e.g., Shi and Rajkumar [126]), kernel point convolutions (e.g., Thomas et al. [141]) and Transformer-based networks (e.g., Misra et al. [98], Mao et al. [95], Pan et al. [103]). Likewise, researchers have come up with novel innovations along the 3DOD pipeline, such as adaptive spatial feature aggregation [55] or semantical point-voxel feature interaction [151]. These new methods and networks are meant to overcome the limitation of previous architectures. Yet, the potential of existing methods such as PointNet has probably not been reached. A consideration of these fairly new concepts and methods in this survey would have exceeded the scope of this work. However, for future work, a further investigation into these directions could be of great interest. To this end, our proposed systematization offers a great starting point to classify and compare existing as well as emerging approaches on a structured basis.

For future research, we suggest looking into a combination of methods along all stages of the 3DOD pipeline. We recommend examining these aspects independent of the pipeline since the fusion of techniques often occurs in a non-linear way.

Finally, this work could support a practical creation of individual modules or even a whole new 3DOD model, since the systematization along the pipeline can serve as an orientation of design choices within the specific stages.

Data availability statement

All data generated or analyzed during this study are included in this published article.

Notes

surge, heave, sway, yaw, pitch and roll.

Abbreviations

2DOD:: 2D object detection
3DOD:: 3D object detection
6-DoF:: Six degrees of freedom
BEV:: Bird’s eye view
CAD:: Computer-aided design
CNN:: Convolutional neural networks
COG:: Cloud of oriented gradients
DORN:: Deep ordinal regression network
FCN:: Fully convolutional networks
FPS:: Farthest point sampling
FV:: Front view
HOG:: Histogram of oriented gradients
IoU:: Intersection-over-union
LiDAR:: Light detection and ranging
LSS:: Latent support surfaces
MLP:: Multi-layer perceptron
NMS:: Non-maximum suppression
R-CNN:: Region-based CNN
RGB-D:: RGB-depth
ROI:: Regions of interest
RPN:: Region proposal network
RV:: Range view
SIFT:: Scale-invariant feature transform
SSD:: Single-shot detection
SVM:: Support vector machine
TOF:: Time-of-flight
VFE:: Voxel feature encoding
YOLO:: You only look once

References

Ahmadyan, A., Zhang, L., Ablavatski, A., Wei, J., Grundmann, M.: Objectron: A large scale dataset of object-centric videos in the wild with pose annotations. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Nashville, TN, USA. pp. 7818–7827 (2021). https://doi.org/10.1109/CVPR46437.2021.00773
Ali, W., Abdelkarim, S., Zidan, M., Zahran, M., Sallab, A.E.: YOLO3D: End-to-end real-time 3D oriented object bounding box detection from LiDAR point cloud. In: Leal-Taixé, L., Roth, S. (Eds.), Computer Vision—ECCV 2018 Workshops, Springer International Publishing. pp. 716–728 (2019). https://doi.org/10.1007/978-3-030-11015-4_54
Amirkhani, A., Karimi, M.P., Banitalebi-Dehkordi, A.: A survey on adversarial attacks and defenses for object detection and their applications in autonomous vehicles. Vis. Comput. (2022). https://doi.org/10.1007/s00371-022-02660-6
Aprile, W.A., Ruffaldi, E., Sotgiu, E., Frisoli, A., Bergamasco, M.: A dynamically reconfigurable stereoscopic/panoramic vision mobile robot head controlled from a virtual environment. Vis. Comput. 24, 941–946 (2008). https://doi.org/10.1007/s00371-008-0278-0
Arnold, E., Al-Jarrah, O.Y., Dianati, M., Fallah, S., Oxtoby, D., Mouzakitis, A.: A survey on 3d object detection methods for autonomous driving applications. IEEE Trans. Intell. Transp. Syst. 20, 3782–3795 (2019). https://doi.org/10.1109/TITS.2019.2892405
Article Google Scholar
Barabanau, I., Artemov, A., Burnaev, E., Murashkin, V.: Monocular 3D object detection via geometric reasoning on keypoints. In: Proceedings of the 15th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, SCITEPRESS—Science and Technology Publications, Valletta, Malta. pp. 652–659 (2020). https://doi.org/10.5220/0009102506520659
Bayoudh, K., Knani, R., Hamdaoui, F., Mtibaa, A.: A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets. Vis. Comput. 38, 2939–2970 (2022) . https://doi.org/10.1007/s00371-021-02166-7
Bello, S.A., Yu, S., Wang, C., Adam, J.M., Li, J.: Review: deep learning on 3D point clouds. Remote Sens. 12, 1729 (2020). https://doi.org/10.3390/rs12111729
Article ADS Google Scholar
Beltrán, J., Guindel, C., Moreno, F.M., Cruzado, D., García, F., De La Escalera, A.: BirdNet: A 3D object detection framework from LiDAR information. In: 2018 21st International Conference on Intelligent Transportation Systems (ITSC), pp. 3517–3523 (2018). https://doi.org/10.1109/ITSC.2018.8569311
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Information science and statistics, Berlin (2006)
Google Scholar
Brazil, G., Liu, X.: M3D-RPN: Monocular 3D region proposal network for object detection. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9286–9295 (2019). https://doi.org/10.1109/ICCV.2019.00938
Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., Beijbom, O.: nuScenes: a multimodal dataset for autonomous driving. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Seattle, WA, USA. pp. 11618–11628 (2020). https://doi.org/10.1109/CVPR42600.2020.01164
Chabot, F., Chaouch, M., Rabarisoa, J., Teuliere, C., Chateau, T.: Deep MANTA: A coarse-to-fine many-task network for joint 2D and 3D vehicle analysis from monocular image. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Honolulu, HI. pp. 1827–1836 (2017). https://doi.org/10.1109/CVPR.2017.198
Chen, G., Qin, H.: Class-discriminative focal loss for extreme imbalanced multiclass object detection towards autonomous driving. Vis. Comput. 38, 1051–1063 (2022). https://doi.org/10.1007/s00371-021-02067-9
Chen, Q., Tang, S., Yang, Q., Fu, S.: Cooper: cooperative perception for connected autonomous vehicles based on 3D point clouds. In: 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS), pp. 514–524 (2019a). https://doi.org/10.1109/ICDCS.2019.00058
Chen, X., Kundu, K., Zhang, Z., Ma, H., Fidler, S., Urtasun, R.: Monocular 3D object detection for autonomous driving. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2147–2156 (2016). https://doi.org/10.1109/CVPR.2016.236
Chen, X., Kundu, K., Zhu, Y., Berneshawi, A.G., Ma, H., Fidler, S., Urtasun, R.: 3D Object Proposals for Accurate Object Class Detection, in: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (Eds.), Advances in Neural Information Processing Systems 28. Curran Associates, Inc., pp. 424–432 (2015)
Chen, X., Kundu, K., Zhu, Y., Ma, H., Fidler, S., Urtasun, R.: 3D object proposals using stereo imagery for accurate object class detection. IEEE Trans Pattern Anal Mach Intell 40, 1259–1272 (2018). https://doi.org/10.1109/TPAMI.2017.2706685
Article PubMed Google Scholar
Chen, X., Ma, H., Wan, J., Li, B., Xia, T.: Multi-view 3D object detection network for autonomous driving. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6526–6534 (2017). https://doi.org/10.1109/CVPR.2017.691
Chen, Y., Liu, S., Shen, X., Jia, J.: Fast point r-CNN, in: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), IEEE. pp. 9774–9783 (2019b). https://doi.org/10.1109/ICCV.2019.00987
Cheng, Z., Liang, J., Choi, H., Tao, G., Cao, Z., Liu, D., Zhang, X.: Physical attack on monocular depth estimation with optimal adversarial patches. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision—ECCV 2022, pp. 514–532. Springer Nature Switzerland, Cham (2022)
Chapter Google Scholar
Crivellaro, A., Rad, M., Verdie, Y., Yi, K.M., Fua, P., Lepetit, V.: A Novel representation of parts for accurate 3D object detection and tracking in monocular images. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 4391–4399 (2015). https://doi.org/10.1109/ICCV.2015.499
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), IEEE. pp. 886–893 (2005). https://doi.org/10.1109/CVPR.2005.177
Davies, E.R.: Computer and machine vision: theory, algorithms, practicalities. 4th ed., Elsevier (2012)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). https://doi.org/10.1109/CVPR.2009.5206848
Deng, J., Shi, S., Li, P., Zhou, W., Zhang, Y., Li, H.: Voxel R-CNN: towards high performance voxel-based 3D object detection. Proceedings of the AAAI Conference on Artificial Intelligence 35, 1201–1209 (2021)
Article Google Scholar
Deng, Z., Latecki, J.L.: Amodal Detection of 3D Objects: Inferring 3D bounding boxes from 2D ones in RGB-depth images. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 398–406 (2017). https://doi.org/10.1109/CVPR.2017.50
Du, X., Ang, M.H., Karaman, S., Rus, D.: A general pipeline for 3d detection of vehicles. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), IEEE. pp. 3194–3200 (2018). https://doi.org/10.1109/ICRA.2018.8461232
Engelcke, M., Rao, D., Wang, D.Z., Tong, C.H., Posner, I.: Vote3Deep: Fast object detection in 3D point clouds using efficient convolutional neural networks. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 1355–1361 (2017). https://doi.org/10.1109/ICRA.2017.7989161
Ferguson, M., Law, K.: A 2D-3D object detection system for updating building information models with mobile robots. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1357–1365 (2019). https://doi.org/10.1109/WACV.2019.00149
Fernandes, D., Silva, A., Névoa, R., Simões, C., Gonzalez, D., Guevara, M., Novais, P., Monteiro, J., Melo-Pinto, P.: Point-cloud based 3d object detection and classification methods for self-driving applications: a survey and taxonomy. Inf. Fusion 68, 161–191 (2021). https://doi.org/10.1016/j.inffus.2020.11.002
Article Google Scholar
Fidler, S., Dickinson, S., Urtasun, R.: 3D object detection and viewpoint estimation with a deformable 3D cuboid model. In: Proceedings of the 25th International Conference on Neural Information Processing Systems—Volume 1, Curran Associates Inc., USA. pp. 611–619 (2012)
Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM24, pp. 381–395 (1981). https://doi.org/10.1145/358669.358692
Friederich, J., Zschech, P.: Review and systematization of solutions for 3d object detection. In: 2020 15th International Conference on Wirtschaftsinformatik (WI), pp. 1699–1711 (2020). https://doi.org/10.30844/wi_2020_r2-friedrich
Fu, H., Gong, M., Wang, C., Batmanghelich, K., Tao, D.: Deep ordinal regression network for monocular depth estimation. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2002–2011 (2018). https://doi.org/10.1109/CVPR.2018.00214
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361 (2012). https://doi.org/10.1109/CVPR.2012.6248074
Giancola, S., Valenti, M., Sala, R.: A Survey on 3D Cameras: Metrological Comparison of Time-of-Flight. SpringerBriefs in Computer Science, Springer International Publishing, Structured-Light and Active Stereoscopy Technologies, Berlin (2018). https://doi.org/10.1007/978-3-319-91761-0
Book Google Scholar
Girshick, R.: Fast R-CNN. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1440–1448 (2015). https://doi.org/10.1109/ICCV.2015.169
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, IEEE. pp. 580–587 (2014). https://doi.org/10.1109/CVPR.2014.81
Godard, C., Aodha, O.M., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6602–6611 (2017). https://doi.org/10.1109/CVPR.2017.699
Graham, B.: Spatially-sparse convolutional neural networks (2014). arXiv:1409.6070 [cs]
Graham, B.: Sparse 3D convolutional neural networks. In: Procedings of the British Machine Vision Conference 2015, British Machine Vision Association, Swansea. pp. 150.1–150.9 (2015). https://doi.org/10.5244/C.29.150
Graham, B., Engelcke, M., Maaten, L.v.d.: 3D Semantic segmentation with submanifold sparse convolutional networks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9224–9232 (2018). https://doi.org/10.1109/CVPR.2018.00961
Griffiths, D., Boehm, J.: A review on deep learning techniques for 3D sensed data classification. Remote Sens. 11, 1499 (2019). https://doi.org/10.3390/rs11121499
Article ADS Google Scholar
Guo, Y., Wang, H., Hu, Q., Liu, H., Liu, L., Bennamoun, M.: Deep learning for 3D point clouds: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 43, 4338–4364 (2021). https://doi.org/10.1109/TPAMI.2020.3005434
Article PubMed Google Scholar
Gupta, I., Rangesh, A., Trivedi, M.: 3D bounding boxes for road vehicles: a one-stage, localization prioritized approach using single monocular images. In: Leal-Taixé, L., Roth, S. (Eds.), Computer vision—ECCV 2018 workshops. Springer International Publishing, Cham. volume11133 of Lecture Notes in Computer Science, pp. 626–641 (2019). https://doi.org/10.1007/978-3-030-11021-5_39
Gustafsson, F.K., Danelljan, M., Schon, T.B.: Accurate 3D object detection using energy-based models. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), IEEE, Nashville, TN, USA. pp. 2849–2858 (2021). https://doi.org/10.1109/CVPRW53098.2021.00320
He, C., Zeng, H., Huang, J., Hua, X.S., Zhang, L.: Structure aware single-stage 3D object detection from point cloud. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE. pp. 11870–11879 (2020). https://doi.org/10.1109/CVPR42600.2020.01189
He, R., Rojas, J., Guan, Y.: A 3D object detection and pose estimation pipeline using RGB-D images. In: 2017 IEEE International Conference on Robotics and Biomimetics (ROBIO), pp. 1527–1532 (2017). https://doi.org/10.1109/ROBIO.2017.8324634
Hinterstoisser, S., Holzer, S., Cagniart, C., Ilic, S., Konolige, K., Navab, N., Lepetit, V.: Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes. In: 2011 International Conference on Computer Vision, IEEE. pp. 858–865 (2011). https://doi.org/10.1109/ICCV.2011.6126326
Huang, S., Qi, S., Xiao, Y., Zhu, Y., Wu, Y.N., Zhu, S.C.: Cooperative holistic scene understanding: unifying 3D object, layout, and camera pose estimation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, Curran Associates Inc., USA. pp. 206–217 (2018)
Huang, T., Liu, Z., Chen, X., Bai, X.: EPNet: Enhancing point features with image semantics for 3D object detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (Eds.), Computer Vision - ECCV 2020. Springer International Publishing, Cham, volume12360, pp. 35–52 (2020). https://doi.org/10.1007/978-3-030-58555-6_3
Huang, Y., Chen, Y.: Survey of state-of-art autonomous driving technologies with deep learning. In: 2020 IEEE 20th International Conference on Software Quality, Reliability and Security Companion (QRS-C), IEEE. pp. 221–228 (2020). https://doi.org/10.1109/QRS-C51114.2020.00045
Janiesch, C., Zschech, P., Heinrich, K.: Machine learning and deep learning. Electronic Markets 31, 685–695 (2021). https://doi.org/10.1007/s12525-021-00475-2
Article Google Scholar
Ji, C., Liu, G., Zhao, D.: Stereo 3D object detection via instance depth prior guidance and adaptive spatial feature aggregation. Vis. Comput. (2022) https://doi.org/10.1007/s00371-022-02607-x
Jörgensen, E., Zach, C., Kahl, F.: Monocular 3D object detection and box fitting trained end-to-end using intersection-over-union loss (2019). arXiv:1906.08070 [cs] , pp. 1–10
Kehl, W., Milletari, F., Tombari, F., Ilic, S., Navab, N.: Deep Learning of Local RGB-D Patches for 3D Object Detection and 6D Pose Estimation, in: Leibe, B., Matas, J., Sebe, N., Welling, M. (Eds.), Computer Vision—ECCV 2016. Springer International Publishing. pp. 205–220 (2016). https://doi.org/10.1007/978-3-319-46487-9_13
Kim, J.U., Kang, H.: LiDAR based 3D object detection using CCD information. In: 2017 IEEE Third International Conference on Multimedia Big Data (BigMM), pp. 303–309 (2017). https://doi.org/10.1109/BigMM.2017.59
KITTI.: Kitti 3dod benchmark (2021). http://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=3d
Ku, J., Mozifian, M., Lee, J., Harakeh, A., Waslander, S.L.: Joint 3D proposal generation and object detection from view aggregation. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1–8 (2018). https://doi.org/10.1109/IROS.2018.8594049
Ku, J., Pon, A.D., Waslander, S.L.: Monocular 3D object detection leveraging accurate proposals and shape reconstruction. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11859–11868 (2019). https://doi.org/10.1109/CVPR.2019.01214
Kuang, H., Wang, B., An, J., Zhang, M., Zhang, Z.: Voxel-FPN: Multi-scale voxel feature aggregation for 3d object detection from LIDAR point clouds. Sensors 20, 704 (2020). https://doi.org/10.3390/s20030704
Article PubMed PubMed Central ADS Google Scholar
Payen de La Garanderie, G., Atapour Abarghouei, A., Breckon, T.P.: Eliminating the blind spot: adapting 3D object detection and monocular depth estimation to 360\(^\circ \) panoramic imagery. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (Eds.), Computer Vision - ECCV 2018, Springer International Publishing, Cham, pp. 812–830 (2018). https://doi.org/10.1007/978-3-030-01261-8_48
Lahoud, J., Ghanem, B.: 2D-Driven 3D object detection in RGB-D images. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 4632–4640 (2017). https://doi.org/10.1109/ICCV.2017.495
Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O.: PointPillars: Fast encoders for object detection from point clouds. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE. pp. 12689–12697 (2019). https://doi.org/10.1109/CVPR.2019.01298
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521, 436–444 (2015). https://doi.org/10.1038/nature14539
Article CAS PubMed ADS Google Scholar
Lefsky, M.A., Cohen, W.B., Parker, G.G., Harding, D.J.: Lidar remote sensing for ecosystem studies. BioScience 52, 19 (2002). https://doi.org/10.1641/0006-3568(2002)052[0019:LRSFES]2.0.CO;2
Article Google Scholar
Lehner, J., Mitterecker, A., Adler, T., Hofmarcher, M., Nessler, B., Hochreiter, S.: Patch refinement—localized 3d object detection (2019). arXiv:1910.04093 [cs]
Leibe, B., Leonardis, A., Schiele, B.: Robust object detection with interleaved categorization and segmentation. Int. J. Comput. Vis. 77, 259–289 (2008). https://doi.org/10.1007/s11263-007-0095-3
Article Google Scholar
Li, B.: 3d fully convolutional network for vehicle detection in point cloud. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE. pp. 1513–1518 (2017). https://doi.org/10.1109/IROS.2017.8205955
Li, B., Ouyang, W., Sheng, L., Zeng, X., Wang, X.: GS3D: An efficient 3D object detection framework for autonomous driving. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1019–1028 (2019a). https://doi.org/10.1109/CVPR.2019.00111
Li, B., Zhang, T., Xia, T.: Vehicle detection from 3d lidar using fully convolutional network. In: Robotics: Science and Systems XII, Robotics: Science and Systems Foundation. pp. 1–8 (2016). https://doi.org/10.15607/RSS.2016.XII.042
Li, J., Luo, S., Zhu, Z., Dai, H., Krylov, A.S., Ding, Y., Shao, L.: 3d IoU-net: IoU guided 3d object detector for point clouds (2020). arXiv:2004.04962 [cs]
Li, M., Hu, Y., Zhao, N., Qian, Q.: One-stage multi-sensor data fusion convolutional neural network for 3d object detection. Sensors 19, 1434 (2019). https://doi.org/10.3390/s19061434
Article PubMed PubMed Central ADS Google Scholar
Li, P., Chen, X., Shen, S.: Stereo R-CNN Based 3D object detection for autonomous driving. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7636–7644 (2019c). https://doi.org/10.1109/CVPR.2019.00783
Li, S., Yang, L., Huang, J., Hua, X.S., Zhang, L.: Dynamic anchor feature selection for single-shot object detection. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6608–6617 (2019d). https://doi.org/10.1109/ICCV.2019.00671
Li, X., Guivant, J.E., Kwok, N., Xu, Y.: 3D backbone network for 3D object detection (2019e). arXiv:1901.08373 [cs]
Liang, J., Wang, Y., Chen, Y., Yang, B., Liu, D.: A triangulation-based visual localization for field robots. IEEE/CAA J. Auto. Sin. 9, 1083–1086 (2022). https://doi.org/10.1109/JAS.2022.105632. conference Name: IEEE/CAA Journal of Automatica Sinica
Liang, M., Yang, B., Chen, Y., Hu, R., Urtasun, R.: Multi-task multi-sensor fusion for 3d object detection. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE. pp. 7337–7345 (2019). https://doi.org/10.1109/CVPR.2019.00752
Liang, M., Yang, B., Wang, S., Urtasun, R.: Deep Continuous Fusion for Multi-sensor 3D Object Detection. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (Eds.), Computer Vision—ECCV 2018, Springer International Publishing. pp. 663–678 (2018). https://doi.org/10.1007/978-3-030-01270-0_39
Liang, Z., Zhang, M., Zhang, Z., Zhao, X., Pu, S.: RangeRCNN: Towards fast and accurate 3d object detection with range image representation (2020). arXiv:2009.00206 [cs]
Liu, D., Cui, Y., Yan, L., Mousas, C., Yang, B., Chen, Y.: DenserNet: Weakly supervised visual localization using multi-scale feature aggregation. In: Proceedings of the AAAI Conference on Artificial Intelligence 35, 6101–6109 (2021). https://doi.org/10.1609/aaai.v35i7.16760. number. 7
Liu, J., Chen, H., Li, J.: Faster 3D object detection in RGB-D image using 3D selective search and object pruning. In: 2018 Chinese Control And Decision Conference (CCDC), pp. 4862–4866 (2018a). https://doi.org/10.1109/CCDC.2018.8407973
Liu, L., Lu, J., Xu, C., Tian, Q., Zhou, J.: Deep fitting degree scoring network for monocular 3D object detection. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1057–1066 (2019a). https://doi.org/10.1109/CVPR.2019.00115
Liu, L., Ouyang, W., Wang, X., Fieguth, P., Chen, J., Liu, X., Pietikäinen, M.: Deep learning for generic object detection: a survey. Int. J. Comput. Vis. 128, 261–318 (2020). https://doi.org/10.1007/s11263-019-01247-4
Article Google Scholar
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: SSD: Single Shot MultiBox Detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (Eds.), Computer Vision—ECCV 2016, Springer International Publishing, Cham. pp. 21–37 (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Liu, W., Sun, J., Li, W., Hu, T., Wang, P.: Deep learning on point clouds and its application: a survey. Sensors 19, 4188 (2019). https://doi.org/10.3390/s19194188
Article PubMed PubMed Central ADS Google Scholar
Liu, Y., Xu, Y., Li, S.b.: 2-D human pose estimation from images based on deep learning: a review. In: 2018 2nd IEEE Advanced Information Management,Communicates,Electronic and Automation Control Conference (IMCEC), IEEE, Xi’an. pp. 462–465 (2018b). https://doi.org/10.1109/IMCEC.2018.8469573
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3431–3440 (2015). https://doi.org/10.1109/CVPR.2015.7298965
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60, 91–110 (2004). https://doi.org/10.1023/B:VISI.0000029664.99615.94
Article Google Scholar
Lu, H., Chen, X., Zhang, G., Zhou, Q., Ma, Y., Zhao, Y.: Scanet: spatial-channel attention network for 3D object detection. In: ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1992–1996 (2019). https://doi.org/10.1109/ICASSP.2019.8682746
Luo, Q., Ma, H., Tang, L., Wang, Y., Xiong, R.: 3D-SSD: learning hierarchical features from RGB-D images for amodal 3D object detection. Neurocomputing 378, 364–374 (2020). https://doi.org/10.1016/j.neucom.2019.10.025
Article Google Scholar
Ma, X., Wang, Z., Li, H., Zhang, P., Ouyang, W., Fan, X.: Accurate monocular 3D object detection via color-embedded 3D reconstruction for autonomous driving. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6850–6859 (2019). https://doi.org/10.1109/ICCV.2019.00695
Maisano, R., Tomaselli, V., Capra, A., Longo, F., Puliafito, A.: Reducing complexity of 3D indoor object detection. In: 2018 IEEE 4th International Forum on Research and Technology for Society and Industry (RTSI), pp. 1–6 (2018). https://doi.org/10.1109/RTSI.2018.8548514
Mao, J., Xue, Y., Niu, M., Bai, H., Feng, J., Liang, X., Xu, H., Xu, C.: Voxel transformer for 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3164–3173 (2021). https://openaccess.thecvf.com/content/ICCV2021/html/Mao_Voxel_Transformer_for_3D_Object_Detection_ICCV_2021_paper.html
Meyer, G.P., Charland, J., Hegde, D., Laddha, A., Vallespi-Gonzalez, C.: Sensor fusion for joint 3D object detection and semantic segmentation. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 1230–1237 (2019a). https://doi.org/10.1109/CVPRW.2019.00162
Meyer, G.P., Laddha, A., Kee, E., Vallespi-Gonzalez, C., Wellington, C.K.: LaserNet: an efficient probabilistic 3D object detector for autonomous driving. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12669–12678 (2019b). https://doi.org/10.1109/CVPR.2019.01296
Misra, I., Girdhar, R., Joulin, A.: An end-to-end transformer model for 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2906–2917 (2021). https://openaccess.thecvf.com/content/ICCV2021/html/Misra_An_End-to-End_Transformer_Model_for_3D_Object_Detection_ICCV_2021_paper.html?ref=https://githubhelp.com
Mousavian, A., Anguelov, D., Flynn, J., Košecká, J.: 3D bounding box estimation using deep learning and geometry. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5632–5640 (2017). https://doi.org/10.1109/CVPR.2017.597
Naiden, A., Paunescu, V., Kim, G., Jeon, B., Leordeanu, M.: Shift R-CNN: deep monocular 3D object detection with closed-form geometric constraints. In: 2019 IEEE International Conference on Image Processing (ICIP), pp. 61–65 (2019). https://doi.org/10.1109/ICIP.2019.8803397
Otepka, J., Ghuffar, S., Waldhauser, C., Hochreiter, R., Pfeifer, N.: Georeferenced point clouds: a survey of features and point cloud management. ISPRS Int. J. Geo Inf. 2, 1038–1065 (2013). https://doi.org/10.3390/ijgi2041038
Article Google Scholar
Pamplona, J., Madrigal, C., de la Escalera, A.: PointNet evaluation for on-road object detection using a multi-resolution conditioning. In: Vera-Rodriguez, R., Fierrez, J., Morales, A. (Eds.), Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, Springer International Publishing. pp. 513–520 (2019). https://doi.org/10.1007/978-3-030-13469-3_60
Pan, X., Xia, Z., Song, S., Li, L.E., Huang, G.: 3D object detection With pointformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7463–7472 (2021). https://openaccess.thecvf.com/content/CVPR2021/html/Pan_3D_Object_Detection_With_Pointformer_CVPR_2021_paper.html
Pang, S., Morris, D., Radha, H.: CLOCs: camera-LiDAR object candidates fusion for 3D object detection. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, Las Vegas, NV, USA. pp. 10386–10393 (2020). https://doi.org/10.1109/IROS45743.2020.9341791
Qi, C.R., Chen, X., Litany, O., Guibas, L.J.: ImVoteNet: boosting 3D object detection in point clouds with image votes. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Seattle, WA, USA. pp. 4403–4412 (2020). https://doi.org/10.1109/CVPR42600.2020.00446
Qi, C.R., Hao, S., Mo, K., Leonidas, J.G.: PointNet: deep learning on point sets for 3D classification and segmentation. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Honolulu, HI. pp. 77–85 (2017a). https://doi.org/10.1109/CVPR.2017.16
Qi, C.R., Litany, O., He, K., Guibas, L.: Deep hough voting for 3D object detection in point clouds. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9276–9285 (2019). https://doi.org/10.1109/ICCV.2019.00937
Qi, C.R., Liu, W., Wu, C., Su, H., Guibas, L.J.: Frustum pointNets for 3D object detection from RGB-D Data. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 918–927 (2018). https://doi.org/10.1109/CVPR.2018.00102
Qi, C.R., Yi, L., Su, H., Guibas, L.J.: PointNet++: Deep hierarchical feature learning on point sets in a metric space. In: Advances in Neural Information Processing Systems, Curran Associates, Inc.. pp. 1–10 (2017b)
Qin, Z., Wang, J., Lu, Y.: MonoGRNet: a geometric reasoning network for monocular 3d object localization. Proc. AAAI Conf. Artif. Intell. 33, 8851–8858 (2019). https://doi.org/10.1609/aaai.v33i01.33018851
Article Google Scholar
Qin, Z., Wang, J., Lu, Y.: Triangulation learning network: from monocular to stereo 3D object detection. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7607–7615 (2019b). https://doi.org/10.1109/CVPR.2019.00780
Rahman, M.M., Tan, Y., Xue, J., Shao, L., Lu, K.: 3d object detection: learning 3d bounding boxes from scaled down 2d bounding boxes in RGB-d images. Inf. Sci. 476, 147–158 (2019). https://doi.org/10.1016/j.ins.2018.09.040
Article Google Scholar
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE. pp. 779–788 (2016). https://doi.org/10.1109/CVPR.2016.91
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39, 1137–1149 (2017). https://doi.org/10.1109/TPAMI.2016.2577031
Article PubMed Google Scholar
Ren, Y., Chen, C., Li, S., Kuo, C.C.J.: Context-assisted 3D (C3D) object detection from RGB-D images. J. Vis. Commun. Image Rep. 55, 131–141 (2018). https://doi.org/10.1016/j.jvcir.2018.05.019
Article Google Scholar
Ren, Z., Sudderth, E.B.: Three-dimensional object detection and layout prediction using clouds of oriented gradients. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1525–1533 (2016). https://doi.org/10.1109/CVPR.2016.169
Ren, Z., Sudderth, E.B.: 3D object detection with latent support surfaces. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 937–946 (2018). https://doi.org/10.1109/CVPR.2018.00104
Ren, Z., Sudderth, E.B.: Clouds of oriented gradients for 3D detection of objects, surfaces, and indoor scene layouts. IEEE Trans. Pattern Anal. Mach. Intell. 42, 2670–2683 (2020). https://doi.org/10.1109/TPAMI.2019.2923201
Article PubMed Google Scholar
Roddick, T., Kendall, A., Cipolla, R.: Orthographic feature transform for monocular 3D object detection (2018). arXiv:1811.08188 [cs]
Sager, C., Janiesch, C., Zschech, P.: A survey of image labelling for computer vision applications. J. Bus. Anal. 4, 91–110 (2021). https://doi.org/10.1080/2573234X.2021.1908861
Article Google Scholar
Sager, C., Zschech, P., Kuhl, N.: labelCloud: A lightweight domain-independent labeling tool for 3D object detection in point clouds. In: CAD’21 Proceedings, CAD Solutions LLC. pp. 319–323 (2021b). http://www.cad-conference.net/files/CAD21/CAD21_319-323.pdf, https://doi.org/10.14733/cadconfP.2021.319-323
Shen, X., Stamos, I.: Frustum VoxNet for 3d object detection from RGB-d or depth images. In: 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE. pp. 1687–1695 (2020). https://doi.org/10.1109/WACV45572.2020.9093276
Shi, S., Guo, C., Jiang, L., Wang, Z., Shi, J., Wang, X., Li, H.: PV-RCNN: Point-voxel feature set abstraction for 3D object detection. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Seattle, WA, USA. pp. 10526–10535 (2020a). https://doi.org/10.1109/CVPR42600.2020.01054
Shi, S., Wang, X., Li, H.: PointRCNN: 3D object proposal generation and detection from point cloud. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–779 (2019). https://doi.org/10.1109/CVPR.2019.00086
Shi, S., Wang, Z., Shi, J., Wang, X., Li, H.: From points to parts: 3D object detection from point cloud with part-aware and part-aggregation network. IEEE Trans. Pattern Anal. Mach. Intell., pp. 1–1,(2020). https://doi.org/10.1109/TPAMI.2020.2977026
Shi, W., Rajkumar, R.: Point-GNN: graph neural network for 3D object detection in a point cloud. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Seattle, WA, USA. pp. 1708–1716 (2020). https://ieeexplore.ieee.org/document/9156733/, https://doi.org/10.1109/CVPR42600.2020.00178
Shin, K., Kwon, Y.P., Tomizuka, M.: RoarNet: A Robust 3D object detection based on region approximation refinement. In: 2019 IEEE Intelligent Vehicles Symposium (IV), pp. 2510–2515 (2019). https://doi.org/10.1109/IVS.2019.8813895
Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (Eds.), Computer Vision—ECCV 2012, Springer, Berlin, Heidelberg. pp. 746–760 (2012). https://doi.org/10.1007/978-3-642-33715-4_54
Simon, M., Amende, K., Kraus, A., Honer, J., Samann, T., Kaulbersch, H., Milz, S., Gross, H.M.: Complexer-YOLO: Real-time 3D object detection and tracking on semantic point clouds. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), IEEE, Long Beach, CA, USA. pp. 1190–1199 (2019a). https://doi.org/10.1109/CVPRW.2019.00158
Simon, M., Milz, S., Amende, K., Gross, H.M.: Complex-YOLO: An Euler-region-proposal for real-time 3D object detection on point clouds. In: Leal-Taixé, L., Roth, S. (Eds.), Computer Vision—ECCV 2018 Workshops, Springer International Publishing. pp. 197–209 (2019b). https://doi.org/10.1007/978-3-030-11009-3_11
Simonelli, A., Bulò, S.R., Porzi, L., Lopez-Antequera, M., Kontschieder, P.: Disentangling monocular 3D object detection. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1991–1999 (2019). https://doi.org/10.1109/ICCV.2019.00208
Sindagi, V.A., Zhou, Y., Tuzel, O.: MVX-Net: multimodal VoxelNet for 3D object detection. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 7276–7282 (2019). https://doi.org/10.1109/ICRA.2019.8794195
Song, S., Lichtenberg, S.P., Xiao, J.: SUN RGB-D: a RGB-D scene understanding benchmark suite. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 567–576 (2015). https://doi.org/10.1109/CVPR.2015.7298655
Song, S., Xiao, J.: Sliding shapes for 3D object detection in depth images. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (Eds.), Computer Vision—ECCV 2014, Springer International Publishing. pp. 634–651 (2014). https://doi.org/10.1007/978-3-319-10599-4_41
Song, S., Xiao, J.: Deep sliding shapes for amodal 3D object detection in RGB-D images. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 808–816 (2016). https://doi.org/10.1109/CVPR.2016.94
Srivastava, S., Jurie, F., Sharma, G.: Learning 2D to 3D lifting for object detection in 3D for autonomous vehicles. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4504–4511 (2019). https://doi.org/10.1109/IROS40897.2019.8967624
Sun, H., Meng, Z., Du, X., Ang, M.H.: A 3D convolutional neural network towards real-time amodal 3D object detection. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 8331–8338 (2018). https://doi.org/10.1109/IROS.2018.8593837
Sun, P., Kretzschmar, H., Dotiwalla, X., Chouard, A., Patnaik, V., Tsui, P., Guo, J., Zhou, Y., Chai, Y., Caine, B., Vasudevan, V., Han, W., Ngiam, J., Zhao, H., Timofeev, A., Ettinger, S., Krivokon, M., Gao, A., Joshi, A., Zhang, Y., Shlens, J., Chen, Z., Anguelov, D.: Scalability in perception for autonomous driving: waymo open dataset. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2443–2451 (2020). https://doi.org/10.1109/CVPR42600.2020.00252
Tang, Y.S., Lee, G.H.: Transferable semi-supervised 3D object detection from RGB-D data. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1931–1940 (2019). https://doi.org/10.1109/ICCV.2019.00202
Teng, Z., Xiao, J.: Surface-based general 3D object detection and pose estimation. In: 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 5473–5479 (2014). https://doi.org/10.1109/ICRA.2014.6907664
Thomas, H., Qi, C.R., Deschaud, J.E., Marcotegui, B., Goulette, F., Guibas, L.: KPConv: flexible and deformable convolution for point clouds. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, Seoul, Korea (South). pp. 6410–6419 (2019). https://doi.org/10.1109/ICCV.2019.00651
Uijlings, J.R.R., van de Sande, K.E.A., Gevers, T., Smeulders, A.W.M.: Selective search for object recognition. Int. J. Comput. Vis. 104, 154–171 (2013). https://doi.org/10.1007/s11263-013-0620-5
Article Google Scholar
Viola, P., Jones, M.J.: Robust real-time face detection. Int. J. Comput. Vis. 57, 137–154 (2004). https://doi.org/10.1023/B:VISI.0000013087.49260.fb
Article Google Scholar
Wang, D.Z., Posner, I.: Voting for voting in online point cloud object detection. In: Robotics: Science and Systems XI, Robotics: Science and Systems Foundation. pp. 1–9 (2015). https://doi.org/10.15607/RSS.2015.XI.035
Wang, G., Tian, B., Zhang, Y., Chen, L., Cao, D., Wu, J.: Multi-view adaptive fusion network for 3d object detection (2020). arXiv:2011.00652 [cs]
Wang, L., Li, R., Shi, H., Sun, J., Zhao, L., Seah, H.S., Quah, C.K., Tandianus, B.: Multi-channel convolutional neural network based 3D object detection for indoor robot environmental perception. Sensors 19, 1–14 (2019). https://doi.org/10.3390/s19040893
Article Google Scholar
Wang, Y., Ye, J.: An overview of 3d object detection (2020). arXiv:2010.15614 [cs]
Wang, Z., Jia, K.: Frustum ConvNet: sliding frustums to aggregate local point-wise features for Amodal. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1742–1749 (2019). https://doi.org/10.1109/IROS40897.2019.8968513
Wang, Z., Zhan, W., Tomizuka, M.: fusing bird’s eye view LIDAR point cloud and front view camera image for 3D object detection. In: 2018 IEEE Intelligent Vehicles Symposium (IV), pp. 1–6 (2018). https://doi.org/10.1109/IVS.2018.8500387
Weng, X., Kitani, K.: monocular 3D object detection with pseudo-LiDAR point cloud. In: 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pp. 857–866 (2019). https://doi.org/10.1109/ICCVW.2019.00114
Wu, P., Gu, L., Yan, X., Xie, H., Wang, F.L., Cheng, G., Wei, M.: PV-RCNN++: semantical point-voxel feature interaction for 3D object detection. Vis. Comput.(2022). https://doi.org/10.1007/s00371-022-02672-2
Xu, B., Chen, Z.: multi-level fusion based 3D object detection from monocular images. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2345–2353 (2018). https://doi.org/10.1109/CVPR.2018.00249
Xu, D., Anguelov, D., Jain, A.: PointFusion: deep sensor fusion for 3D bounding box estimation. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 244–253 (2018). https://doi.org/10.1109/CVPR.2018.00033
Yamazaki, T., Sugimura, D., Hamamoto, T.: Discovering correspondence among image sets with projection view preservation for 3D object detection in point clouds. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3111–3115 (2018). https://doi.org/10.1109/ICASSP.2018.8461677
Yan, Y., Mao, Y., Li, B.: SECOND: sparsely embedded convolutional detection. Sensors 18, 1–17 (2018). https://doi.org/10.3390/s18103337
Article Google Scholar
Yang, B., Liang, M., Urtasun, R.: HDNET: exploiting HD maps for 3D object detection. In: Proceedings of The 2nd Conference on Robot Learning, PMLR. pp. 146–155 (2018a)
Yang, B., Luo, W., Urtasun, R.: PIXOR: real-time 3D object detection from point clouds. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7652–7660 (2018b). https://doi.org/10.1109/CVPR.2018.00798
Yang, Z., Sun, Y., Liu, S., Jia, J.: 3DSSD: point-based 3D single stage object detector. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Seattle, WA, USA. pp. 11037–11045 (2020). https://doi.org/10.1109/CVPR42600.2020.01105
Yang, Z., Sun, Y., Liu, S., Shen, X., Jia, J.: IPOD: Intensive point-based object detector for point cloud (2018c). arXiv:1812.05276 [cs]
Yang, Z., Sun, Y., Liu, S., Shen, X., Jia, J.: STD: sparse-to-dense 3D object detector for point cloud. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), IEEE. pp. 1951–1960 (2019). https://doi.org/10.1109/ICCV.2019.00204
Yoo, J.H., Kim, Y., Kim, J., Choi, J.W.: 3D-CVF: generating joint camera and LiDAR features using cross-view spatial feature fusion for 3D object detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (Eds.), Computer Vision - ECCV 2020. Springer International Publishing, Cham. volume12372, pp. 720–736 (2020). https://doi.org/10.1007/978-3-030-58583-9_43
Zeng, Y., Hu, Y., Liu, S., Ye, J., Han, Y., Li, X., Sun, N.: RT3D: real-time 3-D vehicle detection in LiDAR point cloud for autonomous driving. IEEE Robot. Auto. Lett. 3, 3434–3440 (2018). https://doi.org/10.1109/LRA.2018.2852843
Article Google Scholar
Zhang, H., Yang, D., Yurtsever, E., Redmill, K.A., özgüner, U.: Faraway-frustum: Dealing with lidar sparsity for 3d object detection using fusion (2020). arXiv:2011.01404 [cs]
Zhao, Z.Q., Zheng, P., Xu, S.T., Wu, X.: Object detection with deep learning: a review. IEEE Trans. Neural Netw. Learn. Syst. 30, 3212–3232 (2019). https://doi.org/10.1109/TNNLS.2018.2876865
Article PubMed Google Scholar
Zheng, W., Tang, W., Chen, S., Jiang, L., Fu, C.W.: CIA-SSD: Confident IoU-aware single-stage object detector from point cloud (2020). arXiv:2012.03015 [cs]
Zhong, Y., Wang, J., Peng, J., Zhang, L.: Anchor box optimization for object detection. In: 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, Snowmass Village, CO, USA. pp. 1275–1283 (2020). https://doi.org/10.1109/WACV45572.2020.9093498
Zhou, D., Fang, J., Song, X., Liu, L., Yin, J., Dai, Y., Li, H., Yang, R.: Joint 3d instance segmentation and object detection for autonomous driving. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE. pp. 1836–1846 (2020). https://doi.org/10.1109/CVPR42600.2020.00191
Zhou, J., Tan, X., Shao, Z., Ma, L.: FVNet: 3D front-view proposal generation for real-time object detection from point clouds. In: 2019 12th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), pp. 1–8 (2019). https://doi.org/10.1109/CISP-BMEI48845.2019.8965844
Zhou, Y., Tuzel, O.: VoxelNet: End-to-end learning for point cloud based 3D object detection. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4490–4499 (2018). https://doi.org/10.1109/CVPR.2018.00472
Zia, M.Z., Stark, M., Schindler, K.: Towards scene understanding with detailed 3D object representations. Int. J. Comput. Vis. 112, 188–203 (2015). https://doi.org/10.1007/s11263-014-0780-y
Article MathSciNet Google Scholar

Download references

Acknowledgements

PZ acknowledges funding from the Federal Ministry of Education and Research (BMBF), Germany, within the project “White-Box-AI” (grant number 01IS22080).

Funding

Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and Affiliations

Technische Universität Dresden, Münchner Platz 3, 01187, Dresden, Germany
Moritz Drobnitzky & Patrick Zschech
Mærsk Mc-Kinney Møller Institute, University of Southern Denmark, Campusvej 55, 5230, Odense, Denmark
Jonas Friederich
Friedrich-Alexander-Universität Erlangen-Nürnberg, Schloßplatz 4, 91054, Erlangen, Germany
Bernhard Egger & Patrick Zschech

Authors

Moritz Drobnitzky
View author publications
You can also search for this author in PubMed Google Scholar
Jonas Friederich
View author publications
You can also search for this author in PubMed Google Scholar
Bernhard Egger
View author publications
You can also search for this author in PubMed Google Scholar
Patrick Zschech
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Patrick Zschech.

Ethics declarations

Conflict of interest

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Communicated by S. Sarkar.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Drobnitzky, M., Friederich, J., Egger, B. et al. Survey and systematization of 3D object detection models and methods. Vis Comput 40, 1867–1913 (2024). https://doi.org/10.1007/s00371-023-02891-1

Download citation

Accepted: 01 May 2023
Published: 11 July 2023
Issue Date: March 2024
DOI: https://doi.org/10.1007/s00371-023-02891-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Survey and systematization of 3D object detection models and methods

Abstract

Similar content being viewed by others

3D Object Detection for Autonomous Driving: A Comprehensive Survey

A survey of 3D object detection algorithms for intelligent vehicles development

3D Object Detection in Autonomous Driving

1 Introduction

2 Foundations

2.1 Object detection

2.2 3D vision and 3D object detection

2.3 Sensing technologies

2.3.1 Cameras

2.3.2 LiDAR sensors

2.4 Domains

2.5 Datasets

2.5.1 Autonomous driving datasets

2.5.2 Indoor datasets

3 Related reviews

4 3D object detection pipeline

5 Input data

5.1 RGB images

5.2 RGB-D images

5.3 Point cloud

6 Data representation

6.1 Monocular representation

6.2 RGB-D front view

6.3 Grid cells

6.3.1 Volumetric grids

6.3.2 Projection-based representation

6.4 Point-wise representation

7 Feature extraction

7.1 Handcrafted feature extraction

7.2 Monocular feature learning

7.2.1 Solely monocular approaches

7.2.2 Informed monocular approaches

7.3 Point-wise feature learning

7.3.1 PointNet

7.3.2 PointNet++

7.3.3 PointNet-based feature extraction

7.4 Segment-wise feature learning

7.4.1 Feature initialization

7.4.2 2D and 3D convolutions

7.4.3 Sparse convolution

7.4.4 Voting scheme

8 Fusion approaches

8.1 Cascaded fusion

8.2 Feature fusion

9 Detection module

9.1 Template and keypoint matching algorithms

9.2 Sliding window approaches

9.3 Detection frameworks based on deep learning

9.3.1 Two-stage detection frameworks

9.3.2 Single-stage detection frameworks

9.4 Anchor-based detection

9.4.1 Two-stage detection: 2D anchors

9.4.2 Two-stage detection: 3D anchors

9.4.3 Single-stage detection: 2D anchors

9.4.4 Single-stage detection: 3D anchors

9.5 Anchorless detection

9.5.1 Approaches based on fully convolutional networks

9.5.2 Other approaches

9.6 Hybrid detection

10 Classification of 3D object detection pipelines

11 Concluding remarks and outlook

Data availability statement

Notes

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article