1 Introduction

6D position estimation is an essential task in many Computer Vision (CV) applications. It concerns, among others, robotics [35], autonomous driving [20], and virtual/augmented reality (VR/AR) applications [27] and is extensively used in the entertainment and medical care industry [20]. The problem itself is simple and consists of determining the 3D rotation and translation of an object which shape is known in relation to the camera, using details observable from the reference 2D image. However, achieving a solution to this problem is not trivial [35]. Firstly, due to auto-occlusions or symmetries, the objects cannot be clearly and unequivocally identifiable. Moreover, the image conditions are not always optimal in term of lighting and occlusions between the objects represented in the picture [2, 20, 73]. In these situations, it is often necessary to add an earlier stage of object detection or localization to distinguish the area of the image which contains the object, before estimating its position.

Although the researchers have studied this problem for many years, it experienced a rebirth with the advent of Deep Learning (DL) [19], in the same way as other fields of application, such as the medical [52,53,54] or face recognition [38] domains. Old pose estimation methods were based on geometrical approaches, as for example Feature-based methods, which tried to establish correspondences between 3D models and 2D images of objects by using manually annotated local features. With texture-less or geometrically complex objects, it was not easy to select local features. In these cases, even though the matching phase usually took much time, it might fail and provide a result that was not always accurate [69].

In opposition to these methods, researchers introduced Template-based methods, which represented the 2D object from different points of view and compared these representations with the original image to establish the position and orientation. These approaches were very susceptible to variations in lighting and occlusions even if they could manage texture-less objects and required many comparisons to reach a certain accuracy level, increasing the execution time [27]. With the diffusion of DL, researchers improved traditional methods by introducing Learning-based methods, making them more efficient and performing. The basic idea of these systems involves Convolutional Neural Networks (CNN) to learn a mapping function between images with three-dimensional position annotations, and object 6D position. Some of these systems employ a CNN to predict the 2D projection of the 3D bounding box corners, and then the PnP algorithm. The PnP algorithm is extensively used in CV to calculate the 6D position from matches between 2D features on the test image and 3D points on the CAD model [16]. Other types of Learning-based methods, instead, need only a CNN to resolve a classification or a regression problem. For this reason, Learning-based methods are referred as Bounding box prediction and PnP algorithm-based, Classification-based, and Regression-based methods, respectively. These methods can reach very high levels of precision but need many data to train the network accurately and to be able to work well in real cases. Alternatively, CNNs can be used to execute the most critical steps of traditional methods to join the advantages of the various strategies into the final solution [69].

Referring to the methods mentioned above this literature review focused on the classification of 6D position estimation methods from a single RGB image. The main goal of this work is to supply a baseline for the development of new applications which can work even under boundary conditions, namely, auto-occlusions, symmetries, occlusions between multiple objects, and bad lighting conditions. These conditions, indeed, are widespread in real domains of application, for example, autonomous driving and the medical field. However, the literature review did not reveal a one-size-fits-all method for each case. Consequently, an attempt was undertaken to establish guidelines for new applications, considering the context and related implementation conditions on the one hand and the availability of data and computing power on the other.

The paper is organized as follows: Section 2 describes the methodology used to select the articles, Section 3 illustrates Template-based methods, Section 4 describes Feature-based methods and Section 5 focuses on Learning-Based methods. These last methods have been in turn classified into three categories: Bounding box prediction and PnP algorithm-based methods (Section 5.1); Classification-based methods (Section 5.2); Regression-based methods (Section 5.3). Finally, Section 6 summarizes and discusses the study.

2 Methodology

In this literature review, the articles have been selected by involving different strategies to establish, among the hundreds of existing papers about the topic, the most suitable ones. In general, only papers published from 2016 on have been considered. A second discard criterion concerned input data, since only those systems having a single RGB image as input and optionally a 3D model of the object have been kept. Multiple or RGB-D cameras are hard to use, they are expensive, the calibration process becomes troublesome, and the equipment may become too heavy. As a result, articles including stereo images, RGB-D images, or data from sensors, such as LIDAR sensors, have been ignored, since they supplemented the input with depth information lessening the problem complexity.

Specifically, the articles have been searched on Google Scholar, IEEE, ACM Digital Library, Springer and Science Direct using as keywords “6D pose estimation from RGB images”, “Viewpoint prediction”, “Position and Orientation estimation”, “3D point anchoring”, “Automatic Registration”. Indeed, these keywords appears to be consistent with the references of recent surveys on 6D pose estimation topic, such as the one proposed by Sahin et al. in [48]. From a first analysis, n = 891 articles have been selected, using as constraints: (1) Number of citations > 5; (2) Year of publication > 2015. These exclusion criteria have been chosen because, in general, articles with less than 5 citations were considered related to a narrower research field, i.e., they analyzed objects with peculiar features and, consequently, the proposed method was hardly generalizable to other contexts. However, although some recent papers did not satisfy the condition, there were included because they were considered relevant to our work. Furthermore, given the ongoing innovation of these technologies, this review focuses only on recent studies. In a second step, n = 815 studies have been excluded after screening titles and abstracts, obtaining n = 76 articles for a first full text review. Then, starting from the full text analyzed papers’ bibliography, n = 13 new references have been obtained, reaching a total of n = 89 valuable articles for a second full text review. Finally, after this phase, n = 32 papers have been excluded for three main reasons: (1) n = 14 systems using RGB-D input, (2) n = 12 works concerning old technologies, (3) n = 6 papers which showed low results in terms of accuracy and, consequently, could not be exploited for applications related to critical research fields such as medicine.

Finally, although all the examined articles satisfied all the requirements, a further screening has been done, by considering the number of citations of each paper, according to the number reported by Google Scholar. The PRISMA [32] flowchart is shown in Fig. 1. Once selected, the articles were categorized according to the pipeline and architecture of the proposed method.

Fig. 1
figure 1

PRISMA flowchart

Therefore, this review has three main features:

  1. 1.

    It considers only RGB images as input, excluding RGB-D images and data from LIDAR sensors because this information is often not available.

  2. 2.

    It first classified the selected articles into three main categories: feature-based, template-based, and learning-based methods.

  3. 3.

    Then, as the learning-based approaches are of great importance in research, it focuses on this class of methods and, in turn, categorizes them according to the task solved by the CNN: Bounding box prediction and PnP algorithm-based methods, where a CNN predicts the 3D bounding box, and then a PnP algorithm calculates the 6D position from matches between 2D features on the test image and 3D points on the CAD model [16]; Classification-based and Regression-based methods, in which the CNN resolves a classification or a regression problem, respectively.

To achieve the aim of the paper and provide a baseline for the development of new applications, the methods were analyzed considering specific parameters, as shown in the columns of the summary tables (Tables 1, 2, 3, 4 and 5). In addition to general information and the traditional selection criteria mentioned above, indeed, the following parameters were considered:

  • Pose Refinement. Methods requiring an additional step to refine the estimated coordinates commonly need more processing time. Therefore, this characteristic is discerning in case an application needs to run in real-time.

  • Dataset. Information regarding the dataset is an indicator of the ability of the system to be generalizable in different circumstances. For example, approaches that employ a dataset of only computer-generated images sometimes do not work satisfactorily with real images.

  • Dataset size. The size of the dataset is a helpful parameter because some specific strategies, such as training a neural network, require many data to perform satisfactorily.

  • Real-time. Information about processing time is essential as a guideline for a new application since it is a fundamental prerequisite in some domains, such as autonomous driving.

  • Accuracy. Accuracy is essential for some critical research fields, such as medicine. Consequently, the appropriate level of precision should be determined considering the specific application.

    Table 1 Feature-based methods
    Table 2 Template-based Methods
    Table 3 BB prediction and PnP algorithm-based Methods
    Table 4 Classification-based Methods
    Table 5 Regression-based Methods

3 Feature-based methods

Methods in this category take advantage of local features (keypoints, grey values, edges, or intersections of straight lines) extracted from the regions of interest or all pixels in the image, and then compared with the features found on a 3D model of the object to establish 2D-3D matches [16, 20, 66]. Therefore, the pipeline includes two stages: the first stage extracts local features and compares them with 3D keypoints; the second stage involves 2D-3D correspondences to solve a geometric problem, e.g. via the PnP algorithm, to obtain the 6D position [69]. These techniques combine traditional CV approaches with CNNs. CNNs are harnessed in different stages of the pipeline to improve the overall performance of the system. Figure 2 presents a schematic illustration of these methods.

Fig. 2
figure 2

Schematic representation of feature matching methods. The input image is first passed through a feature extraction step; then the extracted features are compared with those annotated on a 3D model of the object to find 2D-3D matches. Finally, 2D-3D correspondences are fed to a pose refinement step, which estimate the 6D position by solving a geometric problem

Advantages:

  • They are fast and robust to the occlusions between objects and cluttered scenes [27, 55, 62, 69, 70].

Disadvantages:

  • Objects should have rich, well-defined and distinctive textures for computation of local features [25, 27, 62, 70];

  • They do not work well with symmetrical objects [27];

  • The quality of extracted keypoints directly affect the accuracy of position estimation [27, 65];

  • Usually, these methods require a multi-state pipeline which takes much time to perform the task because 2D-3D matches generate a coarse 6D position, so they generally need a supplementary stage to obtain the final pose [7, 27].

In [41, 43], the authors used CNNs to extract the features, and a shape fitting algorithm for determining the final position. In particular, the system in [41] proposed a pipeline including object detection, keypoint location, and pose refinement. Peng et al. [43] introduced a CNN, called Pixel-wise Voting Network (PVNet), to predict the 2D-3D correspondences by regression of pixel-wise vectors to keypoints. The output was a spatial probability distribution for each keypoint, then fed to a PnP algorithm to obtain the result. This work was robust to occlusion while running at a real-time frame rate. Zhu et al. [71] tried to further improve the performance of PVNet in case of severe occlusions by introducing the Atrous Spatial Pyramid Pooling and Distance-Filtered PVNet. Furthermore, You et al. [65] built a system on top of PVNet, which used a projection loss and a discriminative refinement network to obtain a good performance.

Like most of these methods, Zhao et al. [68] leveraged a multi-stage pipeline. They identified the target object using YOLOv3, selected a set of keypoints on the target object, and then trained a ResNet101-based keypoint detector (KPD) to locate them. The 6D pose was then retrieved using a PnP algorithm fed with the 3D keypoints correspondences.

Following the pipeline categorization of the previous methods, also Zhao et al. [69, 70] refined the final output by means of geometrical algorithms. For the first part, instead, they employed a CNN to implement both object detection and keypoints estimation. In [69], the authors introduced an end-to-end framework with a ResNet architecture trained with viewpoint transformation information and salient regions. The goal was to learn geometrically and semantically consistent viewpoints. In [70], the same authors proposed OK-POSE (Object Keypoint-based pose estimation) network, which learned 3D keypoints from relative transformations between pairs of images rather than from explicit 3D labelling information and 3D CAD models.

In some cases, to remedy the lack of training data, systems trained the network with synthetic images. In this context, Nath Kundu et al. [18], introduced a two-stage pipeline: a CNN learned the local descriptors position invariant to obtain the corresponding keypoints; a second CNN, by joining the information coming from multiple correspondence maps, provided in the output the final pose estimation.

Finally, concerning the elaboration of object which are usually complex to treat, the system proposed by Chen et al. [4], focused on metallic targets, texture-less and with shiny materials. The process included three stages: object detection, feature detection, and pose estimation.

Table 1 shows a schematic recap of each analyzed work according to the main features described in the Section 2.

4 Template-based methods

These methods include a first off-line stage which build a template database from a 3D model of the object. This database includes a set of synthetic renderings, obtained by varying position and orientation, resulting in a group of patches from different points of view. These patches could be imagined as distributed over a virtual sphere surrounding the 3D model of the object. The second stage is a test phase, executed on-line to establish the 6D position. So, the current image is compared with all the database patches generated in the previous step through a sliding window algorithm. These systems use a similarity value to compute the best match, chosen by the method itself [48, 66, 69, 73]. Figure 3 shows an example of how these methods operate.

Fig. 3
figure 3

Schematic representation of template matching methods. A first offline phase builds a template database from a 3D model of the object; then, during an online phase, the input image is compared with all the elements in the template to calculate the 6D position as the best match

Advantages:

  • They work well in case of texture-less objects [27, 69, 71];

  • If the database is exhaustive, they can achieve high accuracy [6].

Disadvantages:

  • They are very sensitive to variations in lighting and occlusions between items, as these circumstances affect the rate of similarity, which is very low when the lighting is scarce or when the object is occluded [7, 25, 27, 69, 71];

  • The execution speed is inversely proportional to the number of elements belonging to the template [28]. However, this number is directly proportional to the accuracy of the method [35, 36]. A rich set of images is required to cover as many positions of the object as possible and to have a high probability of obtaining the correct pose. Therefore, a trade-off between performance degradation and desired accuracy is required [6]. Many approaches, based on CNNs, implement changes to the cost functions by adding ad-hoc terms to solve these problems [28].

As already discussed, older methods applied geometric approaches. In this context, in [1], the authors employed color transformation and vectorization for a more compact representation, and the best match calculation. Unlike most methods that try to recover the position of a known instance, called instance-based methods, in [42], the authors worked at the level of object categories trying to estimate the position of unknown instances. They introduced features called Bags of Boundaries (BOB), which looked for matching only on a summary of edges. Edges were also used by Ulrich et al. [58], where they estimated first the discrete position, which was then refined using a 2D match, based on edge features, and the corresponding 3D camera position, using the Levenberg-Marquardt algorithm (LMA) [33]. The system automatically produced a hierarchical model from the 3D CAD model of the object to find the item in the image in an efficient and time-saving way. Another solution to reduce the execution time problem, was introduced by Konishi et al. [16], named the Perspectively Cumulated Orientation Feature (PCOF), which was used to handle a specific range of 3D object positions. Moreover, Hierarchical Pose Trees (HPT) were constructed by clustering the 3D object poses and reducing templates resolution.

A big obstacle of this class of approaches is the handling of texture-less objects. To overcome this, Muñoz et al. [36] used edge matches, given coarse position information by a detector. In [35], instead, the authors involved the Cascade Forest Template (CFT). They used regression forests for each template to learn the misalignment between the initial layout and the current one. 6D position estimation strategies are basic also for tracking. In this context, pose estimation is used for tracking initialization and pose recovery when the algorithm loses the object due to occlusions or when it comes out of the camera’s point of view. For this purpose, in [56], a new segmentation strategy was proposed, based on a consistent local color histogram.

Another problem of applications requiring the exact position of an object is the management of symmetrical objects. Corona et al. [5] addressed this problem by introducing a particular loss function. As mentioned earlier, the most recent methods use neural networks to make the algorithms adopted more efficient and performing. In this case, a CNN received an RGB image and a depth map for each viewpoint, corresponding to the model renderings, to predict the 6D position. In [51], the authors used a particular neural network called denoising autoencoder for 6D pose estimation, trying to learn representations from rendered 3D model views. Finally, neural networks were used also in [31], where the authors proposed a cross-domain adaptation approach, which trained the same CNN, CaffeNet, for both the off-line and the on-line stages.

An overview of the described methods is represented in Table 2.

5 Direct prediction or learning-based methods

These methods predict the 6D pose using CNNs [7, 69], hence needing a training phase which requires large amounts of labelled data but allows CNNs to produce significant improvements for the 3D position and rotation estimation. DL based methods can be one-stage and two-stage, depending on the use of a further step to refine pose parameters through Perspective-n-Point (PnP) algorithm (Fig. 4) [73]. PnP algorithm could provide worse results when correspondences are degenerated because of occlusions [27]. In general, two-stage CNNs are more accurate than single-shot ones, notably on small objects and multiple objects. Computing the cell size and the number of items occupying the same cell is challenging in single-shot object detectors. Moreover, with many objects, occlusions between them affect the precision of some single-shot methods, which employ correspondences between an object’s 3D bounding box corner and its 2D projection [27].

Fig. 4
figure 4

One-stage and two-stage methods. One-stage methods directly output the 6D Pose from the input image by using a CNN; Two-stage methods require a further step, fed with the CNN output, to refine pose parameters

Specific strategies try to solve the lack of training data problem by involving synthetic images for training or processes such as data augmentation, known as Domain Randomization in the context of 6DoF (Degrees of Freedom) pose estimation. For example, as described in [23], this requires complement data with semi-realistic synthetic images. To do so, the authors rendered a 3D model of the object on a real background and then applied different augmentation techniques, such as varying lighting conditions, contrast, blur, and occlusion by removing small image blocks and replacing them with monochrome patches. While Domain Randomization improves the pose estimation accuracy, its benefits on real test images remain limited, mostly because existing Domain Randomization strategies do not tackle the severe occlusions problem, which is one of the main challenges in pose estimation. Figure 5 shows a schematic overview of these techniques.

Fig. 5
figure 5

Schematic representation of learning-based methods. In a first training phase, the CNN is trained with a large set of labelled data; then the trained model from an input image can estimate the 6D position, optionally refined through a refinement step

Learning-based methods can be classified into three categories [69, 70]:

  • Bounding box prediction and PnP algorithm-based methods (Section 5.1).

  • Classification-based methods (Section 5.2).

  • Regression-based methods (Section 5.3).

Advantages:

  • They are powerful and can provide excellent results [46, 66];

  • They have high performance even if the object is partially occluded or in case of cluttered backgrounds [6].

Disadvantages:

  • They require a time-consuming training process [6, 12];

  • They are not very robust to severe occlusions because covering the space of all possible occlusions with real images is unmanageable [23, 39];

  • Their ability to generalize is still a problem in some cases [43, 51, 73].

5.1 Bounding box prediction and PnP algorithm-based methods

These approaches use a pipeline for the 6D pose prediction composed of a CNN architecture for the object category detection and the object projected bounding box vertices prediction [69]. The methods belonging to this category are two-stages, i.e. in a first stage, they regress the projection of the corresponding 3D keypoints of the target object in the 2D image and then, in the second stage, calculate the actual 6D pose using the PnP algorithm [73]. All the systems described have the last stage in common, so they differ only on how they prepare data, i.e., the 2D-3D correspondences fed as input to the PnP algorithm for obtaining the final estimate. These methods require expensive manual annotations on bounding boxes [69].

Rad and Lepetit [45] proposed BB8, a cascade of multiple CNNs for object pose estimation task. A first CNN performed semantic segmentation, a second CNN predicted the eight corners of the 3D Bounding Box projections and, finally, after PnP algorithm processing, a third CNN per object refined the pose. The authors of BB8 extended their work in [37], trying to manage position estimation in case of severe occlusions. The proposed solution calculated heatmaps from small patches independently and then combined them to obtain robust predictions. Liu and He [24, 25] exploited the advantages of BB8 for regression. They tried to avoid the use of PnP algorithm for reducing errors and implementation consumption. The former introduced a novel layer, called the Collinear Equation layer, which provided a 2D projection of 3D bounding box angles and a new representation of the 3D rotation. The latter exploited a new algorithm, called Bounding Box Equation, to achieve accurate and efficient translation. Tekin et al. [55] based on YOLO [47] and BB8 ideas, introduced YOLO6D, a neural network with fully convolutional architecture capable of efficient and precise object detection and pose estimation without refinement. As for BB8, the key feature here was to perform the regression of reprojected bounding box corners in the image. Moreover, in contrast to SSD6D [14], it did not suffer from pose discretization resulting in much more accurate pose estimates without refinement.

Most techniques are potentially vulnerable to occlusion as they treat the object as a global entity and calculate a unique pose estimate. In contrast, Hu et al. [9] introduced a segmentation based 6D pose estimation framework in which each visible part of objects contributed to the 2D keypoints estimation performed by a local pose predictor. To obtain a more robust and accurate result, Li et al. [21] introduced CDPN. This method treated separately the prediction of rotation and translation. It handled occluded or texture-less objects and resolved rotation using a two-stage object-level coordinate estimation and a Masked Coordinate-Confidence Loss (MCC loss). Translation was estimated directly from the image using a Scale-Invariant Translation Estimation (SITE). The approach was very accurate, fast, and scalable. Another solution, named Pix2Pose, was introduced by Park et al. [39] and predicted the 3D coordinates of each object pixel using 3D models without textures during training. The method estimated 3D coordinates and errors per pixel using an auto-encoder architecture. These pixel-wise predictions were then used in multiple stages to calculate 2D-3D matches and obtain the final pose. The method introduced a novel loss function called transformer loss for managing occlusions and symmetries. Also Zakharov et al. [66] proposed a new system called DPOD: Dense Pose Object Detector. An encoder-decoder network regressed the mask and 2D-3D matches. The training phase worked with both real and synthetic data. A finishing step, implemented via a CNN, from a coarse proposal predicted the refined one.

To manage the problem of occlusions and the lack of labelled real images, Li et al. [23] introduced a robust 6-DoF position estimation approach which exploit a Domain Randomization (DR) strategy. The method employed a first network to locate the pixels of the object. Next, a Self-supervised Siamese Pose Network (SSPN) output the coordinates and segmentation information.

The methods proposed in [13, 67], and [29] have the advantages to work in real-time. The first was an end-to-end framework, it used CNNs to obtain 2D-3D matches and worked both with texture-less objects and in case of occlusions between objects. The second exploited a client-server architecture for robots, which used YOLOv3 for object detection, keypoint detector and pose estimation. In the third, Liu et al. introduced a new network called TQ-Net. An object detection algorithm located the target and its bounding box. This information was fed into the TQ-Net to predict the translation vector T and the quaternion Q. In the end, Q was converted into a rotation matrix R. TQ-Net was easily implemented, run in real-time efficiently and accurately and worked with all previous CNN-based object detection methods.

Finally, among the most recent methods, Yang et al. [64] introduced DSC-PoseNet, to attain pose from 2D bounding-boxes. The framework learned to segment objects from real and synthetic data, then it predicted object poses through a differential renderer.

Table 3 illustrates a summary of the systems described above.

5.2 Classification-based methods

These methods aim to solve the 6D position estimation as a single-shot classification problem by discretizing the pose space. They leverage on CNNs to obtain a probability distribution in the pose space and associate it with the 3D model information to acquire the 3D position and rotation [69].

In SSD-6D [14], the authors extended the SSD detection framework [26] to 3D detection and 3D rotation estimation. A neural network performed object recognition from an RGB image returning its 2D bounding box. Each box is provided with a set of the most likely 6D poses for that instance. It decomposed a 3D rotation space into discrete viewpoints and in-plane rotations, so the rotation estimation is treated as a classification problem. In this work the authors utilized both a real dataset, for the bounding box prediction and a synthetic one for the rotation estimation. To the contrary, in [49] and [12], the authors leveraged only on a synthetic dataset. Su et al. [49] introduced a neural network trained by rendering synthetic 3D objects superimposed on real images. The trained neural network could then estimate the viewpoints of items in real situations. The method proposed in [12] combined the robustness of CNNs with high-resolution instance-based 3D pose estimation. The model used a modular architecture consisting of a detector and viewpoint estimator. The output of the architecture did not directly provide a 6DoF pose. The PnP algorithm was used to combine the intrinsic parameters of the camera and the 3D model. In opposition, in [22], the authors introduced GS3D and showed that 6D estimation could be solved without the use of a synthetic dataset. A modified Faster R-CNN detector, based on a CNN called 2D + O, classified the rotation from RGB images and the 2D Bounding Box parameters. Then, the 2D bounding box and the orientation obtained were used together with a knowledge of the guidance scenario to generate a basic cuboid called guidance, then projected onto the image plane. Another CNN called 3D Subnet received these features to refine the guidance. In the same research field, Zou et al. [72], proposed 6D-VNet to estimate traffic participants’ poses for autonomous driving applications.

Most approaches we discussed separate the object detection phase from the pose estimation one, by making them run on two separate networks. These methods require resampling the image at least three times: (1) to find region proposals, (2) for detection and (3) for pose estimation. The method proposed by Poirson et al. [44] did not require resampling of the image and used convolutions to detect the object and its position in a single forward step. It provided acceleration in execution time because it did not require image resampling, and the computation for detection and pose estimation was shared. The scheme employed a Single Shot Detector. Mousavian et al. [34] estimated the position and size of an object 3D bounding box from its 2D bounding box and surrounding pixels. This method used a detector extended to regress the orientation and size of the item by training a CNN. These predictions were combined with geometric constraints to produce the final 3D pose, estimating the translation and 3D bounding box. At last, the network used by Xu et al. [63] contained two parts: one for the generation of the 2D region proposal through a Region Proposal Network (RPN), and the other for the simultaneous prediction of position, orientation, dimensions of 2D objects and 3D poses.

An overview of Classification-based approaches is shown in Table 4.

5.3 Regression-based methods

These systems solve 6D pose as a regression problem and use CNNs to estimate the position [69]. They directly regress the 6D pose parameters of the target object from the input image [73]. Usually, there is a preliminary stage of object detection to simplify the position estimation process [40]. These systems belong to the category of one-stage methods, i.e. they design a neural network which receives an input image for training and solves the posed problem by learning the rotation and 3D translation of the object represented in it [73]. PoseCNN [62] is one of the current top performers for this task in RGB images. Xiang et al. designed a fully CNN composed of two stages to jointly segment objects, estimate the rotation, and the distance from the camera. The first two stages extracted and integrate feature maps with different resolutions from the input image. The network output semantic labels, 3D translation, and 3D rotation. PoseCNN did not address input images containing multiple instances of the same object and might require further refinement steps to improve the accuracy. The method proposed in [30] focused only on rotation between the object and the camera using a modified version of the VGG-M network. The pipeline contained a featured network and a pose network.

The methods described below, although less important than the previous ones, has their own relevance in research. Do et al. [7] introduced Deep-6DPose to detect, which was an end-to-end deep learning pipeline consisting of a RPN to derive the Regions of Interest and a Mask-RCN. It decoupled pose parameters into translation and rotation. For autonomous driving applications, Hara et al. [8] considered objects seen approximately sideways in the center of an image. The authors proposed three approaches for estimating the rotation: the first two differed only in the loss function and represented angles as points on a unitary circle and trained a regression function. The third approach employed the discretization process. For the same research field, Ku et al. [17] introduced MonoPSR, a method for 3D Object Detection that used suggestions and shapes reconstruction. Rambach et al. [46] trained a CNN to directly regress the object 6D pose using only single-channel synthetic images with improved edges, obtained from rendering the 3D object. It used a modified version of the PoseNet architecture [15] with a new loss feature to facilitate the training process. In contrast to other CNN-based approaches for pose estimation, which require many data to be trained, in [61] training was done only with synthetic position data and then extended to real data. The process consisted of two cascading components: a segmentation network (DRN: Dilated Residual Network) that generated the segmentation masks and a pose interpreter network. The image and the segmentation result were the inputs of the pose interpreter network.

In this last part, the most recent and fewer known methods are described. Hu et al. [10] assumed the objects were rigid and their 3D model was available. The proposed network directly regressed the position from groups of 2D-3D correspondences associated with each keypoint. The system used three main modules to infer the pose: a local feature extractor, a feature aggregation module, and a global inference module. The CNN proposed in [59] computed both the mask and the 6D pose. The system was divided into two distinct networks to overcome the effects caused by the lack of training data: segmentation network and pose estimation network. Liu et al. [28], used rendered binary images in the training phase to generate triplets. The triplets were fed to a triplet network to capture the features, while the positions were reference information. The regression network provided the final pose. Capellen et al. [2] introduced ConvPoseCNN, an architecture derived from PoseCNN [62], described above. At first, a VGG16 convolutional backbone extracted the features. The system performed first pixel-wise semantic segmentation through a fully convolutional branch. Then a fully convolutional vertex branch estimated central direction and depth. The results of these two branches found the center of the objects and their bounding boxes. A fully convolutional architecture, like the other two branches, replaced the PoseCNN quaternion estimation branch to estimates quaternions for each pixel. Wang et al. [60] proposed the Geometry Guided Direct Regression Network (GDR-Net) to unify direct and geometry-based indirect methods. The system first detected all the objects, and, for each detection, it zoomed in to the corresponding Region of Interest (RoI). Each RoI was fed to the network to predict several intermediate geometric feature maps. Then the Patch-PnP algorithm directly regressed the 6D object pose from Dense Correspondences and Surface Region Attention. Trabelsi et al. [57] introduced an end-to-end 6D object pose estimation method, made by a pose proposal module and a pose refinement module. The former output an object classification and an initial pose estimation. The latter embodied a differentiable render and an iterative refiner called MARN. Hu et al. [11] involved a Feature Pyramid Network for multiple scales 6D pose regression of space objects. Finally, Su et al. [50] introduced SynPo-Net, a CNN trained exclusively with synthetic images, which tried to improve accuracy in pose estimation by replacing pooling layers with convolutional layers.

Regression-based techniques are summarized in Table 5.

6 Discussion and conclusion

6D pose estimation of an object from a single RGB image is a central issue in the computer vision community, especially after the introduction of deep learning solutions, which speeded up the diffusion of new applications. This review analyzed the most recent and relevant methods available in the literature and classified them according to the procedure adopted, to define a series of guidelines related to this problem. To summarize the methods, Tables 1, 2, 3, 4 and 5 show the main feature values of every study. The tables contain some general features of each article, such as the publication year, the journal or conference, the number of citations, the highlights, and the research field. Furthermore, the system input, the neural network, pre-processing, and refinement methods, if used, have been specified. Finally, the tables indicate the dataset and its size, the accuracy, if the method is one-stage or two-stage and whether it works at the instance or category level. These parameters, as specified in the Section 2, have been chosen as they highlight the main features of each group of methods and the differences among the different classes. For this reason, they can be used as guidelines to choose the correct approach for a new specific task, relying on previous works.

Feature-based methods (Table 1) could be the appropriate solution if the target object has a recognizable shape, but the keypoints must be accurately chosen, and a refinement step is often required [4, 41, 43, 65, 68,69,70,71]. Even though the refinement step of two-stage methods is time-consuming, some of them can run in real-time [43, 65, 71] or near real-time [41]. These systems have been evaluated on LINEMOD dataset [43, 65, 68,69,70,71], PASCAL3D + dataset [18, 41], KITTI dataset [3] or custom datasets [4].

Template-based methods (Table 2) can reach high accuracy values with an exhaustive template database [35, 58], but the matching process is time-consuming, so they rarely work in real-time [35, 51]. Except one study [42], which is also the only category-level method, all the approaches belonging to this category require a 3D model of the object in addition to the image as input. Moreover, most systems create a custom dataset and do not require pose refinement. To the contrary, in [16, 58], the methodology need a further step of refinement to estimate the pose. Finally, only three systems [5, 31, 51] involve CNNs to solve the problem.

In recent years, the research focused on Learning-based methods, which allow the training of a classification-based or a regression-based neural network tailored for a specific task. For Learning-based methods three different tables have been created, one for each subcategory described in previous sections. Apart from the methods in [24, 25], Bounding box prediction and PnP algorithm-based methods (Table 3) are two-stages, and calculate the actual 6D refined using PnP algorithm. Just in [64], the system needs a supplementary pre-processing step, which detects the 2D bounding box. Despite the time-consuming multi-stage pipeline, some approaches can work in real-time [9, 13, 21, 29, 66, 67]. Most of the studies have been evaluated on LINEMOD dataset and its variation, named Occluded LINEMOD; some of them, instead, exploit T-LESS Dataset [39, 45], YCB-Video Dataset [9, 37], ACCV Dataset [29], or a custom dataset [64].

Classification-based methods (Table 4) are one-stage, and they rarely need a pre-processing [14] or a pose refinement step [14, 22]. The work described in [44] is the only category-based approach, and it can work in real-time. These systems have been mainly tested on PASCAL3D + Dataset [34, 44, 49, 72] and KITTI Dataset [22, 34, 63].

The most known and performant systems belong to the Regression-based methods (Table 5), which do not require either pre-processing or post-processing steps, as they directly regress the 6D position through a single-stage pipeline.

The training process of Learning-based methods is time-consuming and requires computational power. For this reason, some Regression-based systems propose end-to-end trainable networks to simplify the process and obtain real-time working methods [55, 60, 61].

Starting from these values, the most remarkable methods were analyzed and classified, deriving their main characteristics, strengths, and weaknesses. Therefore, it has been possible to put together all the pros found in the different articles and define what should be the correct approach for the 6D position estimation of an object from a single RGB image, which can work even under boundary conditions, namely, auto-occlusions, symmetries, occlusions between multiple objects, and bad lighting conditions. To summarize the findings of this work, Table 6, based on the technical prerequisites and boundary conditions for a potential future application, provides guidelines on which category among those described in Sections 3, 4 and 5 is most appropriate to meet the required needs. It could be inferred that algorithms belonging to Feature-based and Template-based methods are outdated and they could only be used as support for Learning algorithms, as single steps of a larger system based on neural networks. They could be exploited only whether a large dataset is not available, or the computational power is not enough to train a neural network. In these cases, the solution is turning to classical geometric algorithms. On the contrary, learning algorithms are getting better and better results, often employing artificial datasets, and avoiding the expensive data retrieval phase. In terms of real-time speed, accuracy, pipeline complexity, Regression-based approaches are the most performing, but, at the same time, more specific. On the other hand, the other two groups can provide a more generic scheme of implementation. Furthermore, the network’s ability to generalize is still a challenge in some cases. This limit leads the research to move towards new efficient training databases and new techniques for automatic labelling to obtain increasingly accurate solutions.

Table 6 The table provides a summary of the results that emerged from this work, indicating for each feature the most appropriate method among Feature-based, Template-based, and Learning-based