1 Introduction

See [5] for video demonstration of this work.

Accurate 6D pose estimation of rigid objects from color images is a paramount challenge with diverse applications spanning robotics and close-contact aircraft operations. This study explores the techniques to apply the YOLOv5 object detection CNN and Solve Perspective-N-Point (Solve-PnP) for the precise localization of aircraft 6D poses using color imagery. Conventional methods for object detection labeling encounter inaccuracies stemming from perspective geometry and are typically restrained to visible key points. This research demonstrated that with meticulous labeling, a CNN can achieve near-pixel accuracy in predicting object features, effectively grasping the distinct appearance of objects distorted by a camera’s perspective. Additionally, the pivotal role of understanding occluded features in the training process is emphasized. While the inclusion of occluded features in the training slightly reduces the pixel precision, it leads to a substantial increase in the number of predicted features, including those initially hidden from view. Ultimately, this holistic approach results in an overall improvement in 6D pose estimation.

To underscore the necessity and significance of this approach, close-contact aircraft maneuvers, specifically aerial refueling is examined. The cost, risk, and necessity associated with aerial refueling have previously limited the adoption to military operations. However, if the process can be reliably and cheaply automated then there is the potential to democratize this technology by making it accessible beyond military circles. Thus, revealing the economic advantages highlighted by Nangia [29] and Parry [33], where it is shown that lightweight civilian aircraft could perform longer range flights with enhanced efficiency, thanks to the prospect of automated aerial refueling (AAR). Past attempts at AAR relied on differential Global Positioning System (GPS) and custom data-links to conduct their demonstrations [14, 44], thus making the cost untenable and failing to meet the military’s need to operate in signal degraded environments.

Close-contact aircraft operations like aerial refueling require robust solutions in scenarios where human controllers, global positioning system data, and distance-based sensors may be unreliable or unavailable. Imagery-based sensors emerge as a potential solution, but their efficacy hinges on their ability to serve as both the primary source of information and be processed onboard. This paper identifies the necessary techniques for labeling and training YOLOv5, a CNN, to locate aircraft components with the precision required for the Solve-PnP algorithm. The objective is to achieve an aircraft’s pose prediction accuracy within 7 cm and 1\(^\circ\) of error. With these techniques in place, automated aircraft can extend their autonomous range, thereby reducing the need for human operators in high-risk missions and enhancing overall flight safety.

Beyond close-contact aircraft operations, this work addresses the fundamental challenge in the deployment of an artificial intelligence (AI) on autonomous aircraft - the need for airworthiness certification. The existing certification processes assume deterministic system behaviors with humans in the loop, and establishing a clear path for certifying algorithms designed for autonomous flight is currently lacking. [18]. Such a certification will be required to operationalize algorithms such as those designed in this article. Certifications prescribe the conditions under which algorithms can operate, such as proximity to population centers, times of day, aircraft compatibility, range between aircraft, in presence of precipitation, and many potential others. This process can be easier if a system’s components are modular, understandable, individually testable, and individually certifiable. These are some of the reasons why object detection is used for the approach in this paper, instead of relying on a monolithic neural network that provides the 6D localization of an observed object. However, typical object detection still presents some issues for this application.

In typical object detection, a classified bounding box is drawn over the visible portion of an object in an image. This causes a mismatch with the bounding box’s 2D center and the projection of the 3D geometric center of the item being observed, thus resulting in inaccurate 6D pose localization when using these mismatched bounding boxes. There are two reasons for this mismatch. First, the inherent nature of labeling visible portions of an object means the center point would not be guaranteed to match and hence training a neural network would lead to imprecise 6D localization. Second, perspective distortion introduces disparity between the 2D center of a bounding box and the 3D geometric center of an object when projected onto a 2D image plane, a concept elaborated in the Background section. The broader question that emerges is whether a CNN, such as YOLOv5, can predict bounding boxes that correct for perspective distortion and occlusions while maintaining low pixel error.

To show the necessity of both correcting for perspective distortion and training with the knowledge of occluded components, an ablation study was conducted. This study serves to underscore the importance of these techniques in achieving accurate 6D pose estimation. Additionally, the impact of various data augmentation parameters, particularly focusing on those that may hinder the network’s ability to learn perspective distortion correction. There is an analysis in the Background regarding which parameters may interfere with learning perspective distortion, with detailed equations in Appendix 2. To further explore the intuition of what the neural network is learning, a final experiment took images and tiled them onto a larger image in different positions. This allowed for an accuracy and 2D error comparison between the various positions within an image as well as to the original image in order to explore how the CNN was learning to correct for perspective distortion.

In summary, this paper showcases the capability of the YOLOv5 CNN to effectively correct perspective distortion and predict occluded features. This investigation highlights the proper intuition for perspective distortion and demonstrates a method for correcting such distortion. This research shows the necessity for each of the steps in the process to attain an error of less than \(1^\circ\) and 7 cm at 7.5–35 m from the camera. This is significant because it demonstrates that the system is both accurate and fast enough to potentially be used on autonomous aircraft or for aerial refueling in the civilian sector. Additionally, this method can be applied to 6D localization of objects with occlusions, making it useful for robotics. It could also be applied for use with even more accurate and faster object detection neural networks which have yet to be invented. A video showing these findings is available at [5].

The contributions of the paper are:

  1. 1.

    A challenge to current best practice, augmenting scale during CNN training, via experiments showing degradation of pixel precision.

  2. 2.

    A novel set of mathematical operations, exploring how perspective geometry causes distortion on projection of 3D components to 2D bounding boxes.

  3. 3.

    Experimental results demonstrating that a CNN-Solve-PnP system is performant when trained with knowledge of occluded components and equivariant to translation.

  4. 4.

    Experimental results showing ability of a CNN to learn distinct appearance caused by perspective geometry.

  5. 5.

    An application-specific trade-off where 6D performance is enhanced via an increase in true positives from occluded components, despite a slight decrease in 2D precision.

2 Background

2.1 Aerial refueling use case

Aerial refueling has been a crucial capability for military and civilian aircraft for over 90 years, enabling extended flight durations and acting as a true force multiplier. The history of aerial refueling dates back to the mid-1920s when the first successful aerial refueling was performed between two Airco DH-4B biplanes [27]. Past attempts in 2006 and 2015 with Defense Advanced Research Projects Agency, National Aeronautics and Space Administration, and Northrop Grumman have succeeded in demonstrations of AAR systems [14, 44], but these relied on expensive custom sensor and communication suites between the receiver and tanker aircraft.

The two main aerial refueling systems in use are the United States Air Force (USAF) boom and the Navy (also NATO) probe and drogue system. With the USAF system, stereo vision cameras are used by a boom operator to look backwards towards a receiver aircraft and move the boom into the proper position. The boom operator extends the boom into a receptacle for a receiver aircraft once the receiver is in the correct position. The Navy system employs a flexible hose that trails from tanker with a drogue, also referred to as a basket, behind it into which a receiver aircraft must guide its probe.

In order to operate in GPS denied environments, receivers and tankers need sensors and the ability to process the sensor data. Tankers already tend to have cameras facing backward, so if those cameras can be used, modifications to the aircraft could be minimal. Modifying receivers could be kept to a minimum if only upward facing cameras are needed and local processing could be used. Hence, the approach proposed here focuses on implementing 6D localization with only images and knowledge of the aircraft being observed; this paper focused on a tanker camera facing a receiver.

Inexpensively adding sensors and creating reliable enough autonomy could extend AAR to the civilian community [29, 33]. Parry and Hubbard focused on using a vision based approach exclusively for finding the refueling basket immediately before contact, while relying on GPS for all other aspects of the refueling process [33]. This is due to certification concerns with neural networks.

2.2 Center point correction - perspective geometry

One of the key features to this process is the proper training of the CNN by correcting for a mismatch in the 2D bounding box center and the projection of a component’s 3D geometric center to the image plane. This mismatch occurs because the length and proportions between projected points on a flat image sensor are not invariant to projective transformations. This is often called projective distortion, but typically that term is not used to emphasize the effects of the image sensor being flat.

The causes of perspective distortion on a flat image sensor are due to three main factors: 1) the distance from the camera, 2) the angle that object is away from the principle axis, and 3) the rotation the object itself has. Perspective distortion ultimately occurs because a part of an object is closer to the camera than another part of the object. The amount of distortion is based on the proportion of the distance to the object versus the amount one part is closer than another part. The key nuance is learning the fact that the perspective changes the proportion of mismatch or distortion correction required based on these criteria, not just the number of pixels that an item may be distorted.

Each of these criteria inspire some general rules for perspective distortion and the mismatch it causes in the bounding box center points. First, the further away from the camera, the lower the proportion of distortion; the closer, the greater proportion of distortion, observed in Figs. 16 and 17 and Table 10. Second, the distortion is minimized to zero when the object itself is rotated such that it is parallel the image plane. Minimization also occurs when the angle of an object’s rotation is complementary to the the angle created by the ray from the optical center to the object’s center and the principle axis. The rotation effect has local maxima for the error at approximately the bisector of the two angles. This is stated as approximate due to the complex nature of the equation and the nonlinear effects of distortion; however, effects of rotation are still useful as a general rule to develop intuition. These effects are detailed via equations in Tables 12, 13, and 14 in Appendix 2.

The angle between an object and the principle axis has two general observations useful here. First, as a camera changes its orientation, the object’s angle relative the image plane also changes, causing different distortions. This particular observation is difficult for humans to intuit because the retina is curved, thus when an individual changes their eye’s orientation, the item does not change in appearance at all. The second observation for this angle changing is that it affects where the minimum and maximum values occur for the angle at which an object itself is rotated relative to the image plane. Additional examples, figures, equations, and tables of proportion changes are provided in Appendix 2 to assist readers with developing their own intuitions.

These behaviors of perspective distortion are critical to designing experiments to test how a CNN may learn to correct the distortion and how to best teach a neural network. For the aerial refueling use case, the observed object is typically similar in orientation to the camera, with only slight perturbations. It is possible that a network may learn to correlate the pixel position in an image to the amount of correction needed. If this is the learned method, then clipping an image could lose vital information about the position of the object in the image. Further examination revealed this was likely incorrect due to the amount of random rotation of the observed object in the testing data. Compared to a camera with a 56\(^\circ\) field of view, a similar amount of distortion could occur for 25% of the image due to random rotation of the observed object. Thus, leading to another insight, perspective distortion leads to observed objects having a distinct appearance depending on their 3D location and orientation, hence the correction required could be calculated and learned based on this distinct appearance regardless of location in an image. This leads to the experiment in the Latent Space Perspective Distortion Experiment section testing the pixel precision of the network respective of the position in an image.

This analysis of perspective geometry and how it creates distortion also provides insights to the relationship between data augmentation parameters used during training and their potential effectiveness. Earlier it was noted that the distance an item is from the camera can affect the distortion. Hence, the size of an object could indicate the amount of correction required. Much of the existing research [3, 6, 8, 9, 12, 13, 15, 16, 28, 40, 45, 46, 53, 58, 59] on object detectors improve their results by varying the scale during training, which helps generalization across large data sets, but could also hinder pixel precision by confusing the neural network during training. There is also an apparent correlation to the position in the image, leading to a potential for mosaic and translation as data augmentation techniques to affect the results. Mosaic is a technique where multiple images are combined into a single training sample and translation involves shifting the object within the image to different positions. Both techniques are typically utilized to create more diverse training data. This lead to the experiment detailed in the Training Data Augmentation Optimization section testing networks trained with different training time data augmentation.

3 Related work

With many tankers having stereo vision on them already, a previous approach by Anderson et al., was to utilize these cameras for the 6D pose estimate. Here, the stereo block matching algorithm created 3D point clouds that would be compared to a truth point cloud via the iterative closest point method to gain a pose estimate [1, 2]. While this was able to meet the error threshold, stereo block matching is restricted to stereo vision systems, requires extremely precise calibration, and requires extensive filtering of noise and occlusions to be effective. Modern approaches have shown that it is possible to use neural networks with Solve-PnP on monocular color imagery to meet the error threshold required [26], the experiments here extend this work.

3.1 Key point vs object detection

Typical approaches which use Solve-PnP involve using key point finders with known 2D-3D correspondences. Many of the simpler key point finders like Scale-Invariant-Feature-Transform or Speeded-Up-Robust-Features will find points and create descriptions with their own methods, which can really only be matched to other 2D points found in similar images with those same methods, lacking a bridge to making 3D correspondences reliably. Some more modern approaches have tried to solve this blind Solve-PnP problem by using neural networks to create descriptors and create these correspondences [23, 43]. Without correspondences, the task of using Solve-PnP becomes an optimization problem to search for the best set of correspondences [42], which can be slow and unreliable.

Similarly, CNNs have been shown to be able to identify key points with their own descriptors that are more resilient to image changes than the legacy methods and have extremely low latency[10, 11, 21, 25]. However, all of these key point identification methods thus far hide information inside their respective algorithms with custom descriptors for features that are found, which cause difficulty if used for 2D-3D correspondences. These custom descriptors are less explainable and suboptimal for usage with 6D localization. Consequently, generic key point finders and descriptors were not employed.

Other CNNs solve the correspondence issue by using landmark detection [15, 24, 54] and object detection. Often landmark detection is also called key point detection without distinction to the earlier methods. Landmark and object detectors are able to both classify and provide 2D localization of individual features. In landmark detection, a neural network is trained to find pixel locations of generically classified features in an image. An example of this is a person’s knee or nose. At the time of writing, landmark detectors did not exhibit comparable execution speed, scalability, or the level of adoption achieved by leading object detectors. Nevertheless, landmark detectors present a promising alternative to the object detection for 6D localization, and recent advancements in object detection CNN design could certainly be leveraged for landmark detection.

You Only Look Once, YOLO [37] is a family of algorithms that was originally released at the Conference on Computer Vision and Pattern Recognition in 2016 from Joseph Redmon. Redmon released further versions [38, 39] the next two years following after which the algorithm was continued by others in the computer vision community, with many more iterations addressing modernizations in neural networks such as skip connections, cross-stage partial connections, inception modules, and different training techniques.

In general, the YOLO family of CNNs has a pre-trained classification network as its backbone. The neck, middle layers, of the network utilizes a feature pyramid, spatial pyramid pooling, and path aggregation networks to identify features at various scales as well as feed multiple output convolutional nodes to produce multiple bounding boxes for a given image. After the backbone is pre-trained, the neck is attached and the entire network is further trained on imagery which has bounding boxes labeled. For the work in this paper, YOLOv5 [19, 57] was chosen. This was due to its ease of adaptability, flexibility in precision vs speed, and precision achievable at speed.

3.2 Other pose detectors

In recent years, there has been a surge in other approaches that use CNNs for 6D pose localization. Kehl [20] developed a single shot 6D pose estimator which works in two stages: first, creating bounding boxes around objects to clip them into their own images at a specific size, then running other neural networks against that, which are essentially trained as classifiers based on the viewpoint. Xiang [52] developed a single neural network that is able to output the segmentation, translation, and orientation of objects by being trained on that data.

Pix2Pose [31] has an innovative approach, they apply a custom coloring of each object in an image where that coloring is associated with the 3D coordinate of the skin of the object, hence creating unique correspondences between color and positions on that object. They then train networks to create images with that coloring, then send those correspondences to Solve-PnP to estimate the 6D localization. Finally another neural network, BB8 [36], is trained to identify 3D bounding boxes for items, overlay them onto an image, and use the 8 corners of the resulting 3D bounding box as inputs to Solve-PnP to get a final 6D localization. [4, 7, 32, 34, 35, 47, 55, 56] are very similar to these approach with slight variations, all basically doing direct pose regression from a CNN. There are many other examples which use combinations of these approaches and add in depth data [17, 22, 41, 48,49,50,51], however depth data is outside the scope of the desired solution here.

Each of these methods show great promise using neural networks, but suffer from a similar issue: the concealing of the learned features inside of them, making them difficult for humans to understand. This could lead to difficulty certifying them or testing them. The difficulty is increased for monolithic systems like these, hence a lack of modularity. This increase in difficulty is because if a change occurs then the entire system would need to be re-tested and re-certified, as opposed to a single module. These are some of the reasons our approach utilizes an object detector with Solve-PnP, as they can be made small and fast with the ability to identify components in human readable ways and each step of the process can be tested independently and as a whole.

4 Methodology

4.1 Overall process

In order to conduct an ablation study on the importance of each step in the system, one must first understand the overall system designed. The system utilizes a trained CNN, YOLOv5, to draw bounding boxes around aircraft components in an image. The CNN provides a classification used later for 2D-3D correspondences and the 2D center point of the bounding boxes. If at least four components are found, then those correspondences and center points are utilized by Solve-PnP to calculate a pose for the aircraft. The ablation study is conducted by training multiple neural networks from the same set of images; different labeling techniques are used to train the multiple neural networks.

In the ablation study, the labeling techniques can be grouped into two main categories: visibility and perspective correction. For visibility, neural networks are trained from labels created either from only visible points or both visible and occluded points using “x-ray” vision, which was conceptually created via simulation and perfect truth data. For perspective correction, neural networks are trained either from labels created from data which corrects the perspective distortion in the bounding box centers or from data which does not. The metrics used to compare these neural networks are the average center point error (in pixels) for each bounding box in an image, the true and false positives a network identifies, and the position and orientation of the 3D model prediction.

4.2 Data generation

To train a CNN and assess the system, labeled imagery must be generated. This data consists of images, a label truth file for each image, and a pose truth file for each data set. This data is generated with AftrBurner [1, 26, 30], a 3D graphics engine maintained by a team at the Air Force Institute of Technology. AftrBurner imports a “.stl” file defining an aircraft model, renders that model, imports point clouds defining 3D components of the previously imported model file, saves rendered scenes as images, and saves the associated truth files. All aircraft model files used were publicly available files purchased and customized for this application; no official Air Force information or files were utilized. The images are sized at 640x480 pixels, with 56\(^\circ\) horizontal field of view, and a perfect camera calibration is assumed.

To create 2D bounding boxes, the 3D model required a file defining the components on the aircraft. This was accomplished by creating point clouds on the aircraft by identifying the points on the skin of the 3D model in the model’s coordinate system. Each of these point clouds is named accordingly with the component they resemble on the aircraft; 50 were created. These 3D model points defining the components are projected to the image in order to create bounding boxes for each component. At this point, occlusion checks are able to be made or omitted. Additionally, the perspective distortion which causes the center point from the 2D bounding box to misalign with the 3D component’s geometric center point projection can be corrected during this process.

See Figs. 1 and 2 for an example point cloud of dots, with the red, or off-color, dot indicating the 3D component’s correct center point projection on the image and a small green crosshair indicating the uncorrected bounding box center. Figure 3 includes an example of expanding the bounding box to correct for the perspective distortion. This example had approximately 9 pixels between misaligned and corrected bounding box centers.

Fig. 1
figure 1

Non-corrected bounding box and component point cloud

Fig. 2
figure 2

Example bounding box without correction

Fig. 3
figure 3

Example bounding box with correction

The occlusion data allow filtering of point clouds to control generating bounding boxes based on either the visible portion or the entire component. At least 25% of a component’s points must be visible to consider it visible and its center point must be visible. The process for the center point correction utilized in this study was to project the 3D geometric center point, \(True_x\) and \(True_y\), to the screen then expand the bounding boxes via taking the new correct center point and expanding the width and height. The process is shown in Algorithm 1

figure a

The result of these labeling strategies is 4 label files being generated for each image, with the resulting trained networks being named with the following convention:

  1. 1.

    Vs_CC - only visible points with center point correction

  2. 2.

    Vs_Nc - only visible points with NO center point correction

  3. 3.

    Xr_CC - All points regardless of occlusions with center point correction

  4. 4.

    Xr_NcFootnote 2 - All points regardless of occlusions with NO center point correction

To robustly train the neural network, various attributes were randomized within the AftrBurner engine to generate the data set. The orientation of the receiving aircraft was randomly rotated up to 5\(^\circ\) in the roll, pitch, and yaw. The position was randomly set between 7.5 and 35 m from the tanker camera in such a way to ensure the center of the aircraft was in the viewing frustum of the camera. Images without aircraft were used 5% of the time. The background sky box, orientation of the camera versus the ground, and lighting positioning was also randomly set for each image. Finally, the position of the refueling boom was randomly set in each image, causing potential occlusions to the aircraft in the image. Examples of these images are seen in Fig. 4.

Fig. 4
figure 4

Variability in training data for CNN: synthetic image examples

4.3 Neural network training

The main experiment utilized the randomization of imagery attributes mentioned in the Data Generation section, 40,000 images, and the associated truth data. The first 30,000 images were used for training, and a further 5000 for validation during training to track and utilize early stopping criteria of 20 epochs. The last 5000 images were used for testing and comparing each of the trained models across the metrics mentioned in the Introduction section. When training, 1000 epochs was chosen as a top limit for the number of epochs to be used, with 20 epochs as the patience criteria for early stopping.

In addition to the data augmentation detailed in the Data Generation section, training time data augmentation was also utilized. This includes randomizing up to 70% saturation, 1.5% hue, 40% value, 3\(^\circ\) rotation, 10% mixup, 10% translation, mosaic, and 90% scale for each training image. This is according to recommended defaults by the YOLOv5 team [19], with the exception of left-right flipping of the image; this was excluded because the object being detected is nearly left-right symmetrical, with few identifying symbols on each side and the components being predicted are respective of the side of the aircraft. An additional three networks were trained with different hyperparameters for training in order to explore the best method of teaching perspective distortion of a 3D object within an image, further detailed in Sect. 4.5. Each of these additional networks were trained with labeling of all components in the image regardless of occlusions and with center point correction enabled.

The YOLOv5 team has various pre-trained models available for transfer learning, with larger models being more accurate, but slower. The YOLOv5 small model, YOLOv5s, was chosen for transfer learning for this project due to the precision, accuracy, and speed it was able to provide. Other relevant parameters used for training were a batch size of 32, 0.0005 weight decay, a learning rate of 0.01, and 0.937 momentum. The two visible networks stopped from the early stopping criteria while the other Xr_CC and Xr_Nc networks trained for the full 1000 epochs; details of all networks trained are in Fig. 8.

4.4 Neural network execution

The object detection neural network is able to utilize various parameters to choose the precision versus recall needed, maximizing true positives and minimizing false positives. The main parameter to be modified is the objectiveness confidence. The confidence factor chosen for the experiments here was 0.85 because that was the maximum before a precipitous drop in F1 score by many of the networks used. This chosen value allowed no false positives when there was no aircraft in the image and allowed adequate correspondences for Solve-PnP. Further fine tuning of the confidence could be beneficial for a future deployed system.

In this experiment it is assumed that only one of each component may appear in each image, allowing us to choose the largest confidence result for each component as the method to remove duplicates. In a deployed system, techniques like masking, clipping, or segmentation may be required in concert with the system proposed here to remove observations from other aircraft in an image.

4.5 Data augmentation training hyperparameter optimization

To address perspective distortion, an experiment was designed to explore which data augmentation to use during training because the typical data augmentation used for training CNNs for object detection utilize methods that could make learning perspective distortion difficult.

Some of the potentially troublesome techniques are mosaic, translation, and scale. Mosaic takes patches of four images and stitches them into one, essentially making the position of components in the image change. Translation also moves components within an image. These two techniques will matter greatly if the network is learning based on the position in the image, but will not if the network is learning based on the distinct appearance of the item as the indicator of perspective distortion. In addition, the distance an object is away from the camera changes the proportional amount of distortion, therefore scaling the image directly affects the perspective distortion.

The analysis of how perspective affects the appearance of objects in imagery led to the training of an additional three networks. The Xr_CC network mentioned previously is trained with complete knowledge of occluded components, center point correction, and with the data augmentation hyperparameters as outlined above. Each of the other networks were also trained with complete knowledge of occluded components and center point correction enabled. The Xr_CC_nS was trained like the Xr_CC but without scale augmentation; nM_nT was trained without mosaic and without translation; and nM_nT_nS was trained without mosaic, translation, or scale. These four models were used to assess the best technique with which to train in light of perspective distortion requiring correction.

  1. 1.

    Xr_CC - Default hyperparameters, no left to right flip

  2. 2.

    Xr_CC_nS - Same hyperparameters as Xr_CC, no modification to scale

  3. 3.

    nM_nT - Same hyperparameters as Xr_CC, no mosaic or translation

  4. 4.

    nM_nT_nS - Same hyperparameters as Xr_CC, no mosaic, translation, or scale

4.6 Latent space perspective distortion investigation

To further explore how the neural network was learning to correct for perspective distortion in the center points, another experiment was designed. To correct the position of a 3D object’s center point, a neural network could learn based on two potential measures: the 2D location in the image, and/or the distinct appearance and proportions of the object due to the distance and orientation from the camera.

To assess the impact of position in the image, test images were superimposed onto a larger image of the virtual sky for analysis. This test took 520 images from the test data set of 640x480 sized images and superimposed them onto nine different positions of a 4096x3072 sized image, creating 9 more sets of images for testing. These positions included the combinations of top, bottom, middle, left, and right; a notional example is in Fig. 5. No black boxes are used in actual test images, rather there would be sky in those locations; it is only used here to show possible positions. Then inference was run with the Xr_CC_nS model on each set of images for comparison; this model is defined in the hyperparameter study. The comparison conducted here was for the average center point error and the true/false positives; the 3D error was not assessed as part of this comparison.

Fig. 5
figure 5

Notional example superimposed images

4.7 Solve-PnP error characterization

The usage of a CNN to detect individual components of an aircraft requires truth label files to be created, which allows for error characterization of the truth data itself. The truth label data for each of the labeling techniques was utilized as inputs to the Solve-PnP algorithm to characterize Solve-PnP and the system. The intent of this was to show the behavior of Solve-PnP across the ablation study. This allowed the limiting of error analysis to images for which the truth data would allow solutions. The Error from Truth Labels section of the results shows the 3D error when utilizing the truth labels and the number of images with valid results for further comparison.

5 Results and discussion

This section is composed of four subsections detailing the results of the various experiments and discussing what those results could indicate. The first is an examination of the truth values from the different labeling techniques and how Solve-PnP performs given those perfect truth values. The next section is an analysis of the ablation experiment with results from entire system using each of the different labeling techniques. Then details on the experiment where different training time data augmentation parameters were used and which performs the best with regard to perspective distortion is provided. Finally, there is a discussion on the experiment designed to test if the neural network was learning to correct for perspective distortion through the position in the image or the distinctive appearance of object. This last experiment led to a potential cause of false positives that is also discussed at the end of the section.

5.1 Error from truth labels

In order to evaluate the error of the overall system, the results of each labeling scheme using their truth labels was analyzed. These truth labels were used to directly make Solve-PnP predictions and limit the further analysis to that of the images in which solutions are possible. This was useful in characterizing the potential of the system. Figures 6 and 7 show each of the labeling techniques and their associated errors for position and orientation magnitudes of error. Each plot contains a dashed line representing the error threshold for the system of 7 cm and 1\(^\circ\), respectively. Table 1 explicitly presents the mean and standard deviation error of the Xr_CC labeling technique with center point correction.

The Vs_CC labeling technique exhibited outliers in rotation and position, which can be attributed to the sensitivity or fragility of the Solve-PnP algorithm when given only 4 points. To address this, a threshold of 6 components was established and is used in Fig. 8 to distinguish between good and poor amounts of components. Additionally, when Solve-PnP utilizes 6 points, it allows for direct linear transformations, enabling an estimate of the camera’s intrinsic matrix and further refinement of the results. Importantly, when there are less than 6 components found in an image, the system does still make a prediction and the various figures and metrics all include such results.

Fig. 6
figure 6

Solve-PnP position error from truth labels

Fig. 7
figure 7

Solve-PnP rotation error from truth labels

Table 1 Error from best truth data

The label data analysis revealed that predictions were possible for 4729 of the images. This occurred because, intentionally, 5% of the images in both the training and test data sets were generated without any aircraft, aligning with best practices. Consequently, these images lacked any aircraft presence. Interestingly, none of the trained models generated false positives when using a confidence threshold of 0.85 for the images without aircraft.

The XR_CC labeling strategy proved effective in providing sufficient truth labels, enabling the application of the Solve-PnP algorithm to make predictions for all 4729 images when provided truth data. In contrast, when using a labeling technique restricted to visible components, predictions were possible for 4586 images (96.98% of predictable images) in the case of images with center point correction and 4715 images (99.70% of predictable images) for images without center point correction. The difference in performance between these two visible labeling techniques can be attributed to the center point correction technique’s requirement for the center point to be visible, resulting in slightly fewer visible points compared to the data without center point correction.

This analysis yields two interesting concepts. First is that a neural network trained on data without center point correction is unlikely to reliably predict under the desired error threshold. The next item is that the position error through the system is potentially 8% of the threshold in the best case, observed through Table 1. This exemplifies the difficulty of the problem space. This error is likely due to compounding of numerical precision issues between the simulation, various truth files, and limits on floating point variables respective of the scale at which the simulation was running. It is left to future work to optimize this aspect of the system.

The next interesting observation is that the orientation predictions were fairly accurate, even for the data sets without center point correction. This is likely due to the nature of Solve-PnP with RANSAC minimizing the error from the data it is sent. Additionally, when only provided 4 features Solve-PnP cannot use multiple samples to reduce the error, hence is more likely to produce a spurious output even when provided very good data. This is an indication that the Solve-PnP part of the pipeline is robust to some error, but it is vital to provide adequate amounts of data in order to allow it to perform the processes which make it robust. For Solve-PnP, when provided 6 or more points it can potentially even correct for intrinsic errors of the camera.

Overall, this initial analysis shows that the labeling scheme of center point correction is able to meet the error thresholds. While also showing that not using center point correction causes the error thresholds to typically be exceeded by Solve-PnP. Thus, a system designed with this approach should have sufficiently low 6D error if able to predict these labels with sufficient precision.

5.2 Labeling strategy ablation study

This subsection details the results of the ablation study, examining the effects of training a network with knowledge of occluded components or not, and correcting for perspective distortion in the centers of the bounding boxes or not. First the performance of models which were trained with and without knowledge of occluded features are compared. Then, the ability of the model to learn perspective distortion is shown via discussion of the amount of error and amount of correction required.

5.2.1 Training without regard to occlusions

The first comparison is between the models trained with knowledge of occluded components and trained without that knowledge. For this part of the analysis, the center point corrected models were used. Figure 8 shows box and whisker plots comparing the two networks. The dotted line indicates 6 points being detected. That is a significant number because it is the threshold for Solve-PnP to utilize direct linear transformations and conduct iterations to reduce error from noise in the measurements.

Fig. 8
figure 8

Visibility comparison of true positives

Figure 8 reveals the labeling technique has a large effect on the number of features that can possibly be observed. The technique used with XR_CC found nearly three times as many components on average. This is significant because it allows the Solve-PnP algorithm more opportunity to reduce error and potentially even correct for camera calibration issues. This is also an indication of the difficult nature of labeling components on a 3D object. Labeling 3-dimensional components situated on an object makes it difficult to meet the visibility threshold, even when using a small threshold of 25% for an item to be considered visible. Furthermore, this is the first indicator that the neural network is able to precisely identify components it is trained on even if many of those features are less than 25% visible; a full comparison of the neural networks’ metrics is provided in Appendix 1 for typical CNN metrics like precision.

Table 2 Visibility predictions comparison

Outlined in Table 2, the occlusion labeling strategy resulted in observable differences in the number of Solve-PnP observations between the two. The Xr_CC network was able to make predictions on 99.6% of images whereas the Vs_CC network achieved only 88.3%, a 29x increase. The more important factor was that the Xr_CC network was able to make predictions under the error threshold 15% more often than the network trained on only visible components; this was a substantial increase in ability. The box and whisker plots for these can be observed in Figs. 9 and 10, with detailed metrics in Table 3 and Appendix 1; outlier counts are included in Table 9.

When reporting 3D error it was determined that an observation would be an outlier if more than 10 m or 5\(^\circ\) from truth. This approach is reasonable because Solve-PnP can behave poorly at times; when it does the system would be expected to detect and discard those observations. The system could detect poor observations for multiple reasons. First, if the orientation is multiple degrees in error, the aircraft would be expected to be flying away quickly. If the position has a large error, the aircraft may not be in the region for which the visual algorithm is supposed to make observations. Also, in a deployed system, there would be a time ordering to the observations, hence if the observation was too different than the previous few it could be discarded; it is expected the aircraft will only move approximately 10 cm max between frames during these operations.

Table 3 Visibility pose error comparison

The next interesting metric was the pixel error of the Vs_CC versus Xr_CC trained networks on finding the 3D geometric center of components. Table 4 details the findings. The network trained on only visible components had less error. This showed statistical significance via a Wilcoxon Rank-Sum test with an alpha of 0.01 and P-values significantly less than 0.001. While statistically significant, the amount was a very small fraction of a pixel difference.

Fig. 9
figure 9

Center point correction - position error

Fig. 10
figure 10

Center point correction - RPY error

While the center point pixel error is lower, the system with a model trained on only visible components performed significantly worse. The Vs_CC model was worse in both the 3D position and orientation error, shown in Figs. 9 and 10 and Table 3, as well as number of observations under the threshold, evidenced in Table 2. While this use case is slightly unique due to the predicting components of a larger rigid object, for this case it is recommended to train on the full set of components of that object regardless of occlusions in an image.

Table 4 Center point error (pixels)

To summarise this analysis of training without regard for occlusions, a neural network should be trained without regard for occlusions. There is a slight degradation to average 2D pixel-precision, however the massive increase in quantity of observations allows Solve-PnP to be more performant. This was in terms of both 6D Euclidean distance as well as total images which predictions were possible for.

5.2.2 Correcting for perspective distortion labeling comparison

It is easily observable through Figs. 9 and 10 that the analysis on the truth data was confirmed. The usage of neural networks trained without correcting for perspective distortion are not able to meet the error threshold when used in the overall system. This was confirmed as statistically significant with Wilcoxon Rank-Sum tests being performed, with alpha of 0.01 and P-values significantly less than 0.001 being attained. If anything, it is impressive the Solve-PnP algorithm was able to perform so well given distorted points.

The interesting result of examining the correction made by perspective distortion is the amount the networks are capable of learning. In Fig. 11, each network’s center point error is shown where that error was calculated against the labeling strategy each was trained against. This showed that the underlying neural network was able to learn both methods with similarly low error, under a pixel of error on average. Figure 12 is used to show the amount of correction the networks are applying; here the XR_CC model was evaluated against labels that were corrected and uncorrected, with the error being shown in the figure. This reveals that the models are measurably correcting for the perspective distortion, arguably indicating that perspective distortion can be learned and corrected for by a neural network.

Fig. 11
figure 11

Example of ability of networks to learn uncorrected and corrected bounding boxes

Fig. 12
figure 12

Pixel amount corrected by network

5.2.3 Ablation study summary

In conclusion, the ablation study of neural network models had a clear result. The four models came from varying the training data to respect occlusions and if the bounding box should be corrected for perspective distortion. The Xr_CC model performed the best, this was the model trained without regard for occlusions and with perspective distortion corrected. Thus, showing that perspective distortion was measurable, correctable, and learnable by a neural network. The initial analysis quantitatively showed center point correction is required to perform within the error threshold when using object detection. Additionally, this result showed that training without regard for occlusions could improve the overall performance of the system by vastly increasing the number of observable components. This leads to other questions, such as how to best augment training data for training to correct for perspective distortion and how the network is learning to correct for the distortion.

5.3 Training data augmentation optimization

This section explores the results of assessing neural networks trained with different data augmentation parameters.

The number of true positives was improved with the Xr_CC_nS model versus the Xr_CC model, but was degraded for the models trained without mosaic and translation. These data sets did show statistical significance in the comparison of their true positives versus Xr_CC model. The Appendix 1 table also shows that the typical CNN metrics like F1, accuracy, precision, and recall were consistently degraded for the models trained without mosaic or translation, and were improved for the Xr_CC_nS model. The Xr_CC_nS model was significantly better than Xr_CC for all but the average number of false positives. For the use case here a small increase in false positives is acceptable. This is acceptable due to Solve-PnP’s ability to maintain accuracy with false positives as well as due to observations that are made during the next and final experiment.

The model that performed the best for average 2D center point error was the one using the original hyperparameters, but without scale, Xr_CC_nS, with average values showning this improvement in Table 5. The difference was statistically significant when compared between the original model and each of the others, as well as between the Xr_CC_nS model and others; significance was shown via Wilcoxon Rank-Sum test with an alpha of 0.01, and all P-values significantly less than 0.001.

Table 5 Hyperparameter average errors comparison

The Wilcoxon Rank-Sum tests also showed statistical significance for the improvement to the 3D position and orientation, with alpha of 0.01 and all P-values significantly less than 0.001. The Xr_CC_nS model showed improvement over that of the original settings for the 6D localization estimates and degradation for the other networks, evidenced in Table 5. These results indicate that the scale is the primary factor which should not be varied during training. It also showed that the mosaic and translation being removed did not show any improvement, but did degrade the detection and classification metrics, hence training should use mosaic and translation for CNNs being used for 6D pose localization.

5.4 Latent space perspective distortion experiment

With the best performing model, Xr_CC_nS, we can now examine how it is learning to correct for the error. In this experiment, there was an improvement when making predictions further away from the edge or with fewer edges for both the 2D center point error as well as a reduced number of false positives. Figures 13 and 14 show plots of the error for running the inference against the various data sets with their means detailed in Table 6; the abbreviations indicate the positions top (T), bottom (B), middle (M), left (L), and right (R) respective of the notional example in Fig. 5. The ‘orig’ annotation indicates the original 520 images sized at 640x480. In average 2D center point error case, the improvement showed statistical significance from a Wilcoxon Rank-Sum test with an alpha of 0.01 and all P-values significantly less than 0.001.

Table 6 Average center point errors and false positives
Fig. 13
figure 13

Average center point error vs image location

Fig. 14
figure 14

Average false positive count vs image location

We needed to find that the error was not made worse by moving around the image to conclude that the network was learning the distinct appearance of an object versus the position in an image to correct for perspective distortion. In fact, in every case in Table 6, the false positives decreased and the 2D center point error decreased. This indicates the neural network is learning to correct for the perspective distortion by the distinct appearance an object has, regardless of the location in the image. Additionally, this leads to an insight about the few false positives that there are.

5.5 False positive analysis

The implication of the previous observation is that the false positives are not actually false positives. The component is there, but its center point is slightly off the edge of the image, causing it be labeled as absent in the truth data. These false positives did not have a large negative effect due to Solve-PnP with RANSAC being robust to outliers when provided enough points. This was indicated and confirmed in multiple ways.

First, the latent space experiment made the false positives disappear when there were pixels in existence to provide negative feedback, i.e., if an object was barely off the edge of the image and the neural network saw the sky background there and not the component, then it was given evidence the component was not there and made the decision to not identify it. Second, an interactive 3D plot was utilized in Python to view the position in 3D space where the false positives occurred. They all occurred when the aircraft was near the edge of the image.

Finally, this was confirmed with a manual verification of the points which were falsely positive. In Fig. 15 a 2D version of that 3D interactive plot was included; this is a plot of the location of the geometric center of the aircraft and where it is in the image for the entire data set, not just the latent space analysis data set. The false positives that appear towards the middle in Fig. 15 are from instances where the aircraft was near the camera, hence the geometric center of the aircraft was near the center while its components may have been near the edge. Overall, this is merely an interesting observation that may lead to future work, it does not detract from the experiments original result demonstrating the model learned the distinct appearance of the aircraft due to perspective geometry.

Fig. 15
figure 15

False positive counts vs location on image

5.6 Running time

For 640x480 input images, the algorithm runs at 65fps on a desktop, which is more than efficient for real-time pose estimation. The desktop utilized an AMD 7950X CPU, 32GB DDR5 6400MHz Ram, a RTX 3080 GPU, and CUDA 11.7 with inference from ONNX files via OpenCV C++ API. Specifically, this implementation on average across the 5000 image test data took 3.4ms for reading an image file in, 1.4ms to size and pre-process the data, 6.8ms for forward propagation, 0.26ms for post processing the results, and 0.21ms for Solve-PnP.

6 Conclusions and future work

In summary, this study illustrated critical insights within the field of computer vision and 6D object localization. The three experiments underscored the paramount importance of correcting perspective distortion, the importance of training without regard for occlusions, the degradation of performance from image scale augmentation during training, and demonstrated translational equivariance. Thus, underscoring the need for the computer vision, robotics, and neural network communities to consider perspective geometry and distortions created by flat image sensors, essentially creating distinct appearances of objects based on their location and orientation respective to a camera.

The issue in typical object detection is the mismatch caused with the bounding box 2D center and projection of the 3D geometric center of an item being observed. This mismatch is caused by two concepts in the typical approach in object detection. The first is labeling only the visible portion of an object and the second is the perspective distortion. Each of these cause a mismatch and lead to imprecise 6D localizations. Correcting for these two items was explored in the first study.

The first study showed that the technique of expanding a bounding box was necessary for the performance of the Solve-PnP algorithm, was learnable with near-pixel precision by the neural network, and that occluded features of the aircraft should be utilized. While their pixel-precision was slightly worse, using the occluded features ultimately resulted in a massive increase in quantity of observations and improved the 6D estimate. The second study demonstrated the training time data augmentation strategy of scale augmentation should not be used, as augmenting scale decreased the true positives found and increased the 2D error, resulting in worse 6D estimation. This study also confirmed mosaic and translation should be utilized. The final study found the model to be translationally equivariant, indicating that the distinct appearance of the item due to perspective was being learned to correct for the distortion, enforcing the Background section theory on how perspective behaves.

These results suggest that future studies could be useful to fully round out these concepts. First, the false positive analysis revealed a potential to extend the concepts featured here to the training of future neural networks by also allowing knowledge of features off the image. Future systems could also investigate other center point correction strategies like shifting the bounding boxes instead of expanding them. Those could provide additional insights to larger computer vision systems that may be able to utilize the width and height data for extremely occluded predictions or filtering of spurious results, each of which may operate better with different perspective distortion correction strategies. An eventual system could even potentially generate occlusion metrics to be used as weighting factors to create confidence, or filters for individual components’ 2D precision.

For the purposes of this study, it was beneficial to use purely synthetic data. The synthetic data allowed for meticulous sub-pixel labeling, ample data sets, and extremely precise error assessments. However, these may prove to be limitations for real-world applications as such precise labels and error metrics are extremely difficult with real imagery. Additionally, if being deployed for a fleet of aircraft there is likely to be variations of textures, paint schemes, and configurations of the aircraft. This is compounded by operating aircraft not being truly rigid objects; their wings flex and control surfaces manipulate. Each of these limitations presents opportunities for future research in synthetic environments as well as with real world imagery.

While the approach shown in this article is not likely to solve issues of certifiability of a neural network for aircraft, the approach here exemplifies many of those traits: modularity, understandability, and individual testability. The system was designed with the neural network predicting human readable components of an aircraft and a separate process predicting the 6D localization. Each of the system’s processes were individually assessed and the system was assessed as a whole, resulting in a system able to meet the original goal of predicting a 6D localization to less than 1\(^\circ\) and 7 cm.

In conclusion, this study underscores the significance of considering perspective distortion in training neural networks. To optimize an object detector’s performance for usage by Solve-PnP it is crucial to correct for perspective distortion, train with knowledge of occluded components, and refrain from augmenting image scale. These findings shed light on how a neural network can precisely learn and perform best in concert with Solve-PnP when armed with comprehensive knowledge, ultimately benefiting a wide array of applications in computer vision and object localization.

7 Supplementary content

This paper has an accompanying video which can be found at [5].