1 Introduction

Human decision-making often relies on various visual viewpoints [1]. Despite this, most computer vision models and visual inspection systems are still relying on a single viewpoint [2]. Seeland and Mäder [1], Van Ruitenbeek [3], Carrasco and Mery [4], and Wang et al. [5] point out that visual damage inspection is strongly influenced by the camera angle and light reflection. As a result, accurate localization and classification of damages remain challenging. Therefore, a multi-view inspection system that mimics human decision-making could improve the detection performance. However, using multiple camera locations and viewpoints increases the complexity as damages can be identified multiple times. Therefore, an identification of similar detections between different viewpoints is required.

Both Stent et al. [6] and Tang et al. [7] show that fragmenting original images into smaller-sized tiles increases the recall on small objects. Although they used multiple fragments, they did not include multiple viewpoints. The effect of multi-view image classification or object localization is already slightly touched upon in previous research, showing that the use of multiple viewpoints can significantly improve both the classification and localization accuracy [8,9,10].

Although multi-view damage detection can result in improvements for both precision and recall, Nassar et al. [11] point out that alignment of detections over different images can be challenging, especially for situations, such as assembly lines, where the object of interest moves along the cameras. The overlap between different viewpoints, as well as the moving object within a static environment, complicates the widely implemented image stitching approaches [11]. Alternative approaches use point clouds for damage inspection [12, 13]. Point clouds have the benefit to firstly combine the multiple viewpoints to a point cloud, to eventually detect the objects or damages in 3D. The construction of point clouds from a multi-view camera setup is widely studied [13,14,15,16,17]. However, construction of these point clouds is costly on high resolutions. This makes point clouds less suitable for the identification of small damages.

We propose the use of ray tracing to construct a fast multi-view detection system. Our proposed solution projects damages from each individual viewpoint onto a 3D vehicle representation. With this, each face in the 3D representation receives an associated class probability from the single-view detection model. The final multi-view prediction is obtained by directly weighting the class probabilities on the 3D representation. Therefore, our novel approach solely requires a single-view CNN and a 3D representation of the object. With this, our method largely benefits from the extensively researched and optimized single-view object detection models of the last decade. While focusing on vehicle inspection, our generic approach can transform any single-view inspection model into a multi-view inspection system without the need to modify the existing model or underlying training data.

Our proposed method enables the re-identification of similar damages without requiring a minimum amount of overlapping context between the individual views. Even more, it does not require the vehicle of interest to be captured in the same environment by each camera, as our approach is independent of surroundings and solely requires the location and orientation for both the camera and vehicle.

We contribute to previous research in three ways. Firstly, we implement a single-view inspection system and quantify how our single view CNN benefits from improved label consistency, by revising labels using multiple annotation rounds. Secondly, we present a uniform way to combine inspections from different viewpoints, without the need for multi-view training data or hardware for depth estimation, as used by Kim et al. [16], Liu et al. [18], Chen et al. [19], Deng and Czarnecki [20]. Lastly, we evaluate our approach in practice, on more than 3000 images, and show that our approach successfully merges similar damages across different viewpoints. Our approach successfully merges almost \(99\%\) of the duplicate directions across all viewpoints and is able to remove \(96\%\) percent of the false positives due to the multi-view setup.

We first describe related work in Sect. 2, followed by a detailed explanation of our proposed multi-view inspection in Sect. 3. We conduct a practical evaluation of our proposed method and present the findings in Sect. 4.

2 Related work

Various research focuses on multi-view object detection to increase the detection and localization accuracy. Some encode multiple viewpoints in a single CNN [7, 8, 10, 11], whereas others combine the detections from each individual viewpoint in a later stage [2, 4, 21]. Alternative approaches combine multiple viewpoints into a point cloud to perform object detection directly on a point cloud level [12,13,14,15,16,17].

2.1 Multi-view object detection

Tang et al. [7] applied multi-view object detection to improve detection capabilities for small objects. They segmented a single image into multiple overlapping fragments and applied either Single Shot Detector (SSD) [22] or You Only Look Once (YOLO) v2 [23] on each segment. Although they used a fragmented object detection, they did not use different camera locations or camera angles, making the merging step relatively easy.

Accounting for different viewpoints, Zhao et al. [8] proposed M-YOLO, to achieve more positioning accuracy. They detect an object from two different viewpoints, using a front and a side view. They map the two-dimensional bounding boxes onto the reconstructed three-dimensional scene, to form the three-dimensional object box. They perform the mapping using the transformation matrix between the viewpoints, in combination with the three-dimensional space coordinates of the camera.

Nassar et al. [11] developed a multi-view object detection model to accurately identify trees within the landscape. Their task differs from the object detection of Savarese and Fei-Fei [9] and Liebelt and Schmid [10], since their images are not captured in a fixed setting. They extended the SSD model of Liu et al. [22] to receive camera pose information. They developed a Geo Regression network to estimate the real-world geographical positions of the objects. Nassar et al. [11] argue that strong changes in viewpoints, light conditions, and scale complicate the detection task. Furthermore, they point out that training a multi-view object detection model requires large-scale multi-view object detection datasets, which are mostly not available.

An alternative approach is taken by Kaichi et al. [2], where a rotating table is used to capture different viewpoints for surface inspection on industrial parts. They used a fixed camera and light source location and used 3D CAD models, in combination with the predefined rotations, to track scratches and dents over different views. Rubino et al. [21] used a minimum of 2 and ideally 3 viewpoints to construct a 3D object location. They used conics instead of bounding boxes from the 2D images to formulate a quadric ellipsoid of the object location. Although they did not focus on surface inspection and solely focused on combining detections of large objects, they were able to successfully merge detections across different viewpoints.

To overcome the need for optical and geometrical parameters when combining multiple images, Carrasco and Mery [4] developed Automatic Multiple View Inspection (AMVI). Their proposed AMVI model firstly identifies defects in each image, then extracts keypoints of the object, and uses the keypoints to match the identifications across the images. They validate their proposed method on aluminum wheels, while rotating the wheel. The AMVI method takes approximately 38 s to combine the detections across 3 views, making it less applicable in real situations.

2.2 Object detection using point clouds

Constructing point clouds or 3D geometry from multiple viewpoints has been widely studied. Fender and Müller [24] present Velt, a uniform framework to combine multiple RGB-D viewpoints into a single point cloud. They show that multiple viewpoints can successfully be combined on both high and low resolutions. Similarly, Lemkens et al. [25] evaluate the construction of point clouds using multiple RGB-D viewpoints and specifically focused on cross-camera interference and noise influence when constructing the point cloud. They showed that a larger angle between the light source and camera increases the noise and a smaller distance between the individual cameras increases the interference.

Other research uses depth map fusion for 3D reconstructions from multiple viewpoints [26]. The proposed solution of Weder et al. [26] constructs foxelgrids on \(320 \times 240\) pixels at 15FPS using an Nvidia Titan GPU, therefore being applicable for real-time processing on low resolutions.

A variety of researchers focus solely on RGB images to overcome the need for additional depth information and, therefore, specialized cameras or sensors [27, 28]. Wen et al. [27] implement a single-view CNN to predict semantic and geometry features, which are used to leverage cross-view feature pooling. With this, they achieve an impressive speed of 0.32 s to generate a single mesh from 3 viewpoints on a \(214 \times 214\) pixel resolution, using an NVIDIA Titan Xp.

Much research has been done on object detection in point clouds [13,14,15,16,17]. Both Li et al. [14], Feng et al. [15] focus on indoor object detection and exploit the relationship between objects to improve the 3D object detection. Kim et al. [16] use RGB-D images for mask segmentation. They generate multiple foreground masks for each located bounding box and use a corresponding point cloud to estimate the 3D location using a Support Vector Machine (SVM) formulation.

Madrigal et al. [12] presented a method for recognizing surface defects in 3D point clouds. They introduced a Model Point Feature Histogram (MPFH) to detect defects using principal components and classify the point clouds into primitives for final surface defect recognition. Their 3D surface inspection has higher discriminative capabilities than single-view surface inspection. In line with Madrigal et al. [12], Makuch and Gawronek [13] automate inspections of cooling towers using point cloud analysis. They construct high-quality point clouds with fixed 3D laser scanners with a 0.1mm accuracy. Using a sequence of algorithms, among which PCA, a 100% recognition rate is achieved for the targeted damage types.

2.3 Ray tracing

Ray Casting, initially pioneered by Appel [29], enables to find the closest object on the path of a ray. His approach enabled for the first time to go beyond traditional-shaped cones and spheres and provided a way to trace rays for differently shaped objects. Foley et al. [30] extended the algorithm of Appel [29] by adding recursive ray tracing to trace reflection, retraction, and/or shadow rays emerging from each hit. This extension enables more realistic image rendering with shadows and reflections. As both ray tracing and ray casting require the calculation for ray-primitive intersection, the complexity is \(O(\text {rays} \times \text {primitives})\). By this, a brute force ray tracing algorithm is computationally expensive. To reduce this, several algorithms have been proposed to minimize the number of triangle validations per ray.

Bentley [31] proposed a multidimensional binary search tree (kd-tree) to consequently split each dimension of the search space to construct the binary tree. The binary search algorithm requires on average \(O(\log (n))\) operations for search actions. As kd-trees enable range operations, it is especially of use for ray casting/tracing. Other approaches, considered less efficient, are the Bounding Volume Hierarchy (BVH) of Rubin and Whitted [32] and Three-Dimensional Digital Differential Analyzer (3DDDA) [33].

3 Methodology

Figure 1 presents the same damage using two different viewpoints, indicating the complexity of multi-view detection systems. The complexity increases with the number of viewpoints and the number of identified damages per viewpoint. To overcome this, we use single-view object detection, in combination with a 3D object representation, to generate multi-view object detection. We apply a sequence of operations to project the single-view damage locations onto the 3D object. These operations are graphically summarized in Fig. 2. We define the camera and vehicle scene representation and its mathematical formulation in Sect. 3.1. Using this formulation, the bounding boxes of the single-view damage detection (step one) are projected onto the vehicle by ray tracing in step two. Each projection is weighted with a bivariate Gaussian distribution to generate a weighted class probability onto the vehicle in step three. The multi-view detection is constructed by averaging the bivariate Gaussian distributions over all detections and taking the dominant class per face in step four. In this step, we provide both the single-class and multi-class damage predictions.

We evaluate the effectiveness of the proposed approach in a practical setting, described in Sect. 3.5.

Fig. 1
figure 1

Single damage from two different viewpoints

Fig. 2
figure 2

Graphical representation of single view damage projection onto 3D objects

3.1 Camera scene

We define the scene and mathematical formulation to be independent of the number of cameras, camera placement, camera angles, and vehicle make or model. We solely assume the location of the vehicle and cameras within the scene to be known.

We define the scene as a xyz-plane, where the vehicle is orientated in the z direction as visualized in Fig. 3a. \({\varvec{O}}_{c_i} = \left\langle x_{c_i}, y_{c_i}, z_{c_i} \right\rangle \) defines the origin of camera i within the xyz-plane and \({\varvec{{\varTheta }}}_{c_i} = \left\langle {\varTheta }^{y}_{c_i}, {\varTheta }^{z}_{c_i} \right\rangle \) the angle of camera i with respect to the y-axis and z-axis, respectively. Both \({\varvec{O}}_{c_i}\) and \({\varvec{{\varTheta }}}_{c_i}\) are of major importance for the proposed multi-view inspection and can be obtained at installation of the cameras.

The camera angle encoding is graphically presented in Fig. 3b.

Fig. 3
figure 3

Scene representation with vehicle orientation in xy-plane (a) and angle encoding in xyz-plane (b)

3.2 Single-view damage detection

Our approach aims to provide a fast and scalable multi-view inspection system, without the requirement of extensive multi-view training data or multi-view computer vision models. We aim to achieve this by leveraging existing single-view object detection models and combining the individual detections afterward. This approach removes the need for advanced hardware such as lidar sensors to estimate depth in images, as implemented by Kim et al. [18], Liu et al. [16], Chen et al. [19], and Deng and Czarnecki [20]. We use YOLOv5l, developed by Jocher et al. [34], for the single-view damage detection. Although we outline the proposed method using object detection, our method can easily be adapted to instance segmentation models.

We aim to construct a fast multi-view object detection algorithm, which can be executed on the edge with either GPU or CPU devices. To achieve this, we avoid the computationally expensive point clouds constructions. In line with this, we leverage bounding box models instead of instance segmentation model to reduce the computational complexity. To date, the state-of-the-art object detection model (YOLOv5) achieves 50.1 \(AP_{box}\) with 218.8 GFLOPS and 167 FPS (V100), whereas the state-of-the-art instance segmentation model (SOLO V2) achieves 41.4 \(AP_{box}\) with 10 FPS (V100) [34, 35].

3.3 Bounding box projection

We denote the damage locations, obtained from the single-view damage detection model, by \(\mathbf{d }_{ijk} = \left\langle x_1, y_1, x_2, y_2 \right\rangle \), with damage k of camera i in iteration j. Iteration j defines the capture count of the camera over time. We encode the coordinates as the top left and right bottom corners of the bounding box and use \({\varvec{{{\Delta }}}}{\mathbf {d}}_{ijk} = \left\langle {\varDelta } d^{x}_{ijk}, {\varDelta } d^{y}_{ijk} \right\rangle = \left\langle x_2 - x_1, y_2 - y_1 \right\rangle \) for its dimensions. The image dimensions are represented by \({\mathbf {D}}_{c_i} = \left\langle H_{c_i}, W_{c_i} \right\rangle \), where we assume a constant image dimension over different iterations, but not necessarily across different cameras. Similarly, we define the camera viewing angle by \(\varvec{\alpha }_{c_i} = \left\langle \alpha _{c_i}^{y}, \alpha _{c_i}^{z} \right\rangle \). The camera viewing angle depends on the camera type, size, and focal length of the lens used within the camera. Therefore, this variable is determined upfront since it is specified by the manufacturer.

The use of rectangular boxes adds a redundancy around the location of interest. This results in lower localization accuracy of the projected area. Figure 4a illustrates the incorrect projection when solely using bounding box coordinates. Wang et al. [36], Li et al. [37], and Liu et al. [38] tried to overcome this redundancy of bounding boxes by encoding the rotation angle in the prediction. Despite this rotation angle, the projection is still affected when projecting on outward facing parts of the vehicle, such as mirrors.

To overcome this, we project the bounding box onto the vehicle by ray tracing through each individual pixel within the bounding box. This approach ensures that the full location can be extrapolated, as displayed in Fig. 4b. The downside of this approach is an increased number of rays from 4 to \({\varDelta } d^{x}_{ijk} \cdot {\varDelta } d^{y}_{ijk}\). This implementation is further described in Sect. 3.3.1. To further reduce the effect of redundancy around the damage object, we weight the area within the bounding box with a bivariate Gaussian distribution to decrease the class probability as the distance to the bounding box center increases, which is further described in Sect. 3.3.2.

Fig. 4
figure 4

Projection 2D damage location on a 3D vehicle with incorrect prediction when using bounding box coordinates (a) and improved localization when using all pixels within the bounding box (b)

3.3.1 Ray tracing projection

We project the bounding box from the single-view image onto the 3D representation of the vehicle. This step is performed using ray tracing, where the ray angles are obtained from two known components. Firstly, we obtain the angle of each pixel with respect to the camera center. Secondly, the camera angles with respect to the xyz-plane are added to the ray directions.

We index all pixels within the bounding box \({\mathbf {d}}_{ijk}\) and define this matrix as \({\mathbf {M}}_{ijk}\) (Eq. 1). The distance of each pixel with respect to the image center is then expressed by Eq. 2. The first term normalizes the pixel coordinates of \({\varvec{M}}_{ijk}\) with respect to the bounding box dimension, where subtracting 0.5 shifts the normalized pixels to the origin. The angles of the pixels with respect to the image center are subsequently obtained by multiplying the normalized pixel coordinates with the camera viewing angle (\(\varvec{\alpha }_i\)) in Eq. 3.

$$\begin{aligned} {\mathbf {M}}_{ijk}= & {} \begin{bmatrix} x_{u}\\ y_{v} \end{bmatrix}^T_{u=x_1, x_1 +1, \ldots , x_2, v=y_1, y_1 + 1, \ldots , y_2}, \end{aligned}$$
(1)
$$\begin{aligned} {\varvec{\tilde{M}}}_{ijk}= & {} \frac{{\varvec{M}}_{ijk}}{{\varDelta } {\varvec{d}}_{ijk}} - \frac{1}{2}, \end{aligned}$$
(2)
$$\begin{aligned} {\varvec{\tilde{M}}}^{\theta }_{ijk}= & {} {\varvec{\tilde{M}}}_{ijk} \cdot \varvec{\alpha _{i}}. \end{aligned}$$
(3)

In the second step, the pixel angles within the xyz-plane are obtained by adding the camera orientation (\({\varvec{{\varTheta }}}_{c_{i}}\)). This results in the angle matrix \(\tilde{{\varvec{M}}}^{{\varTheta }_{xyz}}_{ijk}\), consisting of a single ray angle per pixel. This final computation is given in Eq. 4, leading to the ray directions.

$$\begin{aligned} \tilde{{\mathbf {M}}}^{\theta _{xyz}}_{ijk} = \tilde{{\mathbf {M}}}^{{{\varvec{\theta }}}}_{ijk} + {\varvec{{\varTheta }}}_{c_i}. \end{aligned}$$
(4)

Using ray tracing, we shoot rays under all angles of \(\tilde{{\mathbf {M}}}^{\theta _{xyz}}_{ijk}\). Fortunately, Foley et al. [30] developed a real-time ray tracing algorithm to determine the ray–triangle intersection point. Using this, we obtain the face matrix \({\varvec{F}}_{ijk}\), containing a 1 if ray r hits face l, and 0 otherwise. We represent \({\varvec{F}}_{ijk}\) as a \(n_f \times n_r\) matrix with \(n_f\) the number of faces and \(n_r\) the number of rays.

3.3.2 Probability map

We weight each bounding box projection with a bivariate Gaussian distribution to increase the weight given to the center of the box. We combine the projected bounding boxes from different cameras and iterations and weight each detection with its class probability. We identify the damaged vehicle areas on the 3D representation by computing the class probability \(p^{\gamma }\) per face l using Eq. 5.

Each ray is weighted by the bivariate Gaussian probability density function, given in Eq. 6, with \({\varvec{x}}\) the normalized pixel distances toward the center (\({\varvec{\tilde{M}}}_{ijk}\)). This results in the intensity matrix \({\varvec{F}}_{ijk 1 \times n_r}\) which we expand to \(n_f \times n_r\) to eventually multiply using the Hadamard product with \({\varvec{F}}_{ijk}\). As a result, the numerator defines the damage intensity obtained from ray r on face l.

The number of rays that hit a 3D face strongly depends on the size of the face. To compensate for this, we average over the number of ray hits per face in the denominator. Based on this, each face receives the average hit intensity from all rays that project onto the face. Figure 5 shows the resulting heatmap projection for two different damages, based on Eq. 5.

$$\begin{aligned} {\varvec{P}}_{ijk}^{\gamma }= & {} \frac{{\varvec{1}}_{n_f \times 1}{\varvec{F}}_{ijk} \circ f(\varvec{{\varvec{\tilde{M}}}_{ijk}}) {\varvec{1}}_{n_f \times 1}}{{\varvec{F}}_{ijk} {\varvec{1}}_{n_f \times 1}}, \end{aligned}$$
(5)
$$\begin{aligned} f({\mathbf {x}})= & {} \frac{1}{2\pi } |{{\varvec{\Sigma }}} |^{-\frac{1}{2}} \exp \Big (-\frac{1}{2} ({\mathbf {x}} - \varvec{\mu })^T {\varvec{{\Sigma }}}^{-1} ({\mathbf {x}} - \varvec{\mu }) \Big ), \nonumber \\ {\mathbf {x}}= & {} \begin{bmatrix} x_{1} \\ x_{2} \end{bmatrix}, \qquad \varvec{\mu } = \begin{bmatrix} \mu _{x_1} \\ \mu _{x_2} \end{bmatrix}, \qquad {{\varvec{\Sigma }}} = \begin{bmatrix} \sigma ^2_{x_1} &{} \rho \sigma _{x_1} \sigma _{x_2} \\ \rho \sigma _{x_1} \sigma _{x_2} &{} \sigma ^2_{x_2} \\ \end{bmatrix}. \end{aligned}$$
(6)
Fig. 5
figure 5

Damage projection with ray tracing on each pixel, weighted by a heatmap probability. Damage located on left mirror (a) and left front bumper (b)

3.4 Combine multiple views

We combine multiple camera viewpoints to move from single-view damage detection to multi-view damage detection. A non-max suppression is not suitable as it does not account for the irregular shapes of the 3D vehicle representation. Fortunately, we can directly apply Eq. 5 on all damages from the single-view model for all available viewpoints.

A bias arises toward false negative detections from the single-view model. To overcome this bias, we average the detection probability ( \({\varvec{p}}_{ijk}^{\gamma }\)) over the number of cameras that have a non-blocking view on the damaged face. We denote the number of cameras that could have seen the damaged face l by \(t_l\). We obtain the vector \({\varvec{t}}\) by projecting a single ray to the face from each camera and counting the number of hits on the face. Using this, we define the multi-view class prediction by \(\varvec{\tilde{p}}^{\gamma }_{l}\), obtained by Eq. 7 (Fig. 6).

$$\begin{aligned} \varvec{\tilde{p}}^{\gamma }_{l} = \frac{1}{{\varvec{t}}} \sum _{ijk}{\varvec{p}}_{ijk}^{\gamma }. \end{aligned}$$
(7)
Fig. 6
figure 6

Damage projection from three viewpoints with rays departing from camera

The final dominant class prediction for face l (\(\gamma _{l}\)) is then defined by Eq. 8, where the background class is set to 0. For multi-class prediction, the classes are solely defined by \(\{p^{\gamma }_{l} \mid p^{\gamma }_{l} > \tau \}\). Redundant predictions are easily removed by the threshold variable \(\tau \), which serves as a confidence threshold. In other words, if the damage does not surpass threshold \(\tau \), the damage is market by Eq. 8 as background.

(8)

3.5 Practical evaluation

We construct a camera gate to evaluate the proposed solution in practice. We configure the camera gate and scene by manually selecting \({\varvec{O}}_{c_i}\) and \({\varvec{{\varTheta }}}_{c_i}\). We place three cameras at 20, 100, and 200 centimeters from the ground surface and place these cameras at the following locations: \((-175, -100)\), \((-175, 100)\), \((175, -100)\), and (175, 100). The selected cameras are from the same manufacturer, which reports the viewing angle \(\varvec{\alpha }_{c_i} = \left\langle 58, 80 \right\rangle \) and the resolution \({\mathbf {D}}_{c_i} = \left\langle 1920, 2560 \right\rangle \).

We compared our multi-view approach with a single-view inspection process. The comparison is conducted using 50 vehicles, 12 cameras, and 5 captures per camera. Each vehicle is captured while driving through the gate, with a speed of approximately 5 kilometers an hour. With this, each vehicle scan consists of exactly 60 images. Therefore, we conduct the evaluation on 50 vehicle scans with a total sample size of 3000 images.

As described in Sect. 3.2, YOLOv5 [34] is used to train the single-view object detection model. A total of 42,313 images are used for the training of the single-view object detection model. These images are captured by various users, with various mobile phones in different settings. This ensures that the dataset contains different camera angles, image resolutions, vehicle types, vehicle colors, and lightning and background conditions. This results in a diverse dataset, to ensure that the single-view model is not solely trained on a single camera setup. Therefore, the single-view model is more robust against different camera angles and light conditions, when implemented in the multi-view approach.

A total of 42,313 images are manually and disjoint labeled by four individuals in a first annotation round. All annotations are validated in a second annotation round by shuffling the label sets over different individuals. We use this extensive approach in an attempt to improve label quality and consistency. During the second annotation round, a total of 26.2% of the images received a change for at least one of the labels (e.g., remove/add bounding box or change of bounding box shape). To validate if the consistency improved, we conducted an additional round with a randomized subset of 1000 images, resulting in a significantly lower change rate of 2.8%.

The full labeling process results in a total of 119,701 bounding boxes. We used transfer learning from the COCO trained weights and fine-tuned the detection head of YOLOv5 for 100 epochs. The full network is then trained for an additional 50 epochs. We make use of Stochastic Gradient Descent, image size 640, Mosaic augmentation, and apply a grid search for \(\{(lr, bs); lr: \in \{1e^{-4}, 1e^{-3}, 1e^{-2}, 1e^{-1} \}, bs \in \{8, 16, 32, 64\}\}\). The learning rate, momentum, image scale, rotation, sheering, and HSV Saturation are optimized using Bayesian Optimization.

We apply the Gaussian heatmap using \(\mu _1 = \mu _2 = 0\), \(\sigma _1 = \sigma _2 = 5\), and \(\rho = 1\). We use a relatively high \(\sigma \), to put slightly more attention on the bounding box center compared with the corners. To enable fast ray tracing, we leverage the latestFootnote 1 ray tracing implementation of Parker et al. [39].

4 Results

We first outline the performance for single-view damage detection and present the performance improvement that is achieved from the label revision. Secondly, we draw the comparison between the single-view damage detection approach and our proposed multi-view damage detection method.

4.1 Single-view damage detection

Table 1 presents the damage detection performance after parameter optimization. The performed hyperparameter optimization yields \(lr= 1e^{-3}\), \(bs = 32\), \(momentum=0.89\), \( \text {image scale} = \pm 40\%\), \(rotation= \pm 20^{\circ }\), and

\(shearing = \pm 5^{\circ }\). The results show that an improved label consistency from the second annotation round reflects in a strong performance improvement. The recall and precision improve with almost 23% and 12%, respectively. The annotation revision does not solely improve the precision and recall, but also improves the alignment of the bounding box. Figure 7 presents an excerpt of the label revision, where 2 false positives are removed and the alignment of one bounding box is improved.

Table 2 presents the confusion matrix for single-view damage inspection. A total of 3000 images of 50 unique vehicles are used in the evaluation. The test set contains 931 damages, corresponding to 224 unique damages (e.g., some images are captured multiple times under different angles). The presented results seem quite fair at first sight, but the underlying problem directly arises when the individual detections are aggregated per unique damage. Table 3 shows that 34 unique damages (15%) have not been identified on any of the views. Furthermore, it points out that 73%Footnote 2 of the unique damages are identified more than once. Based on the used test set, a human operator will receive each damage on average 3.5 times. These duplicate identifications directly affect the applicability in real-world situations, since duplicates have to be removed manually.

Fig. 7
figure 7

Excerpt of a modified class label with iteration 1 (a) and iteration 2 (b). 2 removed false positives and 1 revised bounding box dimension

Table 1 Performance comparison of single-view damage detection, before and after label revision
Table 2 Single-view confusion matrix for 3000 Ultra HD images of 50 unique vehicles, under confidence threshold 0.2
Table 3 Number of identifications for all 224 unique damages, based on 3000 images, for varying confidence thresholds for the single-view detection model

4.2 Multi-view damage detection

Table 3 shows that the single-view approach contains many damages which are identified more than once. A strong reduction in duplicate identifications is achieved using the multi-view approach, where only 8 damages are marked as two separate damages, while belonging to the same unique damage. In other words, our proposed method enables us to reduce the number of duplicate identifications from 576Footnote 3 to 8. This strong reduction in duplicate detections is achieved since the single-view model is not able to identify if the damage in the first viewpoint is equivalent with the damage in the second viewpoint. Furthermore, Table 4 shows that the number of false positives is reduced by \(79\%\) for the multi-view approach, compared with the single-view approach. This reduction is again obtained from the ability to share information across the different viewpoints. We previously argued that the viewpoint matters for the detection of damages, since light reflections can results in false positives. Applying the multi-view model enables to evaluate and average the damage from multiple viewpoints, according to Eq. 7. This approach reduces the number of false positives notably, since an false positive from one viewpoint is outweighed by the other viewpoints.

Table 3 shows that a lower confidence threshold reduces the number of false negatives (0 identifications) for the multi-view approach, while keeping the property to remove almost all duplicate identifications. In addition, a lower confidence threshold reduces the number of false negatives at a slightly increased number of false positives. The effect of a lower confidence threshold on the number of false positives is presented in Table 4. Therefore, we can state that our approach retains the same properties as the single-view approach in terms of the false positive/false negative trade-off, while being able to remove almost all duplicate detections.

Table 4 Number of false positives, based on 224 unique damages with 3000 images

It has to be noted that our proposed approach does not solely remove duplicate identifications from multiple views; it also directly removes almost all false positives on the surrounding of the vehicle. The trained single-view damage inspection model generates false positives on the surroundings when cracks or scratches are present in the concrete walls and floors of the setup. A total of 71 false positives have been generated on the surrounding, based on the 3000 test images. The multi-view algorithm automatically removes 68 of the false positives (96%) without the need for any additional algorithms such as vehicle segmentation, which is normally required.

Both Figs. 8 and 9 visualize the multi-view prediction for a single dent. The heatmap predictions are combined from multiple damages, where the damage of Fig. 8a is seen by 6 out of the 6 cameras and the damage of Fig. 9a by 4 out of 5 cameras. Both visualizations are constructed with probability threshold 0.10.

Fig. 8
figure 8

Excerpt of a single-view (a) and associated multi-view prediction (b)

Fig. 9
figure 9

Excerpt of a single-view (a) and associated multi-view prediction (b)

4.3 Robustness of the proposed model

We previously presented how a change in the confidence level of the single-view model (YOLOv5l) impacts the number of duplicate detections (Table 3) and the number of false positives (Table 4). Besides this, other factors play an important role in the performance of the multi-view model. Changing the underlying training dataset for the single-view model largely influences the outcome of the multi-view model.

Table 5 Effect of the training dataset size for the single-view damage detection model on the number of duplicate identifications and the number of false positives, under confidence threshold 0.2

Table 5 shows that a smaller single-view training dataset directly effects the performance of the multi-view model. This results from a decreased diversity and a decreased detection performance. As a result, the number of false positives increases and the number of false negatives increases. However, the number of duplicate dictations does not increase to large extend. With the original dataset, a total of 8 damages are not successfully merged, against 18 with a dataset of 10,000 images. In addition, the number of false positives from the multi-view model with only 10,000 training images is still lower than the single view model trained on the full dataset. Showing that our proposed model can already add value for smaller datasets.

4.4 Computation time

We evaluate the computational time of our proposed solution on a Nvidia RTX 3090 GPU with an AMD 5800 CPU and 64GB memory. We break the computation time into two segments: the single-view stage and the multi-view stage. We report the total time to process the 50 vehicles (3000 images), as well as the average time per vehicle (60 images). The resulting computational time for both the GPU and the CPU implementation is presented in Table 6. The largest computational components are the single-view YOLOv5l model and the ray tracing component. The ray tracing implementation requires a relatively low number of rays to process. There are no rebounds required to find the 3D location and the rays are only computed over the pixels of the damage locations. In our practical evaluation, the single-view model predicts 1085 (773 + 312) bounding boxes, resulting in 1085 boxes for ray tracing. Despite this, the ray tracing component is still relatively large. This is mainly due to context setup required in our proposed solution. The majority of the ray tracing time is used for loading the 3D vehicle context from disk and into the render environment. Therefore, the mentioned results are an indicator for the performance which are not yet optimized.

Table 6 Computation time complexity for both GPU and CPU implementation. Reported in seconds

5 Discussion

Our implementation uses pre-constructed 3D models, to reduce computation time. Therefore, the mirror or wiper can be differently orientated between the real situation and pre-constructed 3D models. As a result, the damage projection becomes slightly less accurate for these areas. The same occurs for different rim types, which is not taking into account in our approach. The latter can easily be resolved in further research as most vehicles have a fixed set of available rims, giving the option to render 3D models for each make, model, and rim combination.

The detail level of the 3D object affects the damage projection accuracy. More fine-tuned 3D models lead to more detailed damage projections. Contrary to this, a more detailed 3D model increases the number of faces, which increases the computational time of the ray tracing. Future work can focus on quantifying the trade-off between 3D model granularity and computational time.

6 Conclusion

We presented how computer vision models benefit from improved label consistency, by comparing the detection performance before and after label revision from a second annotator. We showed that both the recall and precision improve with, respectively, almost 23% and 12%.

Furthermore, we presented a novel approach to leverage existing single-view training data and single-view deep learning models for multi-view damage inspection. Our practical evaluation, conducted in a dynamic environment, shows that similar damages are successfully grouped together, therefore reducing the number of duplicate identifications by 99%. Furthermore, we showed that the multi-view approach reduces the false positive rate by more than 96%, while maintaining the same number of true positives. Even more, we show that almost all false positives on the surroundings are automatically removed, without the need for additional segmentation models. Therefore, single-view damage detection models can significantly benefit from our novel multi-view inspection methodology without requiring specialized depth hardware, multi-view deep learning architectures, or multi-view training data.

The proposed method requires relatively low computational power, as solely viewpoints with identified damage are used in the merging process. Therefore, ray tracing is solely performed on the damaged areas, making it applicable for CPU and edge devices.

With the reduced number of duplicate detections, the reduced number of false positives, and the low computational requirements, we are able to reduce the barrier for practical implementation of multi-view damage inspection systems.