Evaluation of deep learning algorithms for semantic segmentation of car parts

Evaluation of car damages from an accident is one of the most important processes in the car insurance business. Currently, it still needs a manual examination of every basic part. It is expected that a smart device will be able to do this evaluation more efficiently in the future. In this study, we evaluated and compared five deep learning algorithms for semantic segmentation of car parts. The baseline reference algorithm was Mask R-CNN, and the other algorithms were HTC, CBNet, PANet, and GCNet. Runs of instance segmentation were conducted with those five algorithms. HTC with ResNet-50 was the best algorithm for instance segmentation on various kinds of cars such as sedans, trucks, and SUVs. It achieved a mean average precision at 55.2 on our original data set, that assigned different labels to the left and right sides and 59.1 when a single label was assigned to both sides. In addition, the models from every algorithm were tested for robustness, by running them on images of parts, in a real environment with various weather conditions, including snow, frost, fog and various lighting conditions. GCNet was the most robust; it achieved a mean performance under corruption, mPC = 35.2, and a relative degradation of performance on corrupted data, compared to clean data (rPC), of 64.4%, when left and right sides were assigned different labels, and mPC = 38.1 and rPC = 69.6%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$69.6\%$$\end{document} when left- and right-side parts were considered the same part. The findings from this study may directly benefit developers of automated car damage evaluation system in their quest for the best design.


Introduction
Recently, the insurance business has grown rapidly because more people have started to insure their life and property seriously to control the risks of extensive repair costs for a damaged car or property, after an accident. Car insurance is a major insurance business; it is mandatory for cars that have not been fully paid off yet. A crucial process in the operation of a car insurance company is the intricate car damage evaluation process, that requires evaluators to have comprehensive  1 Faculty of Information Technology, King Mongkut's Institute of Technology Ladkrabang, Bangkok,10520, Thailand experience and skills in handling car damage. The evaluators will base their task on evidence, e.g., video recorded from car's camera, photos taken from mobile phones showing the damages as pieces of evidence of the accident and log data from IoT devices-for example telematics [1,2]. They must also present their damage evaluation to several parties and estimate the repair cost. This process not only takes a long time, but is also prone to human errors, fatigue or bias. Insurance companies desire to make this process more accurate, without needing to hire many highly paid damage evaluators.
New technology has made computers more powerful: machine learning enables a computer to learn from big data and provide clues for decision makers; computer vision enables a computer to recognize objects in an image or a video clip, which is directly applicable to the business. Edge computing enables front-end devices, e.g., smartphones, to analyze images in real time. This applies to the insurance business too. This technique pushes the heavy computation tasks, e.g., artificial intelligence, computer vision and complex algorithms, from centralized computing to the edge of the network-a front-end device. The front-end device will benefit from privacy, reliability and lower network latency [3][4][5]. Evaluators can use a smartphone to capture complete views of a car and analyze the captured image or video, in real-time, to evaluate damages and estimate the repair cost instantly. Any insurance company requires photos of damages to an insured car or property as pieces of evidence. Therefore, we brought in the new computer technologies to automate some steps of damage evaluation from photos of the damaged car-(1) identification of car parts; (2) identification of damaged parts; (3) damage evaluation for each part; and (4) repair cost estimation. These steps are illustrated in the schematic diagram in Fig. 1.
Here, we used image segmentation to automatically identify car parts. An image segmentation technique is similar to object detection; it detects where, in an image, an object is located, but adds recognition of the context of the object. An essential difference between the two techniques is that image segmentation works at the pixel level, whereas object detection works at the level of bounding boxes around the object. Image segmentation can be either semantic segmentation, where identical objects in the image are considered to be the same object, or instance segmentation, where identical objects are recognized as different instances. In particular, we used instance segmentation, since we wanted to differentiate different instances of the same object. For example, some car parts come in a left and right pair; instance segmentation enabled us to differentiate between the two members of this pair. A literature review showed that papers on car part segmentation are still limited, and no standards or criteria for this process have been established. Therefore, we tested a set of state-of-the-art deep-learning algorithms on a selfdeveloped car part data set, containing images annotated with descriptions of the object in them. Our contributions are: 1. Development of an extensive car part data set-annotated images of car parts from multiple viewpoints-some were taken from the Internet and some were taken by our team. 2. Comparative evaluation in terms of mean average precision between Mask R-CNN (baseline technique) with ResNet Backbone and four state-of-the-art instance segmentation algorithms-the top four algorithms reported by paperswithcode.com [6]. 3. Robustness testing in terms of mPC and rPC of models from four state-of-the-art instance segmentation algorithms and the baseline model against real weather elements and lighting conditions in the photos.
The rest of this paper is arranged as follows: the second section briefly describes related works; the third section briefly describes the tested algorithms; the fourth section discusses the experimental setup and the data sets; the fifth section reports and discusses results, and the final section concludes.

Related works
Edge computing has emerged to push the computation capability closer to end-devices. It can improve response times and reduce required network bandwidth. With a combination of front-end devices, edge nodes and cloud computing, many applications that use machine learning and computer vision techniques have been successfully deployed. Many researchers developed their algorithms to fully operate on front-end devices to enhance system efficiency. Velichko et al. [7] proposed a lightweight neural network algorithm called "LogNNet", that used filters based on logistic mapping for image recognition task. It can be employed in low-memory devices. Howard et al. [8] and Sandler et al. [9] developed MobileNets and MobileNetV2, which are efficient lightweight Convolution Neural Network (CNN) models, designed to work on mobile device. Tuli et al. [10] developed an object detection framework, EdgeLens, that integrated IoT, fog and cloud computing.
Applications of instance segmentation have included detection of individual humans in an image based on their posture. In addition, Zhang et al. [11] presented an instance segmentation method for human detection based on a human pose skeleton. It enabled recognition of the context of a posture even though, in the image, there was another human nearby or an overlap with another human. This capability differentiated it from other instance segmentation algorithms, e.g., Mask R-CNN [12]. Other instance segmentation applications include identification of biological objects in an image. In one instance, Yi et al. [13] presented an instance segmentation method for biological objects, that worked on heat map images.
Currently, several new instance segmentation algorithms have been proposed. For instance, CenterMask [14] did not use a bounding box but used a spatial attention-guided mask. It differed from algorithms that use a fully connected layer, e.g., Mask R-CNN. In addition, it used a fully convolutional one-stage object detector (FCOS) [15] rather than Faster R-CNN [16] in the object detection task, resulting in a higher detection accuracy of both still images and video frames. In another example, Wang et al. [17] developed an instance segmentation algorithm, "SOLO", a one-step algorithm, that did not use bounding box in object detection but, instead, divided an image into a number of squares and detected the interested object in each square. It used a semantic category branch technique to determine semantic category as well as an instance mask branch to determine instance category. SOLO was improved into SOLOv2 [18] Recently, one-stage instance segmentation methods, that do not have different branches for performing different functions, have gained more attention from researchers than two-stage methods, e.g., PolarMask [19], RDSNet [20] and YOLACT++ [21]. A two-stage method performs object detection first, then constructs a mask branch to predict each mask in a bounding box. Example of these methods are Mask R-CNN, PANet [22] and Mask Scoring R-CNN [23]. Chen et al. [24] proposed a BlendMask with an improved FCOS Object Detector. They added a blender module to an attention map. The blender module included both high-and low-resolution masks in every bounding box mask, enabling the model to predict the mask more accurately and rapidly than Mask R-CNN or other two-stage algorithms.
In a review of studies on car part segmentation, Lu et al. [25] presented a semantic segmentation method for car parts, based on landmark assignment and boundaries of each part. They used a graphical model to find relationships between car parts, then a segmentation by a weighted aggregation method (SWA) [26] to pair two nearby landmarks, then a Segment Appearance Consistency (SAC) technique, to connect segments of nearby landmarks, in every level of a hierarchical segmentation and to determine whether the same segment was represented in every hierarchical level. The outcome was a group of pixels that could classify various car parts. Nevertheless, in SAC and hierarchical segmentation for every hierarchical level, the meanings of a car part of different levels differed. In other words, an SAC, after only one round of SWA, was not able to segment all car parts in an image. Singh et al. [27] built a system to detect different car parts and localize their damages. However, the algorithms used in their system-Mask R-CNN, PANet and an ensemble model, that was based on Mask R-CNN and PANet-did not perform well. The MAP was lower than 0.5 across all algorithms. Dhieb et al. [28] used Inception-ResNetV2 to classify damage severity level, localize and detect part damage. Patil et al. [29] and Dwivedi et al. [30] used various CNN models to classify the car part damage, but these works only focused on a small set of car parts.
A website, paperswithcode.com, ranked all instance segmentation methods and determined the state-of-the-art ones [6]. They were benchmarked on various data sets, e.g., PAS-CAL VOC [31] and Common Objects in Context (COCO) Challenge [32]. Since we needed the best model for instance segmentation of car parts, we evaluated several algorithms on a large COCO Test-dev Task data set with a large number of categories, using Mask R-CNN with ResNet as baseline. The evaluated methods were the top four, as ranked by paperwithcode.com on 30/09/2019, that also used ResNet as Backbone: HTC [33], CBNet [34], PANet [22] and GCNet [35]. These algorithms are briefly described in the next section.

Mask region-based convolutional neural network (Mask R-CNN)
Instance segmentation Mask R-CNN algorithm [12] was a development of Faster R-CNN [16]. Faster R-CNN was only able to detect, where a target object was in an image and recognize it, but Mask R-CNN was also able to perform instance segmentation. Mask R-CNN had two main parts: (1) a backbone that extracted features from an image with Residual Neural Network (ResNet), a CNN 50-101 layers deep [36], in combination with Feature Pyramid Network (FPN) [37]

Global context network (GCNet)
GCNet [35] had a similar structure to Mask R-CNN, as can be seen in Fig. 2. However, the ResNet-FPN backbone was augmented with a global context block (Fig. 3). The Non-local Network (NLNet), a part of the block, solved the long-range dependency issue of deep neural networks [38]. NLNet worked in combination with a Squeeze-Excitation Network (SENet) to find the relationships between channels of each feature [39]. GCNet was as effective as NLNet, but computed faster, because it used fewer convolution and operation layers than NLNet. It was ranked number four by paperwithcode.com.

Path aggregation network (PANet)
PANet was developed by Liu et al. [22]. It had a similar structure to Mask R-CNN, as shown in Fig. 4  every layer, concatenated all of its output, then sent them to the head component, consisting of many fully connected layers, to detect objects, construct masks and bounding boxes and classify detected objects. Because of those processes, PANet was highly accurate. It was able to take advantage of all levels of feature maps, from low to high level features in each feature map. PANet was ranked number three by paperswithcode.com.

Cascade mask R-CNN with composite backbone network (CBNet)
This method combined Cascade Mask R-CNN [40] and Composite Backbone Network [34]. First, CBNet improved the feature extraction step, using a number of connected backbones called Assistant Backbones. Each connected backbone extracted some features and sent a feature map to the next backbone, which also extracted some features and sent a new feature map to the next-to-next backbone and so on. The last backbone was called a 'Lead Backbone'. It generated the final feature map, that was consecutively concatenated with features extracted from all previous backbones in the connection. Because of this repeated extraction step, low-level and high-level features were extracted into a more effective mix than a mix that Mask R-CNN generated. Second, Cascade Mask R-CNN, whose head was modified from that of  Mask R-CNN, improved prediction accuracy. The bounding box head in a previous branch was forwarded to the ROI Extractor of the next branch to improve prediction accuracy, as illustrated in Fig. 5. This method was ranked number two by paperswithcode.com.

Hybrid task cascade for instance segmentation (HTC)
This algorithm was developed by Chen et al. [33] improving the efficiency of instance segmentation task. In this algorithm, the bounding box head, mask head and ROI extractor were interleaved in a cascade, illustrated in Fig. 6, and so bounding box prediction and mask prediction tasks proceeded in parallel instead of independently. A multi-stage mask branch technique was introduced. It took into account the mask from a previous branch in the generation of a mask in the current branch to improve information flow. Lastly, a semantic mask branch was connected to the head of every mask to enable the model to understand the context of the information in every mask better. All of these features improved the information flow in every task. This method was the top in the paperswithcode.com ranking.

Data set
The data set contained 500 images of sedans, pickups and sports utility vehicles (SUVs) collected from the Internet and taken from public parking spaces. Images of these vehicles were taken in multiple views-front, back and angled views. The car identification number was blurred to hide individual vehicle details. Each image was annotated by the 18 listed instance masks and bounding boxes: back_bumper, back_glass, back_left_door, back_left_light, back_right_door, back_right_light, front_bumper, front_glass, front_left_door, front_left_light, front_right_door, front_right_light, hood, left_mirror, right_mirror, tailgate, trunk (of trucks and SUVs), and wheel (wheel and tire). The number of instances per category is shown in Fig. 7 and examples of the images in the data set are in Fig. 8. The DSMLR Car Part data set contains images and annotation in COCO Challenge format and is available for download at https://github.com/dsmlr/Car-Parts-Segmentation.

Experimental procedures and settings
We evaluated five algorithms: Mask R-CNN [12], HTC [33], CBNet [34], PANet [22] and GCNet [35], that used ResNet-50 and ResNet-101 as backbones, in terms of correctness and robustness on the car part data set. The algorithms were implemented with an MMDetection toolbox [41]. The experimental steps are described next. First, we resized all input images to 1024 × 1024 pixels, while maintaining the aspect ratio by zero-padding. Next, we randomly partitioned the car part data set into a training set (80% of the entire data set) and a test data set (20%). Then, since it was necessary to determine the best number of epochs for training the model for every evaluated algorithm, we ran a five-fold cross-validation by training for 200 epochs on each fold. The best number of epochs for each algorithm was the number that provided the lowest average five-fold validation loss. Validation loss was computed from 5 types of loss: (1) loss in classification task, (2) loss in bounding box task, (3) loss in segmentation task (Loss mask), (4) RPN loss in classification task and (5) RPN loss in bounding box task. Validation losses (4) and (5) were calculated by a Cross Entropy loss algorithm, embedded in the RPN. Next, we used a Stochastic Gradient Descent (SGD) to finding optimal parameters, setting the learning rate at 0.02 and weight decay at 0.0001. The optimal models were trained with the training set for the optimal number of epochs. The experiment was run five times with different random splits. Furthermore, we evaluated the robustness of algorithm for semantic segmentation and object detection tasks on corrupted data, simulating four real weather conditions and lighting, i.e., snow, frost, fog and ambient light at five levels of severity. The corrupted examples were generated by methods described by Hendrycks and Dietterich [42] (visualized in Fig. 10).

Correctness
Each algorithm was evaluated for average precision (AP), based on the COCO Challenge, an established evaluation method for object detection tasks. AP was calculated from the Intersection over Union (IoU) of each interested object. IoU was calculated by A model was considered to successfully detect an object, if the IoU was equal to or higher than a threshold that we assigned. The AP 50 and AP 75 means that the IoU are greater than or equal to the threshold at 0.50 and 0.75, respectively. Then, the mean average precision (mAP), based on COCO Challenge, is the average over IoUs between the threshold at 0.50 and 0.95, computed as: Since car parts take different sizes, we also evaluated the AP across scale of the car part, i.e.,APS for small parts with an area lower that 32 2 pixels, AP M for medium parts, with area between 32 2 and 96 2 pixels, and AP L for large parts, with area greater than 96 2 pixels. It is noted that AP on the COCO Challenge was reported in percent.

Robustness
Robustness was measured using two metrics-mean performance under corruption (mPC) and relative performance under corruption (rPC) metrics [43]. mPC is calculated: where N c = 4 indicates the number of corruptions and N s = 5 the number of severity levels (as set in this work), and P c,s is the performance measure evaluated on test data, that was corrupted with corruption type, c, under severity level, s.
Although several metrics could be used for P, in this work, P levels were calculated using mAP. A higher mPC indicates a more robust algorithm. rPC measured the relative degradation of performance on corrupted data compare to original data. It was calculated by where P original is the performance of algorithm on the original data, that is mAP of the original data, rPC ∈ [0, 1]. rPC = 1 represents 'perfect' robustness, while 0 represents negligible robustness.

Experimental results and discussion
In this section, several comparisons were made and discussed: 1. We compared overall algorithm performance based on two tasks-object detection and semantic segmentation tasks. 2. We discussed robustness in potential real weather elements and lighting conditions. 3. We further discussed performance and robustness, when left-and right-side parts were grouped under one label.

Overall performance of object detection and semantic segmentation tasks
The performances of all the algorithms are illustrated in In terms of performance related to the size of the car part, the models performed better on large parts followed by medium and small parts. The average AP L , AP M and AP S across all the models in the object detection task were 55.3, 46.9, and 32.1, respectively, and, in semantic segmentation, the scores were 59.2, 48.7, and 33.2, respectively. Larger parts led to better performance. Figure 9 shows a sample of object detection and semantic segmentation by the models with ResNet-50 and ResNet-101 encoders.
To determine which combination of model and encoder achieved the best overall performance, we used Kendall's coefficient of concordance (W ) to measure agreement between evaluation metrics. We rank the 10 candidate models (5 models with 2 encoders each) on 12 performance metrics (2 tasks with 6 metrics each). Then, we reported sum of the ranks of each candidate model as shown in Table 1 that leads to the ranking of the candidate models. Next, we calculate W by where n is the number of candidate models, R is the sum of ranks for the i-th candidate, k is the number of the perfor-  Fig. 9 Sample of object detection and semantic segmentation results: a ResNet-50 Encoder and b ResNet-101 Encoder mance metrics, and T is a correction factor, based on tied ranks (see [44] for more details). Here, n = 10 and k = 12. Thus, W =0.5079 that is transformed to a χ 2 value of W for significance testing against a null hypothesis of no agreement, Thus, X 2 = 54.8350 leads to p < 0.01 for 9 degrees of freedom. the led to p < 0.01. Thus, we rejected the null hypothesis. Therefore, we confirmed that HTC with ResNet-50 and HTC with ResNet-101 are the first and the second rank, respectively.

Robustness
In this section, the models used in the previous subsection were further evaluated. They were tested on the modified test data, including the set of real weather elements and lighting conditions, with different severity levels as shown in Fig. 10. We illustrate the overall robustness test results, showing results for different types of noise for object detection and semantic segmentation tasks in Table 2. GCNet with the ResNet-50 encoder was the best contender; it achieved the highest robustness, based on rPC, in object detection at 64.8% and semantic segmentation at 64.4%. It yielded the best mPC in all weather conditions for both object detection and semantic segmentation tasks, except brightness changes in object detection task. It was clear that the worst was CBNet, with the ResNet-50 encoder, as it retained only 48.1% and 47.3% of the performance in object detection and semantic segmentation tasks, respectively. HTC, with ResNet-101, in the object detection task, achieved the highest mAP, with the normal condition image, but although it only retained 53.2% of the performance, when the images were corrupted, its mPC was still ranked second at mPC= 28.9, after GCNet with ResNet-50. Moreover, HTC with ResNet-101 obtained the highest mPC with the brightness changes at 42.4. This also applied to the semantic segmentation task, HTC, with ResNet-101, ranked second in overall performance, based on mPC = 29.3, similar to Mask R-CNN with ResNet-101. We also found that the factors, that degraded performance for all algorithms, were snow and frost conditions, because they degraded the performance to less than 50% of the performance without corruption in both tasks. However, the algorithms tolerated changes in brightness and fog conditions well: they were still able to keep performance at 79.3%

Merging left-and right-side car part as one label
After evaluating overall performance and robustness, we ran an error analysis to seek a way to improve the task. We found that the algorithms were usually confused with left-or right-side parts, e.g., predicting left_mirror as a right_mirror or vice versa. Therefore, we created a new set of data, that assigned a single label to left and right sides of a part. Then we fine-tuned each pre-trained model from the original labels at 100 epochs-other settings remained the same. Table 3 shows the overall performance on both object detection and semantic segmentation, with left-and rightside part labels merged. All performances were higher than when left-and right-side parts were considered separately (Table 1): mAP increased by 5.76% for object detection and 5.27% for semantic segmentation for all models. The table shows that HTC, with the ResNet-101 encoder, yielded the highest mAP = 59.4, followed by HTC, with the ResNet-50 encoder, with mAP = 59.1 in object detection. HTC, with ResNet-101, performed best on large car parts-the highest value of AP L at 65.4-while HTC, with ResNet-50, encoder achieved the best performance on small and medium car parts, resulting in AP S = 34.5 and AP M = 53.5. In addition, HTC, with ResNet-50, was the best contender with the most strict metric AP 75 = 68.6. Although Mask R-CNN, with ResNet-50, received the highest AP 50 score, it was still worse than HTC, with ResNet-50 or ResNet-101, using the strictest metric. In semantic segmentation, HTC, with ResNet-101, also ranked first with mAP = 60.1, followed by Mask R-CNN, with ResNet-50 or ResNet-101. Apparently, Mask R-CNN performed well in semantic segmentation, resulting in the highest performance on AP 50 = 81.9, AP 75 = 71.3, and We also evaluated algorithms robustness in the merged sides of a car part scenario on both tasks as shown in Table 4. The overall picture was very much the same as considering left side and right side separately. GCNet was still the most robust algorithm, while the worst was CBNet. Moreover, snow and frost were still the top most challenging conditions to corrupt the data, that impacted the algorithms.

Conclusion
Computer network technology and end-devices are becoming more powerful. Also, the car insurance business is rapidly growing. Thus, an automated system for damage evaluation is necessary. In this work, we describe an automatic car part identification system based on images by deep learning techniques. We compared the performance of several stateof-the-art deep learning algorithms on a part segmentation task, using a car part data set, created for this work, that is now publicly available. Our experiments showed that HTC was the best model, followed by Mask R-CNN and GCNet, in both object detection and semantic segmentation tasks in normal weather conditions. Also, we evaluated algorithm robustness in real environmental and lighting conditions, simulating conditions that would occur in the field, when we take a photo using a smartphone. GCNet was the most robust model, because it achieved the best performance in overall pictures and in real conditions, except in varying brightness. Currently, edge computing has become more practical and able to overcome limitations of end-devices. Therefore, edge computing enabled the models to operate in the end-device, leading to a solution for real-time image analysis.  In future work, we will focus on developing a lighter weight model for semantic segmentation to ease the load on the end-device, without sacrificing its accuracy and robustness. We also aim to extend the work to detect, localize and estimate the severity of damage on different parts.