1 Introduction

Industrial surface defect detection is a critical step to ensure product quality. In order to relieve inspectors of laborious work and improve the consistency of inspection, much effort has been dedicated to automating the inspection process using computer vision approaches over the past decades [1,2,3]. However, conventional computer vision approaches have been limited in their effectiveness due to varying illumination conditions and similarities between the surface textures and defects. Recently, Industry 4.0 has become a popular notion to achieve the goal of intelligent manufacturing [4], wherein manufacturing technologies will shift from automatic to “smart manufacturing.” The main goal is to achieve the best performance and highest efficiency in the production process.

Nowadays, with the rising popularity of deep learning techniques for visual recognition, deep learning–based defect detection has been extensively applied to surface defect inspection systems [5,6,7,8,9,10,11,12].

There are three categories of deep learning–based defect inspection systems, which are based on classification, object detection, and object segmentation. Classification-based defect inspection systems [13,14,15,16,17,18,19,20,21] categorize an input image as defect or non-defect and calculate the class probability using stacks of convolutional neural networks. This approach is simple, but it cannot localize defects. To localize various defects in an input image, object detection–based defect inspection systems called defect detectors were developed [5, 7, 11, 22, 23]. Based on this approach, defect locations can be predicted as bounding boxes, along with defect class labels and confidence scores. There are two types of object detection–based defect detectors [24]. The first is a one-stage method [12, 25,26,27,28,29,30,31], which simultaneously detects and localizes defects. This method achieves a fast inference speed at the cost of lower precision. The second is a two-stage method [7, 12, 32,33,34]. It first generates region proposals for possible defect locations; then, the region proposals are passed along a pipeline for defect classification and bounding-box regression. Such a method is slower, but it reaches higher detection accuracy. Besides, segmentation-based defect inspection systems [6, 12, 35] were also developed, which identify the shape of defects by using a pixel-wise mask. This technique provides a far more granular understanding of the defects in an image. However, the inference speed is much slower than that of object detection–based methods because it requires pixel-level defect and non-defect predictions. According to the above discussion, object detection–based defect detection offers a better trade-off between accuracy and complexity; hence, it is more suitable for industrial applications.

In this paper, we propose a novel object detection–based defect detection system, called the “forceful defect detector” (FDD). Moreover, the proposed FDD begins with enriching the point-of-view of input images. We propose a data preprocessing pipeline which involves a random scaling scheme [36] in the training stage and the ultimate scaling technique in the inference stage. Next, we adopted Cascade RCNN [32] as the object detection network and enhanced it with deformable operation and guided anchoring region proposal network (GA-RPN) [37]. Finally, our experimental studies show that the proposed method achieves higher defect detection accuracy on famous defect datasets [7, 12, 38] compared to existing models and maintains a processing speed which meets the standard for steel surface inspection systems [39, 40]. The remaining of this paper is organized as follows. Section 2 presents the related works on defect detection. In Section 3, we elaborate on the proposed FDD in detail. In Section 4, the effectiveness of the proposed method is demonstrated through experimental studies. Section 5 concludes the paper.

2 Related works

Object detection–based defect inspection systems locate the defective area in an image by generating bounding boxes. State-of-the-art one-stage object detectors are you only look once (YOLO) family and transformer networks. For example, YOLOv4 [26] improves on the classical YOLOv3 [41] by selecting suitable components to enhance the detection performance, such as Mosaic data augmentation, CSPDarknet53 [42] with the spatial pyramid pooling (SPP) block [43] as the backbone, PANet [44] path-aggregation neck, followed by the YOLOv3 anchor-based prediction head. Recently, YOLOv5 [27] improved the inference speed of YOLOv4, and YOLOX [28] incorporated the free anchor concept to improve YOLOv3’s prediction head accuracy.

Other one-stage object detection methods also improved YOLOv3 by incorporating new elements. For example, CP-YOLOv3-dense [31] combined YOLOv3 with Dense Convolutional Network (DenseNet) [45] to detect surface defects of steel strips. The model can receive multi-layer convolutional features output by the densely connected blocks before making predictions, thereby enhancing feature reuse and feature fusion. A Lightweight End-to-End Network for Surface Defect Detection (LSSDN) [46] was proposed to identify the defects on a textured surface. This lightweight network comprises three major parts: The stem part quickly reduces the size of the feature maps, the trunk part is composed of three stages for multi-level feature extraction, and finally, YOLOv3 serves as the detection part.

Besides, RetinaNet [47] proposed the focal loss to solve the class imbalance problem. RetinaNet with difference channel attention and adaptively spatial feature fusion (DEA_RetinaNet) [30] improved the performance of RetinaNet for steel defect detection. CenterNet with dilated feature enhancement model, center-weight, and CIoU loss (DCC-CenterNet) [29] is a variant of CenterNet [48], which proposed a dilated feature enhancement model (DFEM) to enlarge the receptive field of features; thus, the network can effectively detect the defects of different scales.

In recent years, the transformer [49] has emerged as an effective architecture to explore global correlations among a sequence of inputs. It was successfully applied to image classification [50], semantic segmentation [51, 52], and object detection [25, 53]. The pyramid vision transformer (PVT) [54] combined transformer layers with the feature pyramid network (FPN) [55] to extract features and adopted the RetinaNet [47] as the detection head. The updated version PVTv2 [25] improved PVT by adding three designs: overlapping patch embedding, convolutional feed-forward networks, and linear complexity attention layers.

In contrast to one-stage object detectors, two-stage object detectors can achieve a higher detection accuracy. Among famous two-stage object detectors, Faster R-CNN [56] uses the FPN as a neck between the backbone network and prediction head to enable multi-scale feature extraction. Cascade R-CNN [32] improved upon Faster R-CNN by proposing iterative prediction heads with different IoU thresholds from small to large. This iterative process uses cascade regression as a sampling procedure to provide good samples from one stage to another. On the other hand, DetectoRS [33] improved the FPN structure by proposing the recursive feature pyramid and switchable atrous convolution. This sequentially repeats the FPN process and uses the feedback signal to improve the accuracy of each stage.

For defect detection, the two-stage method defect detection network (DDN) [7] proposed a multi-level feature fusion network (MFN) to integrate lower-level and higher-level features to include more location details of defects. Defect inspection network (DIN) [34] utilized deformable convolution [57], balanced feature pyramid [58], and Fast R-CNN head [59] to accommodate steel defects with arbitrary shapes. Another two-stage model [60] was developed to handle the complicated defect detection task on steel surface datasets, which contain critical problems such as vagueness and tiny defects. Moreover, the inspection model proposed in [22] combined the two-stage method with an attention network to detect defects for solar cell electroluminescence (EL) images, which is a challenging task on the manufacturing side due to the similarity between background and foreground features.

3 The proposed forceful defect detector

Our proposed defect inspection system, named the forceful defect detector (FDD), consists of a data preprocessing pipeline which involves a random scaling scheme for model training, the baseline detector, the deformable operation, the guided anchoring, and the ultimate scaling scheme for the inference stage. In general, we developed this system based on several reasons. First, to enrich the point-of-view of input images, we propose a data preprocessing pipeline which involves a random scaling scheme [36] in the training stage, randomly resizing the input image while maintaining its original aspect ratio. In the inference stage, we propose the ultimate scaling technique to scale the input image to an optimal size. Second, to ensure high-quality feature extraction, our proposed network begins with a 5-stage feature pyramid network (FPN) [55], each stage involving the state-of-the-art aggregated residual transformation block (ResNext) [61]. Third, instead of the standard convolution [62], the proposed system utilizes the deformable convolution [57] at stages 3 to 5 and deformable region of interest (RoI) pooling, which are more suitable for extracting features from the geometric shape of the defect. Fourth, we replace the general region proposal network (RPN) head with the guided anchoring region proposal network (GA-RPN) [37] to generate more precise bounding boxes. Furthermore, this section will discuss the parts of the system in more detail.

3.1 Data preprocessing pipeline

As shown in Fig. 1, steel defects usually have different features, such as various scales, directions, and shapes. To train a detection model robust to these features, we propose a data preprocessing pipeline as shown in Fig. 2. The first two blocks are the standard image loading and annotation. We followed the Pascal VOC format [63] to generate bounding box annotations for the datasets.

Fig. 1
figure 1

As a result of defect randomness, a and d show varying scales and b and e show varying directions, while c and f show arbitrary shapes

Fig. 2
figure 2

The proposed data preprocessing pipeline

The third block in Fig. 2 is our proposed random scaling scheme. Our proposed random scaling method helps to train a more generalized model to detect defects of various scales.

As shown in Fig. 3, image resizing is widely used as one important technique in deep learning–based object detection. There are three existing image resizing methods: uniform scaling [64], progressive resizing [65], and sampling scaling [36]. The uniform scaling method shown in Fig. 3a resizes all the original training images into images of a single size. For object detection, this has often resulted in losing key defect features, because the network only learns the defect feature from one fixed size. It may result in identifying only small defects or large defects and ignores the others. Progressive resizing, on the other hand, scales up the input image into three different sizes. The network is trained with the first size and then fine-tuned with the other two different sizes. It has become a powerful solution to boost the object detection performance [36] and was recently adopted to help screen Covid-19 cases [66]. In contrast, the sampling scaling method as shown in Fig. 3b randomly resizes the input images and the bounding boxes to different sizes in different epochs during the training process. This helps in detecting defects of different sizes. Meanwhile, it avoids the high computational complexity of progressive resizing, caused by using multiple scales for the same input image.

Fig. 3
figure 3

a Uniform scaling; b sampling scaling

Inspired by the idea of sampling scaling, in the third block of Fig. 2, we propose a random scaling scheme, randomly resizing the input image while maintaining its original aspect ratio. For example, Fig. 4a shows that if we have an input image with an aspect ratio \(\alpha =w:h\), where \(w\) and \(h\) are the original width and height of the image, then the new height \({h}^{^{\prime}}\) is randomly chosen from a set of integers as (1), and the new width is determined by (2).

Fig. 4
figure 4

The proposed a random scaling for training and b ultimate scaling for inference

$$h^{\prime}=random\;int\left(h_1,\;\dots\;,h_n\right)$$
(1)
$$w^{\prime}=round\left(a\times h^{\prime}\right)$$
(2)

Once we have resized the input image, the height and width of the bounding box in the original input image is also resized by the following:

$${h}_{B}^{^{\prime}}=round\left({h}_{B}\times \frac{{h}^{^{\prime}}}{h}\right), {w}_{B}^{^{\prime}}=round\left({w}_{B}\times \frac{{w}^{^{\prime}}}{w}\right).$$
(3)

where \({h}_{B}\) and \({w}_{B}\) are the original height and width of the bounding box, and \(h'_B\) and \(w'_B\) are the resized height and width of the bounding box.

The fourth block in Fig. 2 performs horizontal flipping, followed by padding operation in the fifth block. The sixth block is shift-scale-rotate, followed by the last block that normalizes the input image.

3.2 The baseline detector

Figure 5 shows our proposed forceful steel defect detector (FDD). We adopt the two-stage detector Cascade R-CNN [32] as our baseline architecture. First, the FPN backbone extracts and combines features of small and large scales, which belong to small and large defects, respectively.

Fig. 5
figure 5

The proposed forceful defect detector network, where M is the preprocessing blocks in the training stage and ultimate scaling in the inference stage, DEF-FPN is the backbone feature pyramid network with deformable convolution, GA-RPN is the guided anchoring region proposal network head, def pool is the deformable RoI pooling, H is the R-CNN head, C is the output class label, and B is the output bounding box

Next, the RPN will generate proposals based on the multi-scale feature maps generated by the FPN. An output feature map will show whether a defect is present in the input image at a particular location and its estimated size. This is done by placing a set of “Anchors” on the output feature map of the backbone network. These anchors indicate defects of different sizes and aspect ratios that could be present at this location in the form of bounding boxes.

Prior R-CNN baselines or Faster R-CNN [56] define ROIs as defect proposals with at least 0.5 IoU with the foreground bounding box. On the contrary, cascade R-CNN refines the proposals sequentially based on three IoU thresholds: 0.5, 0.6, and 0.7, respectively, and generates bounding boxes and defect classes based on these refinements.

In Cascade R-CNN, the three successive R-CNN modules, as shown in Fig. 5, refine the proposals sequentially with three different IoU thresholds: 0.5, 0.6, and 0.7, respectively, and generate the final predicted bounding boxes and defect classes.

Furthermore, we improved the baseline Cascade R-CNN by the following components: First, we process the original input image by module “M,” which represents the preprocessing blocks in Fig. 1 during the training stage and represents the ultimate scaling scheme during the inference stage. The ultimate scaling scheme will be introduced in Section 3.6. Second, we propose a deformable feature pyramid network (DEF FPN) with deformable convolutions, and a deformable RoI pooling in three cascaded R-CNN modules instead of the regular RoI pooling. Third, we adopt the guided anchoring RPN (GA RPN) instead of the regular RPN.

3.3 Deformable operations

Inspired by the deformable convolutional network (DCN) [57], our proposed FDD as shown in Fig. 5 adopts the deformable convolution in the FPN module and deformable RoI pooling in the cascaded R-CNN modules to augment the spatial sampling locations in the convolution and RoI pooling modules with additional offsets, and to learn the offsets without additional supervision. The original regular convolution operation is written in Eq. (4), where \(X\) denotes the input feature map, \(Y\) denotes the output feature map, \(W\) denotes the weights, \({L}_{0}\) is a certain location in\(Y\), and \(\mathcal{K}\) denotes a regular sampling grid. For instance, \(\mathcal{K}=\{(-1,-1), (-\mathrm{1,0}), ..., (\mathrm{0,1}), (\mathrm{1,1})\}\) defines a \(3\times 3\) kernel with a dilation rate of 1, and \({L}_{n}\) enumerates the locations in\(\mathcal{K}\). In contrast, in deformable convolution, formulated as Eq. (5), the regular sampling grid \({\mathcal{K}^{\prime}}\) is augmented by offsets \(\Delta {L}_{n}\) which are generated by a sibling branch of the regular convolution. As shown in Fig. 6b, through deformable convolution, certain 2D offsets are added to the regular grid sampling locations in the standard convolution, which makes the convolution highly adaptive to the geometrical variation of the defect.

Fig. 6
figure 6

a The defect location. b The deformable convolution sampling the defect location

The process of deformable RoI pooling is similar to that of the deformable convolution, in which offsets are added to the spatial binning positions in the pooling operation. As shown in Eq. (6), the regular RoI pooling splits the RoI into \(s\times s\) bins and produces an \(s\times s\) feature map \((0\le i,j < s)\), where \(X\) is the input, \({L}_{0}\) is the top-left corner location of \(X\), \(bin(i,j)\) is the set of locations in the \({(i,j)}^{th}\) bin, and \({n}_{i,j}\) is the number of pixels in the bin. Then, the generated offsets \(\left\{\Delta {L}_{i,j}\right| 0\le i,j < s\}\) are added to the spatial binning positions as in Eq. (7) to enable the deformable RoI pooling operation.

$$Y\left({L}_{0}\right)= \sum_{{L}_{n}\in \mathcal{K}}W\left({L}_{n}\right) . X({L}_{0}+ {L}_{n}).$$
(4)
$$Y\left({L}_{0}\right)= \sum_{{L}_{n}\in \mathcal{K}}W\left({L}_{n}\right) . X({L}_{0}+ {L}_{n}+ \Delta {L}_{n}).$$
(5)
$$Y\left(i,j\right)= \sum_{{L}_{n} \in bin(i,j)}\frac{X\left({L}_{0}+ {L}_{n}\right)}{{n}_{i,j}}.$$
(6)
$$Y\left(i,j\right)= \sum_{{L}_{n} \in bin(i,j)}\frac{X\left({L}_{0}+ {L}_{n}+ \Delta {L}_{i,j}\right)}{{n}_{i,j}}.$$
(7)

3.4 Guided anchoring RPN

In defect detection problems, we need to consider the non-uniform distribution of defect locations and shapes. Therefore, we adopt region proposal via guided anchoring (GA) [37]. It works as follows: first, the anchor location prediction branch yields a probability map to predict the locations where the center of objects of interest are likely to exist [56]. This step dramatically reduces the number of anchors compared to the sliding-window scheme. The next step is to determine the shape of the object that may exist at each location by the anchor shape prediction branch. Furthermore, an anchor-guided feature adaptation component transforms the feature at each individual location based on the underlying anchor shape, using a 3 × 3 deformable convolution layer. This makes the features and the anchors match more closely.

3.5 Loss functions

Moreover, the proposed architecture is optimized in an end-to-end training procedure that employs the sum of the GA-RPN head loss \({\mathcal{L}}_{rga}\) and the Cascade R-CNN loss \({\mathcal{L}}_{rcnn}\). More specifically, the GA-RPN loss \({\mathcal{L}}_{rga}\) as shown in Eq. (8) is a multi-task loss, in which the conventional classification loss \({\mathcal{L}}_{cls}\) and regression loss \({\mathcal{L}}_{reg}\) are combined with the anchor localization loss \({\mathcal{L}}_{loc}\) (the loss to measure the center of the foreground) and the anchor shape loss \({\mathcal{L}}_{shp}\) (the loss to predict the best width and height of the anchor). The three-stage Cascade R-CNN loss \({\mathcal{L}}_{rcnn}\) as shown in Eq. (9) includes the classification loss \({\mathcal{L}}_{cls1}\) and the regression loss \({\mathcal{L}}_{reg1}\) of H1, the classification loss \({\mathcal{L}}_{cls2}\) and the regression loss \({\mathcal{L}}_{reg2}\) of H2, and the classification loss \({\mathcal{L}}_{cls3}\) and the regression loss \({\mathcal{L}}_{reg3}\) of H3. The total loss \(\mathcal{L}\) as shown in Eq. (10) is the summation of \({\mathcal{L}}_{rga}\) and \({\mathcal{L}}_{rcnn}\).

$${\mathcal{L}}_{rga}= {\mathcal{L}}_{cls}+ {\mathcal{L}}_{reg}+ {\mathcal{L}}_{loc}+ {\mathcal{L}}_{shp}.$$
(8)
$${\mathcal{L}}_{rcnn}= {\mathcal{L}}_{cls1}+ {\mathcal{L}}_{reg1}+ {\mathcal{L}}_{cls2}+ {\mathcal{L}}_{reg2}+ {\mathcal{L}}_{cls3}+ {\mathcal{L}}_{reg3}.$$
(9)
$$\mathcal{L}={\mathcal{L}}_{rga}+ {\mathcal{L}}_{rcnn}.$$
(10)

3.6 Ultimate scaling

In the object detection task [44, 64, 67,68,69,70,71,72,73,74,75,76], multi-scale testing is employed in the inference stage to resize a test image into several scales. Each re-scaled image will go through the detection network to obtain a detection result; then, the final result is obtained using some voting mechanism. Although this method significantly increases the accuracy, it also increases the inference time. Instead of using such a multi-scale testing scheme, we propose a more efficient technique in the inference stage as shown in Fig. 4b called the “ultimate scaling.” After the model is trained, we apply the model to the training set and determine the optimal size of the training images, by trying out all sizes within the scale range of training images, with a step size of 100 pixels along the height direction. The size that leads to the highest mAP for the training set will be selected as the optimal size, and all test images are resized to this optimal size before testing. The combination of random scaling in the training stage and such ultimate scaling in the inference stage effectively improved the detection accuracy.

4 Experimental results

4.1 Parameter configuration and datasets

Our proposed FDD is implemented in PyTorch and trained on a single NVIDIA GeForce RTX 2080Ti GPU. The batch size is 1. The initial learning rate is set at 0.001 and is reduced by a factor of 10 when the number of epochs reaches 16 and 19. The total number of epochs is 40.

In this experiment, we use three well-known surface defect datasets. The first dataset, Severstal, is a steel defect dataset [38]. As shown in Fig. 7, it has 4 defect categories. According to a study in [77], class 1 is marked by yellow features pitted surface defects, class 2 marked by blue features crazing defects, class 3 marked by purple features scratch defects, and class 4 marked by red features patch defects. We use 5332 images and 667 images for training and testing, respectively. The original labels in this dataset are pixel-wise annotations. To perform defect detection, we converted the pixel-wise annotations into bounding box labels, in accordance with the Pascal VOC format [78].

Fig. 7
figure 7

The Severstal defect dataset

The second dataset is the Northeastern University (NEU) surface defect dataset [7], as shown in Fig. 8. It contains 1800 images, belonging to six classes of hot-rolled steel plates: crazing, inclusion, patches, pitted surface, rolled-in scales, and scratches. Following [7], we use 1260 images as the training set and 540 images as the test set. To evaluate the performance of our proposed FDD in detecting general defects in addition to steel surface defects, we also use the synthesized Deutsche Arbeitsgemeinschaft für Mustererkennung e.V., the German chapter of the International Association for Pattern Recognition (DAGM) dataset [12] as shown in Fig. 9. A total of 150 defect images are included in each of the six classes for developing surface defect inspection systems. The miscellaneous anomalies in the dataset occur on a variety of statistically textured backgrounds, as described in [77]. Anomalies vary in shape and size, making it difficult to distinguish them from complex textures. This dataset reflects real-world defects, which can demonstrate the effectiveness of our proposed method on general defects. There are 900 images in total, and we adopt 720 images for training and 180 images for testing.

Fig. 8
figure 8

The NEU defect dataset. a Crazing. b Inclusion. c Patches. d Pitted surface. e Rolled-in scale. f Scratches

Fig. 9
figure 9

The DAGM defect dataset. a Class 1. b Class 2. c Class 3. d Class 4. e Class 5. f Class 6

4.2 Performance evaluation metrics

For each dataset, the defect detection accuracy is evaluated by the average precision (AP) of each defect class, the mean average precision (mAP) over all classes, and the average recall (AR).

For a specific defect class, the precision and recall are first calculated according to Eqs. (11) and (12), where TP, FP, and FN represent the number of true positives, false positives, and false negatives of this class, respectively. The average precision (AP) is then computed over different levels of recall achieved by varying the confidence score threshold.

$$Precision=\frac{TP}{TP+FP}=\frac{TP}{all\;detections},$$
(11)
$$Recall=\frac{TP}{TP+FN}=\frac{TP}{all\;groundtruths}.$$
(12)

Finally, we calculate the mean of the AP across all defect classes, resulting in the mAP value, and calculate the mean of the Recall across all defect classes as the AR value.

4.3 Objective performance and discussion

Table 1 shows the scaling parameters for each dataset in our experimental studies, which are applied to the random scaling in the training stage and ultimate scaling in the inference stage. We set \({w}_{1}, {h}_{1}, {w}_{n}, {\mathrm{and} h}_{n}\) based on the image size of three different datasets. The original images of the Severstal dataset have a high resolution: \((w,h)=(\mathrm{1600,256})\). To enrich the training set resolution, we set the initial width and height \({(w}_{1},{h}_{1})\) as \((625,100)\), which are smaller than the original resolution, and set the maximum width and height (\({w}_{n}, {h}_{n}\)) to be the same as the original size. Therefore, the size of the training images varies from small to large. The second dataset is the NEU surface defect dataset, which has small width and height (200, 200). Since small resolution images do not have a positive impact on the detection result, for this dataset, the initial width and height \(({w}_{1}, {h}_{1})\) are set as (500, 500) and the maximum width and height (\({w}_{n}, {h}_{n}\)) are set as (1200, 1200). The third dataset is DAGM for general defects, which has a medium resolution: the original width and height (\(w, h)\) are (512, 512). For this dataset, the initial width and height (\({w}_{1}, {h}_{1}\)) are set as (400, 400), and the maximum width and height (\({w}_{n}, {h}_{n}\)) are set as (1200, 1200). This range allows the network to learn from lower to higher resolution images.

Table 1 The scaling set parameters

A validation study of each component of the proposed FDD model was conducted on the Severstal dataset, as shown in Table 2. We adopt Cascade R-CNN [32] with a ResNet-50 backbone [79] as the baseline, which achieves an AR of 85.5% and an mAP of 67.5%. By adding the proposed preprocessing blocks (including random scaling) and ultimate scaling, the AR and mAP are improved to 96.7% and 72.2%, respectively. This illustrates the benefit of the proposed random scaling operation in the training stage, which enhances the variation defect features. Additionally, the ultimate scaling can be used in the testing stage to select the optimal size for the input image. After integrating the proposed deformable convolution, deformable RoI pooling, and guided anchoring region proposal network (GA-RPN), the AR and mAP further increased to 96.9% and 78.3%. This improved accuracy was attributable to the deformable operation that adapted to the geometric variation of the defect shape and precisely localized the bounding box with GA-RPN.

Table 2 The ablation study for accuracy improvement of FDD

In Tables 3, 4 and 5, we objectively compare the performance of our proposed FDD method to state-of-the-art methods: YOLOv4 [26], YOLOv5 [27], YOLOX [28], PVTv2 [25], DetectoRS [33], DDN [7], CP-YOLOv3-dense [31], DEA_RetinaNet [30], DIN [34], DCC-CenterNet [29], and Deep Reg [12]. Our proposed method outperformed these methods by significant margins in terms of the AR and mAP values.

Table 3 Comparison results for the Severstal dataset
Table 4 Comparison results for the NEU dataset
Table 5 Comparison results for the DAGM dataset

Table 3 shows the proposed methods achieved an AR of 96.9% and an mAP of 78.3% on top of the ResNet-50 backbone, thus surpassing the accuracy of YOLOv4 [26], YOLOv5 [27], YOLOX [28], PVTv2 [25], and DetectoRS [33].

Table 4 shows the comparison study on the NEU steel surface defect dataset. The optimal test image size is chosen as \(700\times 700\) during the ultimate scaling process. The DDN [7], DIN [34], and DCC-CenterNet [29] models run on the standard backbone (ResNet-50) and achieved an mAP of 82.3%, 80.5%, and 79.4%, respectively. The Cascade R-CNN [32] also adopted the ResNet-50 backbone and achieved an AR of 95.7% and an mAP of 79.3%. DEA_RetinaNet run on top of a deeper backbone ResNet-152 and achieved an mAP of 79.1%. On the other hand, the CP-YOLOv3-dense model [31] integrated YOLOv3 with the DenseNet backbone to achieve an AR of 82.3% and an mAP of 76.7%. In contrast, although our proposed FDD is implemented with the standard ResNet-50 backbone, it achieved the highest mAP of 83.4% and the highest AR of 99.7% among all methods in comparison.

With the aim to develop a widely applicable defect detection system, the proposed method was also tested on the DAGM dataset with an image size of \(800\times 800\) as chosen by the ultimate scaling scheme. This experiment mimics a real-world defect detection scenario. As shown in Table 5, the segmentation-based defect detection system Deep Reg [12] achieved an AR of 99% and an mAP of 98% on this dataset. On the other hand, the Cascade R-CNN [32] achieved an AR of 99.7% and an mAP of 98.7%. The proposed FDD achieved an AR of 100% and an mAP of 100%, when it is implemented with the ResNet-50 backbone.

Moreover, in Table 6, the proposed FDD with the ResNet-50 backbone achieved an inference speed of 12 frames per second (fps) on a single NVIDIA GeForce RTX 2080Ti GPU, which is on par with that of the state-of-the-art feature pyramid networks model DetectoRS [33] and also meets the criteria of steel surface inspection systems as explained in [39, 40], where the inference speed must be greater than or equal to 10 fps. Although the other existing methods in Table 6 achieve higher speeds, we demonstrated in Tables 3, 4 and 5 that they have lower detection accuracy than our proposed FDD.

Table 6 Comparison results for run time

4.4 Subjective performance and discussion

Figures 10, 11 and 12 show sample testing results to demonstrate the subjective performance of the proposed FDD. The system shows accurately detected defects from small and medium to larger scales. Each image shown in Figs. 10, 11 and 12 contains a green bounding box, which indicates the location of the detected defect along with its predicted class label.

Fig. 10
figure 10

Detection results for the Severstal defect dataset: a class 1, b class 2, c class 3, and d class 4

Fig. 11
figure 11

The sample detection results on the NEU defect dataset: a inclusion, b patches, c scratches, d crazing, e pitted surface, and f rolled-in scale

Fig. 12
figure 12

The sample detection results on the DAGM dataset: a class 1, b class 2, c class 3, d class 4, e class 5, and f class 6

Figure 10 shows that our proposed model accurately predicts the four classes of defects in the Severstal dataset. The defects in this dataset are quite challenging since they appear in varying scales. Classes 1 and 2 shown in Fig. 10 a and b belong to small defects, while classes 3 and 4 shown in Fig. 10 c and d belong to larger defects. Figure 10 demonstrates that the proposed system can not only detect defects of various scales, but also accurately predict the defect locations when the defective pixels are similar to the background.

Figure 11 shows the detection results on the NEU defect dataset. This dataset is a very common dataset in the steel surface inspection system. Among the six classes of defects, the inclusion, patches, and scratches are defects of varying scales. On the other hand, the crazing, pitted surface, and rolled-in scale defects have pixels that are similar to background pixels.

To demonstrate that our proposed FDD model can detect general defects, in Fig. 12, we tested our model on the DAGM dataset. The results show that our model accurately localized six classes of defects with different textures.

5 Conclusion

In this paper, we design a new defect detection system called the FDD. We propose the novel concepts of random scaling in the data preprocessing pipeline for training and ultimate scaling for testing and combine them with an improved cascade R-CNN detector, which involves deformable convolution, deformable RoI pooling, and a guided anchoring RPN. The resulting FDD system is used in the defect detection field for the first time in the literature. The experimental results demonstrate that the proposed model has significantly improved the defect detection accuracy on steel surface defect datasets and a general defect dataset compared to existing methods, while maintaining the processing speed criteria required for steel surface inspection systems. In further work, we will extend our network structure to a more generalized defect detector.