1 Introduction

In industrial production, the detection of workpiece surface defects is essential to ensure product quality [1]. For example, in a metal forming process, surface scratch in contact sliding has been critical because it downgrades the surface quality of a workpiece and the service life of a tooling system. However, this has been traditionally a manual process, which is very inefficient, inaccurate and unreliable [2]. Recent advances in artificial intelligence provide a promising approach to tackling tough engineering problems, such as fault detection [3], nonlinear system control [4, 5] and surface defect detection [6]. For instance, the image features of surface defects can be learned by machine learning techniques such as the support vector machine (SVM) and the artificial neural network (ANN). The SVM algorithm has been utilised to analyse and classify surface defects on steel surfaces [7] and cutting tools [8]. Similar applications of the ANN algorithm have been noted for defect recognition in cold rolling [9] and colour-filter production [10]. For example, the performance of different neural networks for surface crack detection in fracture experiments was tested and compared [11]. However, the separated feature extraction and classification operations have significantly restricted the detection efficiency.

The CNN-based deep learning technology has demonstrated its capability in image classification because it can automatically detect and extract high-level image features from the labelled image data [12]. Several CNN networks have been developed and applied to classify the image data, e.g. AlexNet [13], VGG-16 [14], GoogleNet [15] and EfficientNet [16]. The applications of CNN networks for the detection of surface defects [17, 18], rolling bearing degradation [19] and roll marks on hot-rolled steel plates/strips [20] were also noted. For example, an image detection model based on R-CNN network was proposed to identify the wear location and wear mechanism in tribological tests [21]. Besides, different optimisation algorithms have been proposed to train the CNN networks appropriately. For instance, an Adam optimiser with power-exponential learning rate was proposed to control the iteration direction and step size in order to tackle the problems of local minima and overshoot in network training [22]. Although the CNN networks have been widely used for image classification with high accuracy in recent studies, their large model sizes and complicated structures limit the classification speed and bring about high latency.

Therefore, lightweight CNN networks, e.g. SqueezeNet [23], MobileNet [24] and ShuffleNet [25], have been developed to decrease the network parameter number and model size without sacrificing the classification accuracy. For example, a new fire module was utilised in the SqueezeNet to considerably reduce the computation consumption and communication cost [23]. Therefore, it can be feasibly built in the hardware with limited memory, e.g. mobile devices, to complete the real-time object detection in automatic vision-based systems [26]. However, a lightweight CNN network for detecting the surface scratch in contact sliding is yet unavailable.

Recently, the embedded system has been used to deploy CNN networks to complete real-time recognition and detection tasks, e.g. vehicle plate recognition [27], fire detection [28], handwriting recognition [29] and action recognition [30, 31]. Normally, the embedded hardware has limited computation capacity and on-board memory; thus, lightweight CNN architectures are more feasible to be deployed in the embedded environment. For example, an anamorphic depth lightweight CNN, Anam-Net [32], was proposed to segment anomalies in COVID-19 chest CT images. Therefore, it is expected that deploying a lightweight CNN network for surface scratch detection in sheet metal forming will help to improve the level of automation and efficiency.

This paper aims to develop a lightweight CNN structure, called WearNet, for surface scratch detection in contact sliding. A customised convolutional block will be developed to reduce the training parameter number and network layers as well as to simplify the network structure but retain classification accuracy. To train the WearNet, cylinder-on-flat sliding tests will be conducted to provide a large-scale surface scratch dataset. The network response and decision mechanism will be investigated to reveal the WearNet capability. The WearNet will then be compared with the existing advanced CNN structures to demonstrate its advantages in classification accuracy, model size and computation efficiency. In addition, the performance of the developed WearNet will be compared against other existing CNN networks based on a public image database, i.e. the NEU surface defect database [20]. Finally, a Linux-based embedded system will be utilised as the deploying environment to further test the detection performance of WearNet.

2 Image database of surface scratches

2.1 Experimental setup for data collection

To develop a reliable CNN-based detection model, a large-scale surface scratch database is essential. To extend the database scale, cylinder-on-flat sliding tests (see Fig. 1) were conducted under a wide range of operation conditions listed in Table 1. The cylinder-on-flat sliding setup has been used to mimic the contact conditions encountered in a metal forming process [33, 34]. To match industrial production conditions, two typical types of high-strength steel, DP980 and QP980, and DC53 tools with nitriding and vacuum heat treatment were selected as the pair of sliding contact. Both the one-way and cyclic sliding tests were carried out to mimic practical sliding types. The ranges of testing parameters (tool radius, normal load, sliding speed, contact width) are listed in Table 1.

Fig. 1
figure 1

Illustration of the experimental setup for data collection

Table 1 Conditions of the contact sliding experiments

2.2 Image data processing

After each sliding test, the surface topography was measured by a digital microscope, OLYMPUS DSX 510. The measuring size of each image is 750 × 750 µm. Both of the surface images of DP980 and QP980 workpieces were divided into five categories (see Fig. 2):

  1. 1.

    The surface images prior to contact sliding were labelled as intact surface.

  2. 2.

    After certain cycles in a sliding test, if material transfer was identified on the workpiece surface without obvious scratches, the measured surface images were denoted as material transfer.

  3. 3.

    The images with the maximum depth of scratching (hmax) below 2 µm, 2 < hmax < 4 µm and hmax > 4 µm were called as minor, mild and severe scratch, respectively.

Fig. 2
figure 2

Surface scratch labels in the image database: a intact surface, b material transfer, c minor scratch, d mild scratch, e severe scratch

The surface images, Fig. 2c–e, were identified based on the hmax. This is because hmax plays an important role in determining the severity of scratching damage [35, 36]. Overall, the database with a total of 10,500 surface images was identified by ten labels, as shown in Table 2. These images were randomly divided into training, validation and testing datasets with a ratio of 4:1:1 (7000:1750:1750). The image resolution was normalised to 227 × 227 prior to the training.

Table 2 Image numbers of different surface image labels

3 WearNet for surface scratch detection

3.1 Structure of WearNet

The existing advanced CNN networks were designed to classify the 1000 labels of ImageNet [37, 38] with a database of over 14 million images. For the surface scratch identification in the current study, the image database was much smaller than ImageNet and fewer image labels were utilised. As such, a lightweight WearNet was developed based on a novel convolutional block to prevent overfitting, to effectively minimise the network parameters and to reduce the model size. The architecture and specifications of WearNet are outlined as follows (see Table 3 and Fig. 3):

  1. 1.

    Convolutional layer: this plays an important role in extracting the image features from the input image data, which is achieved by the convolution kernel. The convolution kernel is a square filter, which can scan the input image and output feature maps. The kernel sizes for conv-1 and conv-2 are 3 × 3 and 1 × 1, respectively. Each convolutional layer is followed by a ReLU activation function.

Table 3 Specifications of the WearNet network
Fig. 3
figure 3

Architecture of the WearNet network

  1. 2.

    Max-pooling layer: The role of the pooling layer is to reduce the feature map size by downsampling. There are two common methods to conduct pooling: average pooling and max pooling. The max-pooling is more suitable for image feature processing as it preserves the maximum output in a rectangular region. Therefore, the max-pooling strategy was selected in this paper. Besides, the network conducts max pooling with a stride of 2 to ensure late downsampling.

  2. 3.

    Convolutional block: This block takes advantage of separable convolution and squeeze-expand operations, as shown in Fig. 4. It starts with a squeeze convolution layer with 1 × 1 filters, which helps to restrict the total number of input channels, n1, fed into the following expand module. The expand module consists of batch normalisation, separable convolution and expand convolution using a 3 × 3 kernel. In the separable convolution, a channel-wise 3 × 3 spatial convolution is followed by a point-wise 1 × 1 convolution, which can bring about higher computation efficiency as fewer convolution operations will be conducted. In the concatenation layer, the channel number increases from n1 to 4 × n1. As such, the network parameters and model size are decreased significantly.

Fig. 4
figure 4

Illustration of a customised convolutional block in the WearNet structure

  1. 4.

    Dropout layer: The dropout layer is to avoid the overfitting problem in the network training process [39]. The strategy is to deactivate some hidden layer nodes in the neural network and minimise their effects in the current training step. In this study, a dropout layer with a ratio of 0.5 was applied after max pool-3.

  2. 5.

    GAP layer: The global average pooling (GAP) layer is used to replace the fully connected layer in the traditional CNN networks. The GAP layer averages each feature map to enforce the correspondence between feature maps and image categories. There is no optimisation for the parameters, which further reduces the network parameters and minimises the overfitting problem.

In this study, the WearNet was developed by using the customised convolutional block. The network layers and parameters were effectively minimised to bring about a smaller mode size and higher classification speed. The comparison among the WearNet and other CNN networks was conducted on an embedded system to demonstrate the distinguished performance of the proposed WearNet for practical applications.

3.2 Training details

The WearNet was trained and evaluated by MATLAB on a PC with an Intel i5-10,600 (3.3 GHz and 16 GB RAM) and an NVIDIA RTX 3080 GPU (10 GB). Deep Learning Toolbox in MATLAB can provide a friendly framework for building network structures, setting training parameters and monitoring training processes. GPU computing was utilised in the network training to speed up the iteration. In general, there are three learning algorithms in the machine learning area, including supervised learning [40], unsupervised learning [41] and reinforcement learning [42]. In this paper, the supervised learning algorithm was adopted and the dataset consisting of labelled images listed in Table 2 was used for network training. The selection of training parameters (e.g. learning rate, gradient algorithm and mini-batch size) is discussed in the following section.

3.3 Evaluation protocol

The evaluations of deep neural networks are based on the following aspects:

  1. 1.

    Training time T: The training time is related with the network architecture, database scale and training platform as well as the training parameters.

  2. 2.

    Classification accuracy p: The prediction result is considered accurate when the predicted category with the highest confidence is consistent with the ground truth. Thus, the classification accuracy (p) can be given by:

    $$p={N}_{a}/N$$
    (1)

    where Na and N refer to the numbers of accurately classified images and total images, respectively.

  3. 3.

    Recognition rate r: For a specific image class c, if Ma and M donate the numbers of images classified as class c correctly and the total image of class c, the recognition rate (r) of class c can be defined as

    $$r={M}_{a}/M$$
    (2)
  4. 4.

    Classification time t: The classification time plays an essential role in evaluating the network performance, particularly for the industrial production involving fast recognition. In this study, the average classification time (t) for each surface image was calculated for further analysis and comparison.

  5. 5.

    Model size: This determines the applicability of the WearNet in production. Typically, larger CNN architectures require more transition bandwidth and communication costs.

Classification accuracy is one of the most important evaluation metrics, as it indicates the overall classification performance of a CNN network. However, the classification accuracy alone can be misleading, if the numbers of surface images in individual classes are unequal. Therefore, the recognition rate and confusion matrix are used to check the network performance on each image class and to figure out how the CNN network is confused when making classification decisions. The training and classification time will play an important role when the computation efficiency of different networks is investigated. Besides, the model size should be taken into consideration when the CNN network is deployed in the embedded environment, as the on-board memory of an embedded device is usually limited.

4 Results and discussions

In training the WearNet, it is crucial to select appropriate training parameters. This should be done by considering the network structure, the surface image database and the computation resource available. In this section, the selection of the optimised training parameters for WearNet was explored. Then, the WearNet was investigated by focusing on the network response, layer activations and network decision mechanism. The comparison between the WearNet and other CNN networks was conducted by using the evaluation protocol in the last section.

4.1 Selection of training parameters

The effects of different training parameters, e.g. learning rate, gradient algorithm and mini-batch size, were investigated in this paper, as they were reported to have a considerable influence on the training results in the literature [43, 44]. In the training experiments, it was found that the validation accuracy usually reached a stable stage after around 150 training epochs. Therefore, a series of network training experiments were conducted with the maximum epoch number of 200. The epoch refers to the entire image database being fully trained once. The mini-batch size (Nb) refers to the number of image data used for networking training in a single iteration. Generally, the network parameters are trained and updated by a greater number of times for a higher epoch number. With a given training database size D, the number Ni of total iterations can be given by

$${N}_{i}=(D/{N}_{b})\times 200$$
(3)
  1. 1.

    Learning rate: The learning rate determines the converging speed of iteration. In the network training, it is essential to find an optimal value of learning rate to achieve a reasonable balance between the training speed and validation accuracy. Different learning rates, ranging from 0.001 to 0.00001, were tested and compared, as shown in Fig. 5. Besides, a piecewise learning rate from 0.001 to 0.0001 was also utilised in the training experiments. The descent algorithm and mini-batch size were set as stochastic gradient descent method (SGDM) and 16, respectively. According to the training results, it is concluded that a larger learning rate enables the model to learn faster but brings with it a risk of sub-optimal results. When the learning rate becomes smaller, the convergence speed becomes lower in the initial stage, and it takes a longer time to reach the stable stage. In particular, if the learning rate is too small (e.g. 0.00001), the final validation accuracy is relatively lower after 200 training epochs. Therefore, the piecewise learning rate from 0.001 to 0.0001, which combines the helpful characteristics of the larger and smaller learning rates, can bring about a fast convergence in the beginning and ensure a high validation accuracy in the final stage. Figure 5 also presents the average training time for single epoch, indicating that the influence of learning rate variations on training time is negligible.

  1. 2.

    Gradient algorithm: The gradient descent algorithm is used in the training of deep neural networks [45, 46]. This section compares the training performance of two typical gradient algorithms, SGDM and Adam, and selects the appropriate algorithm by considering the convergence speed, computation efficiency and generalisation ability. Compared with the traditional gradient descent algorithms, SGDM computes the gradient of the loss function only by a small random subset, instead of the whole dataset, and performs a parameter update, which can help to improve the computation efficiency. The Adam algorithm utilises squared gradients to scale the learning rate and takes advantage of momentum by the moving average of gradient. Figure 6 presents the training process of the two algorithms with the batch size and learning rate fixed at 16 and piecewise, respectively.

Fig. 5
figure 5

Iteration process of networking training with different learning rates

Because of the random gradient computation, the SGDM usually has a lower convergence speed in the beginning and reaches its stable stage after a higher number of iterations compared with the Adam algorithm. However, the former consumes less training cost and leads to a higher validation accuracy than the latter. This is because more frequent updates are conducted for SDGM. Hence, there are more chances to jump out of a local minimal and search for better solutions. Hence, the SGDM algorithm was adopted in this study.

Fig. 6
figure 6

Iteration process of networking training with different gradient algorithms

  1. 3.

    Mini-batch size: In training the CNN networks, the scale of image database is usually very large. The computational cost will be unaffordable if the whole database is swept in each iteration. Hence, a proper selection of a mini-batch size is important to reduce the training cost and refine classification performance [47]. By using the SGDM and piecewise learning, Fig. 7 demonstrates how the mini-batch size affects the training cost and validation accuracy. In general, the influence of batch size variations is more significant at the beginning of network training. It is noted that a smaller batch size usually brings about a faster convergence speed because the network parameters are updated more frequently within each epoch. Meanwhile, more iteration steps related to a smaller size also bring about more training time. However, overfitting should be taken into consideration if the batch size is too small. It is also found that a larger batch size may lead to poorer generalisation ability. For example, as shown in Fig. 7, the validation accuracy drops gradually when the batch size increases from 16 to 64. With considering the balance between the training cost and validation accuracy listed in Fig. 7, the mini-batch size will be fixed at 16.

Fig. 7
figure 7

Iteration process of networking training with different mini-batch sizes

4.2 Examination of developed WearNet

With the above selected parameters, the WearNet for surface scratch detection was trained. The network responses, layer activations and decision mechanism of the WearNet are as follows:

  1. 1.

    Network responses: To examine the WearNet responses, a t-distributed stochastic neighbour embedding (t-SNE) function [48] was used in the study. Three hundred surface images with six different labels, as listed in Table 4, were used to investigate the responses of different layers, i.e. maxpool-1, conv-2 and softmax, in the WearNet. Figure 8 illustrates the t-SNE plots for three different layers where the six colours of these solid dots refer to the six image labels. For the maxpool-1 layer, the labels were not correctly grouped because only low-level features, e.g. colours and edges, were operated in such an early layer. The conv-2 layer can refine the cluster of these labels to some extent, but the accuracy was not satisfactory. For the softmax layer, the t-SNE plotting demonstrates that an appropriate classification of these different labels was achieved as the network went deeper, which validated the high accuracy of the developed WearNet.

Table 4 Surface scratch images used for t-SNE plotting
Fig. 8
figure 8

t-SNE plotting for different layers in WearNet: maxpool-1, conv-2 and softmax

  1. 2.

    Examination of layer activations: The layer activations play an important role in training the WearNet. To check which features the network has learned and whether the representative features have been correctly detected and preserved, it is necessary to visualise and examine the activation maps within different layers, i.e. conv-1, squeeze layer in conv-block-2 and conv-2. A testing image, QP980 surface with severe scratch, was fed into the trained WearNet.

There are 32 channels in the first convolution layer (conv-1). The 32 image features corresponding to the 32 channels are shown in Fig. 9a. Similarly, Fig. 9b, c present the feature maps of the squeeze layer in the conv-block-2 and the last convolution layer (conv-2). The numbers of channels of the squeeze layer and conv-2 layer are 32 and 10, respectively. Figure 10 illustrates the image features with the largest activations in the three layers. Clearly, the discriminative features for characterising the severe scratch were extracted step by step as the neural network went deeper. For example, the deep and long ploughings were typical features for severe scratching images (Fig. 10).

Fig. 9
figure 9

Feature maps from three different layers

Fig. 10
figure 10

Image features with the largest activations in three different layers

  1. 3.

    Decision mechanism: To figure out how the WearNet makes a reliable classification decision, the gradient-weighted class activation mapping, Grad-CAM [49], technique was employed. It utilises the gradient of the final classification scores associated with the convolutional features to determine the most influential part of a tested image for the classification. Figure 11 illustrates the Grad-CAM map for a test image from the class of QP980-severe scratch. The regime with the blue refers to a low influence while the regime with the red denotes a high effect. In general, a larger gradient corresponds to the red zone on which the final score most relies. It should be noted that the red rectangle zone also refers to the area with a deep and long scratch, which has the greatest impact on classifying the test image as severe scratch.

Fig. 11
figure 11

Grad-CAM gradient map for a test image (QP980-severe scratch)

4.3 Comparison with other CNN networks

A series of classification experiments were conducted to compare the performance of the WearNet with other state-of-the-art networks when the training conditions, i.e. image dataset (see Table 2), training platform and parameter settings, were identical. Table 5 compares the network performance in terms of training time, validation and testing accuracy. Here, the training time refers to the average iteration time for a single epoch. It can be found that the validation and test accuracy of WearNet outperform the others while the minimum training time is consumed. Figure 12 presents the evolution of validation accuracy during the network training, in which WearNet has the highest value throughout the whole training process. Furthermore, Table 6 presents the recognition rates of individual image labels when different CNN networks are tested. In general, the recognition rates of most image labels are over 85%. When it comes to the labels DP980-severe scratch and QP980-mild scratch, the WearNet is still able to provide reliable classification results, while the recognition rates related to other networks drop significantly, especially AlexNet. Therefore, Fig. 13 shows the confusion matrices for AlexNet and WearNet, which can help to figure out how two CNN networks are confused when making classification decisions. It can be found that for the scratch images with the label of DP980-severe scratch, AlexNet is able to classify them with a recognition rate of only 65%, while around 7% and 26% are incorrectly classified as DP980-minor scratch and DP980-mild scratch, respectively. However, when the WearNet is employed, the classification error can be reduced significantly, as shown in Fig. 13b.

Table 5 Comparison in the network performance
Fig. 12
figure 12

Training process of WearNet and other state-of-the-art networks

Table 6 Comparison in recognition rates of individual image labels
Fig. 13
figure 13

Confusion matrices for AlexNet and WearNet

Table 7 compares the complexity of WearNet and other CNN networks in terms of layer number, parameter quantity, model size and classification time. It is found that the model size is closely related to the quantity of network parameters, while the number of network layers has a considerable impact on its classification time. As shown in Table 7, the structure of WearNet is simpler than that of its convolutional counterparts, which brings about the smallest model size and fastest classification speed. However, the WearNet still has excellent classification performance, which should be attributed to its well-designed lightweight architecture. In conclusion, the WearNet proposed in this study has shown its advantages in computational efficiency and model size, as well as in its excellent classification performance.

Table 7 Comparison in the complexity of different CNN networks

4.4 Classification performance on a public dataset

To comprehensively demonstrate the effectiveness of WearNet, its performance is investigated based on the NEU surface defect dataset. This dataset collects six kinds of typical surface defects on hot-rolled steel strips (see Fig. 14), i.e. rolled-in scale, patches, crazing, pitted surface, inclusion and scratches, with a set of 300 labelled images for each type. The surface defect images in the database are greyscale and will be converted into RGB images before being used for network training. All the images were randomly divided into training, validation and testing datasets with a ratio of 4:1:1 (1200:300:300). A smaller training epoch (100) was adopted due to the smaller size of the image database, while other training parameters were identical to those in the last section. Table 8 presents the classification performance of four different CNN networks. All the four networks achieved high validation and test accuracy (over 98%), but the WearNet outperformed others in terms of training time and classification speed.

Fig. 14
figure 14

Surface defects of hot-rolled steel strips in the NEU database

Table 8 Classification performance on NEU surface defect database

5 Deployment of CNN networks

Currently, embedded systems are widely used in the industry due to their advantages in high efficiency, good affordability, continuous production and low energy consumption. In this study, a Linux-based embedded system, in which Raspberry Pi 4B works as the core hardware (see Fig. 15), was used to further demonstrate the application of CNN networks. In this section, 600 surface images selected from the testing dataset were used in the surface defection test.

Fig. 15
figure 15

Illustration of an embedded system

Table 9 compares the classification of surface scratch in the embedded system with four different CNN structures. The folder size refers to the total size of whole configuration files, which enables the embedded system to run detection programmes independently. In the embedded environment, the folder size and classification time of the WearNet were significantly lower than others, while its testing accuracy was still the highest. Hence, it is expected that WearNet will have promising prospects in industrial production.

Table 9 Comparison of network performance in embedded system

6 Conclusions

In this study, a lightweight CNN structure, called WearNet, has been developed based on the well-designed convolutional block. The WearNet is designed for surface scratch detection in contact sliding, and the surface scratch images in the database are collected from cylinder-on-flat tribological tests. A detailed investigation on the parameter selection for network training and examinations on the network response and decision mechanism have been carried out. The performance comparison between the WearNet and other commonly used CNN networks has been conducted by using different databases. The main contributions of this paper are summarised as follows:

  1. 1.

    The developed lightweight WearNet has minimised network layers and parameters, with distinguished advantages in model size and classification speed, while guaranteeing high classification accuracy and recognition rate.

  2. 2.

    Training parameter variations have a significant influence on the network training process. The selected combination of training parameters manages to achieve a good balance between computation consumption and network performance.

  3. 3.

    The developed WearNet is able to extract and learn discriminative features for surface scratch classification step by step. The examination results demonstrate the excellent capability of WearNet to correctly classify different scratch images with appropriate labels.

  4. 4.

    The application of WearNet in an embedded system shows that WearNet has promising application prospects in production.