1 Introduction

At present, the research on computer vision has gradually shifted from the laboratory to the embedded devices. Although many algorithms can achieve high accuracy, they usually have a large number of parameters and computational costs and need high-performance hardware. Many chips have special convolutional acceleration modes, but they are relatively expensive. Therefore, the use of a lightweight network is an effective way to balance performance and speed.

The advantages of the lightweight convolutional neural network (CNN) model are as follows:

  1. 1.

    The lightweight model has fewer parameters and less data exchange during training. Given this perspective, it can achieve better results in a shorter training time, and can significantly reduce training time.

  2. 2.

    The model with fewer parameters and less computation is more suitable for embedded devices. The embedded devices have relatively smaller size, so the heat dissipation and hardware performance are greatly different from the counterpart of laboratory machines. Therefore, compared to the model with high accuracy and slow operation, the model with general accuracy and fast operation is more suitable for the embedded devices.

  3. 3.

    At present, transformer model of architecture achieves better accuracy than CNN based on huge training time and parameters, while the convolutional hardware optimization method of CNN is mature at present, so the speed of CNN can be improved more easily compared with the transformer structure.

Lightweight models usually consist of knowledge distillation [1], pruning [2] and quantification [3], or modifying network structures. There are many similar researches on network structure optimization. AlexNet used group convolution to reduce parameters and computation [4]. AlexNet's team adopted group convolution by two GPUs to reduce training time for the low performance of the device at the time. Group convolution can significantly reduce the number of parameters and training time, and increase the diagonal correlation of the convolution kernel, making the network difficult to overfit [5]. Iandola et al. constructed SqueezeNet network, which used 1 × 1 convolution kernel to reduce the number of channels and added multi-scale convolution parallel computing to preserve more information [6]. SqueezeNet has only one fiftieth of AlexNet's parameters but similar performance. MobileNet series networks adopt the deepwise separable convolution, which greatly reduces the number of parameters and enables channel and region separation to better integrate the features of different channels through separate operation for each channel [7]. ShuffleNet proposed four criteria for network design, and through the shuffle of group convolution, information can circulate inside of groups to avoid information isolation [8, 9]. However, the number of parameters and computational costs of the above networks are still too much, so a novel model is proposed based on the characteristics of the previous lightweight models in this study, of which number of parameters and computational costs are, respectively 12% and 64% of the counterparts of MobileNetV3. The traditional dataset and the defect dataset built in this study are both tested by the novel algorithm proposed in this study, which further verifies the effect under actual conditions.

For image classification tasks, structure optimization can significantly improve the network speed or performance, while the target detection task needs to consider the speed of different parts. For example, a common object detection model is selected to calculate inference time for different components, and the results are shown in Fig. 1.

Fig. 1
figure 1

Proportion of inference time and GFLOPS

In Fig. 1, (a) represents the required inference time for each part of model, and (b) represents the proportion of computation (Multiply Add) for each part. Backbone, Neck and Head represent the corresponding structures of model, NMS represents non-maximum suppression, and other represents hardware such as model loaded into GPU and data memory throughput.

From Fig. 1, it can be concluded that a large amount of inference time and computational costs in the object detection model are occupied by Backbone. Therefore, it is a good method to optimize backbone. The neck part usually adopts the Feature Pyramid structure [10]. The optimization for FPN structure needs to consider the impact of each branch, and the optimization for Backbone is more direct.

2 Materials and Methods

The purpose of the lightweight network is to minimize the number of parameters and size of the network without reducing accuracy too much. Although the number of parameters in MobileNetV3 is small enough, there are still about 4 million training parameters. A large number of parameters occupy a large amount of video memory in the prediction stage. For embedded devices, the less video memory an algorithm consumes, the more images it can process in the same time. Such algorithms have a higher upper limit of optimization. Many improvements have been made to reduce the size of network. We mainly use deepwise conv (deepwise separable convolution), as shown in Fig. 2. Deepwise conv is used by many lightweight networks because it can significantly reduce the number of parameters and increase speed [7, 11].

Fig. 2
figure 2

Depthwise separable convolution

Compared with ordinary convolution, depthwise separable convolution does not add in the channel direction after convolution but directly outputs the elements in the channel direction.

2.1 The Model Proposed in This Study

Our network is similar to the CSP structure, which is called LightCSPNet because it expands the number of channels through convolution [12]. Four CSPlayer modules are connected to be used in our network. CSPlayer module includes two structures: Down layer and Extraction layer for expansion of channel number and feature extraction respectively. Our lightweight CSP structure is used by down-layer as shown in Fig. 3. This structure is stacked with feature layers of deepwise conv, similar to GhostNet but used to expand the number of channels [13]. Depthwise represents depthwise separable convolution, Conv represents convolution normalization and activation functions, and Concat represents concatenation in channel direction.

Fig. 3
figure 3

The structure of Downlayer

If ordinary convolution is used on the residual edge, many redundant information will be generated as shown in Fig. 4. According to reference [13], these redundant features can be reduced by using depthwise separable convolution, and overfitting can be avoided by reducing the number of parameters and computational costs. The feature map is processed by Down layer, of which the width and height then become half, and the number of channels increases twice. Down sampling can improve the speed of network operation.

Fig. 4
figure 4

Feature map using ordinary convolution

Redundant features are represented by boxes with the same color, and one feature can be derived from another by a simple transformation. It provides little information but takes up memory.

The extraction layer is shown in Fig. 5. The inversed residual structure is used in the extraction layer, which is present in many networks, mainly because it can expand the number of channels so that less information is reduced through the activation function [14, 15].

Fig. 5
figure 5

The structure of Extraction layer

The channel magnification of center part of the inverse residual is a hyperparameter, and we choose 4 × magnification here. Therefore, the number of channels in the middle layer is 1024. Increasing the number of channels by convolution can decompose the features of the middle layer. Afterwards, reducing the number of channels compresses the features of the middle layer, thus preserving more distinct parts.

In Fig. 5, Left: inversed residuals, Right: Resmax structure. 256,1 × 1Conv,1024 indicates that the convolution is used to change the number of channels from 256 to 1024, and 1 × 1 Conv represents that the size of convolution kernel is 1.

On the residual edge, the Resmax structure is used, as shown in Fig. 6, which consists of two branches. One branch gets the most obvious feature of the current feature layer, and the other branch obtains the weight factor of each channel. Max represents that obtaining the maximum value of all values in the channel direction, and adaptive max pooling indicates that obtaining the maximum value of all pixels in the width and height direction. Linear represents the full connection. Resmax controls the feature map ratio of each channel by adaptive parameters. This structure can screen out more obvious network features and better distinguish foreground and background.

Fig. 6
figure 6

Resmax structure

Compared with direct linking, Resmax structure adds the capability of feature filtering and achieved better results in experimental test. Three different CSPlayer approaches are tested, which are shown in Fig. 7, (a): Resmax; (b): Direct addition; (c): Direct connection of Resmax module.

Fig. 7
figure 7

CSPlayer module

The test results are shown in Table 1, (a) structure using Resmax proposed in this study has a better effect than the counterpart without using Resmax. Although Resmax structure can extract features that are significantly different from the background, it also removes objects that are similar to the background. If the direct connection way like (c) structure is adopted, a large number of similar features will be removed, which can result in performance degradation. However, if the direct connection way is used as residual edge, the above disadvantage can be effectively overcome, because of the trunk containing a wealth of similar features [16, 17]. That's the reason why structure (c) is selected in this study.

Table 1 The accuracy of three models for ImageNet100

Top 1 and Top 5 are indicators to measure the prediction accuracy of model. ImageNet100 was selected as the experimental dataset, of which 120 thousand images are part of ImageNet1k. The network structure is shown in Fig. 8.

Fig. 8
figure 8

The structure of LightCSPNet

Down layer and extraction layer in Fig. 8, is described in Figs. 3 and 4. Conv means convolution normalization and activation function.

2.2 Additional Improvements

At present, transformer model has been proved to be very effective in the CV (Computer Vision) field [18,19,20]. The MHSA (Multi-head self-attention) structure in Transformer model has been tried to be added into the improved network proposed in this study. MHSA structure as shown in Fig. 9, was first used in the NLP (natural language processing) field to improve accuracy by enabling networks to select useful information themselves [21]. MHSA structure has been proved to be very suitable for image classification and segmentation target detection tasks. The combination of CNN and Transformer models has been tried by many scholars [22, 23]. Based on the above improvements, the innovations are put forward in this study that the extraction layer in the last CSPlayer is replaced with MHSA structure.

Fig. 9
figure 9

The structure of MHSA

In this paper, four heads are used in fact, and only one head is shown in Fig. 9 for simplicity. q, k and v, respectively represent query (to match others), key (to be matched) and information to be extracted. P is the relative position code. ⊕ and ⊗ , respectively represents sum and matrix multiplication. \(W^{q}\), \(W^{k}\) and \(W^{v}\) are all implemented using 1 × 1 conv [24].

The Softmax operation in Fig. 9 is shown in Eq. (1):

$$y_{i} = Softmax\left( {x_{i} } \right) = \frac{{e^{{x_{i} }} }}{{\sum\nolimits_{j = 1}^{C} {e^{{x_{i} }} } }},i = 1,...,C,$$
(1)

\(x_{i} = \{ x_{1} ,x_{2} ,...,x_{i} \}\).

Softmax generates a series of weight coefficients in the range of [0,1], and can be regarded as network normalization operations, by which the network can increase the weight proportion of foreground part and reduce the weight proportion of background part.

The MHSA structure is shown in Eq. (2):

$$output = softmax (Q \cdot (P^{T} + K^{T} ))^{T} \cdot V,$$
(2)

where Q, K, V and P in Eq. (2) represent the tensors of the lowercase letters in Fig. 9.

The addition of MHSA structure reduces the size of model further, but the structure needs a lot of training time to fit. Finally, the accuracy of the classification network after adding MHSA structure decreases partly but the structure achieves good results in the target detection network.

2.3 Dataset and Hyperparameter Setting

Defect detection is a common task of computer vision in industry but most defect datasets have too few categories, resulting in easy to overfit for models. For this disadvantage, in this paper, a special defect dataset was built for image classification task, which includes 12 varieties of defect types such as cloth, concrete, bridge, steel, etc. [25,26,27]. One picture of a portion of dataset is shown in Fig. 10.

Fig. 10
figure 10

Several different categories of defect images

The dataset comes from the work of many previous scholars. Different models are adopted for defect detection, most of which are not as lightweight as ours [28,29,30]. A variety of defects have not only similar characteristics but also fundamental differences, so training together is conducive to network convergence. For example, whether or not the cracks and defects is from bridge, ground or concrete cracks is similar to the difference between a cat and a dog, they have some of the same features but not quite the same details. Putting multiple defects in one dataset is equivalent to adding negative samples, which can better improve the generalization ability of the model. Moreover, the model of multiple categories can be used as the pre-training weight to continue training the target of fewer categories. This is similar to the distillation of knowledge using large models to train small models. The pseudo-code for training the LightCSPNet algorithm is as follows:

figure a

The function in Algorithm 1 is shown in Figs. 3 and 5. The hyperparameters of LightCSPNet are set according to experimental experience. The epoch is set to 200 to facilitate the convergence of the algorithm. Adam and Cross Entropy are adopted as the gradient descent algorithm and Loss function, respectively. The cosine annealing method is used to automatically adjust the learning rate, and the initial learning rate is set to 0.001.

The classification capacity of the novel model proposed in this paper was tested for ImageNet100 dataset. In addition, the novel model as backbone is introduced into the object detection model to obtain AP50 that is compared with the counterparts of MobileNetV2 and MobileNetV3.

3 Results

Some small optimizations were made to further modify our model, such as EL (increasing the number of Extraction layers) using dropout and MHSA. The Extraction Layer 3 in Fig. 8 are recycled three times, for the module ratio in the network was 1:1:3:1 [31], but the effect was not good. The dropout is also used in MobileNetV2, MobileNetV3 and LightCSPNet proposed in this study, by which the accuracy is improved to a certain extent [32]. The experimental condition is that (224,224) input and CPU: Intel(R) Xeon(R) Platinum 8255C and GPU: RTX 2080 Ti and RTX 3090. The ablation experiments shown in Table 2 were done to verify the improvement effect. The experimental results for two datasets are shown in Table 3.

Table 2 Ablation experiment
Table 3 Model performance on the ImageNet100 and Defect datasets

It can be seen from Table 3 that our model reduces a large number of parameters and computational costs while the accuracy remains almost unchanged. Although the addition of MHSA structure results in much performance degradation in the classification`, the MHSA structure achieves better performance in the object detection model.

Compared to the other lightweight algorithms, the number of parameters of LightCSPNet is reduced by 75%. In addition, the performance of LightCSPNet is almost the same as the counterparts of other lightweight networks after using these skills. LightCSPNet is combined with the MHSA, the number of parameters of LightCSPNet is further reduced. In addition, the Vision Transformer algorithm, which consists of multiple MHSA structures, was used for our tests. In fact, the vgg16 and Vision Transformer networks in the experiments were so large that it was difficult to converge for the GTX3090. Therefore, pre-training weights for large models are used to accelerate the convergence of the algorithms. Finally, it can be seen from the ablation experiment in Table 2 that our improvement is effective.

LightCSPNet as backbone was put into the target detection model and further compared with MobileNetV2 and V3. The backbone is tested on two algorithms: anchor-base and anchor-free, which are YOLOv4 [33] and YOLOX [34] respectively, both of which structures are very similar and uses the same PAN (Path Aggregation Network) structure [35] and is shown in Fig. 11. The difference of the above structures is whether anchor is used.

Fig. 11
figure 11

The structure of YOLOV4 and YOLOX

The detection results are shown in Table 4. The image size of the input model is (416,416), and the dataset is VOC2012 that consists of 10,000 images from 20 classes. The pre-training weights from ImageNet100 are used by all models in this paper.

Table 4 Performance comparison among different backbone models

For the VOC dataset, the AP of LightCSPNet was improved by 0.4% and 0.6%, respectively. Although LightCSPNet has a significant reduction in the number of parameters and computational costs, the inference time on the GPU does not decrease much. The reason is that the Nvidia GPU used in PC has hardware optimizations and frameworks for ordinary convolution and continuous data such as CUDNN, but not optimizations for depthwise separable convolution and discrete data. It can be concluded that the inference time of LightCSPNet is only half of the counterpart of MobileNetV2 on CPU without optimization mechanism, which means that the performance of LightCSPNet can perform better on the embedded device without CUDNN.

4 Conclusion

For the relatively poor computing power of embedded devices, a more lightweight and better performing network that LightCSPNet is proposed to accommodate low performance devices. The dataset including multiple defects was constructed to test the algorithm in the defect detection target. Two international public datasets that ImageNet100 and VOC2012, were used to train and test the algorithms in the areas of image classification and target detection, respectively. The main conclusions are as follows:

  1. 1.

    The proposed Down layer and Extraction layer structures combining deepwise separable convolution and CSP structures reduce the computation and the number of parameters without degrading the performance.

  2. 2.

    For the classification task, LightCSPNet obtained better results for the ImageNet100 and defect datasets. Compared to the other algorithms, LightCSPNet has fewer parameters and less computation and owns almost the same performance.

  3. 3.

    The new dataset constructed in this paper consisting of 12 categories can be better used for defect detection tasks.

  4. 4.

    LightCSPNet adopted some skills, such as adding drop out and MHSA structures, and achieves a better performance improvement.

  5. 5.

    For the target detection task, LightCSPNet is used to replace the backbone part, which increases the mAP for the VOC2012 dataset. In addition, the running time of LightCSPNet is only half of the counterpart of MobileNetV2 on CPU. LightCSPNet proposed in this paper can be better applied in embedded devices that owns only CPU or GPU with low performance.