Skip-YOLO: Domestic Garbage Detection Using Deep Learning Method in Complex Multi-scenes

It is of great significance to identify all types of domestic garbage quickly and intelligently to improve people's quality of life. Based on the visual analysis of feature map changes in different neural networks, a Skip-YOLO model is proposed for real-life garbage detection, targeting the problem of recognizing garbage with similar features. First, the receptive field of the model is enlarged through the large-size convolution kernel which enhanced the shallow information of images. Second, the high-dimensional features of the garbage maps are extracted by dense convolutional blocks. The sensitivity of similar features in the same type of garbage increases by strengthening the sharing of shallow low semantics and deep high semantics information. Finally, multiscale high-dimensional feature maps are integrated and routed to the YOLO layer for predicting garbage type and location. The overall detection accuracy is increased by 22.5% and the average recall rate is increased by 18.6% comparing the experimental results with the YOLOv3 analysis. In qualitative comparison, it successfully detects domestic garbage in complex multi-scenes. In addition, this approach alleviates the overfitting problem of deep residual blocks. The application case of waste sorting production line is used to further highlight the model generalization performance of the method.


Introduction
The intelligent recycling of municipal solid waste can not only bring economic benefits, but also has research value in the fields of social research and natural science [1,2].The recycling system of domestic garbage can is simply divided into four stages: garbage collection, transportation, transfer, and recycling.A reasonable waste management process can produce social and economic benefits.For example, a survey of rural areas in developing countries calls for increased density of garbage collection points, which can reduce environmental degradation from the source [3].It can reduce investment costs and improve recycling efficiency by optimizing the placement of garbage bins in urban communities [4].The daintiness and perceptibility of the trash can help the correct collection and classification of waste [5].Converting the collection and transportation of municipal solid waste (MSW) into an integer program can provide the best solution for waste collection and transportation [6].Multi-level solid waste collection with operating stations and transportation system with reliability framework can be used to manage the uncertainty of multi-level SWM systems [7].It will help solid waste recycling by improving the harmless treatment method of domestic waste or extracting valuable organic matter from organic waste [8,9].It is important for domestic waste to process intelligently, which can further coordinate urban development and environmental protection [10].The simulation of various policies on MSW management from a dynamic and complex perspective found that economic policies have a great impact on the future management of municipal solid waste [11].Sensor-based Internet of Things (IoT) can improve the generation, collection, transportation, and final disposal of food waste [12].Non-autoregression (NAR) model can be used to predict the future generation of MSW [13].In general, many scholars are searching for efficient waste disposal systems, which require efficient waste sorting methods.
With the rapid development of computer vision technology, deep learning methods have gradually attracted the attention of some researchers.For example, intelligent classification of glass and metal in garbage bags by training convolutional neural network (CNN) [14].Optimize the fully connected layer of CNN through genetic algorithm (GA) can improve the performance of waste detection [15].However, the full connection layer needs a lot of parametric optimizations.Therefore, it is difficult to realize fast and real-time recognition.Based on ResNet, an intelligent waste classification system can achieve accurate waste classification [16,17].However, it cannot identify the remains and locate the object.It can realize real-time monitoring of digested and non-digested waste to combine with IoT and CNN intelligent waste management system architecture.Therefore, it needs a lot of hardware support [18].In addition, deep learning has been successfully applied in separation and classification of waste electrical and electronic equipment (WEEE) batteries [19], E-waste collection [20], construction solid waste classification [21], and automatic detection of waste in water [22].The efficient and accurate detection of domestic waste will assist the intelligent development of waste treatment.The garbage collection robot that uses neural networks to identify garbage and the surface cleaning robot based on the YOLOv3 algorithm can both replace manual garbage collection [23][24][25].However, the number of garbage detected is limited and the background information is single.It is different from real life.
There is a challenge to object detection performance, when the domestic waste is diverse and existence scene is uncertain.For example, disposable chopsticks and banana peels are non-recyclable garbage, but their attributes are greatly different.The problem of target occlusion can be solved by designing high-performance filters to extract highquality feature maps [26,27].Based on the deep learning framework, the method of multi-level feature fusion performs well in target detail feature extraction [28,29].The skipping connection based on residual convolution module has significant advantages in image super-resolution reconstruction.In response to this problem, this paper proposes a method that combines the YOLOv3 with densely connected convolutional blocks [30,31].This method can be applied to domestic garbage detection in the multiple natural scenes or different quantity distributions.First, we analyzed different types of neural networks through the visualization of feature mappings during the process of network training.Second, a feature extractor is constructed using densely connected convolutional blocks to obtain high-dimensional feature mappings.Finally, multi-scale high-dimensional feature mappings are merged and three different YOLO layers are engaged to predict various types of domestic waste.In addition, we provide an application case in the experimental part as a reference for intelligent waste management.
The main contributions of this paper are as follows: (

Deep Learning Methods
The different types of debris are the objects that need to be detected and usually need to train the current model before making predictions.The deep learning model is composed of four parts: Shallow network, Backbone network, Neck and Head.Among these, the Shallow network and the Backbone network are mainly responsible for extracting semantic information such as the shape, color and location from the input feature and converging them into high-dimensional feature mappings.The Neck can optimize the extracted high-dimensional feature mappings, which helps the Head to decode higher quality feature.

Object Detection Algorithm
YOLOv3 is a one-stage anchor-based object detection, which is mainly composed of the Darknet architecture, three convolution sets and three YOLO layers.The YOLOv3 algorithm is widely used in the fields of construction [32], agriculture [33] and transportation [34] etc.As shown in Fig. 1, Darknet is mainly composed of five residual blocks and several convolutional layers which are connected to the residual blocks.Convolution set alternately uses 1 × 1 × c and 3 × 3 × 2c convolutional layers (1 × 1 × c means the size of filter kernel is 1 × 1 and dimension are c) to effectively extract and merge the mapping information.Among them, 1 × 1 × c convolutional layers can effectively compress the feature information which expanded by the previous convolution layer.3 × 3 × 2c convolutional layers can expand the feature information and reduce the model calculation parameters without changing the scale of the input feature.
The feature mapping will enter corresponding residual block to achieve multi-scale feature extraction after each down sample.Finally, three groups of high-dimension feature mappings with the scale of 52 × 52, 26 × 26 and 13 × 13, respectively, will be the output.YOLOv3 draws on the idea of multi-scale feature fusion in FPN [35].It is fused with the corresponding feature convolutions of two different scales through up-sampling with 13 × 13 feature mappings.After multi-scale feature fusion, three YOLO layers are used for prediction and regression at the same time.

Related Convolution Block
The residual block is composed of multiple residual units for image feature extraction as the black dashed box is shown in Fig. 2. Each residual unit is composed of a 1 × 1 convolution kernel with k channels and a 3 × 3 convolution kernel.The input feature and the 3 × 3 convolution kernel are connected by residuals, which can continuously overlay input features of the same dimension.The calculation of residual connection is as follows: where X n represents the input feature of the nth layer, Y n indicates the output feature of the (n-1)th layer.Therefore, each residual unit will be affected by the output from the previous residual unit layer.
In contrast, the dense block is similar to an enhanced version of the residual block.Therefore, the each dense block consists of several dense units as the red dashed box is shown in Fig. 2. Therefore, each unit contains of a 1 × 1 convolution kernel with k channels and a 3 × 3 convolution kernel with 4k channels.The number of channels input for each dense block is k 0 .After n times of convolution stacking, the feature mapping with k 0 + (n-1)k channels is final output.Among them, the non-linear function y n needs to be obtained by the operation of batch normalization (BN) [36], ReLu [37] activation function and 3 × 3 convolutional layer in turn.The calculation between densely convolutional blocks is as follows: (1) where [X 0 , X 1 , … , X n−1 ] represents the input mapping from layer 0 to layer (n-1).More, Xn represents the output of nth layer.Yn denotes the non-linear function of the output.Therefore, the output of each dense block layer is related to all previous input layers.

Overview
The main work of this paper is shown in Fig. 3 which explains the entire features of the model.In the stage 1, indoor and outdoor garbage images have obtained in different scenarios, and divide all images into two types: singleclass and multi-class.Then, all images are resized to $416\ times 416$, and randomly allocate training data set, validation data set and test data set.In the stage 2, we analyzed the parameter transfer forms of three classic convolutional neural networks.The same type of any domestic garbage has quite different characteristics.This makes garbage detection more complicated.Therefore, how to get the most important pixels between similar characteristic is the key of garbage detection.This paper will find an advantageous solution from the perspective of feature mapping.In the stage 3, we combined the analysis results of feature mapping to improve the backbone network of the YOLOv3.We conducted two different tests: one is based on the test data set, and the other is an application case that simulates a waste sorting production line.To test whether the proposed model has sufficient generalization ability, we replaced some untrained garbage in the production line application case.

Analysis of Feature Mapping
In the training of the deep learning model, the shallow network has rich feature information, such as edge contour, brightness, and color etc.However, the lack of sufficient receptive fields results in the limitation of shallow feature extraction.The deep network can not only express the global features of each object in the image, but also recognize the detailed information inside the object.With the deepening from the network depth, it is easy to produce network degradation.Since, the useful features will gradually become saturated.For example, only simple features such as rough outlines, colors, backgrounds, and shadows can be obtained when the shallow network detects expired drugs.In addition, the deep network focuses on semantic and detailed features, such as graphical information on the packaging.
The parameter transfer process of different networks is shown in Fig. 4. As shown by the black dashed box in Fig. 4, linear transmission can reduce the impact of data fluctuations on the output with the plane neural network (such as VGG16 [38]) learning parameters.However, the continuous increase of network depth will also lead to the gradual saturation of useful features and gradually cause network degradation.Therefore, some key pixels are missing in the high-dimensional feature mapping.As shown by the red dashed box in Fig. 4, the output of the lower layer will be impacted by the input of the upper layer, which can generate more features mapping in the residual network.Therefore, the residual network is more sensitive to data Fig. 3 Illustration of different convolution blocks.The illustration of main research works throughout this study.In the stage 1, a domestic waste data set is set, which contain simple and complex garbage object.In the stage 2, high-quality feature mappings have more important pixels.However, the quality of feature mapping is affected by the characteristics of the object itself.Therefore, we design an improved model for the feature extraction of domestic waste in stage 3. Finally, we evaluate the performance of this approach through the test data set and apply it to real scenarios fluctuations.This ability to use the features of the previous layer for identifying mapping solves the problem of network degradation.However, the data description of the residual network is prone to overfitting in the deep network, which will eventually affect the detection accuracy.The densely connected networks have been shown in the green dashed box in Fig. 4. The parameters of the upper layer can jump to the next layer at will, so that each layer of the densely connected network contains all the previous layer information when learning the parameters.Densely connected networks and residual networks use the features of the previous layer for mapping learning, but each layer of dense connected networks only learns less features.Therefore, it can be more flexible to choose the effective information that needs to be learned when the data fluctuates.This method can effectively alleviate deep network overfitting while reducing network redundancy.

Improved Model
YOLOv3 is significantly better than other neural networks in animals or people detection with the help of residual network structure and multi-scale feature fusion.However, there is a major feature difference between the domestic garbage data set and public data (such as ImageNet and PAS-CAL VOC etc.), which leads to poor detection performance in real application.In addition, the household garbage is arranged in a mess, and the same kind of garbage contains many different objects.Therefore, the feature of the same class garbage is quite different, which produces a certain degree of data fluctuation.This data fluctuation makes the network overfitting during the deep residual network learning.The training results will lack sufficient generalization and ultimately affect the average accuracy.To solve the problem of over-fitting, we propose a Skip-YOLO model for the domestic garbage detection.This model uses dense blocks to extract high-dimensional feature maps and combines multi-feature fusion based on the YOLOv3 algorithm.
The Skip-YOLO model is shown in Fig. 5. First, a 7 × 7 convolution kernel is used to extract the original image information, which can enhance the receptive field of shallow feature.The sensitivity of the shallow network to the same garbage also can be improved.Second, the darknet framework is improved by the jump feature of dense blocks, which achieves the sharing of shallow and deep high-level semantic information.To improve the detailed information and the ability of the model expression, a deeper dense block is constructed to extract deep detailed features at different scales.Finally, the high-dimensional feature mappings of different scales are fused and sent to the YOLO layer to achieve end-to-end regression.Assuming that there are 4 convolution units in a dense block and each unit outputs m dimensional features, then each dense block will output 4m dimensional features.Among them, each layer of dense block learned fewer garbage features and the output information is relatively scattered.Therefore, the actual dimension of output feature mappings is much larger than the theoretical estimate.To make more effective use of the features extracted by dense blocks, this paper designs a CBL convolution block to compress scattered convolution information in the previous layer of the YOLO layer.CBL is consist of one 3 × 3 convolution layer and the number of convolution kernels is equal to the output feature convolution dimension.The BN operation and Leaky activation function will be performed after convolution.

Data Set
We shoot common domestic garbage images in the grass, pavement, dormitory and other scenes using the Nikon D5300 camera.The original size of each image is 3020 × 3020 × 3 and the color channel are RGB.Among them, the outdoor background mainly includes dark grass, light grass, mud, concrete and streets.Likewise, the indoor background mainly includes dormitory and black garbage bag.In addition, this work divides garbage into recyclable garbage, nonrecyclable garbage, harmful garbage, and other garbage.Therefore, current data set contains two or more kinds of similar garbage.For example, harmful garbage includes batteries and medicine bottles etc.To better test the generalization ability of this approach, the sample distribution of the data set needs to be complicated.Therefore, this work data set contains 304 single-class images and 914 multi-class images.Among them, 80% of the data set images are used for training, 10% of the data set images are used for validation during the training process, and the remaining 10% of the data set images are used for testing after model training is completed.The results of the ablation experiment show that the deeper backbone increases the complexity of deep learn model, which helps to extract the characteristics of the garbage in different backgrounds.However, background information similar to garbage characteristics can easily be misidentified.The ability to distinguish background information and features can be improved by a suitable K value.

Implementation Details
All the experiments in this paper were performed on a laptop computer containing features, such as [Intel (R) Core (TM) i7-9700H CPU @ 2.6GHz, GPU GeForce RTX 1660Ti (6G), 16GB RAM, Windows10-64bit.Deep learning framework is Darknet with CUDA10.1 version and cudnn7.6.4 neural network acceleration library].The size of all images is resizing to 416 × 416 before training.The training process uses multi-scale and iterative learning strategies.Among them, all experiments set the initial learning rate to 0.001, the learning rate decay coefficient to 0.1, the weight decay coefficient to 0.0005, and the momentum to 0.9.

Evaluation Metrics
In the field of object detection, common model evaluation standards include different specifications, such as accuracy (P), recall rate (R), F1 score, average accuracy (AP, mAP, etc.).The calculation formula is as follows: where TP is the correct number of positive samples predicted, FP is the number negative samples mistaken as positive samples.FN is the number of positive samples mistaken as negative samples.Both AP and mAP can represent the average accuracy of the model.The difference is that AP can measure the performance of the model in some category, while mAP can measure the overall performance of the model.In practical applications, AP 50 , AP 75 and other indicators are commonly used to evaluate model performance.For example, AP 50 represents the detection accuracy of the model when the IOU threshold is 50%.The relation between AP and mAP is as follows:

Ablation Study
This work explores the performance of the Skip-YOLO through the ablation experiments.The backbone of Skip-YOLO consists of four different dense blocks, which can set a different growth rate K.At the same scale, the dense hop network can stack the input characteristics in order on the channel dimension.The influence of the amount of information contributed by different feature stacks on the model is further analyzed through appropriate K value.Therefore, the ablation experiment in this article first obtains three different models by adjusting the backbone depth.Second, based on the backbone with the best performance, other different models are obtained by adjusting the growth rate K. Finally, we analyzed the qualitative and quantitative comparison results of the experiments.
The results of the quantitative comparison are shown in Table 1 and Fig. 6.It can be seen that the Skip-YOLO-0 with the deepest layer has an AP 50 slightly lower than the Skip-YOLO-4 by 0.26%.However, the mAP50 reached 90.38%, indicating that a deeper backbone is good for fitting complex data in complex background.From Fig. 6a, c, it is found that the average accuracy of the Skip-YOLO-0 is steadily increasing and the loss function fluctuation is small.It can be seen from Fig. 6b, d that the changes of accuracy curves after adjusting the K value are not much different.When the K value is 32, the change of the loss curve is the most stable.The K value has a more significant impact on model complexity compared to the depth of backbone.
As shown in the first line of Fig. 7, Skip-YOLO-0 with the deepest network layer and Skip-YOLO-4 with the maximum K value both detect the white light background as recyclable garbage and hazardous garbage.Thus, both types of garbage have some characteristics similar to white light.Therefore, when the model has a deeper backbone or a larger K value, its ability to express detailed information will be strong.However, irrelevant background information may be error detected as the object.As the results shown in the third row of Fig. 7, when the Skip-YOLO-0, 1 and 2 models detect a large single object, the shallower backbone has poor regression performance.Among them, shallowest network (Skip-YOLO-1) even has a false detection.Comparing Skip-YOLO-0, 3 and 4, it can also find that when the K value decreases, the prediction box becomes smaller.However, when the value of K increases, the model can eventually learn more features.It is possible to mistake the redundant background as the feature of the detection object.As shown in the second row of Fig. 7, the false and missed detection of Skip-YOLO-1 are obviously if the object has centralized distribution.When the number of network layers decreases, the probability of false detection will be increased.However, the Skip-YOLO-3 has the best detection effect, because the K value is the smallest.The fourth row of Fig. 7 has evenly object distributed.Compared with Skip-YOLO-0, 1 and 2, it can be found that the Skip-YOLO-1 has no false detection.Comparing Skip-YOLO-0, 3 and 4 can also find that reducing the K value can effectively improve the detection performance in the deep network.

Analysis of Different Models
As presented in Table 2, the average accuracy of AP-N and AP-O is lower.The reason is that the characteristics of similar garbage are very different, which makes the data fluctuate greatly during the training process, and ultimately affects the detection accuracy of the model.In addition, the improved model is much deeper than the YOLOv3, which makes the detection more time-consuming, but the detection accuracy is better.It can also be seen from Fig. 8 that the classification and recognition ability of this approach is significantly better than YOLOv3.Comparing the first column and the second column of Fig. 9, it is found that YOLOv3 mistakenly detects hazardous waste (expired drugs, batteries, etc.) as other waste in the singlecategory detection.When the object is small and the position is relatively scattered, the redundant background will be wrongly detected as the garbage.Comparing the third column and the fourth column of Fig. 9 shows that the YOLOv3 has serious missed detection.Although current approach has some error detection, the improvement of generalization performance makes more targets successfully identified.The comparison of generalization performance is shown in Fig. 10.The detection effects of YOLOv3 and this approach both perform relatively well when the trained images are selected for testing.However, the YOLOv3 has different degrees of error or missed detection when using untrained images for the same testing.For example, in the second column, other garbage is detected as recyclable garbage.Based on the confusion matrix generated from the untrained image, it can be seen that this approach has a better recognition effect on various types of garbage.Because dense block can selectively learn a small amount of effective information when fitting a data set, which can effectively alleviating data fluctuations and improve the generalization performance.

Comparison of Current Advance Detection Methods
In this paper, the method proposed (Skip-YOLO-0) in the present study is compared with current models to show the importance and advantages, and the model performance is compared in Table 3.Among them, FasterDet is a lightweight new model that combines ShuffleNetV2 [37] and multi-scale fusion.DETR is an object detection framework that uses Transformer to encode and decode.FasterDet, DETR, YOLOv5, and Skip-YOLO-0 are selected for qualitative comparison in this study.As shown in the first line of Fig. 11, the FasterDet lightweight model has a lot of error recognition due to the limited features extracted.The DETR using Transformer for end-to-end prediction has the best recognition effect, and other methods perform well.However, both the lightweight model and the transformer-based method have partially missed detection can be observed from the second line of Fig. 11.Besides, the YOLO-based method performs well in dense object detection in this case of study.From the third line of Fig. 11, the Transformer is not friendly for large object detection, while Skip-YOLO has the best recognition effect.Other methods have different degrees of error recognition.Compared with the fourth line in Fig. 11, when the object is relatively scattered, all methods have a good recognition effect.The proposed method performs well in large object recognition and dense multi-object classification.Therefore, the proposed method is more applicable and robust in real scenarios.

Instance of Application
As shown in Fig. 11, a common industrial camera is used to detect the moving domestic waste on the conveyor belt.It is also used to simulate the different application for the waste sorting production line.Part of the detection effect is shown in Fig. 12.Although some objects and the backgrounds have not been trained, the deep learning method still has a certain detection effect.Among them, YOLOv3 and FasterDet mistake the background as other garbage, and the error detection is more serious.In contrast, Skip-YOLO is not affected by unfamiliar backgrounds, although there are some missed detections and false detections.DETR also has some false detections and missed detection.Figure 13 shows part of the test results, where it explains the sequence from top to bottom as Skip-YOLO, YOLOv3, FasterDet and DETR.

Conclusion
This paper reports a Skip-YOLO model for the intelligent detection of domestic waste by aiming at the problems of low similarity of domestic waste characteristics and complex scenes.First, this paper visualizes the feature mappings in different neural networks.Second, the backbone network has been improved by dense blocks, which helps to extract high-quality high-dimensional feature mappings and suppress deep network overfitting.Finally, high-dimensional feature mappings of different scales are fused and garbage detection is completed through the YOLO layer.Through ablation experiments, it is found that a deeper backbone has stronger ability of feature expression.However, there is a risk of mis-checking the redundant background at the same time.Therefore, setting a reasonable growth rate of dense blocks can prevent excessive learning of background features and control the size of bounding box.The experimental results indicate compared to the results obtained by the YOLOv3 model that this approach increases the mAP50 by 22.5% and the average recall rate increases by 18.6%.Among them, the precision of non-recyclable garbage and other garbage reached 81.48% and 88.77%, respectively.The qualitative experiments and the results of waste sorting production line are well-performed during this approach.The proposed method also has set on the following improvements in the future work.For example, the essence of parameter jump is to further optimize the back propagation path of neural network.Therefore, deep learning strategy optimization can also be achieved by connecting different network parts or embedding similar attention mechanisms.In addition, complex data set labeling is time-consuming, which makes it necessary to develop an intelligent

Fig. 2
Fig. 2 Illustration of different convolution blocks

Fig. 4
Fig. 4 Parameter transfers process of different networks.The characteristics of input image for visualization are quite different.The visualization image in the figure comes from a representative channel image in the feature mappings

Fig. 5
Fig. 5 Network structure of architecture

Fig. 6
Fig. 6 Quantitative comparison of ablation study

Fig. 7
Fig. 7 Qualitative represents comparison of ablation study.From left to right: Skip-YOLO-0, Skip-YOLO-1, Skip-YOLO-2, Skip-YOLO-3, and Skip-YOLO-4 YOLOv5 and YOLOv7 are the current advanced detection method of the YOLO series.All experiments use the same hyperparameters and train on the same data set.When the loss value iteration converges, the training is terminated.From the comparison

Fig. 8 Fig. 9
Fig. 8 Quantitative shows comparison of different models.a F1-Score curves.It is a common indicator for classification problems.b PR curves.The curve with good performance will completely cover the curve with poor performance

Fig. 10
Fig. 10 Generalization performance comparison of different models.From top to bottom: YOLOv3, Skip-YOLO

Table 2
Quantitative displays comparison of different modelsThe recyclable garbage, non-recyclable garbage, harmful garbage, and other garbage are denoted by R, N, H, and O, respectively.In addition, average accuracy is represented by AP-R, AP-N, AP-H and AP-O, respectively

Table 3
Quantitative comparison of current models