1 Introduction

With the advancement of industrialization and socialization, it is estimated that total volume of waste will increase by 20% to 2.6 billion tons per year post-2030 [17]. Environmental pollution caused by inadequate waste management mechanisms poses a global challenge requiring immediate attention. For instance, around 6 trillion tons of plastic waste litter the global oceans, with the difficulty of plastic degradation contributing to long-term pollution hazards. Moreover, soil and water pollutions are also globally prevalent [9, 27]. By 2020, waste recyclability rates in a few of countries remained unsatisfactory, barely reaching 35% [34].

At present, waste treatment plants primarily employ semi-manual and semi-automated waste classification methods [25]. These methods, however, are not only inefficient but also pose significant health risks to workers, such as inhaling excessive dust and harmful chemical gases emanating from waste. Therefore, there is an urgent need to incorporate deep learning and computer vision into waste classification and recycling [12], which promises a healthier approach by significantly enhancing waste classification efficiency and recyclability and ultimately fostering environmental protection and economic growth.

While waste classification models [8, 12, 24, 33, 42] have demonstrated remarkable classification accuracy, explored waste classification and detection remain limited and insufficiently researched for real waste classification scenarios. These scenarios grapple with a wide variety of waste types, intricate classification backgrounds, diverse waste forms, and the impact of different real-world light sources. Most waste classification models were refined and improved based on the Labeled Waste in the Wild [32], Trash Annotation in Context (TACO), and TrashNet [39] datasets. These datasets share three common characteristics: A small amount of waste data (due to the higher cost of manual labeling), pristine backgrounds, and a rather simplistic representation of waste scene. Despite the high accuracy of these waste classification models, they largely overlook the complexities present in real waste classification scenarios, which could potentially result in model overfitting.

Therefore, in this paper, we propose a semi-supervised semantic segmentation model for the Zerowaste dataset (the first real-world industrial-grade waste dataset) [1], amalgamated with deep learning, specifically designed for highly complex real waste classification scenarios. The choice of semantic segmentation over the enhancement of related visual object detection methods is motivated after our observations of the extreme clutter in real waste classification scenarios. If there is a significant quantity and variety of waste, waste items can become obscured by one another. At this point, semantic segmentation can segment the entire image without gaps into each region, with each region belonging to a category. This approach is sensitive to the size, shape, and state of objects. Our method not only improves waste classification and mitigates the extensive pixel-level manual annotation work but also alleviates the data-hunger problem [23] during model training. From collaborative training [5, 30] and self-training [15, 38], to generative adversarial networks [23] and the currently popular consistent regularization [6, 16], semi-supervised learning methods have been in development for a considerable duration. How can we offset the scarcity of waste datasets, the shortcomings of waste datasets, and the complexity of waste dataset scenarios? How can we employ semi-supervised learning to effectuate improvements in waste classification efficiency and accuracy? How can we ameliorate the extreme lack of pixel-level annotation in semi-supervised semantic segmentation? These questions interest us greatly and are addressed in this paper.

In this paper, we concentrate on the consistent regularization framework to tackle the aforementioned issues by introducing non-uniform data augmentation tailored to the dataset characteristics of real waste classification scenarios. Through our exploration, we found that the existing data augmentation methods, such as Cutout, Cutmix [40], and Mixup [13], have limitations. For instance, Cutout omits portions of the image and cannot utilize the full image information, while the data post-Mixup is often unnatural due to local blurring. In contrast, our non-uniform data augmentation method neither adds nor removes image information, which is uniform and continuous, each changed pixel point maintains correlation, avoiding the direct addition of noise to the image that ensures the generalization ability of this model. Moreover, we discovered that in the Zerowaste dataset, two categories with low data volume lead to substantial differences.

To counteract the escalating misclassification costs, we have designed an adaptive weighted loss function by assigning different weight values according to the data size characteristics of different categories. We have added a mask to make adaptive adjustments to the number of positive and negative samples, compelling the model to filter out features that have been particularly well learned without imposing a limit. This improvement is based on the waste dataset that can be applied to other categories of unbalanced datasets. Our contributions are summarized as follows:

  1. (1)

    We construct a semi-supervised semantic segmentation waste detection model applied to real waste classification scenarios. Our experimental results visually demonstrate its state-of-the-art performance.

  2. (2)

    A novel non-uniform data augmentation method is proposed. It simulates natural light by making the model better adapted to the waste classification scenario. Additionally, it increases the amount of data and reduces network overfitting. Our method is smooth and continuous, ensuring the generalization ability of our proposed models.

  3. (3)

    An adaptive weighted loss function is designed to counter the model non-robustness resulting from severe data imbalance.

In the rest of the paper, our related work is shown in Section 2. Section 3 illustrates our methodology. The experiments and ablation studies are detailed in Sections 4 and 5. Finally, our conclusions are presented in Section 6.

2 Related work

2.1 Waste detection

Over the past few years, waste datasets have been updated continuously. The TrashNet dataset for recyclable waste [39] was proposed, which contains more than 2,000 annotated images, including six classes, i.e., paper shell, plastic, paper, metal, glass, and trash, all of which were collected by shooting under natural light illumination. This dataset is characterized by the simplicity of containing only one object in each image, the background of the object is mostly white. Subsequently, UAVVaste [18] was publicly used, which is a collection of outdoor waste collected by using aerial photography from drones in urban environments. It has about 770 waste images and over 3,700 waste objects, all labeled in the “waste” category.

Finally, a more popular waste dataset is TACO. The background of trash images is intricate, including indoor, beach, street, vegetation, and water, with 1,500 images in 28 categories and about 4,800 labeled objects. However, the annotations in TACO do not belong to the waste category, which can interfere with the learning of the model. Although these three typical waste datasets have their characteristics, they do not play all the advantages in real waste classification scenarios and are more suitable for waste detection in the field.

On the other hand, the performance of waste detection using deep learning models has also been gradually improving. A genetic algorithm was employed to optimize the traditional CNN model, DenseNet121, which achieved a high accuracy of 99.6% on the TrashNet dataset [24]. Thus, the effectiveness of the proposed model and the RGBD-based MJU-waste segmentation dataset were all verified by performing a rough segmentation operation on the data first and then selecting the target region for meticulous segmentation. If the backbone network adopted ResNet-101 [14] and the baseline employed DeepLabv3 [4], the IOU mean was 97.14 [33]. Ultimately, applied to the waste classification task, the one-stage deep learning model YOLO achieved 94.79% of the mAP value after a lightweight improvement. All the three waste classification models achieved an accuracy rate of more than 90% [8]. However, they were applied to simple dataset backgrounds, monolithic waste items, and less stacked and obscured waste states, which made the detection results potentially uneven. The improved algorithms are lack of relevance and less consideration of the real form and characteristics of the waste. Therefore, we improve the model performance by performing data augmentation and loss function improvement for real waste plant classification scenarios based on the Zerowaste dataset.

2.2 Semi-supervised learning

The core idea of semi-supervised learning is how to maximize the use of unlabeled data to advance model learning. Consistency regularization [23, 29, 35] and entropy minimization [2, 3, 10] are two popular semi-supervised learning paradigms. Consistency regularization serves to reduce the phenomenon of overfitting. If unlabeled data is perturbed, its prediction result should not change significantly, i.e., it has predictive consistency. Entropy minimization combines unlabeled data, labeled data, and pseudo-labeled data to make the network predictions more confident. Other training strategies, data augmentation, such as Cutout, Cutmix [40], and Mixup [13], are applied in semi-supervised learning to increase the utilization of unlabeled data to improve the performance of the model. Nevertheless, these methods may not maintain the localization and generalization ability of the data. In contrast, the focus of our non-uniform data augmentation is on continuous improvement of smoothing at the pixel level, which ensures the generalization ability of the proposed model.

2.3 Semi-supervised semantic segmentation

Different from object detection, semantic segmentation goes down to the exact pixel point in the image, the semantic information assigned to each pixel is the position of that pixel point in the image. However, pixel-level annotated data is much difficult to obtain and suffers from severe overfitting problems, so the development of semi-supervised semantic segmentation is inevitable. In the early stage, typical Generative Adversarial Networks (GANs) [11] models extract valid training signals from unlabeled data [28]. Later, semi-supervised semantic segmentation follows a semi-supervised learning paradigm for model training [2, 3, 10, 26, 29, 35], such as entropy minimization and consistency regularization, etc. Similarly, data augmentation methods are crucial for semi-supervised semantic segmentation has also been revealed. Therefore, our approach greatly reduces the labor cost required for pixel-level labeling, improves the performance of the semi-supervised semantic segmentation model, and further solves the problem of lack of data for waste classification tasks.

3 Our method

3.1 The structure of our framework

In this paper, we utilize U-Net as the baseline and ResNet-50 as the backbone network to validate our proposed non-uniform data augmentation and adaptive weighted loss function. The network structure is illustrated in Fig. 1. While model training with unlabeled data, data without non-uniform data augmentation and data with non-uniform data augmentation produced L1 loss, which is to ensure the consistency of the results predicted by these two types of data, and the better the model results if the L1 loss is approximately small. Finally, drop perturbation is added to both the original input channel and the feature channel to improve the generalization ability of the model.

3.2 Non-uniform data augmentation

Our non-uniform data augmentation encompasses not only lighting tasks to simulate real scenes, but also polymorphic tasks to emulate waste objects. These correspond respectively to non-uniform color data augmentation and non-uniform offset data augmentation.

Fig. 1
figure 1

The structure of our network

Fig. 2
figure 2

The typical samples of non-uniform color data augmentation, which is applied to a completely random strategy

Non-uniform color data augmentation

Non-uniform color data augmentation randomly simulates real-world natural lighting, which is different from random brightness data augmentation [36]. Random brightness simply increases or decreases the same pixel value randomly for all pixels, while our data augmentation makes the pixel points all completely random, as shown in Fig. 2. From Fig. 2a to d, these figures showcase the light being shaded from the top, bottom, left, and right of the object. Figure 2e has darker light near the edges of the image and brighter light near the middle of the image, while Fig. 2f is exactly the opposite of Fig. 2e. After that, Fig. 2g and h represent the case of intensely brighter and deeply darker light, respectively. Figure 2 only shows eight images with typical representative non-uniform color data augmentation methods, similar but not identical samples, as shown in Fig. 3.

Fig. 3
figure 3

The selected samples from atypical non-uniform color data augmentation

Fig. 4
figure 4

Comparison of the original image and the image after data augmentation. a is the original image in the Zerowaste dataset, we see that in the original image, the light is also blocked. The color of the lower left corner is darker than the upper right corner because the light is blocked on the left side. b is the image after non-uniform color data augmentation. At this point, we see that the light of the image is blocked from the upper left angle, and the lower right corner is brighter

Applying non-uniform color data augmentation to the Zerowaste dataset, the position of the pixel values in the image is left unchanged, the samples after non-uniform color data augmentation are obtained by randomly and continuously changing the values of the pixels. As shown in Fig. 4b, the light from the top left of the image is blocked, while the light from the bottom right of the image is enhanced after waste classification is carried out. Compared with the Fig. 4a in the dataset, the upper left part of Fig. 4b is darkened, and the lower right part is brightened, simulating the situation where the waste samples are disturbed by different light for the same number, the same object, and the same shooting angle.

Non-uniform offset data augmentation

Figure 5b is shown as a compressed version of Fig. 5a or the whole pixel is shown an upward translation trend. Figure 5d is displayed as an enlarged version of Fig. 5c or the overall pixels are panned to the right. This is our non-uniform offset data augmentation, which is different from the simple scale and aspect-ratio-based transformations for image augmentation. Specifically, each pixel in the image is given a different, random offset, such that one pixel is shifted by 1 pixel and another pixel is shifted by 3 pixels, and this offset is gradually reduced if it reaches a maximum peak value.

Fig. 5
figure 5

Comparison of the original image with the image after data augmentation. a and c are original images in the Zerowaste dataset, and almost all the waste exists distorted and folded in different forms. Therefore, after non-uniform offset data augmentation, the morphology of the waste objects in b and d is smoothly variegated

Let iIW×H×C, i denotes a training sample, and this sample consists of two dimensions x and y. The purpose of non-uniform offset data augmentation is to generate a new training image \(\stackrel{\sim}{i}\)(x,y). We define the formula \(\stackrel{\sim}{i}\)(x,y) for as follows:

$$\stackrel{\sim}{i}\left(x, y\right)= \varDelta i\left(x,y\right)*\mathrm{s}\mathrm{i}\mathrm{n}(2{\uppi }\mathrm{*}\mathrm{r} /{v}_{c})+ i(x,y),$$
(1)

and

$${v}_{c}= 1200+200*\left(\left(r-0.5\right)*2\right)$$
(2)

where x and y are the intensity of each pixel at x-axis direction and y-axis direction, respectively. \(\varDelta i\)(x, y) is the initial value of the offset that we randomly assigned to the pixels, after which it is multiplied with the sine function to obtain the final pixel offset. r represents a random number that conforms to a uniform distribution in the range of 0 to 1.0, which serves to make the Sine function value also conform to the random criterion. Finally, \({v}_{c}\)indicates the random peak of the sine wave. The equation of \(\varDelta i\)(x,y) is shown as follows:

$$\begin{array}{cc}x\sim Unif\left(0,W\right),&y\sim Unif\left(0,H\right),\end{array}$$
(3)

and

$$\varDelta i\left(x,y\right)=\left\{\begin{array}{lc}\varDelta x=v_o+15\ast\left(\left(r-0.5\right)\ast2\right)&r<0.5\\\varDelta y=v_o+15\ast\left(\left(r-0.5\right)\ast2\right)&r\geq0.5\end{array}\right.$$
(4)

where \({v}_{o}\) means that we define a fixed value 70 for the pixel and perform a random addition and subtraction calculation on this fixed value to get the final value of \(\varDelta i\)(x, y). Then, Eq. 4 means that Δi = Δx when r < 0.5 and Δi = Δy when r > = 0.5. Thus, we keep the offset within a range. In this experiment, the initial value of the offset is taken to be between 55 and 85 (this initial value can be adjusted according to the scene of different datasets). The initial value 70 and the range of randomly selected offsets are the optimal choices determined through our experiments. As illustrated in Eq. 1, we calculate the final offset by using the initial value and a sine function (other computations such as cosine, cubic function, and quadratic function can also be employed, contingent on the characteristics of different dataset scenes).

Table 1 The offset values

Table 1 presents the offsets for x-axis. Our data augmentation method is continuous and naturally smooth in transition. In the Zerowaste dataset, approximately 6,212 unlabeled images are collected. The deficiency in data volume can be substantially addressed by using non-uniform data augmentation. Given that our method operates by mapping each pixel of the image to the original image, creating variations in the pixel value and position stands distinct from noise addition which merely darkens a pixel and incites abrupt pixel changes. Our non-uniform data augmentation is neither independent nor random, but continuous with a natural smooth transition. This enhances the generalization ability of our proposed model and brings it closer to actual waste classification scenarios. Furthermore, since non-uniform data augmentation neither deletes nor adds pixels, it circumvents the problem of erroneous positive and negative sample assignment due to image resizing.

Adaptive weighted loss function

While training the model, we observe that two classes (Metal and Rigid Plastic) in the dataset had low numbers of waste images, which result in an extreme imbalance of the data distribution, and the model did not extract the features of these two categories well. To solve this problem, we designed a new adaptive weighted loss function, as seen in Eqs. 5, 6, and 7.

$${L}_{cls}=-\frac{1}{Z}\left\{\sum\nolimits _{i=1}^{N}loss\left({p}_{i}\right)\right\}$$
(5)
$$loss\left(p_i\right)=\left\{\begin{array}{lc}e^{w-p_i}\cdot\mathrm{ln}\;p_i,&p_i<\eta\\0,&p_i\geq\eta\end{array}\right.$$
(6)
$$Z=\sum\nolimits _{i=1}^{N}[{p}_{i}<\eta ]$$
(7)

where i refers to a pixel, Z represents the number of e in the mask, which is equivalent to a planning coefficient, η is a hyperparameter set to 0.99. Moreover, w denotes the weight and \({p}_{i}\) shows the prediction probability of a pixel. \({e}_{w}\)refers to the overall weight assigned to each waste class, and w is set by us in the experiments.

Since there are four waste classes and one background class, we assign w as 3 to the Metal class and Rigid Plastic class based on the feature of less data volume in these two classes. All other classes have w = 1.0, which is the best value for training results after we have conducted a number of ablation experiments. \({e}_{p}\) is to justify a weight assignment of this pixel itself. If \({e}_{p}\) is stable, p should be large, so that the value of \({e}_{p}\) becomes small. Conversely, if the value p is too small, the weight assigned to it will be larger. In this way, it is straightforward to bring the data to a balanced state.

Furthermore, we set a mask as 0 if \({p}_{i}\) is greater than or equal to η, as provided in Eq. 6. By the way, the number of positive and negative samples is self-adjusted by mask, which filters out the features that are particularly well learned in the model and without limiting the number. Finally, the weights are then multiplied by the cross entropy to obtain the final loss function value, not as same as OhemCELoss [31], a typical weighting loss method, which restricts the proportion of positive and negative samples, while we do not restrict them, the positive samples also acquire valuable features by achieving a simple and effective way to improve the performance of the model.

4 Our experiments

4.1 Datasets

We propose a novel non-uniform data augmentation and adaptive weighted loss function to train the semantic segmentation of the Zerowaste dataset for the purpose to improve the accuracy and efficiency of real waste factories in performing waste classification. The Zerowaste dataset contains three branches, namely Zerowaste-f, Zerowaste-s, and Zerowaste-w. We mainly take use of the Zerowaste-s dataset, which is available for semi-supervised tasks and contains 6,212 unlabeled images. Additionally, the Zerowaste dataset has four waste categories, i.e., Cardboard, Soft plastic, Rigid plastic, and Metal. We evaluate the model with the metric mean(IoU) and perform ablation experiments to verify the effectiveness of our proposed method.

4.2 Implementation details

Our experiments were based on a server with RTX A5000 GPU and AMD EPYC 7543 CPU. Installed code editors VSCode, neural network framework PyTorch, CUDA, CUDNN, and, OpenCV. The specific experimental parameters are reflected in Table 2.

Table 2 Training parameters of our experiments

4.3 Results

Our method is based on the Zerowaste dataset to improve the model of real waste factory classification scenarios, so we follow the result comparison rules of the Zerowaste baseline to ensure the fairness and validity of the experiments. In Table 3, the mean(IoU) of the test set from our Unet method (semi) of ResNet-50 is 55.37%, which is higher than the experimental results obtained in the Zerowaste baseline of about 3.74%. If we do not utilize ResNet-50 but with EfficientNet, the IoU is also reduced by 0.39%. Besides, to obtain visual comparison results, we applied our method to DeepLabv3+ [4] (the same baseline method as Zerowaste, whose backbone network is ResNet-101) and obtained a mean(IoU) value of 54.77%, which is higher than the original value of 3.14%. It proves that our method is effective that can noticeably improve the value of the mean(IoU). After that, following the experimental approach of Zerowaste, we compared it with UniMatch [37], AugSeg [41], ReCo, CCT [28], and EPS [19]. The mean(IoU) of the semi test sets of the UniMatch, AugSeg, and ReCo method is 54.65%, 53.88%, and 44.12%, respectively, which are lower than that of our method. Finally, regarding CCT and EPS, all their mean(IoU) values are below 33%, while our method is up to 55.37%. Moreover, since the Transformer model has improved the performance of the classification model and segmentation model, we also tested the CLUSTERFORMER [20] and Swin Transformer [22], the obtained mAP values were 52.76% and 53.21%, respectively. We see that our method is also applicable in the Transformer models.

Table 3 Comparisons of mean(IoU) results of different methods

After the U-Net method was implemented, the value of mean(IoU) is only 0.6% higher than using the DeepLabv3 + method, but we still choose to apply U-Net model as our semi-supervised waste semantic segmentation network. This is because U-Net model consists of an encoder and a decoder that not only obtains the global features of the image but also restores the spatial information of the image, and has remarkable performance in processing smaller datasets, while DeepLabv3 + is suitable for processing large datasets. With the experimental results, we see that our method yields impressive results. Figure 6a and b illustrate the mean(IoU) and loss plots of our method, respectively. This shows all the mean(IoU) and loss values during the training process. Another, darker-colored, less undulating fold is the approximate tendency of the mean(IoU) curve and the loss curve, it is present to highlight the focused changes in the curve.

Furthermore, we adopt the SGD optimizer, because though the SGD optimizer is less efficient than the Adamw optimizer, it has more stable training results. We are surprised to see that the SGD optimizer utilized exhibits unsatisfied performance in our model, leading to a negative effect on the training outcomes, as shown in Fig. 7. Its loss values are on average higher than those of the Adamw optimizer.

Fig. 6
figure 6

The mean(IoU) and loss values of our method

Fig. 7
figure 7

The loss value of our method when using SGD optimizer

5 Ablation study

5.1 Analysis of the adaptive weighted loss function

Table 4 Comparisons of mean(IoU) with the application of adaptive weighted loss function

The loss function is harnessed to evaluate the deviation of model prediction from the true value, the selection of loss function will impact the training results of the model. In this paper, we propose an adaptive weighted loss function and conduct experiments, as shown in Table 4. The mean (IoU) using our loss function is higher than the mean(IoU) without our loss function by 1.25%. We improve the performance of the model by a simple weighting method, effectively and efficiently. There is still room for improvement in the adaptive weighted loss function, and we believe that exploring more dynamic adaptive weighting schemes is an interesting direction for future work.

5.2 Analysis of w values

In this paper, the Zerowaste dataset has four categories, namely Cardboard, Soft plastic, Rigid plastic, and Metal. But in the semantic segmentation, it is also necessary to add the background category. Therefore, our experiment contains five classes. However, after observation, the data volume of the Metal classs and Rigid Plastic class only accounted for 3.6% and 16.6% of the total dataset, respectively, which show the extreme imbalance of the dataset. After the initial training, the mean(IoU) of these two categories was only about 20.00%, which was much lower than that of other categories, resulting in lowering the overall training of the model results. Our adaptive weighted loss function improves this problem by adding the same weight values for both categories. According to the design of our loss function, we perform ablation experiments on the w values in the adaptive weighted loss function in this subsection.

There are five classes to be weighted, the initial w value is set as 1.0. The model is trained to obtain a mean(IoU) of 52.69%. We envision that the value of mean(IoU) will not be positively correlated with the value of w. If only the Metal class and Rigid Plastic class are given greater weight values, it would lead to overfitting the model. Therefore, we did not assign large values to these two categories, which is detailed in Table 5. The best training result is that if the value of w is 3, and the mean(IoU) is as high as 55.37%. The lowest mean(IoU) value of 52.60% is when the value of w is set to 5. Besides, the mean(IoU) with w equal to 1 is 0.09 higher than 52.60%. Finally, the second highest mean(IoU) value is 54.02%, which is 1.35% lower than 55.37% (when the w value is 3.5). This verifies our conjecture that w value cannot be set too large.

Table 5 Comparisons of mean(IoU) with different w values

Next, since Metal class and Rigid Plastic class also have different amounts of data samples, we assign two different sets to these two categories. As an example, let w values be 3 and 2 (or 4 and 2), corresponding to the Metal category and Rigid Plastic category, respectively. However, the mean(IoU) did not increase due to the finer differentiation of the w values, both about 53% (which is different from our conjecture). We speculate that this maybe due to the difference in data volume between the Metal class and the Rigid Plastic class which is small. Moreover, another reason is that the simplified assigning w values do not raise the mean(IoU) values if the difference in data volume is small. In future work, we will further investigate the results with a variety of w values. Figure 8 is the typical losses with different weights.

Fig. 8
figure 8

The typical loss values with different w value

5.3 Analysis of non-uniform data augmentation

Table 6 Comparisons of mean(IoU) with various data augmentation methods

In this paper, we conduct an in-depth study of non-uniform data augmentation and compare them with other data augmentation methods. In Table 6, we list the results of our experiments. The best result is our non-uniform data augmentation, which is up to 55.37%. Then, we keep U-Net model and employed Cutout, Cutmix, and Mixup for replacement, and obtained mean(IoU) values are 52.21%, 53.86%, and 54.08%, respectively, which are all lower than our method. For a fair comparison with the Zerowaste baseline method, we also conducted the experiments in the case of using DeepLabv3 + as the segmentation network. However, it was only for comparing our method with the method in the Zerowaste, so Cutout, Cutmix, and Mixup were not tested. The results are remarkable in that our non-uniform data augmentation can increase the mean(IoU) value of the model by about 2.49% on average.

5.4 Analysis of the initial offset value

In Section 3.2, we narrated setting the initial value for non-uniform offset data augmentation, 70 as the best solution, which is what we have obtained after the ablation experiment. Table 7 shows that if the initial value is set to 30, the mean(IoU) will reach the lowest value of 51.97%. In addition, if the initial value is set to 100, it also causes a decrease in the mean(IoU) value. Figure 9 illustrates the comparison plot of the mean(IoU) with the initial value set to 30 and 100. We suppose that the pixel values of the images range from 0 to 255, if the initial value of data augmentation is too large or too small, it cannot get the global information of the images and will cause the network to ignore image features. Therefore, in this experiment, we choose 70 as our initial offset value of data augmentation.

Fig. 9
figure 9

Comparison of the mean(IoU) with the initial value set to 30 and 100

Table 7 Comparison of mean(IoU) with different initial offset value

5.5 Analysis of the application of non-uniform data augmentation

Since our non-uniform data augmentation includes the transformation of pixel colors and pixel positions, we compared these two methods. If we only use non-uniform color data augmentation, the value of mean(IoU) is 54.39%, which improves the Zerowaste baseline by 1.41%. In contrast, applying non-uniform offset data augmentation improves the mean(IoU) value of the baseline method by 1.08%. Table 8 shows that our methods are effective and data augmentation for simulated light is more useful than data augmentation for simulated polymorphism. The reason for the difference needs to be explored in depth. We speculate that it is due to the hyperparameter settings of these two data augmentation methods. Therefore, in future, we will further improve our model.

Afterwards, to verify the comprehensiveness of the proposed method, we used the ResNet-101 model to compare with other data augmentation methods [21] on the CIFAR-10 dataset, the results are given in Table 9. We see that non-uniform data augmentation also enhances classification performance by about 0.12%. In our experiments, our method shows that robustness to image disturbances such as illumination variation, scale variation, rotation and tilt. This implies that it may also be resistant to some adversarial attacks. However, an assessment of specific attacks may needed, such as Adversarial Patch and Physical-object-oriented MDE Attack [7]. Therefore, we plan to specifically test our method’s resistance to these advanced adversarial attacks in future work and further adjust and improve it to address these challenges.

Table 8 Comparison of mean(IoU) of the application of non-uniform data augmentation
Table 9 Comparison of mean(IoU) of the application of non-uniform data augmentation applied to image classification task

6 Conclusion

We present a pioneering approach to waste classification that incorporates a novel non-uniform data augmentation technique. This method excels in simulating various environmental conditions such as natural lighting and object polymorphism, significantly enhancing the model’s robustness. Coupled with an adaptive weighted loss function, our method, when applied to the U-Net architecture, elevates the mean Intersection over Union (IoU) by 3.74%. This underscores the efficacy and simplicity of our method in handling real-world waste classification challenges.

While addressing the scarcity of large labeled datasets in waste classification, we see our approach leverages semi-supervised semantic segmentation algorithms. This not only mitigates the financial burden of manual annotation but also enriches the Zerowaste dataset, thereby amplifying the model’s efficiency and reliability in waste classification.

Our methodology, though tailored for waste classification, offers transferable insights that could be adapted to other domains, including traffic detection tasks on datasets like COCO and Pascal. Specifically, our non-uniform offset data augmentation methods employs sinusoidal computations, but it can also accommodate other mathematical functions like cosine or quadratic equations for broader applications.

Looking ahead, we aspire to extend our method to additional deep learning challenges, continually seeking enhancements. Moreover, there’s potential for further refinements in non-uniform color data augmentation. For instance, beyond altering light and shadow, we plan to explore how manipulating object colors could further improve the model’s performance. This serves as a fertile ground for future research. Finally, we suspect that the simplicity of adaptive weighted loss function is a limitation of our research work. Although it improved the accuracy of the model in this paper, there is still room for improvement. In the future, we will conduct further research projects on adaptive weighted loss functions to design more efficient and effective loss function.