1 Introduction

In recent years, there has been an increasing number of computer-aided applications deployed to help farmers perform their daily work (Rehman et al. 2019), attempting to deal with challenges caused by climate change. Robots equipped with object detection techniques have been used in various tasks (Dhaya et al. 2021), such as plant disease detection (Ampatzidis et al. 2017), pest classification (Thenmozhi and Reddy 2019), managing water resources (Reis et al. 2019), to name a few. Such tools help improve both effectiveness and efficiency of farming work (Pilarski et al. 2002; Lehnert et al. 2017; Bargoti and Underwood 2017) as they can fulfill these tasks in a short time and obtain a high accuracy (Deepa and Ganesan 2019).

Vietnam is a developing country, and despite the recent structural transformation process, agriculture still remains a staple element of the country’s economy. Nevertheless, there is a worryingly increasing lack of labor in the agriculture sector (Sabzi et al. 2017; Liu et al. 2020), as descendants of farmers prefer to seek their fortune in big cities, rather than continuing the tradition. In fact, Vietnam is among the world’s major rice exporters, and rice accounts for around 90% of the food production (Tam and Shimada 2019). However, due to the country’s geographical position and weather conditions, rice fields in Vietnam severely suffer from the invasion of pets and, in particular, several types of wild weeds. The country has sought different countermeasures to improve crops and combat pets and weeds, including fertilizers and herbicides. Still, the effectiveness of herbicides is far from optimal as weeds quickly adapt and become resilient to these means. Moreover, an abusive use of herbicides would cause contamination and harm to the environment. Given the circumstances, there is a need for automatic techniques and tools to deal with the lack of labor, as well as to support the detection and elimination of weeds. Having a sound technical background to recognize weeds is an important prerequisite to deploy in-field automatic robots.

The classification of weeds under realistic conditions turns out to be a daunting task. Although various studies have been conducted to recognize weeds and good results have been obtained, most of the existing studies deal with detection of weeds in a laboratory setting. As a matter of fact, the classification of weeds in rangeland environments has not received adequate attention (Olsen et al. 2019a). In reality, images captured from fields are heterogeneous and they contain noise in the background. In this respect, there is the need for robust mechanisms for detecting weeds in situ.

Machine learning (ML) has made profound progress in the past decade, thanks to the proliferation of several disruptive deep learning algorithms (Duong et al. 2023). There is a rise of applications exploiting ML across several domains. Among others, ML techniques have been used to solve different issues in the agriculture sector (Espejo-Garcia et al. 2021; Duong et al. 2020).

In this work, we propose a practical solution to weeds recognition, in terms of efficiency and effectiveness. We made use of two variants of the EfficientNet family (Tan and Le 2019) as the classification engine. More importantly, we incorporated different optimization functions and transfer learning strategies, aiming to find the best configuration. The performance of our approach has been evaluated using a real weed dataset, i.e., DeepWeeds (Olsen 2020). The results we got so far are promising: for all configurations, the obtained accuracy is always larger than 97%, with 99.62% being the maximum accuracy. Compared to two state-of-the-art baselines, our approach achieves a better prediction performance with respect to various quality metrics. The aim of our work is to build an expert system, which paves the way for a machine that is able to automatically detect and eliminate weeds from rice fields.

In this sense, our paper makes the following contributions:

  • A practical solution to weed classification adopting cutting-edge deep neural network and transfer learning techniques.

  • A comprehensive evaluation of the conceived framework on a real dataset. This also aims to compare it with two well-established baselines, namely ResNet-50  (Olsen et al. 2019a) and Inception-V3 (dos Santos Ferreira et al. 2019).

  • A software prototype provided as a mobile app is ready for download.Footnote 1

The paper is organized as follows. In Sect. 2, we provide a literature review on the related topics. Afterward, Sect. 3 presents in detail the proposed approach. In Sect. 4, we explain the dataset and metrics used for evaluation. Section 5 presents the main results and discussions, as well as the probable threats to validity of the outcome. Finally, we present future work and conclude the paper in Sect. 6.

2 Related work

In this section, we present a literature review on the related topics. In particular, we review work for automation in agriculture, and notable studies on weed classification.

Recently, different techniques have been conceived to solve issues in agriculture (Deepa and Ganesan 2019). A survey on studies exploiting Deep Learning in agriculture and food production has been recently conducted (Kamilaris et al. 2018). By means of a detailed examination on various agricultural problems, it has been shown that Deep Learning algorithms help obtain a better prediction accuracy and they outperform conventional image processing techniques. Similarly, Rehman et al. (2019) present a survey on machine learning techniques for various agricultural areas. The work summarizes the pros and cons of statistical machine learning techniques for certain purposes. Furthermore, it also provides a discussions on future trends of statistical machine learning technology applications.

Convolutional neural network models (Ferentinos 2018) have been applied to help farmers detect plant diseases. The approach has been evaluated using an open database with 25 different plants. The framework achieved a high prediction accuracy, and this suggests that the model can be used to support an identification system working in real-world scenarios. Our approach presented in this paper is relevant to this work, as we also attempt to assist farmers in their daily tasks, using Deep Learning techniques. We suppose that EfficientNet can also be used to automatically recognize plant diseases.

A survey (Zhao et al. 2016) presented the major techniques used in fruit or vegetable harvesting robots, and provided discussions on the challenges and trends of deploying different automatic techniques in robots. The exists an approach to identification of soybean leaves and herbivorous pest from images captured by unmanned aerial vehicles (Amorim et al. 2019). The system aims to support specialists and farmers in pest control management in soybean fields, especially when there is a limited amount of labeled instances. We assume that our approach build on top of EfficientNet and transfer learning can be used to tackle the issue (Amorim et al. 2019).

The automatic identification of tree species have great potential in agriculture, and there is a tree species recognition method based on the fusion of multiple deep learning models (Hu et al. 2018). The dataset was built based on the published image datasets on the Internet and autonomous photography. The experimental results showed that the recognition accuracy of the tree species in the complex background with the proposed method reached 93.75%. A recent work (Sharpe et al. 2020) leveraged the advantages of tiny-You Only Look Once 3 (YOLOv3-tiny) as a potential detector to aid goosegrass identification and spraying in situ. The approach was evaluated by two annotation techniques, i.e., annotation of the entire plant (EP) and annotation of partial sections of the leaf blade. The reported performance is as follows: F\(_1\)-score 0.75 and 0.85 for the EP and LB of goosegrass detection in strawberry.

To assist farmers in their daily tasks, a number of approaches have been conceived to classify fruits classification, making use of various machine learning techniques. For instance, a feature learning-based algorithm has been used to build a systems for classifying fruits (Hung et al. 2015). The algorithm extracts the most representative features of fruit images by means of pixel classification. Similarly, a deep neural network to detect fruit was conceived (Sa et al. 2016), aiming to support yield estimation and automated harvesting. A multi-modal Faster R-CNN model (Ren et al. 2015) was designed and implemented, and it achieves improvement in accuracy compared to various state-of-the-art approaches. Furthermore, the approach is also timing efficient, as it requires bounding box annotation instead of pixel-level annotation. The model was retrained to detect seven fruits, and the entire process takes 4 h to annotate and train the new model for a fruit. In our recent work (Duong et al. 2020), we proposed a workable solution to fruit classification using EfficientNet and MixNet . The experimental results show that our proposed approach obtains a high prediction accuracy, thereby outperforming a well-established baseline.

Recently, several studies have focused on boosting productivity in agriculture. In a recent work (Olsen et al. 2019b), the authors proposed an approach to recognize various weed species in the complicated range land environment. The work introduced a baseline for weeds classification task using deep learning model, Inception-v3 and ResNet-50 on the DeepWeeds dataset including 17,509 images of eight weed species across northern Australia. The proposed framework gained an average classification accuracy of 95.1% and 95.7%.

A deep learning architecture named Graph Weeds Net (GWN) (Hu et al. 2020) has been recently introduced to detect multiple types of weeds from images. The proposed method improved the performance of weed identification tasks by establishing fine-grained level deep representation and specified the high probability of the true-positive weed rather than the others. The problem of plant recognition is still challenging due to background noise from the living environment. The proposed technique (Zhu et al. 2019) is able to recognize plants on four datasets, i.e., Malayakew, ICL, Flowers 102, and CFH plant, by exploiting the two-way attention model with deep convolutional neural network. The approach obtained accuracy of 99.8%, 99.9%, 97.2%, and 79.5% for the aforementioned datasets. The first attention way is based on taxonomy of plants to recognize plant’s family, whereas the second attention way focuses on the discriminative features of plants image based on the feature maps generated by convolutional neural networks. Due to the compatibility of two-way attention, the discriminative feature learning and part-based attention are combined and they obtained promising results.

3 A practical solution to weeds classification

This section introduces the proposed approach based on EfficientNet  (Tan and Le 2019), EfficientNet -Lite4, and transfer learning (Huang et al. 2017; Weiss et al. 2016). We systematically present the related technologies, i.e., EfficientNet in Sect. 3.1, and transfer learning in Sect. 3.2. The architecture conceptualized to build an expert system to automatically recognize weeds is presented in Sect. 3.3.

3.1 EfficientNet

EfficientNet  (Tan and Le 2019) is a recently developed family of deep neural network, with the aim of transcending the main limitations of the existing CNN technologies related to prediction accuracy. EfficientNet imposes a balance between all network dimensions, i.e., width, depth, and resolution by means of a set of fixed scaling coefficients that meet some specific constraints (Tan and Le 2019). The EfficientNet family is made of different versions, and the most simple one is EfficientNet -B0. The other EfficientNet configurations are generated from EfficientNet -B0 with different scaling values. For illustration purposes only, the most compact configuration, i.e., EfficientNet -B0, is shown in Fig. 1: It consists of 18 convolution layers in total and each of them uses a kernel either k(3,3) or k(5,5). Input images are made of three color channels, i.e., R, G, B, and each of them is scaled to the size of \(224\times 224\). The next layers are reduced in resolution, but increased in width to enhance accuracy. For example, the second convolutional layer is equipped with W=16 filters, and by its next layer, the number of filters is W=24. The maximum number of filters is D=1, 280 by the last layer which is fed to the fully connected layer.

Fig. 1
figure 1

The efficientNet-B0 architecture

Similar to the EfficientNet backbone, the EfficientNet -Lite modelsFootnote 2 are designed for working on mobile CPU, GPU, and EdgeTPU, but still maintain a comparable accuracy compared to quantized version of some popular image classification models. EfficientNet-Lite architectures do not apply squeeze-and-excite function like the original backbone, but use the modification of Rectified Linear Unit, namely RELU6 function, instead of Sigmoid Linear Units (SiLU). We opted for EfficientNet -Lite4, the largest variant, which achieved 80.4% ImageNet top-1 accuracy, while still running in real time, e.g., 30ms/image on a Pixel 4 CPU. Within a CNN, different optimization functions can be used to speed up the learning process. In the scope of this paper, three different optimization functions, namely, SGD, ADAM, and SLS, are incorporated into our evaluation.

3.2 Transfer learning

An important requirement for deep neural networks is to acquire enough data for the training process. Nevertheless, labeled data can be obtained by manual annotation, which requires time and human labor. Thus, transfer learning has been adopted as a practical solution to overcome the lack of data, since it allows for the re-use of weights and biases trained using large datasets. With respect to using only random weights, transfer learning brings a better performance in terms of effectiveness and efficiency, even when the target domain is quite different from the original one where the weights are obtained (Huang et al. 2017). In the scope of this work, we consider three learning strategies as follows.

  • ImageNet: We import pre-trained weights from the ImageNet dataset (Russakovsky et al. 2015), which consists of more than 14 million images, covering various categories;

  • Adversarial propagation (AdvProp) (Xie et al. 2019): Adversarial propagation has been proposed as an enhanced training scheme, and it treats adversarial examples as additional examples, with the ultimate aim of minimizing overfitting;

  • Noisy Student (NS) (Xie et al. 2019): This aims at boosting up ImageNet classification Noisy Student training by: (i) extending the trainee/student equal to or larger than the trainer/teacher, so as to force the trainee learn better on a large dataset, and (ii) adding noise to the student, thus enabling it to learn more.

3.3 Architecture

The conceived system is shown in Fig. 2 and there are two main phases, namely training and testing. The architecture allows for the inclusion of weights pre-trained using other datasets, e.g., ImageNet (Russakovsky et al. 2015).

Fig. 2
figure 2

System architecture

Input images  are already labeled and they are transformed into a feature vector, which is then handled by the Extractor component . During training, an input image is augmented with rotation and one resized crop, as well as horizontal flipping with a random change. The rotation is done to change at least \(30\%\) of the input image. The resulting image is then rescaled to fit into a frame of size \(224 \times 224\), and fed as training data. The feature vectors and their labels are used to train the system by means of the Weights Calculator component . Pre-trained weights  from other datasets can be imported to deploy transfer learning.

The resulting trained parameters can then be used to classify unlabeled weeds images. The testing phase (or deployment) can be performed on lightweight devices, e.g., laptops or smartphones. Each time, when an unknown input weed image  is put into the system, it will be extracted to build a feature vector using Feature Extractor . The feature tensor is then fed to the neural networks  which run the classification engine to label the input image . The final result is a label that classifies the input weed image into a certain category .

4 Evaluation materials and methods

In this section, we explain in detail the experiments to evaluate the proposed approach as well as to compare it with two existing studies. First, the research questions to study the systems’ characteristics are enumerated in Sect. 4.1. Section 4.2 gives an overview of the DeepWeeds dataset (Olsen 2020), which has been exploited as input data for the evaluation. The experimental settings are described in Sect. 4.3, while Sect. 4.4 presents the metrics utilized to measure the prediction performance.

4.1 Research questions

By performing a series of experiments on the given dataset, we attempt to answer three research questions pertinent to this work:

  • RQ \(_1\): Which optimization function brings a better classification performance for EfficientNet -B4? We examine different EfficientNet -B4 configurations to find the one that obtains the best prediction accuracy.

  • RQ \(_2\): Which optimization function brings a better classification performance for EfficientNet -B4 Lite? Similarly, we perform experiments on the EfficientNet -Lite4 network, aiming to determine the setting that brings the best prediction accuracy on the given dataset.

  • RQ \(_3\): How does the proposed approach perform compared to the baselines? Finally, we are interested in understanding if our proposed approach achieves a better prediction performance compared to two state-of-the-art baselines (Olsen et al. 2019a; dos Santos Ferreira et al. 2019).

4.2 Dataset and baselines

The DeepWeeds dataset was collected using a ground-based weed control robot (Olsen 2020). There are 17,509 images in total with nine categories of weeds, which are summarized in Fig. 3. Among them, eight categories are real weeds, while the Negative category is made of different images rather than weeds, i.e., soil and vegetation. Figure 4a–i depicts some representative examples of the categories from the original dataset. By plain eyes, it is difficult to distinguish the weeds from each other. In this respect, an automatic mechanism to detect weeds is highly desirable.

Fig. 3
figure 3

A summary of the DeepWeeds dataset (Olsen 2020)

Fig. 4
figure 4

Real weed images (extracted from the original dataset (Olsen 2020))

To better study the performance of our proposed approach in relation to state-of-the-art studies, we compare EfficientNet -B4 and EfficientNet -Lite with two baselines as follows. The first one is built based on ResNet-50 , a convolutional neural network originally designed and implemented by the creators of the DeepWeeds dataset to articulate their contributions. The tool works on top of the TensorFlow framework,Footnote 3 and its source code is available online (Olsen 2020). The second baseline is a tool developed by dos Santos Ferreira et al.  (dos Santos Ferreira et al. 2019), which is based on the Inception-V3 deep neural network.

4.3 Experimental settings

For all the experiments, we split the original dataset into three independent parts, i.e., 80% for training, 10% for validation, and 10% for testing. Moreover, we applied various data augmentation techniques as follows. Images are transformed using rotate, blur, random noise, horizontal flip, and vertical flip exploiting an existing tool.Footnote 4 The parameters and their corresponding values are listed in Table 1. The ultimate aim of the augmentation process is to enrich the dataset to feed the recommendation engine.

We adopted a recent implementation (Wightman 2019a) of EfficientNet and EfficientNet -Lite4 which was built on top of the PyTorch framework.Footnote 5 We also imported pre-trained weights from the ImageNet dataset (Russakovsky et al. 2015) as well as using NS and AdvProp (cf. Sect. 3.2), with the final aim of speeding up and increasing the effectiveness of the learning process. We trained on a server with the following configurations: Intel Xeon CPU E5-2678v3 @ 2.50GHz \(\times \) 12 cores, 96GiB RAM, NVIDIA GeForce GTX 1080Ti, Operating System Ubuntu 20.04.1 LTS.

Table 1 Augmentation parameters
Table 2 Experimental configurations
Table 3 Precision, Recall, and F1-score using EfficientNet -B4 with different optimizers

Table 2 depicts the experimental configurations for EfficientNet -B4, EfficientNet-Lite4 as well as for the two baselines, i.e., ResNet-50 and Inception-V3 . The # Params column corresponds to the number of parameters needed to store the weights and biases of the network. The # MAC column measures the computational complexity, counted as the number of multiply-accumulate operation. While both EfficientNet -B4 and EfficientNet -Lite4 need less than 20 millions of parameters and 2.10G MAC, ResNet-50 and Inception-V3 are larger in size, i.e., they have more than 21.80 millions of parameters and 2.85G MAC. Correspondingly, the File size used to store the parameters (in MB) varies depending on the number of parameters. We also use three different optimizers in our evaluation, i.e., SGD, ADAM, and SLS. For EfficientNet -B4, we combine three transfer learning strategies introduce in Sect. 3.2, i.e., ImageNet, NS, and AdvProp. While for EfficientNet -Lite4, we can use only weights pre-trained from the ImageNet dataset.

4.4 Evaluation metrics

Given a set of weeds images, we have their corresponding labels, i.e., \(G=(G_{1},G_{2},..,G_{N})\). By running a classifier, we obtain a set of predicted labels, i.e., \(C=(C_{1},C_{2},..,C_{N})\). We compare the two sets to measure the prediction performance, exploiting the following metrics: accuracy, precision, recall, and F\(_1\) score, which are defined as follows.

Accuracy: The metric is measured as the ratio of number of correct prediction to the total number items

$$\begin{aligned} accuracy =\frac{ \sum _{i}^{N} match_{i} }{\sum _{i}^{N} \left| G_{i} \right| } \times 100\%. \end{aligned}$$
(1)

Precision and Recall: Precision asserts the number correctly predicted instances, whereas recall expresses the ability to find all relevant instances in the dataset. The metrics are computed using the following formula:

$$\begin{aligned} precision_{i}= & {} \frac{match_{i}}{\left| C_{i} \right| } \end{aligned}$$
(2)
$$\begin{aligned} recall_{i}= & {} \frac{match_{i}}{\left| G_{i} \right| }. \end{aligned}$$
(3)

F\(_1\)score (F-Measure): F\(_1\)-score is computed as the harmonic mean of Precision and Recall as follows:

$$\begin{aligned} F_{1} = \frac{2 \cdot precision_{i}\cdot recall_{i}}{precision_{i} + recall_{i}}. \end{aligned}$$
(4)

In the next section, we analyze the experimental results by answering the research questions in Sect. 4.1.

5 Experimental results

We report and analyze the results obtained by conducting a series of experiments on the given dataset in Sects. 5.15.2, and 5.3. Afterward, Sect. 5.5 discusses the probable threats that may adversely impact on the validity of our findings.

5.1 RQ \(_1\): Which optimization function brings a better classification performance for EfficientNet -B4? 

We study the performance by calculating precision, recall, and F\(_1\) scores for EfficientNet -B4, exploiting three optimization functions, i.e., SGD, ADAM, and SLS. The obtained results are shown in Table 3. To facilitate the reading, we compare the performance category-wise, and use different colors to mark the maximum values for each metric. In particular, the light red, light gray, and light green colors are used to highlight the maximum precision, maximum recall, and maximum F\(_1\), respectively.

The EfficientNet family uses either SGD, ADAM, or SLS (Vaswani et al. 2019) as the loss function. The number of epochs for the training phase is 120 iterations. Moreover, a start learning rate of 0.001 with rate decay of 10\(^{-1}\) after every 30 epochs is set for SGD and ADAM. In fact, SLS does not use learning rate, and its number of batch sizes is smaller than that of SGD and ADAM. Although SLS helps to converge faster in terms of numbers of training epochs, the total time for training is not significantly reduced due to the number of batch size settings and SLS seems not to achieve a saturated prediction performance on the independent test set. Due to these reasons, the file sizes of trained weights using SLS optimizer are smaller than those of using either SGD or ADAM (see Table 2). Though SLS optimization does not obtain the best performance, it is still useful to build a baseline for further experimental benchmark.

Table 4 Precision, Recall, and F1-score using EfficientNet -Lite4 with different optimizers
Fig. 5
figure 5

Confusion matrices EfficientNet -B4 and EfficientNet -Lite4 with SGD

Overall, the table demonstrates that using the ADAM optimization function yields a mediocre performance, compared to its counterparts. To be concrete, we can see that ADAM enables to achieve the maximum recall score only for some categories, e.g., R=1.00 for Category Parkinsonia, Parthenium, Prickly Acacia; Meanwhile, it does not help bring any maximum values for Precision and F\(_1\). In contrast, compared to ADAM, the SLS optimizer contributes to a much better prediction, i.e., it allows EfficientNet -B4 to yield a good performance in terms of Recall for many categories. Moreover, using SLS also is also beneficial to the Precision and F\(_1\) scores.

In comparison to the others, the SGD optimization function brings the best performance with respect to all the evaluation metrics. As we can see in the table, most of the maximum Precision scores are obtained using SGD, i.e., there are several cells marked with the light red color, demonstrating the most superior values. Similarly, for other metrics, i.e., Recall and F\(_1\), SGD demonstrates its superiority by helping EfficientNet -B4 achieve the best predictions, compared to ADAM and SLS. Especially, by Recall, we can see that SGD enables EfficientNet -B4 to yield a maximum performance by many categories.

Altogether, the results in RQ\(_1\) reveals an interesting outcome as follows. The ADAM optimizer has been widely used in various classification tasks; however, as we show in this evaluation, at least for weed classification, it cannot gain the upper hand compared to the other functions.

figure j

5.2 RQ \(_2\): Which optimization function brings a better classification performance for EfficientNet -B4 Lite? 

The precision, recall, and F1 scores obtained using EfficientNet -Lite4 are reported in Table 4. Similar to RQ\(_1\), we also use the same set of colors to highlight the maximum values for the quality metrics, i.e., Precision, Recall, and F\(_1\).

Table 5 Accuracy for EfficientNet -B4, EfficientNet -Lite4, ResNet-50 , and Inception-V3 (%)
Table 6 Precision, Recall, and F1-score (using the ADAM optimization function)
Table 7 Timing performance

The table shows that training EfficientNet -Lite4 with the ADAM optimization function brings a mediocre performance, compared to using SGD and SLS. Running EfficientNet -Lite 4 with SLS is beneficial to the final classification, as this configuration brings a good performance in terms of Precision and Recall for various categories. Overall, since most of the colored cells are related to SGD, we conclude that it is the most suitable optimization for this network configuration, since it brings the maximum values for all the evaluation metrics. In particular, with respect to recall, the combination of EfficientNet -Lite4 and SGD yields the maximum score, i.e., 1.00 for eight among nine weed categories. Concerning the F\(_1\) metric, SGD also helps EfficientNet -Lite4 get the maximum value by all the categories.

To further study the performance of the SGD function, we show in Fig. 5a–d the confusion matrices for running EfficientNet and EfficientNet -Lite4 with various transfer learning strategies. The figures show that training EfficientNet -B4 with NS and AdvProp brings a superior performance, compared to other configurations. To be concrete, by both configurations, we obtain correct predictions for all the weed categories, i.e., accuracy is equal to 1.00. Only by the Negatives category, there are some miss predictions. Altogether, this demonstrates that using the SGD optimizer is beneficial to the final outcome.

Table 8 Summary of recent state-of-the-art algorithms
figure k

5.3 RQ \(_3\): How does the proposed approach perform compared to the baselines? 

Transfer learning has shown to bring benefits to the classification with deep neural networks (Thenmozhi and Reddy 2019; Duong et al. 2020). In this work, we investigate if the weights pre-trained obtained by various learning algorithms are beneficial to the recognition of weeds. We ran the networks, i.e., EfficientNet -B4, EfficientNet -Lite4, ResNet-50 and Inception-V3 on the given dataset and obtained a set of categories as results. We computed the evaluation metrics for each of the systems following the descriptions in Sect. 4.4. Table 5 reports the accuracy obtained by all the systems.

In relation to the two baselines, our approach gains a much better performance in terms of accuracy. To be concrete, the first one (Olsen et al. 2019a) obtained 96.08% with the best configuration using ResNet-50 , while the second baseline (dos Santos Ferreira et al. 2019) obtained a maximum accuracy of 94.96% using the VGG16 deep neural network. In particular, using EfficientNet -B4 with the SGD function and weights pre-trained with AdvProp and Noisy Student, we get 99.26% as the prediction accuracy, the maximum score compared to others.

To further compare the approaches, from the predicted categories, we calculated the precision, recall, and F\(_1\) scores following Eqs. (2), (3), and (4), respectively. The final results are shown in Table 6 and we also use the same colors in RQ\(_1\) to mark the maximum values. From the table, it is evident that EfficientNet -B4 is the best classifier among others, as most of the colored cells are concentrated on the rows representing the evaluation metrics for this configuration. Both baselines obtain an inferior performance compared to EfficientNet -B4 and EfficientNet -Lite4 with respect to precision, recall, and F\(_1\).

For instance, Inception-V3 gets two maximum precision scores for Category 3 and 6, and ResNet-50 gains only one maximum recall value. In contrast, we can see that most of the best predictions are achieved using EfficientNet -B4. Also, by EfficientNet -Lite4, there are maximum scores for precision, recall, and F\(_1\).

Table 7 depicts the timing performance of the proposed approach in comparison with the baselines. Among others, ResNet-50 is far from optimal as it can predict the results for 836 images per second. Inception-V3 is quite efficient, as it can generate predictions for 975 images in 1 s. However, ConvNext_Femto and ConvNext_Atto are more efficient compared to Inception-V3 as they return 993 and 994 predictions within a second, respectively.

figure l

5.4 Ablation study

To study the generalizability of our approach, we conduct more experiments using state-of-the-art backbones including the ConvNeXt family (Liu et al. 2022), CoAtNet (Dai et al. 2021), and GCViT (Hatamizadeh et al. 2022) on the DeepWeeds dataset. An Nvidia 1080Ti is not suitable for real deployment because of its energy consumption. We carried out more experiments on edge and very light architectures, including ConvNeXt_Atto, ConvNeXt_Femto, and this allows us to bring more choices to deploy the research result on devices with limited computational resources.

The supplement experiments are implemented using the latest version of the timm codebase (Wightman 2019b).Footnote 6 These configurations are summarized in Table 8. For all the model settings, we use a transfer learning strategy with trained weights from ImageNet-1K and the SGD optimization function. The experimental results of the models on the independent test set are shown in Fig. 6. Considering the ConvNeXt family, the most optimistic performance belongs to the ConvNeXt_Femto (see Fig. 6b), while it obtains the accuracy of 99.11%, which is equal to an accuracy of ConvNetXt_Nano (see Fig. 6d), but smaller than the ConvNeXt_Nano in terms of hyperparameter numbers, MAC, file size of trained weight. The equal accuracy seems to be saturated prediction performance using ConvNeXt variant.

Fig. 6
figure 6

Confusion matrices of recently state-of-the-art models on the independent test set of DeepWeeds

Obviously, training and inference cost for ConvNeXt_Femto are cheaper than that of ConvNeXt_Nano. ConvNeXt_Pico obtains the accuracy of 99.00%, while the smallest architecture ConvNeXt_Atto brings an accuracy of 98.66% (see Fig. 6a). Furthermore, GCViT_XXTiny achieves the maximum accuracy of 99.34% on the independent test set (see Fig. 6f. As shown in Fig. 6e, the second biggest model CoAtNet_Nano_RW_224 gives a baseline accuracy of 97.72%. This suggests that the synergy of architectures and typical datasets needs to be further investigated in future work.

5.5 Threats to validity

This section explains the probable threats to internal and external validity of our evaluation as follows.

Internal validity. These are internal factors that could influence the evaluation. A probable threat is the comparison with the baselines. Such a threat is minimized, since we ran experiments using the original implementations, as well as executed the three systems on the same dataset, and compare them using the same set of metrics.

External validity. The threat to external validity of the approach is related to the generalizability of our findings, i.e., whether they would still be valid outside the scope of this study. We moderated the threat by evaluating EfficientNet and EfficientNet -Lite4 using a dataset collected in situ, which covers different categories of weeds. This aims at evaluating the feasibility of the approach in real-world settings.

Construct validity. The threats are pertinent to the experimental configurations presented in the paper, with respect to the simulated scenario to evaluate the system. We conducted the evaluation using a training set and a test set, which might not reflect well a real-world usage. To aim for a reliable comparison, we used the same settings to evaluate and compare the systems.

Conclusion validity. This is related to the factors that may influence the obtained outcome. The evaluation metrics, namely accuracy, precision, recall, and F\(_1\) score, might pose a threat to conclusion validity. To tackle the issue, we used the same metrics for comparing our proposed approach with the two baselines.

6 Conclusions and future work

In this paper, we proposed a solution to weeds classification exploiting EfficientNet -B4 and EfficientNet -Lite4 as the engine, together with various transfer learning techniques. Our proposed approach has been validated on a real-world dataset. Through the empirical evaluation, we see that the conceived framework outperforms two well-established baselines in terms of various quality metrics. In this respect, we concluded that the combination of EfficientNet and transfer learning brings a substantial improvement in performance compared to using ResNet-50 and InceptionV3. Interestingly, we found out that the ADAM optimizer function, which has been widely used in deep neural networks, does not gain the upper hand compared to the SLS and SGD functions. We come to the conclusion that the SGD optimizer function is the most suitable one for EfficientNet, when it comes to weed classification. Altogether, we contribute to an advancement in the classification of weeds in real-world scenarios.

We plan to deploy the system on a robot to detect weeds from lively captured images. Such a robot is beneficial to farmers in Vietnam as it helps them automatically find and eliminate weeds on rice fields. We are also going to extend the evaluation by considering additional weed datasets, as well as compare EfficientNet and EfficientNet -Lite4 with other neural network approaches. Last but not least, we will select the most network model and export to Android, making it an independent tool suitable to work on a smartphone.