1 Introduction

Around two-thirds of all workers on the continent are employed in the agricultural sector, which has a significant impact on the economics of African nations. Changing rain patterns, droughts, flooding, and the geographical dispersion of pests and illnesses are just a few of the direct and indirect effects of climate change on agricultural output (Food and Agricultural Organization of United Nations, 2020). These findings indicate that one of the most difficult issues in precision agriculture is crop yield prediction. Agriculture's future is negatively impacted by the changing environmental circumstances, particularly global warming and climate variability [1, 2].

Since crop output varies on a wide range of variables, including climate, weather, soil, application of fertilizer, and seed variety [3], this challenge necessitates the use of many datasets [4]. Accurately forecasting crop production using statistical models is a time-consuming and challenging process [5]. It takes a lot of time and effort to anticipate the crop production using statistical models [6]. Big data's introduction in this decade has increased the adoption of more sophisticated analysis methods like machine learning. Depending on the study challenge and research questions, a machine learning model may be either descriptive or predictive [7, 8].

Predictive models are used to generate future forecasts, whereas descriptive models are used to learn from the data gathered and explain what has occurred [9]. It has been employed to find solutions in a variety of fields, including finance [10], biology [11], medicine [12], and, most recently, agriculture.

The Internet of Things (IoT) is a network of interconnected physical objects that includes electronics, software, sensors, and network connectivity [13]. These devices can collect and exchange data, enabling remote sensing and control over an already built network infrastructure [14]. By allowing for more direct interaction between the physical world and computer-based systems, IoT can enhance productivity, accuracy, and financial gain [15].

Geo-distribution, mobility assistance, and location awareness are typically needed for IoT applications [16]. Fog computing (FC) [17], which is made by Cisco, shares the virtualization, network, processing, and computing properties of cloud computing (CC). In order to overcome the problem of high delay, these capabilities are provided close to Things and are designed for delicate applications like healthcare [18] and smart agriculture [19]. The connection between IoT and smart agriculture is shown in Fig. 1.

Fig. 1
figure 1

Big data-IoT-smart agriculture relationship

The main contribution in this paper is to propose a Crop Yield Prediction Algorithm (CYPA) in Precision Agriculture based on IoT Techniques.

The following are the primary contributions of this paper:

  • A decision support tool is provided to assist farmers and decision makers in forecasting agricultural yields based on climatic conditions in their zones in order to successfully combat climate change and ensure future food security.

  • A sophisticated machine learning-based agricultural yield prediction method is suggested.

  • We also propose a new algorithm which is based on active learning.

  • Incorporating active learning in the proposed Crop Yield Prediction Algorithm (CYPA) can enhance the performance of the model by reducing the number of labeled data required for training.

  • Active learning selects the most informative samples for labeling and updates the model iteratively, leading to a reduction in labeling costs and an improvement in prediction accuracy. By actively selecting the most informative samples to train the model, the CYPA can achieve high accuracy with fewer labeled samples, making it more efficient and practical for use in Precision Agriculture.

  • Additionally, active learning can help the CYPA adapt to changing field conditions, such as weather or pest outbreaks, by actively selecting new samples for labeling that better represent the current conditions. Overall, incorporating active learning into the CYPA can improve the efficiency and accuracy of crop yield prediction, leading to better decision-making at the international, regional, and local levels.

Following is how the remaining work is structured. Section 2 presents some of the most recent research in the field of precision agriculture. The proposed framework is presented in Sect. 3. In Sect. 4, experimental evaluation is offered. Section 5 brings this effort to a close.

2 Related work

Machine learning (ML), a branch of artificial intelligence, allows computers to learn from data without having to be explicitly programmed. ML is now more useful because to big data technology. A considerable amount of data that is produced quickly from multiple sources is simply referred to as “big data”. ML, which aids in comprehending data, is fundamentally based on the construction of mathematical models [20].

The three main types of difficulties that machine learning is used to solve are supervised learning problems, unsupervised learning problems, and reinforcement problems [21,22,23]. A tuple of (X, Y) is the mathematical input for machine learning, where X represents the so-called independent characteristics and Y represents the dependent variable or target. We have a supervised learning problem when the variable Y is known before the training. In supervised learning, regression and classification problems are the two most frequent problems.

One of the regression difficulties is known to be the house price projection problem [24]. Unsupervised learning is used when the variable Y was unknown before to the training. Their methods are widely applied to clustering problems. Climate change is the long-term alteration of regional or global temperatures or weather patterns. It is challenging to combat global warming and reduce greenhouse gas emissions due to the legal and regulatory concerns posed by it [25].

Millions of people worldwide, especially in South Asia, Sub-Saharan Africa, and tiny islands, are projected to see an increase in food insecurity, hunger, and malnutrition issues [26]. Climate change is one of the main dangers to Africa's agricultural development [27]. Crop yield forecasting continues to be a hot topic for researchers worldwide.

Using data from remote sensing, You et al. [28] developed a deep learning system for agricultural yield prediction. In order to forecast crop production in undeveloped countries annually, this method combined a Convolutional Neural Network (CNN) with a Gaussian process component and dimensional reduction technique. The method was used on a dataset of soybeans that was created by fusing US soil, sensing, and climate data. The authors' method demonstrates how the Gaussian approach was used to reduce the model's Root Mean Square Error (RMSE), which was reduced with the Long Short-Term Memory (LSTM) model from 6,27 to 5,and 83 on average and with the CNN model from 5, 77 to 5, 57.

Paudel et al. [29] combined machine learning with agronomic crop modelling techniques to create a machine learning baseline for large-scale crop production prediction. They started with a workflow that prioritised accuracy, modularity, and reuse. The MARS Crop Yield Forecasting System (MCYFS) database was used to create the crop simulation outputs' features along with weather, remote sensing, and soil data.

Sun et al. [30] presented a novel multilevel deep learning model coupling Recurrent Neural Network (RNN) and Convolutional Neural Network to extract both spatial and temporal data for agricultural productivity prediction (CNN). Their main goals were to evaluate the usefulness of the proposed strategy for US Corn Belt yield prediction and the influence of different data sets on the prediction task. Data on soil characteristics as well as time-series remote sensing data were used as inputs. For the purpose of predicting the county-level corn yield from 2013 to 2016, they carried out their studies in the US Corn Belt states.

Using a 43-year benchmark dataset, Bali and Singla [6] projected wheat crop yield in northern India using a deep learning-based Recurrent Neural Network (RNN) model. Additionally, LSTM was used in their study to address the RNN model's inherent vanishing gradient problem. Results from several models, including RNN-LSTM (RMSE = 147.12, MAE = 60.50), Artificial Neural Networks (RMSE = 732.14, MAE = 623.13), Random Forest (RMSE = 540.88, MAE = 449.36), and Multivariate Linear Regression (RMSE = 915.64, MAE = 796.07), demonstrated the effectiveness of the proposed model.

Authors in [31] proposes a novel edge-based spatiotemporal data preprocessing (ESDP) framework for enhancing the accuracy of air quality prediction. The proposed framework utilizes a combination of deep learning models and edge computing technologies to preprocess and analyze large volumes of spatiotemporal air quality data in real-time, thereby reducing the computational burden and improving the efficiency of air quality prediction. The authors demonstrate the effectiveness of the proposed framework by conducting experiments on a real-world dataset of air quality measurements from the city of Beijing, China. The results show that the ESDP framework outperforms other state-of-the-art approaches in terms of prediction accuracy and computational efficiency, thereby demonstrating its potential to improve air quality monitoring and management. This article is relevant to our work as it proposes a novel framework that utilizes edge computing and deep learning techniques for enhancing the accuracy of spatiotemporal data prediction, which can be applied to various domains including precision agriculture.

3 Crop yield prediction algorithm (CYPA)

This section presented the implementation in two cases: (i) Using machine learning methods, (ii) Using Statistical methods (Correlation).

3.1 Using machine learning methods

In this study, the top 10 crop yields eaten globally are predicted using machine learning approaches. Among these crops are wheat, paddy, rice, sorghum, sweet potatoes, soy beans, plantains, cassava, and others.

The majority of agricultural operations are significantly impacted by our prediction models' most well-known and commonly used features which are: (i) Temperature: The nation's average annual temperature. Throughout the twenty-first century, yield declines among and across many soil and crop species may be caused by temperature increases [32]. (ii) Precipitation: the volume of precipitation on average each year. Variations in water availability may have an impact on agricultural productivity and profitability because water is a crucial input in agricultural production [33].

(iii) Pesticide: the quantity of pesticides used annually per square mile of land in a given nation. Crops are treated with pesticides, especially in low-income areas. The most typical justification for applying pesticides is to boost yield. (iv) Yield: the production rate per area, year, and nation; (v) Country: the area under discussion. (vi) The year the data were gathered. The approaches described in this work include Gaussian Nave Bayes (GNB), Gradient Boosting (GB), Decision Tree (DT), Random Forest (RF), and Multimodal Naive Bayes (MNB).

The overall scheme to describe the effect of the climate change on the crop yield is shown in Fig. 2. The applied machine learning models employed are I Gradient Boosting, (ii) Decision Tree, (iii) Random Forest, and (iv) SVR. Four models are trained and validated using the best hyper parameter settings for each of the machine learning techniques in the study. The pipeline for the crop yield prediction system is shown in Fig. 3. Four steps make up this process: ETL, feature engineering, model training, evaluation, and model deployment, which is the last step.

  1. i.

    DecisionTreeRegressor

Fig. 2
figure 2

The overall scheme to describe the effect of the climate change on the crop yield

Fig. 3
figure 3

An overview of the crops yield prediction pipeline

DecisionTreeRegressor is a simple and interpretable algorithm that can handle both numerical and categorical data. It can also handle missing values and outliers effectively, and is resistant to over fitting. However, it can be prone to high variance and instability, and may not generalize well to new data if the tree is too deep or complex. To overcome this, techniques like pruning and ensemble methods, such as Random Forest, can be used. Overall, DecisionTreeRegressor is a powerful and widely used algorithm for regression tasks, especially when interpretability and simplicity are desired.

  1. ii.

    RandomForestRegressor

RandomForestRegressor is a machine learning algorithm used for regression tasks, which predicts a continuous output variable based on input features. It is an ensemble learning method that combines multiple decision trees to create a more accurate and robust model. The algorithm works by constructing a large number of decision trees and training each tree on a random subset of the training data and a random subset of the input features. Each tree in the ensemble produces a prediction, and the final prediction is obtained by averaging the predictions of all the trees.

RandomForestRegressor can handle missing data and non-linear relationships between the input features and output variable. It is also robust to over fitting, which is a common problem in machine learning models, and can handle high-dimensional data with many input features. Additionally, it can provide a measure of feature importance, indicating which input features are most relevant for the prediction.

  1. iii.

    ExtraTreeRegressor

The Extra Tree Model builds an ensemble algorithm using decision or regression trees that have not been pruned. It varies from other ensemble models based on trees. It begins by randomly splitting nodes based on the selected cut-points. Additionally, it grows the trees using the entire learnt sample set rather than a bootstrap replica. The Extra tree splitting process is described in Algorithm 1.

figure a

3.1.1 Using statistical methods (correlation)

To test the impact of climate changes on crop yields, we follow two methods: (i) Pearson’s Correlation Coefficient (PCC), and (ii) Multiple Regression (MR).

3.1.1.1 Pearson’s correlation coefficient (PCC)

To gauge how closely two variables are related to one another, correlation coefficients are used.

The Pearson's correlation coefficient is the most often used of the several correlation coefficient types. A correlation coefficient frequently used in linear regression is Pearson's correlation (sometimes known as Pearson's r) [26].

We determine the Pearson's correlation coefficient (ra) between the most significant variables and crop yields (CY) as follows to examine the effect of climatic changes on crop yields: (i) Using the formula in Eq. (1), we determine the relationship between the two variables Average Temperature (TA) and Crop Yields (CY).

$$rt = \frac{{n \left( {\sum TA CY} \right) - \left( {\sum TA} \right)\left( {\sum CY} \right)}}{{\sqrt {\left[ {n \sum \left( {TA} \right)^{2} - \sum \left( {TA} \right)^{2} } \right]\left[ {n \sum \left( {CY} \right)^{2} - \sum \left( {CY} \right)^{2} } \right]} }}$$
(1)

(ii) we calculate rp between the two variables Pesticides Tonnes (PT) and Crop Yields (CY) as shown in Eq. (2).

$$rp = \frac{{n \left( {\sum PT CY} \right) - \left( {\sum PT} \right)\left( {\sum CY} \right)}}{{\sqrt {\left[ {n \sum \left( {PT} \right)^{2} - \sum \left( {PT} \right)^{2} } \right]\left[ {n \sum \left( {CY} \right)^{2} - \sum \left( {CY} \right)^{2} } \right]} }}$$
(2)

(iii) we calculate rr between the two variables Average Rain Fall (RF) and Crop Yields (CY) as shown in Eq. (3).

$$rr = \frac{{n \left( {\sum RF CY} \right) - \left( {\sum RF} \right)\left( {\sum CY} \right)}}{{\sqrt {\left[ {n \sum \left( {RF} \right)^{2} - \sum \left( {RF} \right)^{2} } \right]\left[ {n \sum \left( {CY} \right)^{2} - \sum \left( {CY} \right)^{2} } \right]} }}$$
(3)

where, rt is coefficient between TA and CY, rp is coefficient between PT and CY, rr is coefficient between RF and CY. TA is the average temperature, PT is the pesticides tonnes, RF is the average rain fall, n is number of points.

3.1.1.2 Multiple regression (MR)

The second phase is using multiple regression analysis to determine how much the chosen features have an impact on drop yield. Following are the very important explanatory variables, listed in decreasing order of partial correlation coefficient, from a multiple regression analysis utilizing three factors as the explanatory variables: Average rainfall (RA), average pesticide use (PT), and average temperature (TA) (RF). MR formula and computation are shown in Eq. (4).

$$NCY= \beta0+\beta1(TA)+ \beta2(PT)+ \beta3(RF)+ \euro$$
(4)

where NCY is the new crop yield (the predicted value of crop yield), β0 is the NCY-intercept (constant value), β1 is the change in NCY each 1 increment change in TA, β2 is the change in NCY each 1 increment change in PT, β3 is the change in NCY each 1 increment change in RF, € is the model error term.

4 Implementation and Experiments

The utilized dataset and the suggested algorithm's outcomes are introduced in this section.

4.1 Dataset

Crop Yield Prediction Dataset [35] is used in this study. Pesticides & Yield are collected from FAO. Rainfall & Avg. Temperature are collected from World Data Bank. Yield_df.csv is final dataset processed by cleaning & merging of pesticides, yield, rainfall, and avg. temperature. Table 1 provides a sample of utilized dataset.

Table 1 A sample of utilized dataset

Figure 4 displays the HeatMap for the dataset that was used.

Fig. 4
figure 4

HeatMap

4.2 Performance assessment

This part of paper presented the results of the implementation in two cases: (i) Results of using machine learning methods, (ii) Results of using Statistical methods (Correlation).

4.2.1 Results of using machine learning methods

Table 2 displays the anticipated values for each crop using Gradient Boosting Regressor and Decision Tree Regressor. Figure 5 compares real values to expected ones when using a decision tree regression model, and Fig. 6 compares actual values to projected ones when using a gradient boosting regression model.

Table 2 Predicted data for each crop
Fig. 5
figure 5

Actual Data versus Predicted Data using Decision Tree Regressor

Fig. 6
figure 6

Actual Data versus Predicted Data using Gradient Boosting Regressor

Using DecisionTreeRegressor, the score = 0.9814, using RandomForestRegressor, the score = 0.9903, and using ExtraTreeRegressor, the score = 0.9933.

4.2.2 Results of using statistical methods (Correlation)

To test the impact of climate changes on crop yields, we follow two methods: (i) Pearson’s Correlation Coefficient (PCC), and (ii) Multiple Regression (MR).

4.2.2.1 Calculating PCC

In order to investigate the impact of average temperature on crop yield, we calculate PCC (rt) between TA and CY as shown in Fig. 7.

Fig. 7
figure 7

Crop Yield (CY) versus average temperature (TA)

From Fig. 7, it is shown that there is a negative relationship (inverse correlation) between the average temperature (TA) and crop yield (CY). The rt = −0.11477 (negative correlation). We also calculate PCC between the pesticides tonnes (PT) and crop yield (CY) as shown in Fig. 8.

Fig. 8
figure 8

Crop Yield (CY) versus pesticides tonnes (PT)

From Fig. 8, it is shown that there is a weak relationship between the pesticides tonnes (PT) and crop yield (CY). The rp =  + 0.06408 (positive correlation) means that both variables move in tandem—that is, in the same direction. We also PCC between the average rain fall (RF) and crop yield (CY) as shown in Fig. 9.

Fig. 9
figure 9

Crop Yield (CY) versus average rain fall (RF)

From Fig. 9, it is shown that there is a weak relationship between the average rain fall (RF) and crop yield (CY). The rp =  + 0.000962 (positive correlation) means that both variables move in tandem—that is, in the same direction.

4.2.2.2 Calculating MR

We employ Multiple Linear Regression (MLR) [36] to forecast the quantity of citations from the aforementioned components (TA, PT, and RF), as shown in Fig. 10.

Fig. 10
figure 10

Normal probability plot (TA,PT,RF)

From Fig. 10, it is shown that the residuals are normally distributed.

Comparing the proposed algorithm (CYPA) with the state of the art algorithm proposed in [37] is illustrated in Table 3.

Table 3 Predicted data for each crop

One limitation of our study is that it is based on data collected from a specific geographical region, and the performance of the CYPA algorithm may vary for different regions. Additionally, the CYPA algorithm relies on accurate and timely data collection, and any errors or delays in data collection can affect the accuracy of crop yield prediction.

Moreover, the presented algorithm utilizes machine learning techniques, which are dependent on the quality and quantity of data used for training. Therefore, the accuracy of the CYPA algorithm may be affected by incomplete or biased data, as well as any changes in environmental conditions that are not captured in the training dataset.

5 Conclusion

This paper presented a novel approach to predicting crop yields using IoT techniques in precision agriculture. The proposed Crop Yield Prediction Algorithm (CYPA) incorporates multiple data sources such as climate, weather, agricultural yield, and chemical data to anticipate annual crop yields by policymakers and farmers in their country. The study demonstrated the efficacy of CYPA by training and verifying five models using optimal hyper-parameter settings for each machine learning technique. The results indicate that CYPA can achieve high accuracy in predicting crop yields, as demonstrated by the scores of DecisionTreeRegressor, RandomForestRegressor, and ExtraTreeRegressor. Additionally, the paper introduces a new algorithm based on active learning that can enhance CYPA's performance by reducing the number of labeled data needed for training. Incorporating active learning into CYPA can improve the efficiency and accuracy of crop yield prediction, thereby enhancing decision-making at international, regional, and local levels. Overall, this study highlights the potential of IoT and machine learning techniques in addressing the critical challenge of predicting crop yields, thereby facilitating informed decision-making for policymakers and farmers alike. When utilizing DecisionTreeRegressor, RandomForestRegressor, and ExtraTreeRegressor, the score is equal to 0.9814, 0.9903, and 0.9933, respectively. In the future, the proposed algorithm can be used with OCNN [37,38,39,40,41] and make use of Resnet [42].