1 Introduction

Agriculture is one of the leading application areas of urban life which is undergoing advances at an exponential pace. Most of the losses in agriculture occur due to incorrect crop selection on the available soil. Thus, the use of agriculture sensors, the IoT, machine learning, and deep learning in the advancement of smart farming and precision agriculture is gaining much importance [1,2,3]. Owing to the enormous capability of machine learning to learn and analyze the features of the soil in cultivatable land, numerous applications can be witnessed in several dimensions [4, 5]. Identification of the most suitable crop on the available land, by taking into consideration the various nutrient contents of the soil and the environmental conditions prevalent, is a persistent challenge, which when addressed, would prove to be one of the most beneficial ways to enhance crop production [7]. This fact has led to an extensive background study on the existing recommendation systems and an urge to work further in this direction to serve a greater purpose [8].

The pomological industry contributes to a major portion of the economy, earned from overseas business of any nation. The proposed model ensures maximum fruit growth by analyzing and recommending the most suitable fruit that can be cultivated in the existing soil type. This in turn, leads to better and enhanced fruit yield. The fruit export of any nation is directly affected by the improved fruit yield, this way the proposed model contributes to improving the profitability of the pomology industry and boosting national economies.

The biggest challenge faced in cultivating fruits using vertical farming methods is the size of the fully grown fruit trees. Fruit trees are mostly woody, i.e., they have woody stem and are big trees that essentially need soil as the cultivation medium for it to bear fruits properly, and perennial in nature, i.e., these trees take more than two years to grow fully and bear fruits. Vertical farming methods are soilless cultivation methods where a substitute substrate is used instead of soil and the nutrients are provided using nutrient infused water. The grow tubes or trays are generally placed one above the other to ensure maximum possible yield in limited space. Due to these factors, herbaceous plants are the ones that can be grown in the best way using vertical farming methods and very few fruits grow in vertical farming method. Precision agriculture can be implemented in the field of pomology by making the field smart. Sensor equipped fields can monitor the real time ongoings in the field and the nutrition, moisture, etc. can be precisely made available to the plants according to the situation.

The different types of learning methods for machine learning are supervised, unsupervised, and reinforcement learning. Any machine-learning model uses two phases of the process of extracting knowledge using machine-learning training and testing [9]. The training process is the process of training the model by considering the hypothesis function whereas the testing process is the process following the training process wherein the model is tested on the rest of the dataset and the accuracy is calculated. Along with all these, various classification techniques are available for use in machine learning [10,11,12].

Many researchers have worked on this issue and provided many solutions to help farmers in this regard, but the novelty of this research lies in the fact that an ensemble technique is used in machine learning to provide a more accurate result of the fruit recommendation system [13].

The authors, in this research work, have focused majorly on fruits. A dataset is created that consists of various parameters on which the nitrogen, phosphorous, potassium, temperature, humidity, rainfall, and pH requirements of various fruits have been collected. A hundred entries have been created in the dataset against each fruit. The traditional method of cultivation practices the modification of soil nutrient contents according to the requirements of the crop. The authors in this work have taken the reverse approach to enhance the cultivation output. The analysis of soil health in terms of nutrient contents, pH level, temperature, rainfall, and humidity are done to recommend the crop would thrive the most in the available soil and weather conditions. This type of autonomous recommendation system for fruits helps in getting yield more than the estimated yield [14]. The best and the most systematic approach to solving any type of real-time problem is machine learning. It is one of the most effective methods that utilizes historically recorded data to extract veiled information through functions to address the queries that may arise in the future [15].

The fact that nearly the entire agricultural land suitable for agricultural production is under cultivation, leads to an increasing need to meet the demands of food and fodder by increasing the yield of the cultivated produce per unit area of the crop. Pomology or fruticulture is one of the most emerging categories of agriculture that benefits the economy of many nations around the world. The authors of [16] state that, the better the cultivation of fruits, the stronger the economic benefits of the nation will be. Various drawbacks and uncertainties of cultivation in soil have led to the concept of vertical farming. Smart agriculture not only inculcates smart monitoring and controlling plant growth in various controlled environment agricultural setups using IoTs but also analysis of soil properties to decide and determine the best-suited crop for the type of soil available [17,18,19]. The aspect of smart agriculture in fruit cultivation in substrate medium holds great significance to the practitioners of traditional pomologists. This enhances the ease of proficient as well as amateur pomologists to come up with a more profitable yield in all aspects. The information offered by the current advances in smart fruticulture can be turned into profitable decisions by smartly analyzing the soil properties and its nutrient content along with the existing climatic conditions and thereby deciding the fruit to be cultivated [20].

The required technical awareness, the hardware components, cloud computing, internet connectivity, and most importantly the multiplied energy consumption in vertical farming methods render it difficult for a traditional farmer to shift towards vertical farming methods [21]. Significant development has been noticed in the use of machine learning in various industries and research areas. Thus, the development of a machine learning system, for the betterment of the fruticulture sector, is the need of the hour [22, 23].

The gaps identified from the existing literature can be enlisted as: the identification of best suited nutrient-enriched soil, and climatic condition for a particular fruit tree, identification of the best machine learning algorithm for fruit recommendation process and the key parameters affecting the recommendation process, analyzing the advantages of Light GBM over other algorithms, knowing the combination of soil nutritional and climatic requirements for each fruit, and knowledge about the flexibility, scalability, and dynamic nature of the recommendation model used. These gaps are addressed in the due course of this research article. To formulate some research questions to be analyzed and answered after the process of the research work, the identified research questions are as stated:

RQ1 Which nutrient-enriched soil is best suited for a particular fruit?

RQ2 Which climatic condition is best suited for a particular fruit?

RQ3 Which machine learning algorithm is the best for fruit recommendation?

RQ4 What are the advantages of a Light Gradient Boosting Machine (Light GBM) based over other tree-based algorithms?

RQ5 What are the key parameters affecting the efficient recommendation for fruticulture?

RQ6 What are the combinations of nutrient and climatic requirements for each fruit?

RQ7 Analyze the flexibility, scalability, and dynamic nature of the proposed recommendation system.

The authors of this research work have created a database for various fruits. The database consists of various fields like Nitrogen (N), Phosphorous (P), Potassium (K), temperature, humidity, pH, rainfall, and the corresponding crop. 100 entries have been taken for each of the crops in the database. The crops in the fruits considered are raspberry, papaya, orange, apple, muskmelon, watermelon, grapes, mango, banana, and pomegranate. The fruit nutrient requirements and their respective growing conditions are obtained from Kaggle database. The nitrogen, phosphorus, and potassium contents of soil is determined using soil NPK sensor. The other factors like temperature, and humidity are checked using temperature sensor, pH is checked using pH meter, and rainfall is checked manually. These sensor values are sent to the cloud for analysis and the recommendations are made accordingly. Soil fertility can highly be assessed through a systematic assessment using this device by determining its fertility. This is a robust, high quality, rust and corrosion resistant sensor that can be buried in the soil for a prolonged period. The sensor also has a high accuracy, very low power consumption and lowest latency which makes it even more reliable. A system for checking and analyzing the soil type and its nutrient contents has been proposed in this research work. A fruit recommender system then checks the soil for its type and nutrient contents and subsequently recommends the best fruit that can be grown in that soil in the prevalent climatic conditions. This will prove to be a great help in minimizing fruit waste and providing the maximum possible quality yield of fruit using substrate cultivation methods. The end user will have to additionally summarize, analyze, and then identify the recommended fruit that is suggested by the recommendation system and act accordingly.

Light GBM is different from the other tree-based algorithms in the way it propagates. Light GBM grows the tree vertically or leaf-wise, whereas the other ones propagate level-wise or horizontally. The leaf with the maximum delta loss is considered for propagation and of this characteristic, the leaf-wise propagating algorithm can reduce loss better than the other ones.

The other advantages that Light GBM has over the other tree-based algorithms are:

  • Higher efficiency – Light GBM is found to perform with higher efficiency than the other tree-based algorithms, like XGBoost.

  • Faster training speed – The training speed of the ML model is one of the major aspects determining the efficiency and selecting factor for any ML algorithm. Light GBM has a very fast training speed as compared to other ML algorithms which makes it better and more efficient than the other ML models.

  • Better accuracy–Light GBM is found to be more accurate than other existing ML models.

  • Lesser memory consumption–Optimized consumption of memory is another deciding factor for the selection of Light GBM over other ML models.

  • Scalable–The Light GBM is scalable to a greater extent than other ML models.

Figure 1 is a depiction of the process of the proposed recommendation system. In the entire process of the pomological recommendation system, the initial phase is of collecting the parameters to form the dataset. The parameters include pH, nitrogen (N), phosphorous (P), potassium (K), and the average rainfall, temperature, and humidity requirements of each fruit. After the collection of data is done, the dataset is created accordingly. The next phase is to analyze the prevalent N, P, K, and pH contents of soil, the temperature, humidity, and the average rainfall conditions, and subsequently recommend the most suitable fruit that can be grown in the given soil and environmental conditions.

Fig. 1
figure 1

Proposed pomological recommendation system using Light GBM

The major contributions of this work are enlisted as (1) creating a model for a recommendation of suitable fruit trees for a given soil type and environmental condition. (2) creating a dataset of 11 varieties of fruits with 100 entries for each fruit against 7 parameters. (3) applying the model to the created dataset. (4) split the dataset into training and testing sets in the ratio of 7:3. (5) evaluate the model and check for the efficiency of the model.

The authors implement the generated dataset, particularly on fruits, on the proposed Light GBM-based fruit recommendation system. This model takes into consideration the suitable soil nutrient contents and the environmental requirements of the crop from the dataset fed to it. For a particular soil type and the prevalent environmental conditions, the model helps the farmers to identify the best-suited fruit that can be grown in the available soil nutrient content and environmental conditions. Apart from analyzing the correlation and confusion matrices, the performance of the model is also tested using precision, recall, and F1-score metrics.

This research work is organized as follows: Section I covers the introduction to the importance of the research domain; Section II encloses the background study of the research. Materials and methods are covered in section III, where dataset collection and the method used are discussed in detail. The obtained results are discussed in section IV and finally, the conclusion and future scope of the work is discussed in section V.

2 Background study

The authors in [24,25,26] have created a neural network in this work wherein static soil information is handled by fully connected layers whereas the dynamic information is taken care of by continual LSTM layers. The training was done with historical information on the explicit design for many soil properties, maximum and minimum temperatures, and precipitation against the country level of historical yield labels. The model was tested in an exceedingly different set of information and produced comparable results to create the use of detailed remote sensing data. The potential information of the mining technique used in crop yield production supported the input environmental condition parameters. The authors of [27] have hybridized Q-learning with deep learning. This way the raw data is mapped exactly with the prediction values. A dataset was created by the authors of [28] which consists of images of highly consumed and exported Indian fruits. A user-friendly web page is developed in [29, 30] which helps in attaining more than 75% accuracy of prediction square measures. The work done by the authors of [31] introduces a computationally economical and efficient crop recommendation system using a naïve mathematical model. The scalability of the system can be proved by the fact that it takes very less time to implement on different crops. Time of sowing, plant growth, etc., may be known easily from the yield graphs. The best and worst conditions can be known conjointly, and smaller farms may also be benefitted from this. In the research work [32], the authors deal in data classification for optimized crop recommendation wherein the analysis potentialities for the classification of soil are done using various algorithms in the data processing. Kasur district in Pakistan is the place to carry out the experiment, where the comparative analysis of the algorithms is ascertained for various levels of accuracy to determine the effectiveness and potency of predictions [33]. A higher understanding of soil categories helps in improving productive farming and reduces the dependency on fertilizers. The authors of [34,35,36] have used two supervised classification ML formulae, i.e., ID3 and KNNR, which help in unveiling the patterns within the knowledge set which consists of the average temperature and precipitation of six crops in 10 cities of Bangladesh over 12 years, to provide the prediction. An ensemble model to help the farmers is proposed by the authors of [6], which uses ANN, Random Forest, SVM, and KNN. The authors of [37,38,39] have incorporated a derivation of datasets of soil precipitation and properties derived by satellites and data derived from models containing climate prediction. Maize and soybeans cultivated in Brazil and USA are the crops taken into consideration without using data from the Normalized Different Vegetation Index. Crops are considered in [40, 41] and the prediction is done by studying a fixed dataset using Supervised ML methods. The authors of [42, 43] use big data for the storage, processing, and analysis of data with accuracy, thus being used in substrate cultivation and benefiting farmers and thereby enhancing the economic growth of the country. The study in [44, 45] entrails an analysis based on the case, providing the users with empirical evidence on various data mining algorithms to fragment the dataset of agricultural regions based on soil properties. Light GBM—a model of decision trees used to classify and apply regression [24, 46,47,48]. The utility is demonstrated in breeding assisted by selection with a huge dataset of maize crops [48,49,50]. The authors of [51] use color slicing and grey level occurrence matrix to process and extract features of paddy leaves. Random forest, KNN, etc. are used and the RF classifier performed the best way. In [52] the authors evaluate the necessary elements for crop recommendation. LGBM showed the best performance in terms of several benchmarks.

The gaps identified from the extensive background study are in terms of fruit cultivation or pomology [53]. No specific or dedicated work is seen to be done concerning fruits, although fruit cultivation helps in the growth of the economy of any nation. Fruits are mainly cultivated in substrate medium, and thus, recommending suitable fruits for a particular type of soil in a specific weather condition is more convenient than creating suitable conditions for the cultivation of a particular fruit. This work is thus aimed at recommending fruits [54] that can thrive in available soil conditions and the prevailing weather conditions at a specific point in time.

3 Materials and methods

The materials needed for this experiment are a complete, reasonably large dataset wherein the various entries are made for various fruits concerning their N, P, K, and pH values concerning the requirements from the soil, and temperature, humidity, and rainfall values for climatic conditions. This dataset is created by studying the fruit growth requirements and recording them for matching with the existing nutrient content of the soil and the existing climatic conditions. As the climatic conditions and soil nutrient content are not possible to change, this recommendation system will help the pomologists to cultivate the fruit that will produce the fruits that can give the best yield.

The method used is a machine learning algorithm named Light GBM [55], which takes the CSV dataset with all numeric data values as the input and recommends the fruit giving the best growth estimation in the existing soil and climatic conditions.

3.1 Dataset collection

The authors have considered 11 fruits in the dataset namely: raspberry, pomegranate, banana, mango, grapes, watermelon, muskmelon, apple, orange, papaya, and coconut. The corresponding values for N, P, and K requirements of all the fruits, along with the temperature, humidity, rainfall, and pH values are entered into the dataset. A hundred entries have been considered for each fruit. The dataset is obtained from the Kaggle source. Therefore, in an answer to RQ. 5, it can be stated that the N, P, and K requirements of each fruit, along with the corresponding temperature, humidity, rainfall, and pH values are the key parameters affecting the efficient recommendation for fruticulture.

A total of 7 data fields are included in the dataset corresponding to each of the 11 fruits considered. The data fields with the various parameters represent the ratio of Nitrogen content in the soil in kg/ha, P represents the ratio of Phosphorous content in the soil in kg/ha, K represents the ratio of Potassium content in the soil in kg/ha, temperature represents the temperature in degree Celsius, humidity represents the relative humidity in %, pH represents the pH value of the soil, and rainfall represents rainfall in mm. Thus, the dataset was collected by studying and analyzing the nutrient requirements of 11 fruit trees. A hundred entries are made corresponding to each fruit with an intention of having a detailed analysis of the fruit trees. Studies regarding fruit trees and their nutritional requirements helped in creating this dataset.

3.2 Methodology

Light GBM is a framework that uses a gradient boosting framework that uses tree-based learning algorithms like decision tree algorithms. This helps in increasing the efficiency of the model and reduces memory usage. Two novel techniques that are used in Light GBM are Gradient-based One Side Sampling (GOSS) and Exclusive Feature Bundling (EFB), which compensates for the limitations of the histogram-based algorithm used by Gradient Boosting Decision Tree (GBDT) frameworks [56]. The Light GBM algorithm can be characterized by the techniques of GOSS and EFB. These two together enhance the efficiency of the model.

It is a proven fact that the smaller the instant gradients, smaller is the error and better is the training. Gradient-Based One Side Sampling (GOSS) is a novel method of dataset sampling in Light GBM, that uses gradients to down sample the instances. It also selects a subset of the data to be used while training a Light GBM. GOSS calculates the gain along with highly weighing the data points with larger gradients. Those instances contribute more to this method whose instances are not used for training. GOSS initially sorts the training data by the loss function gradients according to the current model, then selects a subset of the data on the basis of the gradients’ magnitudes.

Exclusive Feature Bundling (EFB) combines sparse or mutually exclusive features to carry out feature elimination indirectly and engineering without affecting learners’ performances. To ensure that non-zero values of all constituent features are never considered simultaneously, an offset value is used that is capable of creating a set of features by allowing exclusive features to stay in different bins.

The associations and correlation among the dataset values prove to be an added advantage offered to the users as a pattern or a reference for future use. This information provided by the recommendation system will also facilitate pomologists to cultivate suggested fruit by predicting probabilities of yield. Further, in response to RQ. 4, the added advantages of Light GBM over the other tree-based algorithms can be mentioned above.

The formulae to calculate different performance metrics used to evaluate the performance of the proposed model are as follows:

$$\mathrm{Precision}=\frac{TP}{TP+FP}= \frac{\mathrm{no}.\,\mathrm{ of \,correctly \,predicted \,positive \,instances}}{\mathrm{no}.\,\mathrm{ of \,total \,positive \,predictions \,made}}$$
(1)

Precision is the analysis of the true positives over the number of total positives that are predicted by the machine learning model. The formula for precision is as written in Eq. 1. This helps in calculating the rate of positive predictions being positive.

$$\mathrm{Recall\, or\, Sensitivity}=\frac{TP}{TP+FN}= \frac{\mathrm{no}.\,\mathrm{ of\, correctly \,predicted \,positive \,cases}}{\mathrm{no}.\,\mathrm{ of\, total \,positive \,cases \,in \,the \,dataset}}$$
(2)

Sensitivity or recall is the analysis of the TP over the number of actual positive results. The formula to obtain recall is shown in Eq. 2. It is used to assess how well the model analyzes the input and identifies the actual outcome.

$$\mathrm{F}1 =2*\frac{precision*recall}{precision+recall}$$
(3)

F1 may also be termed the harmonic mean of precision and recall. It analyzes both false positives and false negatives along with performing well on an imbalanced dataset. The formula to calculate F1 is shown in Eq. 3.

$$\mathrm{Accuracy} =\frac{TP+TN}{TP+FP+FN+TN}$$
(4)

The developed recommendation system introduces a model to predict the fruit yield using dependencies of soil features, climatic conditions, and the requirement of the fruits under consideration. Although there are several existing techniques to obtain farming recommendations, the model used in this research enhances the crop type and yield by considering the most relevant crop requirements that match the available soil parameters. Therefore, in response to RQ. 3, it can be stated that Light GBM is the best-suited machine learning algorithm for a recommendation of fruits on a particular soil type available and in a particular environmental condition.

The outcome of this research model can further benefit the agriculturists or the end users by helping them decide upon the investment capital and the probable benefit, well before the beginning of the sowing process. The scope of predictive financial investment and gain analysis can further be expanded to various financial institutions to allocate funds or fiscal loans to farmers. Among the other advantages, the system is scalable to a large extent. It can be used to recommend a wide variety of crops on the given soil.

The workflow of the proposed fruit recommender system is depicted in Fig. 2. The foremost task is to generate a dataset wherein the authors have considered 11 types of fruits. 100 entries of each fruit have been considered which have been classified into 7 different parameters.

Fig. 2
figure 2

Workflow of the proposed fruit recommendation system

The dataset is further analyzed by the recommendation model. Subsequently, the entire process of the work can be divided into four segments:

  1. 1.

    Modules, datasets, and datatypes of input variables: there are six modules or parameters under which data values for each fruit are considered, namely: N, P, K, temperature, humidity, rainfall, and pH. The input dataset is a CSV file that contains all the values in the numeric type of data.

  2. 2.

    Data preprocessing and cleaning: this involves cleaning the data set and removing the outliers to normalize the sample.

  3. 3.

    Technique: Light GBM is used as the technique to implement the fruit recommendation system.

  4. 4.

    Split dataset into training and test sets: 70% of the dataset is used to train the model and 30% of the dataset is used for its testing.

  5. 5.

    Evaluate model performances: precision, recall, F1-score, and accuracy are the metrics for evaluation of the performance of the model.

  6. 6.

    Results: The result shows us which fruit is the most suitable for a combination of which type of soil and the existing climatic conditions. The recommender system recommends the best-suited fruit to be cultivated in the available conditions.

4 Results and discussions

This section deals with the analyses of the obtained results of this experiment. Initially, the nitrogen, phosphorous, and potassium content requirement for each fruit is analyzed for each fruit separately. Then the combination of proportions of N, P, and K requirements for each fruit is calculated and depicted as a pie chart for easier understanding. A scatter plot to show exactly how much each fruit’s requirements are close to each other is plotted and the temperature, humidity, and relative rainfall are also plotted according to the requirement of each fruit taken into consideration. There may also be many fruits that have similar growing conditions, so a table is created to find which other fruit can be cultivated in a given N, P, K, contained soil, and given environmental conditions. Finally, the correlation and confusion matrices are created and analyzed.

4.1 Nitrogen requirement analysis

The plot showing the nitrogen requirement of the fruit trees under consideration is depicted in Fig. 3. The graph shows that the fruit that needs the maximum nitrogen among all the fruits in the dataset is the pomegranate. Coconut is the fruit that requires the least amount of nitrogen. The nitrogen requirement of raspberry lies midway between that needed by muskmelon, banana, and watermelon on one end and papaya on the other end. Orange, mango, and apple are the fruits that need the least amount of nitrogen among all the fruits in the dataset.

Fig. 3
figure 3

Analysis of nitrogen content required in the soil concerning each fruit

4.2 Phosphorous requirement analysis

Figure 4 shows the plot of the phosphorous requirement of the fruit trees taken in the dataset. Pomegranate, along with orange, mango, coconut, and muskmelon lie among the fruit trees that require the least phosphorous to grow properly. Apple and grapes on the other hand require the maximum phosphorous content in the soil among all fruits.

Fig. 4
figure 4

Analysis of phosphorous content required in the soil concerning each fruit

4.3 Potassium requirement analysis

Figure 5 gives us the graphical representation of potassium requirement in the soil for the growing condition of the fruits considered in the dataset. From the graph, it can be inferred that muskmelon, banana, watermelon, and papaya are the ones that require almost equal quantities of potassium to grow in a soil medium. Orange is the fruit that needs the least quantity of potassium whereas the fruits that demand the maximum quantity of potassium are apples and grapes.

Fig. 5
figure 5

Analysis of potassium content required in the soil concerning each fruit

4.4 A precise proportion of N, P, and K needed by each fruit

The combined proportions of N, P, and K for each fruit are analyzed from the plots of N, P, and K content requirements for each fruit. Thus, RQ. 1, can be answered in terms of pie charts for each fruit, as shown in Fig. 6. A precise proportion of N, P, and K is needed for a particular fruit to grow in the best possible way. This can act as a basis for better recommendations of fruits that would suit the best for a certain quality of the soil available.

Fig. 6
figure 6

The proportion of N, P, K in kg/ha as needed by each fruit: a muskmelon; b banana; c watermelon; d papaya; e pomegranate; f orange; g mango; h apple; i raspberry; j grapes; k coconut

4.5 Fruit growth concerning temperature and humidity

Figure 7 is a scatterplot obtained by plotting each fruit in a graph where the horizontal axis represents the temperature in degrees Celsius and the vertical axis represents the percentage of humidity in the atmosphere. It can be analyzed that mango needs very less humidity but a relatively high temperature, whereas crops like orange and papaya need very humid weather but a relatively less temperature. On the other hand, coconut needs high humidity as well as high temperature.

Fig. 7
figure 7

Scatterplot of fruits based on temperature and humidity

4.6 Effect of rainfall, temperature, and humidity on fruit growth

Along with the nutrient content of the soil in the specified proportions, the climatic conditions prevalent at that period is another major factor affecting the fruit yield in substrate medium. The temperature, humidity, pH, and rainfall are the climatic conditions that must be analyzed by the recommendation system to recommend the correct fruit to the pomologists. In response to RQ. 2, Fig. 8 shows the bar plots to represent the quantity or proportion of the rainfall, temperature, and humidity requirements for the fruit trees to produce the best yield, thus showing the climatic conditions required for a particular fruit to grow.

Fig. 8
figure 8

Temperature, rainfall, and humidity requirement for fruits

Table 1 is created in response to RQ. 6, where the nutrient and climatic requirements for each fruit are mentioned. The combination of nutrient content and climatic conditions of every fruit is compared with the others to check for feasibility of fruticulture in other similar conditions, in response to RQ. 7, where it has been found that few of the requirements for the cultivation of one fruit may even be suitable for the others. Thus, making the proposed recommendation system more dynamic, flexible, and scalable.

Table 1 Suitable fruits based on soil health and climatic conditions

4.7 Correlation between different features

The matrix showing the correlation among all the variables taken into consideration in this research work is depicted in Fig. 9. The matrix is used to determine the form of correlation and the degree of correlation among different variables. The form of correlation may be positive or negative, whereby positive correlation means the positive influence of one variable upon the other, and negative correlation means the negative influence of one variable upon the other. In this matrix, each variable has a complete correlation with itself.

Fig. 9
figure 9

Correlation matrix of N, P, K, pH, temperature, rainfall, and humidity requirement for fruits

Though none of the variables are dependent on any of the others, the correlation of phosphorous and potassium content in the soil seems to be close to 1. Nitrogen content is seen to be negatively correlated with potassium, phosphorous, and average rainfall and positively correlated with temperature, humidity, and pH. Phosphorous content, on the other hand, is negatively correlated with temperature and pH apart from nitrogen content and is positively correlated with potassium content, humidity, and average rainfall. Potassium content is seen to be negatively correlated with Nitrogen content, temperature, pH, and rainfall, but it is positively correlated with phosphorous content and humidity.

4.8 Confusion matrix or error matrix

A matrix of numbers that tells us when the model gets confused. It is a systematic way of mapping the predicted values to the original values. This is used in the case of supervised learning frameworks and is also helpful in defining the performance of the classification algorithm.

The confusion matrix obtained in this experiment is shown in Fig. 10. A 10 × 10 matrix is obtained which evaluates the performance of the fruit recommendation model developed here. It compares the actual values against the predicted values by the developed model.

Fig. 10
figure 10

Confusion matrix

The confusion matrix obtained in this study shows that the model is working accurately. All the non-diagonal elements are 0.

Running this model produces two outcomes:

  • Binary 0–False

  • Binary 1–True

The y-axis corresponds to the actual values and the x-axis corresponds to the predicted values. There are four potential outcomes in a confusion matrix:

  1. (1)

    True positive (TP)–true result when the actual observation is positive

  2. (2)

    True negative (TN)–true result when the actual observation is negative

  3. (3)

    False positive (FP)–wrong prediction when actual observation is positive

  4. (4)

    False negative (FN)–wrong prediction when the actual observation is negative

4.9 Performance metrics of the proposed model and its accuracy

Evaluating the performance of the proposed Light GBM machine learning model is one of the important steps while building an effective ML model. Different metrics are used to evaluate the performance of the proposed model, and these metrics are known as performance metrics or evaluation metrics which analyze the performance of the model for the given data. In this way, the model's performance can be improved by tuning the hyperparameters.

Figure 11 shows the interconnectivity of the various classifier predictions and training data to the F1 score. As F1-score is dependent upon precision and recall, the connection showing the intermediatory stages is also depicted to depict the interrelation.

Fig. 11
figure 11

A bottom-up approach showing the relation of metrics from raw data to F1 score

Table 2 shows the performance metrics of all the fruits in the dataset in terms of precision, recall, and f1-score.

Table 2 Performance metrics

The most used parameter used to judge the machine learning model is accuracy. The formula to calculate accuracy is shown in Eq. 4. The accuracy of the model is found to be 99%, the macro average is 0.99 and the weighted average is 0.99.

5 Conclusions and future scope

The fruit recommendation model proposed in this experiment is found to be efficient, more dynamic, flexible, and scalable which can be useful to the pomologists to identify the best suitable fruit that can be cultivated in the available soil and the prevalent environmental conditions. The dataset created contains the N, P, K, and pH requirements of the soil and the temperature and humidity requirements of the environment, which each fruit requires to grow to the fullest. The Light GBM algorithm uses 70 % of the data for training and 30 % of the data is used for testing. The available soil and environmental conditions are analyzed and checked for correlation to effectively recommend one or more fruits that can be grown in the prevalent conditions. Table 1 shows the alternative fruits that require similar soil and climatic conditions that can be recommended by the system to the pomologists to have better qualitative and quantitative yields. On the other hand, the correlation and confusion matrix show the efficiency of the proposed recommendation model. The precision, recall, and F1 score are also calculated to verify the proper working of the proposed model. The accuracy of the model is found to be 99%, the macro average is 0.99 and the weighted average is 0.9.

Despite being very useful in positively impacting the profitability of the economy, there are various challenges practical implications of the proposed model in real world. The soil may erode or its conditions may deplete over time, due to several reasons. Thus, continuous monitoring of the farm is needed. The dynamic changes in soil and climatic conditions need to be considered every time this model is used for recommendation. Failing which, there may be a downfall in the yield. The presence of pests and rodents in the soil also needs to be monitored continuously.

In the future, the model can be worked upon on a broader scope to accommodate more crops apart from fruits. It can be enhanced in two different aspects. Firstly, it can be expanded to accommodate a recommendation of other crops like grains, lentils, and flowering plants. Secondly, the scalability and utility of this model can also be extended to determine and recommend fertilizers and irrigation needs of crops for more precise yield estimation.

One way to extend the scalability of the existing model is to accommodate for other vegetations like cereals, legumes, and grains. The nutrients, pH, temperature, and humidity requirements of the crops can be analyzed and a dataset has to be created for all those plants. The prevalent soil and weather conditions can then be analyzed for the recommender model to recommend the best suited crop for that region. The model can further be scaled to increase its utility to recommend fertilizers and irrigation needs of the plants considered. The dataset can be expanded to check the nutrient contents of the soil in real time and analyze it with the requirement of the plant to recommend fertilizer to be applied as and when needed.