1 Introduction

From a climatic and topographic perspective, the planet Earth is divided into eight main biogeographic realms: Nearctic, Neotropical, Palearctic, Afrotropical, Indo-Malayan, Australian, Oceanic, and Antarctic. Colombia, due to its geographic location, is within the Neotropical Kingdom, a region rich in biological diversity that encompasses South America, Central America and Mexico [1].

Corn plays a crucial role as a global staple food, consumed daily by more than 4.5 billion people. It is highly relevant due to its caloric content and environmental adaptability, and its demand is expected to experience significant growth in the coming years due to the increase in the global population [2,3,4,5,6]. In the specific case of Colombia, corn accounts for 9% of the daily calorie supply; however, the country relies on imports for 74% of its demand [7]. This situation highlights the urgent need to strengthen domestic production, ensure crop availability, and reduce global dependence. Ensuring effective corn production with high yields requires preventive policies based on information, commonly obtained through decision-making guided by predictive data models [8,9,10,11,12].

The estimation of crop yields is a task of paramount importance for food security [12,13,14,15]. With this information, farmers, commercial improvement organizations, and government agencies make informed decisions that allow for proper crop management, the implementation of development policies, the promotion of national food policies, and the promotion of international trade [4, 16,17,18,19]. In the case of corn crops, estimating their yield helps to understand their response to different environmental stresses [20, 21] and, thus, provides relevant information for their management in a sustainable agriculture environment [22,23,24]. However, making estimates with a high degree of accuracy is a complex task. This process involves multiple factors that directly and indirectly affect plant growth [4, 25,26,27].

The constant spatial and temporal changes in planting environments, as well as the continuous interaction between factors, result in highly complex and non-linear effects that, in practice, make it difficult to provide accurate estimates [28,29,30]. To achieve successful predictions, a representative dataset for each study case is required, containing a subset of features capable of appropriately describing the target concept. Predictive models that employ many features are often irrelevant and noisy, with low precision and accuracy, requiring analysis and selection of the features to be used to ensure the accuracy of the results obtained [31].

The objective of this research is to identify the climatic factors that are critical in the accurate prediction of corn crop yields in Colombia, a country belonging to the Neotropical zone, to guide the construction of more accurate predictive models with regional applicability. To achieve this, a feature relevance estimation technique is employed, and prediction algorithms are used to validate their performance. While some studies have revealed highly influential factors that can improve model accuracy [32, 33], very few have focused their efforts on identifying those that are predominant in the Neotropical zone for this type of crop.

The rest of the document is organized as follows. Section 2 presents the related works in the case study. Section 3 shows the methodology used in the research. Section 4 describes the development and results obtained. Section 5 provides the discussion of results. Section 6 addresses the conclusions and Sect. 7 discusses future work followed by the references.

2 Related works

According to the systematic review conducted in SCOPUS, Science Direct, Web of Science, PugMed, IEEE and Google Scholar, among the researches that focus their efforts on examining the influence of the various factors involved in the yield of corn crops, there are varied approaches in the estimation of the relevance of factors; the quality of this estimation depends, to a large extent, on the availability of data and the methods used, their complexity and performance significantly impact the selection of attributes and, therefore, the precision of the proposed predictive models.

A common approach is the use of statistical methods. In the study [34], the “C–D production function model” algorithm was applied to evaluate the relevance of factors such as the application of fertilizers, pesticides, sown area and precipitation, respect to the yield of corn crops in Daqing city, China. It was concluded that the application of pesticides and fertilizers significantly influences yield, as does precipitation, whose impact varies depending on the variety of corn planted. Similarly, in [35], Pearson correlation coefficient and coefficient of determination were used to analyze variables associated with soil and topography in the states of Illinois and Indiana, EE. UU., highlighting elevation and terrain curvature as the most influential factors on yield. In [36], multiple linear regression was used to evaluate 18 factors related to crop growth in the EE. UU., highlighting precipitation and late season temperature as the most influential.

Another approach focuses on checking the relevance of factors by testing different subgroups of attributes directly on predictive models. In research [37] they use the Random Forest (RF) algorithm to estimate the yield of corn crops in the EE. UU., using different combinations of variables until they find the one that presents the best performance for the predictive model. In this case, the result is that the year, region, irrigation and seasonal climate are the most relevant factors to predict with high accuracy the yield to be obtained at the end of the season. The authors of [38] evaluate the influence of 20 attributes related to soil, topography and type of corn crop in two fields in Illinois, EE.UU., using different variations of Artificial Neural Networks (ANN). They use an intelligent problem solver to randomly test 150 combinations of ANN attributes and configurations, to find the model with the best precision, resulting in: corn hybrid, relative terrain elevation and cation exchange capacity are the factors with the highest degree of influence on the estimation of crop yield. Finally, in the research [39], the Hybrid-Maize model, which allows simulating corn crop yield and the influence of each factor on the yield, is used to analyze 12 factors related to corn hybrids, crop management and climatic factors in Huanghuaihai, China. Through verification of multiple variations in these factors, precipitation and temperature were identified as the most relevant, explaining approximately 50% of the yield obtained at the end of the season.

3 Methodology

This research is based on the CRISP-DM methodology, widely used by various authors to describe the life cycle of standard data mining projects [40]. It aims to develop the necessary mechanisms for identifying and selecting the climatic factors with the highest degree of influence on predicting corn crop yield in Colombia.

Below, we present a detailed description of the phases that comprise the methodology used:

  1. 1.

    Understanding the Problem: This phase addresses the lack of precision in predictive models for corn crop yield, emphasizing the importance of selecting influential factors for model training. This identifies the knowledge gap to be addressed, leading to the case study and the research objective.

  2. 2.

    Understanding the Data: Involves the search and collection of data in the form of historical records related to climatic factors and corn crop yield in the region. Through analysis, the necessary processing and transformation procedures are identified to create a dataset ready for use.

    There is a fundamental relationship between the phase of understanding the problem and understanding the data; understanding the problem implies the need to have access to the data and its adequate interpretation.

  3. 3.

    Data Preparation: Various techniques are employed to transform the data according to the study’s requirements. Initially, attribute cleansing is carried out on the datasets by analyzing their frequency of use in research related to the case study, establishing a starting point for transformation processes. Subsequently, the climatic dataset is adjusted to a semi-annual periodicity corresponding to the yield data. Next, the datasets are integrated by matching them based on date and location, outlier values are removed, and normalization is performed. Finally, using RReliefF as a method to estimate feature relevance, a subset of attributes representing the best configuration for the predictive model is selected, resulting in a single transformed dataset ready for use in the modeling phase.

  4. 4.

    Modeling: ANN and Linear Regression (LR) are used to build two predictive models, each with two different configurations: one employing the selected subset of variables and the other using the total available variables in the study. This is done to validate the performance of the selected attributes.

    In this context, the modeling phase is closely linked to the data preparation phase. During model configuration and testing, it may be necessary to make additional adjustments to the data set to ensure efficient and accurate integration with the required information.

  5. 5.

    Evaluation: The results obtained in the modeling phase are evaluated to verify the performance of the selected subset of factors and determine if there is a significant improvement compared to using the total available variables.

    This phase seeks to ensure the fulfillment of the research objective, depending on the result of the model evaluation, its determined if it can advance to the next phase of the process. If its insufficient, it returns to the phase of understanding the problem to carry out the necessary changes and make the corresponding adjustments.

  6. 6.

    Deployment: The obtained results are discussed, highlighting the most important findings regarding the factors with the highest degree of relevance in estimating corn crop yield, potential associations between features, and the overall performance of the models with and without the use of the selected subset of attributes.

Figure 1 illustrates the general structure of the methodology and the interaction between its phases.

Fig. 1
figure 1

Research methodology; adapted from [40]

4 Results

4.1 Data description

The dataset used in this study consists of historical records of climatic factors and corn crop performance in Colombia. Yield data has been recorded semi-annually from 2006 to 2021 in various regions of the country.

4.1.1 Climatic data

These data are provided by the Consultation and Download of Hydrometeorological Data system of the Institute of Hydrology, Meteorology, and Environmental Studies of Colombia [41]. They originate from over 4.400 meteorological stations located throughout the country. The dataset comprises 20 climatological variables, including Maximum Temperature, Minimum Temperature, Average Temperature, Precipitation, Vapor Pressure, Solar Radiation, Sunshine Hours, Cloud Cover, Evaporation, Wind Speed, and Relative Humidity, with over 1.5 million daily records.

4.1.2 Yield data

These are obtained from the Ministry of Agriculture and Rural Development [42] and come from historical records of traditional maize harvests across the country. The dataset includes 17 variables related to production data, including Yield, Planted Area, Harvested Area, Physical Production Status, and Production, totaling 22.440 individual records with semi-annual periodicity, commonly used for transient crop types.

4.2 Data preparation

Data preparation begins with the refinement of features in the previously described datasets based on their frequency of use in related research concerning the case study. This reduces the number of attributes to be transformed and analyzed in later phases.

To determine which attributes are commonly used in other research, the results obtained from 19 related studies are reviewed, with reference to the variables available in the climatic dataset. Table 1 presents the results, with attributes analyzed horizontally and the reviewed research studies listed vertically, marked with an (X) for attributes used in each study.

Table 1 Frequency of use of climatic factors for corn crop yield prediction

The results reveal that precipitation, minimum temperature, maximum temperature, average temperature, vapor pressure, solar radiation, and evaporation are the most frequently used climatic characteristics, with a total of 17, 17, 17, 8, 8, 5, and 3 appearances, respectively. In contrast, cloud cover, wind speed, and dew point temperature are excluded from the study due to their low utilization in the case study. Next, the climatic dataset is transformed by calculating the arithmetic mean of each attribute semi-annually, reducing the annual records from 364 to 2, corresponding to each year’s A and B semesters, resulting in a total of 10.225 records with semi-annual periodicity. Subsequently, using the Department, Municipality, Year, and Period fields as reference points, the climatic and yield datasets are integrated, resulting in 2.984 records and 13 attributes, including yield. Temporal and location attributes are then removed, leaving the dataset with 8 variables.

Through the calculation of Z-Score by attribute, which indicates how far a particular value deviates from its arithmetic mean, outlier values accounting for 5.46% of the data are identified and removed, resulting in a total of 2.821 records. The Min–Max data normalization method is applied, using the minimum and maximum values of each attribute as reference. Table 2 displays a segment of the resulting dataset after applying these transformation techniques.

Table 2 Fragment of the resulting dataset after data preparation

As a final step, the RReliefF algorithm is applied to identify variables with the highest relevance to the dependent variable. RReliefF is an improved version of the Relief and ReliefF methods, allowing for the identification of statistically influential attributes with respect to a target attribute through case-based learning. Relief involves assigning a weight to each attribute, modifying it based on the Euclidean distance calculation between randomly selected instances of the attribute and their nearest neighbors, both from the same attribute (near-hit) and the target attribute (near-miss). Finally, attributes that exceed a predefined threshold are selected [52]. On the other hand, RReliefF incorporates the ability to address regression problems, where the target class is continuous [53].

Table 3 presents the results of applying the RReliefF method, which indicates that the attribute with the highest degree of relevance to yield is solar radiation, followed by precipitation, vapor pressure, and maximum and minimum temperatures. Average temperature and evaporation have a lower degree of influence compared to the other attributes.

Table 3 Selection of relevant characteristics using the RRliefF method

4.3 Predictive model

In order to evaluate the effectiveness of the chosen attributes in estimating corn crop yields, two predictive models were developed: one using ANN and another using LR. For each, two different configurations are presented: one involving all the variables used in the research and another based on the selected subset of attributes. Each configuration is detailed below:

  • Artificial Neural Network: The Multi-layer Perceptron type was used, which, due to its structure and high performance in pattern association for predictions [54], is optimally suited for the needs of this research. The ANN is configured using 10 and 5 neurons in the input layer corresponding to the number of input variables in the model, considering all available attributes and the selected subset, respectively. It has 3 hidden layers with 64, 32, and 16 neurons each and an output layer with 1 neuron. Adam is used as the optimization function, Relu as the activation function, Mean Squared Error (MSE) as the loss function, and 200 training epochs.

  • Linear Regression: The Ridge algorithm (L2) is employed, an alternative regularized version of least squares that reduces variance and mean absolute error [55], configured with a penalty coefficient (Alpha) of 0.0001.

70% of the data, randomly selected from the dataset, were used for model training. The remaining percentage was used to validate the accuracy in each case. Table 4 presents the results obtained, specifying each model’s configuration and performance, evaluated using metrics such as MSE, Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE), both in the training and validation phases.

Table 4 Performance of predictive models in the training and validation phase

The results reveal that the ANN model with the selected subset of variables exhibited the best performance, achieving an RMSE of 0.1216, translating to a notable accuracy of 87.84%. In contrast, the LR model with the same subset of variables achieved an RMSE of 0.1417, equivalent to an accuracy of 85.76%.

On the other hand, the ANN model that used all available variables showed significantly lower performance compared to its counterpart that used the selected subset of variables, registering an RMSE of 0.1403 and an accuracy of 85.97%. Similarly, the LR model obtained an RMSE of 0.1424, equivalent to an accuracy of 85.76%.

5 Discussion of results

Most research on factors affecting corn crop yields focuses on biogeographic zones of the: Nearctic that exhibit climatic diversity with variation in conditions from arid in the southwest to a temperate climate on the east coast, a topography that encompasses vast plains, majestic mountains and extensive plateaus, such as the United States, and the Palearctic and Indo-Malayan, which experience climates from arid and cold in the north to tropical in the south, with a topography that includes vast plains in the east, majestic mountains in the west and plateaus in the center, such as China [1].

According to the literature review, there is little research in relation to countries near the equator, which make up the Neotropical, characterized by tropical and equatorial climates, with warm temperatures throughout the year and rainy seasons. Climate variability has a differentiated impact in each region, affecting corn crop yields in a unique way in each geographic area. The presence of seasons in countries such as the United States and China introduces specific challenges, such as seasonal droughts, which can have direct consequences on corn production [56], while in equatorial countries, the absence of different seasons minimizes these climatic risks, offering distinct conditions for corn cultivation. This diversity in corn response to climate highlights the need for research that analyzes the relevance of climatic factors on crop yields, considering the biogeographic and topographic heterogeneity of the region.

In research such as [35, 38], focused on the analysis of topographic factors and soil properties, although they do not establish a direct correlation with the present study as they do not address climatic aspects, they make it possible to understand and compare equally relevant elements. These include the type of methods used to evaluate the relevance of characteristics, the methodology employed and the relative influence of other types of factors on corn crop yields.

This study proposes an approach that identifies the most influential factors in the prediction of corn crop yields by evaluating the degree of relevance of each attribute with respect to crop yield, considering the interaction between attributes by assigning weights based on their neighborhood, using the RReliefF algorithm. The results show that in the Neotropical zone, solar radiation exerts the greatest degree of influence, followed closely by precipitation. In addition, vapor pressure and maximum and minimum temperature exhibit values greater than 0.020. Although their magnitudes are smaller than those corresponding to solar radiation and precipitation, they have a high influence on the estimation of corn crop yields. In the studies that consider the influence of climatic factors, a high correspondence is evidenced in relation to precipitation and temperature, when accurately estimating the yield to be obtained at the end of the season, as presented in Table 5, ratifying the results obtained in this research.

Table 5 Most influential factors in corn crop yields by research

The choice of RReliefF in feature selection before implementing a predictive model offers key advantages compared to other approaches such as using statistical methods [34,35,36] or direct implementation of predictive models [37,38,39]. RReliefF stands out for its sensitivity to local interactions, predictive model independence, robustness to noise, interpretability, and computational efficiency. By focusing on evaluating feature relevance at the local level, RReliefF can capture specific patterns and provide more robust and efficient feature selection, independent of the subsequent prediction algorithm. These features make RReliefF an attractive option in situations where feature interpretation, noise resilience and computational efficiency are valued.

On the other hand, the values obtained for average temperature and evaporation are slightly lower compared to the other attributes. Although these factors have also been identified in previous studies as relevant for estimating corn crop yield, their relative influence in the case study is lower. It is important to note that despite their lower degree of influence, these attributes can still play a significant role in yield prediction when considered together with other climatic factors.

According to the results, solar radiation is the most important factor in estimating corn crop yield. It plays a fundamental role by providing the necessary energy for the photosynthesis process in plants, which directly influences carbohydrate production and crop growth. The energy captured through solar radiation is essential for driving the biological processes that determine corn crop production and yield [57].

Precipitation also plays a crucial role due to its significant influence on yield. Its variability during the crop growth cycle has a significant impact on the outcomes. This factor is the primary source of water for crops, making it crucial for meeting their water needs. Unlike other factors like temperature, solar radiation, and wind, which are consumers, precipitation directly supplies the water needed for plant development and growth [58].

Vapor pressure is another determinant factor in this estimation process, allowing us to understand the pressure exerted by the water vapor content in the air in a specific area, and consequently, how much water vapor is present in the air. Additionally, it determines the air’s capacity to hold water molecules and has a direct relationship with temperature. When temperature increases, the air’s capacity to store water molecules in the air and vapor pressure increase [59].

Maximum and minimum air temperatures are essential characteristics in the yield forecasting process as they regulate plant development rates and the duration of growth processes. Moreover, they control the capacity of the air to hold water molecules, and their variation over time determines a significant part of the corn crop growth stages [60].

During the training and validation process of the models, an increase in accuracy was observed when reducing the input variable set from 10 to 5. In the case of the ANN-based model, the configuration with 5 variables achieved an accuracy of 87.84%. In contrast, the accuracy of the model with 10 variables was 85.97%. Similarly, in the LR model, the configuration with 5 variables resulted in an accuracy of 85.83%, surpassing the accuracy of the model with 10 variables, which was 85.76%. The reduction of variables allowed the models to more effectively capture the relationships between climatic factors and corn crop yield, confirming the importance of the selection process in the predictive capability of the models.

The use of climatic factors and models based on them plays a fundamental role in addressing the issue of corn production. By identifying the relevance of attributes such as solar radiation, precipitation, vapor pressure, and maximum and minimum temperatures in predicting corn crop yield, a more comprehensive understanding of how climate changes affect agricultural production is achieved. This information is crucial for making informed decisions in crop planning, irrigation, and other agricultural practices. Anticipating the values of these attributes and their influence on corn crop yield allows farmers to take necessary management measures to improve results, adapting agricultural practices according to forecasted weather conditions.

6 Conclusions

The climatic and topographic diversity between biogeographic zones shows specific challenges and different patterns in in corn crop yields. The Nearctic biogeographic zone experiences seasonal challenges such as droughts, the Palearctic and Indo-Malayan zones have climates ranging from arid and cold to tropical, and the Neotropical zone, to which the equatorial countries belong, presents warm temperatures throughout the year and rainy seasons. From the study carried out in Colombia, a country in the Neotropical zone, its evident that solar radiation, precipitation, vapor pressure and maximum and minimum temperature are the climatic factors that have the greatest influence on the estimation of corn crop yields, with a relevance factor (RRelifF) of 0.033, 0.032, 0.026, 0.022 and 0.021, respectively. These factors, both individually and in their interaction during the crop growth cycle, play a determining role in the yield obtained at the end of the cycle. The significant relevance of these variables in the estimation of agricultural yield is essential for the construction of high- precision predictive models, which are crucial for improving production processes at the regional level.

When contrasting the performance of predictive models that used the complete set of variables with the subset representing the most relevant climatic factors, the importance of identifying the fundamental variables that should be considered when defining high-precision and reliable models becomes evident. In this case study, the ANN model achieved higher accuracy in yield estimation when using the selected variable set, achieving an RMSE of 0.1216, compared to the configuration that used all available variables, which obtained an RMSE of 0.1403. Similarly, the LR-based model showed better performance when using the subset of variables, obtaining an RMSE of 0.1417 compared to the configuration that used all variables and obtained an RMSE of 0.1424.

7 Future works

To optimize the predictive capacity of the models and adapt to specific contexts, additional variables could be considered in future work, such as soil quality, fertilizer use, and presence of pests. Conducting detailed temporal analyzes of corn crops to establish how climatic factors affect each phase of the growth cycle. Improving accuracy and efficiency may include techniques such as deep learning algorithms and optimization methods and integrating satellite data sets for more accurate measurements and improved spatial resolution. These improvements will result in a better understanding of the factors that influence corn crop yield estimation and facilitate effective application of the results in agricultural decision making.