Population Dynamics and Estimation of Damage of the Spittlebug Aeneolamia varia on Sugarcane in Colombia by Using remote Sensing and Machine Learning Tools

The spittlebug (Aeneolamia varia) is one of the most important sugarcane pests in Colombia, where a recent increase in population and distribution specially in southwestern Colombia have led to the need for new technologies for integrated pest management. The objectives of this study were to determine the spatial distribution of this pest in commercial sugarcane fields and to validate machine learning (ML) tools for indirect injury detection and impact on yield (damage) using satellite images. This study was carried out in fields grown with the CC 01-1940 variety in El Cerrito, Valle del Cauca, Colombia, where systematic sampling of the populations (number of adults and nymphs per stem) was carried out. The spatial aggregation and distribution were determined using Moran’s index and point patterns, sequence observations, and analysis with distance indicators (Sadie). The indirect injury detection and quantification of the impact on production were carried out with a ML approach using satellite image products with 10 m spatial and five days temporal resolutions, obtained from a Sentinel-2 sensor using Google Earth Engine. The results indicated that spittlebug populations had an aggregate spatial behavior and high spatial dependence. In addition, the ML algorithms predicted spittlebug injury, and the effect on production was estimated at 26.4 tons of cane per hectare, which represented a 17% reduction in the expected yield. The use of spatial analysis and remote sensing tools are an alternative for indirect detection of injury and for understanding population dynamics of the pest in sugarcane, so they can become instrumental for decision-making on an integrated pest management program.


Introduction
The global sugarcane (Saccharum officinarum L.) agroindustry has a large socioeconomic impact given its many products, including the production of sugar, "panela" (unrefined sugar), electricity, biofuels, paper, among others (Aguilar-Rivera 2019). In Colombia, sugar production is concentrated mainly in the southwest, in the departments of Valle del Cauca, Cauca, Quindío, Caldas, and Risaralda (Asocaña 2020). This sector represents 0.7% of the gross domestic product (GDP) per capita and generates more than 180,000 direct and indirect jobs, benefiting more than 1.2 million families in Colombia (Asocaña 2020).
Currently, a series of phytosanitary problems threatens the sugarcane agroindustry in Colombia, where the spittlebug Aeneolamia varia (F) (Hemiptera: Cercopidae) stands out (Vargas and Gutiérrez 2017). In southwestern Colombia, this pest is emerging, increasing its population and distribution, and presenting a considerable risk given its impact on production, which can be up to 25% of biomass production (Salazar and Proaño 1989), without mentioning the costs associated with pest management and the environmental impact from the use of chemical insecticides (Vargas and Gutiérrez 2017). The injury of spittlebugs in sugarcane is associated with necrosis, portions of reddish brown spots and in advanced stages tissue death (Vargas and Gutiérrez 2017). This kind of injury it is related to nymphs and adults that can inject toxins into plant stems (Taliaferro et al. 1967).
Generally, the implementation of an integrated pest management program entails: (1) sampling and monitoring; (2) determination of the impact on the crop in terms of injury, damage, and loss; (3) application of treatments and evaluation of management practices; (4) forecasts of future population dynamics; and (5) implementation of site-specific agriculture (Brenner et al. 1998;Arbogast et al. 2000;Binns et al. 2000;Ferguson et al. 2003). Monitoring and subsequent spatial analysis are key to generating basic information on population dynamics, species detection, behavior, and pest distribution (Binns et al. 2000;Deleon et al. 2017;Arias et al. 2019).
Spatial analysis of sugarcane monitoring has optimized sampling schemes and, therefore, the integrated management of important arthropod pests, such as the weevil Sphenophorus levis Vaurie (Coleoptera: Curculionidae) in Brazil (Pavlu and Molin 2016); melolonthidae species (Coleoptera: Scarabaeidae) in the USA (Cherry 1984) and Australia (Allsopp and Bull 1989); and the sugarcane borer Diatraea saccharalis F (Lepidoptera: Crambidae) in the USA (Schexnayder Jr et al. 2001).
Remote sensing can optimize the spatio-temporal characterization of a pest attack, in a quickly, accurately, cheaply, and noninvasively ways (Khdery 2021). Specifically, it obtains key information on the interactions between pest attack, the structural changes of the crop, and its effect on the response in specific segments of the electromagnetic spectrum (Hatfield and Pinter Jr 1993). Currently, the combination of the Sentinel-2 A + B satellite constellation (European Space Agency-ESA-Copernicus Programme) and the massive processing platform Google Earth Engine (GEE) generates, validates, and updates numerous spectral monitoring tools for crops on multiple scales (Amani et al. 2020;Segarra et al. 2020). On the other hand, in the last decade numerous studies have reported on the usefulness of machine learning (ML) algorithms to data processing and analysis in the indirect detection of crop injury (Pardede et al. 2020).
The integration of ML tools and remote sensors to detect the effect of pests on sugarcane is incipient. However, recent work has focused on the detection of phytosanitary problems, stress conditions, and yield components (Abdel-Rahman and Ahmed 2008). In this regard, several studies have reported promising results in the prediction of plant growth and final yield (Natarajan et al. 2016;Ghazvinei et al. 2018), identification and mapping of weeds with spectral responses (Souza et al. 2020), and disease recognition with photographic images (Militante et al. 2019). Specifically, in analyses associated with spittlebugs in sugarcane, the use of a hybrid artificial intelligence model for the analysis of population parameters based on climate variables has been reported by Figueredo et al. (2021). In addition, several studies have reported that ML algorithms increase efficiency in time, precision and costs of injury detection, loss quantification, species identification, and monitoring (He et al. 2019;Partel et al. 2019;Rustia et al. 2020).
No studies are available yet, to the best of the authors knowledge, that validate the combination of ML algorithms and remote sensors in relation to spatial distribution and behavior of A. varia in sugarcane. Similarly, these technologies have not been used to classify the presence/absence of spittlebug injury and the subsequent damage, that is, its impact on sugarcane productivity. Finally, the quantification of the effect of spittlebug injury, in addition to generating key information on the affected productivity components, will generate a technical and economic base that will optimize integrated management. The objectives of this study were: (1) to determine and characterize the spatial-temporal dynamics of spittlebug populations and infestation in sugarcane crops; (2) to validate the use of ML algorithms for remote detection of spittlebug injury on sugarcane, using Sentinel 2 A + B satellite images; and (3) to quantify the effect of spittlebug damage on sugarcane yield.

Characterization of the Assessed Sugarcane Fields
This study was carried out in two 30 ha commercial fields (Field 1 and 2) planted with the sugarcane variety CC 01-1940 in a second ratoon (Fig. 1a). They were in the municipality of El Cerrito, Valle del Cauca, Colombia (latitude north: 3.693, longitude west: − 76.306; elevation: 960 m). The average monthly temperature varies between 19 and 30 °C, with minimums of 17 °C and maximums close to 32 °C, annual precipitation with bimodal distribution, and accumulated monthly values between 20 and 160 mm. The relative humidity is between 60 and 80%, and wind speeds are between 2 and 14 km h −1 . The agroecological zone is 6H1, which indicates less than 200 mm per year of excess precipitation and a predominantly sandy soil texture (Carbonell González 2011). All agronomic management practices were carried out according to the instructions of the technical staff in charge. The mechanical harvest was carried out for 13.1 months (06-01-2020), and the spatial obtaining yield was represented based on map (TCH: tons of cane per hectare) with a spatial resolution of 1 m, visualized using the QGIS version 3.16 geographic information system tool (QGIS Development 2016).

Monitoring Spittlebug Populations
At the middle of 2019, a spittlebug attack was detected in the fields described above ( Fig. 1a and b), with a crop of 5 months of age. So, monitoring of pest's immatures (nymphs) and adults was carried out. Then, a systematic sampling was designed in a regular georeferenced 40 × 40 m grid, selecting 51 and 86 sampling points for fields 1 and 2, respectively (Fig. 1c). Field monitoring was carried out on May 25 and 31 and July 2, of 2019. Variables associated with the population level of adults and nymphs were evaluated at each sampling point during three monitoring sessions. Each point was represented by row meter transects (10 m), where the number of stems and the number of adults present in the foliage were determined; then, the root zone was checked for spittle masses, assuming each one contained a nymph. Finally, estimates of the infestation were made by calculating the number of nymphs/stem and adults/stem (Vargas and Gutiérrez 2017). Additionally, injury to the foliar tissue on the plants was evaluated, quantifying it as a dichotomous variable (presence-absence). The injury at the foliar level was characterized as portions of reddish brown spots, chlorosis, and necrosis in a more advanced stages tissue death ( Fig. 1b) (Vargas and Gutiérrez 2017).

Spittlebug Population Inference and Level of Injury
With samplings executed in three different times (described above), a data set was obtained and used for estimates of population parameters using descriptive statistic methods. For pest population variables (number of nymphs and adults) and infestation (nymphs/stem and adults/stem), the mean, mode, minimum, maximum, standard deviation, variance, and coefficient of variation (CV) were estimated. In addition, with the aim to evaluate whether there were population changes over time, an one-way analysis of variance and subsequent means comparisons were performed using Tukey's test (P < 0.05). The Levene and Kolmogorov-Smirnov criteria were used to evaluate the homoscedasticity and normality of the data, respectively. Analyses were run in the free statistical software R version 4.2.1 (R Development Core Team 2022).

Analysis of the Spatio-temporal Dynamics of the Spittlebug Populations
The analysis of the spatial-temporal dynamics of nymphs and adults was carried out in two steps. The first step determined the aggregation (random, aggregate, or uniform) and spatial dependence of spittlebug populations using the Moran index (MI) (Moran 1948). In the second step, the spatial distribution was modeled, and since the number of nymphs and adults were counting types (discrete) and distributed randomly, spatial tools of the point and area pattern type were used (Cressie 1991). The methods used in this second step were: (1) analysis distribution patterns using the kernel density estimator, where the points were a function of the coordinates (xy), and the mark was a function of the value of the population count (nymphs and adults). This process was implemented with the statistical software R and the functions of the spatstat package (Baddeley et al. 2015).
(2) Area patterns using count sequence observations (population of nymphs and adults) in space (coordinates) based on the monitoring carried out at each sampling point in the field, implemented with R and the functions of the epiphy package (Gigot 2018). (3) Spatial analysis with distance indicators (Sadie) (Perry 1995;Li et al. 2012), implemented with R and the functions of the epiphy package (Gigot 2018). The Sadie index is designed for individual count analysis, obtaining an aggregation index (Ia) and associating the presence of neighbors with a high presence (Vi) or absences (Vj) (Perry 1995;Li et al. 2012). Likewise, the colors in the points provide a visual representation that is related to clusters associated with the size and dimension of the studied phenomenon (Perry et al. 1999a, b).

Acquisition of Multi-band Satellite Images with High Temporal and Satellite Resolution and Determination of Vegetation Indices
Our approach was based on evaluating two sugarcane production cycles: (1) without infestation and (2) with spittlebug infestation. The average length/cycle was 13 months. For both, monitoring was carried out using satellite images from planting to harvest. The time window evaluated was from years 2017 to 2020, specifically the first crop cycle occurred between October 2017 and November 2018, while the cycle of high infestation, between December 2018 and January 2020.
The multi-step scheme that was executed on the GEE platform consisted of: (1) selection of the luck area and obtaining all the available images for the 2017-2020 period; (2) cloud mask and threshold to select products that meet quality and information parameters; (3) clipping, band extraction and calculations of pixel values associated with reflectance to determinate the vegetation indices described in the next section; (4) visualization of vegetation indices for the lots and extraction of characteristics for the sampling points; and (5) export information. Additionally, since band 11-SWIR 1 had a spatial resolution of 20 m, an interpolation and resampling process was carried out at 10 m, supported by the 10 m resolution bands. For this process, a convolutional neural network (Wu et al. 2022) was optimized and applied, implemented in the free Python programming language using the TensorFlow and Keras libraries (Brownlee 2016).
In total, 39 (for 2017-2018) and 31 (For 2018-2020) Sentinel-2 satellite images were obtained for the fields, divided into two production cycles: (1) without spittlebug affectation (2017-2018) and (2) with spittlebug affectation (2018-2020), respectively. With the images, previously reported vegetation indices with a high potential for detecting alterations in plants caused by arthropod pests were calculated (Table 1). Additionally, blue, green, red, and near-infrared (NIR) bands were obtained. Subsequently, these products were projected to the evaluated fields. This process was executed with R with functions in the LSRS package (Sarparast 2018).

Indirect Detection of spittlebug Injury Using Machine Learning Tools
To indirectly detect the spittlebug injury in the host, a set of multi-band satellite images (described above) were used that had a temporal correspondence after field pest population evaluations between the fifth day and 15 days after the last field evaluation (May 25 and 31 and July 2, 2019). Then, the indices described in Table 1 were determined, and the bands were extracted for use as predictor variables. Healthy plants (absence of injury), soil, and injured plants were used as classes to be predictors. The presence class was obtained by extracting the mean of the sampled plants at any of the sampling points ( Supplementary Information 1 A and B), indicating presence class as plants presenting some level of injury (as described above), while absence means plants with no evidence of injury.
For the classification process of the previous classes, the ML approach was used. Specifically we selected the random forest (RF) classifier for simplicity, low computational cost and effectiveness (Pal 2005). We implemented an assembly model of the bootstrap Bagging aggregation (Breiman 1996). The data were randomly divided into two data sets: (1) training (75%) and (2) testing (25%), and in each run of the model, a seed was placed to stabilize the predictions and achieve reproducibility in the results.
Different variations of input data and parameters were made within the RF algorithm, seeking a balance between robustness, significance, and computational performance. A set of variable combinations (bands and bands + indices) and classes was evaluated. At the algorithm level, variations were made in the number of trees, number and depth of nodes (1-4000), and the hyperparameter alpha (0-20) (Henao-Rojas et al. 2021). Additionally, the evaluated classes were balanced, incorporating artificial class weights in the RF classifier algorithm (Liaw and Wiener 2002).
The classification result was evaluated using five statistical parameters: (1) the receiver operation characteristic (ROC), comparing the values on the scalar value of the area under the curve (AUC), a measure of the quality of the classification (Cortes and Mohri 2003); (2) confusion matrix (Bekkar et al. 2013;Mueller and Guido 2016), where the rows corresponded to the true classes and the columns to the predicted classes. With each matrix, the global accuracy (GA) was calculated using the ratio between the total number of correct samples over the total number of samples (Hasmadi et al. 2009); (3) the Kappa index; (4) inference error using a stratified 10 K-fold cross-validation (1-inference error = precision); and (5) significance (P < 0.05).
The supervised classification and the evaluation of the results were developed by building code in the free software R using functions from the caret (Kuhn et al. 2020); random forest (Liaw and Wiener 2018); Car (Fox et al. 2020); pROC (Robin et al. 2021); randomForestExplainer (Paluszynska and Biecek 2019); and Raster (Hijman 2016) packages.

Estimation of Spittlebug Infestation Effects on Yield Using Indirect Detection of Injury and a Machine Learning Approach
With the objective of quantifying some productivity parameters prior to harvest, a stratified sampling was designed on the 10th month of crop development (20 of September 2019) (Fig. 1d). The infestation level (IL) was associated with the normalized differential vegetation index (NDVI) according to the distribution of quartiles as follows: (1) high infestation and low NDVI values (Q4: > 0.451 IL and < 0.33 NDVI); (2) mean infestation and intermediate NDVI values (Q2-Q3: 0.141-0.45 IL and 0.34-0.43 NDVI); (3) low infestation and medium high NDVI values (Q1-Q2: 0.01-0.14 IL and 0.44-0.57 NDVI); and (4) no infestation and high NDVI values (QI: 0 IL and > 0.58 NDVI) (for further details, see Fig. 1f). The IL for each evaluation point was obtained as the mean of the pixel value after joining the interpolation rasters, using the kernel density method (described above) for the variables nymphs/stem and adults/stem and functions of the raster library, implemented in the free software R (Hijman 2016).
The free application Avenza Maps (Avenza Systems Inc) was used for the geolocation of each sampling point. Then, the number, height and diameter of the stems were determined in transects of 10 m each. The data were subject to an analysis of variance and post hoc tests, using the Tukey test (P < 0.05), after verifying the normality and homoscedasticity of the data using the Levene and Kolmogorov-Smirnov criteria, respectively, using statistical free software R.
The second analysis aimed to evaluate the use of multivariate regression methods and an ML approach for predicting infestation effects on sugarcane yield using predictive variables obtained from satellite images described above. Our approach was based on the sensitivity of NDVI obtained at a resolution of 10 m from Sentinel-2 sensor images for the  Normalized difference vegetation index (NDVI) Yang et al. (2005) characterization of phenological phases and productivity of sugarcane (Wang et al. 2019) (Fig. 1e).
With the 39 and 31 images obtained for the above-mentioned periods, the mean spectral responses of each index and band were calculated (Table 1) for each phenological phase for the two evaluated cycles (Fig. 1g) as predictor variables. The prediction variable was the yield, expressed in tons of cane per hectare (TCH) for the 2017-2018 (without infestation) and 2018-2020 harvests (with spittlebug infestation), obtained from the mechanized and visualized harvest with the raster format of continuous values using the free software QGIS.
With this information, some parameters of spatial descriptive statistics were determined. The behavior of the spectral indices and bands at different points of crop development and their relationship with the TCH were estimated by calculating the centroids associated with each polygon of the production area, and the subsequent extraction of values, whose visualization was carried out using histograms, was done using the sf library of the free software R (Pebesma 2018). In addition, histograms were generated per infestation level to compare the behavior between sampling days and the spectral response in the crop.
An estimation model was fitted from the values of each pixel for the variable to be predicted (TCH) and the predictors (bands and indices obtained from spectral images), comparing two approaches. In the first approach, we used a generalized additive model (GAM) (Eq. 1). Initially, a selection process of non-autocorrelated variables was carried out using the Pearson or Spearman coefficient, eliminating those with values less than 0.8. Subsequently, the selection of predictive variables was carried out with a stepped selection method, using the Akaike information criterion (AIC) until the variables were correlated with the residuals (Efron et al. 2004). The procedures were developed using R and the functions of the gam library (Hastie 2020).
where TCH is the tons of cane per hectare; β 0 is the intercept; and f (xs) is the standardized vegetation bands and indices according to z transformation methods using the expression reported in Eq. 2. (1) where χi = the raw variable value, μ = mean, and σ = standard deviation.
In the second approach, we implemented ML algorithms. We selected a regularized linear regression using the methods Lasso (stands for Least Absolute Shrinkage and Selection Operator) regression model because this method selects variables, excluding less relevant predictors (Friedman et al. 2010). The penalty of this model was controlled by evaluating a set of values of the hyperparameter λ (0-20), selecting the λ value that gave rise to the best model through crossvalidation using the statistical parameter of the mean square error (MSE). The calibration and validation of the model were developed with a data set, 75 and 25%, respectively. The procedures were developed using the software R and the functions in the glmnet library (Friedman et al. 2021).
The evaluation of the two models was carried out using the coefficient of determination (adjusted R 2 ), AIC, and root mean square error (RMSE). The procedures were developed using the free software R. The results of the selection of variables with the Lasso regression method (described above) showed that the biomass production (TCH) was highly influenced by the level of spittlebug infestation (adults and nymphs per stem) and the indirect injury detection associated with NDVI values. So, the TCH and infestation were fitted to a second-order linear and quadratic model as a function of the infestation level and the NDVI, respectively.
The TCH and infestation data for each sampling point were obtained by extracting the values according to the coordinates of the harvest map and the mean of the three samplings. The fit of second-order linear and quadratic models was evaluated with statistical significance (P < 0.05), coefficient of correlation (r), determination (R 2 ), and mean square error (MSE). Likewise, for all models, the assumptions of normality, independence, constant variances, and linearity were guaranteed. The procedures were developed using the statistical free software R.

Spittlebug Infestation Levels and Space-Time Population Dynamics
The monitoring of adult spittlebug populations and their level of infestation showed high variation over time, represented by values of the CV between 102.5 and 234%. The highest average populations were present in the first and second sampling times, with a reduction in the third evaluation. In this sense, maximum population values of 110 adults were observed in a linear row meter, along with maximum infestation values of 73 adults per stem, for all samples (Table 2). This variation gave rise to significant differences depending on the sampling time (Fig. 2). For the nymph population and infestation, the means were similar in all three samplings but with high variation, as represented by the CV and high number of atypical values, higher than Q4 of the box-and-whisker plot (Table 2, Fig. 2a  and c). In some sampling points, maximums of 151 nymphs were found, with infestation levels of 78 nymphs per stem. This high variation resulted in no statistical differences, depending on the sampling time (Table 1 and Fig. 2b and d).
For the spatial dynamics, it was found that, according to the Moran index, the adult populations had an aggregate distribution, highly aggregated distribution, and medium aggregation, with a random tendency, in samples 1, 2, and 3, respectively (Table 1). On the other hand, the populations of nymphs presented a very high, medium, and medium-low aggregation for the three samplings, respectively (Table 1) (Fig. 1a).
For the spatial distribution, populations of nymphs and adults were generally grouped in areas within the lot with variations depending on the sampling (Figs. 1a and 3). The higher concentrations were in the central and southwestern areas of the fields, with a tendency toward spatial aggregation. The three methods of spatial analysis used found coincidence in the above-mentioned. On the other hand, the interpolation with the kernel method assigned higher values of density or grouping, depending on the state type of the insect and the sampling time (Fig. 3a). Likewise, the area pattern method grouped centroids with population values for adults and nymphs between 1 and 7 individuals (Fig. 3b). The Sadie index was able to differentiate areas with a high risk of finding populations of adults and nymphs (red, Fig. 3c) with an aggregate distribution, those areas with low levels (blue dots, Fig. 3c), and the areas with movement of insects toward new areas (white color, Fig. 3c).
For the nymphs, higher levels of population grouping were observed in the first and third sampling, with greater intensity in the southeast, center, and northeast (Fig. 3), as corroborated by the interpolation using the kernel density (Fig. 3a). This result was corroborated with the Sadie index, which showed that, in these areas, there was aggregation and a greater probability of finding nymphs (Fig. 3c). The densities of adults were considerably higher in the first sampling, with a tendency to concentrate toward the edge of the lot in the southwest but with a more random distribution (Fig. 3b). Later, in samplings 2 and 3, although the population level decreased, the aggregation increased, moving from the edges to the central areas in the fields (Fig. 3).

Remote Spittlebug Injury Detection Using Satellite Imagery and ML Tools
It was found that the ability to correctly discriminate the classes associated with injured plants associated with spittlebugs, without injury, and soil using vegetation indices and bands obtained from satellite images varied depending on the number of variables used as predictors, the sampling time, and the optimization of the RF algorithm according to the number of nodes, minimum depth, and hyperparameter λ ( Fig. 4 and Table 3).
For the RF algorithm, the best prediction was based on a balance between the accuracy, the values of the kappa statistic, the ROC-AUC and the significance for the training and testing data set and was obtained with a combination of predictor variables between 4 and 10, a number of nodes between 120 and 334, a minimum depth of nodes between Table 2 Descriptive statistics and population of nymphs, adults per meter of crop row and infestation expressed as nymphs and adults per stem in sugarcane commercial plots under natural infestation of Aeneolamia varia CV coefficient ofvariation a Values close to 0, less than − 1, and close to 1 indicate a random, uniform, and aggregated spatial pattern, respectively (Moran 1948 (Table 3). For the discrimination of classes associated with injury associated with spittlebug adults or nymphs, it was found that the informative variables differed for each sampling time. First, a highly predictive capacity for the presence of spittlebugs injury in the NIR band was observed, followed by the AVI, EVI, and SAVI vegetation indices. Likewise, a good prediction for the absence of spittlebugs injury was obtained in the red, green, and blue bands. In the second sampling time, a low predictive capacity was observed for the bands and indices for pest injury, while for the absence of injury the more important predictors were the NDVI, AVI, ARVI, and NDWI vegetation indices. In the third sampling, only the green band had predictive importance for the presence of injury of spittlebugs, while for absence, the more important predictors were the EVI, NDVI, and SAVI vegetation indices and the blue and NIR bands. In the three samplings, the red, green, and blue bands presented predictive capacity for the soil class (Fig. 4a).
According to the values found in the confusion matrix (Fig. 4b), the results indicated that the trained RF model presented variations in the correct selection of classes as a function of time. In the first sampling, the presence of pest injury was moderately estimated, but the absence was underestimated since a higher value indicated a presence. On the other hand, in the second and third sampling times, the opposite occurred, and the classes associated with pest injury were underestimated.
The prediction accuracy of the evaluated classes, as well as the other evaluated statistics, varied depending on the number of classes, the combination of the predictive variables and the sampling time (Table 4). In general, a better fit was found using all bands and indices, all classes, and sampling times (Table 3). Based on the above, a best-case accuracy of 68, 70, and 72% was achieved for monitoring 1, 2, and 3, respectively (Table 3). Figure 4c represents the spatial distribution of the classification using the RF algorithm of the evacuated classes in the three samples. Here it is evident a considerable difference in the magnitude of the presence of injury associated with spittlebugs classified between the first sampling in comparison to the other two. Despite these differences, the prediction of presence of injury tended to be concentrated in similar areas during the three surveys and was highly associated with the population analyses of nymphs and adults using spatial statistic tools (Fig. 4). The soil class was identified with higher density in the second sampling, possibly because of the high infestation resulting from high population levels in those same areas during the first sampling (Table2 and Fig. 4).
There was a correspondence between the density of pixels for each of the seven vegetation indices and the bands evaluated with respect to the level of infestation of nymphs and adults ( Supplementary Information 1), explaining how the

Determination of Spittlebug Infestation and Its Effect on Yield
The more informative vegetation indices for discriminating differential values of TCH values were NDVI, GNDVI, EVI, AVI, and ARVI (Fig. 5a-c), explaining how the NDVI was highly consistent and sensitive in differentiating the different phenological phases of sugarcane during its crop cycle and   (Fig. 1g). For the relationship of the mean values of the vegetation indices during the crop cycle with the TCH, lower levels coincided with the areas where there was less yield (Fig. 5a, b). These results coincided with the fact that higher numbers of pixels with high index values, such as NDVI, GNDVI, EVI, AVI, and ARVI, were found in the classes with high production (120-124.9 and 130-142 TCH) (Fig. 5a-c). The latter results show consistency of areas with a lower population level of nymphs and adults (Fig. 3) and areas classified by the RF algorithm as non-infested areas (Fig. 4).
At the 10th month of crop development the comparison of relationships between the productivity components related to the number, weight and diameter of stems, and based on the categories of infestation levels from the NDVI indices: none (0 IL and > 0.58 NDVI), low (0.01-0.14 IL and 0.44-0.57 NDVI), medium (0.141-0.45 IL and 0.34-0.43 NDVI), and high (> 0.451 IL and < 0.33 NDVI), presented statistical differences (Figs. 1c and 6a-c). High infestation levels led to a lower number, height, and diameter of stems. Between the medium and low categories, there were no considerable differences in the productivity components in terms of relative magnitudes; the losses from the effect of high levels of spittlebug infestation occurred in descending order: number of stems (~ 70% lower), height of stems (~ 48% smaller), and stem diameter (~ 33% smaller) (Fig. 6a-c).
For the prediction of production using the two approaches in this study, the traditional method using a GAM-type model was inferior to the Lasso regression (Table 4). The prediction of TCH for both methods was slightly overestimated, but the value did not exceed 10% of the real performance, with a better fit for the Lasso regression with respect to the GAM (Table 4).
For both crops cycles (with and without infestation), the methods were sensitive in detecting changes in the estimated yield based on the mean values of the spectral indices and bands obtained from satellite images during a cycle of approximately 13 months. The ML method was able to predict that the losses associated with the spittlebugs injury were 26.4 TCH, when high infestations of nymphs and adults per stem were compared with similar areas but without infestation, while the real value of the reduction from the affectation was 24.7 TCH, corresponding to a 17% reduction in the expected production, and where the estimated loss value with the GAM method was lower (17.1 TCH) ( Table 4).
The simple linear models presented a good fit, indicating that the variation explained by the predictor variables was greater than a random effect (Fig. 7a-d). Negative relationships were found between infestation and plant productivity variables. In the case of the number of adults per stem, there was a first-order relationship between the infestation and TCH (F = 145.23; gl = 1, 135; P < 0.001; r = − 0.71; R 2 = 51.82; MSE = 5.60; Fig. 7a), while, between the infestation and the NDVI index, the relationship was quadratic (F = 141.8; df = 2, 134 = P < 0.001; R 2 = 67.92; MAE = 0.06; Fig. 7c). According to the relationship between adults per stem and TCH, an increase of one unit in the infestation would be associated with a decrease in TCH of 19.2%. In reference to the nymphs per stem, there was a first-order relationship between infestation and TCH (F = 36.95; df = 1, 135; P < 0.001; r = − 0.56; R 2 = 41.48; MAE = 7.32; Fig. 7b), while between the infestation and the NDVI index the relationship was of the second order (F = 25.95; df = 2, 134 P < 0.001; R 2 = 27.91; MAE = 0.08; Fig. 7d). According to the relationship between nymphs per stem and TCH, an increase of one unit in the infestation would be associated with a decrease in TCH of 3.4%. Table 4 Ability of traditional and machine learning methods to predict the yield on sugar cane without and with infestation of Aeneolamia varia GAM generalized additive model, Lasso least absolute shrinkage and selection operator, RMSEP root mean square error of prediction, AIC Akaike information criterion, R 2 coefficient of determination 1 The reduction by infestation was determined by means of the following expression: RI = (TCHn − 1 − (RR)) − TCHn , where RI: reduction by infestation, TCH n−1: tons of sugarcane per hectare of the previous period without infestation; RR ratoon reduction, for the case of an initial cut (n = 1) to a later one (n = 2), it was assumed to be 10 TCH; y TCH n : TCH of the period affected by the infestation

Discussion
The results confirmed that the spittlebug A. varia is a pest of high economic importance in sugarcane production systems. The infestation levels in this study (means ranging from 5.1 to 7.4 nymphs per stem and 0.9 to 9.5 adults per stem) were associated with injury that caused losses up to 24.7 tons of cane per hectare, representing an approximately 17% reduction in the expected yield. The foregoing can be aggravated if there is no rapid response in control actions  (Table 1). In this case, given the state of the infestation and a delay in detecting the problem, the implementation of management measures was relying mainly on spraying chemical insecticides (Fig. 1a, b). Spatial analyses using ML tools and satellite images obtained from the GEE platform were effective to assess spittlebug population patterns associated with injury and were also instrumental to estimate final damage (i.e., effects on yield). A. varia having an aggregated spatial structure and high space dependence moves relatively quickly from the infested areas, with high impact on neighboring plants evidenced on advanced injury in the surrounding areas (Fig. 1). The latter raises the need for timely detection and early warnings to avoid severe attacks from rapid establishment and population increases (Fig. 1a, b). The space-time dynamics of spittlebugs are highly sensitive to environmental conditions (Figueredo et al. 2021), which, in the case of the study area, have been found to be highly associated with transitions from dry to rainy periods (Ramirez-Gil et al. 2021), which sets the stage for coupling population dynamics and associated environmental factors to develop early warning systems (Méndez-Vázquez et al. 2019;L. Méndez-Vázquez et al. 2022).
In relation to the above-mentioned, populations of A. varia have been found associated with early development of sugarcane, either with those stands with few months after harvest (ratoon) or those during establishment (Vargas et al. 2013), with a greater incidence and relationship with low values in the vegetation indices found in this work (Supplementary Information 1), indicating that the first 6 months of crop development are more sensitive to pest attack, aggravated by the fact that some sugarcane varieties may be more susceptible than others (Cuaran et al. 2012).
As mentioned previously, the injury of spittlebugs in sugarcane is associated with chlorosis, necrosis, portions of reddish brown spots, and in advanced stages death of the stools (Vargas and Gutiérrez 2017). This kind of injury is related to the feeding behavior of both nymphs and adults that after injecting digestive enzymes trigger adverse effects primarily on leaves and then in other tissues (Taliaferro et al. 1967). These leaf lesions cause changes in the coloration of tissues, which can be identified by means of spectral techniques, where the amount of light absorbed by the object and its respective reflectance vary depending on the degree of injury (West et al. 2003;Chaerle et al. 2009;Mahlein et al. 2012). In our results, we identify that bands blue, green, red, and near-infrared, and vegetation index have the capacity to discriminate sugarcane plants with injury by A. varia. These bands have been associated with the ability to detect changes in biochemical, color, structural, and leaf density (Vogelmann et al. 1993;Huo et al. 2021).
Negative relationships between variables of infestation and those of crop yield (i.e., TCH and NDVI) corroborated the impact of the pest on production and showed that the negative effect of the infestation is more severe when the infestation is associated with adults than with nymphs. A one-unit increase in adults per stem impacted TCH about five times more than a one-unit increase in nymphs per stem. Notably, this affectation occurred at maximum levels of 1.6 adults per stem, while the nymphs reached maximum levels of 10 nymphs per stem. In studies on Bermuda grass attacked by Prosapia bicinta (Say) (Hemiptera: Cercopidae), symptoms of pest attack were only evident when the plants were exposed to adults (Byers and Wells 1966). In sugarcane, attack symptoms are largely caused by adult attacks, 1 to 3 weeks after detection (Fewkes 1969). As greater level of injury was more associated with the presence of adults, it is important to identify how sampling of the nymphal stage can help in preventing the ulterior injury of an increased population of adults. Under a scenario where the nymphs attack is no noticeable by either plain sight or remote sensing, there would be still need of executing field samplings of nymph populations, especially on those sugarcane fields with history of previous spittlebug attacks or in those plots planted with susceptible varieties, mostly on susceptible crop stages (< 6 months). According to the above-mentioned, a general approach for IPM (Integrated pest management) of this pest could have remote sensing as complementary tool for injury assessment while filed sampling of both adults and nymphs is sustained in a continuous monitoring program over time (Hutchins 1994). In this ecological aspects need to be considered (Kuno 1991), such as pest's aggregated distribution and its rapid movement in the field, so the detection via remote sensing of areas expressing injury in the field could be an early warning to prevent further increase in the area under attack. The latter while keeping in mind the potential high impact on final biomass production, mainly driven by attacks on early stages of crop development that in the case of the fields surveyed was around 5 months of age.
The approach used on this work represents a combination of promising emerging technologies for plant health monitoring. The GEE platform proved to be suitable for quickly, accurately, and cheaply obtaining a historical series of images for the evaluated batches that, after processing and extracting products such as bands and spectral indices, were incorporated as predictive variables into different ML models and algorithms. The versatility of our approach was due to the fact that, on the one hand, the Sentinel-2 sensor provides constant and uninterrupted multispectral satellite images at high temporal frequency, high spatial resolution and global coverage (Segarra et al. 2020). For its part, GEE efficiently combines numerous remote sensing data sets and processing algorithms, which are easily accessible at multiple spatial scales (Amani et al. 2020). The combination of these emerging technologies simultaneously provides processing infrastructure, storage, algorithms, software, and generation of Internet-based application programming interface (API) and a Web-based interactive development environment, providing unprecedented opportunities to generate effective computational tools for crop monitoring.
This study is illustrating how sampling tools do not only generate information for longitudinal analysis of populations, using elements of descriptive and inferential statistics (Binns et al. 2000), but also can incorporate the space dimension using location and elements of spatial statistics. Additionally, the remote sensing and GEE platform tools used to indirectly detect injury and damage associated with pest infestation is an example of a useful application Yield (TCH)=138.7 -26.6 (adults/stem)

Fig. 7
Fit models for the prediction of biomass (TCH) and NDVI vegetation index in sugar cane associated with infestation by nymphs and adults of Aeneolamia varia. A, B, C y D: fit to a linear and second-order quadratic models of biomass production (TCH) and infestation as a function of variables with greater weight determined by the Lasso regression model. Black line: prediction limits.

Conclusion
This study aimed to characterize spatio-temporal population dynamics of the spittlebug A. varia in commercial sugarcane plots based on the proposal and validation of a digital agriculture approach in which sampling tools, spatiotemporal analyses, satellite sensing big data, and ML analytical techniques are complementary. This approach provided a good approximation for the detection of the infestation level of nymphs and adults, as well as predicting the reduction on yield caused by this pest. The results suggested that the potential impact of A. varia infestation varies in sugarcane, generating losses of ~ 17% in biomass, having that further effects on sucrose content warrant additional studies. The GEE platform proved to be an ideal emerging technology to obtain massive satellite data that facilitates characterizations of the spatiotemporal dynamics of A. varia attacks on sugarcane. The precise geolocation of the impact of attacks and the magnitude on productivity using the proposed approach generates opportunities to implement emerging management practices that optimize the use of inputs and validate management practices and time control interventions. Future studies should validate the proposed technological approach in similar and different agroecological conditions. Likewise, the agronomic analysis of the results must be complemented with a focus on optimizing crop management practices to prevent or mitigate spittlebug infestations.