Introduction

Efficient pasture production and utilisation is often the most critical component in a pasture-based dairy operation (García et al. 2014). Despite the extensive literature available about the constituent factors of pasture production and management, most pasture-based grazing systems are not optimally managed due to the costly and time-consuming nature of standard methods for pasture measurement and monitoring. In practice, coordinating pasture growth rates and grazing events is crucial to optimise yield (Chapman et al. 2012) while avoiding nutritional value losses (Turner et al. 2006).

In a farm scenario, such coordination can be achieved through weekly measurements of pasture biomass (García et al. 2014). Currently, common methods, such as the rising plate meter (RPM), rely on linear relationships between canopy height (Allen et al. 2011) and biomass. However, these relationships are limited in their accuracy and biased due to plant-development stages, canopy architecture (erectophile or plagiophile) or canopy density (Nakagami 2016). Finally, such a method requires a trained observer constantly monitoring and sampling paddocks, which can lead to inconsistencies due to observer-bias (Thomson et al. 1997) and requires extensive time investment (Hall et al. 2019).

Alternatively, remote sensing (RS) techniques can be employed to provide semi- or fully-automated monitoring, provided that the ideal temporal-spatial scales are observed. This constraint has, historically, prevented widespread adoption, as satellite revisit intervals and pixel size are not adequate for on-farm management. Furthermore, persistent cloud cover is a pervasive constraint (Kawamura et al. 2008) during seasons of rapid growth rates, hence, decreasing the utility of satellite optical imagery (Ali et al. 2016).

Unmanned aerial vehicles (UAV) and multispectral (MS) cameras (those defined as multi-camera 2D imagers in Aasen et al. (2018)), provide a flexible and economical system for spectral observation at a time and spatial-scale which are ideal for agricultural practices. As an advantage, such multispectral cameras are readily available, producing radiometric calibrated imagery which can be easily integrated into agricultural management tools. A drawback, however, is that multi-camera 2D imagers are mostly restricted to a small number of bands, usually five as per Aasen et al. (2018), within the visible to near-infrared range (VIS–NIR) of the spectrum due to their silicon-based sensors. As a consequence, end users have focused on vegetation indices (VI) as enhanced predictors for biomass estimation.

Vegetation indices have been, to an extent, successfully employed in precision agriculture (PA) and RS to estimate different biophysical and biochemical attributes of vegetation. Accordingly, a vast number of VIs have been proposed (Xue and Su 2017), surpassing the possibility of optimally or correctly utilizing each one of them without extensive domain knowledge. This issue is found in UAV remote sensing, where most commercial sensors rely on a broadband VI such as the normalized difference vegetation index (NDVI) to estimate a range of different, non-correlated attributes. Notwithstanding its versatility, estimations from a multi-purpose VI (e.g. NDVI) are, logically, not optimal for attributes different from which the index was originally designed. Issues such as NDVI saturation, occurring after a certain biomass threshold, are well-described in the literature (Thenkabail et al. 2002).

Despite the saturation drawback with the traditional NDVI, Mutanga and Skidmore (2004) have shown that, through systematic generation of normalized ratio indices (NRI) for all band pairs of a hyperspectral dataset (acquired at handheld level), certain narrowband NRIs can overcome biomass saturation. In parallel, Burkart et al. (2014) demonstrated the equivalence between spectral observations at two different data acquisition scales: handheld and low-level flight. In a more recent study, Wang et al. (2019) presented a brief review on the difference between reflectance measurements from both acquisition-scales, indicating that such differences are negligible. Such findings fulfil a methodological gap for data collection, analysis and performance validation for UAV sensors (i.e. low-level flight) extrapolated from handheld spectral information.

Presently, these approaches (i.e. filtering and selection of optimal NRIs and known VIs) have not been tested and translated to custom-made commercial UAV sensors. Hence, canopy based approaches, such as the RPM, are still favoured in pasture-based systems rather than spectroscopy methods despite the unique potential, at a low-cost, offered by the adoption of UAVs, multispectral sensors and a satisfactory small set of VIs.

Given this research-gap, current commercial multispectral cameras and data analysis (Michez et al. 2019) have reported poor performances for pasture biomass estimation, both when employing spectral-based (R2 = 0.35) or canopy based (photogrammetric estimated height) techniques (R2 = 0.23). This is in spite of the potential reported in previous research (Mutanga and Skidmore 2004) and absence of scaling-up issues (Burkart et al. 2014; Wang et al. 2019).

The research objective of this study was to evaluate how many and which VIs (i.e. features) should be employed for perennial ryegrass (Lolium perenne) biomass estimation while comparing with two traditional techniques: the rising plate meter and broadband NDVI. Consequently, accuracy improvements can be quantified based on the nature of measurements (canopy height or spectra) and number of indices. Furthermore, different regression techniques (parametric and non-parametric algorithms) were analysed, so that increments in accuracy and precision of feature(s)-algorithm pairs can be evaluated.

Methods

Experimental design

The trial was undertaken at the Tasmanian Dairy Research Facility in Elliot (TAS, Australia—41° 04′ 57.3″ S, 145° 46′ 21.8″ E). The experimental layout was an array of 30 rainfed perennial ryegrass plots (dimensions of 2.0 × 7.5 m, with 0.35 m border at each side of the plot’s longitudinal axis), arranged as two rows by 15 columns (Fig. 1).

Fig. 1
figure 1

Experimental layout employed in each of the five data collection campaigns. Background textures refer to regrowth periods. Nitrogen levels are indicated within the plot. Each block (encased by a dashed rectangle) comprises all combinations of regrowth periods and nitrogen fertilizing regimes. Plots are numbered from left-to-right and bottom to top. Within, each plot, six different sample-sites were measured both by the spectroradiometer and the rising plate meter

Plots were grouped in three main blocks (10 plots per block). Each block was split in two different growth intervals: long and short or approximately 30 and 15 days, respectively. Each plot on the split-block was randomly allocated a different nitrogen (N) fertilising regime (0, 25, 50, 75 or 100 kg N ha−1). The fertiliser was manually applied (i.e. top-dressing) on each plot at the start of each regrowth cycle, having urea as N source. Prior to spring (end of August) and prior to installing the experiment, phosphorus (P), potassium (K) and sulphur (S) were broadcast throughout the trial area according to soil analysis to ensure that the lack of macronutrients would not impede pasture growth.

Data collection campaigns consisted of three subsequent stages: (1) spectral measurement, (2) canopy height measurement and (3) biomass determination. In each of these stages, attention was given to minimize confounding factors and ensure independence amongst measures, given that measurements of stages (1) and (2) overlapped spatially (Fig. 2). In total, five campaigns were carried out from December 2016 to November 2017 (as per the dates of spectral measurements).

Fig. 2
figure 2

Data collection workflow. Stages: (1) spectral measurements (spectroradiometer); (2) canopy height measurements (rising plate meter); 3 biomass determination, (a) drying (forced-air oven), (b) weighing (digital scale). Plots (1–30) dimensions: 2.0 × 7.5 m. Sample-sites (1–6) were measured both with the spectroradiometer and the rising plate meter

Spectral measurements

Spectral data was collected by a field spectroradiometer (ASD Hand- held 2, Colorado, USA) on five dates under clear-sky conditions and around solar noon: December 18th, 2016, February 06th, April 29th, October 22nd and November 28th, 2017. This instrument acquires data from 325 to 1075 nm, with a total of 750 bands and field of view (FOV) of 25°. Total time spent to obtain all spectral measurements (180 data points) was 1.5 to 2 h per field campaign with minimum warm-up of 30 min. The instrument setup follows the manufacturer’s recommendation: 30 scans for spectrum averaging, 60 scans for dark current and white reference. The sequence of measured plots was randomized to minimise any systematic effect of solar position across the plots during data collection. In addition, after finishing measuring the samples of each plot, a spectral measurement of the white reference (Spectralon®) was recorded. The intention of this procedure was twofold: (a) to monitor the stability of the instrument and (b) detect any possible change in atmospheric conditions. The instrument was recalibrated (against the white reference) after seven minutes of continuous usage or whenever the white-reference measurement deviated from 100% reflectance, whichever occurred first. Within each plot, six randomly allocated sample-sites were selected (Fig. 2 (1)) Spectral measurements were taken from approximately one-meter height, thus, yielding a circular footprint equal to 0.15 m2 (or 0.44 m diameter). Each sample-site was measured five times. Final sample spectral value (referred to as raw spectral data) was the average value of these five measurements.

Canopy height measurements

An analogue RPM with 5 mm resolution, described in Earle and McGowan (1979), was employed to measure (compressed) canopy height, once per sample-site (Fig. 2 (2)).

Biomass determination

Pasture biomass was mechanically defoliated above a residual height of 50 mm from the 0.15 m2 footprint used in stage 1 and 2 (Fig. 2 (1) and (2)). Harvested material was dried for a minimum of 48 h at 60 °C in a forced-air oven (Fig. 2 (3a)) immediately following each harvest for pasture dry-matter (DM) determination. Samples were weighed (Fig. 2 (3b)) using a digital scale (MassCal, 30 kg ± 0.5 g).

Data analysis

For reproducibility purposes, data analysis operations are introduced by the corresponding package::function format (italics typeface and accompanied by the double colon operator, i.e. the scope resolution operator).

Data analysis was performed in RStudio/R (versions 1.14 and 3.5.1, respectively). Necessary packages for the analysis, besides the base and dependencies packages, are hsdar (Lehnert et al. 2018), caret (Kuhn 2008) and caretEnsemble (Mayer and Knowles 2015) (Fig. 3).

Fig. 3
figure 3

Data analysis workflow. Statistical packages employed are indicated on the top-left of each dashed rectangle. Feature generation process is enclosed by the top dashed rectangle while feature filtering, model fitting and performance analysis are encased in the lower dashed rectangle. Single and double asterisks indicate where train/test and validation dataset were, respectively, employed

Feature generation

The raw spectra data were smoothed using the Savitzky–Golay filter (hsdar::smoothSpeclib, window size = 9 nm, polynomial-degree = 2). This operation aims to decrease the influence of random instrument noise without distorting the original reflectance values. As a consequence, this operation improves the signal-to-noise ratio.

Vegetation indices

The smoothed spectra were transformed into a set of 97 VIs (hsdar::vegindex) available in the literature. The full list of VIs can be found in Lehnert et al. (2018) and are listed n Table 2. As a baseline approach and as a complementary step for data analysis, all VIs were fitted in a univariate (single-index) ordinary linear regression against biomass values.

Normalized ratio indices

The smoothed resampled spectra were used in an exhaustive-search process, testing all available bands combinations in a normalized difference equation, as described in Mutanga and Skidmore (2004) and shown in Eq. 1. The smoothed spectra were resampled from 750 to 149 bands given that the necessary computation time to test all band combinations would be substantial, and most likely superfluous due to high spectral band correlation. Spectra were resampled (hsdar::spectralResampling) applying a gaussian response function and 10 nm bandwidth, reducing computational load to 4% of all possible combinations. In total, 11,026 combinations were assessed.

$$ {\text{NRI}}\left( {\lambda_{1} ,\lambda_{2} } \right) = \frac{{\lambda_{1} - \lambda_{2} }}{{\lambda_{1} + \lambda_{2} }} $$
(1)

All the NRIs were then fitted in a univariate (single-index) ordinary linear regression (stats::lm) against biomass values. From these, the best fitting linear model (hsdar::nri best performance, highest R2) was identified and the best performing NRI (λ1,λ2optimised NRI) included in the pool of filtered VIs.

Feature filtering and selection

The framework for data analysis consisted of three steps: (A) filtering of highly correlated and non-significant VIs (i.e. features); (B) recursive feature elimination and feature selection (optimal and satisfactory subsets) and (C) model fitting, stacking and validation. A detailed description and rationale of the feature filtering and selection workflow is discussed in Perez-Riverol et al. (2017) and a specific case-study for perennial ryegrass is available in Alckmin et al. (2019). The training/testing set (n = 630) comprises 70% of the entire dataset and was used in a repeated k-fold cross-validation (folds = 10 and repeats = 5).

Filtering

Pearson correlation among all VIs was calculated. A maximum cut-off of |0.95|, based on a sensitivity analysis developed in Alckmin et al. (2019), was applied to identify highly correlated VIs. Such VIs were evaluated in a pair-wise fashion: the one with the largest mean correlation (i.e. correlation with all other features) was removed. A minimum Pearson correlation removal cut-off equal to |0.2| between the remaining filtered VIs and DM values was applied.

Recursive feature elimination

After the filtering process, remaining VIs were centred, scaled and a recursive feature elimination process (caret::rfe) was performed using the training set. Two subsets were identified: optimal (minimal RMSE) and the satisfactory subset, which was the smallest group of features that presented results (in training–testing stages) which were below a 10% threshold from the minimum RMSE model (optimal subset). This workflow and guidelines were presented in Kuhn and Johnson (2013) and Perez-Riverol et al. (2017).

Ranking of VIs (variable importance, caret::varImp) for each different VI subset size, at each cross- validation fold, was calculated through a random forests routine using Gini Importance, introduced in Breiman et al. (2017).

Model fitting, stacking and validation

After the satisfactory feature set was determined (i.e. Selected VIs), different regression algorithms were fitted to the data. The following models were chosen: “Bagged MARS” (Friedman 1991), “Random Forests” (Breiman 2001) and “Support Vector Machines with Polynomial Kernels” (Cortes and Vapnik 1995) and “Ordinary Linear Regression” (referred to hereafter as MARS, RF, SVM and LM, respectively). The rationale for this selection was to test models that do not share the same core technique or are simply variations of the same technique. Tuning of hyper-parameters was performed automatically at the training-test stage (caret::train) through an embedded grid-search algorithm. Model ensemble/stacking consisted of training (caretEnsemble) an algorithm to combine the predictions of other previously trained algorithms (Caruana et al. 2004). Such an approach was generated through a generalized linear model of the three non-parametric models (referred to as STACK).

Finally, models were validated against an unseen dataset, corresponding to 30% of total observations (n = 270), where it is expected that model performance is similar (or superior) to the training-test stage. For benchmarking purposes, RPM, NDVI, optimised NRI and the satisfactory subset (i.e. selected VIs) were all fitted using a linear model and underwent the Performance Analysis protocol, allowing a comparison between features.

Performance analysis

In this study, the error of each feature(s)-algorithm pair, or the difference between predicted and observed/true values (sample weight), was assessed using three different metrics: root-mean-square error (RMSE), coefficient of determination (R2) and mean absolute error (MAE). Ultimately, this analysis aimed to (i) assess the accuracy (Joint Committee for Guides in Metrology 2008) for each feature(s)-algorithm pair; (ii) to test if the accuracy differences between the feature(s)-algorithm pairs were statistically significant (p-value > 0.05); (iii) if the error-distribution (i.e. precision) of these feature(s)-algorithm pairs were statistically different (Fig. 4).

Fig. 4
figure 4

Feature(s)-algorithm pair model fitting (left-side), performance and tests workflow (right-side). Selected VIs were used as features on non-parametric models. Non-parametric models were used as features on the STACK model

For items (i) and (ii), the algorithm (caret::diff.resamples) and workflow provided in Kuhn and Johnson (2013) was employed, which are descriptions of benchmark studies mostly based in one sample t-test. To check if the error-distribution was different between feature(s)-algorithm pairs, item (iii), the Kolmorogov-Smirnov test (Massey 1951) was employed. The two-sample Kolmogorov- Smirnov (KS) test is a non-parametric analysis that compares the cumulative distributions of two datasets. In this context, it was employed using error metrics (RMSE) derived from each feature(s)-algorithm pair. The error-metrics (n = 50) were generated through a repeated k -fold cross-validation (5 folds, 10 repeats) on the training set. Feature(s)-algorithm pairs are presented in Fig. 4 under the “Model Fitting” section. A schematic diagram of this workflow is presented in Fig. 4.

Results

Experimental setup—biomass and canopy height

Pasture biomasses ranged from 164 kg DM ha−1 (minimum) to 4663 kg DM ha−1 (maximum) and a mean of 1633 kg DM ha−1. The descriptive statistics data of biomass and canopy height values, per campaign, can be found in Table 1.

Table 1 Descriptive statistics for biophysical characteristics per campaign

Feature generation

Vegetation indices

The result of a baseline approach (i.e. single index linear regression), and consequent model fit metrics, can be examined in Table 2. The optimised NRI (“Opt NRI”), retrieved within this analysis ranks as the best performing VI (Table 2) as found by Mutanga and Skidmore (2004).

Table 2 Performance by vegetation index in an ordinary linear regression

Optimised normalized ratio indices

Figure 5 presents the R2 value for each NRI (λ1, λ2 combination) and pasture biomass values. The region with the highest correlation to pasture biomass (delimited by a dashed rectangle) occurred within the far end of the red-edge region (680–730 nm) until the NIR shoulder (up to 900 nm). The best performing NRI (λ1 = 745, λ2 = 755 nm) is then added to the feature-space in which the feature selection will be performed.

Fig. 5
figure 5

Normalized Ratio Index Correlogram (λ1, λ2 and biomass). Displaying the determination coefficients between biomass and narrow band NRI values calculated from all possible combinations spread across λ1 (350–1075 nm) and λ2 (350–1075 nm). Dashed rectangle limits λ1 700–900 and λ2 700–760 nm

Feature filtering and selection

The filtering selection protocol reduced the number of features from 97 to 19 VIs. Within the selection workflow, the optimal (filled circle) and satisfactory (filled triangle) features subset were identified in the feature selection protocol (Fig. 6). By this protocol, the number of VIs necessary to fulfil this constraint (i.e. 10% threshold above minimum RMSE) was equal to four and are listed in Table 3 as well as indicated in boldface in Table 2. Complementary, the optimal model, or the model with lowest RMSE value, includes all 20 VIs (i.e. filtered VIs plus the optimized NRI). However, the increments in performance are marginal from six features onward (Fig. 6).

Fig. 6
figure 6

Feature selection protocol. Average RMSE by number of VIs (features). Optimal (filled circle) and satisfactory (filled triangle) subsets have 20 and four VIs, respectively. Optimal subset corresponds to the lowest average RMSE found through the feature selection process. The satisfactory subset corresponds to the minimal number of features (i.e. VIs) below a ten percent tolerance threshold over the minimal RMSE

Table 3 Selected vegetation indices (satisfactory subset), variable importance, formulas and references

Variable importance

The four selected VIs (satisfactory subset) used as features for different regression techniques are: the optimised NRI, Chlorophyll Index, Simple Ratio 7 and Double Difference near the red-edge. Variable importance, extracted as per the description in Methods, respective formulas and bibliographical references are listed in Table 3.

Performance analysis

From the repeated k-fold cross-validation, the error metrics for each feature(s)-algorithm pair are summarised in Table 4. It presents, in descending order, the feature(s)-algorithm pair performances. Results range from (max) 770.6 to (min) 325.3 kg DM ha−1 (RMSE) from NDVI and SVM, respectively. In terms of precision, the standard-deviation (SD) ranges from 59.5 to 15.6 kg DM ha−1 (RMSE).

Table 4 Posterior analysis results for each feature(s)-algorithm pair

The performance of RF and STACK models are equivalent in terms of average accuracy. However, the STACK model is more precise with a standard-deviation of 15.6 kg DM ha−1 in contrast to 41.7 kg DM ha−1 of the RF model. Such difference in error-distribution is statistically significant (p-value < 0.01) as per the results of a two-sample KS-test results, displaying narrower error-distribution for the STACK model (Fig. 7b).

Fig. 7
figure 7

Density plot of the error-distribution RMSE for each model a different features fitted to a linear model. b Selected VIs (features) fitted using non-parametric methods

Figure 7 presents the density plot of error-distribution of each feature(s)-algorithm pair, from which a visual assessment of accuracy and precision can be extracted. This figure is divided into a density plot where only an ordinary linear regression (parametric) model was employed (Fig. 7a) and models where different non-parametric models and the selected VIs were employed (Fig. 7b).

The visual analysis provided a clear indication of the higher precision of the STACK model in comparison with other methods. In the interest of avoiding similar figures, only the RMSE density plot will be presented as the same information for MAE and R2 are also presented in Fig. 8. The ranking and performance (average error metrics and error-distribution) of each feature(s)-algorithm pair is presented in Fig. 8.

Fig. 8
figure 8

Error distribution presented through boxplots. The four top boxplots refer to non-parametric models. Conversely, the four boxplots on the bottom relate to ordinary linear models. The STACK model presents a narrower error-distribution

The results of the benchmark methodology developed by Kuhn and Johnson (2013) are presented in Table 5. With regard to accuracy differences, the best performing feature(s)-algorithm pair is the STACK of models with an average RMSE = 405.8 and MAE = 310.7 kg DM ha−1. The lowest performing feature-algorithm pair is NDVI based on an ordinary linear model. Likewise, the saturation of NDVI becomes noticeable at around 2.500 kg DM ha−1, whereas the optimised NRI does not present the same pattern (Fig. 9). As a consequence, the maximum RMSE difference between feature(s)-algorithm pairs is between the STACK models and NDVI (231.1 kg DM ha−1). Additionally, the optimised NRI has, on average, a (difference) in RMSE of 111.1 kg DM ha−1 smaller than the NDVI (Table 5).

Table 5 Posterior Analysis—RMSE, MAE and R2 for feature(s)-algorithm pairs
Fig. 9
figure 9

Scatterplots of predicted vs observed data of all feature(s)-algorithm pairs on the validation dataset (n = 270). The first row presents different features using an ordinary linear regression (parametric-regression). The second row uses the four selected VIs (i.e. satisfactory subset) in different (non-parametric) regression algorithms. Noticeably, the NDVI (top-left) scatterplot illustrates its saturation issue at around 2.500 kg DM ha−1. Scatterplots have been arranged according to accuracy improvements and follow a left-to-right and top-to-bottom order

Concerning error-distribution (precision), most noticeably the results of the t-tests (lower diagonal—Table 5), indicate that the average inaccuracy (RMSE and MAE) of (i) the STACK and RF models and (ii) RPM and optimised NRI are not statistically different (Table 5). Given such, it follows to check if each of these two pairs, (i) RPM/optimised NRI and (ii) RF/STACK, share equivalent error-distributions.

The p-value (0.717) for the two-sample KS-test indicates that the error-distribution between the RPM and the optimised NRI are not statistically different. These results indicates that both accuracy (Table 5t-test diagonal RMSE) and precision (KS-test) of the RPM and optimised NRI are equivalent.

In contrast, for the STACK and the RF algorithm, the KS-test p-value was 0.01. That indicates that both (error) distributions are unlikely (p-value < 0.05) to share the same distribution. In practical terms, these post-hoc tests indicate that the use of the STACK model provides more precise estimates, making it a more stable model, given that the standard-deviation of the error metrics (Table 4) for the STACK model is smaller than any other feature(s)-algorithm pair (Fig. 8).

Validation results, in which trained-tested models are applied to an unseen (validation) dataset, are presented in Fig. 9. The first row of plots presents predicted vs. observed results for different features in ordinary linear regressions (parametric-model). The second row presents the performance of selected VIs in different non-parametric regression algorithms.

Discussion

This study successfully selected and validated a subset of (four) VIs and regression technique (STACK) that largely outperforms the conventional method (RPM) for biomass estimation by an average RMSE of 116 kg DM ha−1 (Table 5).

This analysis has also shown that, in comparison to the most commonly adopted method on precision agriculture (i.e. standard NDVI), the best method substantially decreases the average inaccuracy and error-distribution (RMSE ± SD) of biomass estimation from 637 ± 60 to 406 ± 16 kg DM ha−1 (Table 4), while eliminating its characteristic saturation issue (Fig. 9—NDVI). In retrospect, given the worse performance of NDVI in comparison with the RPM ( RMSE = 115 kg DM ha−1), it is clear why end users have not adopted NDVI over RPM measurements.

The approach of an exhaustive-search of an optimised NRI identified a band combination that, both in terms of accuracy and precision, is equivalent to the RPM (Table 5). The performance of this particular VI (optimised NRI λ1 = 745, λ2 = 755 nm) is in-line with the results reported by Mutanga and Skidmore (2004). Equivalent performance and similar spectral band combinations were also presented in Cho et al. (2007), who reported that such combination overcomes the saturation issue presented by the standard NDVI (Fig. 9—optimised NRI).

An additional set of 97 other VIs was tested and compared based on a linear regression performance (Table 2): all were suboptimal in performance when compared to the optimised NRI. However, it is important to remark that the performances of these 97 VIs as predictors may be handicapped by a poor fit (Table 2), as the best fit function between a VI and biomass is not necessarily linear. Yet, the analysis demonstrated that the VIs extracted from the literature present a high degree of multicollinearity. The process of filtering eliminated 80.4% (78 out of 97) of features (VIs): a strong indication of redundancy and limitation of VIs. Also, the selection protocol indicated that models with more than four VIs only yield marginal accuracy improvements (Fig. 6).

This analysis also quantified the improvement in error-metrics when employing non-parametric models in comparison to a parametric model (i.e. ordinary linear regression—LM). On average, the use of ordinary linear regression, with the same selected VIs, presented an additional 38.3, 51.6, 67.6 and 79.9 kg DM ha−1 error (RMSE) compared to MARS, SVM, RF and STACK models, respectively. It is important to highlight that such improvement in performance is due exclusively to model selection and tuning, improving results without requiring additional features.

While there are accuracy improvements based solely on the choice of employing either para- metric or non-parametric models (e.g. STACK − LM =  79.9 kg DM ha−1), such improvements are not so substantial within the group of non-parametric models. The largest difference between non-parametric models (RF − MARS) was equal to 29.3 kg DM ha−1. Overall, the largest improvements in performance are due to a better selection of features (e.g. LM—NDVI =  − 151.2) rather than a better performing regression algorithm. This highlights the importance of feature selection and feature engineering (e.g. optimised NRI process).

It was not surprising that both the RF and STACK techniques share the same average accuracy. Random forests employs the technique of bootstrapping the dataset and aggregating results (known as bagging). The same core concept is used in stacked models. However, STACK models present a better precision level, given that their output is an ensemble of other models (not features).

In essence, this analysis has shown that the use of a small feature set of VIs coupled with non- parametric methods enhances the accuracy and precision of pasture biomass estimation. Such VIs are located in the region of the red-edge and NIR-shoulder (Fig. 10), regions which have been identified as important for chlorophyll and/or leaf area index (LAI) assessment (Darvishzadeh et al. 2008; Delegido et al. 2011) as well as for pasture biomass estimation (Clevers et al. 2007).

Fig. 10
figure 10

Selected VIs through the feature selection workflow. In the background, the mean (solid line), upper and lower boundaries (± SD, dashed line) of all spectral observations. Different line styles identify each selected VI. All VIs have at least one band within the regions of chlorophyll absorption, the red-edge and the NIR-shoulder

It is reasonable to hypothesize that the selection of VIs around the red-edge could be linked with an underlying physiological process as canopies with higher levels of biomass will present higher LAI and chlorophyll mass (drivers of shifts on the red-edge). Previous studies have reported similar findings: Horler et al. (1983) credits the red-edge shift, centred at around 740 nm, to leaf stacking, thus, showing a causal link with biomass. Tucker (1977) and Collins (1978), indicated the same spectral region (740 nm) and behaviour when estimating attributes linked to biomass. Guyot and Baret (1988) examined the effects of higher chlorophyll content and LAI on the spectral behaviour around the red-edge. Finally, Mokvist et al. (2014) discussed a wider range of the limits of the far-red spectral absorption, demonstrating a plasticity for light absorption at higher biomass levels, indicating a characteristic which might overcome saturation.

Given the similar results and the mechanistic link explored by these previous studies, the approach and findings become more robust and, most likely, generalisable to other locations or contexts.

The plateau of 430 kg DM ha−1 (RMSE) is equivalent to approximately 6.5 g per sample area (0.15 m2). Thus, a fraction of this error may well be due to the sampling/harvesting technique or even a drying process that, despite best efforts, has an inherent random noise. It is reasonable, therefore, to assume that this methodology was able to explain most of the variance which could be mapped by VIs. Further performance improvements in error metrics are mostly likely due to overfit or noise-fitting. As illustrated in Fig. 6, the addition of extra VIs will only yield marginal improvements. Under normal field-conditions, it seems unlikely that measurement error could be decreased further.

The ranges presented in Table 1 are in-line with growth rates, canopy architecture and height displayed in different seasons as per the intrinsic ecophysiological characteristics of cool-season grasses (Christie et al. 2018). Given the significant duration and number of data collection campaigns, coupled with the consequent different plant development stages, this dataset should provide an adequate surrogate for on-farm pasture conditions.

As spectral data is highly correlated, it is unlikely that the relationships found would not perform equally well whenever employed in perennial ryegrass pastures displaying the same range of biophysical and biochemical status to the sample site in this study, granted that no major factor which could influence spectral response is introduced (e.g. changing light/atmospheric conditions or soil-background at low biomass levels). This indicates a satisfactory approach to estimate a wide range of pasture biomass values throughout the year with a small number of VIs, which could be implemented on a UAV mounted sensor.

Conclusions

In this study, the accuracy and precision of vegetation indices and machine learning methods for biomass assessment were examined, while featuring the rising plate meter (RPM) as benchmark for pasture management. This study was able to present, test and evaluate an optimal method to select and employ spectral vegetation indices (VIs) for biomass assessment of perennial ryegrass under different fertilisation, regrowth periods and seasonal effects. Findings are in-line with previous results reported in the literature, while the best method largely outperforms standard practices (i.e. RPM) as well as known VIs.

The study has shown that the optimised NRI is equivalent to the RPM while providing a firm analytical ground for substitution between these techniques. Also, it is clear that reflectance can outperform canopy height as a feature for biomass estimation provided that the optimal spectral bands and methods are employed. Furthermore, multispectral systems can be employed in a multitude of data acquisition platforms, providing estimates at near or real-time.

This analysis could lead to the development of a multispectral UAV-mounted sensor with known added-value in relation to current on-farm practices and precision agriculture methods. Overall, the comparison of achievable accuracies by employing either canopy height or reflectance (through VIs) can indicate the most appropriated path when developing sensors and establishing optimal sensing techniques (either spectral or canopy based) for assessing pasture biomass.

Given the agreement of these findings with previous research, the best method (STACK and selected VIs, as feature(s)-algorithm pair) and validation results (i.e. assessed against an unseen validation set) should be transferable from this study site to other locations. Furthermore, from an end-user perspective, the accuracy and precision achieved is sufficient to efficient pasture-management.

Given the high accuracy achieved, further improvements to this methodology will necessarily have to take into account the accuracy of reference measurements and data-collection protocols. However, given the unsatisfactory results of current VI methods, a new UAV sensor (with a small number of bands) should benefit from the results and methods presented here. Finally, this study also indicated that the most well-known VI (NDVI) is not optimal for biomass estimation. It also indicated that better performing VIs which could be measured from a UAV-mounted multispectral sensor, which would allow mapping and monitoring of pasture biomass with high accuracy at very high spatial resolution and with complete coverage of management areas.