1 Introduction

Data-driven surrogate models are representations of the response surface of complex simulation models. They are estimated from a sample of input-output mappings simulated with the original model (van der Hoog, 2019; Kleijnen, 2017; Asher et al., 2015).

Data-driven surrogate modeling can considerably enhance the applicability of micro-scale (farm-level) agricultural economic models, such as whole-farm and agent-based models. Farm-level models (FLM) simulate the management decisions of a collection of individual farm holdings allowing for detail and complexity, inter alia in representing heterogeneous farming conditions, preferences, risk behavior and technology. They are well-established tools to simulate fundamental changes in farm production and resource use choices in response to regime-shifting drivers, such as climate change, technological innovation or policy intervention (Kremmydas et al., 2018; Reidsma et al., 2018; Troost & Berger, 2015; Berger & Troost, 2014; van Wijk et al., 2014).

Recently, there has been an increased interest in linking such detailed micro-scale behavioral models into macro-scale integrated assessment models (Brown et al., 2021; Müller et al., 2020; Lippe et al., 2019; Müller-Hansen et al., 2017; van Wijk, 2014). This would, on the one hand, enhance global assessments with high-resolution models of complex and heterogeneous decision-making—overcoming overly restrictive behavioral assumptions for which current aggregate macro-models have been criticized. On the other hand, it would address the major constraint of micro-scale models: Even when implemented for all farm holdings in a study area, they are usually confined to smaller spatial extents and cannot simulate larger-scale feedback to local decisions. In economic terms, this implies a strict small-country assumption applied not only to product and resource markets, but also to global environmental effects. Neglecting such global feedback may result in biased simulation outcomes and misleading conclusions (Müller et al., 2020; Müller-Hansen et al., 2017).

Data-driven surrogate models can help overcome three challenges associated with the incorporation of micro-scale models into large-scale assessments: (i) Computational time demands of micro-scale models for a substantial number of farm agents are often prohibitive for frequent iterative reevaluations required for equilibrium search in a macro-scale model. (ii) Integrating micro-scale models that have been developed by different specialized research groups for different small-scale locations into a large-scale framework is often hampered by heterogeneous use of modeling methods and software between research groups. (iii) Micro-scale models often employ sensitive, privacy-constrained micro data that cannot simply be shared for direct model reuse by other research groups (Troost & Berger, 2015).

A statistical surrogate model that has been estimated from representative output of a farm-level model can stand in for (emulate) the original farm-level model, where the use of the latter is not possible: A surrogate is typically computationally much cheaper than the original model, can serve as a unified interface to the response of heterogeneous model implementations, and isolates the response surface from privacy-constrained microdata. In addition, surrogate models also facilitate model optimization, uncertainty and sensitivity analysis, calibration and interactive model exploration of micro-scale models (Mössinger et al., 2022; van der Hoog, 2019; Lamperti et al., 2018; Kleijnen, 2017; Baustert & Benetto, 2017; Asher et al., 2015).

The usefulness of surrogate modeling of farm-level models has been exemplified in Happe et al. (2006), Domínguez et al. (2009), Lengers et al. (2014) and Seidel & Britz (2019). While these previous applications used classical econometric models, van der Hoog (2019) and Storm et al. (2020) suggest the use of machine-learning methods for surrogate model estimation as demonstrated e.g by Lamperti et al. (2018) for financial economic applications.

In this article, we systematically evaluate the capacity of four different surrogate modeling approaches including machine learning methods to capture the aggregate crop area response to price variation as simulated by a farm-level model. We address two important methodological gaps in the nascent literature on surrogate modeling of micro-scale agricultural economic models:

Firstly, we demonstrate a sequential sampling and evaluation design to ensure robust and efficient estimation of surrogate models from a limited sample of agricultural economic model outputs. We combine a quasi-random low discrepancy sequence with convergence and stability assessments to address the inevitable trade-off between a comprehensive coverage of the variation in original model output and the computational cost of original model evaluations implied by larger sample sizes.

Secondly, part of the micro-scale model output is not continuous, but compositional: Land use and crop areas, which are essential indicators for many environmental and economic assessments, must sum up to the total available land in the area. At the same time, response to external drivers is highly complex and characterized by variable interactions and nonlinearities. To capture this compositional response, we combine the isometric log-ratio transformation (ilr) suggested by Egozcue et al. (2003) with multivariate adaptive regression splines (MARS) (Friedman, 1991), random forest regression (RF) (Breiman, 2001), and extreme gradient boosting (XGB) (Chen & Guestrin, 2016). We compare these methods with multinomial-logistic regression (MNL) as one of the classical statistical methods for categorical and compositional data analysis.

2 Data and Methods

Figure 1 summarizes the basic concept underlying the use of surrogate models to interpolate or predict the simulation output of micro-scale models: A farm-level model (FLM) provides a theory-based representation of farm-economic decisions, which allows for simulation analysis of structural changes and regime shifts in the socioeconomic and biophysical conditions of farming. Simulating a sample of model input-output combinations provides a data base for estimating a surrogate model (SM). This surrogate model predicts the FLM output for input combinations that have not been simulated yet, based on the estimated statistical relationships between FLM model inputs and outputs in the sample. In this way, the SM allows for a quick interpolation of FLM results for uncertainty analysis, iteration with larger-scale models or interactive result exploration.

Fig. 1
figure 1

The basic concept: Using a surrogate model to interpolate or emulate output of a farm-level model

We first describe the FLM that we use as test case in this article and the model input and output variables that we varied. We then explain our evaluation criteria and the experimental design we employed for comparing different surrogate modeling methods. In the last part of this section, we present the four different surrogate modeling approaches we examined.

2.1 The Farm-Level Model

Our analysis used the MPMAS Central Swabian Jura (MPMAS-CSJ) model that was developed by Troost & Berger (2015) to simulate agricultural adaptation to climate change in the Central Swabian Jura, a low mountainous area in Southwest Germany. MPMAS-CSJ simulates the production decisions of all full-time farmers in the study area by solving a mixed integer programming (MIP) problem for each farm agent in order to allocate the production factors (land, labor and capital) such that they maximize expected farm income while respecting individual-specific resource constraints and production options of each farm agent as well as sales and input prices, and the technical and agronomic constraints governing agricultural production.Footnote 1 MPMAS-CSJ represents about 530 farm holdings with about 36,000 ha of agricultural land of which about 22,000 ha are arable. It covers the full range of activities of the mixed-crop-livestock farming systems in the area such as crop production, dairy farming, bull fattening, pig production, and biogas production including the necessary investments in machinery and buildings and participation in agri-environmental measures. The spatial resolution of input data provided to the model is 1 ha, but agents can allocate crop production in arbitrarily small units. For an in-depth description of model equations, empirical parameterization, validation and uncertainty testing, please refer to Troost & Berger (2015) and Troost et al. (2015).Footnote 2

In MPMAS-CSJ, the interplay of agronomic constraints, soil-specific yields, crop utilization options (sale, forage, fermenter feedstock), agri-environmental support schemes, flexible allocation of machinery to field work, and flexible feed composition in animal nutrition leads to complex, heterogeneous and interdependent profitability profiles for the different crops: On the one hand, production options may complement each other, for example, as compatible or even mutually beneficial members of crop rotations, by providing complementary nutrients for animal feeding, by requiring field work at different points of time in the year, or by requiring the same type of agricultural equipment. On the other hand, they may compete with each other if they are incompatible in rotation, are substitutes in animal nutrition, coincide in peak labor demand or require very different agricultural equipment. This means that changes in prices, yields, timing or cost of one crop may potentially affect the relative profitability of other crops, and that these effects may considerably differ between farm holdings and at different prices levels. The response surface representing the effects of price expectations on crop area choice was hence expected to be strongly nonlinear and characterized by structural breaks and segmentation.

2.2 Input and Output Variables

In the present article, we focus on the simulation of regional totals of crop areas on arable land as the output of interest, as would be relevant for integration into a large-scale assessment model. Regional totals result from the aggregation of simulated choices of the individual farm holdings.Footnote 3 Nine crop categories were considered for arable production: winter wheat, spring malting barley, winter barley, spring fodder barley, winter rapeseed, silage maize, green winter wheat (for silage), clover/grass production on arable land, and fallow. Accordingly, for each simulation run we obtained \(J = 9\) output variables indicating the amount of area that was simulated to fall into the respective crop category (\(j = 1,\dots ,J\)) in that run.

With respect to model inputs, we focused on variations in product and input price expectations used by farm agents for production planning. Since we simulated farm agent decisions for one average year for the present analysis, our MPMAS-CSJ application does not endogenously simulate formation of expectations: price expectations are set exogenously.Footnote 4 We vary price expectations over the full range of potential price combinations for crops, animal products and important inputs and apply these to the baseline climate scenario of Troost & Berger (2015).Footnote 5 Variation in the price for good g was expressed as coefficients (\({ pc }_g\)) relative to the 2000-2009 price average (\({{\bar{P}}}_g\)). For the simulations, we extended the ranges observed between 2000-2009 by about 20-30% at both ends, to capture potentially more extreme price relations in the future (cf. Table 1). As our simulations are intended to provide the basis for the estimation of surrogate models, we are interested in efficiently attributing changes in model outputs to changes in model inputs and their interaction. To avoid confounding of input factor effects, we varied price expectations for the individual items independently from each other, ignoring the existing correlations. This means that the resulting distribution of model outcomes should not be interpreted as a probability distribution for the crop choice. (A probability distribution for crop areas could be generated from the output sample by weighting each simulation run with the joint probability of input factor values employed in that run or, alternatively, by later applying the estimated surrogate model to a sample of correlated price factors.)

In addition to price variations, we also included variations of uncertain model parameters into our analysis: Farm-level models, such as MPMAS-CSJ, that are based on theoretical and empirical process knowledge are typically subject to considerable parameter and input uncertainty (Buysse et al., 2007; Troost & Berger, 2015). Good modeling practice requires to clearly communicate the resulting uncertainty to readers and analyze it in order to assess the robustness of results and, in the long run, improve process understanding (Jakeman et al., 2006). As Berger & Troost (2014) argue, one should therefore refrain from identifying a single parameter combination that best fits observation data. For MPMAS-CSJ, Troost & Berger (2015) reduced the model parameter space only, where clearly superior settings could be determined in specially designed calibration experiments that tested parameter combinations against three structurally different observation years. They then used an elementary effects screening (Campolongo et al., 2007) to determine 11 of the 19 unfixed parameters that caused the greatest variance in simulated differences between climate and price scenarios (cf. Table 1). As a consequence, MPMAS-CSJ had to be solved repeatedly to cover the space spanned by these 11 parameters and model results were communicated as ranges over this parameter space.

Table 1 Price coefficients used for price variation in the experimental design and unfixed parameters representing uncertainty in MPMAS-CSJ

With 11 price expectation and 11 uncertain model parameters we obtained \(L = 22\) input variables overall. Accordingly, the model input-output dataset that formed the basis for our surrogate model estimation consisted of \(J + L = 31\) variables (or columns), with each row corresponding to one simulation run. The number of simulation runs and hence rows and the choice of input factor settings for each row is discussed in the following section.

2.3 Experimental Design and Evaluation Criteria

We assumed that the main objective of using a surrogate model is to predict simulation model output outside of the sample of input combinations simulated with the original FLM.

2.3.1 Sequential Experimental Design

As the form of the multivariate model output distribution is typically unknown and not following common parametric distributions, the sample size (\(N_{ estim }\)) necessary to estimate robust and unbiased surrogate models cannot be calculated a priori (Lee et al., 2015). At the same time, the number of original model evaluations needed to generate the sample is an important determinant of the computational burden for creating a surrogate model \(t_{{{ gen },{ SM }}}\) (Eq. 1), which consists of the time needed to estimate the surrogate model from a given FLM input-output dataset \(t_{{{ estim },{ SM }}}(N_{ estim })\), but also the time needed to create this sample, which is a product of \(N_{ estim }\) and the runtime of the original model \(t_{{{ sim },{ FLM }}}\).

$$\begin{aligned} t_{{ gen },{ SM }} = t_{{{ estim },{ SM }}}(N_{ estim }) + t_{{{ sim },{ FLM }}} \cdot N_{ estim } \end{aligned}$$
(1)

Hence, on the one hand, modelers will in practice try to keep \(N_{ estim }\) as small as possible. On the other hand, a smaller sample will inevitably lead to higher sampling error and stronger confounding of input factor effects, and increase the danger of overfitting and unstable surrogate model estimates. Specifying a value for \(N_{ estim }\) that is generally sufficient for all FLM and SM combinations is hardly imaginable, as complexity of FLM responses will differ considerably from application to application, and hence a suitable \(N_{ estim }\) has to be identified for each FLM-SM application specifically.

A pragmatic procedure is to gradually increase the number of FLM simulation runs in batches. One can then estimate the surrogate model on the batches simulated so far and subsequently measure the marginal improvements of expected predictive performance and stability of estimated models in a separate validation sample (cross-validation). One can then stop if a sufficient predictive accuracy and model stability has been reached, no further improvement can be seen or a computational resource constraint on the total number of FLM evaluations has been reached.

In this article, we implemented such an incremental procedure based on a Sobol’ sequence over \(L = 22\) input factors.Footnote 6 Sobol’ sequences are S-optimal experimental designs that like Latin hypercube samples (LHS, e.g. Salle & Yıldızoğlu, 2014) ensure representative coverage of a parameter space when computational requirements limit the number of simulation runs. Their advantage over LHS is that they are more easily extended or reduced in size because the location of design points in the multidimensional parameter space remains the same for different sequence lengths. This means that extending a Sobol’ sample just requires simulating the additional repetitions, whereas a larger LHS requires a complete resimulation of all repetitions of the new sample size (Tarantola et al., 2012).

Our incremental procedure started by using the first 500 elements of the Sobol’ sequence as a training sample (TS) and the following 200 as a cross-method validation sample (VS1). (Note: We use the term training sample here to denote the full sample size provided to each surrogate modeling method at a certain iteration. Nonetheless, each surrogate modeling method may treat part of this full training sample as an intra-method training sample and part of it as an intra-method validation sample.)

We successively increased the TS size in nine steps (over 750, 1000, 1500, 2000, 2500, 3000, resp. 4000 design points) until including the first 5000 design points of the Sobol’ sequence. The 200 design points following each TS were used as cross-method validation sample VS1. At each iteration, we used VS1 to assess the convergence of the predictive performance as well as the stability in prediction and surrogate model structure compared with previous estimations.

2.3.2 Expected Predictive Performance

As described above, the outcome for each simulation run k consisted of a vector of \(J=9\) crop categories, into which the total arable area was classified. The accuracy of the surrogate model prediction for the distribution of the area over these nine categories was measured by the share of correctly classified simulated area for this run k:

$$\begin{aligned} { Scc }_k = 1 - \frac{\sum _j |{{ AreaSM }}_{j,k} - { AreaFLM }_{j,k} |}{2 \cdot \sum _j { AreaFLM }_{j,k}} \end{aligned}$$
(2)

with \({ AreaSM }_{j,k}\) the area in crop category j predicted by the surrogate model and \({ AreaFLM }_{j,k}\) the actual area for crop category j simulated by the original FLM.Footnote 7

We then assessed the predictive performance for a full input-output sample of K simulation runs by calculating the average and also the worst-case (minimum) \({ Scc }\) over the whole sample.

$$\begin{aligned} { avgScc }= & {} \frac{\sum _k^{K} { Scc }_k}{K} \end{aligned}$$
(3)
$$\begin{aligned} { minScc }= & {} \min \{ { Scc }_1, \dots , { Scc }_{K}\} \end{aligned}$$
(4)

2.3.3 Stability of Predictions and Feature Importance

It is important to not only assess the development of predictive accuracy of the model, but also its stability: Does the prediction for an input combination k vary whether the SM has been estimated from a larger or smaller sample? The same type of model with the same functional form and hyperparameter settings, once estimated from a small training sample and once estimated from a slightly larger training sample, might predict very different outcomes and show different deviations from perfect prediction, even if both versions have a similar goodness-of-fit (as long as the fit is not perfect). And even if the predicted outcome is stable: Is the influence attributed to a certain input factor stable? Surrogate model predictions or input factor importance rankings that fluctuate strongly over training sample sizes indicate a strong influence of training sample size and hence likely overfitting. They do not project confidence in the use of the surrogate model for sensitivity analysis.

To assess stability of predictions, we did a pairwise comparison of the predictions for each k in VS1 between the model estimated from a certain TS and the same model estimated from the previous TS. As our dependent variable is a composition, we used Aitchison’s total variance (Pawlowsky-Glahn & Egozcue, 2001) to measure the variation in prediction between models estimated from the adjacent TS at each k. We then plotted the distribution of total variances over the K sample points in VS1. (A more detailed explanation including formula is provided in the Appendix Sect. B.) We expect the differences in prediction caused by the information added through increasing a sample size (and hence also the variance between adjacent sample sizes) to decline with increasing sample size as confounding between input factor effects reduces with increasing length of the Sobol’ sequence.

To assess stability of input factor influence in a comparable way across methods, we followed the permutation feature importance approach of Breiman (2001): to calculate the importance of input factor l in estimated model m, we randomly permuted the values of this input factor across simulation runs to break the relationship of this factor with the outputs, and re-estimated m on this permutation. The difference between the resulting model accuracy and the original model accuracy is regarded as a measure of input factor importance. Input factors were then ranked by importance and the ranking subjected to factorwise comparison with the ranking obtained at the previous, smaller TS.

2.3.4 Meta-Evaluation of Sequential Approach

While the previous steps would be performed in any practical application of the approach, for this article we did an additional step to evaluate the performance of the whole workflow: We evaluated performance of the surrogate models that were selected based on TS and VS1 in an additional 5000-element validation sample (VS2) corresponding to design points 5001-10000 of the Sobol’ sequence. In practice, we do not expect modelers to generate a second VS with the original FLM. Rather, it here represents input factor combinations for which modelers expect to receive a good prediction by the surrogate model without having simulated them with the FLM.Footnote 8 With this meta-validation on VS2, we evaluated in how far our iterative estimation and validation strategy (employing TS and VS1) has generated a generalizable surrogate model that provides robust interpolation for a wider sample.

2.4 Surrogate Modeling Methods

We explored the capacity of four different modeling methods to capture the compositional output of the farm-level model.

2.4.1 Multinomial-Logistic Regression

As a classical regression method, we employed multinomial-logistic regression (MNL), more specifically a baseline-category logistic model with observation-specific regressors and grouped data (Agresti, 2013)—making use of the fact that under certain simplifying assumptions, compositional output data such as simulated crop areas can be understood as multinomial data.

Estimation of a classical MNL model on our dataset was complicated by strong interaction and segmentation effects in the input-output mapping between prices and crop areas simulated by the FLM. Mutual price relations, e.g. the price ratio between product sales prices and production input prices and the sales price ratios between different products have a strong influence on crop choice. Effects of prices are often nonlinear and can be subject to breaks or shifts, i.e. show a segmented response: For example, there may be a specific ratio of their respective price expectations where one crop becomes more profitable than the other for a significant group of farm agents. Crop areas respond very differently to price variation above or below such a threshold ratio. The location of such segmentation points is not known a priori but must be determined during the analysis. Automatic backward and forward selection of regressors and interaction terms can hardly remedy these challenges as the possible locations of segmentation points would still have to be manually pre-specified.

In our MNL application, we dealt with these challenges by manually and iteratively examining residuals and expanding functional forms. To avoid overfitting, we used Akaike’s Information Criterion (AIC) to decide between different proposed model structures. (See Sect. A.1 of the Appendix for the details on MNL model selection and estimation in this study.)

2.4.2 Non-parametric, Machine Learning Methods

As an alternative, we tested three non-parametric, machine-learning methods, which are designed to largely avoid the pre-specification of functional forms and detect input factor interactions and segmentations automatically: Multivariate Adaptive Regression Splines (MARS)(Friedman, 1991), Random forest regression (RF) (Breiman, 2001; Hastie et al., 2009 and Extreme gradient boosting (XGB) (Chen & Guestrin, 2016).

Fitting non-parametric methods always has to strike a balance between a better fit and the danger of overfitting to the sample, which in all three methods is controlled through internal cross-validation and hyperparameters that define limits on model complexity, learning rates, or minimum improvement thresholds. These hyperparameters can be tuned for a specific application. Details on the methods and associated hyperparameter search and cross-validation procedures can be found in the Appendix (A.3, A.4, A.5).

While we expected the non-parametric methods to detect the effects of input factor interactions such as price ratios automatically, we tested this capacity by estimating them with two different input factor sets, from which the algorithms could freely form predictor terms: In the first set, without ratios, we only included the 22 price coefficients and uncertain input factors. In the second set, with ratios, we explicitly also included the ratios of all other price factors to the wheat price and to the fodder barley price leading to a set of 22 + 19 = 41 input variables. It is important to note that the second set did in principle not provide more information to the algorithm than is implicitly contained in the price coefficients of the first set already.

All three methods were originally designed for continuous, unconstrained data and are not per se suitable for use with compositional data. We therefore applied the isometric-log-ratio transformation (ilr) (Egozcue et al., 2003) to transform our J-dimensional compositional simulation output into an unconstrained, continuous \(J-1\) Euclidean vector space, on which statistical approaches designed for continuous data can be used. (Outcomes predicted by the surrogate models were then backtransformed into the compositional simplex space by inverse ilr).

Compared with other transformation methods such as the softmax function, the additive log-ratio transformation, or the centered log ratio transformation, the ilr transformation is, at the same time, symmetric, isometric (preserves distances) and subcompositionally consistent (although not unique), which makes it the theoretically best choice for input data estimation (Egozcue et al., 2003). (For more details see the Sect. A.2 in the Appendix).

As all logarithmic transformations, the ilr only works with strictly positive data, whereas our dataset contained a non-negligible amount of zeros. We worked around this problem by adding a small quantity (= one hectare) to all categories in all runs. With area totals around 25,000 hectares, the associated distortion was expected to be minimal.

3 Results

3.1 Predictive Performance and Model Selection

Figure 2 provides an overview of the predictive performance of the best-performing surrogate model candidate generated by each surrogate modeling method for each training sample size. As a benchmark, we indicate the predictive accuracy of using the mean crop area shares of the respective training sample as a predictor (null model), which achieved a share of correctly classified plots of 0.69 (without any major changes over the training sample sizes). All surrogate modeling methods achieved considerably higher predictive accuracy surpassing 0.9 for a sufficiently large TS size. There is a clear ranking of methods with XGB achieving up to 0.96, followed by RF and MNL with 0.94 and 0.93, respectively. The best MARS model reached a somewhat lower Scc of 0.91.

Even at the lowest TS size of 500 the predictive accuracy was above 0.9, except for MARS which started with 0.875. Performance response to training sample size increases was strongest below 1000 sample points and then flattened out, though smaller improvements are still visible up to 5000 (except for MNL). The ranking of methods remained stable over all TS sizes. The average predictive performance on VS2 (the meta-validation sample) hardly differed from performance in VS1. Worst-case performance was expectedly worse in VS2 compared with VS1. This is not surprising as VS2 was much larger and hence more likely to contain an extreme combination of input values. Notably, however, the degradation of worst-case performance from VS1 to VS2 was smaller for XGB and RF than for MARS, which shows the most inferior worst-case performance, even worse than the benchmark of using the sample average crop areas for prediction.

Fig. 2
figure 2

Predictive accuracy of the best performing hyperparameter setting, resp. functional form, by surrogate modeling method and training sample size in validation samples VS1 and VS2. (All explicitly including price ratios in the set of independent variables)

3.2 Functional Forms and Hyperparameter Choices

Detailed outcomes of the model (respectively hyperparameter) selection process for each modeling method can be found in the appendix. Three observations common to all methods are worth highlighting:

(i) Complexity: Matching the FLM input-output mapping required complex surrogate models. The optimal number of trees converged around 512 trees for RF and between 256 and 512 trees for XGB. The chosen MNL model comprises 720 coefficients. For MARS allowing interactions of degree 2 increased predictive accuracy by 0.05 compared with no interactions. (Allowing higher degrees of interaction did not noticeably improve performance in the case of MARS.)

(ii) Explicit inclusion of price ratios: The explicit inclusion of price ratios as input factors for estimation considerably improved predictive performance. The left pane in Fig.3 shows the lower predictive accuracy of the best models without explicitly included price ratios. Predictive accuracy decreased by more than 0.05 in the case of MARS and MNL. For RF and XGB the effect decreased with TS size, but is still noticeable at TS=5000.

Fig. 3
figure 3

Comparison of predictive accuracy of the best performing hyperparameter setting, resp. functional form, WITHOUT and WITH explicit inclusion of price ratios as input factors by surrogate modeling method and training sample size

(iii) Robustness of selection: The ranking of best-performing choices for the key hyperparameters was mostly stable over increasing TS sizes. The best-performing degree of interaction (MARS), number of trees (RF) and functional form (MNL) performed best already at TS size 500 and ranking between choices hardly fluctuated further on. For XGB, at lower TS sizes a more restricted number of trees performed somewhat better, but the difference to the performance of the number of trees found optimal for larger TS sizes was rather small.

In all cases, predictive performance on VS2 was similar to VS1.

3.3 Stability of Predictions

Figure 4 illustrates the stability of predictions over increasing training sample sizes. Each violin graph shows how the total variance in prediction between surrogate models estimated from two adjacent TS sizes (indicated above the distribution) was distributed over the sample points in the VS1 associated with the larger of the two TS. The sequence of violin graphs over increasing TS sizes depicts how the variance in prediction developed with increasing TS size. We can observe that the predictions of the best-fitting MNL model were extremely varying between training sample sizes, while the models found by the other methods showed much lower variance in prediction between estimation from different sample sizes. (Variation was slightly lower if using a simpler functional form for MNL, but still considerably higher than for the other methods.) Total variance decreased with increasing size of the training sample. This decrease is much stronger in MNL than in the other methods.

Fig. 4
figure 4

Total variance of prediction of the best selected model per method between adjacent training sample sizes. Each violin graph shows the distribution of Aitchison’s total variance between models estimated from \(\hbox {TS}_{\mathrm{s}}\) and \(\hbox {TS}_{\mathrm{s-1}}\) over the runs of \(\hbox {VS1}_{\mathrm{s}}\). (Note: The vertical axis has a logarithmic scale)

3.4 Stability of Input Factor Importance

Figure 5 shows differences in feature importance ranking between models estimated from adjacent TS sizes. Each violin graph shows the distribution of rank difference over the 41 input factors (22 original factors and explicit price ratios). A positive frequency at 10 means that at least one input factor moved in importance by ten ranks between the two TS sizes.

Fig. 5
figure 5

Stability of feature importance of independent variables. Each violin graph shows the difference between the feature importance rank of independent variables estimated from \(\hbox {TS}_{\mathrm{s}}\) vs. \(\hbox {TS}_{\mathrm{s-1}}\), as distributed over the 41 independent variables

Similar to the variance in prediction also the variance in input factor importance was high for the estimated MNL functional form. In this case, however, also the MARS model showed considerable variation in ranking. XGB showed some variation in the lower TS sizes, where also the optimal number of trees varied. RF showed no variation at all.

4 Discussion

Our experiment was motivated by the question: “Can we estimate an accurate surrogate model to predict compositional simulation outcomes of a farm-level model from a moderately sized sample?” Overall, the results look very encouraging. All tested methods achieved a share of correctly classified area beyond 0.9, considerably better than the null benchmark of using the respective training sample average (0.69). There is a clear ranking of methods, with XGB performing best, followed by RF. XGB is able to achieve a very high fit of 0.96 at larger sample sizes and 0.93 already at lower sample sizes. Certainly, whether this level of expected predictive accuracy is sufficient and the differences between methods are practically relevant cannot be answered in general. It very much depends on the purpose for which the surrogate model is intended to be used in practice. It will also have to be seen in context with the uncertainty and inaccuracy of the original FLM itself.

4.1 Robustness

The reported performance of MNL has to be interpreted with care as it showed considerable instability in prediction and input factor effect estimation across samples. This underlines the importance of cross-validation and robustness diagnostics during training. While the other methods employed some form of cross-validation already in the training and selection process (GCV in MARS, inbuilt ensemble methods and explicit 3-fold cross-validation in RF and XGB), control of overfitting relied only on comparing AICs for functional form selection in MNL. Importantly, this type of instability did not become apparent through a decay of performance from training to validation sample - an alternative check typically used to detect potential overfitting. (The difference between training sample and validation sample performance is stronger in RF than MNL, for example.)

Our experiment also highlights that already a small additional validation sample (here VS1) can provide a good estimate of performance further out-of-sample (here represented by VS2). We mainly attribute this to the representative sample structure ensured through the space-filling, increasingly dense representation of the input space achieved through the Sobol’ sequence. (We emphasize that this validation sample does not make proper cross-validation during training obsolete, but serves as an additional control.)

4.2 The Importance of Domain Knowledge

The considerable performance gain achieved by explicitly including price ratios as predictors alongside the individual price coefficients was somewhat surprising for the automatic, non-parametric methods MARS, XGB and RF from which we would have expected automatic discovery of these interactions (unlike for manually-specified MNL). Although the effect may reduce for larger TS sample sizes, this strongly hints at the usefulness and importance of including domain knowledge, i.e. knowledge about the underlying processes, also in sample interpolation tasks, at least for restricted sample sizes.

4.3 Computational Demand and Minimal Sample Size

Runtime for a single prediction (\(t_{\mathrm {pred}}\) ) was about 0.1 s with RF and 0.003-0.005 s with the other methods, and in all cases much lower than the time for an original simulation run (\(t_{{\mathrm {simrun}}} =\) 3 min).

Time required to estimate a surrogate model for each TS (\(t_{{{ estim },{ SM }}}\left( N_{ estim }\right) \)) generally depends on the size of the hyperparameter search space considered (which was likely more extensive than necessary for RF and XGB in our case), (ii) the number of processors that can be used in parallel, which not only depends on the method but also on the available resources, and (iii) the specific software implementation of the algorithm.

Pure total computation time for training and validation over all candidate functional forms/hyperparameter settings was fastest for MNL (less than 1 h) and MARS (about 3 h) employing a single processor. (Model selection in MNL regression, however, was not automatized and involved a considerable, unmeasured amount of human work time and brain power that is difficult to quantify and stretched over weeks.) The RF tuning process required about 14 h employing 16 processors of a High-Performance Compute (HPC) node. Our very comprehensive search process for XGB required about 94 h using a HPC node with 40 cores.Footnote 9 More information on time demands can be found in the respective appendices.

However, we did not systematically benchmark timings under comparable and reproducible conditions and did not optimize code structure, parallelization and size of hyperparameter search space towards time efficiency in training. (Moreover, in this particular study, MARS and MNL were run in R, while XGB and RF were run in Python.) All timing information should therefore be understood as first rough indications. Especially, the results of our comprehensive random searches for RF and XGB showed large shares of hyperparameter combinations that hardly differed in performance, so that in practice search spaces and hence computation times can most likely be reduced by a factor of five without relevant loss of fit, especially if systematic searches instead of random searches are used.

We tested a sequential approach to determine a sufficient number of evaluations with the original model. While here we directly simulated an extended range of sample sizes, in practice, a modeler will first simulate only the smallest TS size, estimate a surrogate model, test its performance with a small validation sample, then simulate the additional runs for the second TS size, estimate and test, simulate the additional runs for the third TS size, and so on. The modeler will stop increasing TS size when computational resources are exhausted or when the fit in the validation sample remains stable and without fluctuations. What constitutes a negligible increase will, again, be subject to the intended purpose of the surrogate model and trade-off considerations between additional effort and fit. In our example, convergence could arguably be attested at a TS size of 3000 based on graphical inspection for all approaches, but even at 1500 runs this could be defensible in practice. Nevertheless, even if a modeler had been able to simulate only 500 runs, RF and XGB would already have given decent approximations in our case.

One can argue that the use of a surrogate model becomes efficient only when the number of predictions for which it is eventually used surpasses the number of original model runs necessary to train it (plus the number of runs possible during the time needed for the training). Nevertheless, in applications such as integration into large-scale models and interactive result exploration with stakeholders the model response is time-critical and the input factor combination to be evaluated cannot be anticipated and simulated in advance. Moreover, if the surrogate model is used to address data-privacy constraints or harmonize heterogeneous model implementations, time spent on surrogate model estimation is usually well-invested and not prohibitive.

4.4 Scope for Performance Improvement

Since the MPMAS-CSJ model is nominally deterministic,Footnote 10 we can expect a fully adaptable surrogate modeling method such as XGB to eventually achieve near-perfect fit if the training sample size is sufficiently increased.

For the given sample sizes, the performance and efficiency might potentially be improved by an adaptive sampling scheme that starts with a small initial space-filling design and subsequently prioritizes regions of high variance in prediction for additional farm-level model runs instead of using the Sobol’ sequence throughout (Gramacy and Lee, 2009). Regions of poor fit would receive higher weight similar to the resampling logic used in Gradient boosting. However, while within Gradient boosting existing information would receive adapted resampling weights, adaptive sampling would generate new, additional information for these regions using the simulation model.

4.5 Generalizability

The sequential sampling approach to provide training data combined with performance and stability assessments in validation samples is generally applicable to support adequate surrogate model development for agroeconomic models. It builds on the theoretical properties of representative sampling by low discrepancy sequences and control of sampling error through cross-validation and is model-free (nonparametric).

The exact relationship between achievable goodness-of-fit and necessary original model evaluations observed in our experiment is most likely not generalizable. From a theoretical point of view, this relationship depends very much on the complexity of the simulated input-output relationships. The greater the complexity the more repetitions will be needed. Complexity may differ considerably between farm-level model applications. Nevertheless, we believe that our test case provides a sufficiently complex benchmark that represents well the typical input-output relationships in simulation data generated by farm-level models. We observed a high level of input factor interactions and segmented response functions (illustrated in the description of the manual MNL estimation in the appendix) that required complex surrogate models. Ratios between price expectations for different items are important to understand area response. Response to input factors and input factor ratios is not constant, but differs between segments along the input ranges and as a function of other input factors. Transition between segments is not always smooth, but sometimes abrupt at breakpoints.

The good performance of XGB is in line with findings in machine learning contests for a large variety of complex prediction tasks (Bentéjac et al., 2021). There is hence reason to expect this to hold also for many agricultural economic applications. Apart from the baseline scenario (B), we also tested the procedure with climate scenario C2 of Troost & Berger (2015). As Troost and Berger discussed, this scenario represents a major structural break compared with the baseline as it i.a. removes an important crop rotation constraint reducing complementarity between crops. For the somewhat less complex price-crop area interrelationships in C2, the results we obtained were comparable except that RF performed very similar to XGB—even slightly better at lowest sample sizes (see Appendix C).

While we focused on aggregated land use shares as potential link to large-scale integrated assessment models, the principles are applicable to continuous output (omitting the ilr transformation) and should also be transferable for surrogate modeling of disaggregate farm-level land use shares. Two additional challenges will have to be addressed for the latter case: (i) Compared with regional aggregates, the share of crop area categories with zero area is typically considerably higher for individual farm agents, so the replacement of zeroes by small positive amounts will require additional scrutiny. (ii) Structured sampling should be extended to the farm agent population to select samples of farm agents for training, respectively validation, whereas in our aggregate version the full farm agent population was used always.

The estimated surrogate models themselves are not generalizable. They represent price-land use relationships only for the structural scenario, time frame, and study area simulated with the FLM. Relying on fundamental economic principles and a generic disaggregated formulation, the FLM itself can simulate farmer reactions under structurally very different conditions, such as climate change scenarios (Antle, 2019; Troost & Berger, 2015). Surrogate models estimated from one structural scenario can, however, not simply be used to extrapolate to other structural scenarios. After simulation of several structural scenarios with the FLM, either a separate surrogate model would have to be estimated for each structural scenario or an encompassing surrogate model would have to be estimated over a sufficiently large number of structural variations.

5 Conclusions

We conducted a systematic experiment to analyze in how far regional crop area share responses to price changes simulated by a detailed farm-level model can be efficiently and comprehensively summarized in the form of surrogate models using econometric and machine-learning methods.

We found that combining extreme gradient boosting with the isometric log-ratio transformation (XGB + ilr) provided a straightforward surrogate modeling approach that is able to achieve a high and robust fit of the complex compositional response surface, albeit connected with higher computational cost. Random forest regression (RF + ilr) appeared as a viable compromise between computational cost and predictive performance, especially for less complex input-output relationships. In our experiments, both outperformed multivariate adaptive regression splines (MARS + ilr) as well as a manually-specified multinomial logistic regression model.

In addition, our experiments highlighted important lessons for the practical use of surrogate modeling with micro-scale agricultural economic models: (i) Even when using a highly representative experimental design, explicit cross-validation and diagnostics for robustness must be an essential component in surrogate modeling in order to prevent overfitting in model selection and avoid unstable and biased predictions. (ii) Explicitly including ratio terms between input variables based on economic domain knowledge can strongly improve efficiency of surrogate model estimation even for highly automatic non-parametric methods and increase performance especially at lower sample sizes, when combined with proper cross-validation. (iii) Using an adequate sampling design in combination with appropriate cross-validation can help to keep the necessary number of original model runs low. The Sobol’ sequence provides a straightforward space-filling sequential design which avoids unnecessary re-evaluations of the original model when increasing the sample size. In our case, it allowed robust estimation from a moderately sized sample and we recommend it as a default choice. Adaptive sequential designs could potentially further improve efficiency.

Our results show that, if these lessons are heeded, reliable surrogate models for aggregate compositional outcomes of micro-scale agricultural economic models can be estimated. This greatly facilitates the integration of micro-scale models into large-scale integrated assessment models - not only by potential gains in computational efficiency but also by providing an option to harmonize integration of models of heterogeneous provenance and allowing for model sharing despite privacy-restricted simulation model input data. In this way, they can make a significant contribution in collaboration towards the structure-rich computational agricultural economics as envisaged by Antle (2019). While our experiments focused on aggregate regional outcomes, the methods presented here could also substantially support calibration, sensitivity analysis, and interactive result exploration with stakeholders (Mössinger et al., 2022) if future research can confirm their successful applicability at a more disaggregate level.