1 Introduction

Although theory supports the idea that firms do not operate on the basis of a common production function, i.e., that firms do not use homogeneous technology (Nelson and Winter 1982; Dosi 1988), a global production function is typically proposed in most empirical studies; thus, it is assumed that production technology is invariant over space and across firms. Indeed, especially in land-based industries, such as agriculture, the analysis of production often must be more nuanced and account for the variations in technology arising from locally-specific solutions that satisfy the environmental or social conditions within which firms operate (Mundlak 2001; Just and Pope 2001; Fezzi and Bateman 2011). Under these circumstances estimating a common production function (technology) to all farms can yield biased estimates of the technological characteristics and lead to wrong management advice or policy prescriptions.

For this reason, many empirical studies control for the possibility of heterogeneous technologies by classifying farms into groups on the basis of a priori exogenous information about their technological characteristics, and subsequently estimate separate different production functions for each group. This classification is based on either some a priori information (e.g., location of farms, etc.) or the application of cluster analysis (Álvarez et al. 2008). It has however been noted (Alvarez et al. 2012) that the use of single or even multiple characteristics to split a sample of observations can only be incomplete proxy descriptor for technologies, since differences in technologies can be the result of both observed and unobserved factors.

Modelling unobserved heterogeneity, instead, can be made via non-parametric and flexible mixtures methods. For instance, in several applications to agriculture, finite mixture models provide the opportunity to classify the sample into a finite number of technologies (classes) underlying the data, according to the estimated probabilities of class membership based on multiple specified characteristics or auxiliary/proxy variables (Orea and Kumbhakar 2004; O’Donnell and Griffiths 2006; Alvarez and del Corral 2010; Sauer and Paul 2013; Baráth and Fertő 2015).

The issue of dealing with unobserved heterogeneity has been recently accounted for by an increasing number of methodological papers (Greene 2005a, b; Emvalomatis 2012; Galán et al. 2014). Most of them relied on fixed- and random-effects panel data models, which former case capture unobserved heterogeneity through a set of firm specific intercepts that are simultaneously estimated with other parameters. These approaches are increasingly used in the context of stochastic frontier models, though distinguishing between heterogeneity and inefficiency is still an open problem (Amsler and Schmidt 2015, Kumbhakar et al. 2014, Colombi et al. 2014).

In this article, we focus the attention on a specific type of unobserved heterogeneity, i.e. spatial heterogeneity. Spatial heterogeneity, a term that has been coined by Anselin (1988), can refer to structural instability over space, in terms of changing functional forms or varying parameters (instability in the mean), or to heteroskedasticity, due to different forms of model misspecification that lead to non-constant error variances (instability in the variance).Footnote 1 Spatial heterogeneity can also be classified into discrete heterogeneity and continuous heterogeneity (Anselin 2010). In this article, we are particularly interested in studying discrete spatial heterogeneity, that is situations in which the relationship or the functional form varies across spatial subsets of the data that, in turn, might point to the existence of spatial regimes (Anselin 1988). Spatial regimes are then geographic subsets of data in which the model coefficients assume different values. It can be seen as simply a special case of group-wise heterogeneity.

The rationale behind our interest in spatial regimes is that in agriculture there is a long tradition of studies aimed at identifying agro-ecological zones (AEZ) i.e., homogenous and contiguous areas with similar soil, land and climate characteristics and having a specific range of potentials and constraints for land use. The AEZ methodology (FAO 1978) provides maximum potential and agronomically attainable crop yields, under assumed levels of inputs and management conditions. As a matter of fact, within each AEZ the long term, dynamic interactions among site-specific environmental variables which characterize a AEZ and farmer decision making about technology contributed to develop local specific varieties and production technologies, hence to give rise to local technology clusters, a concept similar to that of terroir used in oenology. In other words, environmental and social factors contribute to describe a patchy technological landscape. Borrowing from spatial econometric analysis, these local technology clusters can be defined as spatial regimes in farms technologies. However, the zoning of such local technology clusters or terroirs is largely unknown. This is mainly due to the fact that their identification needs comprehensive spatial modelling of soil, agronomical and climatic properties, including their changes through time, hence the processing of large quantities of data acquired at a very fine spatial resolution. In fact, researchers can often rely on a few control variables. For example, agricultural surveys usually do not carry information about cultivars grown, which are key to correctly associate the climatic variables collected in local meteorological stations with farms, in consideration of the differences in the timing of the phenological stages of each cultivar.

These issues motivate us to endogenously identify spatial regimes in farm technologies. To that end, the idea is to exploit the information derived from geographical coordinates (longitude and latitude) to approximate the effects of a variety of complex, dynamic interactions among site-specific environmental variables and farmer decision making about technology that are often not observed at the farm level. The starting point of our analysis is the hypothesis that there may be significant spatial dependence (Anselin 2002) and spatial heterogeneity in the natural clustering of farms around different geographical poles of attraction. Spatial dependence has been recently introduced also into stochastic frontier models, see Druska and Horrace (2004), Schmidt et al (2009)., Areal et al (2012)., Glass et al (2014, 2016)., Adetutu et al (2015). and Vidoli et al (2016).. However, the above references still assume the hypothesis of an aggregate production function, ignoring the effect of unobserved spatial heterogeneity. With the aim of accounting for both spatial dependence and spatial heterogeneity in the estimation of the production function of olive-growing farms in Italy, in this article we apply an approach based on the recent works by Andreano et al. (2017) and Billé et al (2017).

A main difficulty with cross-sectional data, as stressed by Anselin (2010), is that it is often hard separating spatial heterogeneity from spatial dependence. Spatial dependence is viewed as a special case of cross-sectional dependence for which a parametric structure of the covariance matrix is imposed by using a specific ordering of the spatial units. The essence of the problem is that cross-sectional data, while allowing the identification of clusters and patterns, do not provide sufficient information to identify the processes that led to the patterns. As a result, it is impossible to distinguish between the case where the cluster is derived from an apparent contagion due to structural change, hence discrete heterogeneity, or follows from a true contagious process, hence dependence. Moreover, when we deal with unobserved spatial regimes, no a priori information is typically available to identify them and to easily estimate the econometric model.

In this study, we employ a two-step estimation approach. In the first step, we employ the underlying spatial heterogeneity as the criterion to divide the whole sample into groups of contiguous observations that are homogeneous in terms of technology. We capture potential coefficient variations in the the estimated function by making use of an algorithm that builds on local estimation procedure, in the spirit of Cleveland and Devlin (1988), in conjunction with the adaptive weights smoothing (AWS) approach, see Polzehl and Spokoiny (2000). The AWS procedure, originally proposed in the context of image denoising, can be used in all applications where the regression function is likely to possess jumps or sharp edges. We use the AWS approach to make the local coefficients statistically converge into a discrete variation over space, hence identify the edges between different local technology clusters. In the second step of the estimation strategy, once the spatial regimes in technologies are identified, we contemporaneously account both for spatial regimes and spatial dependence (which in this case we suppose is determined by a true contagion process) with the aim to reduce as much as possible the consequences of model misspecifications. In the current research, we employ the above two step approach to investigate farm-level olive production in two regions in Italy, i.e. Tuscany and Apulia.

The remainder of the article is organized as follows. In Section 2 we explain the iterative procedure used to detect unobserved spatial heterogeneity and to identify the spatial technological regimes in a data-driven approach, we then specify the spatial econometric model with regimes used to simultaneously detect the effects of spatial dependence and spatial regimes, and finally we explain how we address the potential endogeneity issue that typically arises into production function estimation. In Section 3 we introduce the data set and show the main estimation results of the application of the procedure to two regional samples of olive growing farms in Italy. Finally, Section 4 concludes.

2 The model

In this section, we present the procedure used to detect unobserved spatial heterogeneity and to identify groups of spatial units (farms in our case) with homogeneous local production function parameters, i.e. spatial regimes in technologies. In particular, in subsection 2.1 we briefly explain the iterative procedure which allows us to control for unobserved spatial heterogeneity and to identify spatial regimes in a data-driven approach. In subsection 2.2 we specify the spatial econometric model with regimes used to simultaneously detect the effects of spatial dependence and spatial regimes. Whereas in subsection 2.3, we present the production function that we are going to estimate and explain how we address the potential endogeneity issue.

2.1 The iterative procedure used to identify spatial regimes

The iterative procedure used to identify spatial regimes is based on a first set of locally (geographically) weighted estimates (Cleveland and Devlin 1988), and the adaptive weights smoothing (AWS) approach (see Polzehl and Spokoiny 2000). The result of the combination of above-mentioned methods is an algorithm that is able to endogenously identify (i.e., in a data-driven approach) regimes over space, that is large homogeneous areas with sharp discontinuities. Overall, the iterative procedure tries to answer the following question: do the coefficient estimates of a spatial unit (farm in our case) statistically differ from the coefficient estimates of the nearby ones? Details on the proposed algorithm can be found in Andreano et al (2017).. Further improvements as well as the R code can be found in Billé et al (2017). In this article, we adopt the same two-step procedure as in Billé et al (2017). A brief explanation of the iterative procedure follows.

To account for local parameter estimates that vary over space, simple linear functions may provide a reasonable approximation of the local estimate as long as we use the information on a group of observations (farms in our case) close to observation i. The goal is to approximate the following model

$$y = \left( {\beta \odot X} \right)1 + \varepsilon,\,\varepsilon \sim iidN(0,\sigma _\varepsilon ^2I),$$
(2.1)

where y is an n-dimensional column of dependent variables, β is an n by k + 1 matrix of local coefficients with i-th row β i  = (βi0, βi1, βi2,…, β ik )' referred to observation i, \(\odot\) is the Hadamard product operator in which each element of β i is multiplied by the corresponding element of the i-th row in the matrix of regressors X, x i  = (xi0, xi1, xi2,…, x ik ), 1 is a (k + 1)-dimentional column vector of ones, and ε is an n-dimensional column vector of iid normal innovations.

Local estimates of β i (i.e. for each unit in space) can be directly obtained by repeated weighted least squares (WLS)

$$\widehat \beta _i^{WLS} = \left( {X^\prime W_iX} \right)^{ - 1}X^\prime W_iy,\,i = 1, \ldots n,$$
(2.2)

where W i is an n-by-n matrix whose non-zero elements denote the geographical distance weights (wi1, wi2, …, w ij ,…, w in ), obtained on the basis of longitude and latitude, of each of the n observed data for observation. So, the role of the W i , i = 1, …, n, weighting matrices is simply to give different distance-based weights in estimating the local coefficients i. We therefore have n diagonal spatial weighting matrices, one for each of the n observed data. The final result of the locally weighted least squares is obtained when n sets of local parameter estimates, that correspond to the local marginal effects, have been estimated. The subsets of observations (farms in our case) i.e., the neighborhood of i used for each local weighted least square fit, are selected by the “bandwidth” or “smoothing parameter” that determines how much of the data is used to fit each local regression.

Let now define y(i) =  i  + ε(i), \(\varepsilon ^{(i)}\sim N(0,\sigma _{\varepsilon ^{(i)}}^2I^{(i)})\), the model effectively and iteratively estimated for each i = 1, …, n. The WLS estimator for each observation i in Eq. (2.2) has the following expected value and variance-covariance matrix

$$\begin{array}{*{20}{l}} {E\left( {\hat \beta _i} \right)} \hfill & = \hfill & {E\left[ {\left( {X{\prime}W_iX} \right)^{ - 1}X{\prime}W_iy^{(i)}} \right]} \hfill \\ {} \hfill & {} \hfill & { = E\left[ {\left( {X{\prime}W_iX} \right)^{ - 1}X{\prime}W_i\left( {X\beta _i + \varepsilon } \right)} \right] = \beta _i} \hfill \\ {V\left( {\hat \beta _i} \right)} \hfill & = \hfill & {E\left[ {\left( {\hat \beta _i - \beta _i} \right)\left( {\hat \beta _i - \beta _i} \right)^\prime } \right]} \hfill \\ {} \hfill & {} \hfill & { = E\left[ {\left( {\left( {X{\prime}W_iX} \right)^{ - 1}X{\prime}W_iy^{(i)} - \beta _i} \right)} \right.} \hfill \\ {} \hfill & {} \hfill & {\left. {\left( {\left( {X{\prime}W_iX} \right)^{ - 1}X{\prime}W_iy^{(i)} - \beta _i} \right)^\prime } \right]} \hfill \\ {} \hfill & {} \hfill & { = E\left[ {\left( {\left( {X{\prime}W_iX} \right)^{ - 1}X{\prime}W_i\left( {X\beta _i + \varepsilon } \right) - \beta _i} \right)} \right.} \hfill \\ {} \hfill & {} \hfill & {\left. {\left( {\left( {X{\prime}W_iX} \right)^{ - 1}X{\prime}W_i\left( {X\beta _i + \varepsilon } \right) - \beta _i} \right)^\prime } \right]} \hfill \\ {} \hfill & {} \hfill & { = E\left[ {\left( {\left( {X{\prime}W_iX} \right)^{ - 1}X{\prime}W_i\varepsilon } \right)\left( {\left( {X{\prime}W_iX} \right)^{ - 1}X{\prime}W_i\varepsilon } \right)^\prime } \right]} \hfill \\ {} \hfill & {} \hfill & { = \left( {X{\prime}W_iX} \right)^{ - 1}X{\prime}W_iE\left( {\varepsilon \varepsilon {\prime}} \right)W_iX\left( {X{\prime}W_iX} \right)^{ - 1}} \hfill \end{array}$$

so that the WLS estimator is locally unbiased (i.e. given the optimal bandwidth value) with variance-covariance matrix in equation \(V\left( {\hat \beta _i} \right) = \sigma _{\varepsilon ^{(i)}}^2\left( {X\prime W_iX} \right)^{ - 1}X\prime W_i^2X\left( {X\prime W_iX} \right)^{ - 1}\).

Given our aim to identify spatial regimes i.e., large homogeneous regions separated by sharp discontinuities, we use the AWS approach to detect the greatest possible local neighborhood of every observation i in which the local parametric assumption is justified by the data. The AWS approach is based on a successive increase of local neighborhoods around every point i and a description of the local model within such neighborhoods by assigning weights. The weights describing the shape of the local model at the point i depend on the result of the previous step of the procedure. More in detail, we iteratively update the weights (geographical distances) in the main diagonal of W i and, at each iteration, we compare the estimated beta coefficients in (2.2) by using Wald test statistics in order to check whether pairs of spatial units (farms in our case) follow the same economic behavior (production function in our case). The initial W i matrix, i.e., the weights from which we start the procedure, is defined by using a bi-square kernel weighting function, whereas a Gaussian kernel weighting function is used during the updating procedure of the weights. Both these kernels are defined as functions of the Euclidean distanceFootnote 2 between observations i and j, i.e. d ij , and the bandwidth b, though alterative definitions of distances can be used.Footnote 3 In this article, we use an adaptive bandwidth because it ensures sufficient (and constant) local information for each local calibration.Footnote 4 We select the optimal adaptive bandwidth value (bopt) by a mix selection of the model specification for local estimates, the optimizing criterion (AIC in our case; see Fotheringham et al 2002., p 61) and the used kernel which, consistently with the function chosen to define the initial weights, we choose to be a bi-square kernel function. To this purpose, we used the package GWmodel in R (Lu et al. 2014). Details on the steps of the iteration procedure are in the appendix.

This procedure does not impose any restrictions and is fully adaptive in the sense that no prior economic information about the spatial structure is required.

2.2 Controlling for spatial dependence

Once the spatial regimes are identified by applying the iterative procedure described above, i.e., after controlling for spatial heterogeneity, we have to deal with spatial dependence among the observations. A huge literature is available on several types of spatial autoregressive models that can be alternatively used to control for spatial interactions (see e.g. LeSage and Pace 2009). Among the most general ones, the spatial autoregressive model with autoregressive disturbances (typically known as SARAR/SAC model) is one possible candidate, which includes both the spatially lagged dependent variables and a parametric correlation structure among the disturbances. Asymptotic properties of the maximum likelihood and quasi-maximum likelihood estimators has been proved by Lee (2004) and of the generalized method of moments by Kelejian and Prucha (2010). By adding the information about spatial regimes, this model can be written as

$$\begin{array}{l}\left[ {\begin{array}{*{20}{c}} {y_1} \\ \vdots \\ {y_c} \end{array}} \right] = \rho W_1\left[ {\begin{array}{*{20}{c}} {y_1} \\ \vdots \\ {y_c} \end{array}} \right] + \left[ {\begin{array}{*{20}{c}} {X_1} & \cdots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \cdots & {X_c} \end{array}} \right]\left[ {\begin{array}{*{20}{c}} {\beta _1} \\ \vdots \\ {\beta _c} \end{array}} \right] + \left[ {\begin{array}{*{20}{c}} {u_1} \\ \vdots \\ {u_c} \end{array}} \right],\\ \left[ {\begin{array}{*{20}{c}} {u_1} \\ \vdots \\ {u_c} \end{array}} \right] = \lambda W_2\left[ {\begin{array}{*{20}{c}} {u_1} \\ \vdots \\ {u_c} \end{array}} \right] + \left[ {\begin{array}{*{20}{c}} {\varepsilon _1} \\ \vdots \\ {\varepsilon _c} \end{array}} \right]\end{array}$$
(2.3)

where \(\tilde y = \left\{ {\tilde y_j} \right\}\) is a partitioned n-dimensional column vector of dependent variables (production in our case), \(\tilde y_j = \left[ {\begin{array}{*{20}{c}} {y_j} \\ \vdots \\ {y_{n_j}} \end{array}} \right]\) for j = 1, …, c is the n j -dimensional vector of dependent variables relative to the unknown regime (cluster) j, with n = n1 + … + n j  + … + n c ; ρ and λ are the well-known spatial autoregressive coefficients, which capture the spillover effects in the dependent variables and the correlation structure among the disturbances, respectively; W1 and W2 are n by n spatial weighting matrices; which pre-specify the spatial structure among the observations; \(\tilde \beta = \left\{ {\tilde \beta _j} \right\}\) is the (k × c)-dimensional partitioned column vector of parameters (a vector for each cluster); \(\tilde X = \left\{ {\tilde X_j} \right\}\) is a block-diagonal matrix of k regressors (inputs in our case) of dimension n by (k × c); \(\tilde \varepsilon = \left\{ {\tilde \varepsilon _j} \right\}\) is a partitioned n-dimensional column vector of iid normal innovations, with \(\tilde \varepsilon _j \sim N\left( {0,\sigma _{\tilde \varepsilon _j}^2I_{n_j}} \right)\) each. Note that the weighting matrices are not partitioned, so that we do not assume unknown form of group-wise autocorrelation heterogeneity, but rather observations that belong to different clusters may be connected each other. The grade of connection directly depends on the choice of the criteria used for defining the weights. If dense matrices are assumed, all the n observations are connected with a different weight. Model identification issues require either W1 and W2 to be different or the vector \(\tilde \beta\) statistically significant.

The model in Eq. (2.3) reduces to a spatial autoregressive error (SAE) model with regimes by setting ρ = 0 and to a spatial autoregressive (SAR) model with regimes by setting λ = 0. Both the squared weighting matrices, i.e. W1 and W2, have diagonal elements w ii equal to zero, i.e. each spatial unit is not viewed as its own neighbor, and they are normalized so that the admissible parameter spaces of ρ and λ are known and less than unity in absolute value. In this article, we adopt the typical row-normalization rule for the weighting matrices.

It is worth nothing that robust covariance estimates can be properly obtained by models with heteroskedastic and autocorrelated consistent (HAC) structures (see Kelejian and Prucha 2007 for a spatial extension), which are able to properly handle unobserved factors from an econometric perspective (Dell et al. 2014; Deschênes and Greenstone 2007).

2.3 The estimation of a production function and potential endogeneity

In this article, we apply the iterative procedure described in Section 2.1 to identify the unobserved spatial production regimes, i.e. spatial groups of farms that are homogeneous in terms of technologies. For this purpose, we replace the general model in 2.1 with the following Cobb-Douglas production function in logs

$$y = \beta _1x_1 + \beta _2x_2 + \beta _3x_3 + \beta _4x_4 + \varepsilon,$$
(2.4)

where y is a column vector of logs of produced quantities and and x1, x2, x3, x4 are column vectors of logs of area (land used), capital inputs, labor inputs and intermediate inputs, while \(\varepsilon \sim N\left( {0,\sigma _\varepsilon ^2I_n} \right)\). Once the spatial regimes are identified, we also control for spatial dependence by estimating the spatial model with regimes (2.3). In the context of production function models the endogenous spatial interactions i.e., ρ capture the spatial dependence structure between farms’ production systems (Horrace et al. 2016). For example, farmers’ decisions on production innovations that spill-over among neighboring farms. The coefficient λ of the spatially autoregressive disturbances may capture an unobserved dependence structure, so that a shock on a farm production propagates through the entire system of farms’ production with a higher effect on the neighborhood.

When we consider the production function, a typical problem of endogeneity arises for one or more of the regressors (Ackerberg et al. 2015; Shee and Stefanou 2014; Latruffe et al. 2016). The endogeneity of inputs is mainly due to the presence of some determinants of the firm productions that are unobserved to the econometrician (omitted-variable bias) but observed by the farmer, leading to inconsistency of the standard least squares estimators. In a linear framework, the standard approach for addressing the potential endogeneity bias is to use instrumental variables. The two-stage least squares (2SLS) estimator (Hansen 1982) has been widely used for this purpose. To efficiently combine 2SLS with repeated local estimates, we regress the set of endogenous variables in (2.4) on the selected instruments before starting the iterative procedure in order to guarantee new orthogonality conditions. To briefly explain, consider the previous model of interest (2.4) in a generic form, y = X1β1 + X2β2 + ε, where X1 are k1 exogenous variables, whereas X2 are k2endogenous variables. Given some useful information on a set of k3 instruments for X2—say Z = [X1, P] with k P  ≥ k2 and P other instruments different from the exogenous ones in the model—we can obtain the predicted values of X = [X1, X2] from the first step as \(\hat X = Z(Z\prime Z)^{ - 1}Z\prime X\). These values are then used in the local regressions during the iterative procedure with \(\hat \beta _i^{WLS} = \left( {\hat X\prime W_i\hat X} \right)^{ - 1}\hat X\prime W_iy\).

Controlling for endogeneity becomes more difficult if we wish to concurrently consider spatial spillover effects (Rey and Boarnet 2004) due to the presence of two different sources of endogeneity, namely, endogenous regressors and spatially lagged variables. The endogeneity problem within the spatial framework can be addressed, for example, by using the spatial heteroscedasticity and autocorrelation consistent (SHAC) estimator proposed by Kelejian and Prucha (2007). In fact, in our local estimates endogeneity of the regressors is an issue prior to the application of the convergence procedure that allows us to endogenously identify the spatial regimes. During the first stage of the estimation procedure we only control for heterogeneous effects. It is only after this first stage that we estimate both a spatial error model and a spatial error model with regimes. Under these circumstances, in the first-stage estimates, we can appropriately address endogeneity by using the generalized instrumental variable method without incurring in the problems deriving from the concurrent existence of another source of endogeneity.

3 An application to olive farms in Italy

We use the above technique to examine the production of olives in Italy. Italy ranks second in the world after Spain for olive oil production. The Italian olive sector is still characterized by a large number of small operations. Specifically, Italy has the highest number of holdings (776,000) with the smallest average size (1.3 ha) in the European Union (EU) Mediterranean countries. Over time, the different microclimate conditions, soil formations and elevation levels have led to natural or manmade modifications (breeding and selection) of the olive tree into many location-specific varieties, each with different productivity levels, agronomic needs and adaptability to irrigation and mechanization. In the case of arable crops (e.g. maize and oil seeds), local varieties have been widely substituted by industrial global varieties whose production response does not vary greatly over space. In the case of olive trees, locally-specific varieties are still largely in use. The territorial anchorage of the production of location-specific olive varieties is further strengthened by social and marketing considerations because, similarly to the case of wine and grapes, farmers choose varieties on the basis of not only agronomic characteristics (e.g., disease resistance, climate preferences, high productivity) but also the aptitude for preserving local production knowledge (e.g., flavor, suitability for curing, etc.) and ability to guarantee the production of high-quality oil.

Based on all these considerations, it follows that the technology available to farms depends on the characteristics of the physical, social and economic environment in which production takes place. In other words, the underlying production technology is not the same for all olive farms; rather, it is location specific, and the group of farms sharing the same technology can be termed a local technology cluster, a concept similar to that of terroir in oenology. The local olive production function used by farmers operating in a territory results from the choice of locally-optimal technology from a given menu of technologies as a consequence of a process of localized technological change (Stiglitz and Atkinson 1969; Nelson and Winter 1982; Antonelli 2008; Acemoglu 2015). This view is consistent with evolutionary theories (Nelson and Winter 1982; Dosi 1988) according to which firms cannot be assumed to operate using a single common production function. These theories explain why permanent asymmetries exist across firms in terms of production technologies and quality of products (Dosi 1988). External inputs and the past accumulation of skills and knowledge guide the creation of technological knowledge. The technology prevailing in the local technology cluster is the efficient solution to the specific techno-economic problems experienced by the firms operating in the cluster. This solution consists of specific families of recipes and routines, and it is based on carefully selected principles derived from natural sciences jointly with specific rules aimed at acquiring related new knowledge (Dosi and Nelson 2013).

For example, in the case of olive trees, farmers grow different cultivars that have different yields, aptitude to the mechanization of the harvesting and input needs. As we noted above, the decision to grow a low-yield or higher-cost variety is partly connected to the characteristics of the local natural environment (e.g., climate, water, soil) and partly to the preservation of the local cultural heritage (e.g., flavor and quality of oil, landscape). Under these circumstances, it is difficult, if not impossible, to collect all the information needed to define the boundaries of the local technology cluster. A solution is offered by the application of the two-step procedure described in Section 2.

3.1 Data

This study relies on cross-sectional data collected by the 2013 Italian Farm Accountancy Data Network (FADN) survey. The data set consists in 233 observations for Tuscany and 227 for Apulia. The FADN sample is stratified according to the criteria of the geographical region, economic size and type of farming. The field of observation is the total number of commercial farms, that is farms large enough to provide a main activity for the farmer and a level of income sufficient to support his or her family. The survey gathers physical and structural data (e.g., location, crop areas, livestock numbers, labor force) and the economic and financial data needed for the determination of incomes and business analysis of agricultural holdings. We also exploit a peculiar characteristic of the Italian FADN, which provides longitude and latitude coordinates of farms, thus allowing us to use spatial econometric models and, in our specific case, account for farm spatial heterogeneity. Another advantage of the Italian FADN is that the information on input use and production results is collected by activity. Consequently, unlike most previous studies (see for example Dinar et al 2007. and Karagiannis and Tzouvelekas 2009 for applications to olive production), we do not need to rely on data that refer to the entire farm production. Instead, both output results and input use in our study are activity specific. The availability of information by activity has several advantages. First, we do not need to focus on farms that are highly specialized in the production of a specific crop—olives, in our case—thus preventing the loss of observations related to farms that grow the selected crops jointly with other products and allowing us to segment the regional samples in groups of farms large enough to produce reliable estimates. Second, given that we do not have to aggregate the farm crop outputs, we can measure output in physical rather than in monetary terms. This feature, along with the availability of activity-specific input uses, guarantees more accurate results.

The dependent variable in the production function in Eq. (2.4) is the olive production measured in kilograms. The inputs included as explanatory variables are (a) land (A) measured in hectares, including only the share of utilized agricultural area devoted to olive tree cultivation; (b) labor (L), comprising hired (permanent and casual) and family labor, measured in working hours; (c) capital (K), proxied by the hours of mechanical work employed in olive growing and harvesting; and (d) intermediate inputs (M)—including expenses for water, fertilizers, pesticides, fuel and electric power and other miscellaneous expenses—measured in euros and augmented with the expenses for contract work. The descriptive statistics are reported in Table 1.

Table 1 Descriptive statistics (year 2013)

To control for the potential endogeneity of inputs, we tested the suitability of a set of variables to be used as instruments. All the inputs, i.e. labor, capital, and intermediate inputs, are treated as endogenous. Recall that we proxy capital by the hours of mechanical work employed in olive growing, so we have to treat it as an endogenous variable. This set first contains, following the standard approach proposed by Pindyck and Rotenberg (1983), the lagged values of the three variables suspected of being endogenous, namely Lt−1, Kt−1, and Mt−1. Additionally, we consider the prices of nutrients (PF) and pesticides (PP),Footnote 5 the opportunity cost of laborFootnote 6 (OCL) and capitalFootnote 7 (OCK) provided in the Italian FADN, and a proxy of the investments in machinery (I) that is obtained as the variation in horsepower at the farm level observed between 2013 and 2012. After testing for the correlation among the endogenous variables and the above-mentioned available instruments by using simple linear regression models, we find that in the cases of Tuscany and Marche, the lagged values of inputs (Lt−1, Kt−1, and Mt−1) are valid instruments of the three variables suspected of being endogenous, while in Apulia, the valid instruments are the lagged value and opportunity cost in the case of labor (Lt−1, OCL); lagged value, opportunity cost and investment in machinery for capital (Kt−1, OCK, I); and finally, lagged value and price of fertilizers (Mt−1, PF) in the case of intermediate inputs.

3.2 Results

Our empirical results are based on year 2013, while the 2012 data set is used to check the robustness of the grouping of farms. To model production, we first correct for endogeneity, then we apply the iterative procedure described in Section 2.1 by using the Cobb–Douglas (CD) functional form in Eq. (2.4) in order to identify the spatial production regimes by repeated WLS. Finally, we estimate the model specified in (2.3) and its simplified version without regimes, to control for possible spatial dependence not already controlled for. We also estimate the CD model by OLS, with and without regimes, as benchmark models to check if including spatial effects improves the goodness of fit.

The optimal bandwidth values, i.e., the optimal numbers of nearest neighbors required for local estimates, suggested by the AIC and used to fit the models with regimes are 67 for Tuscany and 111 for Apulia. These values come from a proper combination of the assumed economic model specification and the trade-off problem between bias and variance in considering local estimation procedures, given the sample size. Therefore, the optimizing bandwidth criterion (AIC in our case) tries to find the best subsample size that is able to give the best local fit of the specified model, see Eq. 2.4, for each point regression i. It is worth noting that the model in Eq. 2.4 should be correctly specified.

We identify 4 spatial regimes in Tuscany (Fig. 1a) and 3 in Apulia (Fig. 1c).

Fig. 1
figure 1

Spatial technology regimes of olive farms. Note: The bi-square kernel function for the initial weights and Gaussian kernel function for the updated weights are used. The adaptive bandwidths are based on the AIC criterion: a \(b_{knn}^{AIC} = 67\), b \(b_{knn}^{AIC} = 83\), c \(b_{knn}^{AIC} = 111\), d \(b_{knn}^{AIC} = 135\)

The results show that the relationship between input use and olive output does not vary evenly across space; rather, it describes a patchy landscape due to spatial heterogeneity. After all, it is not surprising to find, for example, clusters of lower output elasticity of mechanical work on steeply sloping land or higher output elasticity of land when tall olive trees, i.e., those not suitable for mechanical harvesting and pruning, are grown. Input use and yields may also be locally influenced by private and public regulations. For example, farms located in areas covered by geographical indication such as Protected Denomination of Origin (PDO) may be required to comply with the requirement to limit yields to improve the quality of certified products, while those operating in environmentally sensitive areas may be required to reduce their nutrient loadings. At this regard, it is interesting to note that in both regions in both regions the spatial regimes conform well with expert based maps of the olive oil PDO. PDO rules—which aim to protect the names of quality agricultural products—define the geographic area covered and regulate the varieties of olives (usually local cultivar) that can be grown, and the methods of cultivation that can be used. As we already mentioned the local cultivars are the result of a long run process of selection and adaptation to the environment. The high degree of overlap between the clusters identified by the iterative procedure and the geographic distribution of local cultivars and PDO cross-validates our results and the suitability of geographic coordinates as a proxy for variety variations, soil type, climate, etc. In other words, this overlap validates the ability of farms’ longitude and latitude to account for the heterogeneity in spatial distribution of technology that, in turn, arises from the adaptation of production techniques to variables that are not usually detected by agricultural surveys at the farm level. In other word, environmental and social factors may contribute to describe a patchy technological landscape characterized by the presence of discrete spatial technological regimes.

To statistically assess the significance of the groupings of farms identified by the iterative procedure, i.e., to test whether the beta coefficients statistically differ across clusters, we computed the Chow test and its spatial version (Anselin 2005). In Tuscany, the results (Table 2) indicate strong evidence of significantly different coefficients in each of the spatial clusters of farms and suggest that the proposed partition can be accepted. In Apulia, the AIC criterion signals that the global model fits the data best. It is worth noting, however, that the (admittedly slightly) significant spatial Chow test statistic (p-value: 0.079) suggests that the clusters of farms identified by the iterative procedure can still be used for descriptive purposes and further investigation.

Table 2 A-spatial and spatial Chow tests

Finally, in order to check the robustness of the grouping of farms to short term variations in the output and inputs variables, due for example to variations in weather conditions, we tested the sensitivity of the clustering to the change in time by replicating the application of the iterative procedure on two regional samples of olive growing farms referred to year 2012 (Fig. 1b, d). It is easy to see that the zoning obtained using the data referred to year 2013 is very similar to the one obtained using the sample referred to the previous year. The minor differences in the borders are largely due to the fact that the samples are not balanced, hence part of the farms observed in year 2012 were replaced by different farms with different locations. In other words, as expected, we find the clustering of farms is time invariant, the reason for this is that it is the result of the long-term interactions of social and environmental variables that affect the farmers’ choice of technology.

After having controlled for spatial heterogeneity and identified the local technological clusters of farms, we estimate the model specified in (2.3) to control for possible spatial dependence in the data not already controlled for during the application of the iterative procedure. We did not find any statistically significant spatial autoregressive coefficient (ρ) in our empirical application, which result was also supported by the LR test statistic values in Table 3. As already noted, when ρ = 0, then the SARAR/SAC model described in (2.3) collapses to a spatial autoregressive error (SAE) model. The coefficient estimates of the global SAE and those of the SAE with regimes are reported in Tables 4 and 5. At this regard, it is worth noting that, like in OLS, the coefficient estimates of both SAE and a SAE with regimes can be interpreted as output elasticity. This is because the autocorrelation among disturbances does not affect the expected mean, but only the variance-covariance matrix of the spatial model, hence we do not need to calculate the direct, indirect and total effects typically needed for spatial marginal effects interpretations (LeSage and Pace 2009).

Table 3 LR test values between SARAR and SAE specifications
Table 4 Spatial global models
Table 5 Spatial autoregressive error models with regimes

The best fitting model in each region has been defined on the basis of the smallest AIC (Table 6). The best-fitting model in Tuscany is the one accounting for both spatial dependence in the disturbances (λ) and spatial heterogeneity. In fact, it is notable that in Tuscany, the measure of spatial error dependence is highly statistically significant both in the global model and the model with regimes though slightly lower in magnitude in the latter model. This means that spatial dependence is still at work even after controlling for spatial heterogeneity. Also in Apulia, we find spatial error dependence is statistically significant, though in this case the AIC criterion selects the global spatial model as the specification that fits the data best. This result does not mean that spatial heterogeneity is not at work in Apulia, it rather indicates that the model with spatially varying parameters is excessively complex, i.e. less parsimonious, with respect to the sample size.

Table 6 AIC values for model comparisons

The spatial dependence in the disturbances (λ) detected in both regions means that there are different unobserved factors, such as land characteristics (e.g., soil composition) and socio-economic aspects (e.g., information sharing, managerial abilities), that are spatially correlated. In other words, the coefficient λ of the spatially autoregressive disturbances captures an unobserved dependence structure, so that a shock on a farm production propagates through the entire system of farms’ production with a higher effect on the neighborhood. In the case of olive cultivation, for example, the closer the farm operates to the sea and the more farmers chose to grow salt- and wind-tolerant varieties. Similarly, the more the farm is located in the Northern or elevated areas, the more the farmer tends to choose frost-tolerant cultivars.

Overall, in Apulia the global model fits the data best, this means that the olive production technology does not vary over space in this region. On the contrary, in Tuscany we find the best fitting model is the one with regimes. This means that the long term, dynamic interactions among site-specific environmental variables and farmer decision making about technology have given rise to structural differences across space, hence to local technology clusters of farms. For example, Cluster 1 and 3 are very close one to the other, but the first is characterized by hills while the second by flatlands. They are among the most productive olive growing areas in Tuscany and this feature is reflected by the high and statistically significant elasticity of land. In both Cluster 1 and 3 the elasticity of capital is non-significant. In Cluster 1, this result is probably caused by the fact that the orchards are located in hilly and sloping land not suitable for mechanical operations, while in Cluster 3 it is n mainly due to the fact the trees are usually very tall and not suitable for mechanical harvesting. This latter feature is probably the cause also of the statistically significant elasticity of labor found in Cluster 3. In Cluster 2 the traditional orchards are characterized by low yields due to the habit to keep the trees low by severe pruning as a defense against frost damages. This latter feature explains the significant elasticity of labor (used for pruning and harvesting) found in this cluster. The low input management systems used in traditional orchards in this cluster explains the non-significant elasticity of other inputs and capital. In Cluster 4, the severe damage caused by harsh frosts recorded in the 1960s and 1985 led to the replacement of old trees with frost-tolerant cultivars that responded favorably to more intensive fertilizers and pesticides use. The presence of such modern plantations is probably the cause of the statistically significant response to changes in capital, i.e., mechanical work, and in other inputs, while changes in labor do not produce any statistically significant impact on output. Finally, our findings show that in all the clusters, there are decreasing returns to scale, i.e., output increases by less than the proportional change in inputs. Overall, we find that when there is spatial heterogeneity the identification of spatial regimes can help drawing a more detailed picture of the production environment and provide more accurate information to guide extension services and policy makers.

4 Conclusion

In agriculture, the long-term interactions among site-specific environmental variables and farmer decision making about technology have contributed to developing local specific varieties and technologies, hence to give rise to a patchy technological landscape characterized by the presence of a discrete number of spatial regimes in technologies. The clustering of farms in areas in which farms follow a local production econometric model can be the result of true contagious effects, i.e. spatial dependence, or apparent contagion effects due to structural changes, i.e. discrete spatial heterogeneity. Unfortunately, in the estimation of production functions the effects of spatial heterogeneity are often ignored. The result is that a global model is usually fitted to all the farms in the sample. Clearly, when that relationship varies over space, global parameter estimates may be very misleading and lead to wrong economic conclusions. On the other hand, it is often hard separating spatial heterogeneity from spatial dependence with only the statistical information coming from a cross-sectional data set.

In this article we use a two-step procedure that exploits the information carried by the geographic coordinates of farms to control for both spatial heterogeneity and spatial dependence. The main advantages of this procedure are that, first, it identifies spatial regimes in a data-driven approach, and, second, it estimates group-wise (regime-specific) coefficients. As a consequence, it avoids the problem of the lack of information on large sets of variables (variety grown, soil quality, etc.) causing the clustering of farms technologies in spatial regimes, and it provides more accurate information of the input-output relationships in the agricultural landscape than the global estimates..

We apply the above-mentioned two-step procedure to estimate the production function of olives on geocoded individual farm data as collected by the FADN sample survey in 2013 in 2 Italian Regions. The aim of the of this work is twofold. Firstly, we want to identify the unknown number of spatial regimes in olive growing technologies and, secondly, we want to detect the presence of spatial dependence in the data. For this latter purpose, once the spatial regimes are identified, we estimate a spatial autoregressive model with autoregressive disturbances and regimes, which captures both the spatial spillover effects (true contagion) and the heterogeneous ones (apparent contagion).

Our empirical results confirmed the existence of local technology clusters of farms, i.e. areas in which farms follow a local production econometric model. In other words, our results confirm the hypothesis that the presence of different spatial production regimes is related to the existence of a variety of latent unobserved factors, which are closely intertwined to the spatial location of the observed farms. In addition, we do not detect any statistically significant spatial autoregressive coefficient related to the dependent variables in both the regions under analysis. On the contrary, we find the spatial error dependence is at work in both regions, suggesting that the effect of a shock on a farm propagate on the other farms with the intensity given by the value of the autocorrelation coefficient.

The use of the two-step procedure to control for both spatial heterogeneity and spatial dependence can provide a detailed zoning of crop production systems that takes into account the effects of many unobserved agro-ecological and social factors at work in the territory. For example, the procedure can identify the geographical extension of specific terroirs and their borders, making this concept operative and relevant to land use and planning policies. In addition, by estimating local crop models with regimes we can provide regime-specific output elasticities that can be used for accurate analysis of farms’ efficiencies that operate in a specific terroir.

The conditions under which the proposed partitions imply a better fit of the regression model can be further investigated. In particular, it would be interesting to extend the procedure to panel data models with both endogenous spatial regimes and spatial spillover effects in order to account for both time- and space-variations of farms production technologies.