Background

Information on forest resources in tropical developing countries is challenging to acquire due to financial and logistical constraints (Holmgren and Marklund 2007; Skutsch and Ba 2010; Stringer et al. 2012). However, tropical forest resources are being exploited and converted to other uses at rates that seem to outpace the capacity for regrowth (Shearman et al. 2012; Brink et al. 2014; Sawe et al. 2014; Suberu et al. 2014). This is especially true in the miombo ecoregion of southern Africa (Cabral et al. 2010; Mayes et al. 2015), where forest productivity is marginal (Frost 1996). In this context, forest monitoring programs are required to provide information on forest resources for: i) planning use and management activities at local levels (Stringer et al. 2012); and ii) designing effective policies and measures at national levels. Above-ground biomass (AGB) is a key variable of interest in forest monitoring programs, where estimates of AGB are necessary for assessing fuel wood and timber availability, as well as for monitoring forest carbon stocks. Integrating ground plot data from a forest inventory with environmental and/or remotely-sensed predictor variables to predict AGB has shown increasing utility for estimating AGB (Moisen et al. 2006; McRoberts et al. 2010; Lu et al. 2016; GOFC-GOLD 2015).

Optical remotely-sensed data alone have been used to estimate miombo AGB with varying success (Samimi and Kraus 2004; Kashindye et al. 2013; Næsset et al. 2016). The use of these data is complicated by complex, nonlinear relationships between AGB and vegetation indices (Lu et al. 2016). Active sensor remotely-sensed data such as Airbone Laser Scanning (ALS) or radar may provide improved accuracy of AGB models (Ryan et al. 2012; Mitchard et al. 2013; Mauya et al. 2015; Næsset et al. 2016), although this is not guaranteed (Solberg et al. 2015). Another way to improve AGB estimation is to incorporate percent canopy cover (CC) as a predictor variable (Tiwari and Singh 1984; Lefsky et al. 2002; Hall et al. 2006; González-Roglich et al. 2014; GOFC-GOLD 2015), particularly since percent canopy cover is more readily estimated by optical remotely-sensed data (Lefsky and Cohen 2003; Lu et al. 2016).

González-Roglich and Swenson (2016) capitalized on the relationship between AGB and CC for modeling forest carbon in temperate savannahs of South America. They first developed a model to estimate CC using Landsat 5 Thematic Mapper along with topographic and climatic variables, and then used the estimated CC as the sole predictor to estimate forest carbon. Using this process, the percent root mean square error (RMSE) for prediction of forest carbon was 35 %. Similar studies in arid woodlands of Australia (Suganuma et al. 2006) and Sudan (Wu et al. 2013) also reported favorable results using estimated CC to estimate AGB. One complication is that the definition of CC varies (Jennings et al. 1999). In some cases, CC was defined as the percentage of the sky blocked by tree crowns over a hemispherical view (Jennings et al. 1999), often measured using a spherical densiometer. Alternatively, the vertical projection of tree crowns onto the ground is currently used in many countries as the basis for minimum forest area definition (FAO 2014). González-Roglich and Swenson (2016) used the former definition, while Suganuma et al. (2006) and Wu et al. (2013) used the latter. Therefore, it is an open question as to which measure of CC may better relate to AGB.

Models to estimate AGB that also incorporate information on topography, soils, and disturbances may further improve the accuracy (Moisen and Frescino 2002; Powell et al. 2010; Pflugmacher et al. 2014; GOFC-GOLD 2015). In the miombo ecoregion, as in other ecosystems, ecological and physiological factors limit vegetation establishment and survival, while disturbances result in changes to vegetation structure and composition (Frost 1996; Chidumayo et al. 1996; Sankaran et al. 2008). Soil characteristics, fire regimes, and anthropogenic use have all been identified as principal determinants of vegetation structure (Timberlake and Chidumayo 2011; Ryan and Williams 2011). Investigations of these and other variables for estimating miombo AGB appear to be sparse, and further exploration is warranted.

The main goal of this research is to investigate methods for estimating total AGB for a study area within the miombo ecoregion by using ecologically relevant predictor variables. For this purpose, we compared models established within a model-based framework, where inference is derived from the assumption that the y i observations of AGB are considered realizations of the random variable Y i , which in turn are realizations of a random process termed a superpopulation (Gregoire 1998; Ståhl et al. 2016). To assess these models, we compared outcomes to a simple design-based estimate commonly used in forest inventories. The design-based estimate assumes that the population of Y i is fixed, not random, while the sample of y i observations is the realization of a random process, and inference is independent from assumptions regarding the probability distribution of Y i in the population (Gregoire 1998; Ståhl et al. 2016). In this framework, we then addressed the following specific questions: 1) Which definition of CC is more useful as a single predictor variable to estimate AGB, the hemispherical view or the vertical projection view? 2) Does AGB estimation accuracy improve if CC is combined with other predictor variables? 3) What model-based AGB estimation method performs best in terms of fit and validation statistics? and 4) How do model-based methods compare to a simple design-based estimate of the sample mean? This research is intended as a case study to help inform development of forest monitoring programs which apply National Forest Inventory (NFI) data to estimation of AGB at sub-national or national scales.

Methods

Study area

This study was carried out in Nyimba District (14°-15°S, 30°-31°E, ~1,000,000 ha), Eastern Province, Zambia (Fig. 1). Nyimba lies in the center of the miombo ecoregion, a biome of diverse vegetation types dominated by tree species from the Caesalpinioideade sub-family of leguminous plants (Timberlake and Chidumayo 2011). Across the ecoregion, vegetation composition and structure varies depending on climate, soil, landscape position, and level of disturbance (Frost 1996; Timberlake and Chidumayo 2011). Nyimba is found within the dry miombo ecozone and is dominated by four vegetation types (Table 1; GRZ 1976; Timberlake and Chidumayo 2011). Approximately 75 % of the district can be characterized as forest land, based on canopy cover ≥ 10 % per 0.5 ha area (Halperin et al. 2016). Nonforest land uses are dominated by small-scale agriculture (Gumbo et al. 2016). Fire is a frequent disturbance, with an estimated return interval of one to three years (Frost 1996; Timberlake and Chidumayo 2011). Fires generally result from people engaged in agricultural land clearing, charcoal making, and hunting, with early dry-season fires exhibiting less intensity than late dry-season fires (Frost 1996). Average rainfall is approximately 600–900 mm per year (GRZ 1968); most rain falls in the December to February period, with a distinct dry season from May to November. Soils are generally characterized as Lithosol-Cambisols on the hills and plateaus, and Fluvisol-Vertisols in the valleys (GRZ 1986). Elevation ranges from 450 m along the Luangwa River valley bottom to 1000 m on the plateau area around Nyimba district capital, and higher on the mountain tops in the western half of the district.

Fig. 1
figure 1

Location of Nyimba District, Zambia within the miombo ecoregion. Inset: National Forest Inventory grid within Nyimba District

Table 1 Characteristics of the four dominant vegetation types found in Nyimba District, Eastern Province, Zambia

Ground plot measurements

Ground plot measurements were collected at 64, 0.1 ha permanent plots from May – December 2013 (Fig. 1). The ground plots were located on a systematic 10 km × 10 km grid corresponding to the Zambian Integrated Land Use Assessment 2 (ILUA2), an NFI program implemented by the Zambian Forest Department in collaboration with the FAO (GRZ 2014). Nationally, the Zambian Forest Department collects field data at approximately 20 % of the grid intersections using a cluster plot design. Each cluster plot has four 20 m × 50 m rectangular subplots, and each subplot has a nested 10 m × 20 m microplot starting at the same origin as the subplot. In this study, we collected data only for the first ground subplot of a cluster to increase the number of locations within time constraints, thereby capturing variations in conditions and land uses across the landscape. As such, we refer to the 20 m × 50 m area where we collected ground measurements as the plot, and the 10 m × 20 m area as the subplot.

Measurements within each ground plot followed the ILUA2 protocol (GRZ 2014). Species, total height, and diameter at breast height (DBH; outside bark diameter at 1.3 m above ground) were measured and recorded for each live tree with DBH ≥ 10 cm at the plot level and each live tree ≥ 5 cm and < 10 cm DBH at the subplot level. These tree attributes were used to calculate tree-level AGB, using species-specific allometric models developed by Chidumayo (2012) and utilized by the Zambian Forest Department (GRZ 2014). In some cases, species-specific models were not available and forest type models were used (Chidumayo 2012). We summarized tree-level AGB in plots and subplots to obtain plot-level AGB in tonnes per hectare (t∙ha−1).

We also measured percent CC for each ground plot using a spherical densiometer and recorded the dominant vegetation type (i.e., miombo woodland, mopane woodland, munga woodland, or riparian forest, GRZ 2014). We recorded ‘nonforest’ if the ground plot was dominated by nonforest land use with none of the vegetation types present. Ground plot coordinates were determined using a Trimble GeoXT, which were differentially corrected (+/− 1 m). Lastly, we gathered GPS line data for roads and point data for village locations. Most of the ground plots (49) were visited between 5 May and 30 July 2013, while the remaining plots (15) were visited between 27 Oct and 6 Dec 2013.

Predictor variables

We employed two types of estimated CC as a single predictor variable in separate models to estimate AGB. First, CC was estimated as the vertical projection of tree crowns onto the ground using a binomial generalized additive model (GAM) model developed for the same study area (Halperin et al. 2016), referred to as \( {\hat{CC}}_{\mathrm{VERT}} \). Second, CC was estimated as the percentage of the sky blocked by tree crowns over a hemispherical view by using the densiometer. For this purpose, we built a binomial GAM to estimate percent CC using the same predictor variable selection methods described in Halperin et al. (2016), and refer to its estimates as \( {\hat{CC}}_{\mathrm{HEMI}} \). We compared the results of estimating AGB using either \( {\hat{CC}}_{\mathrm{VERT}} \) or \( {\hat{CC}}_{\mathrm{HEMI}} \), and used the better performing model as the base upon which to add other predictor variables.

Disturbance factors, such as small scale conversion to agriculture, unmanaged harvesting for charcoal and timber, and fire are known to affect miombo AGB (Chidumayo 1988; Frost 1996; Timberlake and Chidumayo 2011). Proximity to roads and villages has been shown to be related to increasing levels of anthropogenic disturbance and decreased AGB (Helmer et al. 2008; Ahrends et al. 2010). We calculated Euclidean distance from each ground plot center to roads and to villages, based on the GPS line and point data collected during field surveys (Table 2). Fire frequency and intensity both impact miombo AGB, where low intensity burns are more characteristic of the early dry season when grass is still curing, and higher intensity burns are more likely to occur in the late dry season (Ryan and Williams 2011). We included monthly fire history data for 500 m pixels for 2003–2013, derived from the MODIS Burned Area Product (Boschetti et al. 2013). We processed the fire history data and determined the number of low intensity, early season fires and high intensity, late season fires per pixel (Table 2).

Table 2 Description of disturbance and environmental predictor variables for estimating above-ground biomass at the ground plot level

Underlying factors that affect miombo AGB include topography and soils (Frost 1996; Timberlake and Chidumayo 2011). We assessed landscape position using the compound topographic index (CTI), distance to perennial rivers, and slope percent. These variables were calculated based on 30 m Shuttle Radar Topography Mission Digital Elevation Model (USGS 2004) (Table 2). ISRIC – World Soil Information has produced a consistent and comprehensive data source for soils information at 250 m pixel size, covering the continent of Africa (Hengl et al. 2015). The data is provided as estimates at six depths (2.5, 10, 22.5, 45, 80, and 150 cm). We downloaded these data, clipped each dataset to the study area boundary, and calculated the mean across all depths for each variable (Table 2).

The use of predictor variables developed from optical satellite imagery in models to estimate AGB has a long history. However, the utility of various bands, indices, or texture values to use as predictor variables still remains an area of active research (Banskota et al. 2014; Lu et al. 2016). We downloaded Landsat 8 OLI data from the Landsat Ecosystem Disturbance Adaptive Process (LEDAPS) archive that spatially and temporally corresponded to the study area and the ground measurements (Path 170, Row 70; Path 171, Row 70). LEDAPS data are processed to surface reflectance (Masek et al. 2006); therefore, we did not perform atmospheric correction or radiometric normalization. We created one mosaic from the two scenes that were found to be in acceptable agreement with the GPS line data for roads based on visual assessment. We derived 24 predictor variables from Landsat 8 OLI satellite imagery in order to assess the use of a wide range of spectral and textural attributes for predicting AGB (Table 3).

Table 3 Description of remotely-sensed predictor variables generated from Landsat 8 OLI, used for estimating above-ground biomass at the ground plot level

Before selecting pixels to match with ground plots for all predictor datasets, we masked out major rivers and paved roads by on-screen digitizing using Landsat data at 1:10,000. All spatial data were clipped to the study area boundary. To account for spatial extent mismatches between ground plot size and predictor variables, we resampled all predictor variable spatial data to a 20 m × 50 m pixel size using a nearest neighbor method which maintained the actual pixel values. The values for each predictor variable were assigned to ground plots, based on proximity of the ground plot center to the nearest pixel center for each predictor variable. Preparation of spatial predictor variables was conducted using the R raster package v.2.3-41 (Hijmans et al. 2015) and ArcGIS v. 10.1 (ESRI 2012).

AGB Models

Models using CC only

As noted, we estimated AGB first using either \( {\hat{CC}}_{\mathrm{VERT}} \) or \( {\hat{CC}}_{\mathrm{HEMI}} \) as a single predictor variable using three models. A semiparametric GAM was used since it is well-suited to modeling complex relationships where an underlying nonlinear relationship is not known a priori (Wood 2006). Because of their flexibility, GAMs have been used to estimate forest attributes from remotely-sensed data (Moisen and Frescino 2002; Kattenborn et al. 2015). The form of the GAM is:

$$ \log \left({\hat{AGB}}_i\right)={\upbeta}_0 + {\displaystyle \sum_{j=1}^J}{f}_j\left({\mathrm{x}}_{ij}\right) + {\upvarepsilon}_i, $$
(1)

where i is the ground plot, AGB i is in t∙ha−1, β0 is the intercept, ε i is the residual error, and f j is the smoothing function for a predictor variable x j created via a regression spline (Wood 2006). The GAM was fit with a normal distribution, the log link, and the default variance function. Using the log link avoids negative AGB estimates. A value of 0.1 was added to the AGB for two ground plots with no trees. We found that setting the maximum smoothing parameter at 2 was sufficiently flexible to account for potential nonlinearity in the model, yet reduce the possibility of overfitting that may occur in GAM’s (Moisen et al. 2006; Halperin et al. 2016). The GAM was fit with the mgcv package v. 1.8-9 in R v. 3.2.2 (Wood 2015; R Development Core Team 2015).

We also fit two nonlinear models that restricted the AGB estimate to be positive with an upper limit. First, we used a sigmoidal model (e.g., McRoberts et al. 2015):

$$ {\hat{AGB}}_i = \frac{\upalpha}{1+ \exp \left({\upbeta}_0 + {\displaystyle {\sum}_{j=1}^J}{\upbeta}_j{\mathrm{x}}_{ij}\right)} + {\upvarepsilon}_i, $$
(2)

where parameters are α and the β’s, and the other terms are as in Eq. 1. Second, we used an exponential model, also used for estimating AGB (e.g., Anaya et al. 2009):

$$ {\hat{AGB}}_i = \upalpha\ x\ \exp \left({\displaystyle \sum_{j=1}^J}\raisebox{1ex}{${\upbeta}_j$}\!\left/ \!\raisebox{-1ex}{${\mathrm{x}}_{ij}$}\right.\right) + {\upvarepsilon}_i. $$
(3)

To fit these nonlinear models, we used the Gauss-Newton method in the nls function in R v.3.2.2. Starting parameters were estimated using the linearized form of each nonlinear model (i.e., log (AGB i )). After experimenting with a flexible or fixed asymptote α, we found that setting α = 500 produced realistic estimates. This is in line with Næsset et al. (2016), who truncated miombo AGB estimates from linear ordinary least squares models at twice the maximum observed value in their sample data. For each fitted model (three models and \( {\hat{CC}}_{\mathrm{VERT}} \) or \( {\hat{CC}}_{\mathrm{HEMI}} \)), we used 20-fold cross-validation and calculated the root mean square prediction error (RMSPE) and RMSPE as a percent of the mean (RMSPE%), defined as:

$$ \mathrm{RMSPE} = \sqrt{{\displaystyle \sum_{i=1}^n}\frac{{\left({y}_{iv} - {\hat{y}}_{iv}\right)}^2}{n}} $$
(4)

where y iv is the AGB (t∙ha−1) of ground plot i not used for fitting in fold v, \( {\hat{y}}_{iv} \) is its corresponding estimate, and n = the number of ground plots (64). RMSPE% was calculated as:

$$ \mathrm{RMSPE}\% = \left(\frac{\mathrm{RMSPE}}{\overline{y}}\right)*100, $$
(5)

where \( \overline{y} \) is the mean AGB (t∙ha−1) calculated from the ground plots. We then calculated and compared the mean RMSPE and mean RMSPE% across the three models using \( {\hat{CC}}_{\mathrm{VERT}} \) or \( {\hat{CC}}_{\mathrm{HEMI}} \). Based on these validation statistics, we included either \( {\hat{CC}}_{\mathrm{VERT}} \) or \( {\hat{CC}}_{\mathrm{HEMI}} \) in models with multiple predictor variables.

Models using multiple predictor variables

Next, we investigated whether expanding the models with increasingly diverse combinations of predictor variable sets could improve the accuracy of AGB estimates. For this purpose, we compared models using: \( {\hat{CC}}_{\mathrm{VERT}} \) or \( {\hat{CC}}_{\mathrm{HEMI}} \) alone; remotely-sensed (Landsat) variables alone; environmental and disturbance (Env/Dist) variables alone; and combinations of each predictor variable set (i.e., Landsat and Env/Dist, \( {\hat{CC}}_{\mathrm{VERT}} \) or \( {\hat{CC}}_{\mathrm{HEMI}} \) and Landsat, etc.). A genetic algorithm (GA) was used to perform predictor variable selection within each predictor variable set. The GA has been demonstrated to provide near optimality for predictor variable identification and subsetting when there are a large, diverse range of possible predictor variables (Tomppo and Halme 2004; Garcia-Gutierrez et al. 2014); a description of the algorithm is given in Scrucca (2013). We set the objective criteria (i.e., “fitness”) to minimize the RMSPE, and established the GA stopping criteria at 20 consecutive generations with no improvement in fitness, or a maximum of 100 generations. We restricted the number of variables under consideration to a maximum of six in each predictor variable set, in order to decrease the risk of near multicollinearity and overfitting. For each predictor variable set (except \( {\hat{CC}}_{\mathrm{VERT}} \) or \( {\hat{CC}}_{\mathrm{HEMI}} \) alone), we implemented the GA one time and recorded the lowest RMSPE. We then calculated and compared mean RMSPE and mean RMSPE% for each predictor variable set, and chose the predictor variable set that minimized these two fit statistics. The GA analysis was conducted using the GA package v.2.2 in R (Scrucca 2013).

Final variable selection

We performed final predictor variable selection using the predictor variable set that minimized RMSPE in the previous section. For this purpose, we implemented the GA 100 times for each model using the same fitness and stopping criteria as before, and selected the subset of predictor variables that minimized RMSPE across all 100 GA implementations. For the sigmoidal and exponential models, we also considered commonly used transformations (i.e., square, square root, inverse, etc.) for each predictor variable, once the predictor variables were chosen through this process. Use of transformations for predictor variables has been examined in nonlinear models for improving model performance (Wang et al. 2007; Timilsina and Staudhammer 2012). To compare performance among the three models, we computed the following fit statistics: root mean square error (RMSE), RMSE as a percent of the mean (RMSE%), mean difference (MD), and Pseudo-R2 as follows:

$$ \mathrm{RMSE} = \sqrt{{\displaystyle \sum_{i=1}^n}\frac{{\left({y}_i - {\hat{y}}_i\right)}^2}{n}}, $$
(6)
$$ \mathrm{RMSE}\% = \left(\frac{\mathrm{RMSE}}{\overline{y}}\right)\ \mathrm{x}\ 100, $$
(7)
$$ \mathrm{M}\mathrm{D} = {\displaystyle \sum_{i=1}^n}\frac{\left({y}_i - {\hat{y}}_i\ \right)}{n}, $$
(8)
$$ \mathrm{Pseudo}\hbox{-} {R}^2 = 1-\frac{{\displaystyle {\sum}_{i=1}^n}{\left({y}_i - {\hat{y}}_i\right)}^2}{{\displaystyle {\sum}_{i=1}^n}{\left({y}_i - \overline{y}\right)}^2}, $$
(9)

where y i and \( {\hat{y}}_i \) are the observed and estimated AGB (t∙ha−1) for ground plot i, and n and \( \overline{y} \) were defined in Eq. 4 and Eq. 5, respectively. We also calculated 20-fold validation statistics, namely RMSPE (Eq. 4) and the mean prediction difference (MPD) defined as:

$$ \mathrm{M}\mathrm{P}\mathrm{D} = {\displaystyle \sum_{i=1}^n}\frac{\left({y}_{iv} - {\hat{y}}_{iv}\ \right)}{n}, $$
(10)

where y iv , \( {\hat{y}}_{iv} \), and n were defined for Eq. 4.

Uncertainty assessment and model selection

After choosing the best performing predictor variable subset for each model, we estimated the AGB total (tonnes), 95 % confidence interval (CI) for the total, and AGB (t∙ha−1) for the entire study area. We accomplished this using a bootstrap method involving resampling n samples from the ground plot data with replacement using equal probability sampling, following recommendations from McRoberts et al. (2011) and based on work by Efron and Tibshirani (1986). Using 500 bootstrap samples, we refitted each model to the predictor variable data covering the entire study area, and estimated AGB for every 20 m × 50 m pixel. Because the GAM is not asymptotic, we limited the maximum estimated value for AGB (t∙ha−1) to the same value as the asymptote for the sigmoidal and exponential models, as is commonly done in non-asymptotic models (e.g., Næsset et al. 2016). For each model, we estimated total AGB (\( {\hat{\mathrm{Y}}}_{\mathrm{boot}} \)) using the following equation:

$$ {\hat{\mathrm{Y}}}_{\mathrm{boot}}={\displaystyle \sum_{k=1}^{{\mathrm{n}}_{\mathrm{boot}}}}\frac{{\hat{\mathrm{Y}}}_{\mathrm{boot}}^k}{{\mathrm{n}}_{\mathrm{boot}}}, $$
(11)

where \( {\hat{\mathrm{Y}}}_{\mathrm{boot}}^{\mathrm{k}} \) is the estimated total AGB by summing all pixels (in tonnes) across the study area for bootstrap k, and nboot is 500. AGB/ha was calculated from dividing \( {\hat{\mathrm{Y}}}_{\mathrm{boot}} \) by the number of ha in the study area. We estimated the variance for \( {\hat{\mathrm{Y}}}_{\mathrm{boot}} \) using the following equation:

$$ \hat{\mathrm{Var}}\left({\hat{\mathrm{Y}}}_{\mathrm{boot}}\right)={\displaystyle \sum_{\mathrm{k}=1}^{{\mathrm{n}}_{\mathrm{boot}}}}\frac{{\left({\hat{\mathrm{Y}}}_{\mathrm{boot}}^{\mathrm{k}} - {\hat{\mathrm{Y}}}_{\mathrm{boot}}\right)}^2}{{\mathrm{n}}_{\mathrm{boot}}-1}. $$
(12)

We used this variance estimate to calculate a 95 % confidence interval (CI) for estimated total AGB based on the normal distribution (Efron and Tibshirani 1986), and reported the half-width of the CI for each model. To establish a baseline for comparing model performance, we fit a null model with no predictor variables. Using this null model, we estimated the AGB total, 95 % CI for the total, and AGB (t∙ha−1) (Bater et al. 2009). Then, we calculated the relative efficiency for each model by dividing the null model variance estimate by each model’s variance estimate, following Næsset et al. (2016). Finally, we compared the four model-based estimates of total AGB, AGB/ha, and CI to a simple design-based estimate of AGB total, 95 % CI for the total, and AGB (t∙ha−1) using the NFI systematic sample of ground plots, assuming equal probability sampling (Cochran 1977).

Based on these statistics, along with observed versus predicted AGB graphs, we chose one model and used this to estimate AGB for each 20 m × 50 m pixel in the study area. To assess the pixel-level variability of the AGB estimates, we calculated the Coefficient of Variation (CV) in percent (i.e., standard deviation per pixel across the 500 resamples divided by the corresponding mean pixel value, multiplied by 100, which was also used by Carreiras et al. 2013; Faßnacht et al. 2014; Kattenborn et al. 2015), present a map of CV, and report the mean CV for the study area.

Results

Models using CC only

Observations of AGB (t∙ha−1) and CCVERT (%) for the miombo woodland vegetation type follow a more regular pattern than the mopane woodland and riparian forest vegetation types, which exhibit a large range of observed AGB values at low observed values of CCVERT (Fig. 2). Observations of CCHEMI greater than 50 % display a wide range of observed AGB values; however, observed AGB values corresponding to CCHEMI less than 50 % follow a more regular pattern. As expected, the six nonforest ground plots which were measured have low observed AGB and low observed CC for both forms of CC. A binomial GAM was used to fit separate models for CCVERT and CCHEMI; the model to estimate \( {\hat{CC}}_{\mathrm{VERT}} \) had better fit and validation statistics than the model to estimate \( {\hat{CC}}_{\mathrm{HEMI}} \) (Table 4). However, in predicting AGB, clear differences are apparent in the usefulness of CC as the percentage of the sky blocked by tree crowns over a hemispherical view (\( {\hat{CC}}_{\mathrm{HEMI}} \)) versus CC as the vertical projection of tree crowns onto the ground (\( {\hat{CC}}_{\mathrm{VERT}} \)). Across three model forms, \( {\hat{CC}}_{\mathrm{HEMI}} \) had an average RMSPE% that was 18 % lower than for \( {\hat{CC}}_{\mathrm{VERT}} \) (Table 5). The exponential model performed poorly with RMSPE (72.1 t∙ha−1) larger than the mean estimate of AGB t∙ha−1 from the sample data (64.1 t∙ha−1). Considering only the GAM and sigmoidal models, the average difference in RMSPE between \( {\hat{CC}}_{\mathrm{VERT}} \) and \( {\hat{CC}}_{\mathrm{HEMI}} \) as a single predictor variable was 5 t∙ha−1 in favor of \( {\hat{CC}}_{\mathrm{HEMI}} \). Because of these statistics and relationships, we chose \( {\hat{CC}}_{\mathrm{HEMI}} \) for estimating AGB in models with multiple predictor variables.

Fig. 2
figure 2

Top row: Scatterplot for two forms of observed canopy cover (CC) versus observed above-ground biomass (AGB), by vegetation type or land use. CCHEMI is the vertical projection of tree crowns onto the ground, derived from crown area-DBH models; CCHEMI is percentage of the sky blocked by tree crowns measured using a spherical densiometer. Bottom row: Scatterplot for two forms of predicted canopy cover (CC) versus observed above-ground biomass (AGB), by vegetation type or land use, with three models to estimate AGB based on \( {\hat{CC}}_{\mathrm{VERT}} \) or \( {\hat{CC}}_{\mathrm{HEMI}} \)

Table 4 Goodness-of-fit and validation statistics for two binomial generalized additive models to estimate percent canopy cover based on crown radius-DBH models (\( {\hat{CC}}_{\mathrm{VERT}} \)) and measurements using a spherical densiometer (\( {\hat{CC}}_{\mathrm{HEMI}} \))
Table 5 Above-ground biomass (AGB, t∙ha−1) root mean square prediction error (RMSPE) using different predictor variable dataset combinations and models, ordered by mean RMSPE and mean RMSPE%

Models using multiple predictor variables

A clear pattern in reduction of RMSPE can be seen by including more diverse combinations of predictor variable datasets (Table 5). Across models, \( {\hat{CC}}_{\mathrm{HEMI}} \) alone is a better predictor of AGB than models that use only Landsat 8 OLI predictor variables; however, the exponential model that uses Landsat 8 OLI has a slight edge over \( {\hat{CC}}_{\mathrm{HEMI}} \) (RMSPE 45.9 vs. 46.6, respectively). The predictor variable combination of environmental and disturbance data alone and \( {\hat{CC}}_{\mathrm{HEMI}} \) alone are comparable with two of the three combinations that have two predictor variable datasets (Landsat + Env/Dist and \( {\hat{CC}}_{\mathrm{HEMI}} \) + Landsat), with a difference in mean RMSPE% of only three percent. Noticeable improvements in AGB RMSPE% are seen when combining \( {\hat{CC}}_{\mathrm{HEMI}} \) with environmental and disturbance data, as mean RMSPE% drops seven percent. However, the greatest improvement is when performing variable selection using all three predictor variable datasets combined, with a mean RMSPE% of 58 %. Therefore, we performed final variable selection using this combination.

Final variable and model selection

One influential observation, which likely resulted from haze in the remotely-sensed data, was evident across all models. As the influence could be considered uncommon given this dataset, we discarded the observation and continued with the analysis, following the recommendation of Kutner et al. (2005, p.438). Using the final selected variables, we explored use of predictor variable transformations for the sigmoidal and exponential models. By incorporating squared terms in the sigmoidal model for two predictor variables (\( {{\hat{CC}}_{\mathrm{HEMI}}}^2 \) and pH2), we found an improvement of 4 % in terms of RMSE% and an overall better fit in graphs of observed versus estimated values of AGB. The exponential model did not improve by incorporating any transformed predictor variables. However, we found that the fit and validation statistics of the exponential model remained the same after removing one variable (EVI.texture); therefore, we used a subset of five predictor variables for this model. With the GAM, three variables were estimated with a smoothing parameter of 1 (NBR, CTI, bulk density); therefore we fit the model with these variables as parametric terms, as recommended by Wood (2006).

For each of the three models, average RMSE% was less than 60 % and average RMSPE% was less than 65 % (Table 6). The mean prediction difference (MPD) for the sigmoidal model was the highest at 3.2 t∙ha−1; however, mean differences (MD) across all models were less than 1 t∙ha−1. Despite the difference in MPD, the fit and validation statistics for the GAM and the sigmoidal models were otherwise similar, while the exponential model had the worst fit (Fig. 3). Each model had a tendency to under-estimate AGB at higher values, which was most pronounced in the exponential model; however, observations and estimations were relatively balanced along the 1:1 line up to approximately 100 t∙ha−1.

Table 6 Fit and validation statistics for predicting miombo above-ground biomass (AGB, t∙ha−1) using three models
Fig. 3
figure 3

Observed versus estimated above-ground biomass (AGB) using predicted canopy cover, environmental, disturbance, and remotely-sensed predictor variables. The solid line indicates 1:1 correlation between observed and estimated values

As expected, \( {\hat{CC}}_{\mathrm{HEMI}} \) was selected as a predictor variable in all three models. Vegetation indices were chosen more consistently over raw band values or texture indices among the Landsat 8 OLI predictor variables. The Normalized Difference Moisture Index (NDMI) was chosen in both the GAM and the sigmoidal model, the Normalized Burn Ratio (NBR) was selected as another index in the GAM, while the common Normalized Difference Vegetation Index (NDVI) was selected as the only vegetation index in the exponential model. In the GAM and the sigmoidal model, NDMI was selected in conjunction with topographic or soil variables related to soil moisture (AWC, CTI, D2hydro). Available soil water capacity (AWC) was selected in both the sigmoidal and exponential models. Frequency of late season fire was the only disturbance variable selected, and only in the sigmoidal model.

When each model was applied across the entire study area, AGB ranged from 61.9 to 67.3 t∙ha−1 and total AGB ranged from 64.3 million tonnes to 69.9 million tonnes (Table 7). For comparison, the null model estimate and the simple design-based estimate were both within this same range. All of the model-based methods, including the null model, exhibit confidence intervals that are more precise than the design-based estimate by an order of magnitude. Interestingly, even though the sigmoidal model exhibited the highest mean prediction bias, it had the smallest confidence interval based on bootstrap resampling. Conversely, while the GAM demonstrated the best fit and validation statistics across the three models where variable selection was performed, it had a confidence interval wider than the other models. The sigmoidal model and the exponential model were both more efficient than the null model, while the GAM was less efficient. Since the sigmoidal model had fit and validation statistics similar to the GAM, yet had higher relative efficiency than the GAM, we selected the sigmoidal model to estimate AGB and the corresponding CV for each 20 m × 50 m pixel in the study area.

Table 7 Bootstrap results for mean and total above-ground biomass (AGB) with corresponding 95 % confidence intervals for four model-based methods, compared to a design-based method

AGB estimations per pixel using the sigmoidal model ranged from 0 to 390 t∙ha−1, where lower values were more common around areas with higher population and higher values were more common in remote areas in the Luangwa River valley (Fig. 4). Estimates of CV per pixel generally were highest where AGB was the lowest, with a large area of high CV on the eastern boundary and two small valleys in the west half of the study area. The area of high CV on the eastern boundary is dominated by agricultural, nonforest land use; whereas the two small valleys appear to be dominated by grassland. The average CV per pixel across the whole study area was 31 %, with a standard deviation of 14.

Fig. 4
figure 4

Map of estimated above-ground biomass (AGB) per pixel (0.1 ha) and the Coefficient of Variation per pixel (CV). AGB estimates derived from a sigmoidal model

Discussion

This research provides encouraging results for using a model-based approach to estimate miombo AGB with National Forest Inventory data. The usefulness of canopy cover as a predictor variable in estimating AGB was clear. Similar results were found by Hall et al. (2006) for boreal forests, by Wu et al. (2013) for tropical woodlands in Sudan, and by González-Roglich and Swenson (2016) for wooded savannas in Argentina. Hall et al. (2006) and González-Roglich and Swenson (2016) used CC as the percentage of the sky blocked by tree crowns over a hemispherical view, whereas Wu et al. used CC as the vertical projection of tree crowns onto the ground. In our case study, CCHEMI (i.e., the hemispherical view of canopy cover) demonstrated a better relationship to AGB for two possible reasons. First, taller trees project more cover in a hemispherical measurement device such as the spherical densiometer (Jennings et al. 1999), and tree height has been shown to be an important covariate in predicting tree-level biomass (Chave et al. 2014). Second, CCVERT (i.e., vertical projection) relies on DBH-crown radius models (Burrows and Strang 1964; Fuller et al. 1997; Isango 2007) to estimate canopy cover per ground plot; these models are outdated and not localized in our study (Halperin et al. 2016). Updating DBH-crown radius models may improve the usefulness of CCVERT as a predictor variable in AGB models for the miombo ecoregion and should be an area of active research. This is especially true for less common vegetation types such as mopane woodlands and riparian forests.

Improvements in AGB estimates were observed when including a diverse range of ecologically meaningful covariates along with \( {\hat{CC}}_{\mathrm{HEMI}} \). In this context, two aspects of moisture stand out. First, soil moisture was characterized by available soil water content (AWC, Hengl et al. 2015) in the sigmoidal and exponential models, or through landscape position via the compound topographic index (CTI) and distance to perennial water in the GAM. Soil moisture has been identified as an underlying factor contributing to miombo vegetation dynamics (Frost 1996; D’Odorico et al. 2007). For example, higher levels of soil moisture have been found under closed canopy miombo woodland compared with open canopy miombo woodland (Campbell et al. 1988).

Second, vegetation moisture was captured through the Normalized Difference Moisture Index (NDMI), an important variable in both the GAM and sigmoidal models. NDMI has been shown to be correlated with tree canopy water content and is competitive with the more well-known Normalized Difference Vegetation Index (NDVI) for a variety of forest mapping purposes (Hunt and Rock 1989; Jin and Sader 2005). This study targeted collection of field and remotely-sensed data for the early dry season, which is considered a prime period for linking these observations in the miombo ecoregion. In this timeframe, grass layers have generally senesced while the deciduous tree canopies have not yet fully shed their leaves (Fuller et al. 1997). As such, NDMI appears to be an important vegetation index for characterizing vegetation moisture in tree canopies, and higher canopy moisture could be related to greater levels of AGB (D’Odorico et al. 2007).

Our study area is in the dry miombo ecozone where rainfall averages less than 1000 mm∙yr−1 (Timberlake and Chidumayo 2011). Comparing our results with similar studies in the same ecozone, Næsset et al. (2016) estimated mean AGB at 51.3 t∙ha−1, Mauya et al. (2015) at 65.8 t∙ha−1, Ryan et al. (2011) at 59.4 t∙ha−1, and Ribeiro et al. (2008) at 70 t∙ha−1, while Carreiras et al. (2013) estimated AGB at 25.7 t∙ha−1. However, Carreiras et al. only studied areas with canopy cover < 50 %. The review by Frost (1996) provided an AGB estimate of 55 t∙ha−1for dry miombo of Zambia and Zimbabwe, and in the most recent FAO Forest Resources Assessment for Zambia, the national estimate of AGB was 84 t∙ha−1 (FAO 2014). Across models, our AGB estimates ranged from 61.9 to 67.3 t∙ha−1, where the best performing model and the null model estimated 62.1 and 62.8 t∙ha−1, respectively. Clearly, our estimates of AGB for a study area in the Zambian dry miombo ecozone are in line with other studies.

Comparing accuracy statistics with other published results is challenging, given the number of studies using similar data sources in the same ecozone, as well as the use of a wide variety of statistics. Therefore, a comparison of similar studies estimating AGB in other miombo ecozones or dry forest ecoregions is necessary. Some researchers presented model fit statistics such as R 2 (Wu et al. 2013; Kashindye et al. 2013), while others emphasized RMSE (Mitchard et al. 2013; Solberg et al. 2015). The latter two estimated RMSE at 12.8 t∙ha−1 (22 % CV) in Mozambique and 40.3 t∙ha−1 (78 % CV) in Tanzania, respectively. However, fit statistics are known to be optimistic when concerning prediction (Guisan and Zimmermann 2000; Packalén et al. 2012). Using resampling methods to estimate prediction error for AGB in Mozambique, Ryan et al. (2012) and Carreiras et al. (2013) reported RMSPE of 19.6 t∙ha−1 and 5.0 t∙ha−1, respectively, where the latter reported a mean CV of 25 %. In a study site in Tanzania with similar vegetation and rainfall to our study area, Mauya et al. (2015) and Næsset et al. (2016) reported RMSPE% of 47 and 62 %, respectively. These latter studies all employed active sensor (radar or ALS) imagery as the single source for predictor variables.

Research on the use of optical sensor satellite imagery for estimating AGB in the miombo ecoregion is more limited. Næsset et al. (2016) used a linear ordinary least squares model and a stepwise predictor variable selection method. As potential predictor variables, these authors included canopy cover predictions and canopy cover gain or loss from Hansen et al. (2013), raw band values from Landsat 7 ETM+ and Landsat 8 OLI, NDVI calculated from the Landsat data, and square and square root transformations for all of these predictor variables. Based on their stepwise predictor variable selection procedure, squared canopy cover was the only predictor variable chosen. While their finding helps to corroborate our result of canopy cover and squared canopy cover in the sigmoidal model, their model resulted in poor AGB prediction with an RMSPE% of 87 %. It is possible that their model using Landsat data was disadvantaged by two aspects. First, they did not include other commonly used vegetation indices (i.e., NDMI, NBR) or texture variables (i.e. standard deviation in a 3 × 3 window around the ground plot), which we found to be useful in our study, as have others (e.g., Lu et al. 2014). Second, the Landsat data that Næsset et al. used differed by at least a year from the time of their ground plot measurements.

Two studies that estimated miombo tree volume (m3∙ha−1) using Landsat also provide a useful comparison. Employing cross-validation to compute RMSPE%, Pereira (2006) reported 48 % in Mozambique using Landsat and k-Nearest Neighbor imputation, while Gara et al. (2015) averaged 62 % using Landsat and nonlinear models. Our chosen model (sigmoidal) has estimates of RMSE% at 47 and RMSPE% at 58 %, indicating that our fit and validation statistics are comparable to similar studies using large-scale inventories in the miombo ecoregion using optical remotely-sensed data (Pereira 2006) and active remote sensing data such as ALS or radar (Solberg et al. 2015; Mauya et al. 2015; Næsset et al. 2016). However, more research is needed to compare AGB confidence intervals to other studies because application of a model-based approach appears relatively uncommon in the miombo ecoregion.

The use of GAMs and nonlinear models, such as the sigmoidal and exponential models in this study, for predicting AGB has shown promise in a number of studies in different biomes (e.g., Moisen and Frescino 2002; Hall et al. 2006; Anaya et al. 2009; McRoberts et al. 2015; Kattenborn et al. 2015). However, it appears that most approaches to estimate and map AGB in the miombo ecoregion used linear ordinary least squares (Ryan et al. 2012; Kashindye et al. 2013; Mitchard et al. 2013; Solberg et al. 2015; Næsset et al. 2016), despite the nonlinear relationships between forest attributes and remotely-sensed data (Sedano et al. 2008; Banskota et al. 2014; Lu et al. 2016 ; Gara et al. 2015). We found GAM’s to perform as well as nonlinear models with respect to fit and validation statistics; however, when applied to the population of predictor data the bootstrap confidence interval was less efficient than the null model.

We protected against overfitting of the GAM by using cross-validation in the variable selection process (Guisan and Zimmermann 2000; Hawkins 2004; Packalén et al. 2012); however, the smoothing splines of GAM’s may perform in an unpredictable manner when applied to population-level data which includes ranges of values not within the observation data (Guisan et al. 2002; Venables and Dichmont 2004). Because of this, nonlinear models may be preferred for several reasons. First, the sigmoidal and exponential models in this study are asymptotic, thereby preventing both negative and unreasonably high estimations (McRoberts et al. 2015). Second, nonlinear models are able to achieve more stable extrapolation at the ends of, and beyond, the ranges of the values of the predictor variables (Venables and Dichmont 2004). Third, judicious application of transformations may improve performance if there appears to be a lack of fit in nonlinear models.

One issue with the nonlinear models we used in this study is estimating the asymptote. We initially investigated a flexible asymptote (α), with α estimated in the model fitting process. For our dataset, α was often estimated at unrealistically high values or the nonlinear models would not converge, both of which may indicate that the data does not support estimating a biologically meaningful asymptote. We further investigated estimating α as a function of the predictor variables in the models (i.e., α0 + α1 x i ); however, no improvements in model fit or biologically meaningful values of α were noted. In this context, Burkhart and Tomé (2012 p. 123) stated that if an asymptote cannot be realistically fit with the data at hand, then expert judgement may be employed to fix the asymptote at a reasonable value and continue with model fitting. AGB t∙ha−1 for most vegetation types in the dry miombo ecozone tend to be less than 100 t∙ha−1 (Frost 1996), although upper limits generally appear to be understudied. Mauya et al. (2015) found an upper limit of 350 t∙ha−1 in dry miombo woodland vegetation type for a study site in southeast Tanzania, while the riparian forest vegetation type may approach 500 t∙ha−1 (Chidumayo personal communication 2016). Future modeling efforts should assess practical asymptotic limits for AGB in all miombo vegetation types.

While this study made progress in terms of ecologically meaningful variable selection and highlighted model advantages and disadvantages, improvements may be realized in several aspects. First, we found that canopy cover as the percentage of the sky blocked by tree crowns over a hemispherical view (\( {\hat{CC}}_{\mathrm{HEMI}} \)) was a better predictor for AGB than canopy cover as the vertical projection of tree crowns onto the ground (\( {\hat{CC}}_{\mathrm{VERT}} \)). However, values of CCVERT, used to predict \( {\hat{CC}}_{\mathrm{VERT}} \), were derived from DBH-crown radius models (Burrows and Strang 1964; Fuller et al. 1997; Isango 2007) that should be updated for conditions in Zambia (Halperin et al. 2016). Once updated, \( {\hat{CC}}_{\mathrm{VERT}} \) may prove to be as useful as \( {\hat{CC}}_{\mathrm{HEMI}} \) as a predictor of AGB. Second, the highest percent CV pixel values occurred where AGB was predicted to be the lowest, likely because there were very few ground plots with no trees. To accommodate such under-sampled areas, off-grid ground plots could be included along with the NFI ground plots in implementation of the model-based methods. Model-based methods are not restricted to one single sample design as long as the same variable of interest (i.e., AGB) is being sampled according to the same definitions; the main consideration is that sample selection is required to be uninformative in terms of Y (Gregoire 1998).

Third, active sensor remotely-sensed data allow tree heights to be estimated and height is known be correlated with tree-level (Chave et al. 2014) and plot-level biomass (Wulder et al. 2012). ALS data is a proven source of height estimates, yet remains operationally out-of-reach for many countries (Ribeiro et al. 2012). Testing of this source of predictor variables for use in operational monitoring programs is just now underway in the miombo ecoregion (Mauya et al. 2015; Næsset et al. 2016). Additionally, significant challenges remain in using more readily available radar data (Sinha et al. 2015; Solberg et al. 2015). However, space borne laser scanning systems are under development (Moussavi et al. 2014; GOFC-GOLD 2015) and may prove useful in future applications.

Fourth, a model-assisted approach may lend itself to use of the GAM, balancing the variability in population-level estimation with more conservative inference that takes advantage of the design-based properties of the sample (Opsomer et al. 2007; Næsset et al. 2016). Fifth, there is significant opportunity to explore model-assisted or model-based approaches in a small area estimation context for improving both estimates and inference at spatial scales where traditional design-based approaches are not possible (Magnussen et al. 2014). Lastly, if a map of AGB is a requirement, then use of georeferenced predictor variables is required. However, if a map of AGB is not required, then our null model would provide a viable alternative that could also be applied within a small area estimation framework.

Conclusion

Estimating the amount and distribution of AGB is crucial for improving land management planning and for engaging in international discussions on REDD+. Our understanding of AGB is enhanced by including ecologically meaningful predictor variables in a model-based approach to estimation. This study generated several insights for improving miombo AGB estimation using NFI data. First, the biophysical relationship between AGB and CC can be exploited to improve AGB estimation. Second, model-based methods improve precision of AGB estimates by an order of magnitude over simple design-based estimation. Third, if a map of AGB is not required, estimating the confidence interval of total AGB using bootstrap resampling for a null model is a viable alternative to the traditional design-based confidence interval. Fourth, if a map is required, expensive predictor variables such as those derived from ALS, are not necessarily required as there is freely available spatial data (i.e., Landsat, Digital Elevation Model, soils, etc.) which can be used as predictor variables in the model-based framework. Fifth, a genetic algorithm provides an efficient way to perform predictor variable selection. Lastly, future approaches to improve our understanding of AGB distribution in the miombo ecoregion should include small area estimation.

Abbreviations

AGB, above-ground biomass; ALS, airborne laser scanning; CC, canopy cover; CV, Coefficient of Variation; GA, genetic algorithm; GAM, generalized additive model; NFI, National Forest Inventory; OLI, Operational Land Imager; REDD+, Reducing Emissions from Deforestation and forest Degradation; RMSE, root mean square error; RMSPE, root mean square prediction error; t∙ha−1, tonnes per hectare