Introduction

Globally, indigenous forests are under threat (Thompson et al. 2009). In 1991, forests (including indigenous forests) covered about one-third of the global land surface; however, a declining trend has been observed over the last two decades (FAO 2015). Forests provide various ecosystem services such as erosion protection, carbon storage, and water recycling (DeFries, 2013; Gill et al. 2017). The decline in forest area is due to changes in woody vegetation resulting from deforestation (or degradation) and anthropogenic activities (DeFries, 2013; Gill et al. 2017). Quantifying changes in forests plays a critical role in understanding carbon cycles at regional, national, and global scales (Bonan 2008; Heckel et al. 2020). Such quantifications also support international protocols such as the United Nations Reducing Emissions from Deforestation and Forest Degradation (REDD +) (Mitchell et al. 2017) and the Kyoto Protocol (O'Neill and Oppenheimer 2002).

Forest structural parameters are presented by various metrics, such as woody canopy cover (CC), height, above-ground biomass, and canopy volume. To detect changes in woody cover and understand the effects of degradation and deforestation on forests, it is critical to conduct an accurate quantitative and spatially explicit assessment of forest cover. CC is defined as the percent area projected vertically onto a horizontal plane by woody plant canopies (Jennings et al. 1999). CC is an essential biophysical parameter important for estimating carbon content and vegetation dynamics (Pereira et al. 2013). In South Africa, there is a lack of locally calibrated and validated CC datasets for reliable detection of changes in time series. As a result, the lack of accurate information on CC in South Africa’s indigenous forests contributes to uncertainties in current carbon stock estimates and limits scientific understanding of their potential contribution to long-term climate change.

Conventionally, woody structural parameters are typically measured using field-based methods such as plot inventories, horizontal point sampling, and line intersect sampling (e.g., Bester, 1999; Buitenwerf et al. 2012). Notwithstanding the advantages of field-based measurements, particularly in validating and calibrating models, their spatiotemporal representativeness is limited, for example, due to high cost, labour-intensive nature, and resource demands. Alternatively, satellite remote sensing provides data with a high spatiotemporal resolution that is used to assess structural parameters at a regional or global scale. This type of data is suitable for assessing woody plant structure at local, regional (Brandt et al. 2016; Gill et al. 2017), and global scales (Hansen et al. 2013; Sexton et al. 2013).

More recently, Synthetic Aperture Radar (SAR) remote sensing has emerged as a preferred tool for estimating woody plant structural parameters in savannas (Urbazaev et al. 2015; Naidoo et al. 2016; Skowno et al. 2017) as well as boreal forests (Saatchi and Moghaddam 2000) and temperate forests (Lucas et al. 2006). This rapid increase in the use of SAR is due to challenges associated with optical datasets, such as cloud or haze coverage (Anchang et al. 2020). In addition, some optical sensors have a problem with signal saturation, as they do not respond to changes in multi-layered dense forests, resulting in the inability of these sensors to distinguish between woody and non-woody cover (Zhao et al. 2016). The ability of SAR to determine the structural parameters of forests depends on the type of polarisation and wavelength used (Kellndorfer et al. 2019). SAR microwave pulses penetrate clouds, interact with vegetation cover (Santoro et al. 2007), and provide volumetric backscatter information based on canopy structure and environmental conditions (Lucas et al. 2010). Longer wavelengths such as L-band have a stronger and more universal relationship with woody structure (volume, biomass, and cover) than optical or short-wavelength sensors SAR, suggesting that L-band data may be suitable for modelling woody cover in the indigenous forests of South Africa (Mitchard et al. 2011).

The estimation of structural parameters of woody plants can be improved by adding textural features based on the Grey Level Co-Occurrence Matrix (GLCM) (Haralick et al. 1973) and polarimetric features (e.g., Pereira et al. 2018). Recent studies (Beguet et al. 2014; Wessels et al. 2019) have shown that there is a strong relationship between forest structure and texture extracted from remote sensing data. For example, Wessels et al. (2019) added textural features when estimating fractional woody cover in Namibia and obtained a root mean square error (RMSE) of 14% when GLCMs were added. Perreira et al. (2018) compared quad-polarised and dual-polarised L-band with the addition of polarimetric features in predicting woody biophysical parameters. The study concluded that the quad-polarised L-band yielded an RMSE of 10%, while the results of the dual-polarised L-band yielded slightly lower accuracies, i.e., RMSE = 13% (Perreira et al. 2018). Watanabe et al. (2018, 2020) demonstrated the use of the HH/HV band ratio, referred to here as the polarimetric feature, in mapping forested and non-forested areas in the Amazon, achieving an accuracy of over 80 percent.

Field data from CC are often correlated with optical, SAR, or Light Detection and Ranging (LiDAR) signals, using regression or machine learning models to extrapolate CC over large areas (Urbazaev et al. 2018). In South Africa, where reliable CC ground truth data or LiDAR is lacking, alternative methods of obtaining plot data for model calibration and validation need to be found. Anchang et al. (2020) used a tool called Collect Earth Online (CEO) to access freely available high-resolution imagery (HRI) to generate field data for estimating CC in a savanna environment in Senegal. The CEO tool is key to developing land use change monitoring inventories, which is in line with the Intergovernmental Panel on Climate Change (IPCC) principles and has been used by several developing countries with limited inventories to meet their Nationally Determined Contributions (NDCs) (Tzamtzis et al. 2019).

The capabilities of machine learning algorithms such as Random Forest (RF) and Support Vector Machines (SVM) in estimating structural parameters of woody plants are well documented (Belgiu and Drăgut 2016; Lapini et al. 2020). Several studies have demonstrated the use of machine learning techniques and optical and SAR sensors to monitor different environments in southern Africa. Naidoo et al. (2014) evaluated different modelling algorithms for estimating CC in the savannah environments of South Africa and concluded that RF is a suitable algorithm. In addition, Urbazaev et al. (2015) used a multi-temporal approach, RF, and polarimetric L-band data to map woody cover in savannas. Ludwig et al. (2019) sought to monitor bush encroachment using RF, Sentinel 1, and Sentinel 2 data for modelling woody vegetation in semi-arid and arid environments of South Africa. The literature suggests that machine learning algorithms could improve accuracy in estimating woody structure metrics compared to parametric methods such as stepwise multiple linear regression (SMLR). Although there are a plethora of studies on the use of SAR and machine learning algorithms in estimating woody cover in southern Africa, the literature suggests that the majority of SAR-based research in southern Africa has primarily focused on savanna and semi-arid ecosystems. In particular, the potential and limitations of using the L-band to estimate CC in the indigenous forests of South Africa (with closed canopy) have not been extensively investigated.

Deforestation for agriculture, grazing, or urban development causes forest degradation in many parts of Africa (Van Wyk et al. 1996). Most often, the forest is fragmented into patches of different sizes and shapes, each surrounded by a variety of plants and/or land uses (Cho et al. 2015; Omer et al. 2016). Forest fragmentation has direct ecological impacts on soil erosion, loss of biodiversity, and invasion of alien species. Furthermore, it is difficult to find spatially precise data on how the above factors and forest fragmentation have affected carbon stocks at the local or regional level (Saunders et al. 1991; Cho et al. 20132015).

The Dukuduku forest in Kwa-Zulu Natal, one of the most fragmented indigenous forests in South Africa, requires continuous monitoring, management, and conservation. (Omer et al. 2017). Periodic quantitative data on these forests is essential for decision-making, public management, and re-evaluation of environmental policies for conservation. Therefore, it is crucial to develop an effective remote sensing technique for monitoring CC. A detailed and reliable CC product for indigenous forests can therefore be useful for detecting changes in time series, assessing biodiversity, and planning conservation measures that are essential for managing the impacts of climate change at local, regional, and global scales.

In this study, we sought to develop a method for monitoring CC change (2008–2018) using SAR L-band data and machine learning algorithms in the Dukuduku forest in the Kwa-Zulu Natal province of South Africa with training and validation data from CEO. LiDAR data was used to evaluate SAR-derived CC products. To achieve this aim, the following research questions were addressed in this study.

  1. 1.

    Does combining L-band ALOS PALSAR backscatter with textural and polarimetric features using machine learning techniques improve CC estimation in the Dukuduku Indigenous closed canopy forest?

  2. 2.

    Can CC products derived from SAR L-band data be used to detect changes in woody vegetation (2008–2018) in the Dukuduku indigenous forest?

Study area

The region under study is the Dukuduku Forest (Fig. 1) (28°25′ S, 32°17′ E), located between the towns of Mtubatuba and St Lucia in the northern part of KwaZulu Natal Province near iSimangaliso Wetland Park, South Africa (Ndlovu 2013; Omer et al. 2017). In the early 1950s, the Dukuduku forest comprised about 6000 hectares (ha) of indigenous tree species (Cho et al. 2015). However, due to deforestation, the forest was reduced to barely 3200 ha by 2011 (Cho et al. 2015; Sundnes, 2013). Most of the natural vegetation surrounding the forest was cleared for agricultural purposes, with sugar cane plantations in the south and eucalyptus plantations in the north (van Wyk et al. 1996; Cho et al. 2015). The climate is subtropical with warm, humid summers and mild winters with temperatures never below 10 °C and mean annual temperatures above 21 °C (Ndlovu et al. 2011). The area receives rainfall throughout the year, with an annual mean of about 1200 mm (Ndlovu et al. 2011; Mucina et al. 2003). The Dukuduku jungle is rich in biodiversity and is considered one of the last remnants of lowland forests along the South African coast (Ndlovu 2013). St Lucia, Africa’s largest estuary, and the Dukuduku Forest are part of the iSimangaliso Wetland Park, a United Nations Educational, Scientific, and Cultural Organisation World Heritage Site (UNESCO).

Fig. 1
figure 1

(A) Study area in KwaZulu Natal showing the Dukuduku indigenous (intact in yellow and fragmented in black) forest. (B) KwaZulu Natal Province relative to other provinces in South Africa. (C) Location of the study site in KwaZulu Natal Province

The Dukuduku Forest is subject to two forest management protocols: (i) fragmented forests (Fig. 1A) (black boundary) and (ii) intact forests (Fig. 1A). The fragmented forests are managed and maintained by tribal representatives and local people. The intact forests are managed and maintained by officials (e.g., Department of Environment, Forestry, and Fisheries) (Omer et al. 2016). Both the intact and fragmented forests were used to estimate canopy cover in the study area.

Materials and methods

Description of datasets used

ALOS PALSAR 1 and 2 Data

Global yearly, 25-metre ALOS PALSAR HH and HV dual-polarised mosaics, gamma-naught (γ0) backscatter coefficient were used as the primary predictor variables (SAR backscatter) for the prediction of canopy cover for 2008, 2015, and 2018. The ALOS 1 mosaics were derived using Fine Beam Dual (HH, HV) L-band SAR data collected by PALSAR on-board the Japanese ALOS platform launched by the Japan Aerospace Exploration Agency (JAXA). The ALOS 2 datasets were selected from Strip Map mode (SM1 with 3 m single/dual polarisation or SM3 with 10 m dual polarisation) and were chosen for each year and location based on visual assessment of browse mosaics available for that year. Strip data with the minimum response to surface moisture were chosen to reduce visible banding between adjacent strips (Shimada et al. 2014). The mosaic datasets for 2008, 2015, and 2018 were expressed as gamma (γ0) with backscatter normalised by illumination area under an assumption of scattering uniformity (Shimada & Ohtaki, 2010). The backscatter was also radiometrically and geometrically corrected for topography and standardised for incidence angle (θi) (γ0 = σ0/cos θi) (Shimada & Ohtaki, 2010).

Texture and polarimetric features

Preliminary modelling results showed that textural features and polarimetric features improved modelling accuracy. Four Haralick texture features, here referred to as GLCMs, namely variance, contrast, entropy, and homogeneity, were calculated over a 7 × 7 pixel window of data from SAR L-band data with a step size of 1 pixel. Bilinear weighting was applied within the sliding window so that pixels closer to the centre of the window were given a higher weight in the GLCM calculation. The GLCMs were calculated for both SAR polarisations HH and HV to obtain eight GLCMs. In this study, the HH/HV ratio is used as a polarimetric feature. The co-polarisation ratio HH/VV is commonly used to classify SAR images (Pereira et al. 2018; Sartori et al. 2011; Novo et al. 2010). In the absence of VV, we used HV instead: This provided eleven SAR input variables, i.e., γ0 HH and γ0 HV, eight texture features, and one polarimetric feature HH/HV ratio, which were used to generate CC maps for each year.

Collect Earth Online data

Calibration and validation data of the study area for 2008, 2015, and 2018 were collected using CEO. The measurement data collected from CEO was used to train and validate the canopy prediction models. CEO is a free and open-access land monitoring tool (https://collect.earth/) developed by the Food and Agriculture Organisation of the United Nations (FAO). CEO is based on Google Desktop and cloud storage technology. Various satellite image data are freely available on the CEO platform. CEO includes archives of imagery with extremely high spatial resolution and exceptionally high temporal resolution (e.g., Google Earth, Bing Maps, and Mapbox imagery) (Anchang et al. 2020; Bey et al. 2016). Figure 2 illustrates how the plot data was acquired using CEO.

Fig. 2
figure 2

Example of plot level of CC using Collect Earth Online (https://collect.earth/). Plot dimension is 25 m by 25 m with 6 by 6 (36) sample points with 5 m spacing

Stratified grid plots were created within the study area and assessed using the CEO tool. To ensure consistency with the SAR L-band dataset ALOS PALSAR, square plots of 25 m × 25 m with 1000 m spacing were created within the study area. The 1000 m spacing between plots was chosen to avoid spatial autocorrelation effects. Digital globe Bing Maps and Planet Norwegian International Climate and Forest Initiative (NICFI) images filtered for 2008 (winter), 2015 (winter), and 2018 (winter). Each plot was populated with spatially distributed gridded sample points at 5 m intervals, including plot edges (i.e., 6 × 6 = 36 samples points per plot). The percentage of CC in a plot was calculated by labelling each point in the sample as a tree or no tree and then adding them together to get the total for the whole plot (1 labelled point equals 1/36 or 2.77 per cent cover). The percentage of CC at the level of the 25 m × 25 m plot was subjected to the formula in Eq. (1):

$$\mathrm{CC}=\left(\sum Y/36\right)\ast 100$$
(1)

Here, Y stands for the presence of canopy cover treetops. The value 36 is the total number of sample points in the 25 m × 25 m plots. To ensure that the CEO analysis provides reliable results, the plots were assessed based on the highest visual quality in Mapbox and Planet NICFI in the respective years. However, additional images from Google Earth and RapidEye (https://www.planet.com/) were used in the plots where the CEO images did not have high visual quality. A total of 136 plot-level measurements for each year were used to calibrate and validate the CC predictive models for 2008, 2015, and 2018. The descriptive statistical summary of the CEO plot data for each year, i.e., 2008, 2015, and 2018, is shown in Table 1.

Table 1 Woody canopy cover statistics of field data plots collected from CEO

LiDAR data processing and LiDAR-derived CC

Three Canopy Height Model (CHM) datasets were used to derive CC for the study site, one for 2008, 2015, and 2018. The 2018 CHM was created from a ~1 m Digital Elevation Model (DEM) and top-of-canopy surface models (CSM) generated by processing raw LiDAR point clouds following the processing steps described by Asner et al. (2012). The CHM (pixel size of 1.12 m) was calculated by subtracting the DEM from the CSM. The CC model was derived from the processed CHM. A DSM was obtained from ALOS JAXA for the 2008 and 2015 datasets (https://www.eorc.jaxa.jp/ALOS/). The DSM was used to create a digital terrain model (DTM) following the methods of Estoque et al. (2017). A CHM was then derived by subtracting the DTM from the DSM. CC for 2008 and 2015 was derived from this CHM. The model for 2008 and 2015 CC is referred to as ALOS-derived CC. The LiDAR-derived CC and ALOS-derived CC were created by first applying a data mask to the LiDAR image CHM to create a spatial array of 0 s (non-woody canopy) and 1 s (woody canopy). The percentage of the woody canopy was determined by summing all 1 s and diving by 625. The percentage was created at a spatial resolution of 25 m. LiDAR-derived CC and ALOS-derived CC were validated using CEO plots. The results for 2008, 2015, and 2018 gave RMSE = 29.97% (R2 = 0.15), 30.80% (R2 = 0.09), and 20.33% (R2 = 0.46), respectively.

SAR, LiDAR, and ALOS registration

Visual inspection of the LiDAR-derived CC product and ALOS PALSAR global mosaics revealed variable registration errors, i.e., SAR datasets did not align with the LiDAR-derived CC products. The ALOS PALSAR mosaics, LiDAR-derived CC, and ALOS-derived CC product were co-registered using the 15 m Landsat 8 panchromatic band to obtain accurate co-registration. During LiDAR CC and ALOS PALSAR mosaics co-registration, an RMSE of 0.5 m was maintained to minimise errors. The co-registration is also to ensure alignment between ALOS PALSAR data to allow change mapping with minimal misalignment anomalies. Furthermore, this allowed aligned error assessment between LiDAR CC and SAR CC. Settlements, main roads, and water bodies such as dams and rivers were masked and excluded from the analysis.

Data integration, modelling, and mapping

The 25 m by 25 m CEO plots were integrated with SAR backscatter intensities (Fig. 3) for 2008, 2015, and 2018, as well as LiDAR-derived CC products for 2018 and ALOS-derived CC for 2015 and 2018. CEO plots were used to extract mean values from the SAR input variables (HH, HV, HH-GLCMs, HV-GLCMs, and HH/HV ratio). Four modelling scenarios were selected for RF and SVM, namely (i) L-band only, (ii) L-band + GLCM, (iii) L-band + GLCM + HH/HV ratio, and (iv) L-band + HH/HV ratio, to investigate the capabilities of RF and SVM in mapping CC in the Dukuduku indigenous forest.

Fig. 3
figure 3

Flowchart describing data integration and modelling process

Machine learning algorithms

The RF and SVM were used in this study. The RF was used because it accounts for non-linear relationships between variables and makes no assumptions about their statistical distribution (Breiman, 2001). RF is robust and efficient in terms of runtime and accuracy compared to other data mining, machine learning, and regression methods (Ismail et al. 2010). Unlike traditional and fast learning decision trees such as Classification and Regression Trees (CART), RF is insensitive to small changes in the training dataset and provides a lower probability of overfitting (Ismail et al. 2010). RF has become one of the most widely used methods for mapping forest structure parameters and carbon using various satellite and ancillary data (Wingate et al. 2018). RF requires two user-defined inputs: the number of trees or ntree formed in the “forest” and the number of possible partition variables for each node or mtry (Ismail et al. 2010).

The SVM algorithm is largely based on the principle of structural risk minimisation and statistical theory (Cortes and Vapnik, 1995; Cherkassky and Ma, 2004). The choice of a positive definite kernel function is very important in this method. In addition, the cost factor and gamma in the SVM algorithm affect the penalty imposed for misclassifying a sample and the complexity of the algorithm (Luo et al., 2020). SVMs are capable of handling complex and non-linear problems and a wide range of inputs and can achieve high accuracy even when little training data is available (Marabel and Taboada, 2013). In this study, the radial kernel function with parameters sigma and c was chosen.

R software was used to implement RF and SVM models with the packages randomForest and ksvm. Rattle, a graphical user interface (GUI) for data mining with R (Togaware Pty Ltd., Copyright© 2006-2014), was used in this study. Rattle presents statistical and visual summaries of data, transforms data so that it can be easily modelled, builds both unsupervised and supervised machine learning models from the data, graphs the performance of the models, and evaluates new datasets for use in production. Rattle enables data partitioning, pre-processing, model tuning through resampling, and variable importance estimation. The root mean square error (RMSE), coefficient of determination (R2), and mean absolute error (MAE) were calculated during parameter tuning, and the minimum RMSE value was used to select the optimal model. Both the RF and SVM models were implemented with default settings.

To evaluate the performance of the above two machine learning models, the split sample validation method was used, where 80% of the data was used for training and 20% for model validation. Separate models were created for 2008, 2015, and 2018 using the same RF and SVM parameters. RMSE, R2, and MAE were determined to quantify the performance of the RF and SVM models under different modelling scenarios. A high R2 value, a low RMSE value, and a low MAE value indicate good model performance.

The model with the best performance for each year was applied to the relevant predictor variables (modelling scenarios), which were overlaid and truncated with the boundary of the test area using a mapping script. The script was developed in the statistical software R (version 3.6.1, The R Foundation for Statistical Computing, Copyright© 2019). Map products from CC were imported into ArcMap 10.4 (ESRI, Copyright© 1999–2014) and presented in discrete class intervals to best illustrate the CC distribution representative of the entire modelled areas.

Error assessment

Error assessment was carried out to examine the error caused by the different modelling scenarios and to determine the uncertainty of each modelling scenario. Error statistics and maps were produced by subtracting the LiDAR-derived CC, ALOS-derived CC, and SAR-derived CC (LiDAR-SAR) for both RF and SVM for all years and modelling scenarios. Both LiDAR-derived CC, ALOS-derived CC, and SAR-derived CC had spatial resolutions of 25 m. The error statistics from RF and SVM for all modelling scenarios were presented in statistics. For interpretation purposes, the error statistics were divided into five classes using intervals that best covered the observed error range (based on the quadratic sum) in the different modelling scenarios. These classes were major overestimation, minor overestimation, negligible error, minor underestimation, and major underestimation. Furthermore, the CC product derived from 2018 SAR was correlated with the CC product derived from 2018 LiDAR, and the CC product derived from 2008 and 2015 ALOS was correlated with the CC product derived from 2008 and 2015 SAR to evaluate the performance of the modelling scenarios of the two machine learning algorithms in estimating CC in the Dukuduku indigenous forest. The evaluation was done using XY scatter plots of the SAR-derived CC and LiDAR-derived CC, with R2 used to evaluate the relationship.

CC change mapping

The CC change maps were created by subtracting the earlier CC map (2008) from the more recent CC map (2018), which were created using the best models for each year. The RMSE, which is a measure of the spread of the residuals (regression error) calculated for a LiDAR-derived CC product, was used to assess the uncertainty of a CC product for a single year. The uncertainty of the change product was calculated using the uncertainty of the CC product for each year. To determine the uncertainty, it was assumed that the uncertainties of the two products (the earlier CC SAR product and the later CC SAR product) were not related. The quadratic sum formula (Simard et al. 2006) was used to calculate the uncertainty of the change product under this assumption (Eq. (2)):

$$\sigma \mathrm{CC}\Delta \mathrm{year}2\hbox{--} \mathrm{year}1=\sqrt{\sigma_{year2}^2+{\sigma}_{year1}^2}$$
(2)

where σ is the RMSE. In addition, we assessed canopy cover change and rate of change using five discrete classes for 2008, 2015, and 2018 CC products derived from SAR. The five classes were based on the percentage cover of each 25 m pixel. The five classes created for the Dukuduku indigenous forest are illustrated in Table 2.

Table 2 Five discrete classes, their description, and percentage cover were created in assessing woody cover change in the Dukuduku indigenous forest, adopted from Song et al. (2014)

The annual rate of change was calculated for the CC classes; according to Teferi et al. (2013), the net change is the difference between gain and loss and is always an absolute value. The annual change rate of CC for 2008–2015, 2015–2018, and 2008–2018 was calculated according to the approach introduced by Puyravaud (2003), denoted in Eq. (3):

$$r=\left(\frac{1}{\mathrm{t}2\hbox{--} \mathrm{t}1}\right)\mathrm{x}\ \mathit{\ln}\left(\frac{\mathrm{A}2}{\mathrm{A}1}\right)$$
(3)

where r is the annual rate of change for each class per year, A2 and A1 are the class area (ha) at time two (t2) and time one (t1) (in years) between periods and ln is the logarithm. The net change is the difference between the gain and the loss (Teferi et al. 2013). The gain and loss of CC during the study period were derived from the cross-tabulation of 2008–2015, 2015–2018, and 2008–2018.

Results

Performance of the models based on SAR and machine learning

The validation performance of the different machine learning algorithms under different modelling scenarios in predicting CC for the years 2008, 2015, and 2018 is shown in Table 3. The use of SAR variables (HH and HV) resulted in low accuracy for the RF model in 2008 (R2 = 0.24) and 2015 (R2 = 0.15) and low accuracy for the SVM model in 2008 (R2 = 0.36), 2015 (R2 = 0.27), and 2018 (R2 = 0.47). However, for 2018 (R2 = 0.55), the HH+HV modelling scenario resulted in high accuracy for the RF model. The (HH+HV+GLCM) modelling scenario resulted in higher accuracies than other modelling scenarios for 2008 (R2 = 0.37) and 2015 (R2 = 0.51) for RF. The modelling scenario (HH+HV+HH/HV_Ratio+GLCM) resulted in higher accuracies than other modelling scenarios for the SVM model for 2008 (R2 = 0.37), 2015 (R2 = 0.46), and 2018 (R2 = 0.49). For 2018, the modelling scenario (HH+HV+HH/HV_Ratio+GLCM) resulted in lower accuracies than using only SAR backscatter and SAR backscatter plus HH/HV ratio for RF; however, the modelling scenario with all variables resulted in high accuracies for SVM (Table 3). RF machine learning algorithms produced the best model for 2008, 2015, and 2018.

Table 3 Woody canopy cover (CC) modelling accuracy assessment (validation) results as obtained from the Random Forest and Support Vector Machines from four modelling scenarios

Overall model uncertainty of CC estimates

The model uncertainty of the SAR CC estimates was evaluated by correlating the LiDAR-derived CC and ALOS-derived CC estimates against the SAR-derived CC estimates. Due to low R2 and high RMSE, the estimation of CC was applied to all the modelling scenarios for RF and SVM. The 2008 and 2015 SVM and RF models produce low R2 values; therefore, only the 2018 results are presented. The RF and SVM results modelling scenario (HH+HV+HH/HV_Ratio+GLCM) and (HH+HV+GLCM) modelling scenarios are presented in Fig. 4. The validation results of SAR CC versus LiDAR CC for the RF model under different modelling scenarios: CCSAR (R2 = 0.55), CCSAR+R (R2 = 0.55), CCSAR+GLCMs (R2 = 0.59), CCSAR+R+GLCMs (R2 = 0.6). The SVM modelling scenarios when compared with LiDAR CC for 2018 produced similar but slightly low R2 relative to RF: CCSAR (R2 = 0.54), CCSAR+R (R2 = 0.53), CCSAR+GLCMs (R2 = 0.57), CCSAR+R+GLCMs (R2 = 0.6). Figure 4 illustrates the underestimations that the SAR-derived CC product yield when correlated with the LiDAR-derived CC. The SVM and RF model underestimation CC at lower coverage.

Fig. 4
figure 4

(AD) SAR-derived CC vs LiDAR-derived CC for 2018 under different modelling scenarios

SAR-derived CC maps and error statistics

CC was mapped for the study area using the best-performing models for each year (CCSAR+GLCMs for 2008, CCSAR+R+GLCMs for 2015, and CCSAR for 2018), but only the 2018 RF CC map is presented for the sake of brevity. Figure 5 illustrates the CC map derived using the best-performing model for 2018 SAR variables and RF for 2018. In the intact forest, there is a high CC with values between 60 and 100%, but low coverage is observed at the edges of the intact forest. The distribution of CC within the fragmented forest varies between 40 and 80%, with small patches of low and high cover.

Fig. 5
figure 5

SAR-derived CC map (2018) using RF and all predictor variables. Letters A–G represent selected areas of interest used as examples for LiDAR-SAR error statistics. DD – Dukuduku

Table 4 presents the LiDAR-SAR error statistics. For 2008 and 2015, both the RF and SVM models presented high major overestimations and high major underestimations. However, for 2018 estimates, major overestimations and major underestimations were significantly low, and negligible error was high for both RF and SVM over the four modelling scenarios. Key areas of discussion regarding overestimations and underestimations across LiDAR-SAR coverage for 2018 were selected and are presented in Fig. 5.

Table 4 CC percentage difference across the LiDAR-SAR coverage for the best-performing model four modelling scenarios and the two machine learning models

The addition of GLCMs and polarimetric features across the years for the estimates of RF and SVM CC did not reduce or increase the overestimations and underestimations of CC, implying that the overestimates and underestimates may be influenced by environmental characteristics. The use of the different CC products in determining uncertainty explains the significant differences in over- and underestimates between these years. The regions where the largest overestimates and underestimates observed from the 2018 RF modelling scenarios are shown in Fig. 6. Figure 6(ii) shows the largest underestimates and overestimates produced by the 2018 RF model when all predictor variables are used. These over- and underestimates correspond to the area of interest E from Fig. 5.

Fig. 6
figure 6

(i) 25 m LiDAR-derived CC maps. (iiv) 2018 SAR-derived CC error maps of areas of interest, as illustrated in Fig. 5

Majority of the overestimations and underestimations across all modelling scenarios for both RF and SVM models occurred in the fragmented forest and at the edges of the intact forest, as can be observed in Fig. 6 (i) areas of interest A–G (Fig. 5). A combination of major overestimations and negligible error plus minor overestimations in the CCSAR+GLCMs scenario was observed in the area of interest E on Fig. 5, which is further highlighted and illustrated in Fig. 6 (ii). Modelling scenarios (HH+HV+GLCMs) and (HH+HV) errors are also presented in Fig. 6 (iv, v), respectively.

CC Change, error estimation, and rate of change

The change SAR-derived product was computed using the Random Forest-modelled SAR-derived estimates CCSAR+R+GLCMs for 2008 and CCSAR for 2018 because they were the best-performing models. The uncertainty of the change product was calculated from the annual CC maps. The change uncertainty of σCC∆2018–2008 of 38.41 represents the average uncertainty of all individual pixels. The CC2018–2008 change was summarised into five classes (Table 3). Figure 7 illustrates the SAR-derived CC estimates when using all the predictor variables CCSAR+R+GLCMs. The RF-modelled CC estimates using all predictor variables were presented to observe the area covered by each CC class (Table 5).

Fig. 7
figure 7

Quantitative differences in woody canopy cover estimates area percentages for 2008, 2015, and 2018 using the RF algorithm with all predictor variables (CCSAR+R+GLCMs)

Table 5 CC change trend and the annual rate of change across the study area

In 2008, dense forest, forest, woodlands, wooded grassland, and shrubland covered 37%, 17%, 20%, 23%, and 2%, respectively. The classes dense forest, woodlands, and shrubland increased from 37% (3365.31 ha), 17% (1564.13 ha), 20% (2028.19 ha), and 2% (205.63 ha) in 2008 to 40% (3605.44 ha), 21% (1866.31 ha), 22% (1938.38 ha), and 6% (678.5 ha) in 2018. On the contrary, wooded grassland drastically decreased from 23% (2028.19 ha) to 10% (901.63 ha). Figure 7 illustrates the quantitative differences in woody CC estimates area percentages for 2008 and 2018 using the RF algorithm.

The annual rate of change varied for each CC class across the study period (Table 5). Between 2008 and 2018, for dense forest, forest, woodlands, wooded grassland, and shrubland experienced an annual rate of change of 0.69% ha−1(increase), 1.79% ha−1 (increase), 0.59% ha−1 (increase), 8.11% ha−1 (decrease), and 11.93% ha−1 (increase), respectively.

During the study period, wooded grasslands experienced the highest transition, with 80.64% of its total area in 2008 changing to other classes, with 24.90%, 21.45%, 18.16%, and 16.13% transitioning to woodlands, shrubland, forest, and dense forest, respectively. The other classes also experienced change with dense forest (40.54% increase, 36.3 decrease), high cover (79.44% increase, 75.47% decrease), woodlands (75.53% increase, 74.04% decrease), and shrubland (90.64% increase, 69.48% decrease).

Discussion

The results of the study show that RF provides the best model for 2008, 2015, and 2018. In 2008, it was RF which performed best with SAR variables and GLCMs. In 2015, it was the RF model that performed best with SAR variables, polarimetric features, and GLCMs. In 2018, it was the RF model with only SAR variables that performed best. These results are consistent with other studies that have used RF to estimate CC in different environments in Africa and southern Africa (Naidoo et al. 2014, 2015; Anchang et al. 2020). The use of CEO-derived field plots for model calibration and validation resulted in a low R2 < 0.5 and a high RMSE (29.5–36.5%) for both RF and SVM. However, significant changes were observed within the different percentage coverages of CC across the study area, providing only information on changes in CC in the Dukuduku indigenous forest. Polarimetric and textural features slightly improved the RF and SVM models (Table 3 and Fig. 4). These results indicate the importance of textural and polarimetric features in estimating CC. Textural features are important for estimating the structural parameters of woody vegetation because they can examine pixels within a specific neighbourhood associated with tree clumps (Gonzalez and Woods, 1992). They also reveal underlying physical variations in the image and provide information about the structural arrangement of the surface and its relationships to the surrounding environment (Haralick et al. 1973). Several studies have used textural features to map and assess CC (Wood et al. 2012; Madonsela et al. 2017; Wessels et al. 2019). These studies used optical datasets or were conducted in other environments, such as the savannah. This study therefore extends the body of knowledge by using SAR L-band, textural, and polarimetric features with machine learning models to estimate CC in the indigenous closed canopy forests and provides an opportunity to test this approach in other environments of Southern Africa. In addition, the study used the estimated SAR products to obtain time series of changes in CC.

Machine learning techniques used to estimate woody structural parameters require calibration and validation data. The calibration and validation data may contain inaccuracies or inconsistencies that result in inaccurate models. The low accuracy of the RF and SVM models can be attributed to the inconsistencies between the CEO plot data and the ALOS PALSAR annual mosaics. This is evident when the relationship between the SAR-derived CC and the ALOS-derived CC and the LiDAR-derived CC is determined for all the years. There was a moderate positive relationship between the 2018 SAR-derived CC and the LiDAR-derived CC, R2 > 0.5, and a poor positive relationship for 2008 and 2015, R2 < 0.4. The difference in the dates of the ALOS-derived CC and SAR-derived CC products caused a poor correlation between the datasets. It should be noted that for the 2008 and 2015 error statistics when determining the uncertainty of the RF and SVM SAR-derived CC products, the CC product used for correlation was derived from DTM and DSM products that were not produced with LiDAR point cloud data. However, for the 2018 RF and SVM SAR-derived CC products, the uncertainty of the products was determined using the CC derived from LiDAR point cloud data. The structural parameters of woody plants are influenced by environmental conditions such as precipitation, soil type and moisture, and topography (Sankaran et al. 2005). Model accuracy could be improved by integrating environmental variables with remote sensing data (Wessels et al. 2019).

The SAR-derived CC change statistics and their respective interpretations have successfully addressed the goal of demonstrating the potential of SAR-derived CC maps in monitoring CC change, especially when using free calibration/validation data from CEO. Significant changes were observed between the different CC percentages during the study period, with shrubland gaining the most area during this period. Forest and wooded grassland each recorded the largest decreases. Looking at the changes in CC classes and the rate of change between 2008 and 2015 and 2015 and 2018, it is clear that the Dukuduku indigenous forest is facing threats that need to be investigated. However, these changes need to be interpreted with caution as we acknowledge that there are certain sources of error that have not been thoroughly investigated. Errors introduce uncertainty in the estimated products and are the result of inaccurate measurements of plot data, measurement biases, and inaccuracies in the geometric rectification of remote sensing images (Wang et al. 2005). Different independent validation datasets were used in the study, hence the observed errors. These uncertainties raise the question of whether the variables derived from the plots are representative of the landscape in question (Marvin et al. 2014). The CEO plot data used for calibration and validation in this study were validated using LiDAR-derived CC for 2018 and ALOS-derived CC for 2008 and 2015 to verify the accuracy of the plot data. Multi-temporal and multi-seasonal SAR data can improve modelling results (Naidoo et al. 2015); however, the ALOS PALSAR L-band data used in this study were obtained as annual mosaics. Therefore, the potential and limitations of using multi-temporal SAR data in the closed canopy forests of Southern Africa are unknown. Future missions such as BIOMASS, which will provide longer P-band SAR imagery (Ho Tong Minh et al. 2016) and NISAR, which will generate quad-polarimetric L-band and S-frequency (Rosen et al. 2017), promise to improve and extend the work presented in this study to other forests and woody vegetation types to provide accurate and improved woody canopy cover estimates. This is due to the fact that the SAR signals from these sensors are able to penetrate deep into a complex, multi-layered, closed forest (Rosen et al. 2017).

Both the RF and SVM models resulted in under- and overestimations of CC in certain areas of the study area under the different modelling scenarios. However, this was expected due to the saturation of L-band backscatter at biomass greater than 90–110 Mg ha−1 associated with closed canopy forests (Yu and Saatchi, 2016). Several studies have reported this phenomenon (Urbazaev et al. 2018; Wessels et al. 2019; Naidoo et al. 2015). This phenomenon is also related to RF regressions due to the bias of ensemble trees towards the sample mean (Xu et al. 2016). Interestingly, the over- and underestimations in the SAR-derived CC estimates for 2018 were drastically reduced, fully supporting the use of LiDAR data for calibration and validation of SAR models through upscaling methods to reduce dependence on field plot data causing these errors and thus increase the accuracy of the SAR-derived CC estimates. Understanding the variables that cause some level of error is important for improving and building accurate models in the future. Most of the overestimations and underestimations, as shown in Fig. 5 and Fig. 6 (ii–iv), occurred in the fragmented forest and at the edges of the intact forest. Majority of over and underestimations occur at the edges of the intact forest (Fig. 5). The Dukuduku forest is categorised by dune ridges and is surrounded by floodplains to the east, south, and southwest, which explains the overestimations and underestimations that occur in these regions.

Observing the overestimation and underestimation error statistics over the study period and the fact that only the CEO plot data was used for calibration and validation of the machine learning algorithms. The uncertainty results obtained in the study confirm the recommendations of Naidoo et al. (2016) that the use of LiDAR for model calibration and validation is necessary to increase the accuracy of estimation of CC across different vegetation types. LiDAR data excels at delineating the understory of forests, which includes grasses, forbs, and shrubs distributed below the forest (Mahlangu et al. 2018). LiDAR metrics yield strong correlations with L-SAR data, as L-band signals interact with tree trunks and branches (Mitchard et al. 2011). However, in South Africa, there is limited LiDAR data due to the high costs associated with obtaining LiDAR data. Missions such as the Global Ecosystems Dynamics Investigations LiDAR (GEDI) data have the potential to address this problem. Data obtained from the GEDI mission can be used to extract other structural parameters of woody vegetation, such as canopy height and crown diameter, which are essential for ecological studies related to vegetation dynamics, biomass estimation, and spatial characterisation of vegetation (Ferraz et al. 2016).

Concluding remarks

The aim of this study was to develop a method for monitoring CC in Dukuduku indigenous forest. This was done using SAR L-band mosaics of 2008, 2015, and 2018, polarimetric and GLCMs with machine learning models calibrated and validated with CEO data. This is the first attempt to map and monitor changes in CC in the indigenous closed canopy forests in South Africa using SAR data. Considering the complexity of estimating CC in a closed canopy forest with highly fragmented forest and heterogeneous vegetation cover, we can conclude that the selected machine learning models and modelling scenarios have low R2 (0.2 < R2 < 0.6) and high RMSE (29.5–36.5%) values for estimating CC in the Dukuduku indigenous forest in South Africa.

In terms of machine learning algorithms suitable for monitoring CC in the Dukuduku indigenous forest, the results show that the machine learning algorithm RF provides the best model for estimating CC in 2008, 2015, and 2018; the addition of polarimetric and texture features improved the modelling accuracy. However, more accurate results for estimating CC could be obtained by using LiDAR data for calibration and validation. The methodology presented in this study can be used as an alternative for estimating forest cover when the availability of LiDAR data is limited. We also believe that the methodology can be adopted and implemented to estimate other structural parameters of woody vegetation if LiDAR data are used for calibration and training to improve modelling accuracy. The authors believe that CEO plot data is an alternative data source to LiDAR where LiDAR availability is limited or LiDAR data is not available for the study period. However, to confirm these results, further tests with more CEO plots integrated with environmental variables need to be conducted.

Beyond this study, multi-temporal SAR datasets can be used for long-term forest cover mapping and monitoring to assess the effects of seasonality and environmental conditions on woody canopy cover in different environments of Southern Africa. Finally, the results presented in this study can provide valuable information to the scientific community on the capabilities and limitations of SAR L-band and CEO data in estimating CC in indigenous forests of South Africa.