Diversifying modelling techniques to disentangle the complex patterns of species richness and diversity in the protected afromontane grasslands

Ecological research has focused on the importance of environmental factors on spatial biodiversity variations and organisation. This is important because of scant conservation resources. We used stepwise backward selection and random feature selection (RFE) to identify a parsimonious model that can predict species richness and diversity metrics in response to three models; biotic, abiotic, and topo-edaphic. Our results show that both metrics are good predictors of one another, mainly because species diversity is a combination of species richness and abundance, and further highlights the importance of biotic variables in predicting species distribution. The two modelling techniques selected soil texture and its interactions with topographic variables as the most important variables. However, random forest performed worse than multiple linear regression in the prediction of diversity metrics. This research highlights the importance of topographically controlled edaphic factors as drivers of species richness and diversity in mountainous grasslands where topography inherently controls the geomorphic, hydrological, and, as a result, ecological processes.


Introduction
Many environmental factors affect plant species richness and diversity in grassland communities (Auestad et al. 2008). Habitat fragmentation and loss are causing significant declines in grassland biodiversity, and ecosystem services are dwindling (Zulka et al. 2014). Ecological, topographic, and management intensity are the most important features influencing grassland biodiversity (Orlandi et al. 2016). However, in the Afromontane grasslands, these features are many and site-specific. Thus it is difficult for conservation management. Conservation resources are often scarce, and knowledge about which environmental drivers promote species diversity and richness in mountainous protected areas could allow for the prioritisation of the hotspots and allocation of conservation efforts (Olea et al. 2010). For example, nutrient-poor grasslands in Central Europe are identified as one of the biodiversity hotspots in the region and, therefore, highly valued by conservationists (Becker and Brändel 2007).
Vegetation response to environmental factors should be studied at specific yet multiple scales (Reitalu et al., 2012). Variations in grassland vegetation result from complex gradients of soil moisture and element concentrations locally and climatic variables regionally (Auestad et al. 2008). Research on vegetation-environment relationships is scanty for most parts of the Afromontane grasslands, especially locally; hence, conservation efforts may be haphazard and ad-hoc. For example, some edaphic variables, such as soil carbon content, increased species richness, while heavy metals and the C/N ratio were unrelated and decreased species richness, respectively (Becker and Brändel 2007). Grassland landscapes can be better understood when good predictors of species richness are used (Zulka et al. 2014). Predictors can be used to comprehend the complex patterns of species richness, especially in mountainous grasslands vulnerable to global environmental changes.
In alpine grassland, environmental factors, such as climate, topography and soil, are pivotal for maintaining plant diversity-for example, altitude and slope influence species composition. Inherently, topography in mountainous areas determines temperature, elevation, and hydrology; therefore, it is an important determinant of vascular plant diversity (Moeslund et al. 2013). In addition, topography in mountainous habitats facilitates geomorphological processes such as erosion and, thus, soil fertility along slope gradients, as well as aspects affecting ecological processes. Ultimately, this interplay of environmental factors influences vegetation variations in mountainous areas and can be used to identify biodiversity hotspots for direct conservation protocols in protected areas (Lee and Chun 2016). Mountain ranges in Southern Africa vary in topography and aspect, providing suitable habitats for various plants and animals (Brown and du Preez 2020). The role of topography needs to be investigated, especially in habitats where topographic interplays are poorly understood (Moeslund et al. 2013).
The most active research in ecology has been to understand the importance of environmental factors on spatial biodiversity variations and organisation. Species distribution models, which use a combination of species occurrence or abundance and ecological aspects, can aid in gaining insight into vegetation distribution across landscapes and allow for extrapolation in space and time. Furthermore, key modelling steps such as gathering data, selecting relevant predictors, and appropriate modelling algorithms can influence the robustness and realism of the model (Elith and Leathwick 2009). Modelling algorithms such as parametric generalised linear models and non-parametric random forest (RF) can be used to assess the uncertainty of modelling algorithms (Bittner et al. 2011). In protected areas of developing countries, such as Golden Gate Highlands National Park in South Africa, conservation resources are limited and should be used sparingly. Therefore, good models that can predict the potential distribution of vascular plant diversity linked to nutrient-rich forage can give initial insights into animal distribution and forage preference. Subsequently, precise carrying capacity models can be developed. This research aimed to gain insights into the main drivers of vascular plant diversity in the mountainous grassland of Golden Gate Highlands National Park by two different modelling techniques.

Study area
The study was conducted at Golden Gate Highlands National Park, in the North-Eastern part of the Free State province, South Africa (Fig. 1). The park comprises 32,758.35 ha and lies between 28°27' S -28°37' S and 28°33' E -28°42' E. The Park is located in mountainous grasslands at the foothills of the Drakensberg and forms part of the mesic highveld grassland with marked variation in geology, topography, and rainfall. The following soil types are identified in the park: shallow rocky soils (Glenrosa and Mispah), deep soil along drainage lines (Oakleaf), well-developed sand soils (Hutton and Clovelly), and clayey structured soils (Milkwood and Tambakulu) (SANParks 2020). The park is characterised by summer Fig. 1 A map of the study area rainfall, temperate summers, and cold winters. The rainfall season stretches from September to April, with a mean annual rainfall ranging from 800 mm to 2,000 mm (Kay et al. 1993). The park lies between 1,892 m and 2,829 m above sea level and, hence, comprises the following grasslands units: Eastern Free State Sandy grasslands (Gm 4), Basotho Montane Shrubland (Gm 5), Lesotho Highveld Basalt Grassland (Gd 8), and Northern Drakensberg Highveld (Gd5) (Mucina and Rutherford 2006).

Field data collection
The park was stratified into relatively homogenous physiographic-physiognomic units (Kay et al. 1993). Vegetation sampling was undertaken in a 30 × 30 m plot size randomly placed within homogeneous patches of grass communities (Fig. 2). Four 30 m transects were placed horizontally at every 10 m interval within the plot, the plot where generated using the "generate random points" tool in ArcMap 10.7. In each plot, a 1 × 1 m quadrat size was placed at every 10 m interval along the transect for vascular plant species identification and ariel visual cover estimations. In total, 142 plots with 16 quadrats were sampled in the study area. Data were collected in March, April, and May 2019, which are the rainy growing months for South African montane grasslands; this area receives late rains for summer, thus prolonged vegetation growing season, in addition to the xeromorphic characteristics of plant species. In each quadrat, all vascular plant species of the standing vegetation were identified

Environmental variables and fire severity
Various environmental predictor variables from different data sets were used to measure and model their influence on species richness and diversity. For soil, both chemical and physical variables were used; bulk density (BD), silt fragments (SF), pH, coarse fragments (CF), soil organic carbon (SOC), sand (SD), and nitrogen of topsoil (15 cm) were downloaded from International Soil Reference and Information Center (https://soilgrids.org). The park comprises shallow rocky soils, and while field sampling is encouraged in underrepresented areas (Hengl et al. 2017), physical soil sampling may be damaging in this park. Using this grided soil data with relatively good accuracies (Hengl et al. 2017) is non-destructive, therefore, ideal for a conservation area. The elevation data, i.e. SRTM DEM at 30 m resolution, was obtained from US Geological Survey's EROS data centre (https://earthexplorer.usgs.gov). The slope was derived from DEM using the Spatial Analyst Tool in ArcGIS 10. 4. Topographic and edaphic variables were multiplied to determine the topo-edaphic variables. Fire severity data were acquired from a study conducted in the area (Adagbasa et al. 2018). The study estimated fire severity using the Normalized Burn Ratio (NBR) index by analysing a pre/post-fire season (April -September 2017) from remote sensing images with burnt and unburnt pixels.

Diversity metrics
Species diversity was calculated using the Shannon diversity index H'=-pi*lnpi where pi is the proportion (species cover) of each species in the quadrat; this analysis was computed using the vegan package (Oksanen et al. 2012) in R studio (R core Team, 2022) Furthermore, total species richness was measured by counting all vascular plants species that were recorded for each quadrat. Subsequently, all the quadrat values were averaged to attain the plot level value of each diversity metric.

Simple multiple linear regression
Simple multiple linear regression was used to examine the relationship between diversity metrics (species diversity and richness) and a set of three models of explanatory environmental variables. The biotic model included diversity metrics as predictor variables amongst abiotic-based variables. The abiotic model had bulk density, silt fragments, sand, pH, soil organic content, coarse fragments, elevation, slope, fire severity, and soil nitrogen.
The topo-edaphic model included interactions between topographic and soil variables. The diversity metrics (species diversity and richness) were used as both response and predictor variables because of the high colinearity between the two. Subsequently, a backward selection was used to eliminate redundant variables and select the optimal variables explaining species richness and diversity from the set of environmental variables based on the lowest Akaike's Information Criterion. Statistical analyses were performed using R statistical software (R Core Team, 2022).

Random forest
A set of biotic, abiotic, and topo-edaphic variables was used to input the nonparametric RF method to predict species richness and diversity. Random forest is a highly recommended method for ecologists and remote sensing scientists and was developed to improve classification and regression trees (CART) by using a large set of regression trees (Ramoelo et al. 2015). The 'random forest' package in R statistical software was used to analyse the data. Optimising the number of variables required to predict species richness and diversity was determined using a random feature selection (RFE) based on leave-one-out cross-validation, and the root mean square error (RMSE). The RFE is a simple backward selection algorithm implemented via the 'caret' package in R statistical software (R Core Team, 2022).

Model performance
The statistical measure of model precision and robustness, the r-squared (R 2 ) and root mean square error (RMSE), was determined to test the performance of the modelling algorithms using the selected stepwise and RFE variables in predicting the diversity metrics. The parameters were used to assess the strength of the relationship between the observed and predicted species richness and diversity.

Results
The species and richness diversity metrics had a coefficient of variation of 25% and 39%, respectively (Table 1). All edaphic variables had a very low coefficient of variation (< 10%), except for coarse fragments (38%) and soil organic content (32%). For topographic variables, the slope had the highest (80%) and elevation with the lowest (0.06%) coefficient of variation. Fire severity had a coefficient variability of 26%. All the topo-edaphic variables exhibited a substantially high coefficient of variation (> 82%).
The optimal environmental variables explaining species richness and diversity in grassland communities in Golden Gate Highland National Park are given in Table 2. The stepwise backward selection retained species diversity, silt fragments, and fire severity as the optimal variables explaining species richness. However, species diversity had a significantly positive (7.04, p < 0.05), while silt fragments and fire severity had a significantly (p < 0.05) negative relationship (-0.24 and − 0.41, respectively). Bulk density (-0.02) and silt fragments (-0.03) were identified as the two optimal abiotic variables with a significantly (p < 0.05) negative relationship with species richness. Similarly, in the topo-edaphic model, the stepwise selected elevation, soil bulk density, silt fragments, sand, and pH had a significantly negative (p < 0.05) albeit minute relationship with species richness. Species diversity was optimally explained by species richness (0.10), soil pH (0.02), and fire severity (0.04). Coarse fragments had a significantly (p < 0.05) negative relationship (-0.03) with species diversity amongst the selected abiotic variables. However, two topographic variables were identified as optimally explaining species diversity, namely elevation interaction with soil bulk density and sand, but with very small estimate coefficients.
The multiple linear regression models (Fig. 4) performed better than the RF models (Fig. 5). The multiple linear regression biotic models explained 72% of the variation in species richness with an RMSE of 1.84, while abiotic and topo-edaphic models explained 12%

Discussion
Soil edaphic variables exhibited a very low coefficient of variation (< 10%), while slope showed a very high variation of 80%, second to all the topo-edaphic variables with a variation of > 82%. The low variation of the soil edaphic dataset in Golden Gate Highland National Park can be attributed to unvarying land-use activities, both current (protected area with grazing activities) and historic (farming: crop cultivation) because soil stoichiometry differs amongst and is influenced by land-use activities. For example, soil total carbon (C) and nutrients varied among different grassland types and land use in Alpine Qinghai-Tibetan Plateau (Han et al. 2019). Furthermore, a mean slope of 6.4% that ranged from completely flat (0.0%) to gentle (0.26%) was observed, despite the high variation in the dataset, which may have been because of the difficulty of steep sampling slopes > 45 degrees. The diversity metrics served as both response and predictor variables. This gives insight into forces controlling species richness and diversity in response to biotic and abiotic factors (Tilman 1993). The high collinearity between the two metrics was inevitable because species richness comprises the number of species, and the Shannon-Wiener diversity index considers both species richness and abundance. Despite having similar traits, the diversity metrics may differ in their sensitivity to biotic and abiotic factors. For example, species richness may be sensitive to rapid changes in vital rare species, while species diversity may react to the changes of abundances dominant species. Thus, Symstad and Jonas (2011) suggested that "the response of richness and diversity to drivers reflect different changes in the plant communities".
This research showed a significantly negative relationship between species richness, silt fragments, and fire severity. pH and fire severity, however, influenced species diversity via the stepwise algorithm in the biotic model. Edaphic factors are related to geomorphic heterogeneity, mainly slope stability, which can affect species richness at specific scales in mountainous areas (Malanson et al. 2020). Moreover, fire severity is one of the variables selected by the RF to explain species richness negatively. High fire severity will cause more physical/structural damage to the grassland vegetation community; hence vegetation seldom fully recovers to pre-fire conditions (Adagbasa et al. 2020). Generally, however, variations in fire regimes can serve as a stimulant for seedling establishment and emergence (Olea et al. 2010), while high fire severity, in particular, can obliterate existing species in grassland plant ecosystems. Interestingly, elevation and edaphic interaction variables were identified as optimal for influencing species richness and diversity. This emphasises the role of topography in influencing local plant diversity. For instance, elevation may limit spatial seed distribution, especially by animals that find it difficult to ascend steep slopes (Moeslund et al. 2013). The RF algorithm similarly identified silt fragments, elevation, and other soil texture variables important for explaining species richness. This also shows that topographically controlled edaphic factors such as silt fragments can be drivers of species richness in mountainous areas (Filibeck et al. 2019). Therefore, it is essential to prioritise soil conservation in protected areas to improve plant species diversity, especially in humid mountainous regions susceptible to erosion and nutrient leaching.
Species richness and diversity were affected differently by biotic, abiotic, and topoedaphic models. As all the multiple linear regression algorithms better predicted such, species richness compared to species diversity. The RF performed worse in predicting diversity metrics because of the high prediction error, i.e. RMSE. Our results concur with studies that question the predictive power of RF in estimating vegetation parameters (Filibeck et al. 2019;Kosicki 2020), mainly because it is a non-parametric technique that does not make assumptions about statistical distribution (Ramoelo et al. 2015). However, RF is recommended for ecological use because many variables can be tested, are robust to multicollinearity, and are independent of any assumption; in comparison, stepwise multiple regression is prone to multicollinearity and overfitting, while machine learning methods are prone to different specific conditions in an ecosystem and lack analytical assumptions, which influence their predictive power.

Conclusion
Conservation of natural resources warrants active research to gain insight into the distribution of species occurrence, abundance, and composition. Because conservation resources are dwindling with heightened threats to biodiversity, our study can provide a framework for identifying environmental drivers of species richness through the selection of parsimonious models. The two metrics were useful rangeland indicators. When species distribution is well understood, they can be used to model carrying capacity because they are linked to plant productivity and nutrition. Furthermore, the importance of topographically controlled edaphic grasslands as drivers of species richness and diversity in mountainous grasslands is highlighted. Topography inherently controls the geomorphic and ecological processes. Fire severity is one of the most concerning predictors of species diversity to park managers, which is susceptible to sporadic anthropogenic fire; drivers of fire severity ought to be investigated for abating species loss due to high fire severity. The findings show that diversifying model techniques can help disentangle species richness's complex patterns.
Author Contribution All author's conceived and designed the research; Katlego K Mashiane performed the fieldwork, data mining, and analysis; Katlego K Mashiane wrote and edited the manuscript; Abel Ramoelo and Samuel Adelabu reviewed the manuscripts.
Funding Open access funding provided by University of the Free State.

Conflict of Interest
The authors declare that no financial interests/personal relationships may be considered potential competing interests. Data used in this manuscript is available and can be provided upon request.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.