1 Introduction

Natural forest regeneration is a key process in ensuring forest sustainability (Lucas-Borja 2014). However, in the Mediterranean basin, natural regeneration is hampered by the summer drought, high summer temperatures, long intervals between good seed crops, the presence of livestock and the existence of a too dense herb and litter layer (Barbeito et al. 2011; Calama and Montero 2007; Manso et al. 2013; Pardos et al. 2007). Therefore, understanding the dynamics, patterns and factors involved in the success or failure of regeneration and interactions at the seedling level can provide foresters with the fundamental knowledge required for decision-making in forest management (Lucas-Borja 2014).

Regeneration in forest stands has been studied using different statistical methods depending on the aim, data available or type of forest (Table 1). Analyses have been refined as statistical techniques have improved (Zuur et al. 2007). Regeneration studies often involve counts of the number of seedlings per sampling unit. However, a wider range of techniques can be applied to assess seedling abundance. An initial, simple approach is the linear model (del Cerro et al. 2009; Osem et al. 2013). However, regeneration data rarely satisfy the basic assumptions for linear models: normality of errors, linearity of parameters, homogeneity of variance and independence of the covariates (Zuur et al. 2010).

Table 1 Methods used to study natural pine regeneration in Mediterranean mountains

Explanatory spatial covariates that influence recruitment patterns have also been incorporated into different types of recruitment models, as explanatory variables. The covariates most used are soil nutrients and/or soil moisture (Barbeito et al. 2009), light availability (Adili et al. 2013; Fyllas et al. 2008), environmental conditions (Rathbun and Fei 2006) and stand structure (Manso et al. 2013). Vertical and horizontal forest structures determine light availability and, therefore, regeneration dynamics (Catovsky and Bazzaz 2002; Montgomery 2004) and the development of the stand (Boyden et al. 2005; Montes and Cañellas 2007). Some authors have assessed the influence of the retained trees on regeneration through the inclusion of structure influence indices in the model (Barbeito et al. 2011; Kuuluvainen and Pukkala 1989; Paluch 2005).

Regeneration studies often involve the consideration of a lot of explanatory variables that may be highly correlated. Collinearity among covariates increases the risk of inferring that these covariates have no explanatory power (increasing the risk of type I errors), and in addition, it can be especially problematic when ecological signals are weak (Zuur et al. 2010). Collinearity is often assessed through variance inflation factors, pairwise scatterplots comparing covariates or correlation coefficients (Zuur et al. 2010). In order to avoid correlation and to reduce the number of independent variables, scaling techniques can be used to create a dimensionally reduced data set with uncorrelated variables (Borcard and Legendre 2002; Legendre and Legendre 1998; Strobl et al. 2009). The most widely used tools in this regard are principal component analysis (PCA) and factor analysis or nonmetric multidimensional scaling procedure (NMDS). The obtained uncorrelated new covariates obtained can be then used in standard regression methods. The requirement of normality of the data in the PCA analysis is under discussion (Jolliffe 2002; Borcard and Legendre 2002; Zuur et al. 2010). In this respect, NMDS is not sensitive to normality, to outliers and to homoscedasticity assumptions of classical metric scaling (Casini et al. 2004; Legendre and Legendre 1998). In addition, NMDS allows the identification of non-linear gradients of the environmental covariates. However, the disadvantage of using ordination methods is that the original input variables are projected into a reduced set of components or axes, so that their individual effect is no longer identifiable (Strobl et al. 2009).

This interpretability problem can be avoided by using decision or regression trees, such as classification and regression trees (CART; Breiman et al. 1984), conditional inference trees (Hothorn et al. 2006) or chi-squared automatic interaction detection (CHAID; Kass 1980). Besides allowing the use of multiple variables in the data set, these nonparametric procedures are capable of identifying interactions between the variables which are too complex to be captured by parametric regression models (Strobl et al. 2009). This can be especially useful in the field of ecology. Forest data have been analysed using regression trees (CHAID, Álvarez-Álvarez et al. 2011; conditional inference trees, Barbeito et al. 2012), but their use in regeneration studies is scarce (Fei 2010). However, simple tree models are vulnerable to small changes in the learning data (Strobl et al. 2008). Machine learning methods, such as the random forests algorithm (Breiman 2001) or boosted regression trees (Elith et al. 2008), solve the problem of instability by averaging an ensemble of trees into a more robust composite model (Cutler et al. 2007; Strobl et al. 2009).

Natural, rather than artificial, regeneration is the preferred option in Mediterranean areas. The shelterwood system is commonly employed to ensure the establishment of new Scots pine seedlings (Pinus sylvestris L.) in Central Spain (Cañellas et al. 2000). However, regeneration is sometimes hampered by excessively short regeneration periods or by a number of abiotic and biotic factors, such as competition from grass or the large fluctuation in cone production over the years (Barbeito et al. 2011; Cañellas et al. 2000; González-Martínez and Bravo 2001) which is especially notable in masting species such as Pinus pinea L. (Calama et al. 2011; Manso et al. 2013). With regard to the abiotic factors, the survival of P. sylvestris seedlings is closely linked to climate. In particular, summer drought can cause complete seedling mortality during the first summer after the regeneration fellings (Pardos et al. 2007). In addition, light availability has been shown to be relevant for regeneration on a small scale, with moderate light conditions being optimum to maximize the regeneration density of Scots pine in the Central Mountain Range in Central Spain (Barbeito et al. 2009; Pardos et al. 2007). Topsoil variables, such as soil moisture and nutrients, as well as soil preparation have also been identified as important factors in forest regeneration (Barbeito et al. 2011).

The aim of this study is to employ and compare five different statistical techniques in order to evaluate the influence of environmental factors on natural regeneration, using as a case study of Scots pine forest in Central Spain. The techniques considered are (i) a generalized linear mixed model (GLMM), (ii) data reduction using the PCA ordination method prior to GLMM, (iii) ordination using the NMDS, (iv) CHAID algorithm and (v) random forests algorithm and the conditional inference trees. Additionally, we aimed to answer the following question: What is the most suitable method to evaluate tree recruitment and factors implicated? We compared the results obtained using the different methods, then assessed and discussed the strengths and weakness of each methodology. We calculated the goodness of fit for each methodology to assess how well the different models fit the observations. Measures of goodness of fit summarize the discrepancy between the observed values and the values expected under a statistical model (Maydeu-Olivares and García-Forero 2010). The importance of the methodology used in regeneration studies will be evident if not all the approaches identify the same biotic and abiotic variables that are associated with the regeneration abundance. Additionally, owing to the presence of complex interactive effects among the underlying variables and to correlations among the potential covariates used to represent those variables, it is expected that the parametric modelling strategies considered in this study would fail to identify the importance of covariates detected as important by decision trees.

2 Materials and methods

2.1 Study site

This study was conducted at the mountain forest of Navafría (41° 00′ N, 3° 48′ W), located on a north-facing slope in the Central Mountain Range in Spain. The altitude of this Scots pine forest ranges from approximately 1200 to 2200 m, and the altitudinal position of the timberline is around 1800 m. Annual rainfall exceeds 730 mm, and mean annual temperature is around 9.9 °C. These mountains are composed of granite and gneiss, with fairly homogeneous soils throughout the pinewoods, predominantly humic cambisol-type soils or leptosols at the higher sites (Forteza et al. 1988) according to the FAO taxonomic soil classification of 1989. The site index of these stands ranges between 21.4 and 29.9 m (Montes et al. 2007). Scots pine forms pure stands in the middle and high altitudes of the forest, whereas in the lower parts of the forest, this species grows alongside Quercus pyrenaica Willd. Some shrubs, such as Genista florida L., Cytisus scoparius (L.) Link and Cistus laurifolius L., appear in patches of forest which are at the reinitiation stage of the stand development.

The Navafría forest is managed using a permanent block plan with a 100-year rotation period. The management plans date from the end of the nineteenth century. The forest is divided into three working groups and these in turn into three wood production circles. Each wood production circle is made up of five blocks (García López 1994). Natural regeneration is achieved through the uniform shelterwood system over a 20-year regeneration period. Artificial regeneration is only employed in exceptional circumstances as a supplementary measure or aid. The silvicultural management applied consists of an intensive thinning from below regime from years 30 to 80 (removing subcanopy trees) or mixed (removing dominated and co-dominant trees). The canopy is then opened, leaving a density of 200–250 trees ha−1 at the beginning of the regeneration period. Additional trees are subsequently removed in a uniform manner throughout the regeneration block in two or three harvests over the regeneration period, leaving a low residual tree density. Thus, the seedlings become established under the protection of the older trees (Loftis 1990). The remaining trees (30–40 trees ha−1) are finally harvested at 100 years of age. In order to facilitate the establishment of the regeneration, soil preparation operations, subsoil and blade scarification are carried out before the final harvest. Through this management approach, natural regeneration of Scots pine throughout the forest is achieved successfully.

We installed a linear transect along a contour line in each of the five regeneration blocks. The five transects were installed after the first regeneration fellings and prior to the final felling, in the middle of the regeneration period. Square plots of 64 m2 in size were established for regeneration inventory purposes at 50-m intervals along these transects, making a total of 45 plots across the five blocks (Table 2). Each plot was subdivided into four square subplots where seedlings (individuals shorter than 130 cm) were counted. Additionally, all the trees and the stumps within a 15-m radius from the centre of each regeneration plot were mapped. Measurements included diameter at breast height (dbh) of the trees, total tree height and diameter of the stumps at ground level. The topsoil characteristics of each plot were also recorded, consisting of a mixture of samples taken from a depth of 20 cm and collected at the centre of each subplot to obtain the following topsoil variables: soil pH (in distilled water); percentage of sand, lime and silt according to the International Society of Soil Science; the percentage of oxidizable organic matter (Walkley and Black method); available P (Olsen method); colloidal K, Ca, Mg and Na (all atomic absorption); the electrical conductivity (in distilled water); and the stoniness (percentage of stones in the uppermost 20 cm of the soil) (Table 2). We took a hemispherical photograph at the centre of each plot at a height of 1.30 m above the ground to measure light availability (Table 2). The Global Site Factor (GSF) was then calculated using HemiView Canopy Analysis software (Delta-T Devices Ltd.). The GSF is the proportion of total radiation under a plant relative to that in the open, ranging from 0 (completely closed canopy) to 1 (completely open canopy). We also measured plot slope, expressed as a percentage and the altitude (Table 2). We considered the same altitude for all the plots in each transect.

Table 2 Summary of the main characteristics of the study area

In order to assess the influence of local stand structure, two indices for the measured trees were calculated: the influence potential (IPOT) and the index of influence (INF; Woods and Acer 1984) of the retained trees at a given point. IPOT is based on the concept of ecological field theory (Wu et al. 1985), empirically modified by Kuuluvainen and Pukkala (1989):

$$ {\mathrm{IPOT}}_j=1-{\mathrm{GPOT}}_j $$
(1)

where

$$ \begin{array}{l}{\mathrm{GPOT}}_j={\displaystyle \prod_{i=1}^{n_j}\left(1-{\beta}_{ij}\right)\kern0.5em }\\ {}\mathrm{and}\ {\beta}_{ij}=\left({\mathrm{dbh}}_{ij}/ \max \left(\mathrm{d}\mathrm{b}\mathrm{h}\right)\right) \exp \left(-{b}_{ij}\cdot {s}_{ij}\right)\end{array} $$

β ij is the potential influence of tree i at plot centre j, s ij is the distance from tree i to the plot centre and dbh ij is the diameter of tree i at the plot centre j. Max(dbh) is the maximum diameter at breast height in the data set; 95 cm in our data set. The parameter b ij was replaced by 1/(a h ij ), where h ij is the height (m) of tree i at plot centre j and a is a parameter to be estimated. Alternative values of a from 0 to 1 by 0.1 based on the correlation between regeneration density and IPOT were tested. The highest correlation between a and regeneration density was found for the value a = 0.4. IPOT ranges from 0 (no competition) to 1 (maximum competition).

Moreover, the INF competition index was defined as

$$ {\mathrm{INF}}_j={\displaystyle \sum_{i=1}^n{\mathrm{dbh}}_{ij} \exp \left(-2{s}_{ij}^2{\mathrm{dbh}}_{ij}^{-1}\right)} $$
(2)

where INF j represents the index of influence in a plot centre j, and s ij and dbh ij are defined above. INF acquires low values in gaps and high values in dense overstory patches (Paluch 2005).

The target variable in our case study was the number of seedlings per plot, i.e. count data. This is an important attribute that should be considered in the method selection. All the measured variables (edaphic, light, topographic and stand condition variables; Table 2) were included as explanatory variables in the following fitted models.

2.2 Parametric regression: GLMM using a negative binomial

Generalized linear mixed models (GLMMs) extend linear regression mixed models to accommodate non-constant variance through a variance function, non-linear relationships and certain types of non-normally distributed errors (Bolker 2008; Hardin and Hilbe 2012; Venables and Ripley 2002). There are three steps in GLMMs: (1) choose a distribution function for the response variable, which is a function from the exponential family; (2) define a linear predictor function specifying the covariates; and (3) link the predictor function and the mean of the distribution (Zuur et al. 2007). The connection of the mean of the distribution function to the linear predictor is done using link functions, e.g. logit, log or probit (Venables and Ripley 2002). Count data may be assumed to follow a Poisson distribution or a negative binomial distribution. However, in the Poisson distribution, a single parameter quantifies both the mean and the variance, so the use of Poisson distribution is reduced to data that matches the requirement of having a value of variance similar to the mean value (Crotteau et al. 2014; Lawless 1987). When the variance values are higher than the mean, the negative binomial distribution is more appropriate than the Poisson distribution (Bliss and Fisher 1953; Venables and Ripley 2002). Although the negative binomial distribution cannot strictly be included in the exponential family, negative binomial models can be fitted using a small extension of the GLMMs (Bolker 2008). The negative binomial distribution is characterized by a dispersion parameter, θ, which relates the mean (μ) of a response variable y and its variance: V(y) = μ + (μ 2/θ) (Bolker 2008; Crotteau et al. 2014). This parameterization of the variance, termed NB2 by Hardin and Hilbe (2012) or restricted negative binomial by Crotteau et al. (2014), assumes that the value of θ is constant across factors.

Observations recorded in forestry and ecology frequently present spatial or temporal correlation: several measurements are taken from the same subject or the sampling is carried out hierarchically. If these correlations are ignored, the assumption of independence of the observations and the error terms is violated and the standard error and p values of the covariates might be affected (Zuur et al. 2007). The correlation among measurements can be taken into account by including random effects in the model (Fisher 1918; Henderson et al. 1959; Zuur et al. 2009). Pure spatial and temporal correlation can be addressed through semivariograms or correlograms, respectively (Zuur et al. 2009). Nevertheless, in our study, subplots are nested within the plots and the plots in turn nested within the transect. Hence, a plot random intercept and a plot nested within transect random intercept effects were included in the model. Thus, the full negative binomial mixed model can be written as follows:

$$ y=\mathrm{negative}\ \mathrm{binomial}\left(\mu, \theta \right) $$
(3)
$$ \log \left(\mu \right)=X\beta +Zu $$
(4)

where X is the matrix of the covariates, β is the vector of the unknown (although estimable) parameters of fixed effects, Z is the random-effects design matrix and u is the vector of random effects with mean zero and variance to be estimated. After fitting the model, we used the McFadden’s pseudo-R 2 to assess the goodness of fit (Domenich and McFadden 1975):

$$ \mathrm{M}\mathrm{c}\ \mathrm{Fadden}\hbox{'}\mathrm{s}\ {R}^2=1-\frac{-2 \log {L}_{\mathrm{Full}}}{-2 \log {L}_{\mathrm{Null}}} $$
(5)

where L Full and L Null are the likelihood of the full model and the likelihood of the null model. McFadden’s pseudo-R 2 values can range between 0 and 1. Nevertheless, the resulting values are not equivalent to the classical R 2 values. Indeed, values of McFadden’s pseudo-R 2 between 0.2 and 0.4 should be taken to represent a very good model fit (Louviere et al. 2000).

To avoid over parameterization, the inclusion of the random effects, the covariates (Table 2) and the interaction of the variables in the model was carried out via a forward stepwise procedure according to the likelihood ratio test using a [Pr > (Chi)] = 0.05 as a confidence level. This analysis was conducted by using the “glmmadmb” function of the “glmmADMB” package (Fournier et al. 2012) in R 3.0.2. (R Core Team 2013).

2.3 Including uncorrelated variables in the GLMM. PCA as ordination method

In the former section (2.2), the explanatory variables were included sequentially to avoid collinearity. A second option to deal with this problem is presented in this section. The measured covariates represent a multivariate data set, a collection of sites positioned in a space where each variable defines one dimension (Borcard et al. 2011). In this case, an ordination method can be used to reduce the number of variables to a new set of uncorrelated variables, which can be therefore included as explanatory variables in the GLMM fitting. The PCA is a classic multivariate technique (Legendre and Legendre 1998). It reduces the dimensionality of a group of variables into a new set of uncorrelated variables, called the principal components. The principal components retain all of the variable information, but weight linear combinations of the original variables (Abdi and Williams 2010; Jolliffe 2002; Legendre and Legendre 1998). The first principal component axis data set is the line that goes through the most explanatory dimension describing a multinomial distribution. The following axes are orthogonal to one another and explain successively less (Abdi and Williams 2010; Borcard and Legendre 2002). The PCA is an eigenvector-based method and requires the construction of a disperse matrix S, which assesses the distance among observations. The mathematical details are beyond the scope of this study, but this information can be found in Jolliffe (2002) and Legendre and Legendre (1998). The requirement of normality of the original variables in the PCA analysis has been widely discussed, and some authors state that normality is not required (Jolliffe 2002; Zuur et al. 2010), whereas others argue that data must be normally distributed (Borcard and Legendre 2002).

We carried out two PCAs, one using edaphic variables and the other using the competition and light variables. The eigenvectors and the scores of the principal components were calculated after applying an orthogonal varimax rotation (Legendre and Legendre 1998). The PCA was conducted using the “principal” function of the “psych” package (Revelle 2013) in R. Seedling abundance was then modelled using Eq. (1). In this case, the predictors were the scores of the first and second principal components of each PCA and the topographic variables (slope and altitude). Two or three components are commonly used following ordination methods (Barbeito et al. 2011; González-Martínez and Bravo 2001) because those components typically explain a large amount of the total variance. The criteria described in the previous section (2.2) were also used for the inclusion of covariates. As a measure of goodness of fit, we used the statistics described in the previous section.

2.4 Ordination using NMDS

NMDS (Shepard 1962) is another ordination method. As with PCA, it is also an ordination technique, because it reduces a multivariable set into orthogonal univariate axes. However, unlike PCA and other ordination methods, NMDS is not based on an eigenvalue solution (Legendre and Legendre 1998). This method requires the construction of a community dissimilarity distance matrix (Legendre and Legendre 1998). For this purpose, any measure of association can be used, not only correlation, which sometimes is not the most appropriate measure (Zuur et al. 2007). Several distance measures can be employed to build the dissimilarity distance matrix, e.g. Bray-Curtis, Manhattan or Jaccard. Measure selection depends on the availability and nature of the data. To calculate those dissimilarities, numerical optimization methods have to be used. NMDS is an iterative method. The iterative procedure attempts to position the objects in the three dimensions so as to minimize a stress function (scaled from 0 to 1), which measures how far the distances in the reduced-space configuration are from the original distances (Borcard et al. 2011; Legendre and Legendre 1998). The mathematical details can be found in Legendre and Legendre (1998).

We selected the Bray-Curtis empirical dissimilarity distance matrix (Bray and Curtis 1957) to reduce the observed 19 variables (Table 2) in three dimensions or axes. We selected this distance because it is one of the most appropriate metric distance for community ecology data (Legendre and Fortin 1989), and it has been used in recruitment pattern assessment before (Ledo et al. 2015). We then plotted the Shepard diagram (Borcard et al. 2011; Legendre and Legendre 1998) to compare the original (empirical) distances with the ordination distances in the three-dimensional space. The goodness of fit of the ordination is measured as the R 2 of both a linear and a monotone (nonparametric) regression of the NMDS distances on the original ones (Borcard et al. 2011).

We grouped the number of seedlings for each subplot into a categorical variable with two levels: one corresponding to the fourth quartile (Q4) and the other to the first, second and third quartiles (Q123) of the distribution of the number of seedlings. We chose this division because the number of seedlings was relatively homogeneous in the first, second and third quartiles of the distribution, but increased greatly in the fourth quartile. We plotted the centroids of the quartiles to interpret the environmental variables with respect to the quartiles. We drew an ordination hull to enclose and cluster the environmental variables into the two levels (Q123 and Q4). Ellipses for the 95 % confidence areas of the class centroids were also calculated (Oksanen 2013). Furthermore, smooth surfaces of covariates to ordinations can be fitted using generalized additive models (Cayuela et al. 2006; Oksanen et al. 2013). NMDS was conducted using the function “metaMDS” implemented by the R package “vegan” (Oksanen et al. 2013).

2.5 Decision tree automatic classification method: CHAID algorithm

CHAID is a nonparametric procedure that splits a data set (root node) into groups, classes or segments that differ with respect to the response variable (Kass 1980; Kleppin et al. 2008). CHAID does not necessarily produce dichotomous categories, and therefore, a node can be split into more than two categories (Álvarez-Álvarez et al. 2011; van Diepen and Franses 2006). No distributional assumptions of the data are required in the CHAID, which implies a great advantage over other methods such as the GLMM parametric approach that we followed in Section 2.2. The CHAID mode operates as follows: in a first step, continuous variables are divided into a number of categories with approximately equal numbers of observations, since the CHAID procedure operates on nominal variables (Álvarez-Álvarez et al. 2011; van Diepen and Franses 2006). CHAID then splits the data for the specific predictors which exhibit the strongest degree of association with the response variable (Álvarez-Álvarez et al. 2011; Kleppin et al. 2008; van Diepen and Franses 2006). Since CHAID is a forward stepwise approach, it is possible that a better segmentation solution can be found by using another variable at an earlier stage (Vriens 2001). Mathematical details of the method are in van Diepen and Franses (2006) and Kass (1980).

The root node in the present study is the seedling density. We used a significance level of 5 % of the chi-squared test of independence to decide which categories of each predictor to merge and which predictor to split. As the chi-squared test is only approximately chi-squared distributed, a large sample size is required (van Diepen and Franses 2006). CHAID was carried out using SPSS (IBM Corp 2012).

We evaluated the goodness of fit using the classic R 2:

$$ {R}^2=1-\frac{{\displaystyle \sum_{i=1}^n{\left({y}_i-{\widehat{y}}_i\right)}^2}}{{\displaystyle \sum_{i=1}^n{\left({y}_i-\overline{y}\right)}^2}} $$
(5)

where y i is the observed density, ŷ i is the estimated seedling density, \( \overline{y} \) is the mean of the density and n is the number of observations.

2.6 The random forests algorithm and conditional inference tree

The random forests algorithm (Breiman 2001) is a nonparametric technique that combines the prediction of many independent decision tree models into a robust composite model (Cutler et al. 2007), thereby improving the accuracy of the prediction (James et al. 2013). To achieve this, the random forests procedure generated 500 bootstrapped data sets from the original data set. The procedure then constructs, by default, 500 bagged regression trees on the bootstrapped training samples and aggregates the bagged trees for prediction (James et al. 2013; Strobl et al. 2009). The remaining observations not used to fit a given bagged tree, referred to as out-of-bag (OOB) observations, are used to compute the prediction accuracy of the given tree (Strobl et al. 2009). In addition, the algorithm permutes a random sample of the predictors for determining each split in each tree, producing diverse uncorrelated trees (James et al. 2013; Strobl et al. 2009). Thus, the preference for certain predictors due to their scale of measurement or importance is avoided (Barbeito et al. 2012). The preference for a given predictor could result in the bagged trees being quite similar to each other and, hence, highly correlated predictions (James et al. 2013). The predictor importance is the difference in prediction accuracy before and after permuting, averaged over all trees (Breiman 2001; Strobl et al. 2009). A conditional permutation scheme for the computation of the variable importance measure was used in order to account for the correlation among the variables (Strobl et al. 2008). The variable importance measure is based upon the mean decrease in accuracy of predictions on the out-of-bag samples when a given variable is excluded from the model (James et al. 2013). According to Grömping (2009), the prediction of the random forests algorithm can be estimated via mean square error from the OOB data (OOB-MSE) as

$$ \mathrm{O}\mathrm{O}\mathrm{B}\hbox{-} \mathrm{M}\mathrm{S}\mathrm{E}=\frac{1}{n}{\displaystyle \sum_{i=1}^n{\left({y}_i-{\widehat{y}}_{i\mathrm{O}\mathrm{B}\mathrm{B}}\right)}^2} $$
(6)

where ŷ iOBB denotes the average prediction for the ith observation from all trees for which this observation has been OOB. As with linear regression, with the average of sum of squares \( \mathrm{S}\mathrm{S}\mathrm{T}=\frac{1}{n}{\displaystyle \sum_{i=1}^n{\left({y}_i-\overline{y}\right)}^2} \), OOB-R 2 can be obtained as 1 − OOB-MSE/SST.

Finally, we built an unbiased tree based on non-parametrical conditional inference procedures for testing independence between response and each predictor. The split selection criterion is based on conditional inference tests or permutation test (Hothorn et al. 2006). We used a significance level of 5 % as a splitting node criterion in tree construction. The goodness of fit of the conditional inference tree was evaluated thorough the R 2 described in the CHAID section. We employed the functions cforest and ctree of the R package “party” (Hothorn et al. 2006; Strobl et al. 2008) to perform the random forests analysis and the conditional inference tree.

3 Results

3.1 Parametric regression: GLMM using a negative binomial

We assessed the influence of environmental variables on the number of seedlings per subplot through a negative binomial regression with a log link. Seedling distribution was explained through various models in which the different covariates were sequentially included (Table 3). The most accurate model (in terms of lower log-likelihood) describing seedling abundance included the content of Na, P and the number of stumps within a 15-m radius (Table 3). The Na (estimation coefficient ± standard errors = −12.90 ± 3.20, p < 0.0001) and P (estimation coefficient ± standard errors = −0.07 ± 0.029, p = 0.012) content showed a negative association, while stump density had a positive association (estimation coefficient ± standard errors = 0.09 ± 0.028, p = 0.0001) on regeneration density (Table 3). Altitude, GSF, the logarithm of electric conductivity and stoniness improved the model only very slightly (0.05 < [Pr > (Chi)] < 0.1) in terms of log-likelihood, so these were not included in the model. GSF (estimation coefficient ± standard errors = 1.18 ± 0.71, p = 0.0941) and altitude (estimation coefficient ± standard errors = 0.016 ± 0.0008, p = 0.0579) were positively correlated with seedling density, whereas the logarithm of electric conductivity (estimation coefficient ± standard errors = −0.35 ± 0.20, p = 0.0832) and stoniness (estimation coefficient ± standard errors = −0.01 ± 0.01, p = 0.0657) was negatively correlated. As regards the interactions among variables, we did not find a significant effect of any of the three interactions tested (Table 3). The model including the plot random effect performed better, in terms of likelihood, than the model without random effects. This fact pointed out the importance of considering the spatial correlation within the plots in the model (Table 3). McFadden’s pseudo-R 2 was 0.022, indicating poor model fit (Louviere et al. 2000).

Table 3 Summary of the selection process of the environmental variables and the fitting statistics of the forward stepwise regression

3.2 Including uncorrelated variables in the GLMM. PCA as ordination method

The edaphic variables were reduced to two principal components, the first component was mainly related to the available Ca and Mg, whereas the second component was related to the pH and the electric conductivity (Table 4). Together, the two components explained 32 % of the total variance. Concerning the light and competition variables, the PCA reduced the number of variables to two principal components. The number of trees was related to the first component and the number of stumps to the second (Table 4). In this PCA, the two components explained 48 % of the total variance.

Table 4 Eigenvectors of the environmental variables obtained from the PCAs

We found that the second component of the PCA of the light and competition variables (related to the number of stumps) was positively associated (estimation coefficient ± standard errors = 0.342 ± 0.119, p = 0.004) in terms of likelihood ratio test (([Pr > (Chi)] < 0.05); Table 5) to the seedling density. We did not find a significant effect ([Pr > (Chi)] > 0.05) of the two other components or the two topographic variables (slope and altitude) on the number of seedlings. In this regression, the variance of the plot random intercept was 0.45. Regardless of the goodness of fit, McFadden’s pseudo-R 2 was 0.01 indicating a quite poor model fit (Louviere et al. 2000).

Table 5 Summary of the selection process of the principal components and the fitting statistics of the forward stepwise regression

3.3 Ordination with NMDS

The NMDS (Section 2.4) reduced the dimensionality of the 19 variables to three dimensions; one related to the light variables and to the indices of the influence of retained trees on regeneration, the second related to the soil variables, and the last related to the number of stumps. The R 2 of the linear ordination and the monotone of regression of the NMDS distances on the original ones were 0.88 and 0.98, respectively. The stress value was minimized to 0.14. The NMDS diagram shows that the entire hull of Q4 was enclosed by the hull of Q123 (Fig. 1). The ellipses for the 95 % confidence areas of both classes were almost overlapped, indicating only very slight differences between classes. IPOT and Mg centroids were enclosed by Q123 but not by Q4. The gradient (via contours) of the projected smooth surfaces of IPOT and Mg influenced regeneration density (Fig. 1). The projected smooth surfaces of the INF and the number of trees followed the same pattern as the IPOT, parallel with the IPOT projected smooth surfaces; whereas the surfaces of the number of stumps was also parallel with respect to the IPOT projected smooth surfaces but showed a positive association with the regeneration density. Moreover, both the contours of the projected smooth surfaces of the Ca and the Na followed the same pattern than the projected smooth surfaces of the Mg, parallel to the latter and displaying a negative relationship with the regeneration density. The rest of the other variables overlapped each other and were very close to the centroids of Q4 and Q123. Therefore, we assumed that there was no association with the regeneration density; hence, they are not shown in Fig. 1.

Fig. 1
figure 1

NMDS ordination of the variables showing the centroids of the quartiles of the number of seedlings. Variables within Q4 hull are not shown to facilitate the visualization. Red ellipse and red hull Q4 (fourth quartile of the number of seedlings). Blue ellipse and blue hull Q123 (first, second and third quartiles of the number of seedlings). Yellow smooth surfaces IPOT gradient. Green surfaces Mg gradient

3.4 Decision tree automatic classification method: CHAID algorithm

The CHAID procedure split the data into two main branches according to the Na concentration. Thus, this variable played the most statistically important role (p < 0.0001) in the subplot seedling density (Fig. 2). The lowest concentrations of Na (≤0.030 mEq/100 g) were related to the highest number of seedlings. In the case of higher concentrations of Na (>0.030 mEq/100 g), the algorithm split the data into six branches according to the number of retained trees (p < 0.0001). The distribution of the data in these six branches did not follow a linear pattern. The R 2 of the CHAID was 0.23, pointing to a poor model fit.

Fig. 2
figure 2

Regression tree as a result of the CHAID algorithm for seedling density. Mean indicates the mean number of seedlings per subplot and n the number of subplots per node

3.5 Random forests algorithm and conditional inference tree

Across all the trees considered in the random forests algorithm (Section 2.6), the content of sodium and the number of stumps were the two most important variables since they reached the greatest values of conditional variable importance (Fig. 3). The OOB-MSE and the OOB-R 2 of the random forests algorithm were 434.77 and 0.17, respectively.

Fig. 3
figure 3

Conditional variable importance measured following the permutation principle of the mean decrease in accuracy importance in the random forests algorithm. Larger values of conditional variable importance indicate more importance in the random forests model

The first node of the conditional inference tree split the data according to the content of Na (p < 0.001) (Fig. 4). The largest number of seedlings was related to the lowest concentrations of Na (≤0.040 mEq/100 g). For the highest concentrations of Na (>0.040 mEq/100 g), the branch was divided according to the number of stumps (p = 0.002), which were positively related to the number of seedlings per subplot (Table 6). The R 2 of the conditional inference tree was 0.17, indicating a poor model fit.

Fig. 4
figure 4

Implementation of conditional inference trees for seedling density into the defined theory of conditional inference procedures. Mean indicates the mean number of seedlings per subplot and n the number of subplots per node

Table 6 Significant variables involved in the regeneration process according to the alternative approaches employed

4 Discussion

4.1 Factors underlying Scots pine regeneration in Central Spain

A minimum of 2000 ha−1 (4 seedlings per subplot) denotes a sufficient regeneration density for Scots pine (e.g. Hyppönen et al. 2013). In our study, we found that at least 4 seedlings per subplot (2500 seedlings ha−1) were present in 99 % of the subplots suggesting regeneration success and good management practices. Therefore, we can state that the ranges of the measured variables are suitable for achieving natural regeneration. The current values of these variables are within a range that does not limit the regeneration process in the studied forest. Thus, some of the relationships found between the environmental variables and the regeneration density may be arbitrary rather than causal.

The different methods pointed to similar, although not identical, factors underlying the regeneration process (Table 6). The models obtained using the different methods had low goodness of fit values. This indicates that the low association between the environmental variables and the seedling density may be due to the lack of limiting factors in the regeneration process.

The parametric regression, the CHAID, the random forests algorithm and the conditional inference tree found a negative association between available Na and the seedling density. This negative relationship can be explained by the salt sensitivity of Scots pine (Bravo-Oviedo and Montero 2008). However, the sodium concentrations in the study area are lower than those found in other Scots pine forests (Bravo and Montero 2001). Furthermore, there was a positive relationship between the number of stumps and the seedling density according to the GLMM, the GLMM after PCAs reduction, the random forests algorithm and the conditional inference tree. The number of stumps is related to recently cut trees, so seedling density is directly related to canopy openness indicating that higher intensities in the regeneration fellings favoured the establishment of seedlings. In addition, the extraction of the wood resulted in a soil preparation effect, which facilitated the establishment of the Scots pine seedlings (Barbeito et al. 2011). The CHAID found an interaction between the highest levels of Na and the remaining trees. It indicates that the Na is the most important factor driving the regeneration, and when its concentration increases over the optimum, other factors, the remaining trees also appeared as a driver. The NMDS placed the two classes (the fourth quartile and the first, second and third quartiles of the distribution of the number of seedlings) quite close to each other reporting a weak negative relationship between the number of remaining trees and the competition indices with the regeneration density. The remaining trees play an important role in the regeneration, as Scots pine seedlings prefer moderate light conditions in Mediterranean areas (Barbeito et al. 2009; Pardos et al. 2007) rather than intensive light conditions as in Northern countries (Valkonen 2000). Furthermore, in Mediterranean areas, the success of Scots pine regeneration depends on summer droughts, so silviculture operations in our study area is also aiming at reducing the competition for resources thorough the soil preparation and thinnings (Barbeito et al. 2011).

4.2 Methods comparison

The variables used to predict the seedling recruitment were highly correlated. Including this in a regression model is a problem because it increases the risk of type I errors (Zuur et al. 2007). The PCA solved this problem reducing the data into uncorrelated components. Nevertheless, the full-dimensional GLMM had higher values of the goodness of fit statistics (McFadden’s pseudo-R 2) than the GLMM including the components of the PCAs. In addition, the reduced components present another flaw, which is that the effect of the individual covariates is not fully identifiable (Strobl et al. 2009).

NMDS could also have been used to reduce the dimensionality of the data for the parametric regression. However, since the NMDS is based directly on the dissimilarities, it does not provide correlations between derived axis scores and the variables, whereas techniques based on the eigenvalues and eigenvectors allow the researcher to relate, at least in part, the original variables to the components (Quinn and Keough 2001). Therefore, if the objective of the ordination technique is to reduce the data to use the scores of the axes in a regression model, the eigenvalue techniques may be more useful, since the correspondence to the component-variables is maintained. However, eigenvalue techniques are based only on correlation or covariate coefficients and limited to Euclidean distances, and this may not be the most appropriate measure of association (Legendre and Legendre 1998; Zuur et al. 2007). In addition, the method can also proceed with missing distance estimates as long as there are enough measures left to position each object with respect to a few of the others (Legendre and Legendre 1998). Furthermore, the NMDS allowed to fit non-linear relationships between the covariates to ordinations using generalized additive models, whereas other methods, such as PCA, imply linear relationships and it cannot always be appropriate (Oksanen 2013).

As regards the regression trees, the R 2 of the CHAID was larger than that of the conditional inference tree. Additionally, the CHAID has the main advantage of splitting each node into more than two branches, i.e. it is not a binary partitioning method (Kass 1980; Kleppin et al. 2008; van Diepen and Franses 2006) as the conditional inference tree (Hothorn et al. 2006). However, since the CHAID is a forward stepwise method, i.e. not all the variables are considered simultaneously but sequentially, there is the possibility that a better segmentation solution can be found using another variable at an earlier stage (Vriens 2001). Thus, CHAID cannot guarantee a single optimal solution (Perreault and Barksdale 1980). In relation to this, the bias in the variable selection is solved by the conditional inference trees using the permutation test principle (Hothorn et al. 2006). Both decision trees, the CHAID and the conditional inference tree, found second-order interactions. In addition, decision trees may outperform the classical approaches if there is a non-linear and complex interaction between the variables and the results of the decision trees are easier to interpret than those of other regression models (James et al. 2013; Strobl et al. 2009). Nevertheless, the main weakness of simple tree models is their instability to small changes in the learning data (Strobl et al. 2008). The random forests algorithm solves the problem of instability by averaging an ensemble of trees into a more robust composite model (Cutler et al. 2007; Strobl et al. 2009). However, the R 2 of the CHAID was larger than that of the random forests algorithm.

Forest regeneration studies in Mediterranean areas have often been conducted using parametric regression or point pattern analysis (Table 1). Despite all the assessed models performed in a similar way and given the similar ecological results, the election of the statistical technique must be done according to the characteristics and structure of the data. Ecological data often presents both spatial and temporal dependence among the observations. The hierarchical structure of our design indicates spatial dependence among measurements. The GLMM was the only method which allows the spatial correlation between subplots to be considered entering random effects (Fisher 1918; Henderson et al. 1959; Zuur et al. 2007). In studies with multiple hierarchies, like ours, random effects are especially important (Ten Have et al. 1999). We found a significant effect of the plot random effect indicating high correlation among subplots within the same plot. The model with the plot random effect captured more variability than the model without the random part. The other methods employed did not consider this correlation which can provide biased results and hide the effect of some explanatory covariates. Model misspecification is a frequent mistake disregarded in ecological modelling. Additionally, this method also identified the largest number of variables associated with seedling density. Thus, in studies with nested data, the parametric regression model is the most useful statistical approach. In studies with a large set of predictors, decision trees and the random forests algorithm through the conditional variable importance and ordination methods can be used to select and reduce the number of variables to be included in the model. If complex interactions are expected and the data set is large, then decision trees are advisable (Strobl et al. 2009). Whether the researcher had to analyse data sets from a hierarchal design with complex interactions among variables, the parametric regression models could account both issues.