The study period covers the years 2007–2014 and contains dataset of ground observations of 10 species and 13 phenophases at 52 stations in Poland (Figs. 1 and 2 and Table 1). The phenological ground observation dataset used in this study originates from the newly re-established observational network run by the Institute of Meteorology and Water Management - National Research Institute (IMGW-PIB) and constitutes an important part of the national climate monitoring.
The phenological observational network follows the BBCH methodology (abbr. from German: “Biologische Bundesanstalt, Bundessortenamt und CHemische Industrie”), which was akin to most European countries with similar growth stages of plant species (Meier 1997; Koch et al. 2009).
To account for different levels of reliability in the data records at individual stations (due to the subjective nature of this kind of observations), and the fact that they were collected at different locations, the GIS-based kriging with external drift (Hudson and Wackernagel 1994), together with expert knowledge, was applied to detect observational outliers. Additionally, the database was revised according to several proposals regarding phenological data quality issues made by Schaber and Badeck (2002).
Three types of data sources, commonly applied in phenological modeling, were tested as potential predictors in this study:
Preprocessed gridded meteorological data
Spatial (geographical) features of monitoring sites
The rationale behind this grouping was to determine the skillful scale for each of these groups of predictors for the near-surface plant phenological modeling. The selected phenological phases might not be equally reflected in every dataset due to a different physiological reaction of plant species in selected phenophases, and hence, the different sensitivity given by remote sensing, meteorological, and spatial data. Moreover, a wide range of possible data sources with varying spatio-temporal resolution led the authors to use only free-of-charge and easy-to-access data in order to make this modeling approach applicable in all areas with similar phenological stages (e.g., in other Central European countries). Further details on feature preselection and calculated indices are described below in “Meteorological derived indices” and “Spatial features.” Brief summary of the applied predictors is included in Table 2.
Meteorological derived indices
The timing of plant developmental events is highly dependent on temperature, precipitation, and photoperiod conditions, and therefore, it is the most common strategy for correlating the plant phenophase with the weather conditions (Yan and Hunt 1999). To detect plant reactions to changes in the atmospheric environment, archive station measurements are normally used. However, in this study, the authors decided to use the high-resolution (ca. 27 km) E-OBS gridded dataset provided by the European Climate Assessment & Dataset (ECA&D, Haylock et al. (2008)). The application of gridded dataset instead of in situ measurements allowed to reduce any potential problems with data inhomogeneity or situations where phenological observations were done in quite a distance from the nearest measurement stations. Moreover, E-OBS dataset (Hofstra et al. 2009) assures high quality for the applied data and renders further developed phenological model assumptions usable in other European regions.
A wide group of agrometeorological indices derived from the E-OBS temperature and precipitation gridded data were used as potential predictors. In this study, we decided to calculate a set of cumulative growing degree days (GDD) from 0 to 8 ∘C with an interval of 1 ∘C (calculated from January 1st) to account for a wide range of thermal sensibilities in particular plant species. Similarly, we also took into account the different water needs of plants for different phenophases which should be reflected in the cumulative growing precipitation days (GPD) calculated from January 1st onwards. Complementary thermal and pluvial conditions were represented by seasonal and monthly air temperature averages, seasonal and monthly sums of precipitation for each month of the current and previous year. Altogether, 42 meteorologically based features were created.
Moderate-Resolution Imaging Spectroradiometer-derived products
Observing vegetation from space poses a number of challenges related to many sophisticated effects such as atmospheric and soil effects, pixel aggregating techniques, and observation geometry (Testa et al. 2014). All of them affect the obtained data in a different way and become especially problematic in high and mid-latitudes (Hird and McDermid 2009). Despite such limitations, many previous studies have proven that remotely sensed observations may still be a robust tool for monitoring seasonal cycle of vegetation, even in areas not particularly approachable for satellite imagery (Karlsen et al. 2008). Moderate-Resolution Imaging Spectroradiometer (MODIS) level-3 vegetation products were used for detecting onset dates of particular phenophases. The following indices were used: Normalized Difference Vegetation Index (NDVI), Enhanced Vegetation Index (EVI), Leaf Area Index (LAI), and Fraction of Photosynthetically Active Radiation (fPAR) (Knyazikhin et al. 1999; Huete et al. 2002). NDVI and EVI contain information about live green vegetation and are delivered as MYD13Q1 and MOD13Q1 MODIS products with sinusoidal projection at 250-m resolution and 16-day intervals. Using interleaved Terra and Aqua sensors simultaneously makes it possible to couple them into an 8-day temporal resolution product. Due to the rather noisy NDVI and EVI data, especially in the colder part of the year (Hird and McDermid 2009), the authors then decided to take into account pixel values aggregated within a commonly applied in a national-scale geobotanical research a 10 × 10-km AtPol grids (Fig. 1, (Zajac 1978; Komsta 2016)). To smoothen the raw MODIS data into daily time-series, a spline algorithm was applied.
The next vegetation indices, LAI and fPAR, are 1-km products provided on a daily basis (Knyazikhin et al. 1999) and were also re-calculated for a wider extent of 10 × 10-km grids. LAI was used as an index to define an important structural property of a plant canopy, namely the one-sided leaf area per ground area unit. The fPAR index measures the proportion of available radiation in the photosynthetically active wavelengths (400 to 700 nm) that a canopy absorbs (Knyazikhin et al. 1999).
Additionally, the Interactive Multisensor Snow and Ice Mapping System (IMS) 4-km daily products derived from the National Snow and Ice Data Center were chosen to detect occurrence of snow cover (Brubaker et al. 2005). In the case of detecting occurrence of snow cover, the original vegetation indices in a corresponding time-series were replaced with zeros. In situations where the surface was not visible for the MODIS sensors (mostly due to cloud cover), the original MODIS values, often providing the mean climatology, were replaced by linearly interpolated valid values from the previous and following periods.
Besides the most probable NDVI, EVI, LAI, and fPAR values for each day, the authors also distinguished a set of derivative predictors consisting of the following: normalized values of MODIS indices for every single station, raw and corrected indices accounting for different pixel reliability, rate of change in an index value between monthly and 10-day measurements, and 1-week rolling mean. A conjunction of all selected variables gives a total of 64 MODIS-derived plant phenology indices.
This set of phenological products were supported by the operational IMS snow products. On the basis of the nearest grid value to the stations’ location, five measures were calculated: occurrence of snow cover (as 0–1 binary form), consecutive number of days with and without snow cover, number of days with snow cover in a month, and day of the year with the last snow cover.
To find spatial dependencies for the analyzed locations, four geographical variables were used including longitude and latitude calculated in the projected coordinate system, altitude based on the corrected Shuttle Radar Topography Mission (SRTM-3) dataset (Reuter et al. 2007), and the distance in kilometers to the Baltic Sea coast line for each of the monitoring sites. The latter feature was added to capture local processes observed in the Baltic Coastal zone that make this area climatologically unique (Czernecki and Mietus 2017), but are not fully reflected by temperature- or precipitation-related indices. Adding this variable aimed to improve overall quality of the created models for stations located up to about 100 km from the coast line.
Six commonly used statistical methods were tested and evaluated against the observed onset dates of the selected phenophases:
multiple linear regression (lm)
multiple linear regression with stepwise selection (lmAIC)
least absolute shrinkage and selection operator (lasso)
principal component regression (pcr)
generalized boosted models (gbm)
random forest (rf)
This study splits the previously described total number of 102 potential predictors into four sub-groups that might be applied according to the needs of statistical modeling:
consisting only of meteorologically derived variables and locations’ features (meteo)
MODIS-derived predictors (modis)
all available variables preprocessed with the use of Boruta algorithm to find all relevant features (Kursa and Rudnicki 2010). The role of the Boruta algorithm is to remove features that show to be less important than a random variable (boruta)
all available variables without any preselection (all)
To avoid situations where a “future” dataset would be applied according to the needs of predictive model building, only predictors that could be calculated by the typical onset date of a particular phenophase were used. For example, Corylus avellana flowering phase, observed typically in March, could have been modeled with the use only of indices obtainable before and during this month. Such a solution assures that created models may also be applied as supplementary information supporting the national phenological network or for further investigation related to the spatial prediction of phenological phases.
A k-fold cross-validation strategy was used to avoid overfitting and to estimate the accuracy of the models. For that purpose, the dataset was divided into eight 1-yearly subsets (2007–2013). Next, the model was trained on seven (k-1) years, and the held-out subset (1 year) was used to evaluate the model. This procedure was repeated eight times. The overall performance was obtained by averaging the k estimates of the performance (Kuhn and Johnson 2013).
The models’ performances were characterized using the coefficient of determination (R2) and root-mean-square error (RMSE). An R2 value is the squared correlation coefficient between the observed and predicted values. RMSE is the difference between predicted values and observed values. Additionally, the model’s distribution errors for selected cases were presented as histograms and scatterplots.
The general effect of the independent variables on gradient boosted models was determined using a variable’s “relative influence” (Friedman 2001). Values of variable influence/importance were obtained separately for the models based on all data from each phenophase. The ten best predictors were then selected and divided into meteorological, MODIS-derived, and spatial categories. Afterwards, for each category of predictors, the mean variable importance was calculated and scaled so as to estimate which predictors contribute in the highest degree to a model’s prediction (Fig. 6). All calculations were carried out using R programming language (R Core Team 2016) and its packages such as “Boruta,” “ranger,” or “caret” supporting machine learning techniques (Venables and Ripley 2002; Kuhn 2008; Kursa and Rudnicki 2010; Wright 2015).