Abstract
This study compares the statistical predictability by linear regression of surface wind components using mid-tropospheric predictors with predictability by three nonlinear regression methods: neural networks, support vector machines and random forests. The results, obtained at 2109 land stations, show that more complex nonlinear regression methods cannot substantially outperform linear regression in cross-validated statistical prediction of surface wind components. As well, predictive anisotropy (variations in statistical predictive skill in different directions) are generally similar for both linear and nonlinear regression methods. However, there is a modest trend of systematic improvement in nonlinear predictability for surface wind components with fluctuations of relatively small magnitude or large kurtosis, which suggests weak nonlinear predictive signals may exist in this situation. Although nonlinear predictability tends to be higher for stations with low linear predictability and nonlinear predictive anisotropy tends to be weaker for stations with strong linear predictive anisotropy, these differences are not substantial in most cases. Overall, we find little justification for the use of complex nonlinear regression methods in statistical prediction of surface wind components as linear regression is much less computationally expensive and results in predictions of comparable skill.










Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.References
Amari SI, Murata N, Müller KR, Finke M, Yang HH (1996) Statistical theory of overtraining-is cross-validation asymptotically effective? In: Advances in neural information processing systems, pp 176–182
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Csáji BC (2001) Approximation with artificial neural networks. Faculty of Sciences, Eötvös Loránd University, Hungary 24:48
Culver AM, Monahan AH (2013) The statistical predictability of surface winds over western and central Canada. J Clim 26(21):8305–8322
Davy RJ, Woods MJ, Russell CJ, Coppin PA (2010) Statistical downscaling of wind variability from meteorological fields. Boundary-layer Meteorol 135(1):161–175
Hastie TJ, Tibshirani RJ, Friedman JH (2009) The elements of statistical learning: data mining, inference, and prediction. Springer, New York
He Y, Monahan A, Jones C, Dai A, Biner S, Caya D, Winger K (2010) Land surface wind speed probability distributions in North America: observations, theory, and regional climate model simulations. J Geophys Res 115(D04):103
Holtslag A, Svensson G, Baas P, Basu S, Beare B, Beljaars A, Bosveld F, Cuxart J, Lindvall J, Steeneveld G et al (2013) Stable atmospheric boundary layers and diurnal cycles: challenges for weather and climate models. Bull Am Meteorol Soc 94(11):1691–1706
Hsieh WW (2009) Machine learning methods in the environmental sciences: neural networks and kernels. Cambridge University Press, Cambridge
van der Kamp D, Curry CL, Monahan AH (2012) Statistical downscaling of historical monthly mean winds over a coastal region of complex terrain. II. Predicting wind components. Clim Dyn 38(7–8):1301–1311
Kanamitsu M, Ebisuzaki W, Woollen J, Shi-Keng Y et al (2002) NCEP-DOE AMIP-II reanalysis (r-2). Bull Am Meteorol Soc 83(11):1631
Liaw A, Wiener M (2002) Classification and regression by randomforest. R News 2(3):18–22
Mao Y, Monahan A (2017) Predictive anisotropy of surface winds by linear statistical prediction. J Clim 30(16):6183–6201
MathWorks (2017a) fitrsvm. https://www.mathworks.com/help/stats/fitrsvm.html
MathWorks (2017b) Getting Started with Neural Network Toolbox. https://www.mathworks.com/help/nnet/getting-started-with-neural-network-toolbox.html
MathWorks (2017c) Understanding Support Vector Machine Regression. https://www.mathworks.com/help/stats/understanding-support-vector-machine-regression.html
Mohandes M, Halawani T, Rehman S, Hussain AA (2004) Support vector machines for wind speed prediction. Renew Energy 29(6):939–947
Monahan AH (2012) Can we see the wind? Statistical downscaling of historical sea surface winds in the subarctic northeast Pacific. J Clim 25(5):1511–1528
Platt J (1998) Sequential minimal optimization: A fast algorithm for training support vector machines. Tech. rep, Microsoft Research
Python (2016) RandomForestRegressor-scikit-learn 0.18.1 documentation. http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
Ripley B, Venables W (2016) Package ’nnet’. https://cran.r-project.org/web/packages/nnet/nnet.pdf
Sailor D, Hu T, Li X, Rosen J (2000) A neural network approach to local downscaling of GCM output for assessing wind power implications of climate change. Renew Energy 19(3):359–378
Salameh T, Drobinski P, Vrac M, Naveau P (2009) Statistical downscaling of near-surface wind over complex terrain in southern France. Meteorol Atmos Phys 103(1–4):253–265
Stull RB (2000) Meteorology for scientists and engineers: a technical companion book with Ahrens’ Meteorology Today. Brooks/Cole
Sun C, Monahan A (2013) Statistical downscaling prediction of sea surface winds over the global ocean. Journal of Climate 7938–7956
Vapnik V (2013) The nature of statistical learning theory. Springer Science & Business Media, New York
Wolfram (2016) WeatherData source information. http://reference.wolfram.com/language/note/WeatherDataSourceInformation.html. Accessed 1 Jan 2016
Yuval Hsieh W (2002) The impact of time-averaging on the detectability of nonlinear empirical relations. Quarterly Journal of the Royal Meteorological Society 583(1609–1622)
Zhao P, Yu B (2006) On model selection consistency of Lasso. Journal of Machine learning research 7(Nov):2541–2563
Acknowledgements
The authors gratefully acknowledge helpful comments and suggestion from two anonymous reviewers. This research was supported by the Discovery Grants program of the Natural Sciences and Engineering Research Council of Canada.
Author information
Authors and Affiliations
Corresponding author
Appendix: Nonlinear regression methods
Appendix: Nonlinear regression methods
There are three nonlinear regression methods used in this study: neural network (NN), support vector machines (SVM) and random forests (RF). These three methods are supervised learning methods in which a function is inferred from a set of training data consisting of pairs of predictors and predictands in a process referred to as training. After training, the estimated function can be used for new predictions. Supervised learning methods differ from each other by the algorithms they use to infer the regression function from training data. A brief introduction to the NN, SVM and RF algorithms will now be presented.
1.1 Neural network
The method of NN in nonlinear regression is inspired from the structure of neurons in biology. There are many types of neural network. In this study we use feed-forward neural networks, which is the most widely-applied NN. In feed-forward NN, the predictand is related to the predictor by a sequence of linked computational elements known as hidden layers (Hsieh 2009). Figure 11 demonstrates the structure of a feed-forward NN used to model a regression with P predictors and K predictands by a single layer of M hidden neurons. This structure is similar to that of a two-stage regression model (Hastie et al. 2009).
In the first stage of the NN regression, the values of the hidden neurons \(Z=[Z_1,\ldots ,Z_m,\ldots ,Z_M]\) are computed as linear combinations of the inputs \(X=[X_1,\ldots X_p,\ldots ,X_P]\),
where \({\alpha }_{0m}\) are the offset scalars and \({\alpha }_{m}\) are the weight vectors of predictors X. The function s(v) is usually chosen to be the sigmoid function \(s(v) = 1/(1+e^{-v})\) which asymptotically saturates for large positive and negative values of v: \(s(v) \rightarrow 0\) as \(v \rightarrow -\infty\) and \(s(v) \rightarrow 1\) as \(v \rightarrow +\infty\). In the second stage of the regression models, each predictand \(Y_{k}\) is computed as the linear combination of hidden neurons \(Z=[Z_1,\ldots ,Z_m,\ldots ,Z_M]\),
where \(\widehat{Y_k}\) is the modeled value of target \(Y_k\), \({\beta }_{0k}\) are the offset scalars, and \({\beta }_{k}\) are the weight vectors. Together, the parameters \(\alpha _{0m}\), \(\alpha _m\) and \(\beta _{0m}\), \(\beta _k\) are also called the weights of the NN model.
Training of the NN is done using an optimization algorithm to seek weights of the model which minimize the objective function \(J = \sum _{k=1}^{K}\sum _{i=1}^{N}(Y_{ik} - \widehat{Y}_{ik})^2\) where N is the number of observations. Parameters of the initial state can be chosen randomly. The minimization of the objective function J is done iteratively until some convergence criterion is met. The primary issue we need to be mindful of in training neural network is overfitting. Measures to prevent overfitting should be considered in both model architecture (i.e. number of hidden neurons) and model training.
Hidden neurons The complexity of a neural network is increased by increasing the number of hidden neurons. A neural network with one hidden layer of a finite (but sufficiently large) number neurons can approximate any continuous function to arbitrary accuracy (Csáji 2001). The modeling challenge is choosing the right number of hidden neurons. Models with too few hidden neurons might not have the flexibility to capture the nonlinear signal of the data. Models with too many hidden neurons might be so flexible that they will fit the noise of the data. The appropriate number of hidden neurons can be chosen from empirical testing (as described in Sect. 2).
Training methods Early stopping is a common method of preventing overfitting, in which the training process stops well before J reaches the global minimum. In this method, the training data are divided into two subsets. All data in the first subset undergo the training process as described above to update the weights of the model. Each iteration in the training is referred to as an epoch. The training process is repeated for many epochs. The second dataset is used to evaluate the objective function at each epoch. As the number of training epochs increase, the value of J over the validation data generally decreases at first before increasing. Training beyond the point at which J starts to increase will only contribute to model overfitting and is therefore stopped. The model parameters are those obtained at the minimum of J over the validation data. Note that the separation into training and validation sets is done as part of the parameter estimation process, and is distinct from the data subsetting associated with cross-validation.
1.2 Support vector machine
Support vector machines are characterized by the use of kernel functions which represent the mapping of observations in the input space into a high-dimensional feature space where linear regression can be used. Therefore, a nonlinear model in the input space can be learned from a subset of observations (support vectors) by linear regression in the high-dimensional feature space. The following mathematical formulation is based on the documentation of the Statistics and Machine Learning Toolbox in MATLAB (MathWorks 2017c).
Suppose \(x_n\) represents one case of training data in the input space with observed response value \(y_n\), and \(g(x_n)\) represents a mapping function which maps \(x_n\) in the input space to the feature space. The function, \(f(x_n)\), which is used to model \(y_n\), can be constructed as
The goal of SVM regression is to find a function f(x) that deviates from y by a value no larger than \(\epsilon\) for each observation of training data, and at the same time is as flat as possible. Flatness requires that the weight vector of f(x) should be as small as possible. The problem is formulated as minimizing the objective function
subject to the constraint that all residuals less than \(\epsilon\): that is for all n, \(|y_n - (g(x_n)'\beta +b)|\le \epsilon\). The set of x values satisfying \(|y-f(x)|\le \epsilon\) is known as the \(\epsilon\)-tube. However, for a given \(\epsilon\), not all observations may satisfy the constraint of falling in the \(\epsilon\)-tube; therefore, slack variables \(\zeta _n\) and \(\zeta ^*_n\) are used to allow the regression models to tolerate errors up to the value of \(\epsilon +\zeta _n\) or \(\epsilon +\zeta ^*_n\) , where \(\zeta _n\) and \(\zeta ^*_n\) represent the upper and lower limit of the extension of the \(\epsilon\)-tube respectively (Fig. 12).
By including slack variables, the minimization of the objective function becomes
subject to: for all n
\(y_n-(x_n'\beta +b)\le \epsilon +\zeta _n\),
\((x_n'\beta +b)-y_n \le \epsilon + \zeta ^*_n\),
\(\zeta _n\ge 0\)
\(\zeta ^*_n\ge 0\).
The constant \(C>0\) determines the largest deviations from \(\epsilon\) which can be tolerated. The formulation of \(J(\beta )\) in Eq. (15) is also known as the primal formula (Vapnik 2013).
Estimating the weights of f(x) of SVM regression in Eq.(15) corresponds to minimizing the \(\epsilon\)-insensitive loss function defined as
for which the errors associated with observations within the \(\epsilon\)-tube are ignored (Vapnik 2013). This optimization problem of minimizing Eq. (16) is computationally simpler to solve in its Lagrange dual formulation (MathWorks 2017c). Constructing a Lagrangian function for Eq. (16) requires nonnegative multipliers \(\alpha _n\) and \(\alpha ^*_{n}\) for each observation \(x_n\). The dual formulation of minimizing Eq. (16) involves minimizing
where \(<g(x_i),g(x_j)>\) is the inner product of the predictors after mapping, and Eq. (17) is subject to
\(\sum _{n=1}^{N}(\alpha _n-\alpha ^*_n)=0\),
and that for all n:
\(0 \le \alpha _n \le C\),
\(0 \le \alpha ^*_n \le C\),
\(\alpha _n(\epsilon +\zeta _n-y_n+f(x_n))=0\),
\(\alpha ^*_n(\epsilon +\zeta ^*_n-y_n+f(x_n))=0\),
\(\zeta _n(C-\alpha _n)=0\),
\(\zeta ^*_n(C-\alpha ^*_n)=0\).
These conditions indicate that Lagrange multipliers \(\alpha _n=0\) and \(\alpha ^*_n=0\) when observations are inside the \(\epsilon\)-tube. The dual formulation Eq. (17) is solved by using quadratic programming techniques the details of which are beyond the scope of this paper but can be found in Platt (1998). The solution has the form:
In other words, the solution f(x) only depends on those \(x_n\) which satisfy \((\alpha _n - \alpha ^*_n) \ne 0\), and this subset of \(x_n\) from training data is denoted support vectors which fall within a distance \(\zeta\) or \(\zeta ^*\) from the boundary of \(\epsilon\)-tube as shown in Fig. 12. Since by construction of SVM regression models, most cases of training data are inside the \(\epsilon\) tube, the number of support vectors is small comparing to the number of observations in training data. The transformation function g(x) that maps each \(x_n\) of training data in the input space to the feature space is unknown, but \(<g(x_n), g(x)>\) in Eq. (19) can be approximated by kernel functions \(K(x_n,x) = <g(x_n), g(x)>\). There are many admissible kernel functions, and we list some common kernel types in the following:
-
linear kernel: \(K(x_n,x) = <x_n,x>\)
-
polynomial kernel: \(K(x_n,x) = (1+{x'}_nx)^p\), where \(p=2,3,4\ldots\)
-
radial basis function kernel: \(K(x_n,x) = \exp (-\gamma {\left\| x_n-x \right\| }^2),\gamma>0\)
-
sigmoid kernel: \(K(x_n,x) = \tanh (\gamma <x_n,x>+\tau ),\gamma>0\).
The class of functions that can be approximated are determined by the chosen kernel form.
Generally, there are three factors that can influence the accuracy of SVM regression: the values of C, \(\epsilon\) and the chosen kernel form. The parameter C determines the trade-off between model complexity (i.e. the flatness of f(X)) and the degree to which deviations larger than \(\epsilon\) are tolerated in the optimization of the loss function. Larger C indicates that more data are used in the process of training but the resulting model can be complex and overfit the data; on the other hand, smaller C might make the regression model prone to underfitting because the training process might not have enough training data to characterize the underlying structure. By controlling the width of the \(\epsilon\)-tube in the training data, the parameter \(\epsilon\) influences the number of support vectors used in the training process. Finally, depending on the properties of training data, some kernel forms may work better than others.
1.3 Random forests
Random forests, first proposed by Breiman (2001), belong to the category of ensemble learning methods that generate many regression models and aggregate their results (Hastie et al. 2009). As the name, random forest, suggests, the individual model is a tree-based regression model. The basic idea of a tree-based regression is partitioning the input space into a set of subspaces, and then fitting a simple model (usually a constant) in each subspace. The partition is generally done by binary recursive partition in which the input space is first split into two regions, and each sub-region is fit with a simple model. This splitting process is repeated in the resulting sub-regions until some stopping rule is applied as illustrated in Fig. 13. The following formulation is based on Hastie et al. (2009). Suppose there are p input variables and one response for each of N observations in a dataset: that is, \((x_i, y_i)\) for \(i=1,2,\ldots N\) with \(x_i = (X_{i1}, X_{i2}, \ldots X_{ip})\). Starting from the entire input space, the two planes after each split can be expressed as \(R_1(j,s) = {X|X_j\le s}\) and \(R_2(j,s) = {X|X_j> s}\), where \(X_j\) and s represent the splitting variable and splitting point respectively. The best fit is achieved by seeking the variable \(X_j\) and split point s which solve
where \(c_1\) and \(c_2\) are the constants used to model values in \(R_1\) and \(R_2\) respectively. The split variable \(X_j\) and split point s which lead to best fit can be determined by scanning through input variables \(x_i\), and depending on the objective of a regression problem, it is not necessary to use all p input variables in determining the best fit. The response \(y_i\) can be modeled by the tree regression
where \(R_1\), \(R_2\), ..., \(R_M\) are regions resulting from partition in the tree regression, and \(c_1\), \(c_2\), ..., \(c_m\) are the constants used to model responses in the corresponding region.
Tree regression is generally prone to overfitting as the training processes use all training data including noise. The essential idea is to average many noisy but approximately unbiased models, thereby reducing the variance (Hastie et al. 2009). Averaging many regression trees constructed randomly from bootstrapped samples, the resulting model can be expressed as
where B is the total number of trees, and b characterizes the bth tree-based regression T(X) from a bootstrap sampling. In general, there is no specific rule to determine the fraction of data used for bootstrap model construction. The Python function ‘RandomForestRegressor’, which we used in this study, adopts the strategy of building each tree from drawing a sample of equal size of the input training data with replacement. The estimate of error is obtained by prediction over data not in the bootstrap sample (out-of-bag or OOB data)
One of the biggest merits of random forest analysis is its simplicity. There are only two parameters for RF, and the solution is not very sensitive to their values (Liaw and Wiener 2002). The first parameter is the number of input variables needed to consider when seeking the best split during tree regression. The second parameter is the number B of individual tree regressions used in RF.
Rights and permissions
About this article
Cite this article
Mao, Y., Monahan, A. Linear and nonlinear regression prediction of surface wind components. Clim Dyn 51, 3291–3309 (2018). https://doi.org/10.1007/s00382-018-4079-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00382-018-4079-5



