Abstract
Throughout this book, various concepts from statistics and probability will be used which are essential for understanding and constructing forecasts. In this section some of the fundamental tools and definitions will be presented. It is assumed that the reader knows some basic probability and statistical concepts, and this chapter is only intended as a refresher of the main ideas which will be used in other parts of the book. This chapter will go over basic definitions of distributions, methods for estimating them as well as introduce some important concepts such as autocorrelation and crosscorrelation. For a more detailed description of basic statistics the authors recommend an introductory text such as [1] (In addition see the further reading material listed in Appendix D).
Download chapter PDF
Throughout this book, various concepts from statistics and probability will be used which are essential for understanding and constructing forecasts. In this section some of the fundamental tools and definitions will be presented. It is assumed that the reader knows some basic probability and statistical concepts, and this chapter is only intended as a refresher of the main ideas which will be used in other parts of the book. This chapter will go over basic definitions of distributions, methods for estimating them as well as introduce some important concepts such as autocorrelation and crosscorrelation. For a more detailed description of basic statistics the authors recommend an introductory text such as [1] (In addition see the further reading material listed in Appendix D).
3.1 Univariate Distributions
Real world data typically has some degree of uncertainty with the values it takes distributed over some (potentially infinite) range of points. A major part of probabilistic forecasts is trying to accurately describe, or model, the distribution of the values of interest. For the purposes of this book, distributions will be used to describe probabilistic forecasts and hence understand the uncertainty associated to the estimates they produce. Note that the focus will be on demand data and hence the methods will typically apply to real continuous variables (as opposed to discrete/categorical variables). In this section only univariate distributions will be considered, i.e., those which describe a single real variable. We define a continuous variable, whose values depend on a random process and has a continuous distribution, as a random variable. Notice in the following, the random variable will be denoted with a capital letter, e.g. X, whereas lower case variables, e.g. x, will refer to particular realisations/observation of that random variable.
One of the most common ways to describe the distribution of a random variable X is through its probability density function or PDF, \(f_X(x)\), over some (possibly infinite) interval \(x \in (a, b) \subseteq \mathbb {R}\). The PDF is a nonnegative function which describes the relative likelihood that any value \(x \in (a, b)\) will be observed and can be used to calculate the probability of the variable taking some value within an interval \((a_1, b_1) \subseteq (a, b)\) as follows
Note that the integral of the PDF is bounded above by one by definition.
An alternative but equally important representation of the distribution is the cumulative distribution function or CDF. Again assume the CDF is defined over some (possibly infinite) interval \(x \in (a, b) \subseteq \mathbb {R}\). The CDF represents the probability that the random variable X will take a value less than or equal to some specified value x, and is often written as a function \(F_X(x) = P\{ X \le x\}\). It is related to the PDF via
I.e. an integration of \(f_X(t)\) for t over the interval (a, x). Notice that the CDF is a monotonically increasing function, which means that if \(x_1\le x_2\) then \(F_X(x_1) \le F_X(x_2)\). Also it satisfes the limits, \(\lim _{x \longrightarrow a}F_X(x) = 0\) and \(\lim _{x \longrightarrow b}F_X(x) = 1\).
The expected value and variance are two important derived values associated to a PDF/CDF. The expected value is defined as
and the variance as
where \(\mu =\mathbb {E}(X)\). The expectation (or mean) essentially represents a weighted average of the values of the random variable X with values weighted by the probability density \(f_X\). It acts as a typical, or expected, value of the random variable.
The variance is the expected value for the squared deviation from the mean. It is often used to represent the spread of the data from the mean and hence is a simple measure for the uncertainty of the random variable. The square root of the variance is known as the standard deviation (often denoted \(\sigma = \sqrt{\mathbb {V}ar(X)}\)).
One of the most commonly studied, and important, distributions is the one dimensional Gaussian (also known as the Normal) distribution which has a PDF defined by
Thus the Gaussian is defined entirely by two parameters, namely the mean \(\mu \) and the variance \(\sigma ^2\). When \(\mu =0\) and \(\sigma =1\) then (3.5) is known as the standard normal distribution.
An example of the Gaussian distribution for various means and standard deviations is shown in Fig. 3.1. Notice that the Gaussian distribution is always bellshaped and is symmetric around the mean value. Also the bigger the variance the wider the distribution, as expected. In the plot for the corresponding cumulative distribution, smaller variances translate to steeper gradients.
Not all variables are Gaussian distributed, or even symmetric. There are a whole host of other parametric families of distributions. The lognormal distribution is suitable for variables which are positively skewed distributions with long tails to the right and has PDF defined by
for parameters \(\mu \) and \(\sigma \). Notice that this is simply a Gaussian distribution (3.5) but for the logarithm of the variable. There are also more general distributions such as the gamma distributions which can represent a whole range of different distribution shapes. The gamma CDF has the relatively complicated form
for parameters \(\alpha , \beta \) often called the shape and scale parameters respectively and \(\Gamma (x) = \int _0^\infty t^{x1}e^{t} dt\) is the socalled gamma function. The PDFs for the lognormal and gamma distribution for various values of their parameters are shown in Fig. 3.2.
The Gaussian, gamma and lognormal distribution are examples of parametric distribution functions because they are defined completely in terms of their input parameters. There are also nonparametric distributions which do not assume that the data come from any specific parametric family of functions. Kernel density estimation is a popular method for nonparametrically estimating a distribution and will be described in Sect. 3.4. In this book, the main focus will be on nonparametric models, but the Gaussian distribution will also be commonly used especially when modelling the distribution of errors.
3.2 Quantiles and Percentiles
The continuous CDF admits an inverse function \(F_X^{1}\) which can take any \(q \in [0, 1]\) and give a unique value \(x_q = F_X^{1}(q)\). This value is the qth quantile, or qquantile, also known as the 100q percentile [2]. The most well known values are the median which is the 0.5quantile or 50 percentile and the lower and upper quartile which are the 25th and 75th percentiles, or the 0.25 and 0.75quantiles respectively. Essentially the qquantile defines the value in the domain, less than which a q proportion of the data lies. In other words, the proportion of the random variables X which are less than \(x_q\) is q.^{Footnote 1}
Example of the 50 and 90 percentiles for the standard Normal distribution are shown in Fig. 3.3. Notice that qquantile is simply the domain value corresponding to where the horizontal line at \(y=q\) intersects the CDF, as illustrated in Fig. 3.3. Often the complete CDF is unknown but a finite number of quantiles can be estimated. When enough quantiles are calculated an accurate estimate can be formed of the overall distribution. A technique for estimating the quantiles from observations is given in Sect. 3.4.
Quantiles are important tools for estimating distributions when only samples of the overall population are available and can be used to create probabilistic forecasts as will be demonstrated in Sect. 11.4.
3.3 Multivariate Distributions
Multivariate distributions are an extension of the univariate distributions introduced in Sect. 3.1 to distributions of more than one variable. This time consider N random variables \(X_1, X_2, \ldots , X_N\). Analogous to the PDF for the univariate distribution, is the joint probability distribution
which, like the PDF describes the relative probability that the set of values \((X_1, X_2, \ldots , X_N)\) will be observed. The CDF for a multivariate distribution is defined as
and can be written in terms of the joint probability distribution
If \(N=2\) then the multivariate distribution is known as a bivariate distribution. One of the simplest multivariate distributions is the multivariate Gaussian joint distribution defined by
were \(\textbf{x} =(x_{1},\ldots ,x_{N})^T \in \mathbb {R}^N\), \(\boldsymbol{\mu }=(\mu _{1},\ldots ,\mu _{N})^T \in \mathbb {R}^N\) where \(\mu _k\) is the expected value for the random variable \(X_k\) and \(\boldsymbol{\Sigma } \in \mathbb {R}^{N \times N}\) is the covariance matrix and describes the covariance between the variables. The diagonal elements of this matrix describe the variance of the random variables, i.e. \((\boldsymbol{\Sigma })_{k,k} = \mathbb {V}ar(X_k)\), and the off diagonal elements describe the variation of one random variable in relation to another random variable. Consider two random variables \(X_k\) and \(X_m\), then the covariance between these two variables can be written in terms of the expectation as
Notice that the covariance matrix is symmetric and positive semidefinite.^{Footnote 2} The (Pearson) correlation is defined to be
and is bounded by \([1, 1]\) and is a measure of the linear dependence between two variables. In other words, the correlation between two variables is the covariance scaled by the standard deviation of the variables. If two variables are independent (i.e. the change in one variable doesn’t effect the change in the other) then they are uncorrelated and their covariance is equal to zero. For the special case of two variables, the bivariate covariance matrix can be written as
where \(\rho \in [1, 1]\) is the correlation \(Corr(X_1,X_2)\) and \(\sigma _1, \sigma _2\) are the standard deviation of the random variables \(X_1\) and \(X_2\). An example of a bivariate Gaussian distribution is shown in Fig. 3.4 for \(\rho = 0.6\), \(\sigma _1^2=0.6\) and \(\sigma _2^2=1\). Here (a) is the joint density and (b) is the joint CDF. The correlation \(\rho \) is relatively large and hence the variables are somewhat correlated with each other.
To simplify the discussion on multivariate distributions, the focus of the rest of this chapter will be on bivariate distributions, but the results extrapolate to more general multivariate distributions.
One of the most important substructures of a multivariate distribution is the marginal distribution. Given a joint bivariate distribution, \(f_{X_{1}, X_{2}}(x_{1},x_2)\), the marginal distribution of \(X_1\) describes the distribution of \(X_1\) given no knowledge of \(X_2\) and is found by integrating over \(X_2\)
Similarly the marginal distribution of \(X_2\) can be defined
The joint and marginal distributions for a bivariate Gaussian (the same joint as given in Fig. 3.4) are illustrated in Fig. 3.5. Notice that if the random variables \(X_1\) and \(X_2\) are independent then the joint density is simply a product of the marginals for each variable \(f_{X_{1}, X_{2}}(x_{1},x_2) =f_{X_{1}}(x_{1})f_{X_{2}}(x_2)\). The marginals can often be easier to estimate since they only require estimating each individual variable rather than needing to model any interdependencies between them.
If one of the values, say \(X_2\) is observed so its value \(x_2\) is known for certain, then the distribution of \(X_1\) given this particular value is known as the Conditional density and is written \(f_{X_{1} X_{2}}(x_{1}x_2)\).
The joint, marginal and conditional distributions are related by the following formula
As a simple illustrative example, consider randomly sampling from a bag containing ten identicallooking balls, each with a unique number, one to ten, written on them. Since each ball has an equal chance of being sampled then the probability of drawing any of the balls is the same, 1/10. However the conditional probability of drawing a three given that we know that the ball has an odd value written on it is 1/5.
3.4 Nonparametric Distribution Estimates
In this section basic nonparametric methods for estimating and understanding the distributions from available observations are introduced. Some of these, such as kernel density estimation methods, will be used as building blocks for some of the probabilistic forecasts in Chap. 11. These type of methods are useful when the data is known not to come from a specific parametric family of distributions, e.g. Gaussian, or Gamma (see Sect. 3.1).
One of the most common methods for estimating the PDF is through a histogram. A histogram is simply a count of the number of observations within some predefined discrete partitions (called bins) of a variable. Bins are defined by dividing the space into discrete groups. For a univariate random variables these are just intervals [a, b], for multivariate data these are regions defined by intervals, e.g. \([a_1, b_1] \times [a_2, b_2], \times \cdots \times [a_N, b_N]\). Often the histograms are restricted to univariate and bivariate data due to the difficulty of visualising higher dimensions. Each bin is usually of equal size but this is not necessarily required. An example of a histogram using 20 equally spaced bins is shown in Fig. 3.6a.
A limitation of the histogram approach is the dependence on the position and size of the bins and small adjustments to them can change the shape of the plot significantly. Further, the histogram is a discrete representation of a continuous variable which means some information is lost when binning. A preferable, but slightly more complicated estimator is kernel density estimation (KDE). KDEs are summations of small smooth functions K, called kernels, which are defined around each observation \(x_k\) of the random variable X to contribute to the overall PDF and are can be written as:
where h is the bandwidth hyperparameter. There are a variety of kernels but one of the most common is the Gaussian kernel defined as
The most important parameter is the bandwidth h which determines the smoothness of the final distribution. The larger the h the smoother the final estimate. The optimal value of this parameter is often found through crossvalidation methods (see Sect. 8.2) although there are sometimes rules of thumb used when there is assumptions about the underlying shape of the distribution. A representation of a KDE for two different bandwidths is shown in Fig. 3.6b (for the same data as in Fig. 3.6a). Notice that if the bandwidth is too small then the KDE will overfit to individual observations (see Sect. 8.1.3). In contrast, a bandwidth which is too large will mean features are lost due to underfitting to the observations. Extensions to KDEs to generate probabilistic forecasts will be explored in Sect. 11.5.
Now consider estimation of the CDF for a univariate distribution with samples \(x_1, x_2, \ldots , x_N\). The CDF can be easily estimated using the empirical cumulative distribution function (ECDF) defined by
where \(\textbf{1}_S\) is the indicator function which takes the value 1 if the statement S is true and 0 otherwise. In this case the statement is whether the observation is less than x. In other words, the empirical CDF simply counts the proportion of observations less than a particular value. An example of the CDF for the standard Normal and the corresponding empirical CDF (for 20 randomly sampled points) is shown in Fig. 3.7. Notice the Empirical CDF jumps at every point observed and the more observations available the closer the approximation is to the true CDF.
The quantiles (Sect. 3.2) can also be estimated from a finite sample of points. Suppose the observations are ordered, i.e. the samples \(x_1, x_2, \ldots , x_N\) are such that \(x_k < x_{k+1}\) for \(k=1, \ldots , N1\) (These are also known as order statistics). Then the qsample quantile, for \(q \in (0,1)\) is defined as the closest \(x_k\) where k rounds to qN.
Of course the PDF estimate created from the KDE can also be easily turned into a CDF estimate using the definition of the CDF itself (Sect. 3.1). However, since often the kernels are distributions themselves (as with the Gaussian) means that the CDF is simply the sum of the CDFs of each kernel.
Sometimes it is not necessary to visualise the entire distribution of points and a good general impression of the distribution can be understood from a few points. A boxplot gives a visualisation of a few summary statistics of a distribution. An example of a box plot for two data sets is shown in Fig. 3.8 where the first data set is the same as that shown in Fig. 3.6, whilst data set 2 is a simple Gaussian distribution with mean and standard deviation equal to 0.5. What is included in a box plot can vary but they all typically show the following things:

A centralised value which is given by a line within the main box. In the boxplot in Fig. 3.8 this is the median and is given by the red line.

The first and third quartile which are given by the bottom and the top of a box.

Whiskers which show the span of the points to the smallest and largest values (often this does not include points considered outliers). These are the dotted lines in Fig. 3.8.

Outlier values defined as those which are more than 1.5 times the interquartile range from the top or bottom of the box. These are given by red crosses in the plot.
The box plot, although relatively simple can be used to generate some insight to the data. Firstly it gives a very basic representation of the spread of the data, including where the middle \(50\%\) of the data lies. The comparison of the box plot for the two data sets can also indicate whether there is significant overlap between the data. Finally, if the median line is not in the centre of the box then this indicates skewness in the data.
3.5 Sample Statistics and Correlation
The expected value and variance are important values associated to distributions and are often used as estimates of ‘typical’ values and uncertainty respectively. However, in practice the distribution is not known and important features of a distribution can only be estimated from the available observations. Suppose \(x_1, \ldots , x_N\) is a sample of observations of a univariate continuous random variable X, and each of the samples is independent (i.e. none of the samples are dependent on each other). A population estimate for the mean is the sample mean, defined as
Similarly there is the sample variance
which is divided by \(N1\) rather than N, this is to ensure the estimator is unbiased.^{Footnote 3} As in Sect. 3.1 the sample standard deviation is the square root of the variance \(\sigma \). Other important measures of central tendency include the median (the 0.5quantile) and the interquartile range (the difference between the 0.75 and 0.25 quantiles). These values also tend to be more robust (i.e. are less sensitive) to outliers than the mean and variance.
In Sect. 3.3 the concept of covariance and correlation between random variables was introduced for continuous random variables for when the distributions are known. Consider observations \((x_{k, 1}, x_{k, 2})\), \(k=1, \ldots , N\) for the bivariate random variables \((X_1,X_2)\), then the sample Pearson correlation can be defined as
where \(\bar{x}_1, \bar{x}_2\) are the sample means for random variables \(X_1\) and \(X_2\) respectively.
In addition to the Pearson’s correlation, another common measure of correlation is Spearmans rank correlation coefficient. This is defined as simply the Pearson correlation of the rank of the values in the two random variables. In other words, take the observations \(x_{1, 1}, x_{2, 1}, \ldots , _{N, 1}\) for random variable \(X_{1}\) and assign them based on their rank, i.e. value 1 for the largest value, 2 for the second largest and so on. Similarly do the same for the second random variable, \(X_2\). Then simply calculate the Pearson correlation of these rankings using Eq. (3.22).
The concept of correlation can be expanded to a univariate time series (Sect. 5.1), i.e. \((L_{1}, L_2, \ldots , L_{N})^T\) where \(L_t\) is a single value at time t and the points are ordered consecutively in time. First consider the autocovariance and the autocorrelation—i.e. the (Pearson) correlation between values in the time series and its lagged values. The sample autocovariance function at lag k is defined as
where \(\hat{\mu }\) is the sample mean of the time series \((L_{1}, L_2, \ldots , L_{N})^T\). Similarly the autocorrelation function (ACF) at lag k is defined as
where \(\hat{\sigma }^2\) is the sample variance.
A plot of the autocorrelation at several consecutive lags \(\rho (0), \rho (1), \ldots \) is a common way to better identify interdependencies of a time series to its lagged values as well as to identify important features such as seasonalities. They are commonly used to identify the correct orders for the ARIMA models (see Sect. 9.4). The autocorrelation is bounded \(1 \le \rho (k) \le 1\) for \(k \in \mathbb {N}\) with \(\rho =1\) indicating a perfect positive correlation and \(\rho =1\) a perfect negative correlation. No correlation whatsoever is indicated by \(\rho =0\).
The Partial autocorrelation function (PACF) evaluated at lag k describes the autocorrelation between the series \(L_t\) and the lagged series \(L_{t+k}\) conditional on the in between values \(L_{t+1}, \ldots , L_{t+k1}\). In other words it describes the autocorrelation after removing the effects at shorter lags.
An example of an autocorrelation and partial autocorrelation plot for 30 lags is shown in Fig. 3.9. Typically such plots also include two horizontal lines which indicate the level at which significant correlations exist. Values outside of the area defined by these lines indicate significant correlation values (although of course sometimes values can be outside these lines by chance alone).
The ACF plot shows some strong autocorrelations at lags 6, 12, 18, 24 but also other lags in between. The PACF plot also shows the same strong lags at periods of 6 however, many of the other correlations are much smaller (for example at lags 3, 4 and 5) in the PACF plot compared to the ACF plot, which indicates that the influence of the other lags may have suggested a stronger autocorrelation than truly existed in the time series.
The crosscorrelation is an extension of autocorrelation except between two different time series. To illustrate this, consider a second time series \(M_1, M_2, \ldots , M_N\), defined at the same time steps as \((L_{1}, L_2, \ldots , L_{N})^T\). The crosscorrelation between L and M at lag \(k\ge 0\) can be defined as
and for \(k \le 0\) as
where \(\hat{\mu }_1\) and \(\hat{\mu }_2\) are the sample means of the time series for \(L_t\) and \(M_t\) respectively. The value at \(k=0\) represents the correlation between the two series without any lags. The function not only identifies which lags are the most important but also the temporal direction. For example, the temperature could be a strong indicator of the demand used, however, if the nearest weather station is in the next town over then this recorded value may be related to the temperature in the town of interest but with a delay. In that case there will be a strong crosscorrelation between the demand and the observed lagged temperature at the next town over.
3.6 Questions

1.
Generate different size samples from a normal distribution with a mean and standard deviation of your choice. Plot these values against samples of different sizes. Do the values start to converge to the true values? At how many samples does this convergence begin? This is demonstrating the central limit theorem. How much does the mean change for small samples? Now plot the median? How much does this change for small samples? Does one seem more robust to additions of new data than the other?

2.
For the standard normal distribution, how much data lies between (a) one standard deviation, \(\sigma \), (b) two standard deviations, (c) three standard deviations, from the centre?

3.
For a standard normal distribution, which quantiles most closely approximate the values one standard deviation from the centre?

4.
If a distribution is not symmetric what does the difference between the mean and median tell you about the skewness of the distribution?

5.
Sample from a univariate distribution of your choice and plot the histogram (using the default bin size). Change the bin sizes and number of samples and compare them to the original distribution. What appears to be a good bin size for one of your sampling sets (assuming the sample size is big enough)?

6.
Sample from the same univariate distribution as in the previous question. Plot a kernel density estimate with different bandwidths and different sample sizes. Observe the effects, what appears to be a good choice of bandwidth for one of your sample sets (make sure the number of samples is big enough)?

7.
Plot the box plots for three datasets, with each data set generated from 1000 samples from three different univariate distributions (perhaps use the Gamma, lognormal and Gaussian distributions described in the chapter). Adjust the parameters of each model so the plots can all be seen clearly on the figure. Compare these plots to the original distributions: How does the central value vary across the box plots? What about the whiskers and the box sizes?

8.
Plot a bivariate Gaussian distribution with unit variance on both variables. Change the correlation between the plots and see how they change. How does the marginal distribution change for each variable as you change the correlation?

9.
For the same bivariate distribution as the previous question. Consider the conditional distribution for one of the variables by binning samples of the data contained within a small interval of the other variable. Plot the histogram of these points. How does this change as you change the position of the interval?

10.
Download some energy time series data (See Sect. D.4). Plot the autocorrelation and partial autocorrelations for a few of these time series (only consider lags for up to a couple of weeks). Where are the major correlation values? What is the difference between the autocorrelation plots and the partial autocorrelation plots? Compare these values to plots of the time series. Are there obvious seasonal patterns, and how do they relate to the autocorrelation plots?

11.
Download the GEFCOM 2014 data or alternatively download any other demand time series data which also has some temperature data (See Sect. D.4 for a list of data sets). Plot one of the demand series against the temperature data in a scatter plot. What does the relationship look like? Is there a clear affect of temperature on this data? Calculate the crosscorrelation plot for the demand and temperature data. Where are the major correlations. Is there some strong correlations at lagged values?
Notes
 1.
Note that some authors refer to the nquantiles where the quantiles are defined in terms of dividing the domain into \(n \in \mathbb {N}\) sets. Hence the 2quantile would be the median, the 4quantile would consist of the median, the lower quartile and the upper quartile etc.
 2.
A symmetric realvalued matrix \(\textbf{A} \in \mathbb {R}^{N \times N}\) is said to be positivedefinite if \(\textbf{x}^T\textbf{A}\textbf{x} >0\) for all nonzero \(\textbf{x} \in \mathbb {R}^N\).
 3.
Which essentially means the difference between the expected value of the estimator and the true value is zero.
References
F.M. Dekking, C. Kraaikamp, H.P. Lopuhaä, L.E. Meester, A Modern Introduction to Probability and Statistics: Understanding Why and How (Springer, London, 2005)
D. Ruppert, D.S. Matteson, Statistics and Data Analysis for Financial Engineering: with R Examples. Springer Texts in Statistics (2015)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2023 The Author(s)
About this chapter
Cite this chapter
Haben, S., Voss, M., Holderbaum, W. (2023). Primer on Statistics and Probability. In: Core Concepts and Methods in Load Forecasting. Springer, Cham. https://doi.org/10.1007/9783031278525_3
Download citation
DOI: https://doi.org/10.1007/9783031278525_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 9783031278518
Online ISBN: 9783031278525
eBook Packages: EnergyEnergy (R0)