Neural network models
Artificial neural network models are proved to be very powerful and efficient methods for dealing with complex problems of associations, classification, and prediction (Ordieres et al. 2005). A neural network can be characterized by its architecture, presented by the network topology and pattern of connections between nodes, its method of determining the connection weights, and the activation functions that it employs (Dibike and Coulibaly 2006). Multilayer perceptrons constitute probably the most widely used network architecture and has got wide application in atmospheric science (Gardner and Dorling 2000; Ordieres et al. 2005). They are composed of a hierarchy of processing units organized in a series of two or more mutually exclusive sets of neurons or layers. The information flow in the network is restricted to a flow, layer by layer, from the input to the output, hence also called feed-forward network. However, in temporal problems, measurements from physical systems are no longer an independent set of input samples but functions of time. To exploit the time series structure in the inputs, the neural network must have access to this time dimension (Dibike and Coulibaly 2006). While feed-forward neural networks are popular in many application areas, they are not well suited for temporal sequences processing due to the lack of time delay and/or feedback connections necessary to provide a dynamic model. They can be used as pseudo-dynamic models only by using successively lagging multiple inputs based on correlation and mutual information analysis of the input data. There are however various types of neural networks that have internal memory structures which can store the past values of input variables through time, and there are different ways of introducing “memory” in a neural network in order to develop a temporal neural network. Time-lagged feed-forward and recurrent networks are two major groups of dynamic neural networks mostly used in time series forecasting (Coulibaly et al. 2001a, b; Dibike and Coulibaly 2006).
Time-lagged feed-forward neural network is an extension of the standard MLP models which can be formulated by replacing the neurons in the input layer of an MLP with a memory structure, known as tap delay line or time delay line. The size of the memory structure (tap delay line) depends on the number of past samples that are needed to describe the input characteristics in time, and it has to be determined on a case-by-case basis. TLFN uses delay-line processing elements, which implement memory by simply holding past samples of the input signal shown in Fig. 1 (without the feedback connection). The output of such a network with one hidden layer is given by (Dibike and Coulibaly 2006):
$$y\left( n \right) = \phi _1 \left( {\sum\limits_{j = 1}^m {w_j y_j \left( n \right) + b_o } } \right)$$
(1)
$$= \phi _1 \left\{ {\sum\limits_{j = 1}^m {w_j \phi _2 \left[ {\sum\limits_{l = 0}^k {w_{jl} x(n - 1) + b_j } } \right] + b_o } } \right\}$$
(2)
where m is the size of the hidden layer, n is the time step, w
j
is the weight vector for the connection between the hidden and output layers, w
jl
is the weight matrix for the connection between the input and hidden layers, ϕ
1 and ϕ
2 are transfer functions at the output and hidden layers, respectively, and b
j
and b
o
are additional network parameters (biases) to be determined during training of the networks with observed input/output data sets. For the case of multiple inputs (of size p), the tap delay line with memory depth k can be represented by:
$$X\left( n \right) = \left[ {x\left( n \right),x\left( {n - 1} \right),....,x\left( {n - k + 1} \right)} \right]$$
(3)
$$X\left( n \right) = \left( {x_1 \left( n \right),x_2 \left( n \right),........,x_p \left( n \right)} \right),$$
(4)
where x(n) represents the input pattern at time step n, x
j
(n) is an individual input at the nth time step, and X(n) is the combined input to the processing elements at time step n. Such a delay line only “remembers” k samples in the past. An interesting attribute of the TLFN is that the tap delay line at the input does not have any free parameters; therefore, the network can still be trained with the classical back propagation algorithm. The TLFN topology has been effectively used in nonlinear system identification, time series prediction (Coulibaly et al. 2001b), temporal pattern recognition, and parallel hybrid modeling.
The recurrent neural network model used in this work is the basic Elman type RNN (Elman 1990) and also known as the globally connected RNN. The network consists of four layers: the input layer, the hidden layer, the context unit, each with n number of nodes and the output layer with one node. Each input unit is connected with every hidden unit, as is each context unit. Conversely, there are one-by-one downward connections between the hidden nodes and the context units leading to an equal number of hidden and contest units. In fact, the downward connections allow the context units to store the outputs of the hidden nodes (i.e., internal states) at each time step; then the fully distributed upward links feed them back as additional inputs. Therefore, the recurrent connections allow the hidden units to recycle the information over multiple time steps and thereby to discover temporal information contained in the sequential input and relevant to the target function (Coulibaly et al. 2001b). Thus, the RNN has an inherent dynamic (or adaptive) memory provided by the context units in its recurrent connections. The output of the network depends not only on the connection weights and the current input signal but also on the previous states of the network, which can be shown by the following equations (Coulibaly et al. 2001b):
$$y_j = Ax\prime \left( t \right)$$
(5)
$$x\prime {\left( t \right)} = G{\left\lfloor {W_{h} x\prime {\left( {t - 1} \right)} + W_{{h_{o} }} x{\left( {t - 1} \right)}} \right\rfloor }$$
(6)
Where x′(t) is the output of the hidden layer at time t given an input vector x(t), G( ) denotes a logistic function characterizing the hidden nodes, the matrix W
h
represents the weights of the h hidden nodes that are connected to the context units, W
ho
is the weight matrix of the hidden units connected to the input nodes, y
j
is the output of the RNN assuming a linear output node j, and A represents the weight matrix of the output layer neurons connected to the hidden neurons. The Elman-style RNN is a state-space model since Eq. 6 performs the static estimation and Eq. 5 performs the evaluation (Coulibaly et al. 2001b).
According to Coulibaly et al. (2001b), a major difficulty, however, with RNN is the training complexity because the computation of ▿E(W), the gradient of the error E with respect to the weights, is not trivial since the error is not defined at a fixed point but rather is a function of the network temporal behavior. Here, in order to identify the optimal training method and to reduce computing time, each model has been trained with a different algorithm using the same delayed inputs. Finally, the delta-bar-delta algorithm was selected after investigating several other methods. Delta-bar-delta algorithm is an improved version of the back-propagation algorithm. Unlike standard back-propagation, delta-bar-delta algorithm uses a learning method where each weight has its own self-adapting coefficient. It does not use the momentum factor of the standard BP network. The essence of the rule is that past calculated error values for each weight are used to infer future calculated error values; hence, by knowing the probable errors, the system takes “intelligent” steps in adjusting the weights. Furthermore, each connection weight has its individual learning rate which varies over time based on the current error information found with standard back-propagation; hence, more degree of freedom is achieved which reduced the convergence time (NeuroSolutions 2004).
Bayesian neural network model
The Bayesian neural network model used in this work was developed by Khan and Coulibaly (2006). Bayesian approach implements the conventional or standard learning process; but instead of single set of weights, it considers a probability distribution of weights. According to Khan and Coulibaly (2006), the process starts with a suitable prior distribution, p(w), for the network parameters (weight and biases). Once the data D is observed, Bayes’ theorem is used for deriving an expression of the posterior probability distribution for the weights, p(w|D), as follows:
$$p\left( {w|D} \right) = \frac{{p\left( {D{\text{ $|$ }}w} \right)p\left( w \right)}}{{p\left( D \right)}}$$
(7)
where, p(D|w) is the dataset likelihood function and the denominator, p(D) is a normalizing factor, which can be obtained by integrating over the weight space as follows:
$$p\left( D \right) = \int {p\left( {D{\text{ $|$ }}w} \right)p\left( w \right)dw{\text{ }}} $$
(8)
The left-hand side of Eq. 7 gives unity when integrated over all weight space. Once the posterior has been calculated, every type of inference is made by integrating over that distribution. Therefore, in implementing Bayesian method, expressions for the posterior distribution, p(w), and the likelihood function, p(D|w), are needed. The prior distribution, p(w), which is not related with data, can be expressed in terms of weight-decay regularizer, \(Ew = \frac{1}{2}\sum\limits_{i = 1}^w {wi^2 } \), where W is the total number of weights and biases in the network. Similarly, the likelihood function in Bayes’ theorem 1, which is dependent on data, can be expressed in terms of error function, \(ED = \frac{1}{2}\sum\limits_{n = 1}^N {\left( {y^n \left( {x^n ;w} \right) - t^n } \right)^2 } \), where, x is the input vector, t is the target value, and y(x;w) is the network output. Upon deriving the expressions for the prior and likelihood functions and using those expressions in Eq. 7, the posterior distribution of weights can be obtained. The objective function in the Bayesian method corresponds to the inference of the posterior distribution of the network parameters. After defining the posterior distribution (objective function), the network is trained with a suitable optimization algorithm to maximize the posterior distribution p(w|D). Thus the most probable value for the weight vector w
MP corresponds to the maximum of the posterior probability. Using the rules of conditional probability, the distribution of outputs for a given input vector, x, can be written in the form
$$p\left( {t|x,D} \right) = \int {p\left( {t{\text{ $|$ }}x{\text{,}}w} \right)p\left( {w{\text{ $|$ }}D} \right)dw{\text{ }}} $$
(9)
where p(t x,w) is simply the model for the distribution of noise on the target data for a fixed value of the weight vector w
MP, and p(w|D) is the posterior distribution of weights. The posterior distribution over network weights provides a distribution about the outputs of the network. If a single-valued prediction is needed, the mean of the distribution is used, and while the uncertainty about the prediction is needed, the full predictive distribution is used to present the range of uncertainty about the prediction. A more detailed description of BNN approach as used herein can be found in Khan and Coulibaly (2006).