Deep Learning Model for Wind Forecasting: Classification Analyses for Temporal Meteorological Data

This paper proposes a multiple CNN architecture with multiple input features, combined with multiple LSTM, along with densely connected convolutional layers, for temporal wind nature analyses. The designed architecture is called Multiple features, Multiple Densely Connected Convolutional Neural Network with Multiple LSTM Architecture, i.e. MCLT. A total of 58 features in the input layers of the MCLT are designed using wind speed and direction values. These empirical features are based on percentage difference, standard deviation, correlation coefficient, eigenvalues, and entropy, for efficiently describing the wind trend. Two successive LSTM layers are used after four densely connected convolutional layers of the MCLT. Moreover, LSTM has memory units that utilise learnt features from the current as well as previous outputs of the neurons, thereby enhancing the learning of patterns in the temporal wind dataset. Densely connected convolutional layer helps to learn features of other convolutional layers as well. The MCLT is used to predict dominant speed and direction classes in the future for the wind datasets of Stuttgart and Netherlands. The maximum and minimum overall accuracies for dominant speed prediction are 99.1% and 94.9%, (for Stuttgart) and 99.9% and 97.5% (for Netherlands) and for dominant direction prediction are 99.9% and 94.4% (for Stuttgart) and 99.6% and 96.4% (for Netherlands), respectively, using MCLT with 58 features. The MCLT, therefore, with multiple features at different levels, i.e. the input layers, the convolutional layers, and LSTM layers, shows promising results for the prediction of dominant speed and direction. Thus, this work is useful for proper wind utilisation and improving environmental planning. These analyses would also help in performing Computational Fluid Dynamics (CFD) simulations using wind speed and direction measured at a nearby meteorological station, for devising a new set of appropriate inflow boundary conditions.


Introduction
The green energy requirement is expanding day by day with increasing population growth, and development. One of the free, clean, renewable energy source with a limitless supply that is naturally available is wind (Lawan et al. 2014;Marović et al. 2017;Tarade and Katti 2011). In today's world, mankind seeks to become more environmental friendly in its operations, and the wind is an important source of energy. To monitor, predict, and maintain weather patterns and global climate, wind speed and direction are essential components that need to be tracked (Colak et al. 2012;Vargas et al. 2010). The future wind trends are influenced by the past conditions of wind speed and direction. Moreover, to support the selection of new wind turbine installation sites, prior analysis of the wind nature, and its prediction is required (Aissou et al. 2015;Reed et al. 2011). There are four categories to group wind speed and direction prediction methods based on the time scale (Yesilbudak et al. 2013;Yesilbodak et al. 2017), viz (i) very short-term (these predictions cover a few seconds to 30 min ahead), (ii) short-term (include predictions from 30 min to 6 h), (iii) medium-term (predictions for 6 h to 1 day ahead) and, (iv) long-term (from 1-day to 1-week predictions). Machine Learning (ML) (Sapronova et al. 2016), Numerical Weather Prediction (NWP) models (Aslipour and Yazdizadeh 2019;Janssens et al. 2016;Louka et al. 2008), and models incorporating both NWP and ML (Vladislavleva et al. 2013) for wind prediction are presently the focus of research and commercial applications.
The ML concepts such as fuzzy logic (Martínez-Arellano et al. 2014;Monfared et al. 2009), Artificial Neural Networks (ANN) with several hidden layers (Birenbaum and Greenspan 2017;Daraeepour and Echeverri 2014;El-Fouly et al. 2008;Vogado et al. 2018;Yesilbodak et al. 2017), and statistical models (Jursa and Rohrig 2008;Louka et al. 2008;Miranda and Dunn 2006;Yang and Chen 2019) are used to design such wind prediction frameworks. Techniques like particle swarm optimisation, wavelet transform (Liu et al. 2018;Martínez-Arellano et al. 2014;Wang et al. 2017), REP tree, M5P tree, bagging tree, K-nearest neighbour algorithm (Jursa and Rohrig 2008;Kusiak et al. 2009a;Kusiak and Zhang 2010), principal component analysis, moving average models, Markov chain (Kusiak et al. 2009b;Treiber et al. 2016;Vargas et al. 2010), combined with regression models using neural networks, have been used for wind analyses (Yang and Chen 2019). Moreover forecasting wind speed with Support Vector Machines (SVM) and its variation (Kang et al. 2017) such as Least Square Support Vector Machines (LSSVM) have also been proposed (De Giorgi et al. 2014Harbola and Coors 2019a;Yuan et al. 2015). These works used only limited features based on wind speed, direction and power as input. The ML concept of deep learning based on Convolutional Neural Networks (CNNs) has achieved higher accuracy for classification of Two-Dimensional (2D) images and Three-Dimensional (3D) point clouds (Krizhevsky et al. 2012;Long et al. 2015;Szegedy et al. 2015). Convolutional layers in CNN learn a large number of features automatically so that they need not be designed manually (Jung et al. 2019;Kuo 2016;Qi et al. 2016). Variations of CNNs like single CNN, multiple CNN, Residual Neural Network Architecture (ResNet) (He et al. 2016;Huang et al. 2017;Xie et al. 2017) with several convolutional layers have become popular for classification. Further, One-Dimensional (1D) and 2D single CNNs have been employed for wind power and wind speed predictions (Liu et al. 2018;Wang et al. 2017). However, these models either smooth and filter the wind dataset by applying techniques like wavelet or convert 1D wind dataset into 2D images (Liu et al. 2018;Wang et al. 2017). This leads to distortion of the original information present in temporal wind dataset. To overcome this problem, 1D single CNN (1DS) and 1D multiple CNN (1DM), working directly on the original 1D temporal wind dataset without using smoothening techniques, were proposed by (Harbola and Coors 2019b). The 1DM model showed better performance than the 1DS for prediction of the dominant class of wind speed and direction. However, only two features based on the speed and direction were included in the input layers of the 1DS and 1DM and a limited number of classes (eleven) were used for prediction. This paper improves upon the 1DM model and proposes a deep multiple CNN architecture with multiple input features, along with multiple Long Short-Term Memory (LSTM) and densely connected convolutional layers. More number of features in CNN architecture help in learning the various properties of a sample from finer to coarser levels (de Andrade 2019). Therefore, a large number of features are used in this study. The new architecture is called Multiple features, Multiple Densely Connected Convolutional Neural Network with Multiple LSTM Architecture, i.e. MCLT with the following novel contributions, (a) multiple features (58 in total) are used in the input layers for better representation of the temporal wind dataset, (b) fully connected layers are replaced by LSTM layers to provide memory for a longer period and thereby improving the training of the model, (c) connecting convolutional layers similar to 2D ResNet (for images) (Duta et al. 2020) architecture so that each convolutional layer learns features of previous convolutional layers as well, and (d) a higher number of classes (21) are used for analyzing detailed trend of the temporal wind dataset. The authors are unable to find any existing work that has used these four contributions for in-depth analyses and prediction of wind nature. The remaining paper is arranged as follows; Section 2.1 describes the MCLT architecture followed by Sect. 2.2 which gives detail of the wind datasets used in the experiments. Section 3 presents the results and Sect. 4 gives conclusion and future recommendations.

Methodology
The proposed MCLT architecture is an advanced deep learning architecture, which is a combination of multiple features, multiple LSTM, and densely connected convolutional layers in a multiple CNN model for the wind nature analysis. A total of 58 features are based on the various combinations of two important temporal wind properties, i.e. wind speed and direction. This ensures that several details of the wind features are learnt by the MCLT. These features are designed based on time series data from the past. The features form the input of the MCLT that has to predict a representative wind speed or direction value for a period of time immediately after the last value of the input sample in the time series. The following sections discuss the design of these multiple features, along with the MCLT framework.
Further, the input to the MCLT is a time series (or temporal) data of wind speed and direction for a certain geographic location (i.e. spatial location). These time series data need to be acquired at regular intervals. The time stamp in the data helps to arrange the data in the increasing order of time. More details of the data are available in Sect. 2.2. Further, several features are designed using the wind speed and direction that are explained in Sect. 2.1. The prediction of the MCLT is the class label based on the dominant wind speed and direction. The multiple wind speed values for future points in time are grouped into 21 classes using the wind speed values. Amongst these classes, the class having maximum count, i.e. class of the speed values that occur most (viz. dominant speed amongst future points) in time forms the class label of the input sample (Harbola and Coors 2019b). Similarly, 21 classes for the wind direction are designed and the class label is assigned to the sample based on the class having maximum count of the wind direction values. It may be noted that grouping into 21 classes is a process of creating the class labels of training and testing samples, while the MCLT prediction represents one class label (for a given sample) that depicts wind speed or direction value for a certain period of time immediately following the time represented by the input sample. Also, there are two trained MCLT models, one for the wind speed and another for the wind direction. The proposed method can be short term, medium term as well as long term depending on the choice of the number of future points in time that are grouped into 21 classes. This concept is discussed in detail in Sect. 2.1.

Designing Multiple Features
Wind speed (given in m/s) and the direction (in radians) are two input features (Harbola and Coors 2019b) to the proposed architecture. Besides these two features, 56 additional features also form part of the input. Suppose, matrix M i, j has r rows and 58 columns, where r equals to the number of temporal wind values present in the dataset (each row of M i, j is a time instance for wind dataset comprising speed and direction values), and i, j denote row and column number of a cell, respectively, in the matrix. Moreover, each column denotes a feature. The first feature (first column), second feature (second column) comprise the wind speed and direction values, respectively. M i, j=3 (third feature) is the percentage difference (per) between M i, j=1 (speed values)  Fig. 1, where values up to M i−7, j are used only due to hardware constraints in the present study, it could be decreased or increased as per available hardware. Thus, each row of M i, j has column (or feature) values that are dependent on the current and previous rows, i.e. i to i − 7 . In Fig. 1, for example std ( M i, j=2 , M i−1, j=2 , M i−2, j=2 ) means standard deviation of three quantities inside the brackets. The explanation of other features in Fig. 1 is similar. These features are calculated using adjacent temporal values of wind speed and direction and help in describing trends like increase, decrease, stationary, deviation from the mean. The features can be varied depending on the available hardware for training the MCLT. This is discussed in more detail in section 4.3.2. The above constructed M i, j matrix is further rescaled by dividing each cell's value by the maximum value amongst all the cells. This rescaling helps in resizing values to a smaller range for better learning of the MCLT. This rescaled M i, j matrix is used in below concepts.
Samples for training and testing the proposed architecture are designed using M i, j . A sample consists of input values and a corresponding class label. This class label is predicted by the MCLT. The sample's input consists of a matrix of dimension K B * 58 using values from M i, j=1..58 to M i+K B , j=1..58 , where K B is a scalar quantity that depends on the user. Therefore, rows from i to i + K B (and all columns of these rows) of M i, j form the input of the sample. The columns of matrix K B * 58 are treated as separate features, each of one dimension in the input layers of the MCLT as discussed in the next section.
The corresponding class label of the sample is a class reflecting the wind speed or the wind direction value for the future K F (a scalar value) time values immediately after the last time value (i + K B ) in the sample's input. The class label of the sample is designed using values of speed from For this, mean ( ) and standard deviation ( ) of the given historical temporal wind dataset are calculated, separately for speed and direction. Then, 21 classes are designed using ( ) and ( ), of wind speed values as shown in Table 1. The and concepts provide statistical segregation of classes (Ghilani 2010). k i , where i →1-10 as shown in Table 1, is decided empirically. Speed values from M i+K B +1, j=1 to M i+K B +K F , j=1 (these speed values without rescaling are used for classes construction) are grouped into these 21 classes, and count of values in each class is found. The class having maximum count is assigned to the class label of the sample. This maximum count represents the dominant speed amongst K F future points in time (i.e. class of the speed values that occur most) (Harbola and Coors 2019b). Likewise, the class label of the sample based on the direction is determined by finding the maximum count of direction values from M i+K B +1, j=2 to M i+K B +K F , j=2 (these direction values without rescaling are used for classes construction) among these 21 classes. The ( ) and ( ) based on the wind direction values are used for designing these 21 classes of the wind direction. Here, the second column of M i, j is used that is based on the wind direction values. As stated earlier, the grouping into 21 classes is a method of creating the class labels of training and testing samples, while the MCLT prediction represents one class label (for a given sample) that depicts wind speed or direction value for K F time period immediately after the last time value (i + K B ) in the sample's input. Based on the definition of a sample, from a dataset consisting of matrix M i, j with r rows, training samples can be generated by varying i from 1 to r − K F with an increment of 1. This helps in performing the temporal wind data analysis over wind speed and direction.

MCLT Architecture
The MCLT architecture is shown in Fig. 2. There are five input layers corresponding to each view CNN i (CNN 1 , CNN 2 , CNN 3 , CNN 4 , and CNN 5 ) as in the 1DM. For a given sample's input, five views corresponding to each input layer in the MCLT are formed as follows: (a) first view takes all K B values of the sample's input, i.e. rows from i to i + K B (and all columns of these rows) of M i, j , (b) second view takes half of K B values of the sample's input from rows i to i + K B at an interval of two (and all columns of these rows) of M i, j , (c) third view also takes half of K B values of the sample's input but from rows i + 1 to i + K B at an interval of two (and all columns of these rows) of M i, j , (d) fourth view takes one-third of K B values of the sample's input but from rows i to i + K B at an interval of three (and all columns of these rows) of M i, j , and (e) fifth view again takes one-third of K B values of the sample's input but from rows i + 1 to i + K B at an interval of three (and all columns of these rows) of M i, j , (Harbola and Coors 2019b). The input layer of each view is followed by four successive convolutional layers ( C 1 , C 2 , C 3 , C 4 ). The densely connected convolutional layers similar to ResNet are realised as follows, (a) C 3 directly takes as input features from both C 2 and C 1 (while in the 1DM model, C 3 took input only from previous layer C 2 ), and (b) C 4 directly takes input features from C 3 , C 2 and C 1 (while in traditional CNN models, C 4 takes input only from C 3 ) (Zhao et al. 2019).
The detailed pseudo code of MCLT implementation is given in Algorithm 1. All the feature maps from the last convolutional layer C 4 of each view (total 5 views) are first flattened to 1D form (step 13 in Algorithm 1) and then appended one after another (step 14 Algorithm 1). This appended feature vector is then passed to a common LSTM layer called LSTM 1 (step 16 Algorithm 1), which in turn is followed by the second LSTM layer called LSTM 2 . In the Red arrows denote connections between different convolutional layers and LSTM layers. All the feature maps from C 4 of CNN 1 , CNN 2 , CNN 3 , CNN 4 and CNN 5 are appended to form a vector and passed into LSTM 1 . Multiple blue boxes in Input, C 1 , C 2 , C 3 and C 4 represent multiple features in that layer. Red circles in LSTM 1 and LSTM 2 represent neurons 1DM model, fully connected layers were present in the place of LSTM 1 and LSTM 2 . The output layer (which is dense or fully connected layer) comes after LSTM 2 . The output layer uses softmax function for classification, and the number of neurons in this layer would be the same as the number of classes in the dataset, i.e. 21 neurons corresponding to 21 classes (step 18 Algorithm 1). operation, that takes values such as number of features, stride (amount by which 1D kernel shifts), input from a CNN layer, activation function and dropout (Srivastava et al. 2014) value. Concatenate in Algorithm 1 means that C 1 and C 2 (step 9), C 1 , C 2 and C 3 (step 11), are joined together one after another and then treated as input for the next step i.e. making the densely connected convolutional layers. LSTM Output ← MCLT output layer 3: 4: Merged ← [ ] Merged ← Empty list 5: for i ← 1 to 5 do 6: CNN i processing 7: C 1 ← Conv1D(features, stride, input = CNN i Input, ELU, dropout) 8: C 2 ← Conv1D(features, stride, input = C 1 , ELU, dropout) 9: C 2concat ← Concatenate(C 1 , C 2 ) 10: Merged.append(C 4 ) 15: end for 16: Output ← Dense(neurons, input = LST M 2 , sof tmax) 19: end procedure Further, Merged in Algorithm 1, is initially defined as an empty list (step 4) and for each iteration inside for loop, flattened C 4 is appended to it (step 14). CNN i Input in step 7 means input corresponding to CNN i . Conv1D in Algorithm 1 denotes a function representing 1D convolutional and Dense (steps 16-18 in Algorithm 1) denote LSTM and fully connected layers, respectively. LSTM units include a memory element that can maintain information in memory for long periods of time. Figure 3 shows the LSTM architecture in detail as available in (Chollet 2017;Hochreiter and Schmidhuber 1997). A set of gates (input, output, forget (memory element)) is used to control when information enters LSTM units, when it leaves, and when it is forgotten. Thus, these memory units aid in learning longer-term dependencies. The densely connected convolutional layers help C 3 directly learn features from both C 1 and C 2 , unlike in 1DM, where C 3 learnt features from C 2 only. Likewise, C 4 directly learns features from C 1 , C 2 , and C 3 , unlike traditional CNN where C 4 considers input only from C 3 .
Each input layer of the MCLT, thus, takes multiple 1D features. In this study, there are 58 features in each input layer. A higher number of features in CNN architecture help in learning the various properties of a sample from finer to coarser levels. Therefore many features are used in this study. Thus, for a sample having input values from i to i + K B of M i, j , each column of these rows form a 1D feature of the input layer. Thus, the MCLT incorporates multiple features and multiple views in the input layers, as well as each convolutional layer takes input from several previous layers, with the presence of memory units in the LSTM layers. The output layer of the MCLT uses the sample's class label, either based on the wind speed or direction, for training and testing the architecture. The sample's class label is designed using M i+K B +1 to M i+K B +K F values as discussed in the above section. Accordingly, there are two trained MCLT models, one for the wind speed and another for the wind direction. The samples' inputs to these two models remain the same but the the class labels are based on the wind speed (when the model is trained to predict the wind speed) or the wind direction (when the model is trained to predict the wind direction). Further, the parameters determined in training comprise the weights and biases of neurons of convolutional and LSTM layers as well as the output layer.

Dataset
Historical temporal wind datasets of about more than 30 years are considered as test cases for the proposed MCLT. The first case is the climate and air measuring station located in the corner of Hauptstaetter Strasse 70173 Stuttgart, 1 Germany, which is one of the sources for the wind data collected from 1987 to 2017 in Stuttgart. The temporal resolution of this dataset is thirty minutes as wind speed and direction values are measured at an interval of thirty minutes. The second case is the dataset of Netherlands from the station 210 Valkenburg 2 with 37 years of historical data from 1981 to 2018. The datasets are split into subsets, each of them corresponding to the data for one month. This allows for an analysis of the data on a monthly basis. One matrix M i, j (Sect. 2.1) is generated for each of these subsets.

Experiments and Results
This section explains the results of MCLT for Stuttgart and Netherlands datasets. Section 4.1 provides the details of the hardware and software configuration along with the organisation of the training and testing samples. Section 4.2 presents the obtained accuracies for different datasets and features. Subsection 4.3 represents the qualitative discussion of the obtained results and comparison with other existing methods.

Test Setup
The proposed MCLT architecture has been coded in Python language using Keras library (Chollet 2017) with Tensor-Flow in the backend and executed on Intel ® Core TM i7-4770 CPU @3.40 GHz having four cores. The total samples for a month were randomly divided into training and testing samples, with 30% of the total samples as the testing samples. This procedure of random division of the total samples into training and testing samples, followed by the training and testing of the MCLT was repeated 20 times in order to determine the mean accuracy values. This procedure, thus, accounted for the randomness in splitting into training and testing. Moreover, the splitting technique was applied by ensuring that the input values of each testing sample should not overlap (i.e. disjoint) with the input values of the training samples.
Further, Adaptive Synthetic Sampling (ADASYN) technique (He et al. 2008) was used to enhance the number of training samples for better learning of the MCLT. ADASYN generates samples of the minority class according to their density distributions and avoids over-sampling. The number of feature maps in C 1 , C 2 , C 3 and C 4 of each of CNN 1 , CNN 2 , CNN 3 , CNN 4 , and CNN 5 , of the MCLT architecture are 16, 28, 32 and 32, respectively, whereas the number of neurons in LSTM 1 and LSTM 2 are 200 and 200 respectively. Values of k 1 , k 2 , k 3 , k 4 , k 5 , k 6 , k 7 , k 8 , k 9 and k 10 ( Table 1) were empirically determined as 0.05, 0.10, 0.20, 0.30, 0.40, 0.50, 0.60, 0.70, 0.80 and 1.0, respectively (same for both speed and direction), so that sufficient number of samples occurs in each class (He et al. 2008), by observing the histograms comprising of 21 bins corresponding to 21 classes. Moreover, K B and K F were taken as 60. K F multiplied by the temporal resolution gives a time frame of future prediction as per user desire. Figure 4 shows the variations in total accuracy of the MCLT with 58 features by varying K B (here K F = K B ). In this work, K B is taken as 60 as accuracy increases till 60 and after that remains similar as shown in Fig. 4. Exponential Linear Units (ELUs) ( (Clevert et al. 2017;Pedamonti 2018)) with of 3.0 have been used as activation function in the MCLT. The higher value of 3.0 was chosen to avoid dead neurons problem during training, with highly variable wind datasets (Clevert et al. 2017;Nair and Hinton 2010). Kernel size of three along with stride of one has been applied for all the convolutional layers. Batch normalisation (Jung et al. 2019) and dropout (Srivastava et al. 2014) of 0.45 have been  employed after every convolution layer. This helps to prevent over-fitting, and the MCLT architecture learns better. The parameters comprise weights and biases of neurons of convolutional and LSTM layers that are learned during training. The neurons in a feature map in a convolutional layer share weights and biases. Adam optimisation (Chollet 2017) has been used that takes care of learning rate during training. Initially, weights and biases were initialised using (He et al. 2015) method. Cross entropy loss function has been used during training of the MCLT (Chollet 2017;Nielsen 2015)

Model Accuracies
The total (overall) accuracy for different months of Stuttgart for the test samples, obtained using the MCLT is shown in Figs. 5 and 6. The total accuracy is the number of correct predictions divided by the total number of predictions (Congalton and Green 2010). In these figures, MCLT with 58   The maximum, minimum, and mean total accuracies for dominant speed prediction (for Stuttgart) using the MCLT with 58 features are 99.1%, 94.9%, and 97.2%, respectively, as shown in Table 2. The maximum, minimum, and mean total accuracies for dominant speed prediction (for Stuttgart) using the MCLT with 2 features are 96.8%, 92.4%, and 95.1%, respectively (Table 2). Similarly, the maximum, minimum, and mean total accuracies for dominant direction prediction (for Stuttgart) using MCLT with 58 features are 99.9%, 94.4%, and 98.7%, respectively (Table 3). The maximum, minimum, and mean total accuracies for dominant direction prediction (for Stuttgart) using MCLT with 2 features are 98.8%, 92.5%, and 97.0%, respectively (Table 3). Figures 5, 6, 7 and 8, Tables 2 and 3 also represent results when the 1DM architecture with 2 and 58 features is used for prediction. Learning curves and loss curves (for speed prediction) of January month's test samples of Stuttgart using the MCLT with 2 and 58 features are shown in Figs. 9 and 10, respectively.

Discussion
The proposed MCLT architecture shows promising results for dominant wind speed and direction prediction of temporal wind datasets from Stuttgart and Netherlands. Below subsections 4.3.1, 4.3.2 and 4.3.3 discuss the results with the help of rose plot, comparison among 2 and 58 features, and comparison with other suitable approaches, respectively.

Rose Plots
Wind rose plot helps in the visualisation of wind speed and direction in the same graph, in a circular format. The length of each spoke around the circle indicates the number of times (count) that the wind blows from the indicated direction. Colors along the spokes indicate classes of wind speed. The data of March (Mar) 2020 of Stuttgart are used to represent the real-world sensor's measurements (ground-truth values) and prediction outcomes of the MCLT in Figs. 11 and 12, respectively. The high resemblance among Figs. 11 and 12, indicates that the prediction results are similar to the groundtruth values. This augments visually the accuracies obtained previously in the results Sect. 4.2. In these figures, there are 21 different color ranges denoting the wind speed divided into 21 classes with the wind rose circular format shows the direction the winds blew from. The varying spoke length around the circle shows how often the wind blew from that direction, highlighting the wind nature insight from the indicated directions in this study.

Comparison Between 2 and 58 Features
The 58 multiple features in the input layers help the MCLT to learn the temporal variations in the samples. These features are based on percentage difference, standard deviation, correlation coefficient, eigenvalues, and entropy, that are calculated by taking into account some of the nearby temporal values. As the temporal values adjacent to a time instance change, the values of these features also adapt to these changes. Thus, these features help in comprehensive description of wind speed and direction, describing the trends like increase, decrease, stationary, sudden turbulence, rate of increase and decrease, deviation from the mean, behaviour of speed with respect to direction (i.e. correlation), energy (i.e. entropy) of the adjacent temporal values and its variation. Therefore, they provide additional information about samples. Moreover, the movements of the 1D kernels in the convolutional layers further help the convolutional layers to learn their own features in the form of weights and biases during the training phase of the MCLT. When only two features were used in the input layers of the MCLT, maximum total accuracy was 96.8% and 97.4% for Stuttgart and Netherlands, respectively, for speed (Table 2) and 98.8% and 97.9% for Stuttgart and Netherlands, respectively, for direction (Table 3). The maximum total accuracy for MCLT with 58 features is increased by 2.3% and 2.5% for Stuttgart and Netherlands, respectively, for speed (Table 2) and by 1.1% and 1.6% for Stuttgart and Netherlands, respectively, for direction (Table 3) in comparison to MCLT with 2 features. Similarly, the effect of these 58 features over 2 features can also be seen in the case of 1DM (Table 2, Table 3) where maximum total accuracy for speed improved by 1.4% and 1.2% for Stuttgart and Netherlands, respectively, and by 1.2% and 1.0% for Stuttgart and Netherlands, respectively, for direction. Learning of the MCLT with 58 features is better than 2 features as shown by respective learning curves in Fig. 9 and by the loss curves in Fig. 10.
Convolutional layers ( C 1 , C 2 ) near the input layers learn the features in smaller neighbourhood, while the convolutional layers ( C 3 , C 4 ) near the output layer learn features in larger neighbourhood (He et al. 2016;Huang et al. 2017;Krizhevsky et al. 2012;Xie et al. 2017). C 3 takes as input the learnt features from both C 1 , and C 2 , while C 4 , takes as input the features from C 1 , C 2 , and C 3 , therefore, the MCLT gets trained by learning features at different scales. Further, as the convolutional layers ( C 3 , C 4 ) are connected to all the previous convolutional layers, providing that gradient vanishing problem would not occur, i.e. MCLT learning does not slow down during training via back-propagation (He et al. 2016;Huang et al. 2017;Xie et al. 2017). Moreover, LSTM layers after the last convolutional layers ( C 4 ), have memory units that retain the learnt features from previous output of the neurons and operate upon them with features learnt from the current output of the neurons. This gives better learning over the fully connected layers (present in traditional CNNs) that lack these memory units. Additionally, the memory units in the LSTM help in finding correlations between patterns learnt across different time, as a recent pattern is a function of pattern learnt at previous time.

Comparison with Existing Related Work
The proposed MCLT architecture is compared with the 1DM. The MCLT with 2 features as well as 58 features performs better than the 1DM with 58 features, as shown in Figs. 5, 6, 7 and Fig. 8 for both Stuttgart and Netherlands. Minimum, maximum and mean total accuracies of the MCLT with 58 features are compared with 1DM with 2 features in Table 4. Thus, the MCLT performs better than the 1DM. Moreover, the MCLT with 58 features efficiently predicts for the larger time frame in future ( K F as 60, multiply by the temporal wind dataset resolution) whereas the 1DM with 2 features could only predict for 50 values in future (Harbola and Coors 2019b). Furthermore, the MCLT is also compared with the methods in the existing literature that are near to the proposed architecture. 1D CNN algorithm proposed by (Liu et al. 2018) has used regression technique working on the smoothed and filtered data, thereby losing the originality of the wind dataset. The same samples comprising of K B = 60, input values without applying smoothening and filtering, that have been employed for the proposed MCLT, are also used to train and test the regression CNN architecture (Liu et al. 2018). In this case, Symmetric Mean Absolute Percentage Error (SMAPE) (Flores 1986) for wind speed in Stuttgart is 20.5% for K B = 8 and reaches up to 25.5% for K B = 60, while 14.9% for K B = 15 and reaches up to 21.2% for K B = 60 for wind speed in Netherlands. SMAPE of wind direction were moreover similar to these patterns. It may be noted that, here, the labels of the samples are designed using the real values (i.e. regression); whereas, MCLT predictions are based on the class labels (i.e. classification). SMAPE was also calculated for MCLT prediction results. The center of the interval of each class (Table 1) was calculated by taking the average of lower range and upper range. The class predicted by the MCLT for a test sample along with the corresponding center of the interval of the predicted class was noted. This was done for all the test samples. SMAPE was calculated using the center of the interval of the predicted class and the center of the interval of the ground-truth class for all the test samples. SMAPE for wind speed in Stuttgart was 3.5% for K B = 8, 1.4% for K B = 35 and 0.4% for K B = 60. Similar were the SMAPE values for wind direction. As the future time frame of prediction increases, error also increases using the stateof-the-art CNN-based regression method (Liu et al. 2018). However, the proposed MCLT based on classification shows high accuracy and mean total accuracy reaches up to 99.9% for K B = 60 (and SMAPE = 0.4%), without smoothening and filtering the original wind data. Thus, the proposed MCLT method gives satisfactory results for predicting dominant speed and direction for a greater time duration in the future unlike (Liu et al. 2018). The limitation of 58 input features is only due to hardware constraints and more features can be designed with more GPUs. The accuracies achieved using the designed MCLT can be further improved with better hardware resources using a greater number of feature maps, neurons, convolutional and LSTM layers. Thus, the use of multiple features at various levels in the MCLT, viz. (a) 58 features in the input layers, (b) inputting a convolutional layer with features from all the previous convolutional layers, and (c) retaining the memory of learnt features by LSTM from previous outputs (of neurons) during training, helps the proposed architecture to predict the dominating speed and direction classes with good accuracy. Further, as the number of classes of the samples increases, detailed patterns of the nonlinear nature of the wind can be analyzed but at the same time ambiguity in classification also increases. However, the proposed MCLT architecture is able to overcome this ambiguity by learning multiple features and performs well even with 21 classes.

Conclusion
Wind speed and direction predictions are critical to new wind farm installations and for smart city planning in proper utilisation of green and freely available energy resources.
In this paper, a deep learning architecture is successfully designed and demonstrated to predict the dominant speed and direction classes in the future for the temporal wind datasets. The proposed MCLT architecture uses 58 features in the input layers that are designed using wind speed and direction values. These features are based on percentage difference, standard deviation, correlation coefficient, eigenvalues, and entropy, for comprehensively and efficiently describing the wind trend and its variations. LSTM layers at the end of the last convolutional layers have memory units that employ features learnt during current as well as the previous output of the neurons. Further, densely connected convolutional layers in the MCLT help the convolutional layers to learn features of other convolutional layers as well. Two large wind datasets from Stuttgart and Netherlands are used for training and testing the MCLT. The maximum total accuracies for speed and direction prediction are 99.9% and 99.9%, respectively. The average total accuracies reach up to 98.9% and 98.7%, for speed and direction prediction, respectively. The model's real-world prediction demonstration analysis support the novelty of the work while explaining visually with the help of wind rose plots. Thus, the MCLT shows promising results for different wind datasets. The limited hardware resources restricted this study to using 58 features in the input layers. Accuracies achieved in this work could be further improved with better hardware resources using a greater number of feature maps, neurons, convolutional and LSTM layers. Most importantly, this analysis would help to devise a new set of inflow boundary conditions that are prerequisites for obtaining reasonable wind flow fields. Computational Fluid Dynamics (CFD) simulations use wind speed and direction measured at a nearby meteorological station as the inflow boundary conditions, which could be decided using the proposed work. The performed wind nature analysis has the potential for helping city development authorities and planner in identifying high wind areas with detailed temporal wind information about its magnitude and dominating direction and for selecting the optimum wind energy conversion systems. In future, the authors will improve the proposed algorithm and work for the visual analysis of the temporal wind dataset. Moreover, the proposed deep learning concept for temporal data could be implemented to other time-series datasets like finance, trends analysis, and sensor health monitoring applications.