Application of non-Gaussian multidimensional autoregressive model for climate data prediction

From the point of view of agriculture, ecology, or environmental engineering, the capability of forecasting meteorological variables in the long and short term is crucial. Short-term forecasts enabling the planning of field work in agriculture, management of mass events, or tourism are important, while long-term forecasts related to advancing climate change are also very interesting. In the literature, there are known many approaches that can be used to forecast climate time series. The most common is based on the statistical modelling of the corresponding data, and the prediction is made on the fitted model. There are known one-dimensional approaches, where single variables are modeled separately; however, in the last decade, there appears a new trend which assumes the importance of the relationship between different time series. This is the approach considered in this paper. We propose to examine the climate data (temperature and precipitation) using the multidimensional vector autoregressive model (VAR). However, because in the time series we observe non-Gaussian behaviour, the classical VAR model can not be applied and the multidimensional Gaussian noise is replaced by the α-\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha -$$\end{document}stable one. This model was previously analyzed by the authors in the context of financial data description where also non-Gaussian characteristics are observed. The main goal of this paper is to answer the question whether there are reasons to go from the Gaussian model to the generalized models, like α-\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha -$$\end{document}stable based. The second purpose is to link total precipitation data with temperature time series. In the classical approach, precipitation was treated as a variable not correlated with temperature, which, as we will show in the paper, is inconsistent with reality. We hope the presented in this paper results open new areas of interest related to climate data modelling and prediction.


Introduction
From the point of view of agriculture, ecology, or environmental engineering, the capability of forecasting meteorological variables in the long and short term is very useful. On the one hand, short-term forecasts enabling the planning of field work in agriculture, management of mass events or tourism are important; on the other hand, longterm forecasts related to the advancing climate change are also very interesting. These changes depend on many global factors, the most important of which is the amount of greenhouse gases in the atmosphere. According to the analyzes, greenhouse gases (mainly carbon dioxide) are responsible for the temperature increase observed in many places around the world. In this case, depending on the RCP (Representative Concentration Pathways) scenario [1,2], the content of these gases in the atmosphere is forecast using climatic indicators up to the year 2100. The ability to model climate variables can be used for forecasting as well as for generating synthetic data in any time horizon and in any resolution using weather generators. Meteorological variables in the short-term horizon are usually described using models that take into account a large number of atmospheric factors [3,4]. There are numerous models based on various types of statistical analysis [3][4][5], and from the point of view of new technologies, models based on artificial intelligence are becoming more and more popular [6]. From the point of view of generating meteorological data [7], the most popular models that are in use are models that in the case of precipitation, the model total precipitation by means of the first-order Markov chain to determine the occurrence of wet/dry days, and then for the amount of precipitation, the multidimensional two-parameter gamma distribution is used [2,8,9]. For other variables such as daily minimum temperature, maximum temperature, solar radiation, and wind speed, multivariate autoregression models are usually used [10]. In the first generators, it was assumed that the distributions of the described variables could be described by Gaussian distribution. However, it was soon observed that this assumption was not correct. In [11], the multivariate closed skew-normal distributions [12] were used for modeling of the residual skewness observed in climate data.
Following this path, one should ask whether the use of autoregressive Gaussian models to estimate model parameters is correct and whether there are reasons to go from the Gaussian model to the generalized models. Answering this question is the first goal of the paper. The second purpose is to link total precipitation data with temperature time series. In the first generator, precipitation was treated as a variable not correlated with temperature, which, as we will show in the paper, is inconsistent with reality.
To achieve the defined goals, in this paper, we propose a multidimensional approach for climate data description. To take into consideration the relation between the examined climate data (i.e., temperature and total precipitation), we apply a multidimensional vector autoregressive (VAR) model. It should be emphasized that this time series can be used to describe the dependence between analyzed variables, and at the same time, it takes into account the time dependence for single components. The VAR model is considered as the classical multidimensional time series [13]. However, because the analyzed data exhibit non-Gaussian behavior, the classical VAR model used in this paper, is modified to take into consideration the specific characteristics of the data. Thus, we propose to replace the Gaussian distributed innovations in the classical model by a more general class, namely aÀstable distributed variables. This class of distributions was introduced in 20's [14,15]. The stable probability laws are important in probability theory. According to the Generalized Central Limit Theorem, the stable laws attract distributions of sums of random variables with a diverging variance. It is a generalization of the Central Limit Theorem, which states that the Gaussian law attracts distributions with finite variance. The aÀstable distribution is considered as the generalization of the Gaussian one. Although the first application of this distribution appeared in the work of Mandelbrot [16], where the financial time series were analyzed, the aÀstable distributions and processes have found various applications, including economy [17][18][19][20][21][22][23], physics [24][25][26][27], signal processing [28][29][30][31], computer science [32][33][34][35], geology and geophysics [36][37][38], biology [39][40][41][42], and many other fields. The aÀstable distributions are also considered for climate data modelling, see, e.g., [43][44][45][46]. We also refer the readers to interesting bibliography positions [47,48]. The VAR time series based on the aÀstable distribution was considered, for instance, in [49][50][51][52]. This model takes into consideration the relationship between the multidimensional data, and by using the aÀstable distribution instead of the Gaussian one, it allows describing the data with the heavy-tailed behaviour.
The paper is organized as follows. In Sect. 2, we present the used model. Next, in Sect. 3, we describe the analyzed climate data indicating the relationship between variables. The relation between temperature and precipitation data is a starting point to apply the VAR model. The results of the fitting are presented in Sect. 4. In this section, we also discuss the prediction results and compare them with the predictions obtained in the Klimada project. The last section concludes the paper.

The aÀstable vector autoregressive model
In this section, we present the model under consideration. Because the general methodology was introduced in the literature, therefore, we remind only the basic information referring to the appropriate bibliography positions.
The model examined in this paper is called the vector autoregressive time series (VAR model) with the astable distribution. This time series can be treated as a generalization of the Gaussian VAR model, [13].
A m-dimensional time series fXðtÞg ¼ fðX 1 ðtÞ; . . .; X m ðtÞÞ T g is called a VAR model with the astable distribution, and if for each t 2 Z , it fulfills the following equation XðtÞ À H 1 Xðt À 1Þ À . . . À H p Xðt À pÞ ¼ ZðtÞ: In Eq. (1), the sequence fZðtÞg ¼ fðZ 1 ðtÞ; . . .; Z m ðtÞÞ T g constitutes a sample of independent m-dimensional vectors with independent components having one-dimensional and signðÁÞ denotes a sign function. The parameter 0\a 2 is the stability index which is responsible for the rate of convergence of the distribution tail. More precisely, for a 6 ¼ 2 , the probability tail PðZ [ zÞ $ z Àa and the corresponding second moment are infinite. Thus, the smaller a, the convergence is slower, and in the consequence, the aÀstable random variables take extreme values more likely than it in the Gaussian case. The aÀstable distribution for a 6 ¼ 2 belongs to the so-called heavy-tailed class of distributions. On the other hand, for a ¼ 2 , it reduces to the Gaussian distribution, and in this sense, it can be considered as its generalization. The model defined as in Eq. ( 1) can be considered as the generalisation of the VAR model with Gaussian innovations. The other parameters of aÀstable distribution are: the scale parameter r [ 0, the skewness parameter À1 b 1, and the shift parameter l 2 R. For more details about the aÀstable distribution, we refer to [53,[55][56][57][58][59]. It should be emphasized that for m ¼ 1 , the model defined in Eq. (1) is the one-dimensional autoregressive time series (AR model) with aÀstable innovations, [60]. For more details about the AR model and multidimensional VAR model with aÀstable innovations, we refer the readers to the bibliography positions [49][50][51][60][61][62].
In the Gaussian VAR model, one of the classical methods used for the estimation of the parameters is the Yule-Walker approach that utilizes the auto-covariance function [13]. For the VAR model with non-Gaussian aÀstable distribution, this classical measure of dependence is infinite, and thus, in the literature, there are considered different approaches to the parameters' estimation. One of these is the modified Yule-Walker approach that utilizes the so-called auto-covariation function, and one of the dependency measures properly defined for the aÀstable model. This method is described in detail in [49,63], and in this paper, it is used for multi-dimensional VAR models as well as for one-dimensional AR time series with aÀstable innovations. See also [28,64,65].

Climate data description
The analyzed data are the daily measurements of the maximum temperature (Tmax), minimum temperature (Tmin), average temperature (Tavg), and total precipitation (P) for the city of Wrocław from 1961-2020. The data come from the Institute of Meteorology and Water Management (IMGW) [66]. Each variable is described by a time series of length 21915 data. Figure 1 shows the graphs of all analyzed data. As can be easily seen, all data exhibit seasonal behavior. Therefore, they were divided into individual months, and thus, 12 time series for each variable were obtained. For the temperature data, an upward trend characterizing the data in individual months is visible; therefore, before the time series modelling, the long-term trend, based on a moving average averaging data from the moving 10 years, was removed. As it is known from various studies [67], this trend is related to long-term climate changes leading to global warming.
The first step of our study was to analyze the structure of the relationship between the individual time series. For this purpose, the behavior of correlation coefficients for 5-year windows sliding by one day taking into account all combinations of time series was investigated. The period of five years has been taken as long enough to detect a correlation while not being affected by the long-term trend. For the purposes of research, the Pearson, Spearman, and Kendall correlation coefficients were used, see [68]. The Person correlation coefficient is essential to study the linear relationship between two variables; however, its estimatior is sensitive to outliers. Therefore, this measure is useful, especially for the Gaussian (or light-tailed) distributed variables [69]. The Spearman correlation coefficient (called Spearman rank correlation coefficient) measures a monotonic relationship between variables. Its estimator is insensitive to large observations and thus in cases when the analyzed variables are heavy-tailed [70,71]. The Kendall correlation coefficient (called also Kendall rank correlation coefficient) indicates not only the strength, but also the direction of the dependence. Similarly to the Spearman correlation coefficient, it is resistant to outliers and is used especially for non-Gaussian distributed data [72].
In Fig. 6, there are presented the exemplary correlation coefficients for the sliding 5-year windows for July data   Fig. 7 for December. The coefficients were calculated for all combinations of the considered data. It can be clearly seen that the Spearman and Pearson coefficients coincide for the temperature data. The Kendall coefficient is usually lower due to the way it is calculated. This indicates a linear relationship between the data. The situation is different for the relationship between the temperature and precipitation data. In this case, we observe large discrepancies between the coefficients, which indicates a nonlinear relationship and probably non-Gaussian behavior. An important observation is the fact that all considered data are significantly correlated with each other, which justifies the consideration of multivariate data modeling using VAR-type models. On the other hand, the

Depenendence between Tavg and P
Pearson Spearman Kendall Fig. 7 The analyzed correlation coefficients for sliding fiveyears windows in December for each data combination  Month  T1  T2  T3  T4  T5  T1  T2  T3  T4  T5   1  nonlinear relationship between the data also indicates that considering a model other than Gaussian may be justified.

Model fitting
As a set of four time series was analyzed and the correlation coefficients between all data are significant, an attempt was made to fit the 4-dimensional VAR(1) model, with astable innovations, described in the previous section, to the data under consideration. The VAR(1) model was used as the optimal according to multiple literature positions analysing this type of data [2,8,9]. Due to the fact that in the case of the daily sum of precipitation, the VAR model based on maximum, minimum, and average temperatures cannot reflect the exact values and the methodology of forecasting precipitation has a different nature [2,8,9] (which does not change the fact that precipitation has a significant impact on daily temperatures), only the estimation of the model and forecasts for maximum and minimum temperature is considered. The two remaining variables will be treated as auxiliary variables that have a significant impact on the quality of the forecast As the relationships between 4-dimensional data are not always linear, the residuals distribution of the fitted non-Gaussian 4-dimensional VAR(1) model was investigated by testing whether these distributions are Gaussian or astable and fitting the aÀstable distribution. In the Tables 1 and 3, we present the values of the five statistics used for testing Gaussian and/or a-stable distributions of the residuals, namely Kolmogorov-Smirnov test (T1), Kuiper test (T2), Watson test (T3), Cramer-von Mises test (T4), and Anderson-Darling test (T5), see [73]. In Tables 2 and 4, we can see the parameters of fitted to residuals series the astable distribution. The results are presented for the minimum and maximum temperature.
In the case of daily maximum temperature, the residuals distributions of the fitted 4-dimensional non-Gaussian VAR (1) model are not Gaussian for all considered months. The results of the stability test statistics show that the residuals distribution is much closer to a-stable distribution which indicates a justified use of a non-Gaussian model. In Table 2, we can observe that the residuals for maximum temperature, especially for winter months, have heavier tails than in Gaussian distribution. Additionally, for most of the analysed months, the residuals b parameter is significantly positive, which indicates the right skewness of this distribution. Thus, in the case of maximum temperatures, we will observe the extreme values more often, with particular emphasis on the values with high maximum temperature.   Month  T1  T2  T3  T4  T5  T1  T2  T3  T4  T5   1  In the case of minimum temperature, the lack of Gaussianity of the residuals distributions is not as visible as in the case of maximum temperature. In some months, statistics for normality are lower than for stability. However, generally we can assume that the residuals distribution is closer to astable than to Gaussian. As in the case of maximum temperatures, the fitted a factor of a-stable distribution indicates heavier tails in winter months, and a negative b for almost all months indicates that special attention should be paid to the more frequent low values of minimum temperature.

Prediction
Based on the fitted model, the one-step prediction was made (one day ahead) for the minimum temperature and for the maximum temperature. The mean absolute error (MAE) measure, which is appropriate for non-Gaussian considerations, was used to measure the forecast error [74] where t pred means the predicted and t obs the observed value. The mean absolute forecast error was determined for the last ten years, i.e., for the period 2010-2020. The forecast errors obtained in this way were compared with the onedimensional model of autoregression of the first-order AR(1) with a-stable innovations for both minimum and maximum temperature separately. The forecast quality results are presented in Tables 5 and 6. The results presented in Tables 5 and 6 show that the multivariate model has an advantage over the one-dimensional model in the case of forecasting the daily maximum and minimum temperature values. However, the quality of the forecast for the minimum temperature is much better. The mean absolute forecast error for the minimum temperature for all months is about 0.5 degrees Celsius lower than for the univariate model, while for the maximum temperature, the difference is between 0.15 and 0.3. However, from the prognostic point of view, considering multivariate models, in which the correlation between the series under consideration is taken into account, is absolutely justified. In addition, it is also justified to use models with a nonlinear correlation structure and no assumption of Gaussianity.
Due to the fact that in the case of minimum temperature, the quality of the forecast is much better, and an attempt was made to estimate the future minimum temperature values for the year 2090, assuming the RCP 8.5 scenario [1,75]. In the RCP 8.5 scenario, it is assumed that the warming of the climate related to the content of greenhouse gases will continue as before. The determined values were presented with confidence intervals and compared with the forecast values in the Klimada project [76].
The calculations were made for 10, 000 trajectories simulated on the basis of the estimated 4-dimensional VAR(1) model using the Monte Carlo methodology for each considered month. Figure 8 presents a chart showing the scenario for the year 2090. In the figure, we can see the monthly average minimum temperature obtained in the Klimada project and using the 4-dimensional VAR(1) model and the 99% and 95% confidence intervals from the VAR model.
It is worth noting that the forecast based on a simple VAR model, in the case of the forecast for 2090 of the average monthly minimum temperature, assuming the RCP 8.5 scenario, gives results not much different from those from the Klimada project, where the model used is  certainly much more sophisticated. What is essential is the fact that the confidence intervals are shifted downwards against the mean (the distance between the lower limit of the confidence interval and the mean is in most cases greater than the distance between the mean and the upper limit of the interval). This asymmetry is related to the fact that non-Gaussian residuals are used for projection (thicker tails and left asymmetry). This affects the quality of the projection and the further information that the projection carries with it. As for the distribution of the structure of minimum temperatures, it is also worth noting that despite the asymmetry of the confidence intervals and a slightly lower average minimum temperature in the winter months, the percentage of days with a minimum temperature below 0 C in the VAR model is 10%, while in the Klimada project, it is about 12:6%, which indicates a certain narrowing of the confidence intervals. Projection using the VAR model shows that in the RCP 8.5 scenario, we can expect, for example, daily minimum temperatures in December even around 16 C , which clearly indicates the already observed upward trend.

Conclusions
In this paper, we have analyzed multidimensional climate data related to temperature (minimum, maximum, and averaged) and total precipitation. We have demonstrated the examined time series are related, and the structure of their relation changes over time. Moreover, we analysis of three different correlation coefficients clearly indicates their non-Gaussian behavior. The preliminary research was the starting point for the proposition of the stochastic model that shares similar properties as the data. More precisely, we have proposed the multidimensional VAR time series with non-Gaussian aÀstable innovations. This model was previously used by the authors for financial data modelling. We have shown that the proposed model more precisely reflects the nature of the data, and thus, it is justified to describe the time series by using non-Gaussian systems. Moreover, finally we have demonstrated that the model-based prediction corresponds to the RCP 8.5 scenario-based results. The proposed approach is universal and can be used in a more general case, e.g., to describe climate data corresponding to more variables. One of the main goals was to prove that the non-Gaussian multidimensional model is adequate to the analyzed data. This objective of this paper is fulfilled, and the obtained results clearly confirm this.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.