Forecasting VIX using Bayesian Deep Learning

Recently, deep learning techniques are gradually replacing traditional statistical and machine learning models as the first choice for price forecasting tasks. In this paper, we leverage probabilistic deep learning for inferring the volatility index VIX. We employ the probabilistic counterpart of WaveNet, Temporal Convolutional Network (TCN), and Transformers. We show that TCN outperforms all models with an RMSE around 0.189. In addition, it has been well known that modern neural networks provide inaccurate uncertainty estimates. For solving this problem, we use the standard deviation scaling to calibrate the networks. Furthermore, we found out that MNF with Gaussian prior outperforms Reparameterization Trick and Flipout models in terms of precision and uncertainty predictions. Finally, we claim that MNF with Cauchy and LogUniform prior distributions yield well calibrated TCN and WaveNet networks being the former that best infer the VIX values.


Introduction
Investors and regulators are concerned about financial market volatility and crashes.
For this reason, the Volatility index (VIX) was introduced in 1993 by the Chicago Board Options Exchange (CBOE) with the aim of assessing the expected financial market volatility in the short-run, i.e. for the next 30 days, since it is calculated as an implied volatility from the options on the S&P 500 index on this time-to-maturity [1].
The VIX has been proven to be a good predictor of expected stock index shifts, and therefore as an early warning for investor sentiment and financial market turbulences (see e.g., [1], and more recently, [2]).Due to its importance for asset managers and regulators, it would be useful to foresee the values of the index; however, the VIX is very difficult to forecast [3].There exist several proposals to predict time series found in the literature classified as conventional and modern methods (see e.g., [4] and the references therein).Among modern methods, deep learning techniques have been successfully applied to financial time series.Given a probability space, a time series may be defined as a discrete-time stochastic process, in other words, a collection of random variables indexed by the integers [5].Since time series is a sequence of repeated observations of a given set of variables over a period time [6], where sequences are data points that can be ordered and past observations may provide relevant information about future ones, deep learning models employed for other type of sequence models are also useful for time series.Sequence models may be classified as (see e.g., [7]), (i) one-to-sequence, where a single input is employed to generate a sequence as an output (e.g., generating text from an image), (ii) sequence-to-one, where a sequence of data is used to generate a single output (e.g., sentiment classification), (iii) sequence-tosequence, a sequential data is the input to produce a sequence as output (e.g., machine translation).Time series can be regarded as a special sequence-to-sequence case with trend, seasonality, autocorrelation and noise characteristics [8].Furthermore, financial time series are characterized by nonstationary, nonlinear, high-noise, which makes the prediction of these time series more challenging [4].
Though several deep learning models have been successfully applied to calculate point estimates of financial variables, all financial models are subject to modeling errors and uncertainty caused by inexact data inputs, therefore, probabilistic models are more adequate to achieve more realistic financial inferences and predictions [9], and then for optimal decision making [10].Besides, it has been recently found that neural networks are miscalibrated [11].Thus, our work intends to tackle the abovementioned drawbacks by contributing to the literature in the following aspects: (i) we employ three modern deep learning models to predict the VIX values in a deterministic framework.These models correspond to WaveNet, Temporal Convolutional Networks (TCN), and Transformer, (ii) we obtain the probabilistic version of the deterministic models by using three techniques: Reparameterization Trick (RT), Flipout, and Multiplicative Normalizing Flows (MNF), (iii) we calibrate the probabilistic models with a simple approach known as the standard deviation scaling, and finally (iv) we find that the probabilistic models of WaveNet-MNF and TCN-MNF with LogUniform and Cauchy priors, respectively, are well calibrated.
The rest of the paper is divided as follows.Section 2 presents an overview of the literature related to the examined models in our study.Section 3 describe the WaveNet, TCN, and Transformer models.Section 4 briefly reviews on Bayesian neural networks and the three approaches utilized: Reparameterization Trick (RT), Flipout, and Multiplicative Normalizing Flows (MNF).Section 5 presents the calibration problem.Section 6 presents the VIX dataset.Section 7 explains the methodology of our work.Section 8 presents the results of our manuscript on deterministic and probabilistic models and its calibration.Finally, Section 10 concludes the paper.

Related Literature
Regarding deep learning models applied to financial time series forecasting, [12] performed an exhaustive review of the literature between 2005 and 2019, whereas [13] carry it out for 2020 and 2022.In these studies, related to VIX, Psaradellis and Sermpinis [14] proposed a HAR-GASVR, which is a Heterogeneous Autoregressive Process (HAR) with Genetic Algorithm with Support Vector Regressor (GASVR) model.
On the other hand, Huang et al. [4] and Yujun et al. [15] employ variational mode decomposition (VMD) methods combined with the long short-term memory (LSTM) model.
Within the analyzed neural networks in our study, WaveNet has been applied to VIX [16] and in probabilistic models [17].In this work, we also implement TCN for financial time series for its adequate performance in time series [18], in financial time series [19], high-frequency financial data [20], and probabilistic forecasting [21].Transformer models have been also applied in finance [22] and probabilistic developments for time series [23].
To the best of our knowledge there are few attempts of probabilistic model applications specifically to financial time series [24], [25], [26].

Neural Networks
This section briefly reviews the neural networks employed.An artificial neural network is a special type of machine learning model that connects neurons organized in layers.
While deep learning model is a kind of neural network with numerous layers and neurons [7].

WaveNet
The WaveNet model was introduced by [27] in 2016 to generate raw audio waveforms for reproducing human voices and musical instruments purposes.In short, there is a convolutional layer, which access the current and previous inputs.Moreover, there is a stack of dilated (aka atrous) causal one-dimensional convolutional layers, that is, when applying a convolutional layer some input values are omitted, with exponentially increasing filters [28].At the end of the architecture there are dense layers with an adequate activation function.Thus, this model learns short-and long-term patterns.
In the original paper, the authors stacked 10 convolutional layers with dilation rates of 1, 2, 4, 8, . . ., 256, 512 [29].Since audio is a type of sequential data, we apply WaveNet to financial time series, which is also a form of sequential data as abovementioned.

Temporal Convolutional Network (TCN)
The Temporal Convolutional Network (TCN) was first developed by [30] and the authors unified the traditional two-step procedure for video-based action segmentation.The first step involves a Convolutional Neural Network (CNN) that encodes spatial-temporal information, and the second step involves a Recurrent Neural Network (RNN) that captures high-level temporal linkages.Therefore, a TCN may be summarized as a hierarchical temporal encoder-decoder network and allows for longterm patterns, since it is an adaptation of WaveNet [30].The available keras package for TCN coded by Philippe Rémy, and based on [31], is employed in our work.

Transformer
The standard Transformer model was developed in [32], "Attention is all you need", which is a non-recurrent encoder decoder architecture that helps to transform (that is why the name Transformer) a sequence into another one.The encoder is generally composed of multi-head attention (MHA) and feed-forward layers with residual connections in between.Though the decoder part is like the encoder, it has a self-attention layer (see e.g., [33] for more details about models based on attention).The attentionmechanism is usually represented as Attention(Q, K, V), where Q contains the query, K denotes the keys, and V stands for the values.The main component -MHA -allows for "attending" long-term dependencies in a different way to the short-term dependencies simultaneously.One of its important applications is the Bidirectional Encoder Representations from Transformers (BERT) and GPT-3 models in natural language processing [34].

Bayesian Neural Networks
Probabilistic models like Bayesian Neural Networks (BNN) are more adequate for financial estimates since financial data are prone to measurement errors and are noisy.
BNN considers the weights of the network as a probability distribution rather than a single value as in traditional neural networks.To this aim, a prior distribution (in general) over the network weights is placed.Therefore, an appropriate model should quantify the uncertainties to get a better understanding of the risk involved and improve the decision-making process [9].There are two main uncertainty sources: aleatoric uncertainty (or data uncertainty) and epistemic uncertainty (or model uncertainty) and an ideal BNN would yield more accurate uncertainty estimates because high uncertainties is a sign of imprecise model predictions [35].The total uncertainty of a new test output y * given a new test input x * may be expressed as (see e.g., [36], Section 2.2., and the references therein) where 1 T T t=1 σ 2 t , the mean of the prediction variance, represents the aleatoric uncertainty and 1 T T t=1 (µ t − μ) 2 , the variance of the prediction mean, represents the epistemic uncertainty.
For the inference in probabilistic models, Markov Chain Monte Carlo (MCMC) approach can be considered (e.g., Metropolis-Hastings, Gibbs sampling, Hamiltonian Monte Carlo -HMC, among others) and variational inference.The latter will be employed in this work and is described as follows (based on [37] and its notation, where more details can be found and the references therein).
The output of a BNN is the posterior distribution of the network weights.MCMC methods may be applied to this end; however, they are computationally expensive.
Another approach, which is gaining interest in academia is variational inference.Let p(ω) denote the prior distribution over a parameter ω (the network weights) on a parameter space Ω.The posterior distribution of the parameter is given by where, p(D|ω) is known as the likelihood and p(D) the marginal (or evidence) in Bayesian inference framework.In detail, the dataset D is denoted as {(x i , y i )} N i=1 , where x i represents the inputs and y i the outputs of the total N sample of the analyzed dataset.
The goal in variational inference is to find a variational distribution q θ (ω) (indexed by a variational parameter θ and from a family of distributions Q), which approximates to the posterior distribution p(ω|D).This is done by minimizing the Kullback-Leibler (KL) divergence between the two aforementioned distributions, and it is defined as It can be shown that minimizing the KL divergence is equivalent to maximizing the evidence lower bound (ELBO), which is given by The mean-field approximation with normal distributions may be a proposal for the Q family of distributions [38], [39].That is, where i indicates the index of the neurons from the previous layer and j the index of neurons for the current layer.However, it poses a dimensionality problem in the parameters (mean µ ij and variance σ 2 ij ) to be estimated.Moreover, the KL divergence may be approximated by sampling the variational distribution, q θ (ω), but it is not possible to perform backpropagation through a random variable.A solution to this problem is Reparameterization Trick, and this is our first approach.

Reparameterization Trick
An unbiased and efficient stochastic gradient-based variational inference is provided by (non-local) Reparameterization Trick (RT) and it was applied to variational autoencoders in [40] to make backpropagation possible and the output parameters are normally distributed [41], [42].Rather than sampling from ω, samples are generated from another variable ϵ ij , which is standard normally distributed, and then is calculated, allowing for backpropagation.More details can be found in [40], [43], [44] and the TensorFlow documentation at DenseReparameterization.

Flipout
Flipout also provides an unbiased and efficient stochastic gradients estimator, but reduces the variance of the gradient estimates compared to RT.It was proposed by [45] and applied to LSTM and convolutional networks.The authors impose two constraints, which are (i) independent perturbations and (ii) these perturbations are centered at zero and it has a symmetric distribution.See more details on the TensorFlow documentation at DenseFlipout

Multiplicative Normalizing Flows
Normalizing flows (NF) are probabilistic models useful to fit a complex distribution by learning a transformation (or flow) [42].The NF can be represented as where p T (y) is the probability density function (pdf) of the transformed variable y, T is the invertible mapping, and p(x) is the pdf of an invertible random variable (rv) x.
By including auxiliary rv's z ∼ q θ (z) and a factorial Gaussian posterior for the weights with mean parameters conditioned on scaling factors that are modelled by NF, the multiplicative normalizing flows (MNF) are obtained [46].Therefore, the variational posterior for fully connected layers (similar result is obtained for convolutional layers) is given by and then a distribution q(z K ) is obtained by applying the tranform in Eq. 6 successively as Finally, by incorporating an auxiliar distribution r(z K |ω, ϕ) -with a new parameter ϕ -the KL divergence may be bounded as follows For more details, see e.g., [36], Section 2.3.The codes and references found at MNFare utilized in our work for the MNF model.

Calibration
Since the seminal work of [47] more attention is being payed in the academia to obtain not only accurate forecasting but also reliable prediction confidence level of robust neural networks.This is achieved by the so-called calibration process.
For classification tasks, it is very well-known calibration techniques such as the Platt calibration, histogram binning, Bayesian binning into quantiles, Temperature scaling, Isotonic regression, ensembled-based calibration methods, and the usual metrics such as expected calibration error (ECE), maximum calibration error (MCE), negative log-likelihood (NLL), and the visual reliability diagrams are employed (see e.g., [47]).More recently, in the literature, these techniques are classified as posthoc rescaling of predictions, averaging multiple predictions and data augmentation strategies ( [48] and the references therein).For a comprehensive revision of calibration methods see [49], [50], [51].We follow a similar quantile recalibration method for regression tasks in machine learning [52], and it is seen as a post-hoc rescaling method.
The standard deviation scaling method (proposed by [53]) is adapted in our work, which simply scales the total uncertainty (see Eq. 1) of the uncalibrated network by a factor that minimizes the root mean squared calibration error -RMSCE -( [54], Eq. 19).

Data
Figure 1 shows the daily behavior of historical VIX price from August 22, 2013 to July 31, 2023, and its descriptive statistics is presented in Table 1.As seen in the descriptive statistics, the maximum value of the VIX index was 82.

Methodology
The analyzed data consists of the volatility index VIX, downloaded from Yahoo Finance in daily frequency from August 22, 2013 to July 31, 2023.Thus, the total length of data is 2500 observations.The methodology is described as follows.
In a first step, the VIX time series data is collected from Yahoo Finance, which is freely accessible.Since time series (with trend, seasonality, autocorrelation, and noise attributes) are a special case of many-to-many sequence domain it is needed a different treatment from the most common tasks in this domain.In particular, the windowed dataset creation as in [8] is performed to consider a rolling window for forecasting purposes.We employ a window size of 20 days, i.e. a trading month.
Moreover, a robust to outlier scaler transformation of data will be employed.This transformation subtracts the median (instead of the mean as usual) and scales the data to the Interquantile Range (rather than the standard deviation).Furthermore, The software employed is Python, TensorFlow, Keras Tuner, and TensorFlow Probability.The latter for the probabilistic models.Finally, code repositories for the models and MNF replicability will also be useful in our work.

Results
This section presents the results for the deterministic and probabilistic models as its calibration.We also performed machine learning techniques to forecast the VIX price and the results are found in Table 2.
Interestingly, the Naive Forecaster approach, which basically assumes that future values will behave similarly as past values, is the best model followed by the Exponential Smoothing (ETS) algorithm.In particular, we follow the PyCaret tutorial for time series found at Pycaret-Github and more details are found at Pycaret-Doc.

Deterministic Models
After tuning the hyperparameters for the WaveNet model, the following values are obtained: seven (7) blocks, five (5) layers per block, and 96 filters.For more specific details about the code see geron-github and wavenet.
While for the TCN model, we found one stack (nb stack), and 64 filters to use in the convolutional layer (nb filters).The same number of units (64) is fixed for the LSTM, which is the layer that connects after the TCN architecture, the setup of [1,2,4,8,16] for the dilations (dilation list), and the kernel size is equal to 3. See more details at tcn.
For the Transformer model, the Keras documentation for time series classification is adapted in our work.In the MHA part, we found 256 units for the size of each attention head for query and key (key dim), eight (8) attention heads (num heads), and dropout probability of 0.10, according to the Grid Search run in Keras Tuner.
While, in the feed forward part, the number of filters (ff dim) of eight (8) are utilized in the one dimensional convolutional layer.Moreover, we stack eight (8) of these transformer enconder blocks.Finally, for the multilayer perceptron head, 264 units and a dropout probability of 0.10 are employed.For more details, have a look at the Keras documention: MHA and transformer.
The Table 3  Furthermore, Figure 2, Figure 3 and Figure 4 present the results of the prediction and actual data for the WaveNet, TCN, and Transformer models, respectively.In a visual analysis, the TCN seems to be the model that fits the best to the test dataset.For lower VIX values, i.e. in the las part of the plot, the TCN does not predict adequately the actual data, but the WaveNet and Transformer do a good job.However, the WaveNet behaves better than the Transformer for higher values of VIX, that is, at the very beginning of the graph.

Probabilistic Models
The Bayesian techniques of RT, Flipout and MNF reviewed in Section 4 are employed in the last layer of the previous deterministic networks to obtain their respective probabilistic models.The Table 4 presents the metrics for the probabilistic models and for sake of comparison only the test dataset results will be considered in our analysis.
• For the WaveNet case, the MNF is the model with the lowest value of loss and RMSE, and similar MAE and MSLE values are obtained for MNF and Flipout, and Flipout performs the best for MAPE.RT is outperformed in most of the metrics by the other two models.
• MNF has the minimum MAE, RMSE, and MSLE values for the TCN network, whereas RT outperforms in loss and MAPE metrics, and Flipout performs the worst in most of the metrics.
• The results of the metrics for the Transformer network show that MNF has the lowest MAPE value, RT for MAE, RMSE, and MSLE, and Flipout for the loss metric.
As a consequence, despite the mixed results in the different models, it is observed a good performance of the MNF model in general.An important result of [11] is that neural networks are miscalibrated and this affects the forecasting performance of a model.The next section deals with this issue, the calibration problem.

Calibration
This work implements three robust neural networks (WaveNet, TCN, and Transformer) mostly employed in the literature for many-to-many sequence tasks.After having the hyperparameters fine-tuned, these networks have been trained for the VIX forecasting purposes with good results in a deterministic manner.As mentioned in the Introduction Section, probabilistic models are more appropriate to achieve more realistic financial inferences and predictions.To this aim, we implement three models: RT, Flipout, and MNF in the last layer of the deterministic models and calculate their respective (total) uncertainties (see Eq. 1).However, these models are miscalibrated and affect not only the point estimates but also the uncertainty around these point predictions.
To analyze (mis)calibration, the observed proportion of data falling inside an interval and the expected proportion of data of a standard normal distribution at different percentile levels (i.e., 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, and 90%) are calculated.Then, we plot the observed proportion of data vs the expected proportion of data (as per in [54], Fig. 12-b), before and after the calibration.This graph resembles a modified reliability diagram for classification tasks.A miscalibration is evidenced in the aforementioned plot, if the observed proportion of data lie far from the diagonal of the graph.On the other hand, a perfect calibration is noticed when all the observed proportion of data lies in the diagonal.
If a network model is miscalibrated, a post-hoc rescaling method is followed to calibrate the model.In other words, the total uncertainty (see Eq. 1) of the miscalibrated model is multiplied by a factor c that minimizes the RMSCE [54], Eq. 19, given by where p is the expected proportion of data and p(p) is the observed proportion of data that lies inside the calculated interval given by the total uncertainty.
It is worth to mention that a scaling factor closer to 1, the better the model, being 1 a perfect calibration.The initial results of the calibration are shown in Table 5.The MNF (with standard normal prior) presents the higher values of scaling factor and the minimum RMSCE for the three models.The previous results are confirmed by the calibration diagrams and prediction plots.

The Role of Priors
The most common distribution for the prior is the normal pdf, but better posterior approximation may be obtained by varying the prior.In our study, we also tested the Cauchy and Log-uniform pdf's (see Table 6).By changing to these prior distributions in the MNF setup, better results are obtained.For the TCN, the Cauchy distribution prior and two hidden layers with 50 units each, the scaling factor is 0.9800.Whereas for the WaveNet, a scaling factor of 0.9859 is achieved with LogUniform prior and three hidden layers with 50 units each layer.Figure 23 shows the calibration diagram for the

Key Takeways
All in all, the main results of our work are: • It was confirmed that more robust neural networks provide a good forecasting performance for the volatility index VIX in a deterministic and probabilistic setup (as in other many-to-many sequence data), but these networks are miscalibrated [11].
• MNF with standard normal prior provides better results than RT and Flipout for the calibration procedure in our case study, and • By varying the priors with heavier-tailed distributions in the MNF model, a well calibration is found for the different networks.This is in line with the outstanding works of Fortuin and his team on BNN priors, see for instance [55] and [56].More application works will be needed to compare the performance of uninformative priors (like standard normal) with heavy-tailed prior distributions and our work shed some lights about the study of different priors on BNN in the financial time series field.and overperformance is obtained varying the prior distributions, which is a promising future research in financial time series forecasting with BNN.
Other methodologies related to the analyzed models in our study can be tested such as the Knowledge-Driven Temporal Convolutional Network (KDTCN) proposed by [57] who include background knowledge, news and asset price information into  deep prediction models, to mitigate the problem of asset trend forecasting and abrupt  changes explainability.Another model is the Seq-U-Net, where [58] claim is more efficient than other convolutional setups (including TCN and WaveNet).In the same vein, the Retentive Networks (RetNet), which reduce the inference cost and memory complexity issues of transformer models [59], may be also tested.Furthermore, the probabilistic view may be applied to calculate value-at-risk (VaR), which is considered a high quantile of a financial loss distribution, and contrast results with [60] approach.

Fig. 1
Fig. 1 VIX historical price.Daily VIX price taken from August 22, 2013 to July 31, 2023.A peak is observed on March 2020 due to effect of Covid pandemic statement by the World Health Organization (WHO) on financial markets.

7 on
FigureA4, respectively.From the serial correlation plot of the VIX time series, a longterm dependence pattern can be observed.By observing both the ACF and PACF, an AR(2) model could be identified.This is important for traditional time series modelling and for the use of structural time series (STS) modeling in TensorFlow Probability, but this will be the focus of future research.
split dataset is done in chronological order, 80% for training set, 10% for validation set, and 10% for test set.Thus, we analyze 2000 observations for training, 250 for validation, and 250 for test set, respectively.Before executing any model, it is important to get a better knowledge of the statistical properties of the analyzed data.Main descriptive statistics (mean, median, standard deviation, first and third quartiles, minimum and maximum) are calculated for the volatility index.In addition, useful graphical tools such as histogram, boxplot, and autocorrelation function (ACF) plots are also obtained.Then, robust neural network models like WaveNet, TCN, and Transformer will be applied to compare the performance with the usual metrics (MSE, MAE, MSLE, MAPE) for regression tasks and their respective hyperparameters are fine tuned.Bayesian neural networks for each of the deterministic models are obtained by implementing three Bayesian approaches in the last layer of the deterministic model: RT, Flipout and MNF.Finally, the the observed proportion of data falling inside an interval and the expected proportion of data at different percentile levels are calculated for each Bayesian neural network and the models are calibrated following the standard deviation scaling.That is, scale the total uncertainty (see Eq. 1) of each model by a factor which minimizes the Root Mean Square Error of Calibration (RMSEC).

Fig. 2
Fig. 2 Prediction of the deterministic WaveNet model for VIX test dataset.A good fit of the model is observed except for the peaks at the beginning of the graph.

Fig. 3
Fig. 3 Prediction of the deterministic TCN model for VIX test dataset.A good fit of the model is observed except for the low values of the VIX at the end of the graph.

Fig. 4
Fig. 4 Prediction of the deterministic Transformer model for VIX test dataset.A good fit of the model is observed except for the peaks at the beginning of the graph.

Figures 5 and 6
Figures 11 and 12 show the calibration diagram and fit for the TCN and RT model.Moreover, Figures 13 and 14 present the calibration diagram and fit for the TCN and Flipout model.Figures 15 and 16 exhibit the calibration diagram and fit for the TCN and MNF model.On top of that, Figures 17 and 18 show the calibration diagram and fit for the Transformer and RT model.Furthermore, Figures 19 and 20 present the calibration diagram and fit for the Transformer and Flipout model.Finally, Figures 21 and 22 depict the calibration diagram and fit for the Transformer and MNF model.

Fig. 5
Fig. 5 Calibration diagram for the WaveNet with RT model.After minimizing the RMSCE, the scaling factor is equal to 0.7373.The dashed diagonal line represents a perfect calibration.

Fig. 6 Fig. 7
Fig. 6 Prediction of the probabilistic WaveNet and RT model for VIX test dataset

Fig. 8 Fig. 9
Fig. 8 Prediction of the probabilistic WaveNet and Flipout model for VIX test dataset

Fig. 10
Fig. 10 Prediction of the probabilistic WaveNet and MNF model for VIX test dataset

Fig. 12
Fig. 12 Prediction of the probabilistic TCN and RT model for VIX test dataset

Fig. 14
Fig. 14 Prediction of the probabilistic TCN and Flipout model for VIX test dataset

Fig. 16
Fig. 16 Prediction of the probabilistic TCN and MNF model for VIX test dataset

Fig. 18
Fig. 18 Prediction of the probabilistic Transformer and RT model for VIX test dataset

Fig. 20
Fig. 20 Prediction of the probabilistic Transformer and Flipout model for VIX test dataset

Fig. 22 Fig. 23
Fig. 22 Prediction of the probabilistic Transformer and MNF model for VIX test dataset

Fig. 24
Fig. 24 Prediction of the probabilistic WaveNet model for VIX test dataset

Fig. 26 Fig. 27
Fig. 26 Prediction of the probabilistic TCN with MNF model and Cauchy prior for VIX test dataset.A good point estimate is observed and a higher uncertainty for higher values of VIX, i.e., at the beginning of the graph

Fig. 28
Fig. 28 Prediction of the probabilistic Transformer with MNF model and LogUniform prior for VIX test dataset.A good point estimate is observed in general for VIX values

Fig. A1
Fig. A1 VIX histogram.The analyzed VIX values exhibit a positive skewed distribution with maximum of 82.7 on March 2020 as a consequence of Covid-19 pandemic.

Fig. A2
Fig. A2 Box and whisker plot.Outliers may be indentified above the VIX value of 40 and the Interquantile Range (IQR) is 8.11.

Fig. A3
Fig. A3 VIX Autocorrelation Function Plot.Long-term dependence behavior can be observed in the VIX values.

Fig. A4
Fig. A4 VIX Partial Autocorrelation Function Plot.PACF measures the remaining correlation after eliminating the correlation effect in between.Together with the ACF plot, an AR(2) may be identified for the VIX time series.

Table 1
Descriptive statistics for VIX price.The table shows the usual location and dispersion measures

Table 2
Results of Machine Learning techniques to forecast the VIX price obtained by utilizing PyCaret.The table shows the traditional metrics for regression

Table 3
Performance metrics for deterministic models.Usual metrics for regression tasks are calculated for train set, valid set, and test set

Table 4
Performance metrics for probabilistic models.Usual metrics for regression tasks are calculated for train set, valid set, and test set

Table 5
Results of the initial calibration.A good model has a scaling factor close to 1 and lower values for RMSCE