Fractional differentiation and its use in machine learning

This article covers the implementation of fractional (non-integer order) differentiation on real data of four datasets based on stock prices of main international stock indexes: WIG 20, S&P 500, DAX and Nikkei 225. This concept has been proposed by Lopez de Prado [5] to find the most appropriate balance between zero differentiation and fully differentiated time series. The aim is making time series stationary while keeping its memory and predictive power. In addition, this paper compares fractional and classical differentiation in terms of the effectiveness of artificial neural networks. Root mean square error (RMSE) and mean absolute error (MAE) are employed in this comparison. Our investigations have determined the conclusion that fractional differentiation plays an important role and leads to more accurate predictions in case of ANN.


Introduction
Many supervised machine learning algorithms are applied on financial time series data require stationarity [1]. In the process of creating a predictive model, it is assumed that a given time series is generated by a stochastic process. To make an accurate inference from such model, it is crucial that mentioned data generation process remains consistent. In statistical terms, the mean, variance and covariance should not change with time [1]. As a result in analyzed time series, a trend over time is not revealed. When this assumption is not fulfilled, the algorithm may assign a wrong prediction to a new observation.
In contrary to cross-sectional data, time series is a specific kind of data in the sense that any observation reflects a history of observations occurred in the past. The literature named it as a distinctive memory of past track record [2]. Due to the stationarity condition, this series memory is often excluded from the dataset.
The most commonly used method for non-stationarity removal is differencing up to some integer order [2]. Subtracting from each observation its predecessor, one gets the first-order differentiation. The second-order differencing is accomplished by repeating this process on obtained time series. It is similar for higher orders. Admittedly, these transformations can lead to the stationarity, but as a consequence all memory from the original series is erased [2]. On the other hand, the predictive power of machine learning algorithm is based on this memory. Lopez de Prado [3] calls it as the stationarity versus memory dilemma by asking a question whether there is a trade-off between these two concepts, in other words-does exist a solution of making the time series stationary concurrently with keeping its predictive power. One way to resolve this dilemma-fractional differentiation-has been proposed by Hosking [4]. Lopez de Prado upgraded this idea to find the optimal balance between zero differentiation and fully differentiated time series.
The remainder of the paper is organized as follows. The next section gives an overview of fractional differentiation introduced above. The data used in this research are presented in Sect. 3, and the method applied on this data is described in Sect. 4. Section 5 discusses the results, and chapter 6 concludes the paper.

Fractional differentiation-an overview
In this section, the concept of fractional differentiation is elaborated in more detail. Assume that a time series X runs throughout time t and it is not stationary: X ¼ X t ; X tÀ1 ; X tÀ2 ; . . .; X tÀk ; . . . f g . As previously explained, computing the differences between consecutive observations is an approach to achieve a stationary time series [2]: By defining a backshift operator B as B k X t ¼ X tÀk for k ! 0 and t [ 1, the above formula, first-order differentiation, can be expressed as: Concerning the situation when differenced data will not appear to be stationary, it may be demanded to difference the data a second time to obtain a stationary series: which using the backshift operator may be represented as: More generally, for order of differentiation d we have: For a real number d using a binomial formula [3]: the series can be expanded to [3]: The current value in time series is the function of all the past values occurred before this time point. To each past value, a weight x k is assigned [3]: The application of fractional differentiation on time series allows to decide each weight for each corresponding value. All weights calculated by fractional derivative can be expressed as [3]:

)
When we consider d as a positive integer number, there is a point when d is equal to k, then d À k ¼ 0 and: which leads to conclusion that memory beyond that point is removed. In first-order differencing (d ¼ 1), weights follow as (see Fig. 1 as a confirmation): x ¼ 1; À1; 0; 0; . . . f g General-purpose approach of coefficient for various orders of differencing is depicted in Fig. 1. For example, if d ¼ 0:25 and k is always an integer number, all weights achieved values other than 0 which means that the memory is going to be preserved. From the above derivation, the iterative formula for the weights of the lags can be deduced [3]: For the first-order differentiation, we have: In conclusion, the main intention to use the fractional differentiation is finding the fraction d, which is considered as minimum number needed to achieve stationarity, meanwhile keeping the maximum volume of memory in analyzed time series [3].

Data
This study uses four datasets with main stock indexes from different countries: WIG20 (Poland), S&P 500 (USA), DAX (Germany) and Nikkei 225 (Japan). The stock indexes were recorded from 1st June of 2010 to 30th June of 2020. The empirical distributions of mentioned indexes observed in each of the datasets are given in Figs. 2, 3, 4, 5. Table 1 shows the results of the unit root test for all the analyzed stocks, including the augmented Dickey-Fuller (ADF) test and the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test (described in Appendix). These tests are the most widely used to determine whether a given time series is stationary [5]. The null hypothesis of the ADF test essentially assumes non-stationarity, and the null hypothesis of the KPSS test is stationarity. It can be found out that all the stock prices are non-stationary.

Method
This section describes the selected approach used to compare statistical properties of fractional differentiation with differencing of first order.    In [6] (see also references therein) authors claim that artificial neural networks (ANN) are commonly used for forecasting stock prices indexes. In general, ANN outperforms other statistical models applied on time series due to their good nonlinear approximation ability [7]. They were inspired by the strategy human brain processes given information. One of the most frequently implemented neural networks topology is multilayer perceptron.
Like the human brain the neural network has single processing elements which are called neurons. They are connected to each other by weighted and directed edges (see Fig. 6). Commonly, neurons are aggregated into layers. Typical multilayer perceptron has three layers consisting of neurons: input layer, output layer and hidden layer. In the most simple case of artificial neural network, the edges between layers are limited to being forward edges (feed forward artificial neural networks). It means that any element of a current layer feeds all the elements of the succeeding layer.
The goal of the neural network is mapping values from input layer to values from output layer using hidden neurons in some way. This mapping is based on modifying the weights of the connections to receive a result closer to the output. To determine the value of the output applying the activation function to a weighted sum of incoming values is needed as well. The most widely used activation functions are: the logistic function and the hyperbolic tangent (used in our study) [6].
In the first step, the system takes the values of neurons in the input layer and sums them up by the assigned weights. In this first iteration, all weights are randomized. Then, for each iteration an error is calculated, which is a difference between the achieved value and the output value. This divergence between estimated and expected value is called a loss function. Calculated loss information is propagated backward then from the output layer to all the neurons in the hidden layer that contribute directly to the output.
The learning process, where the total loss should be minimized, uses the propagated information for the adjustment of the weights of connections between neurons. The search of minimum of the loss function was performed in this study by the gradient descent method. This technique calculates the derivative of the loss function to find direction of descending toward the global minimum [8]. In practice, this calculation begins from defining the initial parameter's values of loss function and uses calculus to iteratively adjust the values to minimize the given function. There are more advanced learning techniques (based on gradient descent method) used to train neural network models such as scaled conjugate gradient, one-step secant, gradient descent with adaptive learning rate and gradient descent with momentum [9].
In this study, we are focused on predicting the closing price of a stock tomorrow close tþ1 f gwhich is the output layer using the input layer consisting prices measured a day before low t ; high t ; open t ; close t f g . The structure of this ANN is presented in Fig. 6.
The performances of the resulting neural network are measured on the test set according to following metrics: root mean square error (RMSE): where N denotes the number of observations,ŷ is the model prediction value and y i is the observed value.

Results
In this section, the fractional differentiation method will be applied on stock indexes described in previous part. For every stock index, we are going to compute the minimum coefficient d to get stationary fractionally differentiated series.
To find the minimum coefficient d, the combination of the augmented Dickey-Fuller test statistics and Pearson correlation coefficient was used. This concept is illustrated in Fig. 7, 8, 9, 10 and Table 333 (in Appendix). The ADF statistic is on the left y-axis, with the correlation between the original series and the fractionally differenced series on the right y-axis.
The original series of WIG20 index has the ADF statistics of -2.22, while the differentiated equivalent has this statistic equal to -36.02. At a 95% confidence level, the critical value of the DF t-distribution is -2.8623. This value is presented as a dotted line in Fig. 7. The ADF statistic crosses this threshold in the area close to d ¼ 0:1. At this point, the correlation has the high value of 0.9975. This proves that fractionally differenced series is not only stationary but holds considerable memory of the original series as well.
Similarly, the ADF statistic S&P 500 index reaches 95% critical value when the differencing is approximately 0.4 (for DAX series d % 0:3 and Nikkei 225 d % 0:4) and the correlation between the original series and the new fractionally differenced series is over 99% (the same for DAX and Nikkei 225). Figures 11, 12, 13, 14 contain original series with results of implementing the minimum coefficient d indicated above. The high correlation indicates that the fractionally differenced time series retains meaningful memory of the original series.
The above-obtained time series are going to be implemented in creating multiplayer perceptrons for proposed    stock indexes. To begin with, the data for all stock indexes have been normalized using the following equation: and divided into training and testing datasets. The first 1681 days are used for training, and the last 813 used for testing process. Feedforward neural networks were created using Keras, open-source neural network library in Python. Every network was inputted with the low, high, opening and closing price for each day t. The result layer consists of closing price on next day t þ 1: It means that artificial neural network predicts the closing price of the next days using historical data from the day before.
As it is observed in Table 2, the analysis shows that for all stock indexes fractional differentiation gives better RMSE and MAE statistics obtained on test data.
The purpose of this research is not to evaluate the predictive performance of artificial neural networks, but rather to evaluate how much better a fractional differentiation is, compared to full differentiation.

Conclusions
In this study, the concept of fractional differentiation was evaluated on four time series datasets obtained from the known stock exchanges (from Poland, UK, Germany and Japan). In these fractionally differenced time series, selected orders of differencing vary from 0.12 to 0.43, which is far from integer differencing. For all of them, we have received a high level of linear correlation coefficients (above 0.99%), which means immense association with original series. Nonetheless, these fractional time series are Using fractional differentiation, we have made analyzed time series stationary while keeping its memory and predictive power. Therefore, this study has clearly demonstrated the potential of applying fractional differentiation on time series.
The previously discussed results clearly show the benefit of fractional differentiation compared to classical differentiation in terms of applied performance measurements on created artificial neural networks. Consequently, fractional differentiation used in preliminary data analysis has broad applications prospects in machine learning area, which was confirmed by predictive performance metrics.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. Appendix See Table 3.

Tests
We assume that a time series is generated by an AR(1) process: where e t is a stationary disturbance term. If a ¼ 1, then the analyzed process is non-stationary. The Dickey-Fuller test takes the following model: where d ¼ a À 1 and verifies a null hypothesis that d ¼ 0, which means that y t is generated by AR(1). Alternative hypothesis assumes that the time series is stationary and d\0. The estimation ofd is calculated then using OLS regression and divided by its standard error in computing the testing statistic. The test is a one-sided left tail test. Augmented Dickey-Fuller (ADF) test adds p [ 0 lags of the dependent variable Dy t to make model dynamically complete.
In KPSS test, the following model is analyzed: where t ¼ 1; 2; . . .; T is a deterministic trend, r t = random walk process, e t are iid 0; r 2 e À Á and r 2 e ¼ 1, u t are 0; r 2 u À Á and / j j\1. If r 2 u ¼ 0 and the initial value of r 0 is fixed, then y t is a trend stationary process. If b ¼ 0, the process is stationary around its mean (r 0 ) rather than around a trend. For r 2 u the process is non-stationary.