Inferring the fractional nature of Wu Baleanu trajectories

We infer the parameters of fractional discrete Wu Baleanu time series by using machine learning architectures based on recurrent neural networks. Our results shed light on how clearly one can determine that a given trajectory comes from a specific fractional discrete dynamical system by estimating the fractional exponent and the growth parameter μ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu $$\end{document}. With this example, we also show how machine learning methods can be incorporated into the study of fractional dynamical systems.


Introduction
The logistic equation introduced by May [27] models the behavior of a population that grows exponentially, but some constraints of the environment limit this growth. We can express it as v(n + 1) = ηv(n) ( where v(0) ∈ [0, 1] and η ∈ R. This equation provides the simplest example of a one-parameter nonlinear dynamical system with nontrivial dynamics. It is very well-known that if 0 ≤ η ≤ 4, we have a welldefined dynamical system on [0, 1]. For η > 4, we still can have a discrete dynamical system, but this will only be defined on the complementary of a particular Cantor set in [0, 1]; see for instance [34]. In order to incorporate some difference operator that we can later extend naturally with a discrete fractional derivative, we can transform the logistic equation by applying the change of variable v(n) = η η−1 u(n). In this way, instead of having a formula for computing the term u(n + 1) by recurrence, we express that the forward Euler operator is equal to the nonlinear right term of the logistic, that is u(n) := u(n + 1) − u(n). ( So, we obtain the logistic equation of parameter μ := η − 1 with the initial condition rescaled by a factor μ+1 μ . More precisely, we have Wu and Baleanu [36] considered a fractional version of the dynamical system generated by replacing the Euler operator by the left Caputo-like discrete difference operator ν defined as follows where k −ν ( j) is defined by the generating series In the last years, this definition of the difference operator ν has proven to be very useful due to its convolutional form and for being equivalent, up to translation, with others more commonly used in the literature. See [11,12,22,23] for an overview and properties. It is worth noting that for any ν ∈ R that is not a positive integer, we have the explicit formula [3] See also the references [35,37] for recent works showing the applications of discrete fractional calculus. It is interesting to observe that with the given definition of ν , the fractional version of the logistic equation adopts a convolutional form and reads as follows This representation gives an interpretation of the fractional version of the logistic equation as the one that incorporates a memory kernel in terms of a discrete parameter, thus incorporating a different measure of the trajectories. Given an arbitrary condition u(0) ∈ [0, 1], the trajectory {u(n)} N n=1 obtained with (7) will be called a Wu Baleanu trajectory. It is worth to mention that for ν = 1 and n = 1 we have u(0) = μu(0)(1 − u(0)), recovering (2) for n = 0. Fixing an initial condition and a fractional parameter ν, we can generate Feigenbaum diagrams in order to illustrate the dynamics of this dynamical system in terms of the parameter μ, see Figs. 1 and 2.
Recently, within the frame of the Andi Challenge [30], machine learning methods, alone or combined with some statistical measures, have demonstrated their efficiency in (i) classifying anomalous diffusion noisy trajectories according to one of the previous five generative models and (ii) inferring the anomalous diffusion exponent. We refer the reader to [29] and some subsequent works in which some of these models were fully developed [1,7,9]; see also [6,8,28]. It is also worth mentioning the recent interest in incorporating machine learning methods and intelligent algorithms in studying formal mathematical problems. We can find some examples of this approach incorporating these techniques for solving nonlinear models, such as artificial neural networks, swarm optimization, and activeset algorithms [31], or neuro-swarming heuristics [32].
Therefore, we wonder if this approach lets us infer some fractional-related trajectories' characteristics. In this work, we study if we can infer the μ parameter and the fractional derivative order ν of Wu Baleanu trajectories. We have chosen an architecture based on recurrent neural networks (RNN), the same that provided the best results in inferring the exponent α of one-dimensional trajectories in the Andi Challenge [7,29], for trying to infer these parameters and measuring up to which point there is a straightforward relation any given trajectory of this type and the corresponding parameters μ and ν involved in generating it. To the best of our knowledge, this paper is the first to seek this type of approach. In Sect. 2, we give some details of the model architecture, revisiting some basic fundamentals of our machine learning models. We set the training, validation, and test data sets, as long as the results, in Sect. 3. Finally, we draw some conclusions in Sect. 4.

Architecture of the method
We propose the architecture shown in Fig. 3 to infer the parameters μ and ν of a given trajectory are intro-duced as input. Such architecture has been successfully applied for analyzing trajectories [7] and time series [24]. It is a mixture of convolutional and recurrent neural networks. It consists of three parts.
1. First, we have two convolutional layers that permit the extraction of spatial features from the trajectories. The first convolutional layer is set with 32 filters and a sliding window (kernel) of size 5, which slides through each trajectory extracting spatial features from them. The second convolutional layer has 64 filters to extract higher-level features. 2. Second, the output of the convolutional layers feeds three stacked bidirectional LSTMs layers that permit learning the sequential information. After each of these layers, we include a dropout layer of the 10% neurons to avoid over-fitting. We tested several dropout levels, from 5% to 20%, being 10% the one with the best performance. 3. Finally, we use two fully connected dense layers: the first one with 20 units and the second one with Our model will be fed with trajectories of length between 10 and 50, which are the most frequent in experiments and the hardest to be classified [29]. Let us briefly describe each part of the model:

Convolutional neural networks (CNN)
Convolutional neural networks preserve the spatial structure of data. They do so by connecting a patch (or section) from data to single neurons, so every neuron learns the properties from this single patch, whose size is defined by the kernel size (5 in our model). By doing so, spatially close portions of data are likely to be related and correlated to each other since only a small region of the input data influences the output from each neuron [4,19]. The patch is slid across the input sequence, and each time we slide it, we have a new output neuron in the following layer. This lets us consider the spatial structure inherent to the input sequence [17,18]. Through these layers, we are able to learn trajectory features by weighting the connections between the patches and the neurons so that particular features can be extracted by each patch. By using multiple filters (32 and 64 in our case) the CNN layers are extracting multiple different features (linear and nonlinear) that feed our LSTM layers.

Recurrent neural networks (RNN)
Sequential information can be decomposed in singletime steps, such as words or characters in language, notes in music, codons in DNA sequences. So, if one considers sequential data, it is very likely that the output at a later time step will depend on the inputs at prior time steps. In practice, we need to relate the information from a particular time step also with prior time steps and pass this information to future times. Recurrent neural networks (RNN) address this problem by adding an internal memory or cell state, denoted by h, which is passed from the time t to the time t + 1, that is from h t to h t+1 . This recurrent relation is capturing some notion of memory of what the sequence looks like. Therefore, the RNN output is not only a function of the input at a particular time step but also a function of the past memory of the cell state. In other words, the output y t = f (x t , h t−1 ) depends on the current input x t and the previous inputs to the RNN h t−1 , as it can be seen in Fig. 4.
An RNN adapts the internal hidden state (or memory state) h t through the result of multiplying two weight matrices W hh and W xh to the previous cell state h t−1 and the current input x t , respectively. The weight matrix W hh is modified at each time step to let the cell learn how to fit the desired output, and W xh is the weight matrix that modules the contribution of the input at each time step to the learning process. The result is passed to an activation function tanh that modifies the current state at each time step, i.e., h t = tanh W T hh h t−1 + W T xh x t . The problem with RNNs arises when dealing with long sequences since composing multiple tanh functions entails that the hidden state tends to extinguish by reaching values very close or equal to zero. In practice, this means that only recent cell states will modify the current cell state or, in other words, that RNNs have short-term memory.

Long short-term memory (LSTM)
Long short-term memory (LSTM) [14,21] amends the aforementioned short-term memory problem implicit to RNN by including gated cells that allow them to maintain long-term dependencies in the data and to track information across multiple time steps. This improves the sequential data modeling. LSTM structure is shown in Fig. 4 where σ and tanh stand for the sigmoid and the hyperbolic tangent activation functions. The circles in red represent matrix multiplication and additions. An LSTM incorporates a new cell state channel c which can be seen as a transportation band where the info is selectively updated by the new gates and is independent of the previously defined hidden state h and, therefore, independent of what is outputted in the form of hidden state or current time step out.
One LSTM cell's composition can be seen in Fig. 4  (right), and the gates are used to control the flow of information as follows: -The first sigmoid gate decides what information is kept or rid of. Since the sigmoid output ranges from 0 to 1, this can be seen as a switch that modulates how much information from the previous state has to be kept. Further details about LSTM functioning and implementation can be found in [2,13].

A general model for inferring μ and ν parameters
We have built three independent data sets to infer both μ and ν simultaneously. The train, validation, and test data sets have been built with the following parameters: μ ∈ [2, 3.2] with increments of 0.001.
ν ∈ [0.01, 1] with increments of 0.01. We visit the μ range with higher accuracy, in order to capture the chaotic dynamics that appears in some regions, see Figs When computing the trajectories, if we attain a value lower than −0.5 or greater than 2.0 in the trajectory, we stop the trajectory generation and save the trajectory as it is, provided it has a length greater than 10. The whole pool of trajectories is split into training (65%), validation (15%), and test (20%). As a result of this procedure, we get a training data set containing 618199 trajectories, a validation data set of 142822 trajectories, and a test data set with 190334 trajectories. The data sets can be found in supplementary material.
In all data sets, we pad each trajectory with 0's at its beginning to make them of a fixed length equal to 50. This permits homogenizing the lengths and feeding the first convolutional layer of our proposed architecture, see [10,Ch. 5 & Ch. 9]. We used an early stopping callback with a patience value of 20, which in practice means that the model stops training when validation mean average error (MAE) does not improve after 20 consecutive epochs. We have used a computer with 16 cores configured with 128 GB RAM and Nvidia RTX 3090 GPU with 22 GB RAM, running Ubuntu 20.10. The complete training process took less than 2 h, running up to 23 epochs.
First, we provide a description of the MAE distribution in terms of the true value to be predicted in Fig. 5.  A point (μ, μ) in the diagonal represents that one tra-jectory was generated with μ as a parameter (x coordinate), and the model infers μ as the potential value used for generating the trajectory (y-coordinate). The color bar represents the density of points in the picture. Therefore, the accumulation of spots along the diagonal represents that the model predicts very well the parameters μ and ν. On the one hand, when inferring the parameter μ, the darker region along the diagonal indicates that the best results are given for medium values of μ, between 2.24 and 2.96; on the other hand, the results behave more homogeneously when predicting ν.
We point out that the results in Fig. 5 do not shed light on the importance of the trajectory length to improve the accuracy of the predictions. It is expected that the longer the trajectories are, the lower the MAE is. In order to check it, we compare the MAE results for the shortest trajectories, lengths between 10 and 19, and the longest ones, lengths between 40 and 50, in Fig. 6. We see that for long trajectories (lengths between 40 and 50), the spots concentrate along the diagonal, which indicates that the model improves its accuracy when predicting values of long trajectories. We also appreciate that in the μ case, the model improves quite a lot when predicting values of μ higher than 2.96. In contrast, in the ν case, the long trajectories show a performance improvement for values lower than 0.5.
Here, it is not easy to appreciate at first sight; however, looking at Table 1, we can see that it really holds. We can see that, except in the last case, as we increase the trajectory length, the results improve since the parameters can be better identified. In all cases, we are in MAEs of the order of 10 −2 , that is the same order of magnitude of the ν parameter discretization, and one order less than the μ one. This justifies our choice of a thinner discretization for μ with respect to ν.
In order to look for more insightful descriptions of the MAE, for each truth value of μ and ν we have represented the quartiles Q 1 , Q 2 , and Q 3 of the MAE    However, we see here more clearly that the models are less accurate close to the extreme values of ν = 0 and 1 due to the strong connection existing between both fractional derivative values. We provide a comparison of these MAE distributions for short and long trajectories in Fig. 8.
Due to the fractional nature of equation (7), the model has a memory component that is strongly dependent on the initial condition u(0). We show boxplots of the MAE distribution in terms of u(0) in Fig. 9. We can see that initial conditions are more influential for μ predictions since boxes (green) and whiskers (gray) are higher than for ν. Despite this, in both cases, the results are slightly worse for initial conditions closer to 1, due also to the nonlinear term of the logistic terms of (7). The same conclusion can be extracted when evaluating trajectories by length, as shown in Fig. 10, where we can see that the impact is of initial conditions close to 1 leads to much higher values of MAE for short trajectories, either for boxes and for whiskers.

Conclusions
The success of machine learning and deep learning models has arrived in almost all scientific fields. The development of mathematical proofs and arguments seems to be one of the most difficult challenges. Nevertheless, some barriers have already fallen with the discovery of new multiplication algorithms [5].
Machine learning methods can also help us in modeling tasks and in the search and fitting of parameters. In this line, we have shown these methods permit us to infer the fractional nature of a given trajectory. In particular, we have seen that such a model permits us to elucidate if, given a set of trajectories, we can propose a fractional model based on the logistic equation that would represent the underlying process with reliability. Moreover, with reasonable use of resources, we can tune the model in order to estimate the parameter of the model μ and the parameter ν of the fractional discretization, within a similar order of magnitude.
We expect that this study will foster the incorporation of machine learning tools into the study of dynamical systems and the modeling of real-life problems.
Funding Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature.

Data availability
The datasets used/analyzed during the current study are available in the supplementary material.

Conflict of interest
The authors declare no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/ by/4.0/.