Limitations of the Recall Capabilities in Delay-Based Reservoir Computing Systems

We analyse the memory capacity of a delay-based reservoir computer with a Hopf normal form as nonlinearity and numerically compute the linear as well as the higher order recall capabilities. A possible physical realization could be a laser with external cavity, for which the information is fed via electrical injection. A task-independent quantification of the computational capability of the reservoir system is done via a complete orthonormal set of basis functions. Our results suggest that even for constant readout dimension the total memory capacity is dependent on the ratio between the information input period, also called the clock cycle, and the time delay in the system. Optimal performance is found for a time delay about 1.6 times the clock cycle.


Introduction
Reservoir computing is a machine learning paradigm [1] inspired by the human brain [2], which utilizes the natural computational capabilities of dynamical systems. As a subset of recurrent neural networks it was developed to predict time-dependent tasks with the advantage of a very fast training procedure. Generally the training of recurrent neural networks is connected with high computational cost resulting e.g. from connections that are correlated in time. Therefore, problems like the vanishing gradient in time arise [3]. Reservoir computing avoids this problem by training just a linear output layer, leaving the rest of the system (the reservoir) as it is. Thus, the inherent computing capabilities can be exploited. One can divide a reservoir into three distinct subsystems, the input layer, which corresponds to the projection of the input information into the system, the dynamical system itself that processes the information, and the output layer, which is a linear combination of the system's states trained to predict an often time-dependent task.
Many different realizations have been presented in the last years, ranging from a bucket of water [4] over field programmable gate arrays (FPGAs) [5] to dissociated neural cell cultures [6], being used for satellite communications [7], real-time audio processing [8,9], bit-error correction for optical data transmission [10], amplitude of chaotic laser pulse prediction [11] and cross-predicting the dynamics of an injected laser [12]. Especially opto-electronic [13,14] and optical setups [15][16][17][18][19] were frequently studied because their high speed and low energy consumption make them preferable for hardware realizations.
The interest in reservoir computing was refreshed when Appeltant et al. showed a realization with a single dynamical node under influence of feedback [20], which introduced a time-multiplexed reservoir rather than a spatially extended system. A schematic sketch is shown in Fig. 1. In general the delay architecture slows down the information processing speed but reduces complexity of the hardware. Many neuron based, electromechanical, optoelectronic and photonic realizations [21][22][23][24][25][26] showed the Fig. 1 Schematic sketch of time-multiplexed reservoir computing scheme. The input is preprocessed by multiplication with a mask that induces the time-multiplexing and is then electrically injected. The laser in our case is governed by a Hopf normal form. The output dimension of the system is in this example 4 capabilities from time series predictions [27,28] over an equalization task on nonlinearly distorted signals [29] up to fast word recognition [30]. More general analysis showed the general and task-independent computational capabilities of semiconductor lasers [31]. A broad overview is given in [32,33].
In this paper we perform a numerical analysis of the recall capabilities and the computing performance of a simple nonlinear oscillator, modelled by a Hopf normal form, with delayed feedback. We calculate the total memory capacity as well as the linear and nonlinear contributions using the method derived by Dambre et al. in [34].
The paper is structured as follows. First, we shortly explain the concept of time-multiplexed reservoir computing and give a short overview of the method used for calculating the memory capacity. After that we present our results and discuss the impact of the delay time on the performance and the different nonlinear recall contributions.

Methods
Traditionally, reservoir computing was realized by randomly connecting nodes with simple dynamics (for example the tanh-function [1]) to a network, which was then used to process information. The linear estimator of the readouts is then trained to approximate a target, e.g. predict a timedependent task. The network thus transforms the input into a high dimensional space in which the linear combination can be used to separate different inputs, i.e. to classify the given data.
In the traditional reservoir computing setup a reaction from the system s n = (s 1n , s 2n , . . . , s Mn ) ∈ R M is recorded together with the corresponding input u n and the target o n . In this case n is the index for the nth input-output training datapoint, ranging from 1 to N, and M is the dimension of the measured system states. The goal for the reservoir computing paradigm is to approximate the target o n as close as possible with linear combinations of the states s n for all input-output pairs n, meaning that M m=1 w m s mn =ô n ≈ o n for all n, where w = (w 1 , w 2 , . . . , w M ) ∈ R M are the weights to be trained. We want to find the best solution for where s ∈ R N ×R M is the state matrix defined by all system state reactions s n to their corresponding inputs u n , w are the weights to train and o ∈ R N is the vector of targets to be approximated. This is equivalent to a least square problem which is analytically solved by [35] The capability of the system to approximate the target task can be quantified by the normalized root mean square difference between the approximated answersô n and the targets o n where NRMSE is the normalized root mean square error of the target task with var(o) being the variance of the target values o = (o 1 , o 2 , . . . , o N ) and N the number of sample points. An NRMSE of 1 indicates that the system is not capable of approximating the task better than approximating the mean value, a value NRMSE = 0 indicates that it is able to compute the task perfectly. For a successful operation N M needs to be fulfilled, where M is the number of output weights w m and N is the number of training data points. This corresponds to a training data set of size N being significantly bigger than the possible output dimension M to prevent overfitting.
Appeltant et al. introduced in [20] a time-multiplexed scheme for applying the reservoir computing paradigm on a dynamical system with delayed feedback. In this case, the measured states for one input-output pair reaction s n = (s 1n , s 2n , . . . , s Mn ) are recorded at different times t m = t n + mθ, with m = 1, 2, . . . , M, where t n is the time at which the nth input u n is fed into the system. θ is describing the distance between two recorded states of the system and is called the virtual node separation time. The time between two inputs t n+1 − t n is called the clock cycle T and describes the period length in which one input u n is applied to the system. To get different reactions between two virtual nodes a time-multiplexed masking process is applied. The information fed into the system is preprocessed by multiplying a T -periodic mask g on the inputs (see sketch Fig. 1), which is a piecewise constant function consisting of M intervals, each of length θ . This corresponds to the input weights in the spatially extended system with the difference that now the input weights are distributed over time. Dambre et al. showed in [34] that the computational capability of a system can be quantified completely via a complete orthonormal set of basis functions on a sequence of inputs u n = (. . . , u n−2 , u n−1 , u n ) at time n. In this case the index indicates the input n time steps ago. The goal is to investigate how the system transforms the inputs u n . For this the chosen basis functions z(u n ), forming a Hilbert space, are constructed and used to describe every possible transformation on the inputs u n . The system's capability to approximate these basis functions is evaluated. Consider the following examples: The function z(u n ) = u n−5 is chosen as a task o n . This is a transformation of the input sequence 5 steps back. The question this task asks is, how well the system can remember the input 5 steps ago. Another case would be o n = z(u n ) = u n−5 u n−2 , asking how well it can perform the nonlinear transformation of multiplying the input 5 steps into the past with the input 2 steps into the past. A useful quantity to measure the capability of the system is the capacity defined as A value of 1 corresponds to the system being perfectly capable of approximating the transformation task and 0 corresponds to no capability at all. A simpler method, giving equal results like Eq. (4) developed by Dambre et al. in [34] to calculate C is given by where T indicates the transpose of a matrix and −1 the inverse. We use Eq. (5) to calculate the memory capacity. In this paper we use finite products of normalized Legendre polynomials P d n as a full basis of the constructed Hilbert space for each input step combination. d is the order of the used Legendre polynomial and n the nth step into the past passed as value to the Legendre polynomial. Multiplying a set of those Legendre polynomials gives the target task y {d n } , which yields (see example below for clarification) This is directly taken from [34]. It is important that the inputs to the system are uniformly distributed random numbers u n , which are independent and identically drawn in [−1, 1] to match the used normalized Legendre polynomials. To calculate the memory capacity MC d for a degree d, a summation over all possible past input sets is done where { n} is the set of past input steps, C d { n} is the capacity of the system to approximate a specific transformation task z { n} (u n ) and d is the degree of all Legendre polynomials combined in the task z { n} (u n ). In the example from above with z {−5,−2} (u n ) = u n−5 u n−2 , it is d = 2 and { n} = {−5, −2}. For d = 1 we get the well known linear memory capacity. To compute the total memory capacity, a summation over all degrees d is done.
Dambre et al. showed in [34] that the MC is limited by the readout dimension M, given here by the number of virtual nodes N V . The simulation was written in C++ with standard libraries used except for linear algebra calculations, which were calculated via the library "Armadillo". A Runge-Kutta 4th-order method was applied to integrate numerically the delay-differential equation given by Eq. (10) with an integration step t = 0.01 in time units of the system. First, the system was simulated without any inputs to let transients decay. Afterwards a buffer time was applied with 100000 inputs, that were excluded from the training process. Then, the training and testing process itself was done with 250000 inputs to have sufficient statistics. The tasks are constructed via Eq. (6) and the corresponding capacities C d { n} were calculated via Eq. (5). All possible combinations of the Legendre polynomials up to degree D = 10 and n = 1000 input steps into the past were considered. C d { n} below 0.001 were excluded because of finite statistics. To calculate the inverse, the Moore-Penrose pseudoinverse from the C++ linear algebra library "Armadillo" was used.
We characterize the performance of our nonlinear oscillator by evaluating the total memory capacity MC, the contributions MC d as well as the NRMSE of the NARMA10 task. The latter is a benchmark test and combines memory and nonlinear transformations. It is given by an iterative formula Here, A n is an iteratively given number and u n is an independent and identically drawn uniformly distributed random number in [0, 0.5]. The reservoir is driven by the random numbers u n and has to be able to predict the value of A n+1 , o = A. The reservoir we use for our analysis is a Stuart-Landau oscillator, also called Hopf normal form [36], with delayed feedback. This is a generalized model applicable for all systems operated close to a Hopf bifurcation, i.e. close to the onset of intensity oscillations. One example would be a laser operated closely above threshold [37]. A derivation from the Class B rate equations is shown in the Appendix. The equation of motion is given bẏ and was taken from [18]. Here, Z is a complex dynamical variable (in the case of a laser |Z| 2 resembles the intensity), λ is a dimensionless pump rate, η the input strength of the information fed into the system via electrical injection, g is the masking function, I is the input, ω is the frequency with which the dynamical variable Z rotates in the complex plane without feedback (in case of a laser, this is the frequency of the emitted laser light), γ the nonlinearity in the system, κ is the feedback strength, φ the feedback phase and τ the delay time. The corresponding parameters used in the simulations are found in Table 1 if not stated otherwise.

Results
To get a first impression about how the system can recall inputs from the input sequence u n = (...u n−2 , u n−1 , u n ), we show the linear recall capacities C 1 { n} in Fig. 2. Here, each set of all inputs { n} consists of only one input step n, because d = 1 for which z { n} (u n ) consists of the Legendre polynomial P 1 (u n− n ) = u n− n . The capacities C 1 { n} are plotted over the step n to be recalled for 3 different delay times τ (blue, orange and green in Fig. 2) while the input period time T is kept fixed to 80 and the readout dimensions N V to 50. These timescale parameters were chosen to fit the characteristic timescale of the system, such that the time between two virtual nodes θ is long enough for the system to react, but short enough such that the speeding process is still as high as possible. For input period times T = τ (the blue solid line in Fig. 2) a high capacity is achieved for a few recall steps after which the recallability drops steadily down to 0 at about the 15th step ( n = 15) to recall. This changes when the input period time reaches values of 3 times the delay time τ = 3T (the orange solid line in Fig. 2). Here, the linear recallability C 1 { n} oscillates between high and low values as a function of n, while its envelope steadily decreases until it reaches 0 at around the 35th ( n = 35)  (7) plotted over the nth input step to recall for 3 different delay times τ . The input period time T = 80 step to be recalled. Considering that τ = 3T is a resonance between the input period time T and the delay time τ , one can also take a look at the case for off-resonant setups, which is shown by the green solid line in Fig. 2 with τ ≈ 3.06T . This parameter choice shows a similar behaviour as the T = 3τ one but with higher capacities for short recall steps and a faster decay of the recallability at around the 29th ( n = 29) step.
To get a more complete picture, we evaluated the linear capacity C 1 { n} and quadratic capacities C 2 { n} of the system and depicted these as a heatmap over the delay time τ and the input steps in Fig. 3 for a constant input period time T . The x-axis indicates the nth step to be recalled while the delay time τ is varied from bottom to top on the y-axis. In Fig. 3a the linear capacities C 1 { n} are shown, for which the red horizontal solid lines indicate the scan from Fig. 2 One can see a continuous capacity C 1 { n} for τ < 2T which forks into rays of certain recallable steps n that linearly increase with the delay time τ . This implies that specific steps n can be remembered while others inbetween are forgotten, a crucial limitation to the performance of the system. Generally the number of steps into the past that can be remembered increases with τ (at constant T ), while on the other hand also the gaps inbetween the recallable steps increase. Thus, the total memory capacity stays constant. This will be discussed later in Fig. 4. In Fig. 3b the pure quadratic capacity C 2,p { n} is plotted within the same parameter space as in Fig. 3a. Pure means that only Legendre polynomials of degree 2, i.e. P 2 (u n− n ) = 1 2 (3u 2 n− n −1) were considered, rather than also considering combinations of two Legendre polynomials of degree 1, i.e. P 1 (u n− n 1 )P 1 (u n− n 2 ) = u n− n 1 u n− n 2 . In the graph one can see the same behaviour as for the linear capacities C 1 { n} (Fig. 3a), but with less rays and thus less steps that can be remembered from the past. This indicates that the dynamical system is not as effective in recalling inputs and additionally transforming them nonlinearly as it is in just recalling them linearly. For the full quadratic nonlinear transformation capacity, all This is shown in Fig. 3c. Again, the capacities C 2 { n} are depicted as a heatmap and the delay time τ is varied along the y-axis. This time the x-axis shows the steps of the first Legendre polynomial n 1 , while inbetween two ticks of the x-axis, the second Legendre polynomial's step n 2 is scanned from 0 up to 45 steps into the past. For the steps of Fig. 4 Total memory capacity MC as defined by Eq. (8) (blue) and memory capacities MC 1,2,3,4 of degree 1 to 4 (orange, green, red, violet) plotted over the delay time τ for the same parameters as in Fig. 3. Resonances between the clock cycle T and the delay time τ are depicted as vertical red and green dashed lines. One can see the loss in memory capacity at the resonances, especially for degree 2. Higher order transformations with d > 3 are more effective in the regime where τ < 1.5 T the second Legendre polynomial n 2 the capacity exhibits the same behaviour as already discussed for Fig. 3a and b. This does also apply to the first Legendre polynomial which induces interference patterns in the capacity space of the two combined Legendre polynomials. The red dashed lines highlight the ray behaviour of the first Legendre polynomial. We therefore learn that the performance of a reservoir computer described by a Hopf normal form with delay drastically depends on the task. There are certain nonlinear transformation combinations u − n 1 u − n 2 of the inputs u − n 1 and u − n 2 which cannot be approximated due to the missing memory at specific steps. To overcome these limitations it would be recommended to use multiple systems with different parameters to compensate for each other.
To fully characterize the computational capabilities of our reservoir computer, a full analysis of the degree d memory capacities MC d and the total memory capacity MC as defined in Eq. (8) is done. The results are depicted in Fig. 4a as a function of the delay time τ . All other parameters are fixed as in Fig. 3. The orange solid line in Fig. 4 refers to the linear, the green, red and violet lines to the quadratic, cubic and quartic memory capacity MC 1,2,3,4 , respectively. The blue solid line shows the total memory capacity MC summed up over all degrees up to 10. Dambre et al. showed in [34] that the MC is limited by the number of read-out dimensions and equals it when all read-out dimensions are linearly independent. In our case the read-out dimension is given by the number of virtual nodes N V = 50. Nevertheless, the total memory capacity MC starts at around 15 for a very short time delay with respect to the input period time T . This low value arises from the fact that a short delay induces a high correlation between the responses of the dynamical system which induces highly linearly dependent virtual nodes. This is an important general result that has to be kept in mind for all delay-based reservoir computing systems: With τ < 1.5 T the capability of the reservoir computer is partially waisted. Increasing the delay time τ also increases the total memory capacity MC reaching the upper bound of 50 at around 1.5 times the input period time T .
For τ > 1.5 T an interesting behaviour emerges. Depicted by the vertical red dashed lines are multiples of the input period time T at which the total memory capacity MC drops again significantly to around 40. A drop in the linear memory capacity was discussed in the paper by Stelzer et al. [38] and explained by the fact that resonances between the delay time τ and the input period time T concludes in a sparse connection between the virtual nodes.
Our results now show that this effects the total memory capacity MC, by mainly reducing the quadratic memory capacity MC 2 . At the resonances the quadratic nonlinear transformation capability of the system is reduced. To conclude, delay-based reservoir computing systems should be kept off the resonances between T and τ to maximize the computational capability. A surprising result is that for the chosen Hopf nonlinearity the linear memory capacity MC 1 is only slightly influenced by the resonances. A result from Dambre et al. in [34] and analysed by Inubushi et al. in [39] showed that a trade-off between the linear recallability and the nonlinear transformation capability exists. This is clearly only the case if the theoretical limit of the total memory capacity MC is reached and kept constant, thus every change in the linear memory capacity MC 1 has to induce a change in the nonlinear memory capacities MC d , d > 1. In the case of resonances, a decrease in the total memory capacity MC happens and thus this loss can be distributed in any possible way over the different memory capacities MC d . In our case, we see that the influence on the quadratic memory capacity MC 2 is highest.
The system is capable of a small amount of cubic transformations, depicted by the solid red line in Fig. 4a, which also decreases at the resonances in a similar way as the quadratic contribution does. Higher order memory capacities MC d , with d > 3, have only small contributions for short delay times τ , dropping to 0 for increased time delay τ . A possible explanation is the fact that short delays induce an interaction of the last input directly with itself for k = T τ times, depending on the ratio between τ and T . As a result, short delay times τ enable highly nonlinear tasks in expense of a lower total memory capacity MC.
For more insights into the computing capabilities of our nonlinear oscillator we now also discuss the NARMA10 time series prediction task, shown in Fig. 4b. Comparing the memory capacities MC d with the NARMA10 computation error NRMSE in Fig. 4b, a small increase in the NARMA10 NRMSE can be seen at the resonances with nτ = mT , where n ∈ [0, 1, 2...] and m ∈ [0, 1, 2...]. For a systematic characterization a scan of the input period time T and the delay time τ was done and the total memory capacity MC (Fig. 5c), the memory capacities of degrees 1-4 MC 1,2,3,4 ( Fig. 5a, b, d, e) and the NARMA10 NRMSE (Fig. 5f) were plotted colorcoded over the two timescales. This is an extension of the results of Röhm et al. in [19], where only the linear memory capacity and the NARMA10 computation error were analysed. For short time delays τ and period input times T the memory capacities of degree 1-3 MC 1,2,3 and the total memory capacity MC are significantly below the theoretical limit of 50 as already seen in the results from Fig. 4, while the NARMA10 NRMSE also has high errors of around 0.8. This comes from the fact that short input period times T also mean short virtual node distances θ, which induces a high linear correlation between the read-out dimensions. Degree 4 on the other hand only has values as long as T > τ, a result coming from the fact that the input u n has to interact with itself to get a transformation of degree 4. A possible explanation comes from the fact that the dynamical system itself is not capable of transformations higher than degree 3, since the highest order in Eq. (10) is 3. If the delay time τ and the input period time T are long enough the total memory capacity MC reaches 50 with exceptions of resonances between τ and T . These resonances are also seen in the NARMA10 NRMSE for which higher errors occur. Looking at the memory capacity of degree 1 and 2 MC 1,2 and comparing it with the NARMA10 NRMSE one can see a tendency in which the NARMA10 NRMSE is lowest where both have the highest capacities, raising from the fact that the NARMA10 task is highly dependent on linear memory and quadratic nonlinear transformations. This can also be seen in the area below the τ = T -resonance. To conclude, one can use the parameter dependencies of the memory capacities MC d to make predictions of the reservoir capability to approximate certain tasks.

Conclusions
We analysed the memory capacities and nonlinear transformation capabilities of a reservoir computer consisting of an oscillatory system with delayed feedback operated close to a Hopf bifurcation, i.e. a paradigmatic model also applicable for lasers close to threshold. We systematically varied the timescales and found regions of high and low reservoir computing performing abilities. Resonances between the information input period time T and the delay time τ should be avoided to fully utilize the natural computational capability of the nonlinear oscillator. A ratio of τ = 1.6T was found to be the optimal for the computed memory capacities, resulting in a good NARMA10 task approximation. Furthermore, it was shown, that the recallability for high delay times τ T is restricted to specific past inputs, which rules out certain tasks. By computing the memory capacities of a Hopf normal form, one can make general assumptions about the reservoir computing capabilities of any system operated close to a Hopf bifurcation. This significantly helps in understanding and predicting the task dependence of reservoir computers.
Derivation of the Stuart-Landau Equation with delay from the Class B laser rate equationṡ where E is the non-dimensionalized complex eletrical field and N the non-dimensionalized carrier inversion, P the pump relativ to the threshold for P thresh = 0 and α the Henry factor. The reservoir computing signal is fed into the system via electrical injection ηgI . If fast carriers are considered, an adiabatic elimination of the charge carriers yields which after substituting into Eq. (11) giveṡ where we introduced the quantityP = P + ηgI for convenience purposes. This equation yields the full Class A rate equation for the non-dimensionalized complex electric field. Simulations with the full Class A rate equation close to the threshold show similar results to the reduced case.
Because we consider laser that are operated close to the threshold level, a taylor expansion of the denominator for |E| 2 ≈ 0 is done, where we set |E| 4 ≈ 0 as a neglectable term. As we consider a laser operated close to the threshold, it follows that the pumpP and the intensity |E| 2 are of Order O( ), where is a small factor. This holds true only if the input signal ηgI is a small electrical injection. After applying this the equation is given bẏ We can substituteP = P + ηgI back into the equation, change the rotating frame of the laser by setting E = Ze −i(ω−αP )t and introduce a complex factor γ = −(1+iα) that scales the nonlinearitẏ By addding feedback κe φ Z(t − τ ) to the system one arrives at Eq. (10).