1 Introduction

Recurrent neural networks (RNN) specialize in processing sequential data, such as natural speech or text, and broadly, multivariate time-series data [1,2,3]. With such omnipresent data, RNN address a wide range of tasks in the form of prediction, generation, and translation for applications, including stock prices, markers of human body joints, music composition, spoken language, and sign language [4,5,6,7,8,9,10,11]. While RNN are ubiquitous systems, these networks cannot be easily explained in terms of the architectures they assume, the parameters they incorporate, and the learning process they undergo. As a result, it is not straightforward to associate an optimally performing RNN with a particular dataset and task. The difficulty stems from RNN characteristics, making them intricate dynamic systems. In particular, RNN can be classified as (i) nonlinear, (ii) high-dimensional, and (iii) non-autonomous dynamical systems with (iv) varying parameters, which are either global (hyperparameters) or trainable weights (connectivity).

Architectural variants of RNN that are more optimal and robust than unstructured RNN have been introduced, ranging from the long-established gated RNN, such as long short-term memory (LSTM) [12] and gated recurrent units (GRU) [13, 14], to more recent networks, such as anti-symmetric RNN (ASRNN) [15], orthogonal RNN (ORNN) [16], coupled oscillatory RNN (CoRNN) [17], Lipschitz RNN [18], and more. While such specific architectures are robust, they represent singular points in the space of RNN. In addition, the accuracy of these architectures still depends on hyperparameters, among which are the initial state of the network, initialization of the network connectivity weights, optimization settings, architectural parameters, and input statistics. Each of these factors has the potential to impact network learning and, as a consequence, task performance. Indeed, by fixing the task and randomly sampling hyperparameters and inputs, the accuracy of otherwise identical networks can vary significantly, leading to potentially unpredictable behavior.

A powerful dynamical system method for characterization and predictability of dynamical systems is Lyapunov exponents (LE) [19, 20]. Recent work identifying RNN as dynamical systems has extended LE calculation and analysis to these systems [21], but the connection between LE and network performance has not been explored extensively. Examples in [22] suggest that features of LE spectrum are correlated with RNN robustness and accuracy. However, standard features such as maximum and mean LE can have weak correlations, making consistent characterization of network quality using a fixed set of LE features unfeasible.

Such inconsistency motivates this work, where we develop a data-driven methodology called AeLLE to infer LE spectrum features and associate them with RNN performance. The methodology implements an Autoencoder (Ae) which learns through its Latent units a representation of LE spectrum (LLE) and correlates the spectrum with the accuracy of RNN for a particular task (see Fig. 1A). The Latent representation appears to be low dimensional such that even a simple linear embedding of the representation, denoted as AeLLE, corresponds to a classifier for selecting the optimally performing RNN based on the LE spectrum (see Fig. 1B). We show that once AeLLE is trained, it holds for novel inputs, and we also investigate the correlation between AeLLE classification accuracy and the need for RNN training.

Fig. 1
figure 1

AeLLE: LE spectrum Autoencoder and Latent representation embedding. (A) The Autoencoder takes Lyapunov exponents as input. This input is then embedded into a Latent space (purple) by the encoder (blue). From this Latent space, the Autoencoder predicts the accuracy of the corresponding network, which is compared to the true accuracy of the network to get the prediction loss. Simultaneously, the Lyapunov spectrum is reconstructed from this Latent space by the decoder (red). These reconstructed Lyapunov exponents are then compared to the input Lyapunov exponents to get the reconstruction loss. These sums are added together with normalization factor \(\alpha\) to get the total loss. (B) The Latent space of the LE spectrum Autoencoder correlates LE spectrum and accuracy. Embedding of Latent space representation provides a low-dimensional clustering and classification space, leading to separation between high-accuracy (green) and low-accuracy (red) networks. We find that Latent space clusters the LE spectra well such that simple embedding (PCA) and simple linear classifiers (hyperplane, hyperellipse, or threshold) classify RNN variants according to accuracy. Shown is an example linear classifier along the first PC dimension for networks trained on the CharRNN task (see Sect. "4" for more details)

The significance of the proposed AeLLE method is that it is a novel LE embedding that effectively facilitates interpretation and classification for a variety of RNN models. For example, we show that AeLLE separates high- and low-accuracy networks across varying weight initialization hyperparameters, network size, and network architecture with no knowledge about the underlying network besides its Lyapunov spectrum. Notably, such separation is not observed when AeLLE is omitted, and the Lyapunov spectrum is directly used in conjunction with standard embeddings. Indeed, our results indicate that AeLLE representation is a necessary step that follows the LE spectrum. While it requires additional computational power and time due to the training of the Autoencoder, it is a necessary computational step since it is able to identify the subtle features of the Lyapunov spectrum. As a result, AeLLE implicitly identifies the network dynamics, which are integral to performance across a wide variety of network hyperparameters. Furthermore, we show that AeLLE can be used to predict final network accuracy early in training. This predictability suggests that during training, AeLLE recovers the Lyapunov spectrum features, which are instrumental for trainability. These features are not evident in standard LE statistics, as discussed in [22,23,24].

2 Related work

2.1 Spectral analysis and model quality

Spectral methods have been used to characterize information propagation properties and, thus, model the quality of RNNs. Since vanishing and exploding gradients arise from long products of Jacobians of the hidden states dynamics whose norm could exponentially grow or decay, much effort has been made to mathematically describe the link between model parameters and the eigen- and singular-value spectra of long products [25,26,27,28]. For architectures used in practice, these approaches appear to have a limited scope [29, 30]. This is likely due to spectra having non-trivial properties reflecting intricate long-time dependencies within the trajectory and due to the need to take into account the dynamic nature of RNN systems, which is the reason for proposing to use Lyapunov exponents in this work for such characterization.

Other techniques for inferring model quality include performing spectral analysis of deep neural network weights and fitting their distribution to truncated power laws to correlate with performance [31]. Another approach uses Koopman operators to linearize RNNs demonstrates that linear representations of RNNs capture the dominant modes of the networks and are able to achieve comparable performance to their nonlinear counterparts [32].

2.2 Dynamical systems approaches to RNNs

Identifying RNN as dynamical systems and developing appropriate analyses appears as a prospective direction. Recently, dynamical systems methodology has been applied to introduce constraints to RNN architecture to achieve better robustness, such as orthogonal (unitary) RNN [15, 16, 33,34,35,36,37] and additional architectures such as coupled oscillatory RNN [17] and Lipschitz RNN [18]. These approaches set network weights to form dynamical systems that have the desired Jacobians for long-term information propagation. In addition, analyses such as stochastic treatment of the training procedure have been shown to stabilize various RNN [38]. Furthermore, a universality analysis of fixed points of different RNN, proposed in [39], hints that RNN architectures could be organized in similarity classes such that, despite having different architectural properties, they would exhibit similar dynamics when optimally trained. In light of this analysis, it remains unclear to which similarity classes in the space of RNN the constrained architectures belong and what is the distribution of the architectures within each class. These unknowns warrant the development of dynamical systems tools that characterize and classify RNN variants.

2.3 Lyapunov spectrum

Lyapunov exponents [19, 20] capture the information generation by a system’s dynamics through measurement of the separation rate of infinitesimally close trajectories. The number of Lyapunov exponents of a system is equal to the dimension of that system, and the collection of all LE is called LE spectrum. The maximal LE will determine the linear stability of the system [40]. Further, a system with LE greater than zero represents chaotic dynamics with the magnitude of the first exponent indicating the degree of chaos; when the magnitude decreases and approaches zero, the degree of chaos decreases. The dynamics converge toward a fixed point attractor when all exponents are negative. Zero exponents represent limit cycles or quasiperiodic orbits [41, 42]. Additional features of LE, even non-direct, correspond to properties of the dynamical system. For example, the mean exponent determines the rate of contraction of full volume elements and is similar to the KS entropy [43]. The LE variance measures heterogeneity in stability across different directions and can reflect the conditioning of the product of many Jacobians.

When LE are computed from the observed evolution of the system, Oseledets theorem guarantees that LE characterize each ergodic component of the system, i.e., when long enough evolution of a trajectory in an ergodic component is sampled, the computed LE spectrum is guaranteed to be the same (see Methods section for details of computation) [44, 45]. Efficient approaches and algorithms have been developed for computing LE spectrum [46]. These have been applied to various dynamical systems, including hidden states of RNN and variants such as LSTM and GRU [21]. This approach relies on the theory of random dynamical systems, which establishes LE spectrum even for a system driven by a noisy random input sequence sampled from a stationary distribution [47]. It has been demonstrated that some features of LE spectra can have meaningful correlations with the performance of the corresponding networks on a given task. However, the selection of relevant features of it is task-dependent and challenging to determine a prior [22].

3 Methods

The proposed AeLLE methodology consists of three steps: (1) computation of LE spectrum, (2) Autoencoder for LE spectrum, and (3) embedding of Autoencoder Latent representation.

3.1 Computation of LE [21, 22]

We compute LE by adopting the well-established algorithm [48, 49] and follow the implementation in [21, 22]. For a particular task, each batch of input sequences is sampled from a set of fixed-length sequences of the same distribution. We choose this set to be the validation set. For each input sequence in a batch, a matrix \({{\textbf {Q}}}_0\) is initialized as the identity to represent an orthogonal set of nearby initial states whose evolution will be tracked in the sequence of matrices \({{\textbf {Q}}}_t\). The hidden states \(h_t\) are initialized as zeros.

To track the expansion and contraction of the vectors of \({{\textbf {Q}}}_t\), the Jacobian of the hidden states at step t, \({{\textbf {J}}}_{t}\), is calculated and then applied to the vectors of \({{\textbf {Q}}}_t\). The Jacobian \({{\textbf {J}}}_t\) can be found by taking the partial derivatives of the RNN hidden states at time t, \(h_t\), with respect to the hidden states at time at \(t-1\), \(h_{t-1}\)

$$\begin{aligned} \left[ {{\textbf {J}}}_{t}\right] _{ij} = \frac{\partial {{\textbf {h}}}_{t}^j}{\partial {{\textbf {h}}}_{t-1}^i}. \end{aligned}$$
(1)

Beyond the hidden states, the Jacobian will depend on the input \(x_t\) to the network. This dependence allows us to capture the dynamics of a network as it responds to input. The expansion factor of each vector is calculated by finding the corresponding R-matrix that results from updating Q when computing the QR decomposition at each time step

$$\begin{aligned} {{\textbf {Q}}}_{t+1}, {{\textbf {R}}}_{t+1} = QR({{\textbf {J}}}_{t}{{\textbf {Q}}}_{t}). \end{aligned}$$
(2)

If \(r_t^i\) is the expansion factor of the \(i^{th}\) vector at time step t—corresponding to the \(i^{th}\) diagonal element of R in the QR decomposition—then the \(i^{th}\) LE \(\lambda _i\) resulting from an input signal of length T is given by

$$\begin{aligned} \lambda _k = \frac{1}{T}\sum _{t=1}^T {\log }(r_t^k) \end{aligned}$$
(3)

The LE resulting from each input \(x^m\) in the batch of input sequences are calculated in parallel and then averaged. For each experiment, the LE are calculated over a fixed number of time steps with n input sequences. The mean of n resulting LE spectra is reported as the LE spectrum. To normalize the spectra across different network sizes and, consequently, the number of LE in the spectrum, we interpolate the spectrum to retain the shape of the largest network size. Through this interpolation, we can represent the LE spectra as curves. Spectra curves will have the same number of LE points for small and larger networks.

The expressions for the Jacobians \({\textbf{J}}_t\) used in these calculations can be found in the Supplemental Materials.

3.2 Autoencoder for LE spectrum

An Autoencoder consists of two components: an encoder network \(\mathcal {\phi }\), which transforms the input into a representation in the Latent layer, and a decoder network \(\mathcal {\psi }\), which transforms the Latent representation into a reconstruction of the original input.

Over the course of training, the Latent layer becomes representative of the variance in the input data and extracts key features that might not immediately be apparent in the input.

In addition to the reconstruction task, it is possible to include constraints on the optimization by formulating a loss function for the Latent layer values (Latent space), e.g., a classification or prediction criterion. This can constrain the organization of values in the Latent space [50, 51].

We propose an adapted Autoencoder (Ae) methodology for correlating LE spectra and RNN task accuracy. In this setup, we consider the LE spectrum as the input Z. Our Autoencoder consists of a fully-connected encoder network \(\mathcal {\phi }\), a fully-connected decoder network \(\mathcal {\psi }\), and an additional linear prediction network \(\xi\) defined by

$$\begin{aligned} \begin{aligned} {\hat{Z}} = (\psi \circ \phi )Z, \\ {\hat{T}} = (\xi \circ \phi )Z, \end{aligned} \end{aligned}$$
(4)

where \(\hat{{{\textbf {Z}}}}\) and \(\hat{{{\textbf {T}}}}\) correspond to the output from the decoder and predicted accuracy, respectively, with loss of \(L = \Vert Z - {\hat{Z}} \Vert ^2 + \alpha \cdot \Vert T - {\hat{T}}\Vert _l\). Ae performs the reconstruction task, optimization of the first term of Eq. 5, mean-squared reconstruction error of LE spectrum, as well as prediction of the associated RNN accuracy T (best validation loss), the second term of Eq. 5.

$$\begin{aligned} \phi , \psi , \xi = \mathop {\mathrm {arg\,min}}\limits _{\phi , \psi , \xi } (\Vert Z - {\hat{Z}} \Vert ^2 + \alpha \cdot \Vert T - {\hat{T}}\Vert _l ). \end{aligned}$$
(5)

The parameter l can defined based on the desired behavior. The most common choice is \(l=1\), indicating the 1-norm, and \(l=2\), indicating the 2-norm.

During the training of Ae, the weight \(\alpha\) in the prediction loss gradually increases so that Ae emphasizes RNN error prediction once the reconstruction error has converged. We found that this approach allows Ae to capture features of both RNN dynamics and accuracy. A choice of \(\alpha\) being too small leads to a dominance of the reconstruction loss such that the correlation between LE spectrum and RNN accuracy is not captured. Conversely, when \(\alpha\) is initially set to a large value, the reconstruction, along with the prediction, diverges. The convergence of Ae for different RNN variants, as we demonstrate in the Results section, shows that correlative features between LE spectrum and RNN accuracy can be inferred. The dependency of Ae convergence on a delicate balance of the two losses reconfirms that these features are tangled and, thus, the need for Ae embedding. We describe the settings of \(\alpha\) and additional Ae implementation details in Supplementary Materials.

3.3 Embedding of autoencoder latent representation

When the loss function of Ae converges, it indicates that the Latent space captures the correlation between LE spectrum and RNN accuracy. However, an additional step is typically required to achieve an organization of the Latent representation based on RNN variants accuracy. For this purpose, a low-dimensional embedding, denoted as AeLLE, of the Latent representation needs to be implemented. An effective embedding would indicate the number of dominant features needed for the organization, provide a classification space for the LE spectrum features, and connect them with RNN parameters. We propose to apply the principal component analysis (PCA) embedding to the Latent representation [52, 53]. The embedding consists of performing principal component analysis and projecting the representation on the first few principal component directions (e.g., 2 or 3). While other nonlinear embeddings are possible, e.g., tSNE or UMAP [54, 55], the simple linear projection onto the first two principal components of the Latent space results in an effective organization. This indicates that the Latent representation has successfully captured the characterizing features of performance. In the Results section, we show that the PCA embedding is sufficient to provide an effective space for all examples of RNN architectures and tasks that we considered. In particular, in this space, the more accurate RNN variants (green) can be separated from other variants (red) through a simple clustering procedure.

4 Results

To investigate the applications and generality of our proposed method, we consider tasks with various inputs and outputs and various RNN architectures that have been demonstrated as effective models for these tasks. In particular, we choose three tasks: signal reconstruction, character prediction, and sequential MNIST. All three tasks involve learning temporal relations in the data with different forms of the input and objectives of the task. Specifically, the inputs range from low-dimensional signals to categorical character streams to pixel grayscale values. Nonetheless, across this wide variety of inputs and tasks, AeLLE space and clustering can consistently separate variants of hyperparameters according to accuracy in a more informative way than network hyperparameters alone.

More specifically, we consider (i) Signal Reconstruction task, also known as target learning. In this task, a random RNN is being tracked to generate a target output signal from a random input [30, 56]. This task involves intricate time-dependent signals and a generic RNN whose dynamics are chaotic in the absence of training. With this example, we demonstrate that our method is able to distinguish between networks of high and low accuracy across initialization parameters.

(ii) Character Prediction is a common task that takes as an input a sequence of textual characters and outputs the next character of the sequence. This task is rather simple and is used to benchmark various RNN variants. With this task, we demonstrate that our method is able to distinguish across network sizes, in addition to initialization parameters.

(iii) Sequential MNIST is a more extensive benchmark for RNN classification accuracy. The task input is an image of a handwritten digit unrolled into a sequence of numerical values (pixels’ grayscale values), and the output is a corresponding label of the digit. We investigate the accuracy of various RNN variants on row-wise SMNIST, demonstrating that our method distinguishes according to performance across network architectures. We describe the outcomes of AeLLE application and the resulting insights per each task below.

For each task, we present the low-dimensional projection of AeLLE using the first two principal components. Furthermore, we reduce the projection to a single dimension by showing the distribution of the AeLLE in the first principal component as stacked histograms of the high- and low-accuracy networks or the different hyperparameters. For these stacked histograms, we represent each distribution separately and stack the histograms on top of each other, such that the bar heights are discrete and not cumulative. Therefore, the ordering of the colors will always be the same, and the height of each bar indicates the number of networks in that bin, regardless of whether it is above or below the other colors.

4.1 Signal reconstruction via target learning with random RNN

To examine how AeLLE interprets generic RNN with time-evolving signals as output and input, we test Rank-1 RNN. Such a model corresponds to training a single rank of the connectivity matrix, the output weights W, on the target learning task. We set the target signal (output) to be a four-sine wave, a benchmark used in [56]. A key parameter in Rank-1 RNN is the amplification factor of the connectivity g, which controls the output signal in the absence of training. For \(g \le 1\), the output signal is zero, while for \(g \ge 1.8\), the output signal is significantly chaotic. The output signal is weakly chaotic in the interval \(1< g < 1.8\). Previous work has shown that the network can generate the target signal when it is in the weakly chaotic regime, i.e., \(1<g<1.8\), and trained with FORCE optimization algorithm. Moreover, it has been shown that this training setup most consistently optimally converges when the amplification factor g is in the interval \(1.2<g<1.5\) [56, 57].

However, not all samples of the random connectivity correspond to accurate target generation. Even for g values in the weakly chaotic interval, there would be Rank-1 RNN variants that fail to follow the target. Thereby, the target learning task, Rank-1 RNN architecture, and FORCE optimization are excellent candidates to test whether AeLLE can organize the variants of Rank-1 RNN models according to accuracy.

The candidate hyperparameters for variation would be of (1) samples of fixed connectivity weights (from a normal distribution) and (2) the parameter g within the weakly chaotic regime. We structure the benchmark set to include 1200 hyperparameter variants and compute LE spectrum for each of them. After training is complete, LE in the validation set are projected onto the Autoencoder’s AeLLE space, depicted in Fig. 2, where each sample is a dot in the \(A_{PC_1}-A_{PC_2}\) plane.

Fig. 2
figure 2

Clustering by performance across initialization parameter: AeLLE for networks trained on signal reconstruction task. Left: The signal reconstruction task involves recreating a target signal. Higher losses indicate a greater pointwise difference between the target and predicted signals. Center: In the AeLLE representation, most of the high-accuracy networks (green) cluster to the left of most of the low-accuracy networks (red). The black vertical line indicates the location of the median classifier along the first principal component of the AeLLE space. Moreover, the greatest concentration of the high-accuracy networks is in the bottom-left of the space shown, which is consistent with the stacked histogram indicating that there are relatively few high-accuracy networks to the right of the median compared to low accuracy, and many more to the left. Right: In comparison, the cluster to the bottom-left of the space shown contains a mixture of networks with initialization parameters ranging from 1.0 to 1.6. In general, networks with larger initialization parameter g (particularly once \(g>1.6\)) tend to cluster further to the right in this space. This is consistent with the fact that networks with \(g>1.6\) tend to have lower accuracy on this task, but networks with \(1<g<1.6\) tend to have similarly high accuracy (see Supplemental Materials)

Our results show that AeLLE organizes the variants in a 2D space of \(A_{PC_1}-A_{PC_2}\) according to accuracy. The variants with smaller error values (high accuracy) (\(<0.57\)) are colored in green, and variants with larger error values (low accuracy) (\(>0.57\)) are colored in red. We demonstrate the disparity in the signals that different error values correspond to in Fig. 2—left. AeLLE space succeeds in correlating the LE spectrum with accuracy such that most low-error networks are clustered in the bottom-left of the two-dimensional projection (see Fig. 2—center), whereas large error networks are concentrated primarily in the right and top of the region shown. This clustering of high-accuracy networks allows for identifying multiple candidates as top-performing variants in this space. Comparison of AeLLE clustering with a direct clustering according to g values, Fig. 2—right, shows that while most networks with \(g< 1.7\) include variants with low error, there are also variants with high error for each value of g. This is not the case for all variants in the low-error hyperellipse of the AeLLE space. These variants have different g and connectivity values, and sampling from the hyperellipse provides a higher probability of the Rank-1 RNN variant being accurate.

Notably, AeLLE contributes to this problem beyond the validation of AeLLE itself. The representation selects variants that are “non-trivial” as well, i.e., it is interesting to examine the particular configurations of models that do not belong to \(1.2<g<1.5\), but AeLLE still rightfully reports them being successful at target learning or vice versa models in \(1.2<g<1.5\) but do not reconstruct the target. AeLLE indeed identifies models that are on the tail of the distribution in terms of the parameter g.

For comparison, we calculated the F1 score of the classifier, which uses the median first principal component value as the decision boundary for classifiers that use simple statistics of the Lyapunov spectra. When we use the median value of the Lyapunov spectrum means and maximum Lyapunov exponent as the decision boundaries, the resulting F1 scores are 0.705 and 0.504, respectively, meaning that the mean LE is much more indicative of performance than the maximum LE for this task. Additionally, we define another classifier by projecting the raw Lyapunov exponents onto their first two principal components and using the median of the first principal component as the decision boundary to get an LE PC classifier. For this task, the LE PC classifier performs similarly to the LE mean, with an F1 score of 0.703. Meanwhile, the AeLLE classifier achieves an F1 score of 0.724, indicating a significant improvement over the max LE classifier and a modest improvement over the mean LE and LE PC classifiers (see Supplementary Materials for more details).

4.2 Character prediction with LSTM RNNs of different sizes

Multiple RNN tasks are concerned with non-time-dependent signals, such as sequences of characters in a written text. Therefore, we test AeLLE on LSTM networks that perform the character prediction task (CharRNN), in which for a given sequence of characters (input), the network predicts the character (output) that follows. In particular, we train LSTM networks on the English translation of Leo Tolstoy’s War and Peace, similar to the setup described in [58]. In this setup, each unique character is assigned an index (the number of unique characters in this text is 82). The text is split into disjoint sequences of a fixed length \(l = 101\), where the first \(l-1 = 100\) characters represent the input, and the final character represents the output. The loss is computed as the cross-entropy loss between the expected character index and the output one.

The hyperparameters of network size (number of hidden states) and initialization of weight parameters appear to impact the accuracy the most. We create 1200 variants of these parameters, varying the number of hidden units from 64 to 512 (by factors of 2) and sample initial weights from a symmetric uniform distribution with the parameter p denoting the half-width of the uniform distribution from which the initial weights are sampled in the range of [0.04, 0.4]. We split the variants into an Autoencoder training set (\(80\%\)) and a validation set (\(20\%\)).

Similar to the target learning task, we project the LE of the variant networks onto the first two PC dimensions of the Latent space of the trained Autoencoder and mark them according to accuracy. LSTM networks with loss below the median among these networks (loss <1.75) are considered high accuracy (green), while those with loss above the median are considered low accuracy (red).

We find that AeLLE in 2D space separates the spectra of the variants according to accuracy across network sizes. Performing principal component analysis of the AeLLE illustrates that the low- and high-accuracy networks are separated along the PC1 dimension, with higher-accuracy networks being further to the left in the space and lower-accuracy ones clustering to the right (Fig. 3). For comparison, we show the median value of the first principal component across all networks (black line), showing that the vast majority of high-accuracy networks are to the left of this line. In contrast, the distribution of the network sizes (Fig. 3—right) is more evenly distributed in this space. This demonstrates that this method is able to learn properties from the LE spectrum that correlates with performance across network sizes, which are more informative than network size alone.

Fig. 3
figure 3

Clustering by performance across network size: AeLLE for LSTM RNNs trained on CharRNN task. Left: The CharRNN task involves predicting the next character in a sequence. Larger losses correspond to less confidence in predicting the next character in a sequence. Center: In the AeLLE representation, the higher-accuracy networks, regardless of size, tend to cluster by performance, with the low-accuracy networks to the left in this representation. Right: By contrast, LSTM of different sizes is often mixed together in this representation, with larger networks covering a wider range within the AeLLE space, but still overlapping with smaller networks throughout the space

Comparing the separation in the AeLLE space with a classifier based on direct LE statistics, we find that using the median value of the mean Lyapunov exponent or max Lyapunov exponent as the decision boundary gives classifiers with F1 scores of 0.834 and 0.859, respectively, suggesting that both statistics are strongly indicative of performance on this task. The LE PC classifier with a decision boundary defined by the median value of the first principal components of the raw Lyapunov exponents has a similar F1 score of 0.860. Meanwhile, the AeLLE classifier achieves an F1 score of 0.877, indicating an improvement in both direct statistics (see Supplementary Materials for more details). This shows that, while the metrics used are indicative of performance, the AeLLE method is able to achieve a slightly better discrimination of network performance.

4.3 Sequential MNIST classification with different network types

A standard benchmark for sequential models is the sequential MNIST task (SMNIST) [59]. In this task, the input is a sequence of pixel grayscale values unrolled from an image of handwritten digits from \(0-9\). The output is a prediction of the corresponding label (digit) written in the image. We follow the SMNIST task setup in [60], where each image is treated as sequential data, each row is the input at one time, and the number of time steps is equal to the number of columns in the image. The loss corresponds to the cross-entropy between the predicted and the expected one-hot encoding of the digit.

We train a larger number of RNN variants on this task to demonstrate how the AeLLE properties translate across network architectures. The architectures trained on this task were: LSTM, gated recurrent unit (GRU), (vanilla) RNN, anti-symmetric RNN (ASRNN) [15], coupled oscillatory RNN (coRNN) [17], Lipschitz RNN [18], Noisy RNN [38], and long expressive memory network (LEM) [61]. For each network type, we train 200 variants of hidden size 64. Every network was trained for 10 epochs, and LE of post-trained networks were collected. This constitutes a set of 1600 variants, where we use \(70\%\) for Autoencoder training, \(10\%\) for validation, and \(20\%\) for testing. For more details on the training of this sequential MNIST task, see Appendix A.3. Similarly to previously described tests, we color code the variants according to accuracy. Networks with loss \(<2.2\times 10^{-3}\) are considered as high accuracy (green), which includes \(50\%\) of networks, while the rest of the networks with higher loss are considered as low accuracy (red).

As in the previous tests, AeLLE analysis is able to unravel variants and their accuracy according to LE spectrum. With a median value of PC1 in the AeLLE plane (black line in Fig. 4), the AeLLE plane could be divided into two clusters where low accuracy is in the left part of the plane, and high accuracy is in the right part. Through the distribution of network architectures in this representation (see Fig. 4—right), we find clusters of mixed architectures with similar accuracy. Moreover, we find that GRU, Lipschitz RNN, LEM, and, to a lesser extent, Noisy RNN all occupy a similar part of this space, with their best-performing variants generally being located to the top-right of the space shown, and then moving to the left as the performance of the variants deteriorates. While no vanilla RNN variants achieve high accuracy, even networks that have variants with high accuracy, such as LSTM and LEM, have low-accuracy variants that are projected onto the same space as the cluster of vanilla RNNs. Meanwhile, ASRNN and coRNN, both constrained to have dynamics that preserve information, are projected very close to each other in this space into relatively small clusters near the median boundary.

Fig. 4
figure 4

Clustering across network architecture: AeLLE for networks trained on SMNIST task. Left: In this task, the network predicts the digit shown in a given image. Higher losses correspond to lower confidence in the correct digit label. Center: AeLLE representation shows that networks of similar error are clustered together. When classifying these networks according to high accuracy (green) and low accuracy (red), the high-accuracy networks, regardless of network architecture, are consistently located further to the right (more positive PC1) than low-accuracy networks. Right: Networks of the same type are often clustered together in AeLLE representation, with some overlap between similar architectures (such as coRNN and ASRNN). For each network architecture (except vanilla RNN), there are variants with high and low accuracy separated across the first PC dimension

The LE mean and LE max classifier on this dataset achieve an F1 score of 0.609 and 0.566, respectively, suggesting that both statistics are non-optimal predictors of performance on their own. The LE PC classifier has an F1 score of 0.608, representing a minor improvement in accuracy. However, the AeLLE classifier using the median value of the first principal component achieves an F1 score of 0.859 (see Supplementary Materials for more details). This score is a significant improvement on the direct LE statistics. It demonstrates that AeLLE is particularly advantageous for this task, suggesting that characteristics of the dynamics shared across architectures that determine network performance are non-trivial.

4.4 Pre-trained AeLLE for accuracy prediction across training epoch

In the three tests described above, we find that the same general approach of AeLLE allows the selection of variants of hyperparameters of RNN associated with accuracy. LE spectrum is computed for fully trained models to set apart the sole role of hyperparameter variation. Namely, all variants in these benchmarks have been trained prior to computing LE spectrum. Over the course of training, connectivity weight parameters vary, and as a result, the LE spectrum undergoes deformations. However, it appears that the general properties of LE spectrum, such as the overall shape, emerge early in training.

From these findings and the success of AeLLE, a natural question arises: How early in training can AeLLE identify networks that will perform well upon completion of training? To investigate this question, we use a pre-trained AeLLE classifier, i.e., trained on a subset of variants that were fully trained for the task. We then propose to test how such pre-trained fixed AeLLE represents variants that are only partially trained, e.g., those that underwent 0–50% of training. This test is expected to show how robust the inferred features within AeLLE correlating hyperparameters and LE spectrum are subject to the optimization of connectivity weights. Also, it would provide insight into how long it is necessary to train the network to predict the accuracy of a hyperparameter variant.

Table 1 Precision, recall, and F1 score of pre-trained AeLLE classifier for RNN final accuracy evaluated at different stages of training indicated by bold numbers

We select the SMNIST task with LSTM, GRU, RNN, ASRNN, CoRNN, Lipschitz RNN, Noisy RNN, and LEM models. Each model has a hidden states size of 64, and the initialization parameter p is the same as the previous experiment (200 variants for each model). We then compute LE spectrum for the first five epochs of the training (out of all 10 epochs) for all variants. Note, LE spectrum before training is also computed. Therefore, 6000 LE spectra are considered. We then select \(20\%\) variants into a training set, and they span over all epochs. We define such AeLLE as a pre-trained AeLLE and investigate its performance.

We select another \(20\%\) of data as the validation set (1800 variants) and apply the same loss threshold of \(2.2\times 10^{-3}\) as above. We use this validation set to define a simple threshold that classifies low- and high-accuracy variants according to AeLLE (Fig. 5 green shaded region of \(PC1>-0.03\)).

We then apply the same pre-trained AeLLE and accuracy threshold to variants in the test set (3600 variants) at different epochs (formulated as % of training) illustrated in Fig. 5, with training progressing from left to right and Table 1. We observe that projection of variants LE spectrum onto AeLLE (points in AeLLE 2D embedding) does not change significantly over the course of training. Specifically, before training, we find that the recall, i.e., the number of the low-error networks that fall within the low-error threshold region, is \(93.8\%\). The precision, i.e., how many networks in the region are low error, is \(77.0\%\). Then, after a single epoch, \(10\%\) of training, the recall becomes \(91.5\%\), and the precision improves to \(83.6\%\), indicating that more of the networks in the region are correctly identified as high accuracy. The recall and precision do not change too much over training. This indicates that the LE spectrum captures properties of network dynamics that emerge with network initialization and remain throughout training.

Fig. 5
figure 5

Pre-trained AeLLE during RNN training on SMNIST task. Prediction of task accuracy of inputs in a test set by pre-trained AeLLE threshold classifier (\(PC1 <-0.03\)) of SMNIST variants is shown at (\(0, 10, 20,50\%\)) of training of networks on SMNIST. Throughout the training, the high-accuracy networks (green) tend to cluster to the left of the pre-trained AeLLE median classifier (black line). Furthermore, the distribution of networks in this space changes little over training, indicating that the dynamical properties that facilitate networks’ successful learning of a task emerge early in training

In summary, we find that pre-trained AeLLE is an effective classifier that predicts the accuracy of the given RNN when fully trained, even before it has undergone 50% training. To further quantify the effectiveness of AeLLE classifier prediction, we compare it with a direct feature of training, the loss value at each stage of training, see Table 1. We find that variants with low loss early in training correspond to variants that will be classified as low error, indicated by an almost perfect precision rate of 99–100%. However, many variants of high accuracy do not converge quickly. Indeed, the recall rate for a classifier based on loss values is 17–77% for 10–50% of training, while in contrast, AeLLE recall rate is consistently above \(88\%\) across all training epochs.

Table 2 Precision, recall, and F1 score of LE mean, LE max, and AeLLE classifiers throughout training

For further comparison, we construct classifiers using the LE mean, LE max, and the first PC of the raw LEs, as training progresses. For each classifier, we select the value of the statistic that yields the best F1 score on the validation set as the decision boundary, and then test the accuracy of such a classifier in determining whether a network’s performance at the end of training will be of high or low accuracy on the test set. The distribution of LE statistics and the first PC of the AeLLE classifier over training is shown in Fig. 6. We see that the distribution of the mean LE is heavily concentrated near its maximum value of 0 and does not change significantly over training. Max LE distribution exhibits more changes, but there is a similar number of high- and low-accuracy networks to either side of the boundary throughout training. The first PC of AeLLE has the greatest concentration of high-accuracy networks to one side of the decision boundary, suggesting that this is a more useful classifier. In Table 2, we present the recall, precision, and F1 score of each of these classifiers. This comparison shows that the AeLLE classifier has significantly higher F1 scores than the other classifiers across all epochs.

Fig. 6
figure 6

Comparison of LE mean, LE max, and AeLLE classifiers and distributions throughout network training. The distribution of the mean LE (first row), max LE (second row), and first principal component of the raw Lyapunov exponents (third row), and first principal component of the AeLLE (bottom) are shown from 0% to 50% training on the SMNIST task. The decision boundary value of each metric is shown as a vertical black bar, which is used as a classifier in Table 2

4.4.1 AeLLE contribution to classification

To delineate the necessity of AeLLE representation, we extend the classifier from a simple classifier (PC1), which we used in earlier sections, to a linear regression classifier. We then perform a comparative study to test whether AeLLE representation is responsible for the effective classification, we observe or whether a more advanced classifier applied directly to LE could achieve a similar outcome. To test this, we use a linear regression classifier in conjunction with AeLLE and compare the classification accuracy with a scenario of using the classifier directly with LE (both full spectrum (Raw) and low dimensions (PCA)). Table 3 shows the results in terms of F1 score and demonstrates that AeLLE is a required step in classification. In particular, in the two scenarios, where AeLLE representation is used, classification in conjunction with PC1 or in conjunction with linear regression achieves similar F1 scores (the last two rows in Table 3). These scores are significantly higher than the application of linear regression directly to LE values (row 1) or LE values projected to lower dimensional space with PCA (rows 2, 3). We find that for all tested training durations (from 0 to 50%), classifiers applied to AeLLE achieve an F1 score higher by 0.12 on average than the score computed directly on LE or its PCA embedding. In addition, the study suggests that an extension of the classifier, such as extending the classifier from PC1 to linear regression, does not replace AeLLE but rather could provide additional means in conjunction with AeLLE to enhance classification.

Table 3 Comparison of F1 score of PC1 and linear regression classifiers in conjunction with LE-related representations (Raw—row 1, PCA—rows 2, 3, and AeLLE—rows 4, 5) throughout training duration (from 0% training to 50%)
Table 4 F1 score of pre-trained AeLLE classifier on networks at various levels of training using median from increasing numbers of principal components (PCs)

4.4.2 Higher dimensions of AeLLE

AeLLE space is not restricted to two dimensions. In general, the more dimensions are used, the more accurate the representation is expected to be. To test this hypothesis, we extend the classification to higher dimensions. Specifically, we use the first d PCs where the median of each PC divides the space into \(2^d\) subspaces. In each subspace, we check the number of optimal and non-optimal network samples. Then, we take the union of all optimal subspaces (those that contain more optimal networks than sub-optimal networks) to be the overall optimal region for the classifier. We then test the classifier on the test dataset for each epoch. Our results are reported in Table 4, showing the F1 score for \(0\%\) and \(50\%\) training, with d being the number of PCs set to \(d=1,2,4,10\).

The results show that as additional PCs are included, the F1 score increases at the initial phase before training and during training. Hence, the full AeLLE space includes in higher dimensions additional features of LE and the corresponding networks. The results also show that the first-order PC classifier is able to capture a significant portion of high- and low-accuracy networks.

4.5 LE features visualization in AeLLE space

Lyapunov spectrum for different architectures can have highly variable maximum values, slopes, variances, means, and more. Specific properties of these spectra would intuitively correlate with performance, but the relation between each feature and accuracy is unclear. This is evident from the comparison of Lyapunov spectra of LSTM, RNN, GRU, and LEM to coRNN, ASRNN, Lipschitz RNN, and Noisy RNN trained on SMNIST in Fig. 7(left). Whereas coRNN, ASRNN, Lipschitz RNN, and Noisy RNN all have exponents close to zero for all indices, the spectra of LSTM, GRU, RNN, and LEM, all dip well below zero, with RNN decreasing much faster than the others. Meanwhile, GRU maintains a far greater number of exponents close to zero, similar to coRNN, for about half of the indices. It would then be natural to assume the performance of all networks with all LEs close to zero would be most similar and that LSTM, GRU, and GRU would all be most similar to RNN in performance.

Fig. 7
figure 7

Comparison of Lyapunov spectrum curves, AeLLE, and loss for different network architectures on SMNIST task. Lyapunov spectrum curves (left) of five different architectures show that the ASRNN, coRNN, Lipschitz RNN, and Noisy RNN have very similar spectra with many exponents close to zero. Meanwhile, the spectra of LSTM, GRU, RNN, and LEM dip well below zero, but at different rates. These networks in AeLLE space (center) are thus grouped such that RNN, LSTM, and LEM are closer to each other than to other network types, but that the GRU is grouped with the networks with spectra close to zero. Such mapping appears to be correlated with accuracy (right; colors indicate loss from low-green to high-red). Despite the similar spectra of coRNN, ASRNN, Lipschitz RNN, and Noisy RNN, the loss of coRNN is larger than that of the others by an order of magnitude, causing it to be separated from some of the other networks in this space. While the LE curves of LSTM, RNN and LEM are distinct, their loss values are much higher (red), and they are clustered in the same vicinity farther from more optimal networks

The representation in AeLLE space (Fig. 7—center) shows two clusters, with one tight cluster representing RNN, LSTM, and LEM and another looser cluster representing all other tested networks. Within this looser cluster, we find both the coRNN network, which has a similar spectrum to ASRNN and Lipschitz RNN, and the GRU, which is visually much closer to LSTM and LEM. However, we find that, within this cluster, GRU is located closest to Lipschitz RNN and Noisy RNN, all of which are found to the right of the median boundary along with ASRNN. Meanwhile, coRNN is located just to the left of the boundary, in the direction of the RNN, LSTM, and LEM. While visual inspection of the spectra does not immediately indicate the reason for this ordering, it becomes more apparent when we observe the loss of the networks (Fig. 7—right). Since ASRNN, Lipschitz RNN, Noisy RNN, and GRU obtain the optimal accuracy (indicated by green color), they are mapped to the right of coRNN and the other networks point in AeLLE plane. Furthermore, while LSTM, RNN, and LEM are different networks in architectures and dynamics, these appear to be mapped to the same cluster in AeLLE. The reason for such non-intuitive mapping could be explained by accuracy again, since on this task, RNN, LSTM, and LEM all exhibit low accuracy (indicated by red color).

Such experiments indicate that easily observable LE features, such as the number of exponents near zero or overall spectrum shape, do not vary uniformly with performance. Instead, a more complex, nonlinear combination of LE features extracted by AeLLE would be required to determine this relation.

5 Discussion

LE methodology is an effective toolset to study nonlinear dynamical systems since LE measure the divergence of nearby trajectories and indicate the degree of stability and chaos in that system. Indeed, LE has been applied to various dynamical systems and applications, and there exists theoretical underpinning for the characterization of these systems by LE spectrum. However, it is unclear how to relate LE spectrum to system characterization. Our results demonstrate that the information that LE contain regarding the dynamics of a network can be related to network accuracy on various tasks through an Autoencoder embedding, called AeLLE. In particular, we show that AeLLE representation encodes information about the dynamics of recurrent networks (represented by their Lyapunov exponents) along with the performance. We demonstrated that this relation to performance can be learned across choice of weight initialization, network size, network architecture, and even training epoch. Effectively, AeLLE representation discovers the implicit parameters of the network.

Such a representation is expected to be invaluable in uniting research that looks to assess and predict model quality on particular tasks and those that emphasize and constrain model dynamics to encourage particular solutions. Our approach allows mapping the dynamics of a network to accuracy through the Latent space representation of the LE Autoencoder. This mapping captures multiple characteristics of the networks, with some direct, such as network type, accuracy, and number of units, and some implicit. All these are contained in AeLLE representation and effectively provide salient features/parameters for the networks being considered.

Specifically, the significance of our results is that they show that AeLLE representation is able to distinguish networks according to accuracy across a choice of network architectures with greater accuracy than using simple spectrum statistics alone. Such findings suggest that the features of the dynamics that correspond to optimal performance on given tasks are consistent across network architectures, whether they are RNN, gated architectures, or dynamics-constrained architectures, even if they are not immediately apparent upon visual inspection or through the first-order statistics of the spectrum. AeLLE is able to capture these.

While the accuracy of such a classifier is enhanced over those using direct statistics, this comes at the cost of the training time of the Autoencoder and the interpretability of extracted features. Analysis of the individual components of this Autoencoder methodology (4) could provide more insight into the interpretation of these AeLLE features. Namely, the representations of the Lyapunov exponents that AeLLE produces (\(\phi (Z)\)) could be directly analyzed statistically or otherwise. The contribution of each dimension of the Latent space of the Autoencoder to the predicted loss value could be extracted from the linear prediction layer (\(\xi\)). Furthermore, the corresponding LE spectrum features for these Latent dimensions could be analyzed using the decoder (\(\psi\)). Additional experiments exploring the AeLLE representation beyond those outlined here could also be carried out to provide further insights. These studies are left for future work to be applied with optimization and analysis of particular machine learning tasks, based on generic methodology presented in this work.

Multiple recent works have demonstrated the effectiveness of transformers in processing sequential data [62]. While there is significant overlap between the datasets and problems to which RNN and transformers are often applied, the mechanisms by which they accomplish these tasks are not directly comparable. Notably, transformers process entire sequences at once and use attention mechanisms to emphasize the key points without employing the explicit time-stepping of RNNs. Since the computation of the LE method developed for RNN measures the expansion and contraction over time of the hidden states, a direct comparison between the LE of RNNs and transformers using AeLLE would require identifying analogous dynamic variables for transformers. LE could be computed for activation variables transitioning between layers; however, these variables are not indicative of sequential stepping through the sequence, and their relevance to sequence processing is unclear. The identification of suitable analogous variables for LE spectrum across network types and analysis and comparison of their characteristics through methods such as AeLLE represented in this work offer exciting directions for future research.

While applying AeLLE for the search of optimal networks given a task is outside of the scope of this work, exploration into AeLLE representation to find predicted optimal dynamics for a task would be a natural extension of the results reported here. Furthermore, an extension of AeLLE methodology is to search and unravel novel architectures with desired dynamics defined by a particular LE spectrum. The AeLLE approach can also be adapted to analyze other complex dynamical systems. For example, long-term forecasting of temporal signals from dynamical systems is a challenging problem addressed with a similar data-driven approach using Autoencoders and spectral methods along with linearization or physics-informed constraints [63,64,65,66,67]. Application of AeLLE could unify such approaches for dynamical systems representing various physical systems. The key building blocks in AeLLE that would need to be established for each of these extensions are efficient computation of Lyapunov exponents, sufficient sampling of data to train the Autoencoder to form an informative Latent representation, and stable back-propagation of gradients across many iterations of QR decomposition in the LE calculation.