Introduction

Real-time sensor signal processing is a growing demand in our everyday life. High-frequency sensor data are available in a wide range of embedded applications, including, for example, speech recognition, battery protection in electric cars, and monitoring of production facilities. Hence, many application domains could benefit from intelligent and real-time sensor signal analysis on low-cost and low-power devices. To be able to adapt to changing environmental and operational conditions of the target system, machine learning-based approaches have to be employed since classical signal processing techniques often reach their limits under changing external influences. However, at the same time, efficient hardware implementation and acceleration of such intelligent methods are needed to fulfill real-time constraints for applications requiring inference times in the \(\upmu \)s to ns range.

Currently, Neural Network-based models, including Recurrent Neural Networks (RNNs) and Long Short-Term Memories (LSTMs), form the state-of-the-art for numerous time series processing tasks. However, these models typically consist of several different layers stacked in a deep architecture, have millions of trainable parameters, and include hard-to-implement nonlinear mappings. Such deep architectures are unfavorable for hardware implementations and induce long inference times, which is disadvantageous for real-time applications.

During the last 2 decades, Reservoir Computing (RC) emerged as a promising alternative to deep and Recurrent Neural Networks for time series analysis. In contrast to the latter, RC models have a shallow architecture and trainable parameters in only a single layer, making them generally much easier to design and train. Despite their comparatively simple architecture, RC models have proven their capabilities in many application domains, like biomedical, audio, and social applications [72]. Besides temporal data, RC models can also deal with sequential data, enabling tasks like image recognition [35, 72]. While RC models only need relatively little computational resources compared to deep Neural Networks, they still have to be optimized for the unique requirements of, e.g., Field-programmable Gate Array (FPGA)-based implementations.

Due to the discrete nature of their reservoirs, Reservoir Computing using Cellular Automata (ReCA) models form a subset within the RC framework that is suitable for the implementation on FPGAs. Like for other RC models, the training of ReCA models is easy and fast. Nevertheless, they require extensive hyperparameter tuning.

One major challenge that we address in this paper is that for most ReCA models, the hyperparameter search space is too big for current heuristic search and optimization algorithms. This is especially true for the selection of suitable Cellular Automaton (CA) rules in the reservoir.

Because of this, we conducted the first mathematical analysis of the influence of Linear CA rules on the model performance in the ReCA framework and identified common analytical properties of suitable linear rules to be used in the reservoir. We backed our mathematical analysis with the results of almost one million experiments with a sequential runtime of nearly one year using an NVIDIA RTX A4000 GPU (using this GPU, we were able to run three parallel runs of our experiment).

In the research community, the ReCA framework has been tested almost solely on pathological datasets that do not allow conclusions about the generality of the conducted studies and generalization capabilities of the ReCA models themselves. In the context of this study, we performed an extensive analysis using several benchmark datasets. The result of our research is the Reservoir Computing using Linear Cellular Automata design algorithm (ReLiCADA), which specifies Reservoir Computing using Linear Cellular Automata (ReLiCA) models with fixed hyperparameters and, thus, immensely simplifies the overall design process. The selected ReLiCA models achieve lower errors than comparable state-of-the-art time series models, while maintaining low computational complexity.

Compared to other analyses, we do not try to classify the CA rules using the four Wolfram classes [80], the Langton Lambda [40] or one of the many derivatives of them [73, 76]. These have to be proven to lack formal definitions and are sometimes even undecidable, hindering their usefulness [16, 76]. Furthermore, we do not restrict our analyses to elementary CA, prohibiting the use of grid-search approaches for rule selection. To the best of our knowledge, the only optimization algorithm used for rule selection is Genetic Algorithm (GA) [5]. However, as we will show later in our paper, the results using GAs to optimize the rule selection do not produce great results and are outperformed by our proposed method.

The rest of this paper is structured as follows. We first start with an introduction to the fundamental topics of this paper in “Background and related work”, including RC (“Reservoir Computing”), CAs (“Cellular Automata”), and finally the ReCA framework (“ReCA framework”). After that, we introduce our refined version of the ReCA model architecture and describe all parts of it in “Refined ReCA architecture”. In “Mathematical parameters”, we define the mathematical and topological parameters used in our analysis. This is followed by an explanation of our novel Reservoir Computing using Linear Cellular Automata design algorithms in “Proposed ReLiCA design algorithm”. The datasets and models we use to compare and validate our algorithm are listed in “Experiments”, before we discuss the experiments in “Results”. The paper is completed by a conclusion in “Conclusion”.

Background and related work

The in-depth analysis of ReCA models comprises concepts and methods from different research fields, ranging from abstract algebra over automaton theory and properties of dynamical systems to machine learning. In the following sections, we summarize the required background knowledge and related work about RC, CA, and ReCA. Furthermore, we define the mathematical parameters that we use to characterize the ReCA models.

Reservoir Computing

Fig. 1
figure 1

Echo State Network as an example for Reservoir Computing

The main idea of RC is to transform the input \({{\textbf{x}}}\) into a higher-dimensional space \({{\textbf{s}}}\) to make the input linearly separable. This transformation is performed by a dynamic system which is called the reservoir (center part of Fig. 1).

The readout layer (right part of Fig. 1) is then used to linearly transform the reservoir state into the desired output \({{\textbf{y}}}\) [53]. Generally, RC models can be described using

$$\begin{aligned} {\textbf{s}}^{(t)}= & {} g(\varvec{\textsf{V}}{\textbf{x}}^{(t)}, \varvec{\textsf{W}}{\textbf{s}}^{(t-1)}), \nonumber \\ {\textbf{y}}^{(t)}= & {} h(\varvec{\textsf{U}}{\textbf{s}}^{(t)}), \end{aligned}$$
(1)

with the reservoir state \({{\textbf{s}}},\) the input \({{\textbf{x}}},\) and the output \({{\textbf{y}}}\) at the discrete time t. The function g depends on the reservoir type, while the function h describes the used readout layer and is typically a linear mapping. During model training, only the output weights \({\varvec{\textsf{U}}}\) are trained, while the input weights \({\varvec{\textsf{V}}}\) and reservoir weights \({\varvec{\textsf{W}}}\) are fixed and usually generated using some model specific constraints. In Fig. 1, we depict an Echo State Network (ESN) [32] using a single-layer RNN [66] as the reservoir.

Further simplifications to the reservoir were proposed by Rodan and Tino [64], resulting in, e.g., the Delay Line Reservoir (DLR) or the Simple Cycle Reservoir (SCR). These types of reservoirs require less computations during the inference step compared to general ESNs. Nevertheless, they are still not suited for implementation in, e.g., FPGAs due to the required floating-point calculations. To eliminate the floating-point operations in the reservoir, stochastic bitstream neurons can be used [75]. However, stochastic bitstream neurons trade inference speed with the simplicity of implementation on FPGAs and are thus not suited for our use case [3].

In this paper, we are focusing on a class of RC models that use CAs as the reservoir, that have been termed ReCA [48, 85]. One of the main advantages of ReCA models compared to other RC models is that the reservoir naturally only uses integer operations on a finite set of possible values. Because of that, they are easy and fast to compute on digital systems like FPGAs.Footnote 1

Cellular Automata

CAs represent one of the simplest types of time, space, and value discrete dynamical systems and have been introduced initially by von Neumann [55, 56]. Following this idea, CAs have been analyzed concerning several different properties, including structural [41, 82], algebraic [18, 31, 49, 77, 79], dynamical [6, 17, 30, 38, 68], and behavioral [14, 40, 51, 59, 73] properties.

Fig. 2
figure 2

Lattice of a one-dimensional Cellular Automaton with periodic boundary conditions. Using only the orange weights results in \({n = 3};\) using the orange and green weights results in \({n = 5}.\) The state of the cell \({s_0}\) in the (i)th iteration is the weighted sum of the cell states in its neighborhood in the \({(i-1)}\)th iteration

The CAs considered in this paper consist of a finite, regular, and one-dimensional lattice of N cells (see Fig. 2), for reasons discussed below. Each of the cells can be in one of m discrete states. The lattice is assumed to be circularly closed, resulting in periodic boundary conditions. In this sense, the right neighbor of the rightmost cell \(({s_{N-1}})\) is the leftmost cell \(({s_0}),\) and vice versa. A configuration of a CA at a discrete iterationFootnote 2i consists of the states of all its cells at that iteration and can, thus, be written as a state vector \({ {\textbf{s}}^{(i)} \in {\mathbb {Z}}_{m}^{N} }\) according to

$$\begin{aligned} {\textbf{s}}^{(i)} = (s_0^{(i)}, \ldots , s_{N-1}^{(i)})^{\textrm{T}} ,\quad \text {with } s_k \in {\mathbb {Z}}_{m}, \end{aligned}$$
(2)

where \({ {\mathbb {Z}}_{m} = {\mathbb {Z}}/m{\mathbb {Z}}}\) denotes the ring of integers modulo m,  and (i) denotes the iteration index. The states of each cell change over the iterations according to a predefined rule. At iteration (i) ,  the cell state \({ s_k^{(i)} }\) is defined in dependency of the states of the cells in its neighborhood of fixed size n at iteration \({ (i-1) }\) (see Fig. 2). The neighborhood of a cell contains the cell itself, as well as a range of r neighboring cells to the left and right, respectively, leading to

$$\begin{aligned} n = 2r+1 ,\quad \text {with } r \in {\mathbb {N}}^+. \end{aligned}$$
(3)

The iterative update of the cell states can be described in terms of a local rule \({ f:{\mathbb {Z}}_{m}^n \rightarrow {\mathbb {Z}}_{m} },\) which defines the dynamic behavior of the CA according to

$$\begin{aligned} s_k^{(i)} = f(s_{k-r}^{(i-1)}, \ldots , s_{k+r}^{(i-1)}). \end{aligned}$$
(4)

Since we use periodic boundary conditions, the indices \({k-r, \ldots , k+r}\) of the states in Eq. (4) have to be taken \({\text {mod } N}.\)

A subset of general CAs is given by Linear CAs [77]. The local rule of a linear CA is a Linear combination of the cell states in the considered neighborhood. Hence, for Linear CAs, the local rule f can be defined as

$$\begin{aligned} f(s_{k-r}^{(i)},\ldots , s_{k+r}^{(i)}) = \sum \limits _{j=-r}^{r}{w_j}s_{k+j}^{(i)} \end{aligned}$$
(5)

with rule coefficients \({ w_j \in {\mathbb {Z}}_m}.\) A linear rule can thus be identified by the tuple of its rule coefficients \({{\textbf{w}} = (w_{-r},\ldots ,w_r)}.\) Unless otherwise noted, we will restrict the CA rule to linear rules in this paper. A prominent example is the elementary rule 90 CA, which is defined by \({m=2},\) \({n = 3}\) and \({(w_{-1}, w_0, w_1) = (1,0,1)}\) [79].

We introduce a restriction to the neighborhood n to define the true neighborhood \({{\hat{n}}}.\) For \({{\hat{n}}},\) we require that \({w_{-r} \ne 0}\) or \({w_r \ne 0}.\)

For each linear rule f,  there exists a mirrored rule \({{\hat{f}}}\) with \({\hat{{\textbf{w}}} = (w_r,\ldots ,w_{-r})}.\) If the rule coefficients are symmetric with respect to the central coefficient \({w_0},\) it holds that \({{\hat{f}} = f}.\) In total, there exist \({ m^n }\) different Linear CA rules, which directly follows from Eq. (5). We denote the set of all linear rules for given m and n by

$$\begin{aligned} {\mathcal {R}}(m,n) = \left\{ \left( w_{-r},\ldots ,w_{r}\right) : w_i \in {\mathbb {Z}}_m, n=2r+1\right\} . \end{aligned}$$
(6)

The local rule f is applied simultaneously to every cell of the lattice, such that the configuration \({ {\textbf{s}}^{(i-1)} }\) updates to the next iteration \({ {\textbf{s}}^{(i)} },\) and therefore it induces a global rule \({ F:{\mathbb {Z}}_{m}^N \rightarrow {\mathbb {Z}}_{m}^N }.\) For linear CAs, this mapping of configurations can be described by multiplication with a circulant matrix \({ \varvec{\textsf{W}} \in {\mathbb {Z}}_{m}^{N \times N} },\) which is given by

$$\begin{aligned} \varvec{\textsf{W}} = \text {circ}(w_0, \ldots , w_r, 0, \ldots , 0, w_{-r}, \ldots , w_{-1}), \end{aligned}$$
(7)

with \({ \text {circ} }\) as defined in [77] (note: if \({ N = n },\) the circulant matrix has no additional zero entries that are not rule coefficients). Thus, the global rule for a Linear CA can be written as

$$\begin{aligned} {\textbf{s}}^{(i)} = F({\textbf{s}}^{(i-1)}) = \varvec{\textsf{W}}{\textbf{s}}^{(i-1)}. \end{aligned}$$
(8)

It has been shown that several key properties typically used to characterize dynamical systems are not computable for general CAs. Even for general one-dimensional CAs nilpotency is undecidable, and the topological entropy cannot even be approximated [17, 30, 37]. Furthermore, injectivity and surjectivity can be computed only for one- and two-dimensional CAs [17, 38]. However, when restricting the analysis to one-dimensional Linear CAs, all the mentioned properties are computable. This is the reason why we focus on one-dimensional Linear CAs in this paper.

Fig. 3
figure 3

ReCA architecture as initially proposed by Yilmaz [85]

ReCA framework

CAs have been employed as the reservoir in the RC framework first by Yilmaz [85], replacing the recurrently connected neurons typically used in ESNs. The original architecture of a ReCA model is depicted in Fig. 3. Input to the model is a time series \({ {\textbf{x}} },\) which is fed sample by sample into an encoding stage. The encoding stage, as proposed in [85], serves several purposes. First, the input is preprocessed depending on the type of data, which may include feature expansion, weighted summation, scaling, and binarization. Second, the processed data is mapped to the cells of the CA in the reservoir. Third, the processed data are encoded into the mapped cell states [85]. With the input encoded into its cell states, the global rule of the CA is executed iteratively for a fixed number of iterations. The output of the CA is then passed to the readout layer, which produces the final model output \({{\textbf{y}}}.\)

The ReCA framework has been analyzed and developed further based on the initially proposed architecture. In Nichele and Gundersen [57], the authors use hybrid CA-based reservoirs split into two halves, each half running with a different rule to enrich the dynamics within the reservoir. However, this increases the search space for suitable rule combinations in the reservoir, and it remains unclear how to design the reservoir effectively.

Deep Reservoir Computing using the ReCA approach is investigated by Nichele and Molund [58] by stacking two ReCA models one after the other, resulting in decreased error rates in most of the analyzed cases. This design principle, however, is to some point contradictory to the original intention of RC, which is to reduce the complexity of supervised training of Neural Networks (NNs) [33].

The analysis of suitable CA rules has been extended from elementary CAs \(({ m=2, {\hat{n}} = 3 })\) to complex CAs (\({ m \ge 3}\) and/or \({{\hat{n}} \ge 5 }\)) in [5]. In their work, the authors use a GA to perform a heuristic optimization within the super-exponentially growing rule space (\({ m^{m^{{\hat{n}}}} },\) since they do not restrict on Linear CAs) to find suitable rules for use in the reservoir. One of the biggest challenges with this approach is that the rule space quickly becomes unmanageable for heuristic optimization methods, including genetic algorithms. Even when the number of possible states is only doubled from, e.g., \({ m=2 }\) to only \({ m=4 },\) the number of possible rules with a three-neighborhood grows from \({2^{2^{3}} = 256}\) to \({ 4^{4^{3}} \approx 3.4 \times 10^{38} }.\) This example impressively shows that even small increases in complexity of the CA reservoirs make applying heuristic search and optimization methods practically impossible.

Most of the research mentioned above has been mainly based on the synthetic 5-bit and 20-bit memory tasks [85]. However, as the authors in [47] point out, especially the 5-bit memory task is not sufficient to make conclusions about the generalization capability of a model since this task consists of only 32 examples. Furthermore, the model is trained and tested on the whole dataset which contradicts common practice of separating training and test sets. Therefore, they adapt the 5-bit memory task by splitting the 32 examples into a training and test set. This, however, shrinks the number of available train and test examples further. The authors also investigate the effect of different feature extraction techniques on the reservoir output with the result that simply overwriting CA cells in the reservoir works well in less complex CAs.

A rule 90-based FPGA implementation of a ReCA model for the application of handwritten digit recognition based on the MNIST dataset is presented in [52]. Even though their implementation does not reach the classification accuracy of current state-of-the-art Convolutional Neural Network (CNN)-based implementations, the authors show that ReCA is a promising alternative to traditional Neural Network-based machine learning approaches. This is especially underlined by the fact that the energy efficiency of their implementation is improved by a factor of 15 compared to CNN implementations [52].

An analysis of the influence of several hyperparameters in the ReCA framework has been conducted in [24], with the result that for general CAs, the overall performance of the model is dependent on and sensitive to the concrete choice of hyperparameters.

Fig. 4
figure 4

Refined ReCA architecture

Refined ReCA architecture

Based on the initially published ReCA framework, we refine our view on the architecture by splitting the encoding layer into multiple parts since it fulfills several different and independent tasks. Figure 4 depicts the refined ReCA architecture. The input data \({\textbf{x}}\) is fed into the transformation layer (“Transformation”), which transforms the data and prepares it for the following quantization. The transformation layer can be used to apply any transformation functions, e.g., tangens hyperbolicus, on the input data. The quantization layer then quantizes the input to the allowed states \({x_q \in {\mathbb {Z}}_m}\) (“Quantization”). Note that the transformation and quantization layers often work together to achieve the desired \({x_q}.\) The quantized input \({x_q}\) is then passed to the mapping layer (“Mapping”), which selects the CA cells that receive the quantized input. The following encoding layer (“Encoding”) encodes the quantized input into the selected cells. After that, the CA in the reservoir updates the cells for a fixed number of iterations, and the states of the CA are used by the readout layer to calculate the ReCA model output \({{\textbf{y}}^{(t)}}\) (“Reservoir and readout”).

In “ReCA computations”, we show by example how the layers interact during an inference step, and how the states of the CA in several iterations are combined to form the reservoir output.

In the rest of this paper, without loss of generality, we only consider the case of one-dimensional time series \({ {\textbf{x}} = (x^{(0)},\ldots ,x^{(T-1)}) }\) with \({ x^{(t)} \in [-1, 1] }.\) If n-dimensional data should be used, the transformation, quantization, mapping, and encoding layers are adjusted to the input dimension. For data \({ x^{(t)} \notin [-1, 1] },\) the transformation and quantization need to be adopted. Furthermore, to improve the readability of the following definitions, the superscript (t) is removed in the rest of this section when the time context is clear.

Transformation

We separate the transformation layer into two steps. First, we apply a transformation function \({\tilde{{\textbf{x}}}_{\tau }=\tau ({\textbf{x}})}\) to the input. Second, we scale the transformed input to the range \(x_{\tau } \in [0, m-1]\) since we require this input range in the subsequent quantization layer. To analyze the effect of different fixed-point number representations on the overall ReCA performance, we included the following transformation functions in our experiments:

  • complement

    $$\begin{aligned} {\tilde{x}}_{\tau } = {\left\{ \begin{array}{ll} x,&{}{\text {if}\ x \in [0,1]}, \\ 2+x,&{}{\text {otherwise}}. \end{array}\right. } \end{aligned}$$
    (9)
  • gray and scale_offset

    $$\begin{aligned} {\tilde{x}}_{\tau } = x + 1. \end{aligned}$$
    (10)
  • sign_value

    $$\begin{aligned} {\tilde{x}}_{\tau } = {\left\{ \begin{array}{ll} x,&{}{\text {if}\ x \in [0,1]}, \\ -x+1,&{}{\text {otherwise}}. \end{array}\right. } \end{aligned}$$
    (11)

Rescaling is then done using

$$\begin{aligned} x_{\tau } = \frac{m-1}{2} {\tilde{x}}_{\tau }. \end{aligned}$$
(12)

Using these transformations, we are able to mimic different floating-point to fixed-point conversion methods. With the complement transformation, the numbers are represented similarly to a two’s complement, while sign_value uses a binary sign and value representation. The scale_offset approach shifts the input range to only positive numbers and then uses the default binary representation. The gray transformation uses the same shift but will encode the values using gray code. The conversion to gray code is only correct if m is a power-of-two, otherwise neighboring values might not differ in only one bit.

Quantization

To quantize the input values, we use the typical rounding approach

$$\begin{aligned} {\tilde{x}}_q = {\left\{ \begin{array}{ll} 0,&{}{\text {if}\ x_{\tau } \in [0,0.5)}, \\ 1,&{}{\text {if}\ x_{\tau } \in [0.5,1.5)}, \\ 2,&{}{\text {if}\ x_{\tau } \in [1.5,2.5)}, \\ &{}\vdots \\ m-1,&{}{\text {if}\ x_{\tau } \in [m-1.5,m-1]}. \end{array}\right. } \end{aligned}$$
(13)

In case of gray transformation, the quantized input \({{\tilde{x}}_q}\) is transformed once more, leading to the final quantized valueFootnote 3

$$\begin{aligned} x_q = {\left\{ \begin{array}{ll} {\tilde{x}}_q \oplus ({\tilde{x}}_q \gg 1) \mod m,&{}{\text {if }}{} \textit{gray},\\ {\tilde{x}}_q,&{}{\text {else}} \end{array}\right. } \end{aligned}$$
(14)

with \({\oplus }\) representing the binary bitwise exclusive-or and \({\gg }\) is the binary right-shift operation.

Mapping

Yilmaz [85] mentions that multiple random projections of the input into the reservoir are necessary to achieve low errors. However, instead of implementing multiple separate CA reservoirs as in [85], we follow the design as described in [5] and subdivide a single CA lattice into multiple parts. Therefore, we divide the lattice of the CA in the reservoir into \({N_r}\) compartments. Each compartment has the same number of \({N_c}\) cells. For example, a lattice of size \({N=512}\) divided into \({N_r=16}\) compartments with \({N_c=32}\) cells each is described by the tuple \({ (N_r, N_c)=(16, 32) }\) with \({ N = N_r N_c }.\) The mapping layer selects the cells of the CA that should receive the input value. Out of each compartment, one cell is randomly selected into which the input is encoded in the next step. This random mapping is fixed once and does not change. It can be modeled as a Boolean mask vector \({{\textbf{p}}} \in \{0,1\}^N,\) in which the elements representing the cells of the CA lattice that shall receive the input are set to one, and all other elements are set to zero. The mapped input values \({\textbf{x}}_p,\) as well as the masked CA state vectors \({\textbf{s}}_p\) and \({\textbf{s}}_{\lnot p},\) can then be obtained by multiplication with the mask vector according to:

$$\begin{aligned} {\textbf{x}}_p= & {} x_q{\textbf{p}},\nonumber \\ {\textbf{s}}_p= & {} {\textbf{s}} \odot {\textbf{p}},\nonumber \\ {\textbf{s}}_{\lnot p}= & {} {\textbf{s}} \odot \lnot {\textbf{p}}, \end{aligned}$$
(15)

where \(\odot \) denotes the Hadamard product and \(\lnot \) denotes logical negation.

Encoding

Since the mapping layer only defines into which cells the quantized input \({x_q}\) should be encoded, we have to define how the encoding is actually done. In our experiments, we tested the following commonly used encoding functions. Let \({\bar{s}}\) be the initial state of the CA in the reservoir, \({\textbf{x}}_p\) be the mapped input values, and \(\bar{{\textbf{s}}}_p\) and \(\bar{{\textbf{s}}}_{\lnot p}\) be the masked initial state vectors, as defined in “Mapping”. Furthermore, let the XOR \(\oplus \) and absolute value \({(~)^{\vert \cdot |}}\) operations be defined element-wise. Using modulo m arithmetic, the encoded CA state \({\textbf{s}}\) is then defined by:

  • replacement encoding [85]

    $$\begin{aligned} {\textbf{s}} = {\textbf{x}}_p + \bar{{\textbf{s}}}_{\lnot p}, \end{aligned}$$
    (16)
  • bitwise xor encoding [47, 58]

    $$\begin{aligned} {\textbf{s}} = ({\textbf{x}}_p \oplus \bar{{\textbf{s}}}_p) + \bar{{\textbf{s}}}_{\lnot p}. \end{aligned}$$
    (17)

Additionally, we analyzed the following new encoding functions:

  • additive encoding

    $$\begin{aligned} {\textbf{s}} = ({\textbf{x}}_p + \bar{{\textbf{s}}}_p) + \bar{{\textbf{s}}}_{\lnot p}, \end{aligned}$$
    (18)
  • subtractive encoding

    $$\begin{aligned} {\textbf{s}} = \left( {\textbf{x}}_p - \bar{{\textbf{s}}}_p\right) ^{|\cdot |} + \bar{{\textbf{s}}}_{\lnot p}. \end{aligned}$$
    (19)

The states of the cells not selected by the mapping layer do not change during the encoding process. The replacement encoding overwrites the information stored in the affected cells of the CA with the new input. This is different for the xor encoding, which combines the new input with the current cell states and is, next to replacement encoding, commonly used in ReCA. To analyze the influence of small changes in the encoding, we use the additive and subtractive encoding schemes, which slightly differ from the xor encoding.

Reservoir and readout

After encoding, the CA in the reservoir iterates the encoded state \({\textbf{s}}\) for a fixed number of I iterations, leading to the iterated cell states \({\hat{{\textbf{s}}}^{(1)}, \dots , \hat{{\textbf{s}}}^{(I)}}\). The cell states after each iteration are concatenated to form the reservoir output \({\textbf{r}}\) [85]. Then, the following readout layer computes the weighted sum of the reservoir output

$$\begin{aligned} {\textbf{y}} = \varvec{\textsf{U}}{\textbf{r}}+{\textbf{b}}, \end{aligned}$$
(20)

with the weight matrix \({\varvec{\textsf{U}}}\) and bias \({{\textbf{b}}}.\) Since \({\varvec{\textsf{U}}}\) and \({{\textbf{b}}}\) are the only trainable parameters in the ReCA model, a simple linear regression can be used. To simplify the notation, it will be assumed that the input to the readout layer \({{\textbf{r}}}\) has a 1 appended to also include the bias \({{\textbf{b}}}\) in the weight matrix \({\varvec{\textsf{U}}}.\)

To train the ReCA model, the reservoir output \({{\textbf{r}}^{(t)}}\) is concatenated for each input \({{\textbf{x}}^{(t)}}\) into \({\varvec{\textsf{R}}}.\) Furthermore, the ground truth solutions \({\bar{{\textbf{y}}}^{(t)}}\) are concatenated in the same way to generate \({\bar{\varvec{\textsf{Y}}}}.\) When using ordinary least squares, the weight matrix \({\varvec{\textsf{U}}}\) can be calculated by

$$\begin{aligned} \varvec{\textsf{U}} = {\left( \varvec{\textsf{R}}^{\textrm{T}}\varvec{\textsf{R}}\right) }^{-1}\varvec{\textsf{R}}^{\textrm{T}} \bar{\varvec{\textsf{Y}}}. \end{aligned}$$
(21)

There are many different adoptions to the linear regression algorithm. For example, Tikhonov regularization [25], also called L2 regularization, can be added, resulting in Ridge Regression [28]. It is also possible to run linear regression in an online and sequential approach [42]. We use Ridge Regression in our experiments.

ReCA computations

Fig. 5
figure 5

Example of ReCA computation for an input sample \({ x^{(t)}=0.816 }\) that gets transformed (scale_offset method) and quantized to \({x_q^{(t)}=3},\) and an initial state \(\bar{{\textbf{s}}}^{(t)}\) at time t as depicted. The model uses a linear CA with rule weights \({\textbf{w}} = (1,0,1)\) and lattice configuration \({(N_r,N_c)=(3,4)},\) xor encoding and \({I=4}\) steps. The state of the CA after the ith iteration is denoted by \({\hat{{\textbf{s}}}^{(i)}}.\) The colors of the lattice indicate the three compartments of the CA

During inference, an input sample \({x^{(t)}}\) goes through each of the aforementioned layers of the ReCA model. An example is depicted in Fig. 5, in which the input sample \({x^{(t)}=0.816},\) with \({-1 \le x^{(t)} \le 1},\) is fed into the specified model and passed to the Transformation/Quantization layer. This layer applies the scale_offset transformation to the input sample, resulting in \({{\tilde{x}}_{\tau }^{(t)} = 1.816}\) and \({x_{\tau }^{(t)} = 2.724}.\) Since \({x_{\tau }^{(t)} \in \left[ 2.5, 3\right] },\) the input sample gets quantized to \({x_q^{(t)} = 3}\) (see “Transformation” and “Quantization” and Fig. 5, part I). After quantization, the quantized input sample is mapped onto randomly selected cells of the CA lattice as specified by the Boolean mask vector \({\textbf{p}},\) such that each of the three compartments receives the input once. The mapped input sample is given by \({\textbf{x}}_p\) (see “Mapping” and Fig. 5, part II). The mapped input sample is then encoded into the initial reservoir state \({\bar{s}}^{(t)}\) using bitwise XOR encoding, resulting in the encoded initial CA state \({\textbf{s}}^{(t)}\) (see “Encoding” and Fig. 5, part III). For example, \({s_4^{(t)} = x_q^{(t)} \oplus {\bar{s}}_4^{(t)} = 3 \oplus 1}.\) Rewritten in binary notation, it can be easily seen that \({s_4^{(t)} = 11_2 \oplus 01_2 = 10_2 = 2}.\) The encoded initial state forms the starting point \({\hat{{\textbf{s}}}^{(0)} = {\textbf{s}}^{(t)}}\) for the evolution of the CA, which is then executed for a fixed number of iterations \({I \in {\mathbb {N}}^{+}},\) such that \({\hat{{\textbf{s}}}^{(0)}}\) evolves under the repeated application of the Linear CA rule to \({\hat{{\textbf{s}}}^{(I)}}\) (see Fig. 5 part IV with \(I=4\)) . After the execution of the CA finishes, the reservoir outputs the concatenated CA states \({{\textbf{r}}^{(t)} = \left[ \hat{{\textbf{s}}}^{(1)},\ldots , \hat{{\textbf{s}}}^{(4)} \right] }\) (see Fig. 5, part V) as mentioned in “Reservoir and readout”. The last iterated CA state \({\hat{{\textbf{s}}}^{(I)}}\) will be used as the initial reservoir state \({\bar{{\textbf{s}}}^{(t+1)}}\) for the next input sample \({x^{(t+1)}}.\) Finally, the reservoir output \({\textbf{r}}\) is passed to the readout layer, which applies a linear regression to compute the model output \({\textbf{y}}^{(t)}\) (see “Reservoir and readout” and Fig. 5, part VI).

Hyperparameters

Since the trainable parameters in the readout layer can be optimized using simple linear optimization techniques, a crucial step in designing ReCA models is the choice of hyperparameters. In our analysis, we focus on the following general hyperparameters:

  • Number of states m of the CA: This has an influence on the operation domain of the CA since \({{\mathbb {Z}}_{m}}\) is either a field (if m is prime) or a ring (if m is non-prime). It significantly affects the mathematical properties and thus the dynamic behavior of the CA. Since m defines the number of possible states of each cell, it also influences the linear separability of the reservoir output in the readout layer.

  • True neighborhood \({\hat{n}}\): The size of the neighborhood influences the expansion rate of local information on the lattice and thus also affects the dynamic behavior.

  • Lattice size N: This impacts the size of the dynamical system and thus affects the complexity of the CA

  • Subdivision of the lattice into \(N_r\) compartments with \(N_c\) cells each: This influences the mapping of the input samples onto the reservoir cells.

  • Number of iterations I of the CA per input sample: This influences the degree of interactions between the cells per input sample.

  • Transformation and quantization: This choice of transformation and quantization functions defines how the input data is presented to the dynamical system.

  • Mapping and encoding: The mapping and encoding methods define how the input is inserted into the state of the dynamical system.

Next to the general hyperparameters, an increased importance receives the hyperparameter F,  i.e., the global rule of the CA, because it essentially defines the fundamental basis of the dynamics and topological properties of the CA. As the rule space of Linear CAs grows exponentially with respect to m and \({\hat{n}},\) it is vital to receive guidance when it comes to hyperparameter selection in the design process of ReCA models. Since we restrict our analysis to linear rules, we term the respective framework as ReLiCA.

It is important to note that all of these hyperparameters have interdependent effects on the overall behavior of the CA and, in turn, on the performance of the ReLiCA model in time series processing tasks.

Mathematical parameters

This section introduces the mathematical parameters we use to analyze Linear CA rules. Depending on m\({{\mathbb {Z}}_m}\) is a finite field if m is prime or otherwise a finite ring. This has several mathematical effects, e.g., the existence of unique multiplicative inverses. Unless otherwise noted, we assume the more general case where m is not prime (\({{\mathbb {Z}}_{m}}\) is a ring).

We define the prime factor decomposition of m as

$$\begin{aligned} m=p_1^{k_1} \cdots p_h^{k_h} \end{aligned}$$
(22)

with the set of prime factors as

$$\begin{aligned} {\mathscr {P}}= \{p_1, \ldots , p_h\} \end{aligned}$$
(23)

and their multiplicities

$$\begin{aligned} \mathcal {K} = \{k_1,\ldots ,k_h\}. \end{aligned}$$
(24)

The set of prime weights can be generated using

$$\begin{aligned} {\mathscr {P}}_w = \{s:\gcd (s,m)=1\} \quad \forall s \in {\mathbb {Z}}_m {\setminus } 0 \end{aligned}$$
(25)

and the set of non-prime weights using

$$\begin{aligned} {\bar{{\mathscr {P}}}}_w = \{s:\gcd (s,m) \ne 1\} \quad \forall s \in {\mathbb {Z}}_m {\setminus } 0, \end{aligned}$$
(26)

where gcd denotes the greatest common divisor.

Transient and cycle lengths

The behavior of a CA over time can be separated into a transient phase of length k and a cyclic phase of length c. For Linear CAs, this can be expressed as

$$\begin{aligned} \varvec{\textsf{W}}^{k}{\textbf{s}}^{(0)} = \varvec{\textsf{W}}^{k+c}{\textbf{s}}^{(0)}, \end{aligned}$$
(27)

with the circulant rule matrix \({\varvec{\textsf{W}}}\) and the initial configuration \({{\textbf{s}}^{(0)}}\) [49, 50]. The decomposition of the state space of a CA into transients and cycles gives further information about its dynamic behavior. A Linear CA with no transient phase has no Garden-of-Eden (GoE) states. GoE states have no predecessors and can thus only appear as initial states, if the CA has a transient phase. On the computation of transient and cycle lengths, we refer the interested reader to [36, 50, 62, 63, 70, 71, 87].

Cyclic subgroup generation

A cyclic subgroup is generated by a generator element g. This generator can be used to generate the multiplicative

$$\begin{aligned} {\mathscr {S}}^{\times }(g) = \{g^0, g^1, g^2, \ldots , g^{(m-1)}\} \end{aligned}$$
(28)

and additive

$$\begin{aligned} {\mathscr {S}}^{+}(g) = \{0, g, 2g, \ldots , (m-1)g\} \end{aligned}$$
(29)

cyclic subgroups [21, 23, 43].

The order of the cyclic additive subgroup \({|{\mathscr {S}}^+(g)|}\) can be calculated by [21, 23]

$$\begin{aligned} |{\mathscr {S}}^+(g)| = \frac{m}{\gcd (m,g)}. \end{aligned}$$
(30)

We use the order of cyclic subgroups to analyze whether the set of possible states during an iteration of a linear CA shrinks or not.

Topological properties

For a mathematical analysis, it is often convenient to consider infinite linear CAs whose lattice consist of infinitely many cells [81]. Hence, further properties of infinite one-dimensional CAs can be defined that characterize the behavior of the CA as a dynamic system. For some properties, we only give informal and intuitive descriptions. Formal definitions can be found in [7, 10, 17, 46]. In the following, the symbol \(\exists _n\) denotes “there exist exactly n-times”.

State space and orbit Intuitively, the set of all possible lattice configurations for infinite CAs can be thought of as forming a state space. Furthermore, the notion of distance that induces a metric topology on the state space can be integrated. For a detailed definition, we refer the interested reader to [46]. An individual element in this set is a specific state configuration of the lattice. The series of points in the state space during operation of an infinite linear CA, i.e., the path \({({\textbf{s}}^{(0)}, \ldots , {\textbf{s}}^{(I)})}\) along the visited lattice configurations under iteration of F for I iterations with initial configuration \({{\textbf{s}}^{(0)}},\) is called orbit. Based on this topological framework, further properties of the dynamic behavior of linear one-dimensional CAs can be computed that characterize it for the asymptotic case of \({N \rightarrow \infty }.\) However, only finite lattices can be realized in practical implementations and simulations of CAs, whereby periodic boundary conditions have only limited influence on the behavior of the CA compared to static boundary conditions [44].

Fig. 6
figure 6

Iteration diagram of Linear CA with \({m=4},\) \({{\hat{n}}=3},\) \({N=12},\) \({{\textbf{w}} = (0,2,1)}\) (resulting in \({H=2}\)) and a a single cell initialized with state 1 (impulse) or b random initial configuration for \({I=9}\) iterations. Figures c and d have the same setup, but with \({{\textbf{w}} = (1,2,1)}\) (resulting in \({H=4}\)). The colors indicate different cell states in \({{\mathbb {Z}}_m}\)

Topological entropy The topological entropy is a measure of uncertainty of a dynamical system under repeated application of its mapping function (global rule F for infinite linear CAs) starting with a partially defined initial state [17]. It can be used to characterize the asymptotic behavior of the system with respect to its operation. Since discrete and finite dynamical systems fall into periodic state patterns, the topological entropy gives an idea of the complexity of the orbit structure and can be used to distinguish ordered and chaotic dynamical systems. For example, two runs of the same (infinite) linear CA with different initial configurations that are close in the state space can be considered. If the linear CA has a low entropy, the final states of the two runs are also likely to be close in the state space [7]. However, supposing that the CA had a high topological entropy, it would show chaotic behavior and likely produce diverging orbits during the two runs even though the initial states were close. Hence, a high entropy leads to increased uncertainty in the dynamical system’s behavior. This behavior can also be seen in Fig. 6, where the orbits of the rule with smaller entropy (Fig. 6a and b) show a less chaotic behavior compared to the orbits of the rule with higher entropy (Fig. 6c and d).

The topological entropy (probabilistic approach) is closely related to the Lyapunov exponents (geometric approach) and can be computed based thereon. Assuming a CA over \({{\mathbb {Z}}_m},\) with the prime factor decomposition in Eq. (22), we define for \({i=1, \ldots , h}\)

$$\begin{aligned} \mathcal {P}_i= & {} \left\{ 0 \right\} \cup \left\{ j: \gcd \left( w_j, p_i \right) = 1 \right\} ,\nonumber \\ L_i= & {} \min \mathcal {P}_i, \nonumber \\ R_i= & {} \max \mathcal {P}_i, \end{aligned}$$
(31)

with \({w_j}\) as defined in Eq. (5). Then, the left \({\lambda ^{-}}\) and right \({\lambda ^{+}}\) Lyapunov exponents are [17]

$$\begin{aligned} \lambda ^{-}= & {} \max _{1 \le i \le h}\left\{ R_i \right\} ,\nonumber \\ \lambda ^{+}= & {} -\min _{1 \le i \le h}\left\{ L_i \right\} . \end{aligned}$$
(32)

The topological entropy can be calculated using [17]

$$\begin{aligned} {\mathscr {H}}= \sum _{i=1}^{h} k_i \left( R_i - L_i \right) \log _2\left( p_i \right) . \end{aligned}$$
(33)

To be able to compare the topological entropy of a CA acting on different-sized finite rings, we introduce the normalized topological entropy

$$\begin{aligned} {\widetilde{{\mathscr {H}}}}= \frac{{\mathscr {H}}}{\sum _{i=1}^{h}k_i\log _2\left( p_i \right) } = \frac{{\mathscr {H}}}{\log _2\left( m \right) } \end{aligned}$$
(34)

with m as defined in Eq. (22). For prime power rings, \({{\widetilde{{\mathscr {H}}}}}\) will only be integer values, where \({{\widetilde{{\mathscr {H}}}}= 1}\) will be the smallest nonzero entropy, \({{\widetilde{{\mathscr {H}}}}= 2}\) the second smallest, etc.

Equicontinuity A Linear CA is said to be equicontinuous (or stable) if any two states within a fixed size neighborhood in the state space diverge by at most some upper bound distance under iteration of F [46]. Equicontinuity is given if the Linear CA fulfills the condition [46]

$$\begin{aligned} (\forall p \in {\mathscr {P}}): p \mid \gcd (m, w_{-r}, \ldots , w_{-1}, w_1, \ldots , w_r). \end{aligned}$$
(35)

Sensitivity On the other hand, a Linear CA is sensitive to initial conditions if, for any initial state \({{\textbf{s}}^{(0)}},\) there exists another distinct initial state in any arbitrarily small neighborhood of \({{\textbf{s}}^{(0)}},\) such that both orbits diverge by at least some lower bound distance [46]. If the condition

$$\begin{aligned} (\exists p \in {\mathscr {P}}): p \not \mid \gcd (m, w_{-r}, \ldots , w_{-1}, w_1, \ldots , w_r) \end{aligned}$$
(36)

is fulfilled, the corresponding CA is sensitive [46].

Expansivity Suppose the orbits of any two different states in the state space diverge by at least some lower bound distance under forward iteration of F. In that case, the corresponding CA is called positively expansive [46]. Compared to sensitivity, positive expansivity is a stronger property. Positive expansivity is given for a Linear CA if [46]

$$\begin{aligned} \gcd (m, w_{-r}, \ldots , w_{-1})=\gcd (m, w_{1}, \ldots , w_{r})=1. \end{aligned}$$
(37)

For invertible infinite Linear CAs, this concept can be generalized by additionally considering backward iteration of F and calling such CAs expansive [46]. The condition for expansivity is the same as Eq. (38) for Linear CAs.

Transitivity Transitivity is given for a Linear CA, if it has states that eventually move under iteration of F from one arbitrarily small neighborhood to any other [10]. In other words, the Linear CA cannot be divided into independent subsystems. Codenotti and Margara [15] showed that, for CAs, transitivity implies sensitivity. The condition for transitivity of a Linear CA is [10]

$$\begin{aligned} \gcd (m, w_{-r}, \ldots , w_{-1}, w_1, \ldots , w_r)=1. \end{aligned}$$
(38)

In addition, strong transitivity is given if a CA has orbits that include every state of its state space. For strong transitivity, a Linear CA must fulfill the condition [46]

$$\begin{aligned} (\forall p \in {\mathscr {P}})(\exists w_i,w_j): p \not \mid w_i \wedge p \not \mid w_j. \end{aligned}$$
(39)

Ergodicity In contrast to transitivity, ergodicity concerns statistical properties of the orbits of a dynamical system. While transitivity indicates that the state space of infinite linear CAs cannot be separated, ergodicity, intuitively, denotes the fact that typical orbits of almost all initial states (except for a set of points with measure zero) in any subspace under iteration of F eventually revisit the entire set with respect to the normalized Haar measure [9, 67]. Cattaneo et al. [9] show that, for infinite linear CAs, ergodicity and transitivity are equivalent. The condition for a linear CA to be ergodic is the same as Eq. (38).

Regularity If cyclic orbits are dense in the state space for an infinite Linear CA, then it is denoted as regular [10]. Regularity is defined for Linear CA by condition [10]

$$\begin{aligned} \gcd (m, w_{-r}, \ldots , w_r)=1. \end{aligned}$$
(40)

Surjectivity and injectivity The global rule F of a Linear CA is surjective if every state configuration has a predecessor. Thus, surjective CAs have no GoE states and no transient phase [77]. Cattaneo et al. [9] showed that transitive CAs are surjective. For one-dimensional CAs, surjectivity is equivalent to regularity of the global rule F [10]. Surjectivity for F is given if condition (40) is fulfilled [31].

Injectivity of F denotes the fact that every state has at most one predecessor. Every injective CA is also surjective [46]. If F is surjective and injective, the CA is called bijective, which is equivalent to reversibility [77]. The condition for injectivity of a Linear CA is given by [31]

$$\begin{aligned} (\forall p \in {\mathscr {P}})(\exists _1 w_i): p \not \mid w_i. \end{aligned}$$
(41)

Chaos The behavior of dynamical systems can range from ordered to chaotic. The framework of dynamical systems lacks a precise and universal definition of chaos. However, there is widespread agreement that chaotic behavior is based on sensitivity, transitivity, and regularity [19]. Manzini and Margara [46] identified five classes of increasing degree of chaos for Linear CAs: equicontinuous CAs, sensitive but not transitive CAs, transitive but not strongly transitive CAs, strongly transitive but not positively expansive CAs, and positively expansive CAs. Since for Linear CAs, transitivity implies sensitivity and surjectivity, whereby the latter is in turn equivalent to regularity, transitive Linear CAs can be classified as topologically chaotic [9].

Error metric

To be able to compare different models, we use the mean-squared error (MSE)

$$\begin{aligned} \text {MSE}({\textbf{y}}, \bar{{\textbf{y}}}) = \frac{1}{n} \sum ^{n}_{i=1} {\left( {\bar{y}}_i - y_i\right) }^2 \end{aligned}$$
(42)

and the Normalized Mean Squared Error (NMSE)

$$\begin{aligned} {\text {NMSE}}({\textbf{y}}, \bar{{\textbf{y}}}) = \frac{\text {MSE}({\textbf{y}}, \bar{{\textbf{y}}})}{\textrm{Var}(\bar{{\textbf{y}}})} \end{aligned}$$
(43)

with the ground truth \({\bar{{\textbf{y}}}}\) and the prediction of the model \({{\textbf{y}}}.\)

Fig. 7
figure 7

Empirical cumulative distribution functions for MG_25. The different configurations have the following number of data points: 960, 15,360, 8064

Proposed ReLiCA design algorithm

Since guidance in the choice of hyperparameters would greatly speed-up and assist in the design of ReLiCA models, we propose the Reservoir Computing using Linear Cellular automata design algorithm (ReLiCADA). We start with an outline of ReLiCADA in “ReLiCADA outline”, including a short analysis of the influence of the CA rules and transformation, quantization, mapping, and encoding layers on the ReLiCA model performance. The outline is followed by the formal definition of the ReLiCADA design constraints in “ReLiCADA design constraints”, a reasoning in “Reasoning behind design constraints” as well as pseudocode for the rule selection in “Rule selection pseudocode”. After that, we analyze the number of selected rules in “Number of rules selected by ReLiCADA” and discuss the relation to the Edge of Chaos in “Edge of Chaos”.

ReLiCADA outline

We propose the Reservoir Computing using Linear Cellular Automata design algorithm (ReLiCADA) to assist in the design of ReLiCA models. ReLiCADA constrains the hyperparameters of ReLiCA models to combinations that evidently lead to a good performance in time series prediction tasks. The algorithm consists of a set of design constraints and is based on the evaluation of thousands of experiments, followed by a mathematical analysis of Linear CA properties. The main idea of ReLiCADA is to limit the search space of Linear CA rules from \({m^n}\) (see “Cellular Automata”) to a small number of promising rules, and to select matching transformation, quantization, mapping, and encoding functions. Another purpose of ReLiCADA is to be able to identify ReLiCA models that produce low errors on a wide range of different datasets, and not only on a single pathological dataset like the 5-bit memory task.

The choice of the Linear CA rule and transformation, quantization, mapping, and encoding functions significantly impacts the overall ReLiCA model performance. This can be seen in Fig. 7, where we depict the \({\text {NMSE}}\) for different ReLiCA models for the MG_25 dataset (other datasets produce similar results, see “Datasets” for dataset descriptions). For this analysis, we ran all possible CA rules with all combinations of the transformation and quantization configurations (complement, gray, scale_offset, sign_value) and the encoding functions (additive, replacement, subtractive, xor). The empirical cumulative distribution function specifies the proportion of ReLiCA models with the same or lower \({\text {NMSE}}.\) As the figure shows, only a tiny percentage of all ReLiCA models come close to the optimal performance for the chosen m and \({\hat{n}}\) (lower left part of Fig. 7), hindering random and heuristic search approaches, especially for complex CAs (larger m or \({\hat{n}}\)). To the best of our knowledge, up to date, there are no clear rules or guidelines on how to select the Linear CA in ReLiCA models. Thus, obtaining a well-performing ReLiCA model remained challenging.

Our approach was to exhaustively test the performance of ReLiCA models with almost all combinations of the abovementioned transformation, quantization, mapping, and encoding methods over the complete rule search space of several Linear CA configurations (\({{\hat{n}}},\) mN and I) on several different datasets. The experiments indicate that specific conditions on the choice of the model’s hyperparameters lead to an improvement in performance. For example, some transformation and quantization approaches are more robust against hyperparameter changes than others, and most of the generally good performing Linear CA rules share common mathematical properties. We identified these common properties and described them in terms of the mathematical parameters as defined in “Mathematical parameters”. The result is ReLiCADA, a set of design constraints that are applied to the hyperparameters of ReLiCA models. Thus, out of all possible hyperparameter configurations, ReLiCADA selects a small number of promising candidate models. A crucial part is the restriction of the exponentially growing Linear CA rule space to only a small subset of rules that are among the top-performing rules in the overall rule space. In doing so, ReLiCADA enormously reduces the design time of ReLiCA models because it prevents the need to undergo an exhaustive search over the whole Linear CA rule space, which is not feasible especially for more complex CA. Instead, ReLiCADA enables the aimed testing of a few promising models that are sharply defined by the following conditions.

We use the definitions stated in “Mathematical parameters” to describe the design rules of ReLiCADA. We limited our analysis to \({|{\mathscr {P}}| \le 2},\) which will also be assumed in the description of the rule selection algorithm. This was done since we are primarily interested in \({{\mathbb {Z}}_m}\) with a single prime factor. Some of the proposed rules might also work for the case \({|{\mathscr {P}}| > 2}\) or might be generalized, but no verification was done for that.

ReLiCADA design constraints

The ReLiCADA selects model configurations only if all of the following hyperparameter constraints are fulfilled:

$$\begin{aligned}{} & {} \text {transformation} = \textit{scale\_offset}, \end{aligned}$$
(44a)
$$\begin{aligned}{} & {} \text {quantization} = \textit{scale\_offset}, \end{aligned}$$
(44b)
$$\begin{aligned}{} & {} \text {mapping} = \textit{random}, \end{aligned}$$
(44c)
$$\begin{aligned}{} & {} \text {encoding} = \textit{replacement}, \end{aligned}$$
(44d)
$$\begin{aligned}{} & {} (\forall p \in {\mathscr {P}})(\exists _1 w_i): p \not \mid w_i, \end{aligned}$$
(44e)
$$\begin{aligned}{} & {} {\widetilde{{\mathscr {H}}}}= 1, \end{aligned}$$
(44f)
$$\begin{aligned}{} & {} \text {remove mirrored rules}. \end{aligned}$$
(44g)

Additionally, the following conditions will only be used based on the choice of m:

  • if \({|{\mathscr {P}}| = 1}\) and \(k \ne 1,\) i.e., \({{\mathbb {Z}}_m}\) forms a ring:

    $$\begin{aligned}{} & {} \exists _2 i: w_i \ne 0, \end{aligned}$$
    (45a)
    $$\begin{aligned}{} & {} \forall w_i: (w_i \notin {\mathscr {P}}_w) \vee (w_i \in \{1, m-1\}), \end{aligned}$$
    (45b)
    $$\begin{aligned}{} & {} \forall w_i: (w_i \notin {\bar{{\mathscr {P}}}}_w) \vee (|{\mathscr {S}}^+(w_i)|=4). \end{aligned}$$
    (45c)
  • if m is prime, i.e., \({{\mathbb {Z}}_m}\) forms a field:

    $$\begin{aligned} \forall w_i: (w_i=0) \vee ({\mathscr {S}}^{\times }(w_i)={\mathbb {Z}}_m {\setminus } 0), \end{aligned}$$
    (46)
  • if \({|{\mathscr {P}}| = 2},\) i.e., \({{\mathbb {Z}}_m}\) forms a ring:

    $$\begin{aligned} (\exists w_i, w_j): (p_1 \not \mid w_i) \wedge (p_2 \not \mid w_j). \end{aligned}$$
    (47)

Conditions () will always be used independent of the choice of m and \({{\hat{n}}},\) while conditions () to (46) are only used dependent on the choice of \({{\mathbb {Z}}_m}.\) If \({{\mathbb {Z}}_4}\) is chosen, it is impossible to fulfill constraint (45c). Because of this, it will be ignored for the \({{\mathbb {Z}}_4}\) case.

The selection Eq. (44g) between the rule \({{\textbf{w}}}\) and its mirrored rule \({\hat{{\textbf{w}}}}\) is made using the following condition

$$\begin{aligned} \sum _{i=-r}^{-1}w_i \le \sum _{i=1}^{r}w_i, \end{aligned}$$
(48)

which evaluates to true for only one of the two rules if \({{\textbf{w}}\ne \hat{{\textbf{w}}}}.\) If \({{\textbf{w}}=\hat{{\textbf{w}}}},\) this condition will always be fulfilled. If the condition is true, we choose \({{\textbf{w}}}\) and otherwise \({\hat{{\textbf{w}}}}.\) The selection between \({{\textbf{w}}}\) and \({\hat{{\textbf{w}}}}\) is not optimized to increase the performance and is only used to further reduce the number of selected rules. Because of this, also other selection methods between \({{\textbf{w}}}\) and \({\hat{{\textbf{w}}}}\) than Eq. (48) are possible.

Reasoning behind design constraints

The constraints (44a) to (44d) fix the transformation, quantization, mapping, and encoding methods to an evidently well-performing combination (see “General hyperparameters”). The conditions (44e), (44f) and ()–(47) belong to the rule selection for the Linear CA reservoir. To limit the number of selected rules even further, we added condition (44g).

Our experiments showed that nearly all of the generally well-performing rules are injective. Because of this, we included Eq. (44e) (see Eq. (41)). With this condition, the selected CAs are not only injective but also surjective and regular (see Eq. (40)). Moreover, the CAs do not have a transient phase and, thus, no GoE states because of the injectivity [77]. Also due to injectivity, the selected CAs are not strong transitive (see Eq. (39)) and not positive expansive (see Eq. (37)).

Furthermore, \({{\widetilde{{\mathscr {H}}}}}\) has a significant impact on the ReCA model performance. Using Eq. (44f) ensures that the CA is sensitive (see Eq. (36)) as well as transitive, ergodic, and expansive (see Eq. (38)). While this would also be the case for other \({{\widetilde{{\mathscr {H}}}}}\) values, \({{\widetilde{{\mathscr {H}}}}=1}\) resulted, in most cases, in the best performance and has the advantage of the smallest possible neighborhood \({{\hat{n}} \ge 3},\) which reduces the complexity of hardware implementations.Footnote 4 Constraint (44f) also implies that the CA is not equicontinuous (see Eq. (35)).

Conditions () to (47) are not based on mathematical characteristics, but were chosen to improve the ReCA model performance and reduce the overall amount of rules.

Rule selection pseudocode

Algorithm 1
figure a

ReLiCADA - Rule Selection

For any given m and \({{\hat{n}}},\) Algorithm 1 implements the process of rule selection. Once again, we assume that m is limited to \(\vert {\mathscr {P}}\vert \le 2.\) To initialize the algorithm, the empty set \(\mathcal {S}\) for the selected rules is created (line 1). Furthermore, the prime factor decomposition (\({\mathscr {P}}, \mathcal {K},\) line 2) and the neighborhood radius (line 3) are calculated.

The algorithm then differentiates between the prime (Eq. (46), line 4), the combined prime (Eq. (47), line 12), and the prime power (Eq. (), line 30) cases.

For the prime case, we use lines 6 and 7 to get all weights w that fulfill constraint (46). Line 8 ensures that conditions (44e) and (44f) are satisfied. Since ReLiCADA does not select any rules for \({\hat{n}} \ne 3\) and m a prime, we added line 5.

The same is true for the combined prime case, explaining line 13. To realize condition (47), we generate the sets \(\mathcal {W}_i\) and \(\mathcal {W}_j\) containing all weights that are coprime to \(p_1\) and \(p_2,\) respectively (lines 14–23). Condition (44f) is always fulfilled by including the limitation that the weights must be coprime to only one of the two prime factors of m. The resulting rules also satisfy constraint (44e). To generate all selected ReLiCADA rules, we iterate over all possible combinations of \(\mathcal {W}_i\) and \(\mathcal {W}_j\) (lines 24–28).

Lastly, we handle the prime power case. For this, we generate the set of used prime weights \(\mathcal {W}_p\) according to condition (45b) (line 31) and non-prime weights \(\mathcal {W}_{{\bar{p}}}\) according to condition (45c) (lines 32–41). Once again, we iterate over all possible combinations of the weight sets \(\mathcal {W}_p\) and \(\mathcal {W}_{{\bar{p}}}\) (lines 42–57). Each weight combination generates two rules selected by ReLiCADA, which will be denoted by u and v. These are initialized with all zeros (lines 44 and 45). For both rules, the prime weight is at index one to fulfill constraint (44f) (lines 46 and 47). The non-prime weights can be at different indices in the weight vector (lines 48 to 54). Using a single prime and non-prime weight, constraints (44e) and (45a) are taken into account.

To finally generate the rules selected by ReLiCADA, we need to check condition (44g). This is done in line 59, where the function check_mirror checks if all rules in \(\mathcal {S}\) satisfy Eq. (48) and flips their weight vector otherwise.

The returned set of rules fulfill conditions (44e) to (44g) and () to (47). The remaining conditions of ReLiCADA, Eqs. (44a) to (44d), do not influence the CA rule and are thus not included in the Algorithm 1.

Number of rules selected by ReLiCADA

Table 1 Number of rules selected by ReLiCADA

In Table 1, the number of rules selected by ReLiCADA and the number of all linear rules are listed for different m and \({{\hat{n}}=3}.\) It is worth pointing out that the number of selected rules by ReLiCADA is independent of \({{\hat{n}}},\) whereas the number of total rules depends on the chosen neighborhood \({{\hat{n}}}.\) From Table 1, it is easy to see that ReLiCADA reduces the number of rules to analyze by several orders of magnitude. Hence, when designing a ReCA model for a specific application, one does not have to check all rules in the rule space, but only the few rules that are pre-selected by ReLiCADA.

Limiting the rule space

Table 2 Five rules with \({m=4, {\hat{n}}=5}\) and same \(\lambda \) parameters, but different entropies

The general rule space has no inherent structure or order with respect to the dynamical behavior and properties of the CAs [40]. To the best of our knowledge, there exists no classification scheme or parametrization of the general rule space that imposes a clear structure on it with respect to the dynamical behavior of the CAs without restricting to linear rules [16, 17, 30, 37, 38, 76].

There are several approaches to classify and parametrize the general rule space. However, none of them imposes a definite structure or order on the rules that is suited to restrict the rule space for ReCA model performance optimization. For example, the four Wolfram classes are undecidable [16] and, thus, cannot practically be used to constrain the rule space.

Another often used parametrization is given by the Langton \(\lambda \) parameter, whose purpose is to partition the rule space such that rules from the same partition exhibit similar ordered or disordered dynamics [40]. However, especially for small m and \({\hat{n}},\) the \(\lambda \) parameter is not able to discriminate well between deterministic and chaotic partitions, which is why it is suggested to be used for large configurations with \({m\ge 4}\) and \({{\hat{n}}\ge 5}\) [40]. Nevertheless, even for such large configurations, it can be shown that \(\lambda \) does not discriminate well between different dynamical regimes of the CAs. Table 2 lists five linear rules with \({m=4, {\hat{n}}=5}\) together with their \(\lambda \) parameter and entropy \({\mathscr {H}}.\) As it can be seen, all five rules have \(\lambda =0.75,\) but different entropies. This example illustrates the fact that even for the same \(\lambda \) value, one can get a wide range of deterministic to chaotic rules. Since both parameters intend to describe the chaotic behavior of the rules, they should be similar from a qualitative perspective. As the example in Table 2 shows, this is not the case, which makes \(\lambda \) unsuited to restrict the general rule space in a meaningful manner. Similar considerations can be done for other rule table-based parametrizations, like the Z parameter [76, 83].

For the aforementioned reasons, we limit the CA rule space to linear rules with defined \({\mathbb {Z}}_m\) and \({\hat{n}}\) and use an analytic-based approach. It will be verified by multiple exhaustive searches over the complete linear rule space (see “Results”). Additionally, in “Rule selection” we show that our proposed rule selection algorithm outperforms the only optimization method [5] known to us, which is optimizing over the general rule space using a GA.

Edge of Chaos

The Edge of Chaos (EoC) theory is broadly discussed for CAs [19, 40, 51, 59, 73]. However, to the best of our knowledge, no analysis of the EoC has been done for the ReCA framework yet. The EoC can be compared to the Edge of Lyapunov Stability (EoLS) in the ESNs framework [74]. Verstraeten et al. [74] analyzed the connection of the Lyapunov exponents of a specific ESN model to its memory and nonlinear capabilities. Through these analyses, it was shown that CAs have the highest computational power at the EoC and ESN models at the EoLS.

Using the five groups of CA rules with increasing degree of chaos, as defined by Manzini and Margara [46] (see “Chaos”), we can see that all of the CAs selected by ReLiCADA belong to the third group, implying that they exhibit a “medium” amount of chaos. By the definitions of Devaney and Knudsen, they are chaotic, but not expansive chaotic [11]. Since the CA rules selected by ReLiCADA are among the best-performing rules, we suppose, without any proof, that this might correlate to the Edge of Chaos.

As an example, for the configuration \({m=4}\) and \({{\hat{n}}=3},\) ReLiCADA selects, among others, the rule with \({{\textbf{w}}=(0,2,1)},\) which is depicted in Fig. 6a and b. The two iteration diagrams show that this Linear CA, on the one hand, has memorization capabilities by shifting the initial state to the left. This left shift can also be interpreted as a transmission of local information along the lattice. On the other hand, it shows interactions of neighboring cells during iteration. These properties (storage, transmission, and interaction) constitute computational capabilities in dynamical systems [40, 73]. Generally, all selected ReLiCADA CA rules show similar behavior.

Experiments

We will now introduce the experimental setup used to verify and validate the performance of ReLiCADA. The datasets used are introduced in “Datasets” and the models, to compare the performance of ReLiCADA to, in “Compared models”.

Datasets

To test the performance of the different hyperparameter configurations of the ReLiCA models, we use datasets that have already been used in several other papers to compare different time series models. The following datasets can, thus, be regarded as benchmark datasets. These datasets might not need fast inference times, one of the main advantages of ReCA models, but are suitable choices for broad comparability with other studies. All datasets are defined over discrete time steps with \(t \in {\mathbb {N}}.\) We use x(t) to describe the input to the model, and y(t) represents the ground truth solution. The x and y values are rescaled to \([-1,1].\) Unless otherwise noted, the task is to do a one-step-ahead prediction, i.e., \(y(t) = x(t+1),\) using the inputs up to x(t). The abbreviations used to name the datasets throughout the paper are denoted by (name). For all datasets, 102 disjunct sequences of length 1100 are generated. The first 100 sequences are used for training, one for testing, and the last one for validation.

Hénon map

The Hénon map (Hénon) was introduced in [26] and is defined as

$$\begin{aligned} y(t) = x(t+1) = 1 - 1.4{x(t)}^2+0.3x(t-1). \end{aligned}$$
(49)

Mackey–Glass

The Mackey–Glass time series uses the nonlinear time-delay differential equation introduced by [45]

$$\begin{aligned} \frac{{\textrm{d}}x}{{\textrm{d}}t} = \beta \frac{x(t-\tau ) }{1+{x(t-\tau ) }^{n}}-\gamma x(t) \end{aligned}$$
(50)

with \(\beta =0.2,\) \(\gamma =0.1,\) \(\tau =17,\) and \(n=10.\) The task is to predict \(y(t)=x(t+1)\) using x(t) (MG). Furthermore, we use the prediction task \(y(t)=x(t+25)\) using x(t) (MG_25).

Multiple superimposed oscillator

The multiple superimposed oscillator (MSO) is defined as

$$\begin{aligned} x(t)=\sum _{i=1}^{n}\sin (\varphi _i t),\quad \text {with } t \in {\mathbb {N}}. \end{aligned}$$
(51)

The MSO12 dataset uses \(\varphi _1=0.2,\) \(\varphi _2=0.331,\) \(\varphi _3=0.42,\) \(\varphi _4=0.51,\) \(\varphi _5=0.63,\) \(\varphi _6=0.74,\) \(\varphi _7=0.85,\) \(\varphi _8=0.97,\) \(\varphi _9=1.08,\) \(\varphi _{10}=1.19,\) \(\varphi _{11}=1.27,\) and \(\varphi _{12}=1.32\) as defined in [22]. We use the prediction tasks \(y(t)=x(t+1)\) (MSO) and \(y(t)=x(t+3)\) (MSO_3) with x(t) as input.

Nonlinear autoregressive-moving average

The nonlinear autoregressive-moving average was first introduced in [4] as a time series dataset. We use the 10th order (NARMA_10)

$$\begin{aligned} x(t+1)= & {} 0.3x(t) + 0.05x(t) \sum _{i=0}^{9}\left( x(t-i) \right) \nonumber \\{} & {} + 1.5u(t-9)u(t)+0.1, \end{aligned}$$
(52)

the 20th order (NARMA_20)

$$\begin{aligned} x(t+1)= & {} \tanh \Bigg [0.3x(t) + 0.05x(t)\sum _{i=0}^{19}\left( x(t-i)\right) \nonumber \\{} & {} + 1.5u(t-19)u(t) + 0.01\Bigg ] + 0.2, \end{aligned}$$
(53)

and the 30th order (NARMA_30)

$$\begin{aligned} x(t+1)= & {} 0.2x(t) + 0.004x(t) \sum _{i=0}^{29}\left( x(t-i) \right) \nonumber \\{} & {} + 1.5u(t-29)u(t) + 0.201 \end{aligned}$$
(54)

versions as defined in [12]. The input u(t) is generated by a uniform independent and identically distributed (i.i.d.) random variable in the interval [0, 0.5]. The task is to predict x(t) using u(t).

Nonlinear communication channel

This dataset emulates a nonlinear communication channel and was introduced in [34] as

$$\begin{aligned} q(t)= & {} 0.08u(t+2) - 0.12u(t+1) + u(t) + 0.18u(t-1) \nonumber \\{} & {} -~0.1u(t-2) + 0.09 u(t-3) - 0.05u(t-4) \nonumber \\{} & {} +~0.04u(t-5) + 0.03u(t-6) + 0.01u(t-7) \nonumber \\ x(t)= & {} q(t) + 0.036{q(t)}^2 - 0.011{q(t)}^3. \end{aligned}$$
(55)

The channel input u is a random i.i.d. sequence sampled from \(\{-3, -1, 1, 3\}.\) The task is to predict \(x(t-2)\) using u(t) (NCC).

Pseudo periodic synthetic time series

Introduced by UC Irvine [20], the dataset can be generated using

$$\begin{aligned} x(t) = \sum _{i=3}^{7} \frac{1}{2^i} \sin \left( 2 \pi \left( 2^{2 + i} + \text {rand}(2^i) \right) * \frac{t}{10000} \right) \end{aligned}$$
(56)

as defined in [60] (PPST).

Predictive modeling problem

First introduced by Xue et al. [84], the dataset can be generated using

$$\begin{aligned} x(t) = \sin (t+\sin (t)),\quad \text {with } t \in {\mathbb {N}}\end{aligned}$$
(57)

(PMP).

Compared models

To have a reference for the ReLiCA model performance values, several state-of-the-art models were used as a baseline. These models and their hyperparameters are established in this section.

In the following description, a parameter optimized during hyperparameter optimization is denoted by a range, e.g., [ab]. As hyperparameter optimization framework Optuna [2] is used. We use the TPESampler algorithm with 100 runs per model. For models using epoch-based training, early stopping was used. It was configured to stop the training if the loss is not decreasing by at least \({10^{-5}}\) with a patience of three epochs.

Training was done using 100 parallel models using one sequence for each. A single model is used for testing and validation. For training, testing, and validation, the first 100 data points are used as initial transient and are not utilized.

The results for these models are listed in Table 3.

Neural Networks

These models were created using TensorFlow 2.8.0 [1] with the default settings unless otherwise noted. The models have an Input layer and use a Dense layer as output. The hidden layers were adopted to the used model. We used Adam [39] as the optimizer, and the learning rate was \({[10^{-10}, 1]}.\) As a loss function, \(\text {MSE}\) is used.

The Recurrent Neural Network [66] (RNN) uses a SimpleRNN layer with 64 units with dropout [0, 1] and recurrent dropout [0, 1].

The GRU layer was used for the Gated Recurrent Unit NN [13] with 32 units and dropout [0, 1].

The Long Short-Term Memory NN [27] uses the LSTM layer with 32 units and dropout [0, 1].

The Neural Network [65] NN model uses [1, 4] Dense layers with [1, 64] neurons per layer as hidden layers. The inputs to this model are the last 20 values of x(t),  which results in the vector \({{\textbf{x}}=[x(t-19), x(t-18), \ldots , x(t)]}.\)

RC models

We used an ESN, SCR, and DLR model. All models use the Scikit-learn 1.1.2 [61] Ridge optimizer with an alpha \({[10^{-10}, 1]}.\)

The ESN model [32] uses the Tensorflow Addons ESN cell implementation embedded into our code. We used 128 units with a connectivity of 10%. The other parameters are input scale [0, 10],  input offset \({[-10, 10]},\) spectral radius [0, 1],  and leaky integration rate [0, 1].

We implemented the SCR and DLR models according to [64]. Both use 256 units, a spectral radius of [0, 1],  input scale [0, 10],  and input offset \({[-10, 10]}.\)

ReCA models

We used our implementation of our refined ReCA architecture together with the nonlinear CA rules found by a GA in [5]. The lattice has a size of (16, 32),  and the CA performs four iterations per input sample. All rules found by the GA in Babson and Teuscher [5] were analyzed using all combinations of the transformation and quantization configurations (complement, gray, scale_offset, sign_value) and the encoding functions (additive, replacement, subtractive, xor). A Ridge optimizer is used for training. We call this model Babson.

Linear model

A simple linear regression model using Scikit-learn was also evaluated. Like the NN, the linear model, denoted by Linear, has the last 20 values of x(t) as input.

Complexity

One of the main advantages of ReCA models is their low computational complexity. To compare the complexities of the different types of models, we approximated their computational complexities of the inference step. This analysis was optimized for implementations on FPGAs without the usage of specialized hardware like multiply–accumulate units. Nevertheless, it is a good indication also for other types of implementations.

Assuming two numbers ab that are represented by \({a'},\) \({b'}\) bits, we define the following complexities: addition and subtraction have a complexity of \({\min (a',b')},\) whereas multiplication and division have a complexity of \({a' \times b'}.\) For the additions and subtractions, we assume that the hardware does not need to deal with the most significant bits (MSBs) of the larger number since these are zero in the smaller number. For multiplication and division, we assume a shift-and-add implementation. To approximate the complexity of the tanh function, we use the seventh-order Lambert’s continued fraction [78] as an approximation. We assume the same complexity for the sigmoid function. The ReLU function has a complexity of zero.

We assume that the input and output have 32 bits, and all models use 32 bits to represent their internal states. For the ReCA models, the CA uses the required number of bits to represent \({{\mathbb {Z}}_m},\) and the readout layer also uses 32 bits.

The number of units in the different baseline models was chosen to make the overall model complexity similar to the tested ReLiCA models. Because of this, the number of units was not optimized for model performance.

Results

Our experiments can be divided into two phases. In the first phase (“General hyperparameters”), we analyzed and identified well-working choices for nearly all of the hyperparameters of the ReLiCA model (general hyperparameters), except for the Linear CA rule. In the second phase (“Rule selection”), we fixed the general hyperparameters based on the results in the first phase, and exhaustively analyzed the Linear CA rule performance. Detailed results of the experiments can be found in Appendix C.

General hyperparameters

Since the main focus of our analysis lies in selecting suitable combinations of transformation, quantization, mapping, and encoding methods, and linear CA rules for the reservoir, we fixed the general hyperparameters of the ReLiCA models to reduce the parameter space. As a starting point for the parameter values in our experiments, we used the results of previous studies [5, 8, 47, 48, 57, 85, 86].

We first analyzed the influence of the reservoir size (N) and the number of CA iterations on the overall ReLiCA performance. The datasets used for the following analysis are MG, MG_25, MSO, MSO_3, NARMA_10, NARMA_20, NARMA_30, NCC, PPST, PPST_10, PMP (see “Datasets”).

Fig. 8
figure 8

Influence of the lattice size \({(N_r, N_c)}\) on ReLiCA with \({{\mathbb {Z}}_4},\) \({{\hat{n}}=3},\) \({I=4}\) scale_offset, and replacement

To test the influence of the reservoir size, we tested the following lattice sizes: (16, 32),  (16, 33),  and (17, 31). These were chosen since the total number of cells is similar, but their prime factor decomposition differs significantly. The results for the ReLiCA model using scale_offset, replacement, \({{\mathbb {Z}}_4},\) \({{\hat{n}}=3}\) and \({I=4}\) are depicted in Fig. 8. Other ReLiCA models showed similar behavior. Since none of the lattice sizes is superior to the others, we used (16, 32) in the following experiments. This was done since, for most hardware implementations, a power-of-two number of cells would most likely be suitable.

Fig. 9
figure 9

Influence of the number of iterations I on ReLiCA with \({{\mathbb {Z}}_4},\) \({{\hat{n}}=3},\) (16,32), scale_offset, and replacement

To see the influence of the CA iterations, we tested the ReLiCA model using scale_offset, replacement, \({{\mathbb {Z}}_4},\) \({{\hat{n}}=3},\) and \({(N_r, N_c) = (16,32)}.\) The results are shown in Fig. 9, and other configurations resulted in similar results. Increasing the number of CA iterations to \({>2}\) steps did not lead to a significant monotonic decrease in the overall NMSE. This is in line with the results by Babson and Teuscher [5], where they achieved a success rate of \({{99}\%}\) in the 5-bit memory task for complex CA reservoirs and four iterations. In their study, elementary \(({m=2},{{\hat{n}}=3})\) CAs were found to require eight iterations. However, Nichele and Gundersen. [57] show that several single elementary CA rules also achieve a success rate of \({{\ge 95}\%}\) in the 5-bit memory task with only four iterations. Since higher numbers of CA iterations imply a higher computational complexity and longer training and testing times, we fixed the number of iterations to four in all subsequent experiments.

Fig. 10
figure 10

Comparison of the different transformation, quantization, mapping, and encoding functions using ReLiCA with \({{\mathbb {Z}}_4},\) \({{\hat{n}}=3}\) \({I=4},\) \({(N_r, N_c) = (16,32)}.\) Used abbreviations: additive, replacement, subtractive, xor; complement, gray, scale_offset, sign_value

Another finding is that the replacement encoding together with the scale_offset transformation achieves low errors in most configurations and is thus the most stable encoding with respect to changing values of the other hyperparameters. This can be seen in Fig. 10. Therefore, we fixed the transformation method to scale_offset and the encoding to random replacement.

The random mapping generator’s seeds, the only random element in the ReLiCA model, were fixed to ensure reproducible results.

Rule selection

To analyze the performance of the Reservoir Computing using Linear Cellular Automata design algorithm, we used the following time series benchmark datasets: MG, MG_25, MSO, MSO_3, NARMA_10, NARMA_20, NARMA_30, NCC, PPST, PPST_10, PMP (see “Datasets”). To train the ReLiCA models, a Ridge optimizer is used with \({\alpha =1},\) the default value for Scikit-learn. The number of states m,  neighborhood \({\hat{n}},\) and local rule of the CA are varied throughout the experiments. We denote the models designed using ReLiCADA by ReLiCA* and the general class of ReLiCA models using the whole set of possible Linear CA rules by ReLiCA. All individual performance values are listed in the Appendices B and C. We used a train-test split for the datasets to conduct our experiments.Footnote 5 Unless otherwise noted, the test performance values are used.

Fig. 11
figure 11

Comparison of the performance of the overall best linear rule with the best and worst rule selected by ReLiCADA

In Fig. 11, we compare the mean \({\text {NMSE}}\) of the overall best ReLiCA model, analyzing all possible linear rules, with the best and worst ReLiCA* model, whose rules were selected by ReLiCADA. Best and worst are determined per dataset, resulting in the possibility that different rules are used for the different datasets. It can be seen that the best ReLiCA* model is very close to the overall best ReLiCA model, especially considering that the overall worst rule has a mean \({\text {NMSE}}> 1.\) Not only the best ReLiCA* model shows nearly optimal performance, but also the worst one. It is also evident that increasing \({\hat{n}}\) from 3 to 5 did not improve the performance. This behavior was also verified for several other values of m (see Appendix C).

Fig. 12
figure 12

Rules selected by ReLiCADA are better than x% of the overall linear rules

Instead of using only the mean \({\text {NMSE}}\) for this analysis, we also checked how many ReLiCA models are worse than a selected ReLiCA* model. The results are depicted in Fig. 12 and clearly show that the best ReLiCA* model is at least better than 95% of the total rule space. Even the worst ReLiCA* model is still better than 80% of the overall ReLiCA models. This again verifies that the performance of all rules selected by ReLiCADA is far better than randomly choosing a linear rule.

As it is reasonable to test all configurations selected by ReLiCADA it is possible to always achieve the best performance in Figs. 11 and 12.

Fig. 13
figure 13

Comparison of model performance with model complexity

Since one goal was to achieve a computationally simple model with low complexity while maintaining good model performance, we compared these two parameters in Fig. 13. We used a train-test-validation split of the dataset for this analysis. The test performance values were used to select the best model, and the validation performance values are shown in Fig. 13. No large deviations between test and validation performance were evident during our experiments. The ReLiCA models have less complexity compared to the RC and NN models. Despite their computational simplicity, they still achieve similar or even better performance. Increasing m for the ReLiCA models increases not only the model complexity but also the model performance. However, it is apparent that the performance gain by increasing m declines. A neighborhood of \({\hat{n}}=3\) was chosen for the ReLiCA models since increasing the neighborhood would not result in better performance.

Despite the nonlinear, and thus more complex, CA of the Babson models, their performance is not up to the ReLiCA* models. While the ReLiCA* \({\mathbb {Z}}_4\) models achieve a mean NMSE of 0.12,  the Babson models only achieve 0.34. As the nonlinear CA rules of the Babson models have been optimized with a GA, this indicates that heuristic search and optimization algorithms cannot deal with the structure and size of the general CA rule space very well.

Fig. 14
figure 14

Influence of the random mapping seed on the ReLiCA model performance. The used model configuration is: \({\mathbb {Z}}_4,\) \({\hat{n}}=3,\) scale_offset, replacement

To analyze the influence of the random mapping on the ReLiCA model performance, we tested several different seeds for the random mapping generator. While there is an influence on the performance, it is neglectable for the models selected by ReLiCADA. In Fig. 14, the empirical cumulative distribution function for different seeds is visualized for \({\mathbb {Z}}_4,\) \({\hat{n}}=3\) ReLiCA and ReLiCA* models using scale_offset and replacement. The slight performance difference decreases even further with larger m.

During our experiments, we mainly focused on the integer rings \(m=2^a\) with \(a \in {\mathbb {N}}^+\) since these are most suitable for implementations on FPGAs and other digital systems. Nevertheless, we verified ReLiCADA for several other values of m (see Appendix C). These results showed that ReLiCADA can also be used for \(m \ne 2^a.\) According to our experiments, CAs over \({\mathbb {Z}}_2\) behave differently. For example, the best encoding for these CAs is the xor encoding. Since this configuration was not of primary interest, we did not analyze this further. Furthermore we verified ReLiCADA on lattice sizes not equal to (16, 32) and iterations not equal to 4. Also, for these configurations, ReLiCADA showed great improvements in performance compared to the whole set of all possible ReLiCA models. The performance values are listed in Appendix C.

We also ran tests where the quantized input \(x_q\) was directly fed into the readout layer, forming a quantized skip connection. When the replacement encoding was used, this did not lead to any performance gain. Since ReLiCADA only uses replacement encoding, quantized skip connections are not used in our models. However, a performance gain was observed when the readout layer was provided with the original input x directly. Since this imposes only a very little increase in complexity, we recommend using this skip connection if possible.

Nonlinear capabilities

During our experiments, we saw that ReLiCA models could not deal with highly nonlinear datasets, like Hénon, very well. However, after using the hyperparameter optimizer Optuna to optimize the quantization thresholds (see Eq. (13)) and the regularization of the Ridge optimizer, the performance of the ReLiCA model increased drastically. The ReLiCA* model with \({\mathbb {Z}}_{16},\) \({\hat{n}}=3\) achieved an NMSE of 0.321 before optimization and 0.048 after. Other transformation and quantization layers could likely improve the nonlinear capabilities of linear ReLiCA models. However, this was not further analyzed.

Further tests have shown that the ReLiCA model performance also improves on the other datasets when Optuna is used to optimize quantization thresholds. Since we wanted to create a fast and easy-to-train model, we refrained from using threshold optimization in our results.

Conclusion

ReCA represents a particular form of the broader field of RC that is particularly suited to be implemented on FPGAs. However, the choice of hyperparameters and, primarily, the search for suitable CA rules are major challenges during the design phase of such models. When restricted to Linear CAs, fundamental properties can be computed analytically. Based on the results of nearly a million experiments, we recognized that Linear CA rules that achieve low errors on many relevant benchmark datasets have specific mathematical properties. Based on these insights, we developed the Reservoir Computing using Linear Cellular Automata design algorithm, which selects hyperparameters that have been shown to work well in the experiments. Most importantly, the proposed algorithm pre-selects a few rules out of the rule space that grows exponentially with increasing m and \({\hat{n}}.\) As it has been shown, the best-performing selected rules are among the top \({5}\%\) of the overall rule space. Moreover, the proposed models achieve, on average, a lower error than other state-of-the-art Neural Network models and, at the same time, exhibit less computational complexity, showing the strength of ReLiCADA. Furthermore, with the immensely reduced hyperparameter space, the time needed to design and implement ReCA models is drastically reduced. In conclusion, ReLiCADA is a promising approach for designing and implementing ReCA models for time series processing and analysis.

Table 3 Baseline model performance
Table 4 ReCA model performance (test set)
Table 5 ReLiCADA model performance (validation set)