Forecasting Solar Cycle 25 with Physical Model-Validated Recurrent Neural Networks

The Sun’s activity, which is associated with the solar magnetic cycle, creates a dynamic environment in space known as space weather. Severe space weather can disrupt space-based and Earth-based technologies. Slow decadal-scale variations on solar-cycle timescales are important for radiative forcing of the Earth’s atmosphere and impact satellite lifetimes and atmospheric dynamics. Predicting the solar magnetic cycle is therefore of critical importance for humanity. In this context, a novel development is the application of machine-learning algorithms for solar-cycle forecasting. Diverse approaches have been developed for this purpose; however, with no consensus across different techniques and physics-based approaches. Here, we first explore the performance of four different machine-learning algorithms – all of them belonging to a class called Recurrent Neural Networks (RNNs) – in predicting simulated sunspot cycles based on a widely studied, stochastically forced, nonlinear time-delay solar dynamo model. We conclude that the algorithm Echo State Network (ESN) performs the best, but predictability is limited to only one future sunspot cycle, in agreement with recent physical insights. Subsequently, we train the ESN algorithm and a modified version of it (MESN) with solar-cycle observations to forecast Cycles 22 – 25. We obtain accurate hindcasts for Solar Cycles 22 – 24. For Solar Cycle 25 the ESN algorithm forecasts a peak amplitude of 131 ± 14 sunspots around July 2024 and indicates a cycle length of approximately 10 years. The MESN forecasts a peak of 137 ± 2 sunspots around April 2024, with the same cycle length. Qualitatively, both forecasts indicate that Cycle 25 will be slightly stronger than Cycle 24 but weaker than Cycle 23. Our novel approach bridges physical model-based forecasts with machine-learning-based approaches, achieving consistency across these diverse techniques.


Introduction
The total number of sunspots seen in the Sun varies with an approximately 11-year cycle. This cycle itself is not a regular one, its amplitude varies with time with no particular regularity and occasionally goes through extreme phases of high or low activity (Hathaway, 2015;Saha, Chandra, and Nandy, 2022). The solar cycle plays an important role in governing space weather that in turn has major impacts on our modern society. These include disruptions of satellite operations that impact telecommunication networks and global-positioning systems, geomagnetic storms that lead to electric power grid failures and air-traffic disruptions over polar routes (Schrijver et al., 2015). The economic cost of a severe magnetic storm, say, e.g., of the magnitude of the great magnetic storm of 1859 -the Carrington event -is estimated to be greater than the economic cost of hurricane Katrina (Committee on the Societal and Economic Impacts of Severe Space Weather Events: A Workshop, 2008). Long-term solar-activity variations have stimulated the growth of the field of space climate (Nandy and Martens, 2007) with the emerging understanding that solar-activity variations have profound impacts on planetary space environments, atmospheric evolution (Bharati Das et al., 2019;Basak and Nandy, 2021), and habitability (Nandy, Valio, and Petit, 2017;Nandy et al., 2021).
Such considerations have spurred the field of solar-cycle forecasting with diverse techniques employed to forecast upcoming solar cycles (Nandy, 2021) that is deemed to be one of the most outstanding challenges in heliophysics (Daglis et al., 2021). Petrovay (2020) classified such techniques into three groups: model-based methods, precursor methods, and extrapolation methods. Each has their strengths and weaknesses. Most importantly, the first two are closely connected with the physical insight of the solar dynamo that determines the solar cycle and the last one is model-agnostic and data-based. The solar magnetic cycle is thought to originate in a dynamo mechanism through complex, nonlinear interactions between magnetic fields and plasma flows in the Sun's convection zone (Charbonneau, 2010). It is believed to be weakly chaotic in nature (Petrovay, 2020). The extreme conditions and turbulent nature of the solar convection zone, combined with a lack of observational constraints, make computational modeling of the solar dynamo mechanism quite challenging. There have been a few model-based forecasts for Solar Cycle 24 that has just concluded (Dikpati, De Toma, and Gilman, 2006;Dikpati et al., 2007;Choudhuri, Chatterjee, and Jiang, 2007;Jiang, Chatterjee, and Choudhuri, 2007). However, these model-based forecasts were highly inconsistent -which was a result of differing assumptions regarding the turbulent nature of the Sun's convection zone (Yeates, Nandy, and Mackay, 2008). A NASA-NOAA panel that typically attempts to generate a consensus prediction before the start of a sunspot cycle made an early forecast of a very strong solar Cycle 24, which proved to be incorrect. In fact, this panel revised its forecast to a weak Cycle 24 a few years following its first forecast. This indicates the uncertainty and challenges in predicting solar cycles. We note that terrestrial weather forecasting follows a similar route. Crucially, for the case of terrestrial weather forecasting the computational models are well established, observation provides much stronger constraints, and, compared to solar cycles, massive computational resources are utilized. Moreover, the solar dynamo model and its parameters are not as well constrained by observations as models for terrestrial weather forecasting.
A recently developed physical model-based forecasting scheme relied on coupling two distinct models of magnetic-flux transport on the Sun, namely a solar surface flux-transport model and a solar convection-zone dynamo model (Bhowmik and Nandy, 2018). This physics-based modeling technique was quite successful in hindcasting and matching nearly a century of solar-cycle observations and predicted a weak, but not insignificant Solar Cycle 25 similar to or slightly stronger than the previous cycle peaking in 2024 -2025. Independent observations indicate that Solar Cycle 25 initiated in late 2019 (Nandy, Bhatnagar, and Pal, 2020). Given major advances in both understanding the theory of solar-cycle predictability, as well as application of data-based machine-learning techniques to solar-cycle forecasting, it would be interesting to assess at this juncture whether the best of these very diverse techniques result in predictions that are consistent with each other.
A recent review on progress in solar-cycle predictions by Nandy (2021) points out that divergences and inconsistencies in solar-cycle forecasts persist across sunspot Cycles 24 and 25; however, physical model-based predictions have converged for Solar Cycle 25. Nandy (2021) argues this convergence is based on insights into solar-cycle predictability gleaned in recent times. These insights include the understanding that long-term solar-cycle forecasts are not possible as theoretical processes limit the cycle memory across only one solar cycle (Karak and Nandy, 2012;Hazra, Brun, and Nandy, 2020). This is borne out by observations (Muñoz-Jaramillo et al., 2013). However, the analysis by Nandy (2021) indicates divergence across these physics-based and data-based, model-agnostic forecasting techniques. Can we bridge this gap? This is one of the central motivations of our work.
The last decade has found machine-learning techniques to be extraordinarily successful in making forecasts. They are particularly useful in those cases where a mechanistic model is either unavailable or poorly constrained, as is the case in many astrophysical and geophysical problems. They have played an increasingly important role in making data-based forecasts in several problems in solar physics (Bobra and Couvidat, 2015;Bobra and Ilonidis, 2016;Dhuri, Hanasoge, and Cheung, 2019;Sinha et al., 2022). A recent, comprehensive analysis by Sinha et al. (2022), in particular, indicates the fidelity of machine-learning models in the domain of flare forecasting. Starting with Koons and Gorney (1990) and Fessant, Bengio, and Collobert (1996), different neural networks have been used with varying degrees of success to forecast solar cycles (Pesnell, 2012), including a few recent ones (Pala and Atici, 2019;Covas, Peixinho, and Fernandes, 2019;Benson et al., 2020) who made forecasts for the ongoing Cycle 25.
Let us note that machine-learning techniques, particularly those based on deep neural networks, although sometimes spectacularly successful, are treated mostly as black boxes by most practitioners. This may lead to mistaken conclusions, particularly when applied to limited data (Riley, 2019) -as is the case of the solar cycle. Hence, it is necessary to choose machine-learning algorithms with care and to critically review their forecasts. In this paper, we show how to make such a choice with the specific example of solar-cycle forecasting.
We use four different machine-learning algorithms, all within a general framework called Recurrent Neural Networks (RNNs). The characteristic feature of RNNs is that their connection topology possesses cycles (Jaeger, 2001). They are well known for their ability to model sequential data. In addition, we use a simple linear Autoregressive (linear AR) algorithm, which we use as a reference to compare the performance of the different RNNs. Therefore, we use five different algorithms: • Linear Autoregressive (linear AR).

Figure 1
The reservoir is a collection of N nodes. The state of the reservoir is given by the state vector x. The connections between nodes of the reservoir, W res , depicted by red arrows, are taken from a large, sparse, random matrix with a spectral radius less than unity. During training, the time series, y(1), y(2), . . . , y(T), is fed into the feedback neuron. The update rule for each node is x(t + 1) = tanh (u in w in + W res x(t) + W fb y(t)), where w in and W fb are random weights and u in is a constant. The output of the reservoir,ŷ, is a weighted sum -with weights W out -over the state of the reservoir. To forecast, the output of the reservoir is fed into the feedback neuron.
A detailed mathematical treatment of all of these RNN architectures can be found in Goodfellow, Bengio, and Courville (2016). In our problem, the ESN architecture works the best. For this reason, in the next section, we concentrate on ESN. We mention the other algorithms only for comparison purposes with ESN.
The rest of this paper is organized in the following manner: in Section 2 we give a brief introduction to the ESN algorithm. To study its limitations we use data from simulations of a stochastic dynamo model that allows us to generate an infinite amount of data. We describe this model in Section 3. In Section 4, we apply the ESN to the real sunspot data. Finally, we present the conclusions. Important implementation details regarding how to manually choose certain parameters of the ESN are listed in Appendix A. Then, in Appendix B, the reader can also find how we train the remaining RNN models.

Echo State Networks
In this paper we obtain the best forecasts for a particular technique called Echo State Network (ESN), (Jaeger, 2001;Maass, Natschläger, and Markram, 2002;Jaeger and Haas, 2004;Lukoševičius and Jaeger, 2009). Within the machine-learning community the name ESN is more popular, whereas in the physics community reservoir computing is the most common name used to describe this network.
It has been used successfully to forecast delay differential equations, low-dimensional chaotic systems, and even large spatiotemporally chaotic systems (Pathak et al., 2018). This method has so far not been used to forecast the solar cycle, although theoretical considerations suggest that the solar dynamo mechanism can be represented by a system of delay differential equations (Wilmot-Smith et al., 2006).
Let us first briefly introduce the idea of ESN as applied to the problem of forecasting the next solar cycle, see Figure 1. At the heart of the algorithm lies a neural network with a large number of nodes -the reservoir. Every node of this network is called a neuron. The state of the reservoir is given by the state vector x of dimension N . Every neuron receives its input from all the neurons in the network (including itself), a different input signal u in (constant), and a feedback neuron. Each neuron is updated by operating a nonlinear function (often tangent hyperbolic) on a linear combination of its inputs each multiplied by a different random weight: (1) The random weight that connects any two neurons of this network, W res , is given by the corresponding element of a large, sparse, random matrix whose spectral radius is less than unity. The connection weights between the feedback neuron and the reservoir are given by the feedback matrix W fb . The linear combination of the output of each individual neuron weighted by another set of weights, W out , is the output of the reservoir: First, we must train the reservoir. This proceeds in the following manner. A time series y(1), y(2), . . . , y(T) is fed into the feedback neuron sequentially. At every time instant an output of the reservoir,ŷ(t), is obtained. The weights W out and the bias b are chosen so as to minimize the ridge regression loss function (Shalev-Shwartz and Ben-David, 2014): We can see that the loss function C has two contributions. The first is the Mean Squared Error (MSE) between the true signal y and the forecastŷ. We use the symbol || · || 2 2 to indicate the square of the second norm. The second term β||W out || 2 2 is a penalty on the second norm of W out -this is called L2 regularization. The constant β controls the strength of this penalty term. Regularization tries to avoid the overfitting problem. If we optimize W out such that the output of the reservoir is a very good approximation to the training data the forecast may actually become less reliable. A central feature of machine-learning techniques in general and ESN in particular is that to obtain a reliable forecast often a very large amount of training data is necessary. The forecast is expected to get better the longer the training period is, but there is no a-priori constraint on how long a training period is appropriate. This is true for almost any problem in natural sciences, particularly so for the case of forecasting of solar cycles.
The sunspot data is a one-dimensional time series z(1), z(2), . . . , z(t), . . . , z(T). The sunspot time series can be directly used in the ESN if we treat y(t) as a scalar, i.e., y(t) = z(t). In this case the outputŷ(t) =ẑ(t) is also a scalar. We show in Section 3, see Figure 2, that this direct method generates decent, but not very good forecasts, at later times. To improve the forecast at later times, we develop a variation on the standard ESN algorithm -we call this algorithm Modified Echo State Network (MESN). This algorithm constitutes three changes in data preparation and use. One, instead of feeding the input signal one data point at a time, at time t we construct a p-dimensional vector that contains a history of p values. This vector is fed to the feedback neuron as per Equation (1). Two, we change the dimension of the output of the reservoir, such that we no longer forecast one time instant after every update of the reservoir, but we forecast a q-dimensional vectorŷ(t) = [ẑ(t),ẑ(t + 1), . . . ,ẑ(t + q − 1)] at time t as per Equation (2). Three, the target in Equation 3 is no longer y(t) but rather the future points y * (t) = [z(t), z(t + 1), . . . , z(t + q − 1)] .  Hazra, Passos, and Nandy (2014).

Forecast for a Dynamo Model of Solar Cycle
To study how the algorithm performs when it is not constrained by too limited data we first apply it to a model of the solar dynamo. There are several low-dimensional, stochastic models that describe the same qualitative features as the global sunspot data, namely, oscillations whose frequency and amplitude may change abruptly from one cycle to another. We use a widely studied model for this purpose (Wilmot-Smith et al., 2006;Hazra, Passos, and Nandy, 2014;Tripathi, Nandy, and Banerjee, 2021). Specifically, we follow the prescription in Hazra, Passos, and Nandy (2014) and construct a solar dynamo model consisting of two coupled time-delay differential equations: The square of the signal B(t) corresponds to the global sunspot number, that is z(t) = B 2 (t) and A(t) is the poloidal field strength. The parameters T 0 and T 1 control the time delay of the equations and the function α(t) is stochastic: Here, σ (t, τ c ) is a uniform random number in the interval [-1, +1]. We draw a new random number from this distribution at every τ c . The parameter δ ∈ [0, 100] controls the strength of the noise. The function f 1 is called the quenching factor and it involves the error function (erf) and two thresholds B min , B max : We use the model parameters listed in Table 1 and we choose the initial conditions . We solve the differential equations from t = 0 to t = 1100 with a time step dt = 10 −3 . We use the trapezoidal rule to integrate the delay terms and a second-order Runge-Kutta method for the nondelay parts. Once we obtain the solution B(t) we sample it such that the final time series has a time step of 10 −1 . This is necessary, as the ESN algorithm can not forecast far in time if it learns a time series with a very short time step. We divide the time series into two parts: a long training signal that contains 52 cycles and a short test signal that are the next 4 cycles, as shown in Figure 2(a). Before feeding the training signal into the reservoir, we normalize it by dividing it by the maximum amplitude found. This helps to prevent the saturation of the tanh function in Equation (1). We use a reservoir size of N = 1000 neurons and a regularization parameter β = 10 −5 . As the connections W res are initialized randomly, we obtain different forecasts every time we Figure 2 In (a), we show the dynamo signal from Hazra, Passos, and Nandy (2014). We train the reservoir with this signal except for the last four cycles, which we leave for testing the forecasts obtained. The dashed line separates the training signal from the test signal. In (b), the green dotted line represents the average of an ensemble of 10 independent forecasts of the last four cycles. The standard deviation of the ensemble is plotted in orange. The red line represents the test signal. The figure exhibits that the variance of the ensemble grows after the first cycle and that the forecast is no longer accurate. In (c), we show a zoomed-in plot of the forecast obtained for the first cycle. run it. For example, we run it 10 times, generating an ensemble of 10 independent forecasts. We take the mean of the forecasts as the final result that we compare against the test data. Figures 2(b) and (c) show that we obtain good agreement for the first cycle only, as for the next three cycles the variance of the ensemble grows and the mean falls far from the true signal. We note though that the forecast for the first cycle is accurate, with a test MSE of 8.89. Qualitatively, this result is not a particularity of the test signal we decided to forecast. For example, if we train with 60 cycles instead of 52, we keep obtaining accurate forecasts for cycle 61 only. It is at this point when we state that our ESN is a physical model-validated recurrent neural network.

Application to Solar Cycle
Next, we apply all the RNN algorithms to forecast solar cycles. To be specific, let us consider the case of forecasting one particular cycle, say Cycle 23. We train the algorithms with the sunspot data with a thirteen-month running average till the end of Cycle 22. Then, we continue running to produce the forecast.
In Figure 3 we show the forecasts for Solar Cycles 22, 23, and 24 using five different algorithms. In red, we show the thirteen-month running average of the sunspot data, plotted with some width to distinguish it better from the other curves. Clearly, the linear Autoregressive (linear AR) algorithm performs the worst. For Cycle 22 ESN is the best in forecasting the peak followed by vanilla RNN, LSTM, and GRU. For Cycle 23 the ESN and the vanilla RNN are able to capture the first peak of the data reasonably well. The other algorithms forecast significantly lower number of sunspots near the peak. For Cycle 24 again ESN makes the best forecast. Both vanilla RNN and LSTM make reasonable forecasts but GRU forecasts a significantly larger number of sunspots near the peak than actually observed. Overall, the forecast for Cycle 22 is the least accurate. This may be because the earlier the cycle is, the less data we have to train the algorithms. In Appendix B we provide a detailed quantitative comparison that supports the qualitative discussion here. Note that both LSTM and GRU have more fitting parameters than ESN and also require more computing resources. In general, in machine-learning problems with limited data it is often observed that algorithms with too large a number of parameters perform badly due to overfitting, whereas algorithms with too few parameters also perform badly. We conclude that ESN is not only a physical model-validated network but also the appropriate algorithm for our purpose.
Next, we concentrate on our forecasts for Cycles 23, 24, and the ongoing Cycle 25 using ESN. In Figure 4, we show the mean (green) and standard deviation (orange) of an ensemble of 10 independent forecasts for each case. For Cycles 23 and 24 we compare our forecast with the thirteen-month running average of the sunspot data (red). We also indicate in light blue how noisy the original sunspot data was by painting the standard deviation of the running average. The forecast for Cycle 23, obtained with regularization β = 10 −7 , is quite accurate till the first peak. The standard deviation of the forecast increases with time till the first peak, but after the peak it decreases. For Cycle 24, also with β = 10 −7 , the overall agreement is better but the standard deviation is larger near the peak.
By its construction, the reliability of the forecast decreases as time progresses. This is because the forecast of one point depends on the forecast of the previous one. Consequently, small errors at early times are gradually magnified as the forecast progresses. As described in Section 2, we develop a variation on the standard ESN algorithm, which we call Modified ESN (MESN) to improve the forecasts. All forecasts given by the MESN are done with β = 10 −10 , p = 15, and q = 129. Note that the value of q is approximately the number of months of an 11-year cycle (11 × 12), which means we forecast one cycle in one go. The forecasts are plotted in the bottom panel of Figure 4. Compared to the standard algorithm, our modified algorithm not only gives better results when tested against the observation for Cycles 23 and 24, it also gives more robust forecasts, as the standard deviation of the ensemble is smaller.
Our forecasts for the ongoing Cycle 25 with both algorithms are shown in the right-most column of Figure 4. Using the standard algorithm with β = 10 −6 we forecast that the maximum of Cycle 25 is going to happen between May and June 2024 and that the maximum number of sunspots is going to be 113 ± 15. Our modified algorithm gives a maximum that is flatter, almost constant between June 2023 and August 2024, with a maximum number of 124 ± 2 sunspots. Note that the averaged sunspot data shows a distinctive two-peak behavior in both Cycles 23 and 24 -this behavior is not present in all sunspot cycles -that is not captured by either of our algorithms. We expect the same to happen for Cycle 25none of our algorithms can forecast whether it may or may not have this two-peak feature. Both forecasts show that Cycle 25 is expected to reach a minimum near the beginning of the year 2030. Qualitatively, Cycle 25 is going to be weaker than Cycle 23 but stronger than Cycle 24. We note this qualitative result differs from other machine-learning based forecasts (Pala and Atici, 2019;Benson et al., 2020). The first one forecasts that Cycle 25 will have similar strength to that of Cycle 23, while the second one forecasts that Cycle 25 will be slightly weaker than Cycle 24. However, our result agrees with recent physics-based forecasts (Bhowmik and Nandy, 2018).
Next, we show how robust our forecasts are with respect to when we stop training and start forecasting. We expect the following: near the minima of the cycle the signal-to-noise ratio in the sunspot data is the lowest. Hence, if we stop training at the lowest point of a cycle we expect the worst result if the level of noise is roughly constant as a function of time. A look at the sunspot data shows that the noise is not a constant but is actually significantly larger near the peak of the cycle -the noise increases as the signal increases. Nevertheless, the signal-to-noise ratio is less than unity near the minima of the cycles. In Figure 5 we show four representative cases with the standard ESN algorithm for Cycle 24. The best forecast is obtained when we are in the rising phase of the cycle. We also discover the standard ESN algorithm becomes unstable if the training is stopped very close to the present minima. The forecast for Cycle 25 is generated in 2020, when we are very near its minimum, hence we expect our forecast to improve if we recalculate our results introducing the new data that we have now in the middle of 2022.
We show the recalculated forecasts for each ESN algorithm in Figure 6. We observe that the ESN (a) gives a higher peak when trained with the new data, as the maximum amplitude has changed from 113 ± 15 to 131 ± 14 sunspots. The MESN (b) also increases the amplitude, from 124 ± 2 to 137 ± 2 sunspots, reaching better agreement with the ESN.

Summary and Conclusion
We use five different algorithms based on Recurrent Neural Networks (RNNs) to forecast Solar Cycle 25. First, we try to forecast with them Cycles 23 and 24 and we realize that the Echo State Network performs the best. We note that the ESN performs fairly well for the last five cycles (20 -24) except for Cycle 21. But the forecasts are inaccurate beyond one cycle, even in the case of the dynamo model for which we can generate very large amounts of data. This corroborates what physical model-based studies have already demonstrated -the theoretical processes limit the cycle memory across only one solar cycle (Karak and Nandy, 2012;Muñoz-Jaramillo et al., 2013;Hazra, Brun, and Nandy, 2020). Although we have validated the ESN with a solar dynamo model, note that we have treated the dynamo model and the solar data separately. We have not first tuned the neural network on the simulations of the dynamo model and then fine tuned it on the solar data. We leave such a technique, called transfer-learning, as future work, which may improve our forecasts, although not beyond one cycle.
This motivates the design of the Modified ESN algorithm. The MESN produces more accurate and more robust forecasts, but it also remains limited to forecasting only one cycle ahead. We also note the forecasts given by both algorithms strongly depend on when we stop training and start forecasting, and that the best forecasts are obtained when we are at the rising phase of the cycle. Therefore, we use the data until 2022 to forecast sunspot Cycle 25. The two algorithms agree that Cycle 25 is going to last for about 10 years. They also agree on the approximate time location of the peak, the ESN places the peak in July 2024, while the MESN shows it in April 2024. As for the maximum number of sunspots, the ESN forecasts it to be 131 ± 14, whereas the MESN forecasts 137 ± 2. Hence, both algorithms forecast that Cycle 25 will be slightly stronger than Cycle 24, but weaker than Cycle 23.
We have taken a novel approach towards bridging the gap between physics-based and machine-learning-based solar-cycle forecasts by first applying our techniques with simulated data from a physics-inspired solar dynamo model and subsequently utilizing the same algorithm on observational data. Our forecast is consistent with physical model-based approaches.

Appendix A: ESN Hyperparameter Selection
Hyperparameters are those values that need to be selected by the user regarding the algorithm architecture, the optimization method, and the regularization applied. In the previous  Figure 2 for different values of β. We use an ensemble of 10 independent forecasts and we calculate the training MSE for each one. We list the mean value of the error and its standard deviation. sections, we mention which reservoir size N and which regularization parameter β we use for each cycle but we do not explain why we select those values.

A.1 Selecting the Reservoir Size
We always use a reservoir size of N = 1000 neurons. We could have used larger sizes for the dynamo problem, as it is recommended to set a reservoir size as large as possible Lukoševičius (2012). However, using a thousand neurons is suitable for the sunspots problem considering our limited amount of training data. This is because the reservoir size determines the number of parameters we need to learn (via the matrix W out ). Too many parameters under limited data gives rise to stronger overfitting.

A.2 Appropriate Regularization for the Old ESN
We choose β by comparing different values of it against the Mean Squared Error (MSE) on the training data, as an example, we show the results for the dynamo problem of Section 3 in Table 2. We decide to select the largest regularization that still gives a low training MSE. We also want to avoid values of β that give rise to forecasts and train MSEs with large standard deviations. We think that 10 −5 is a good candidate, as it is ten times larger than 10 −6 and yet it gives a MSE of the same magnitude with a small standard deviation. We follow exactly the same procedure for choosing appropriate values of β when forecasting each solar cycle.

A.3 Appropriate Parameters for the MESN
Our MESN is much more robust to the random initialization of the reservoir weights than the classical ESN. We also find that it is capable of generating some individual forecasts of the sunspot cycles even without regularization. However, if we do not apply regularization one forecast can go completely wrong (giving thousands of sunspots) and ruin the mean and standard deviation of the ensemble. That is why we decided to just apply the smallest regularization that allows building a stable ensemble of 10 forecasts for all cycles. We find that a value for β = 10 −10 is best and we keep it constant for all forecasts produced with this algorithm. Finally, we select the amount of feedback history p for each cycle by taking a look at how the standard deviation of the ensemble of forecasts changes with p. We study the set of values of p ∈ {2 0 , 2 1 , . . . , 2 10 }. We observe that for values of p larger than 64 the standard deviation of the ensemble grows with p. Therefore, large values such as p = 1024 give rise to unreliable forecasts with a standard deviation greater than a thousand sunspots. We finish by considering small values of p between 1 and 64. We decide to randomly set p = 15 and to use this value for all cycles.

Appendix B: Recurrent Neural Networks
The characteristic feature of Recurrent Neural Networks (RNNs) is that their connection topology possesses cycles (Jaeger (2001)). They are well known for their ability in modeling sequential data. We also use a linear Autoregressive model (linear AR) as a performance baseline for the RNN architectures. Considering we have already explained how we train an ESN, we show now how we train the remaining RNN architectures: linear AR, vanilla RNN, LSTM, and GRU. The crucial difference between the RNNs we consider in this section and the ESN algorithm is that for the RNNs the elements of the matrix W res and W fb are also optimized.

B.1 Dataset Preparation
A common practise that helps training machine-learning models is to normalize the data. In our case, the data normalization we use for all algorithms is called min-max scaling and it moves the signal to the interval [0, 1] with a linear transformation. Then, we create a training, validation, and test dataset according to the solar cycle we want to forecast. For example, in the task of forecasting Solar Cycle 22, the data from the 22nd solar cycle is considered as test data, while the data prior to that cycle is split into training data and validation data in the ratio 90 : 10. We use the training set for updating the model parameters (learning) and the validation set for hyperparameter tuning. Finally, we check if the forecast is accurate by comparing it with the test set, which never has to be used for training. A RNN model is capable of predicting either a single time step ahead sequentially or multiple time steps ahead. We denote this parameter as target size (q) and we try to find the most suitable value for forecasting each solar cycle. Training a RNN requires learning a transformation that goes from an input to a target. Hence, from the training set, we build input-target pairs. An input is a signal time window of length L and its target is its future continuation. Mathematically, the tth input-target pair consists of a sequence of L points [z t , z t+1 , z t+2 , . . . , z t+L−1 ] as the input, and a sequence of z t+L , . . . , z t+L+q−1 points as the target.

B.2 Training Details
We train the algorithms using batch gradient descent with a learning rate η = 0.001. We adaptively decrease η in a step-wise fashion by a factor of 0.9, with every one-third of the total number of epochs. One epoch is considered to be completed when the network has seen all the training examples in the training set. We find that this helps in preventing oscillations in training, in the later stages. We use Adam (Kingma and Ba, 2015) as the optimizer and we develop the RNN architectures using Python with the PyTorch library (Paszke et al., 2019). For evaluation, we use libraries from Scikit-learn (Pedregosa et al., 2011).

B.3 Hyperparameter Selection
Every RNN architecture has a set of hyperparameters that need to be tuned in order to ensure a desired performance. For our training purposes we chose to tune the following hyperparameters: number of hidden units (n H ) in each layer, number of hidden layers (n l ), and the number of epochs (N). We try to find the best set of hyperparameters by exploiting different combinations of them using grid-search (LeCun et al., 1998). As MSE is chosen as the optimization metric, we choose the best configuration of parameters as the one that provides minimum MSE on the validation set. As an example, we show in Table 3 the validation MSE for different configurations of hyperparameters for the GRU architecture. We indicate in bold the best configuration obtained. In Table 4, we directly show the best hyperparameters found for each RNN. We also perform hyperparameter tuning for linear AR algorithms so as to have a fair comparison. For the linear AR model, the main parameter that requires tuning is the number of taps (n T ). We tune this parameter by using a range of different values and monitoring the MSE value on the validation set. In each case, we train the model for 4000 epochs. We choose the value of n T that results in a model with the lowest MSE on the validation set. In our experiments, we choose the value of n T as 34.

B.4 Evaluation Metrics
We train the algorithms by minimizing the MSE loss. For evaluating the quality of their forecasts, we again use the MSE computed between the forecast and the corresponding true signal. Additionally, we also show results using the Mean Absolute Error (MAE) and coefficient of determination (R 2 -score) later. The R 2 score is defined as: where y,ŷ, and μ y denote the actual signal, predicted signal, and mean of the actual signal, respectively. The R 2 score gives a measure of how much of the variability of the true signal is explained by the predicted signal. The best case is when R 2 is 1.0, which indicates that the model can completely explain the signal.

B.5 Comparison of Cycle Forecasts Using Different RNN Architectures
To forecast a solar cycle, we train and forecast 10 times with each of the algorithms (MESN, vanilla RNN, LSTM, and GRU). Then, we compute the mean and standard deviation of the

B.6 Discussion
From the results, we find that the performance of the MESN is quite impressive considering the ease of its training. In some cases, the vanilla RNN and LSTM perform comparable to or sometimes better than the MESN, but on average the MESN is found to be the best performing one. Among the RNN architectures, particularly, the performance of the vanilla RNN is quite good on average, as indicated by the MSE and MAE metrics. The performance of the GRU model is sometimes unsatisfactory compared to MESN, which is a little surprising considering that it is widely acclaimed for several sequence modeling tasks (Chung et al., 2014;Che et al., 2018;Ravanelli et al., 2018). The linear AR model, being a pretty naive forecasting model, fails to fit the curves accurately, and performs the poorest among all the algorithms in almost all the cases where the testing data is available. For Cycle 25, we find that the vanilla RNN and LSTM models are quite in agreement, and the times corresponding to the peak amplitude of their forecast are also close to that of the MESN.  Acknowledgment AEF acknowledges Marta Vilademunt for useful feedback on the paper figures. We thank other participants in our SPARC project, particularly Sourangshu Bhattarcharya for stimulating discussions. DM thanks the participants in the machine-learning seminars at NORDITA organized by S-H Lim for useful inputs. DN acknowledges a Visiting Professorship grant from the Wenner-Gren Foundation for facilitating his visit to NORDITA, Stockholm in 2018 -when ideas related to this project were developed. We thank the anonymous referee for pointing out early work on the application of neural networks to forecast solar cycles and the transfer-learning technique.
Author contributions The project was planned by DM, DN and SC. AEF coded the two different ESN algorithms and applied them on both sunspot data and on data from the dynamo model. He is responsible for Figures 1, 2, 4, 5, 6. AG coded all remaining RNN algorithms and applied them on sunspot data, creating the forecasts shown in Figures 3, 7, 8. The paper was written largely by DM and DN with input from all the authors.
Funding Note Open access funding provided by Stockholm University. The collaboration between NORDITA and KTH on the one hand and IISER Kolkata on the other hand has been made possible through the SPARC project grant SPARC/2018-2019/P746/SL of the Government of India, which funded academic visits to facilitate this research. The Center of Excellence in Space Sciences India is funded by IISER Kolkata, Ministry of Education, Government of India. DM is supported by two grants from the Swedish research council, Nos. 638-2013-9243 and 2016-05225. AEF's stay at NORDITA was supported by a grant from the Swedish Research Council. Part of the work was done during an earlier visit by AEF to NORDITA. The visit was supported by the Erasmus+ Programme, where AEF did a three-month internship at NORDITA as a recent graduate from the University of Barcelona.

Data Availability
The data from the dynamo model was obtained by solving Equation (4). The monthly sunspot data is publicly available and it was obtained from the SILSO World Data Center (2015).

Competing interests
The authors declare no competing interests.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.