1 Introduction

Emerging technologies that enable the real-time monitoring of athletes in training and competition have fostered interest in methods to predict and optimize athlete performance. Predictive models for how much an athlete “has left in the tank” enable the investigation of pacing strategies (Behncke, 1997; Sundström et al., 2014; de Jong et al., 2017) and to dynamically adjust strategies to optimize the outcome of a competition (Hoogkamer et al., 2018). They can be described as a digital athlete, i.e., a computer-based model for enhancing training programming or strategy optimization. A foundation for such advances is research in performance modeling, which can be understood as the mathematical abstraction of exercise physiology.

1.1 Critical-power-based approaches to performance modeling

One of the seminal models in the area of performance modeling is the critical power model, which relies on the notions of a critical power (CP) and a finite energy reserve for work above critical power \((W')\) (Hill, 1993; Whipp et al., 1982). Monod and Scherrer (1965) defined CP as: “the maximum work rate a muscle can keep up for a very long time without fatigue”. Thus, CP can be considered as a threshold for sustainable exercise. W’ represents a capacity for work to be performed at a rate above CP and is conceptualized as an energy storage. Using the definitions of time to exhaustion (TTE) and constant power output (P) the critical power model can be summarized in the relationship

$$\begin{aligned} TTE = \frac{{W'}}{{P}-{CP}}. \end{aligned}$$
(1.1)

To determine CP and W’, an athlete has to conduct between three and five exercise tests until exhaustion at various constant exercise intensities. CP and W’ are then fitted to these distinct TTE and P observations and their relationship can be used to predict the time to exhaustion at other intensities. Hill (1993) emphasized that the attractiveness of the model lies in its coarse simplicity and it should not be employed if highly accurate predictions are required. Nevertheless, its straightforward application and its elegant abstraction have led to improved understanding and prediction of performance dynamics (Poole et al., 2016; Sreedhara et al., 2019; Vanhatalo et al., 2011).

While the critical power model predicts energy expenditure at high intensities, it does not consider the recovery of W’ after exercise has ended or during exercise at low intensities. Formally, exercise protocols that alternate between intensities below CP and above CP constitutes as intermittent exercise. In order to predict performance capabilities of athletes during intermittent exercise, models need to predict recovery of W’ during phases of exercise below the CP intensity.

One of the most widely covered approaches to predict recovery of W’ during intermittent exercise is the \(W'\) balance (\(W'_{\mathrm{bal}}\)) model (Sreedhara et al., 2019; Jones & Vanhatalo, 2017). Since the first publication by Skiba et al. (2012), an updated form of \(W'_{\mathrm{bal}}\) was introduced by Skiba et al. (2015) and another alternative form was proposed by Bartram et al. (2018). \(W'_{\mathrm{bal}}\) models have been used to search for optimal drafting strategies in running (Hoogkamer et al., 2018) or to predict phases of perceived exhaustion during cycling exercise (Skiba et al., 2014).

Despite these advances, research into energy recovery modeling is an evolving field, and \(W'_{\mathrm{bal}}\) models have been scrutinized for their limitations. Similar to energy expenditure dynamics, recovery dynamics are derived from exhaustive exercise tests and therefore data for model fitting and validation are sparse (Vanhatalo et al., 2011; Sreedhara et al., 2019). Recent findings suggest that current \(W'_{\mathrm{bal}}\) models overly simplify energy recovery dynamics, and that model modifications that account for characteristics of prior exhaustive exercise (Caen et al., 2019) as well as bi-exponential recovery dynamics (Caen et al., 2021) might improve recovery predictions. These modified models feature additional parameters, which introduces challenges in fitting them to small data sets. Indeed, the search for models that optimally balance complexity with applicability to few data points is a primary challenge of energy recovery modeling.

1.2 Hydraulic models of human performance

Hydraulic models offer an alternative to \(W'_{\mathrm{bal}}\) models for predicting energy expenditure and recovery dynamics during exercise. Instead of using CP and W’, they represent energy dynamics as liquid flow within a system of tanks and pipes. These tanks and pipes are arranged according to physiological parameters such as maximal oxygen uptake and estimated phosphocreatine levels of an athlete. The first hydraulic model was proposed by Margaria (1976) to provide an intuitive conceptualization of bioenergetic responses to exercise. Morton (1986) further elaborated the model, formalized its dynamics with differential equations, and published it as the Margaria-Morton (M-M) model. Later, Sundström (2016) proposed an extension of the M-M model and named it the Margaria-Morton-Sundström (M-M-S) model. Compared to \(W'_{\mathrm{bal}}\) models, hydraulic models can predict more complex energy expenditure and recovery dynamics and have the potential to address highlighted shortcomings of \(W'_{\mathrm{bal}}\) models in recent literature.

A challenge of these hydraulic models is that their parameters require in-depth knowledge about bioenergetic systems. Indeed, Morton (2006) concluded that it remained to be seen to what extent model predictions conform to reality. Also, the more recently proposed M-M-S model by Sundström (2016) has yet to be validated experimentally. Behncke (1997) applied the M-M model to world records in competitive running, and while the predictions agreed with values provided in the literature, he also pointed out situations in which the naive interpretation of the model would not be justified. Furthermore, Behncke (1997) stated that constraints dictated by physiological conditions made explicit computations with the M-M model “rather cumbersome”. Collectively, the requirement to set parameters according to physiological measures impede the application and validation of the M-M and M-M-S models.

To overcome the issues caused by ascribing the model parameters to concrete bioenergetic measures, we proposed a generalized form of the M-M hydraulic model in Weigend et al. (2021). Our generalized hydraulic model allows the fitting of its parameters using an optimization approach that only requires CP and W’ as inputs. In this way, our modified model preserves the flexibility needed to model the observed dynamics without requiring strict correspondence to parameters pertaining bioenergetics. In a proof-of-concept, we showed that our fitted hydraulic model could successfully predict both energy expenditure and recovery kinetics for one example case, the former in line with predictions of the CP model and the latter in a manner that matches published observations. While the generalized hydraulic model demonstrated satisfactory predictivity, it is still unknown as to whether it can outperform the existing \(W'_{\mathrm{bal}}\) models.

Therefore, in this work, we compare the prediction quality of our generalized hydraulic model from Weigend et al. (2021) to that of three \(W'_{\mathrm{bal}}\) models. We hypothesized the hydraulic model would predict the observed recovery ratios compiled from past studies overall more accurately than the \(W'_{\mathrm{bal}}\) models. We found that the hydraulic model outperformed the \(W'_{\mathrm{bal}}\) models on objective goodness-of-fit and prediction metrics. We conclude that the generalized hydraulic model provides a beneficial new perspective on energy recovery modeling that should be investigated further.

2 Material and methods

The Materials and methods are structured in the following way. In Sect. 2.1, the \(W'_{\mathrm{bal}}\) and hydraulic models are defined and their underlying assumptions specified. We then introduce a new \(W'_{\mathrm{bal}}\) model that has been fitted to the same recovery data as the investigated hydraulic model. In Sect. 2.2, we discuss how we will objectively compare the \(W'_{\mathrm{bal}}\) and hydraulic models. In particular, we propose a procedure to obtain comparable recovery ratios. The validation data set consists of data compiled from previously published studies on the recovery from exercise. Section 2.3 lists the data exclusion criteria and extraction procedures. Finally, in Sect. 2.4, we describe the metrics used to assess the model goodness-of-fits and prediction capabilities.

2.1 Model definitions

The CP, \(W'_{\mathrm{bal}}\), and the hydraulic models feature assumptions and parameters that require defining.

2.1.1 Energy expenditure and recovery with \(W'_{\mathrm{bal}}\)

The critical power model predicts energy expenditure and is the underlying model for the \(W'_{\mathrm{bal}}\) models. The four essential assumptions of the critical power model are stated as follows (Hill, 1993; Morton, 2006):

  1. 1.

    An individual’s power output is a function of two energy sources: aerobic (using oxidative metabolism) and anaerobic (non-oxidative metabolism).

  2. 2.

    Aerobic energy is unlimited in capacity but its conversion rate into power output is limited (CP).

  3. 3.

    Anaerobic energy is limited in capacity (W’) but its conversion rate is unlimited.

  4. 4.

    Exhaustion occurs when all of W’ is depleted.

These assumptions are reflected in Eq. 1.1, in which time to exhaustion (TTE) is estimated from the available W’ divided by work above CP. At every time step during which an athlete exercises above CP, the product of the time elapsed and the difference between CP and the actual power output is subtracted from the energy capacity W’. Thus, during a constant power output above CP, the critical power model predicts a linear depletion of W’. When W’ is depleted, exhaustion is reached.

Fig. 1
figure 1

Example energy expenditure and recovery predictions of two models for intermittent exercise. W’ expenditure during exercise above CP (the red dashed line in the lower panel) was modeled using the CP model while W’ recovery during exercise below CP was modeled using either the \(W'_{\mathrm{skib}}\) or \(W'_{\mathrm{bart}}\). Differences in predicted recovery kinetics are clearly visible

As observable in the example in Fig. 1, subsequently established \(W'_{\mathrm{bal}}\) models combine the assumed linear depletion of W’ at power outputs above CP with predictions for W’ reconstitution during exercise below CP. The initial \(W'_{\mathrm{bal}}\) model by Skiba et al. (2012) was later updated in Skiba et al. (2015). Substantial differences between these versions exist, and as shown by Skiba and Clarke (2021), the original model by Skiba et al. (2012) contradicts the assumption of Eq. 1.1 that W’ linearly depletes. As such, we focused on the updated version by Skiba et al. (2015) and we refer to it henceforth as \(W'_{\mathrm{bal-ode}}\). We denote the remaining capacity of W’ at a discrete time point t during exercise as \(W^\prime _{\mathrm {bal-ode}_{t}}\). \(P_t\) refers to the power output at a discrete time step t. \(\Delta t\) is the difference between the discrete time step \(t-1\) and t in seconds. We define the overall \(W'_{\mathrm{bal-ode}}\) model as

$$\begin{aligned} W^\prime _{\mathrm {bal-ode}_{t}} = {\left\{ \begin{array}{ll} W^\prime _{\mathrm {bal-ode}_{t-1}} - (P_t - {CP})\Delta t, &{} \text {for} \; P_t \ge {CP}\\ {W'} - ({W'} - W^\prime _{\mathrm {bal-ode}_{t-1}}) \cdot e^{\frac{-\Delta t}{\mathcal {T}_t}}, &{} \text {for} \; P_t < {CP}. \\ \end{array}\right. } \end{aligned}$$
(2.1)

During a constant power output above or at CP (\(P_t \ge \textit{CP}\)), \(W^\prime _{\mathrm {bal-ode}_{t}}\) decreases linearly as t increases, like the critical power model predicts. During power outputs below CP, \(W^\prime _{\mathrm {bal-ode}_{t}}\) increases exponentially with W’ as its asymptote. \(\mathcal {T}_t\) affects recovery speed and varies between distinct \(W'_{\mathrm{bal-ode}}\) models. At a discrete time step t, the \(\mathcal {T}_{t}\) of the \(W'_{\mathrm{bal-ode}}\) model by Skiba et al. (2015) (\(\mathcal {T}_{\mathrm {skib}_t}\)) is estimated as

$$\begin{aligned} \mathcal {T}_{\mathrm {skib}_t} = \frac{{W'}}{D_{{CP}_t}}, \end{aligned}$$
(2.2)

where \(D_{\textit{CP}_t}\) represents the difference between \(P_t\) and CP. Henceforth, we refer to Eq. 2.1 with Equation 2.2 as \(W'_{\mathrm{skib}}\). Figure 1 depicts an example for \(W'_{\mathrm{skib}}\) predictions. For the time steps of the first 3 min \(P_t\) was below CP and \(W^\prime _{\mathrm {bal-ode}_{t}}\), i.e., available \(W^\prime \) balance, remained at its maximum. Then, the power output increased above CP for the next 3 min. The available W’ balance decreased by \(P_t - \textit{CP}\) per second. Between 6 min and 9 min, \(P_t\) dropped below CP again, \(W'_{\mathrm{skib}}\) simulated recovery, and \(W^\prime _{\mathrm {bal-ode}_{t}}\) rose exponentially with W’ as its asymptote. Speed of recovery was affected by \(\mathcal {T}_{\mathrm {skib}_t}\) from Equation 2.2, which took the difference between \(P_t\) and CP into account. That is observable by comparing recovery between 6 min and 9 min to recovery between 12 min and 21 min. During the second recovery bout \(P_t\) was lower and therefore the slope of the exponential recovery was steeper. During the last 3 min, \(P_t\) was equal to CP and thus \(W^\prime _{\mathrm {bal-ode}_{t}}\) did not change. If \(W^\prime _{\mathrm {bal-ode}_{t}}\) would reach 0 J, exhaustion would be predicted. In the example in Fig. 1 the athlete was predicted to be close to exhaustion, but some of their energy capacities remained.

Bartram et al. (2018) investigated the recovery rate of W’ of elite cyclists and observed faster recovery rates than Skiba et al. (2015). Therefore, they proposed another \(\mathcal {T}_t\) for Equation 2.1 to predict quicker recovery ratios. The \(\mathcal {T}_{t}\) by Bartram et al. (2018)(\(\mathcal {T}_{\mathrm {bart}_t}\)) was defined as

$$\begin{aligned} \mathcal {T}_{\mathrm {bart}_t} = 2287.2 \cdot {D_{\textit{CP}_t}}^{-0.688}. \end{aligned}$$
(2.3)

Henceforth, we will refer to Equation 2.1 with Equation 2.3 as \(W'_{\mathrm{bart}}\). Predictions of \(W'_{\mathrm{bart}}\) are depicted alongside those of \(W'_{\mathrm{skib}}\) in the example in Fig. 1. It is observable that \(W'_{\mathrm{bart}}\) predicted faster recovery dynamics.

2.1.2 The hydraulic model

In Weigend et al. (2021) we expressed our generalized hydraulic tank model mathematically as a system of discretized differential equations with 8 parameters (A.3 and A.4 in their Appendix). Henceforth, the model will be referred to as the hydraulic M-M model by Morton (1986) as interpreted by Weigend et al. (2021)(\(\mathrm {hydraulic}_\mathrm {weig}\)), a schematic of which is depicted in Fig. 2. \(\mathrm {hydraulic}_\mathrm {weig}\) models power output of an athlete as a function of three interacting energy sources, which are represented as liquid-containing tanks. As depicted in Fig. 2, these tanks are named the aerobic energy source (Ae), anaerobic fast energy source (AnF) and anaerobic slow energy source (AnS). Ae is assumed to have infinite volume, which is indicated by the fading color to the left. A pipe connects Ae to the middle tank and has the maximal flow capacity \(m^{Ae}\). The pipe from the right tank into the middle tank allows flow in both directions and the maximal flow capacity \(m^{AnS}\) from AnS into AnF and \(m^{AnF}\) from AnF into AnS. A tap (p) is attached to the bottom of the middle tank AnF and liquid flow from this tap represents energy demand. The fill levels of tanks and flows through pipes change as liquid flows from the tap. If the athlete expends energy, the liquid level of AnF drops and initiates flow from Ae and AnS into AnF. When the middle tank is empty, liquid flow out of p no longer matches the demand and exhaustion is assumed.

In the depicted situation in Fig. 2p was opened and liquid flowed out of AnF. As a result, the liquid level in AnF dropped and thus liquid started to flow from Ae into AnF. The more the fill level of the middle tank AnF dropped, the less liquid pressured against the pipe exit from Ae, and the more the flow from Ae increased. In Fig. 2 the flow out of p was so large, that the fill level of AnF dropped below the top of AnS (\(h>\theta \)) and liquid from AnS started to flow into AnF too. The liquid volume in AnF is limited such that it can only contribute to flow out of p for a limited time. If the simulated athlete stopped exercise, then their power output would decrease to 0 and the tap p would close. If the tap was closed in the depicted situation in Fig. 2, liquid from the outer tanks Ae and AnS would refill the middle tank AnF until its fill level rises above the fill level of AnS. Then, liquid from Ae would continue to refill AnF and liquid from AnF would flow into AnS. The model would mimic recovery and Ae would eventually refill both other tanks.

Fig. 2
figure 2

A three-component hydraulic model as defined by Weigend et al. (2021). Tanks are conceptualized as the aerobic energy source (Ae), anaerobic fast energy source (AnF), and anaerobic slow energy source (AnS). Ae is assumed to be infinite in volume, which is indicated by the fading color to the left. A tap p is attached to the bottom of AnF and flow from it represents energy demand. Pipes connecting the three tanks have maximal flow capacities \(m^{Ae}\), \(m^{AnS}\), and \(m^{AnF}\)

Morton (1986) mathematically expressed liquid flows within this system as first- and second-order ordinary differential equations and in Weigend et al. (2021) we extended these equations so that they apply to all possible configurations of our generalized model. Due to liquid pressure and flow dynamics, varying fill levels of the middle and the right tank affect how flow from the tap p is estimated. With these interactions between three tanks, \(\mathrm {hydraulic}_\mathrm {weig}\) is capable of predicting energy expenditure and recovery as a more complex function than the above introduced critical power and \(W'_{\mathrm{bal-ode}}\) models. It features eight adjustable parameters. Depicted in Fig. 2, the parameters \(\phi \), \(\gamma \) and \(\theta \) represent tank positions. AnF and AnS represent tank capacities measured in J. And \( m^{Ae}\), \(m^{AnF}\) and \(m^{AnS}\) represent maximal flow capacities in W. A configuration c for \(\mathrm {hydraulic}_\mathrm {weig}\) entails the positions, sizes and capacities of each tank and is therefore defined as

$$\begin{aligned} c = [ AnF, AnS, m^{Ae}, m^{AnS}, m^{AnF}, \phi , \theta , \gamma ]. \end{aligned}$$
(2.4)

Fitting the \(\mathrm {hydraulic}_\mathrm {weig}\) to an athlete means finding the configuration that enables the model output to best reproduce the observed exercise responses of an athlete. In Weigend et al. (2021) we introduced an evolutionary computation workflow to derive such configurations. We fitted a configuration to CP and W’ measures of an athlete as well as recovery ratios derived from a publication by Caen et al. (2019). The same recovery ratios are used for every fitting and thus, to fit \(\mathrm {hydraulic}_\mathrm {weig}\) to an athlete, only CP and W’ of the athlete are required. These are the same measures that are needed to apply \(W'_{\mathrm{bal-ode}}\) models and, hence, the required input measures are the same for all compared \(\mathrm {hydraulic}_\mathrm {weig}\) and \(W'_{\mathrm{bal-ode}}\) models. With specific reference to our fitting method in Weigend et al. (2021), 10 evolutionary fittings to given CP and W’ were estimated and the best fitting has been used.

2.1.3 An additional \(W'_{\mathrm{bal}}\) model

The above introduced models \(W'_{\mathrm{skib}}\), \(W'_{\mathrm{bart}}\), and \(\mathrm {hydraulic}_\mathrm {weig}\) were each created by fitting to different sets of recovery observations. Objective metrics to compare model quality, e.g. the Akaike Information Criterion (Burnham & Anderson, 2004), require models to be fitted to the same data. Therefore, to allow a more comprehensive comparison, we added a third \(W'_{\mathrm{bal-ode}}\) model with a new \(\mathcal {T}_{t}\) created on the same recovery measures that were used to fit \(\mathrm {hydraulic}_\mathrm {weig}\) (\(\mathcal {T}_{\mathrm {weig}_t}\)). We derived this \(\mathcal {T}_{\mathrm {weig}_t}\) with a procedure as close as possible to the ones of Skiba et al. (2012) and Bartram et al. (2018).

As the first step, a constant value for \(\mathcal {T}_t\) in Equation 2.1 was fitted to each recovery ratio and recovery time combination from Table 1 of the Appendix of Weigend et al. (2021). For these observations power output was constant for every discrete time step t during recovery and thus \(\mathcal {T}_t\) could be considered as constant with the same value for all t. We used the standard Broyden-Fletcher-Goldfarb-Shanno algorithm implementation of SciPy (SciPy 1.0 Contributors et al., 2020) with 200 as the initial guess to fit a constant \(\mathcal {T}_t\) that enabled Equation 2.1 to best reproduce the observed recovery ratio. This resulted in twelve pairs of fitted constant \(\mathcal {T}_t\)s to constant recovery intensities.

As the next step, we then fitted an exponential function to these twelve pairs using the non-linear least squares implementation of SciPy (SciPy 1.0 Contributors et al., 2020). With the recovery intensity as \(D_{\textit{CP}_t}\), the function was of the form \(\mathcal {T}_t = a \cdot e^{D_{\textit{CP}_t}\cdot {b}} + c\). With the values of Skiba et al. (2012) as the initial guess (546, -0.01, 316), the resulting optimal constants were \(a = 1274.45, b=-0.0308,\) and \( c= 266.65\). Thus, given any \(D_{{CP_t}}\) at a discrete time step t, \(\mathcal {T}_{\mathrm {weig}_t}\) can be estimated as

$$\begin{aligned} \mathcal {T}_{\mathrm {weig}_t} = 1274.45 \cdot e^{-0.0308 \cdot D_{{\textit{CP}}_t}} + 266.65. \end{aligned}$$
(2.5)

Unfortunately, this fitted equation failed to satisfactorily fit the data (\(R^2=0.14\)) but we nevertheless used it because it was developed using a procedure that closely resembled those used to estimate \(\mathcal {T}_{\mathrm {skib}_t}\) and \(\mathcal {T}_{\mathrm {bart}_t}\). The introduction of \(\mathcal {T}_{\mathrm {weig}_t}\) is valuable because it allows the application of the Akaike Information Criterion metric, which requires compared models to be fitted to the same data points. Henceforth, Equation 2.1 with Equation 2.5 will be referred to as \(W'_{bal-ode}\) model with \(\mathcal {T}_{weig_t}\) (\(W^\prime _\mathrm {weig}\)).

2.2 Procedure for computing comparable recovery predictions

Fig. 3
figure 3

A schematic of the protocol to estimate recovery ratios. An exhaustive work bout (WB1) at a set intensity (\(P_\mathrm {work}\)) is prescribed. Immediately after exhaustion is reached, a recovery bout (RB) follows at a lower recovery intensity (\(P_\mathrm {rec}\)) for a set duration (\(T_\mathrm {rec}\)). Then, a second exhaustive work bout (WB2) is conducted and the ratio of the time to exhaustion of WB2 to the one of WB1 represents the amount that was recovered during RB

We compare the abilities of above defined \(W'_{\mathrm{bal-ode}}\) and \(\mathrm {hydraulic}_\mathrm {weig}\) models to predict “recovery ratios”. Recovery ratios are computed from exercise protocols involving two exhaustive work bouts (WB1 and WB2) interspersed with a recovery bout (RB). A schematic of the protocol is depicted in Fig. 3. First, the model simulates exercise at a fixed work intensity (\(P_\mathrm {work}\)) above CP until exhaustion. Immediately after exhaustion is reached, exercise intensity switches to a lower recovery intensity (\(P_\mathrm {rec}\)) below CP. After a set time (\(T_\mathrm {rec}\)) at that recovery intensity, a second work bout (WB2) until exhaustion at \(P_\mathrm {work}\) is simulated. The time to exhaustion of WB2 is expected to be shorter than the one of WB1 due to the limited recovery time in between the work bouts. Because it is assumed that W’ is completely depleted at the end of WB1, the ratio of the second time to exhaustion to the first represents the amount of W’ recovered. Thus, the time to exhaustion of WB2 divided by the time to exhaustion of WB1 multiplied by 100 results in a recovery ratio in percent (%).

This outlined procedure aligns with the assumptions of CP or \(W'_{\mathrm{bal-ode}}\) models and enables the direct comparison of the simulated recovery ratios from each model \(\mathrm {hydraulic}_\mathrm {weig}\), \(W'_{\mathrm{skib}}\), \(W'_{\mathrm{bart}}\), \(W^\prime _\mathrm {weig}\), and with the published data. As an example, the recovery ratio curves in Fig. 5 were obtained by estimating the WB1 \(\rightarrow \) RB \(\rightarrow \) WB2 protocol for every recovery duration (\(T_\mathrm {rec}\)) in between 0 s and 900 s. Published observations by Caen et al. (2021) were added to the plot.

2.3 Data extraction for model comparisons

We extracted data from previous studies that investigated energy recovery dynamics and used to it compare and evaluate recovery ratio predictions of all models. The studies for comparison were identified from Table 1 of the comprehensive review by Chorley and Lamb (2020). From these studies, we retained those that featured appropriate data, except those that met the following exclusion criteria:

  • Featured a mode of exercise other than cycling. Cycle ergometers measure power output directly. Power outputs during modes of exercise like running or swimming are not directly comparable because they are estimated using different methods or are approximated, e.g, (Morton & Billat, 2004) focused only on speed instead of power.

  • The observations were made under extreme conditions, e.g., hypoxia or altitude.

  • Insufficient information was reported to simulate the prescribed protocol in and/or to infer a recovery ratio of W’ in percent, e.g., the integral version of the \(W'_{\mathrm{bal}}\) model by Skiba et al. (2012) assumes recovery during high-intensity exercise such that recovery ratios cannot be straightforwardly inferred.

  • The prescribed protocol leaves doubt if reported recovery ratios are comparable to the “recovery estimation protocol” described earlier, e.g., repeated ramp tests until exhaustion, 50 % W’ depletion followed by a 3-min all-out test, or knee-extension maximal voluntary contraction (MVC) test during recovery.

Five studies were included for comparison, four of which were obtained from the Chorley and Lamb review: (Bartram et al., 2018), (Chidnok et al., 2012), (Ferguson et al., 2010), and (Caen et al., 2019). After the summary of Chorley and Lamb (2020) was published, Caen et al. (2021) published a study that investigated the W’ reconstitution dynamics in even more detail and which was thus added to the list.

The data in the listed studies were presented in diverse ways, such that modifications were made to some of the data to enable model comparison. The study by Caen et al. (2019) did not report distinct mean values for every investigated condition, such that we derived approximate values in Weigend et al. (2021) to fit our \(\mathrm {hydraulic}_\mathrm {weig}\) to their conditions. Hence, the data for comparison are the values from Weigend et al. (2021). Further, the study by Bartram et al. (2018) fitted their own \(W'_{\mathrm{bart}}\) model, where \(\mathcal {T}_t\) is defined according to Equation 2.3. Therefore, we used \(W'_{\mathrm{bart}}\) model predictions for prescribed intensities of Bartram et al. (2018) as the observations against which the other models were compared.

The study by Chidnok et al. (2012) reported times to exhaustion from their intermittent exercise protocol instead of recovery ratios. Power output during recovery was constant in their tests. Therefore, in order to derive recovery ratio estimations that are comparable with the WB1 \(\rightarrow \) RB \(\rightarrow \) WB2 procedure defined above, we fitted a constant value for \(\mathcal {T}_t\) of the \(W'_{\mathrm{bal-ode}}\) model to each of their prescribed protocols and times to exhaustion. These constant values for \(\mathcal {T}_t\) were fitted with the Brent method implementation by SciPy (SciPy 1.0 Contributors et al., 2020) to find a local minimum in the interval between [100, 1 000]. We then used WB1 \(\rightarrow \) RB \(\rightarrow \) WB2 recovery ratio estimations of \(W'_{\mathrm{bal-ode}}\) models with fitted constant \(\mathcal {T}_t\) as the observations with which to compare \(W'_{\mathrm{skib}}\), \(W'_{\mathrm{bart}}\), \(W^\prime _\mathrm {weig}\), and \(\mathrm {hydraulic}_\mathrm {weig}\).

2.4 The metrics of goodness of fit

The metrics of goodness of fit used to compare the models were Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and the small-sample version of the Akaike Information Criterion (\({\mathrm{AIC}}_{\mathrm{c}}\)). Chai and Draxler (2014) discussed RMSE and MAE as widely adopted metrics for assessing model prediction capabilities. We compared predictive accuracy by comparing RMSE and MAE on data to which competing models were not fitted. Lower values for RMSE and MAE were interpreted as more accurate predictions.

To statistically compare prediction error distributions between models, we used a bootstrap hypothesis test (Efron & Tibshirani, 1993; Good, 2000). We did so because only small data sets were available and we could not assume normal distributed prediction errors with equal variances for every compared model. The null hypothesis of our bootstrap test was that prediction error distributions of two compared models are the same. Because we used two prediction error metrics (RMSE and MAE) we investigated the null hypothesis on both. We used the absolute difference between RMSE and also between MAE of compared groups as our test statistics. With the null hypothesis that error distributions are the same, we could bootstrap new samples by randomly selecting with replacement from all pooled observations. We created a distribution of test statistics from 1 000 000 bootstrap samples to reliably approximate the p-value of our observed test statistic at high precision. We rejected the null hypothesis if the p-value\(\,<\,.05\).

We also compared models with the \({\mathrm{AIC}}_{\mathrm{c}}\), which was first proposed by Sugiura (1978). The \({\mathrm{AIC}}_{\mathrm{c}}\) is a model selection tool to investigate the balance between model complexity and explanatory capability (Burnham & Anderson, 2004). \({\mathrm{AIC}}_{\mathrm{c}}\) penalizes the number of parameters of the model and thus provides insight into the balance between model complexity and goodness of fit. The lower the \({\mathrm{AIC}}_{\mathrm{c}}\) score, the better this balance is met. The AIC was calculated as

$$\begin{aligned} \text {AIC}_c = n \cdot \text {ln}(\text {MSE}) + 2k + \frac{2k \cdot (k+1)}{n-k-1}, \end{aligned}$$
(2.6)

where MSE is the Mean Squared Error, n is the number of data points and k is the number of parameters of the model. Models have to be fitted to and applied to the same data in order to obtain comparable \({\mathrm{AIC}}_{\mathrm{c}}\) scores. Therefore, only \(W^\prime _\mathrm {weig}\) and \(\mathrm {hydraulic}_\mathrm {weig}\) were comparable with this criterion in this work.

Altogether, the hypothesis that the more complex \(\mathrm {hydraulic}_\mathrm {weig}\) model fits the data better than the established \(W'_{\mathrm{bal-ode}}\) models will be supported if the overall RMSE, MAE and \({\mathrm{AIC}}_{\mathrm{c}}\) scores are lower for \(\mathrm {hydraulic}_\mathrm {weig}\) than for other models, and if prediction error distributions are significantly different to those of other models.

3 Results

In the following section, we present the extracted data and the prediction results of \(W'_{\mathrm{bal-ode}}\) and \(\mathrm {hydraulic}_\mathrm {weig}\) models for each listed previous study. We refer to extracted data from studies by the last name of the first author, e.g., the extracted data from Bartram et al. (2018) is referred to as “Bartram data set”. All studies collected their data through performance tests that required athletes to exercise until volitional exhaustion. Such tests are affected by circumstances that are hard to measure and control, e.g, motivation, nutrition, and state-of-mind. Therefore, recovery ratio observations are noisy and the extracted group averages were accompanied by large standard deviations. These uncertainties prevented us from drawing conclusions about model quality on averages of individual data sets and instead necessitated to perform the comparison between models across all available data. We begin by presenting the extracted data and model predictions of individual data sets throughout Sections 3.1 to 3.5 followed by summarizing all prediction errors and resulting RMSE, MAE, and \({\mathrm{AIC}}_{\mathrm{c}}\) scores in the final Sections 3.6.

3.1 Bartram data set

Fig. 4
figure 4

Comparison of model predictions with the defined recovery estimation protocol (WB1 \(\rightarrow \) RB \(\rightarrow \) WB2). Depicted are the recovery dynamics around 60 s at various \(D_{CP}\) recovery intensities after a preceding exhaustive exercise at the intensity that is predicted to lead to exhaustion after 100 s (P100). Chosen intensities and time frames stem from the protocol prescribed by Bartram et al. (2018) and predictions of \(W'_{\mathrm{bart}}\) were used as the observations with which to compare models

Table 1 The left part of the table summarizes extracted data and conditions from Bartram et al. (2018)

The protocol prescribed by Bartram et al. (2018) consisted of three work bouts interspersed with two 60-s recovery bouts. The first two work bouts each lasted for 30 s, and the final one until volitional exhaustion. Work bout exercise intensity (\(P_\mathrm {work}\)) was set to P100, i.e., the intensity that was predicted to lead to exhaustion after 100 s. The recovery bout intensity (\(P_\mathrm {rec}\)) was set to differences to CP (\(D_{CP}\)) of 200, i.e., CP - 200 W, or 150 W, 100 W, 50 W, or 0 W. The group averaged CP and W’ for the four world-class cyclists featured in Bartram et al. (2018) were 393 W and 23 300 J. Altogether, these input values resulted in an estimated P100 exhaustive intensity of 626 W and recovery intensities \(D_{CP}\) 0 of 393 W, \(D_{CP}\) 50 of 343 W, \(D_{CP}\) 100 of 293 W, \(D_{CP}\) 150 of 243 W, and \(D_{CP}\) 200 of 193 W, respectively.

The resulting recovery predictions of \(W'_{\mathrm{bal-ode}}\) and \(\mathrm {hydraulic}_\mathrm {weig}\) models are summarized in Fig. 4 and Table 1. The \(W'_{\mathrm{bart}}\) model was not compared because it was the model that Bartram et al. (2018) fitted to their observations and we used it to create the observations against which the other models were compared. The fitted \(\mathrm {hydraulic}_\mathrm {weig}\) configuration to CP and W’ by Bartram et al. (2018) was: [23 112 J, 65 845 J, 392 W, 149 W, 24 W, 0.73, 0.01, 0.24].

Figure 4 and Table 1 show that in all cases except \(D_{CP}\) 0 the recovery ratios predicted by \(\mathrm {hydraulic}_\mathrm {weig}\) model were closest to the ones observed by Bartram et al. (2018), followed by \(W'_{\mathrm{skib}}\) and then \(W^\prime _\mathrm {weig}\). On the contrary, the hydraulic model is the only model to predict recovery at \(D_{CP}\) 0.

3.2 Caen data set

Fig. 5
figure 5

Comparison of model predictions with published observations by Caen et al. (2021). After an exhaustive exercise bout at P240, recovery dynamics at an intensity of 161 W were simulated using the defined recovery estimation protocol of this work (WB1 \(\rightarrow \) RB \(\rightarrow \) WB2). Published observed recovery ratios by Caen et al. (2021) are depicted in blue

Table 2 The left part of the table summarizes extracted data and conditions from Caen et al. (2021). The right part of the table displays model predictions

The protocol by Caen et al. (2021) investigated the recovery dynamics following exhaustive exercise at P240 (published average of 349 W). They prescribed a recovery intensity of 161 W on average, which was determined by selecting \(90\,\%\) of the power at gas exchange threshold (Binder et al., 2008) of their participants. The average CP of their participants was 269 W, and the average W’ 19 200 J. The reported observed recovery ratios were \(28.6\,\%\,\pm \,8.2\,\%\) after 30 s, \(34.8\,\%\,\pm \,11.1\,\%\) after 60 s, \(44.2\,\%\,\pm \,9.7\,\%\) after 120 s, \(50.5\,\%\,\pm \,12.1\,\%\) after 180 s, \(55.1\,\%\,\pm \,13.3\,\%\) after 240 s, \(56.8\,\%\,\pm \,16.4\,\%\) after 300 s, \(73.7\,\%\,\pm \,19.3\,\%\) after 600 s, and \(71.3\,\%\,\pm \,20.8\,\%\) after 900 s.

The simulation parameters and results of the defined recovery estimation protocol are summarized in Fig. 5 and Table 2. Fitting \(\mathrm {hydraulic}_\mathrm {weig}\) to CP and W’ group averages resulted in the configuration [17 631 J, 46 246 J, 267 W, 118 W, 21 W, 0.68, 0.01, 0.29]. The recovery ratios predicted by the \(\mathrm {hydraulic}_\mathrm {weig}\) model better matched the observed values compared to all the other models. Nevertheless, some lack of fit for the \(\mathrm {hydraulic}_\mathrm {weig}\) model was observed: the model overpredicted the recovery ratios at early time points and underpredicted those at longer time points except for the last one. \(W'_{\mathrm{skib}}\), \(W'_{\mathrm{bart}}\), and \(W^\prime _\mathrm {weig}\) model predictions consistently overestimated recovery for longer recovery times.

3.3 Chidnok data set

Chidnok et al. (2012) prescribed a protocol that alternated between 60-s work bouts and 30-s recovery bouts until the athlete reached exhaustion. With their protocol, the work-bout intensity \(P_\mathrm {work}\) was set to P240. The protocol prescribed four trials each with a different recovery intensity \(P_\mathrm {rec}\) (20 W as the “low” recovery intensity, 95 W as “medium”, 173 W as “high”, and 270 W as the “severe” recovery intensity). The participants had an average CP of 241 W and W’ of 21 100 J. Their recorded times to exhaustion were \(1\,224\,\text {s}\,\pm \,497\,\)s with “low” recovery intensity, \(759\,\text {s}\,\pm \,243\,\)s with “medium”, \(557\,\text {s}\,\pm \,90\,\)s with “high”, and \(329\,\text {s}\,\pm \,29\,\)s with “severe”.

Fig. 6
figure 6

Predicted recovery dynamics of compared models up to 60 s after a preceding exhaustive exercise at P240 and at three different recovery intensities (20 W, 95 W, and 173 W). Observations were predicted recovery ratios of \(W'_{\mathrm{bal-ode}}\) models with a constant \(\mathcal {T}\) fitted to reported times to exhaustion by Chidnok et al. (2012)

Table 3 The left part of the table summarizes extracted data and conditions from Chidnok et al. (2012). The right part of the table displays model predictions

As described in Sect. 2.3, in order to compare observations of Chidnok et al. (2012) to WB1 \(\rightarrow \) RB \(\rightarrow \) WB2 protocol estimations, a constant value for \(\mathcal {T}_t\) for the \(W'_{\mathrm{bal-ode}}\) model was fitted to the protocol by Chidnok et al. (2012) for each of their recovery conditions. The resulting \(\mathcal {T}_t\) values were 165.19 s for the “low” recovery intensity protocol, a \(\mathcal {T}_t\) of 124.81 s for “medium”, and a \(\mathcal {T}_t\) of 107.45 s for “high”. The “severe” recovery intensity was left out because 270 W lies above the average CP of 241 W. In this case, no recovery should occur if the assumptions of \(W'_{\mathrm{bal-ode}}\) model hold true. Chidnok et al. (2012) prescribed recovery bouts of 30 s and WB1 \(\rightarrow \) RB \(\rightarrow \) WB2 protocol estimations with corresponding fitted \(\mathcal {T}_t\)s and with a \(T_\mathrm {rec}\) of 30 s were 24.6 % at the “low” intensity, 21.7 % at “medium”, and 16.7 % at the “high” recovery intensity.

The fitted \(\mathrm {hydraulic}_\mathrm {weig}\) configuration to CP and W’ by Chidnok et al. (2012) was: [18 920 J, 48 052 J, 240 W, 115 W, 19 W, 0.68, 0.05, 0.31]. Predictions of all models and extracted conditions for the recovery estimation protocol are summarized in Fig. 6 and Table 3. In the case of the “low” recovery intensity predictions of the \(W'_{\mathrm{skib}}\) and \(W^\prime _\mathrm {weig}\) models were the most accurate. In the case of the “medium” recovery intensity the \(W'_{\mathrm{skib}}\) model was the most accurate, and in the remaining “high” condition \(\mathrm {hydraulic}_\mathrm {weig}\) and \(W'_{\mathrm{bart}}\) model predictions were closest to the data. None of the models made predictions that were close to all three of the observations.

3.4 Ferguson data set

Fig. 7
figure 7

A comparison of predicted recovery dynamics after an exhaustive exercise bout at P360 and at a recovery intensity of 20 W. Recovery ratios are estimated with the (WB1 \(\rightarrow \) RB \(\rightarrow \) WB2) protocol, which resembles the prescribed protocol by Ferguson et al. (2010). Published observations by Ferguson et al. (2010) are depicted in blue

Ferguson et al. (2010) prescribed a protocol with an initial time to exhaustion bout at the intensity that was predicted to lead to exhaustion after 360,s (P360) followed by a recovery at 20 W for 2 min, 6 min, or 15 min. After recovery, exercise intensity was then increased back to one of three possible high-intensity work rates. Thus, each participant performed nine tests in total with three different constant work rates after three different recovery times. The CP model was fitted to these three times to exhaustion after each recovery period to determine changes in CP and W’. Ferguson et al. (2010) published their group averages for CP as 212 W, W’ as 21 600 J, the P360 as 269 W, and the observed recovery ratios after 2 min as (\(37\,\%\,\pm \,5\,\%\)), 6 min (\(65\,\%\,\pm \,6\,\%\)), and 15 min (\(86\,\%\,\pm \,4\,\%\)).

Table 4 The left part of the table summarizes extracted data and conditions from Ferguson et al. (2010). The right part of the table displays model predictions

Extracted parameters for the recovery intensity protocol and model prediction results are summarized in Fig. 7 and Table 3 together with reported means by Ferguson et al. (2010). The fitted \(\mathrm {hydraulic}_\mathrm {weig}\) configuration to CP and W’ group averages by Ferguson et al. (2010) was [18 730 J, 81 031 J, 212 W, 94 W, 19 W, 0.63, 0.21, 0.34]. In this setup \(W^\prime _\mathrm {weig}\) was overall closest to published observations. \(\mathrm {Hydraulic}_\mathrm {weig}\) overestimated the recovery after 120 s and after 900 s. \({W'_{\mathrm{skib}}}\) and \(W'_{\mathrm{bart}}\) overestimated recovery in every instance.

3.5 Weigend data set

We derived the values from Table 1 in the Appendix of our Weigend et al. (2021) publication from Caen et al. (2019). Reported measures recreate the depicted means in Fig. 3. of the publication by Caen et al. (2019). They consisted of three recovery ratios for four conditions each: Preceding exhausting exercise at P240 or P480 followed by recovery at 33 % of CP or 66 % of CP. The participants of Caen et al. (2019) had an average CP of 248 W and W’ of 18 200 J, which results in a P240 of 285 W, a P480 of 323 W, 33 % of CP as 81 W, and 66 % of CP as 163 W.

Fig. 8
figure 8

Predicted recovery dynamics in comparison to measures that we derived from observations of Caen et al. (2019). We derived three recovery ratios for four conditions each: Preceding exhausting exercise at P240 or P480 followed by recovery at 33 % of CP or 66 % of CP. Depicted observations are the values from Table 1 in the Appendix of our publication Weigend et al. (2021) and approximate Fig. 3 of the publication by Caen et al. (2019). \(W^\prime _\mathrm {weig}\) and \(\mathrm {hydraulic}_\mathrm {weig}\) were fitted to these observations

Table 5 The left part of the table summarizes extracted data and conditions that Weigend et al. (2021) derived from Caen et al. (2019). The right part of the table displays model predictions. Both \(W^\prime _\mathrm {weig}\) and \(\mathrm {hydraulic}_\mathrm {weig}\) were fitted to these observations. Their predicted recovery ratios were recorded for the \({\mathrm{AIC}}_{\mathrm{c}}\) goodness of fit estimation metric in Sect. 2.4

Extracted parameters for the recovery ratio estimation protocol and model predictions are summarized in Fig. 8 and Table 5. The best fit \(\mathrm {hydraulic}_\mathrm {weig}\) configuration to CP and W’ group averages was [18 042 J, 46 718 J, 247 W, 107 W, 17 W, 0.72, 0.02, 0.25]. As described earlier, the recovery ratio values by Weigend et al. (2021) were used to fit the \(\mathcal {T}_{\mathrm {weig}_t}\) for the \(W^\prime _\mathrm {weig}\) model and they are used in the evolutionary fitting process for \(\mathrm {hydraulic}_\mathrm {weig}\) to fit recovery dynamics. Therefore, both \(W^\prime _\mathrm {weig}\) and \(\mathrm {hydraulic}_\mathrm {weig}\) were not scrutinized for predictive accuracy on this data set. Their predicted recovery ratios were recorded for the \({\mathrm{AIC}}_{\mathrm{c}}\) goodness of fit estimation metric in covered in the next subsection. Out of the remaining two models predictions of \(W'_{\mathrm{skib}}\) were closer to the observations but both overpredict in nearly all instances.

3.6 Summary of metrics of goodness of fit

Table 6 Summary of the model prediction errors and estimated metric scores. The first two columns summarize prediction errors used to compare predictive accuracy of \(W'_{\mathrm{bart}}\) and \(\mathrm {Hydraulic}_\mathrm {weig}\) via MAE, standard deviation of absolute errors (SD), and RMSE. Prediction accuracy had to be be assessed using data to which models were not fitted to, such that we had to exclude the Bartram and Weigend data sets. The subsequent three columns summarize the prediction errors used to compare the predictive accuracies of \(W'_{\mathrm{skib}}\), \(W^\prime _\mathrm {weig}\) and \(\mathrm {hydraulic}_\mathrm {weig}\). Here, the Weigend data set was excluded because \(W^\prime _\mathrm {weig}\) and \(\mathrm {hydraulic}_\mathrm {weig}\) were fitted to it. Finally, the \({\mathrm{AIC}}_{\mathrm{c}}\) metric requires models to be fitted to and to be evaluated on the same data. Therefore, we compared \({\mathrm{AIC}}_{\mathrm{c}}\) scores estimated from prediction errors of \(W^\prime _\mathrm {weig}\) and \(\mathrm {hydraulic}_\mathrm {weig}\) on all data sets. We approximated p-values for absolute differences in MAE and RMSE with a bootstrap hypothesis tests and considered \(\mathrm {p}<.05\) as significant. For every metric, a lower score means a better result

Table 6 summarizes the prediction errors of the competing models and resulting metric scores on our investigated data sets. RMSE and MAE were defined as the metrics to assess predictive accuracy. Their MAE scores were 24.87 with a standard deviation of absolute errors (SD) of 14.35 for \(W'_{\mathrm{bart}}\) and 7.11 (\(\text {SD}=6.83\)) for \(\mathrm {hydraulic}_\mathrm {weig}\) (\(\mathrm {p}<.001\) for the difference in MAEs, bootstrap hypothesis test). The RMSE scores on Caen, Chidnok, and Ferguson data sets were 28.46 for \(W'_{\mathrm{bart}}\) and 9.69 for \(\mathrm {hydraulic}_\mathrm {weig}\). Also the bootstrap hypothesis test with the absolute difference in RMSEs as its test statistic resulted in \(\mathrm {p}<.001\).

Both remaining models \(W'_{\mathrm{skib}}\) and \(W^\prime _\mathrm {weig}\) could be compared to \(\mathrm {hydraulic}_\mathrm {weig}\) on the Bartram, Caen, Chidnok, and Ferguson data sets. The \(\mathrm {hydraulic}_\mathrm {weig}\) featured the lowest MAE with 7.07, the lowest SD with 7.17, and lowest RMSE with 9.94. \({W'_{\mathrm{skib}}}\) predictions were significantly different to \(\mathrm {hydraulic}_\mathrm {weig}\) (\(\mathrm {p}<.001\) with the MAE test statistic and \(\mathrm {p}=.001\) with RMSE). \(W'_{\mathrm{weig}}\) predictions were significantly different to \(\mathrm {hydraulic}_\mathrm {weig}\) (\(\mathrm {p}=.019\) with the MAE test statistic and \(\mathrm {p}=.031\) with RMSE).

\({\mathrm{AIC}}_{\mathrm{c}}\) was chosen as the metric to assess which model provides the best trade-off between predictive capabilities and complexity. Models must be fitted to and tested on the same data for \({\mathrm{AIC}}_{\mathrm{c}}\) scores to be comparable. Hence, as reflected in the last two columns of Table 6, \(W^\prime _\mathrm {weig}\) and \(\mathrm {hydraulic}_\mathrm {weig}\) could be compared on combined data points of all covered data sets. With a k of 3 for \(W^\prime _\mathrm {weig}\) and a k of 8 for \(\mathrm {hydraulic}_\mathrm {weig}\) the resulting scores were 151.85 for \(\mathrm {hydraulic}_\mathrm {weig}\) and 181.03 for \(W^\prime _\mathrm {weig}\). The \(\mathrm {hydraulic}_\mathrm {weig}\) achieved the lower \({\mathrm{AIC}}_{\mathrm{c}}\) score.

4 Discussion

In this study, we compared the prediction capabilities and goodness of fit of \(\mathrm {hydraulic}_\mathrm {weig}\) to that of \(W'_{\mathrm{bal}}\) models. We hypothesized that the hydraulic model would more accurately predict observed recovery ratios observed in past studies. Models were compared on extracted data from five studies and the \(\mathrm {hydraulic}_\mathrm {weig}\) model outperformed the \(W'_{\mathrm{skib}}\), \(W'_{\mathrm{bart}}\), and \(W^\prime _\mathrm {weig}\) models with respect to objective RMSE, MAE, and \({\mathrm{AIC}}_{\mathrm{c}}\) metrics. Our findings therefore support the hypothesis. We discuss below our results in more detail and interpret them in context of findings of previous literature. We present arguments for why the \(\mathrm {hydraulic}_\mathrm {weig}\) outperformed the \(W'_{\mathrm{bal-ode}}\) models and we propose limitations and future work. Finally, we end this section with statements about significance and implications of our results.

4.1 Interpretation and contextualization

We observed that the standard deviations of absolute prediction errors in Sect. 3.6 as well as the overall MAE and RMSE were considerably lower for the \(\mathrm {hydraulic}_\mathrm {weig}\) than for the \(W'_{\mathrm{bal-ode}}\) models. But when averaging the prediction errors on isolated data sets listed in Table 6, \(\mathrm {hydraulic}_\mathrm {weig}\) only made more accurate predictions than its competitors on the Bartram and Caen data sets. For the Bartram data set, the MAE of \(\mathrm {hydraulic}_\mathrm {weig}\) was 6.96, compared to 18.74 for \(W'_{\mathrm{skib}}\), and 26.82 for \(W^\prime _\mathrm {weig}\) respectively. For the Caen data set, the MAE of \(\mathrm {hydraulic}_\mathrm {weig}\) was the lowest with 3.52. On the remaining Chidnok data set it was \(W'_{\mathrm{skib}}\) that achieved the lowest MAE with 9.4 and on the Ferguson data set it was \(W^\prime _\mathrm {weig}\) with 6.7.

As pointed out by Skiba and Clarke (2021) and Sreedhara et al. (2019), \(W'_{\mathrm{bal-ode}}\) models are meant to be applied to any athlete on a wide range of possible conditions. A lower MAE score for \(W^\prime _\mathrm {weig}\) on the Ferguson data set means \(W^\prime _\mathrm {weig}\) predicted recovery ratios more closely for the particular group (six recreational active men) under the particular test conditions that Ferguson tested. However, to determine the usefulness of a model for predicting performance for high-intensity intermittent exercise in a more general sense, models have to be evaluated on a multitude of scenarios. After combining all data sets, \(\mathrm {hydraulic}_\mathrm {weig}\) achieved the overall lowest MAE score, which means that \(\mathrm {hydraulic}_\mathrm {weig}\) could predict recovery ratios overall more accurately for a range of groups and settings.

Fig. 9
figure 9

Simulated recovery ratios using the \(\mathrm {hydraulic}_\mathrm {weig}\) and \(W'_{\mathrm{bal-ode}}\) models in response to prior exercise of differing intensities. The plots show that \(W'_{\mathrm{bal-ode}}\) models are insensitive to the properties of prior exhausting exercise, i.e., their predictions were not affected by \(P_{\mathrm {work}}\). In contrast, the \(\mathrm {hydraulic}_\mathrm {weig}\) was sensitive to the prior exercise properties. Performance models \(W'_{\mathrm{bal-ode}}\) were configured with a CP of 393 W and W’ of 23 300 J. \(\mathrm {Hydraulic}_\mathrm {weig}\) featured the configuration [23 112 J, 65 845 J, 392 W, 149 W, 24 W, 0.73, 0.01, 0.24]. All simulations differed only in \(P_{\mathrm {work}}\), which decreased from P100 to P480 in the simulations depicted from left to right. \(P_{\mathrm {work}}\) = P100, prescribed by Bartram et al. (2018), was the highest intensity out of compared studies. They investigated recovery after 60 s, therefore the \(W'_{\mathrm{bart}}\) model prediction after 60 s is marked as the observation. \(P_{\mathrm {work}}\) = P480 was the lowest prescribed intensity out of compared studies and recovery ratios for \(P_{\mathrm {work}}\) = P480, \(P_{\mathrm {rec}}\) = \(33\,\%\) of CP from the Weigend data set were marked as observations on the right

The less consistent prediction quality across data sets of the \(W'_{\mathrm{bal-ode}}\) models agrees with findings by Caen et al. (2019), who proposed that the predictive capabilities of \(W'_{\mathrm{bal}}\) models may improve with modifications that account for intensity and duration of prior exhaustive exercise. As an example, out of all compared studies in this work, Bartram et al. (2018) prescribed the highest work bout intensity for their experimental setup (\(P_\mathrm {work}\) = P100). Considering the suggestion by Caen et al. (2019) that a shorter time to exhaustion at a high intensity allows a quicker recovery, it seems reasonable that the \(W'_{\mathrm{bart}}\) model estimated the fastest recovery kinetics out of all recovery models.

Conversely, the Caen et al. (2019) study prescribed the lowest work bout intensity out of all compared studies (\(P_\mathrm {work}\) = P480). Their observed recovery ratios are summarized in the Weigend data set and were slower than the \(W'_{\mathrm{bart}}\) predictions. This observation again matches the assumption that a longer exhaustive exercise at a lower intensity requires a longer recovery.

Despite the differences in observed recovery rates, the \(W'_{\mathrm{bal-ode}}\) models allow for only a single recovery rate no matter the nature of the prior exercise. To illustrate this point, we conducted simulations to depict the influence of prior exercise intensity on the recovery ratios predicted by the \(W'_{\mathrm{bal-ode}}\) and \(\mathrm {hydraulic}_\mathrm {weig}\). Figure 9 depicts four simulations. All simulations shared the same test setup except for differing \(P_\mathrm {work}\) intensities. The simulation on the left had \(P_\mathrm {work} = P100\) as prescribed by Bartram et al. (2018), the simulation on the right featured the lowest \(P_\mathrm {work} = P480\) as found in the Weigend data set. From left to right, \(P_\mathrm {work}\) of the simulations decreased step wise. Bartram et al. (2018) investigated recovery after 60 s and therefore the \(W'_{\mathrm{bart}}\) prediction after 60 s is marked as the observation on the left. Recovery ratios with \(P_\mathrm {work} = P480\) and \(P_\mathrm {rec} = 33\,\%\) of CP of the Weigend data set are marked as observations on the right. The recovery ratios predicted by the \(W'_{\mathrm{bal-ode}}\) models were the same for each \(P_{\mathrm {work}}\) and their predictions were therefore unable to fit all observations equally well. In contrast, the hydraulic model could account for such characteristics.

This result occurred because of the interactions between the three tanks that \(\mathrm {hydraulic}_\mathrm {weig}\) uses to model energy recovery. For example, during high-intensity exercise, the liquid level in AnF would rapidly decrease and the contribution of AnS would be less than during lower-intensity exercise when the liquid level in AnF would decrease more slowly. Differences in fill states of AnS affected recovery estimations and enabled \(\mathrm {hydraulic}_\mathrm {weig}\) to predict rapid recovery after high-intensity exercise and a slower recovery after exercise at a lower intensity. We suggest that standard deviations of MAE as well as overall MAE and RMSE scores of \(\mathrm {hydraulic}_\mathrm {weig}\) model were smaller than those of \(W'_{\mathrm{bal-ode}}\) models because the hydraulic model could account for characteristics of prior exhaustive exercise.

Further, we propose that \(\mathrm {hydraulic}_\mathrm {weig}\) has also achieved overall better metric scores because it better captured the bi-exponential nature of energy recovery, as opposed to the mono-exponential \(W'_{\mathrm{bal-ode}}\) models. Indeed, the observed recovery ratios of the Caen data set increased rapidly from 0 s to 120 s and then continued to rise more slowly at longer durations (Fig. 5). Caen et al. (2021) showed that their observations were well explained with a bi-exponential model that implements a steeper slope during the beginning of recovery. Also, the first \(W'_{\mathrm{bal}}\) paper by Skiba et al. (2012) proposed an alternative bi-exponential version of their \(W'_{\mathrm{bal}}\) model with two \(\mathcal {T}_t\)s but such bi-exponential \(W'_{\mathrm{bal}}\) models have yet to be applied in practice.

4.2 Limitations and future work

The observed improved prediction quality of \(\mathrm {hydraulic}_\mathrm {weig}\) comes at the cost of a time-demanding fitting process. As outlined in Sect. 2.1.2, our fitting process from Weigend et al. (2021) requires CP and W’ of an athlete as inputs and then obtains fitted \(\mathrm {hydraulic}_\mathrm {weig}\) configurations using an evolutionary computation approach. Different CP and W’ values require a new \(\mathrm {hydraulic}_\mathrm {weig}\) configuration to be fitted to them. Additionally, fittings for our comparison took computation times of 3 h or more on 7 cores of an \(\text {Intel}^{\circledR }\) \(\text {Xeon}^{\circledR }\) CPU E5-2650 v4 @ 2.20 GHz each. On the other hand, obtaining \(\mathcal {T}_t\) for our \(W'_{\mathrm{bal-ode}}\) models was solved in milliseconds and they can be applied to any CP and W’ combination without fitting \(\mathcal {T}_t\) anew. Therefore, the application of \(\mathrm {hydraulic}_\mathrm {weig}\) is a time consuming task in comparison to the application of \(W'_{\mathrm{bal-ode}}\) models. In order to improve the feasibility of application, future work must optimize the hydraulic model fitting process to minimize this limitation.

Further, prediction results show that estimated recovery ratios of the used WB1 \(\rightarrow \) RB \(\rightarrow \) WB2 protocol are affected by rounding errors that arose from the recovery ratios being estimated from simulations in discrete time steps. For example, in Table 5, it was observed that the \(W'_{\mathrm{bal-ode}}\) model predictions between the P240 and P480 trials varied slightly in a few cases, even though these the recovery kinetics should have been the same. The variations were caused by simulation time steps of size 0.1 that made rounding effects play a bigger role in shorter work bouts of P240. Smaller step sizes would decrease the error, but to more fully prevent inaccuracies, future work ought to formalize model simulations in differential equations that don’t require estimations in discrete time steps.

Predicted recovery ratios of the \(\mathrm {hydraulic}_\mathrm {weig}\) at recovery intensities close to CP require further investigation too. The comparison on the \(D_{CP}\)0 case of the Bartram data set (See Fig. 4) revealed that \(\mathrm {hydraulic}_\mathrm {weig}\) predicts a slight recovery during exercise at CP. No liquid from Ae flowed back into the system but the ongoing flow from AnS to AnF still caused liquid level in AnF to rise during the recovery bout. Recovery while exercising at CP intensity is a controversial assumption that is made to an even stronger extent by the original \(W'_{\mathrm{bal}}\) model of Skiba et al. (2012). Such dynamics have to be taken into consideration when the models are used for predictions and are important directions for future investigation.

Additionally, extracted recovery ratio observations from previous studies come with associated uncertainties that could not be considered in our MAE, RMSE and \({\mathrm{AIC}}_{\mathrm{c}}\) scores. As an example, the standard deviations of observed recovery ratios by Caen et al. (2021) depicted in Fig. 5 are greater than reported standard deviations by Ferguson et al. (2010) depicted in Fig. 7. One could argue that comparisons to reported means by Ferguson et al. (2010) therefore provide a better indication of prediction quality. Unfortunately, we could not incorporate these standard deviations into goodness-of-fit metrics because of how different recovery ratios were reported in compared studies. Caen et al. (2021) and Ferguson et al. (2010) reported averaged observed recovery ratios with standard deviations, for the Bartram data set we had to use \(W'_{\mathrm{bart}}\) predictions as observations to compare to, in Weigend et al. (2021) we derived our values from Caen et al. (2019) without standard deviations, and for the Chidnok data set we fitted constant values for \(\mathcal {T}_t\) for \(W'_{\mathrm{bal-ode}}\) models to their reported times to exhaustion to obtain comparable recovery ratios in percent. We believe these various formats of observations highlight the need for more and more comparable studies on energy recovery dynamics.

Larger data sets are vital for more educated investigations of recovery models and their improvement in future work. We see the combination of data sets in this work as a step towards this direction. In order to improve and compare models more holistically, it is important that more comparable studies are conducted in the future and combined into a larger test bed for performance models.

4.3 Significance and implications

To the best of our knowledge, performance models on energy recovery during intermittent exercise have yet to be compared in such detail. Our comparison on data from five studies allowed a more holistic view on recovery dynamics and confirmed limitations of \(W'_{\mathrm{bal-ode}}\) models that were suggested by previous literature. Our results imply that more complex models like \(\mathrm {hydraulic}_\mathrm {weig}\) can improve energy recovery predictions. We propose that further efforts to merge and compare data are significant steps to bring the research area of energy recovery modeling forward. We further propose that the predictive capabilities of hydraulic models look strong and that hydraulic models have to be considered as a possible future direction to advance energy recovery modeling.

5 Conclusion

We conclude that the \(\mathrm {hydraulic}_\mathrm {weig}\) outperformed \(W'_{\mathrm{bal-ode}}\) models when fit to multiple independent data sets featuring intermittent high-intensity exercise. The predictive accuracy and goodness of fit was better for the hydraulic model, even for the \({\mathrm{AIC}}_{\mathrm{c}}\) metric, which includes a penalty for the number of model parameters. The hydraulic model is thus likely more generalizable than \(W'_{\mathrm{bal-ode}}\) models, which are typically applied within narrow contexts. Future research should focus on improving the feasibility of the hydraulic model, because it is computationally more burdensome to use than \(W'_{\mathrm{bal-ode}}\) models. Pending such improvements, we foresee athletes adopting this model to optimize pacing and interval training workouts. To contribute towards further advancements we publish all material, extracted data, and simulation scripts here: https://github.com/faweigend/pypermod.