Introduction

Biological systems produce dynamic nonlinear outputs that are measurable across time, such as the variable fluctuations in the beat-to-beat (RRi) of the heart (Lipsitz and Goldberger 1992; Peng et al. 1995). The apparent “chaotic looking” behaviour of the fluctuations in an RR interval time series is accepted to contain meaningful structural richness; which can be assessed by using methods derived from nonlinear dynamics that can quantify the complexity (i.e., degree of self-similarity of fluctuations over multiple orders of temporal magnitude; Peng et al. 1995) and entropy (i.e., the regularity or randomness of the fluctuations; Richman and Moorman 2000) of the RR interval signal. While traditional linear time-domain methods provide a measure of variability between successive RR intervals, frequency-domain methods provide an estimation of the absolute or relative power of the RR interval signal (Shaffer and Ginsberg 2017).

Together the time-domain, frequency-domain, and nonlinear heart rate variability (HRV) metrics reflect the global functioning of the autonomic nervous system (ANS) through the interplay of sympathetic and parasympathetic activity at the sinus node (Task force 1996; Schwab et al. 2003). From a health-related and clinical perspective, a notable increase or decrease in heart rate complexity and variability away from an individual’s optimal range, may be indicative of an increased risk of sudden death, or adverse cardiac events such as arrythmias, myocardial infarcts, postural hypotension, and congestive heart failure (Kleiger et al. 1987; Goldberger et al. 1988; Lipsitz 1989; La Rovere et al. 1998; Stein et al. 2005). Moreover, research has shown a higher HRV to be positively associated with working memory (Mosley et al. 2018), cognitive performance (Hansen et al. 2004), emotional regulation (Williams et al. 2015) and incidence of depression (de la Torre-Lugue et al. 2016).

Research utilising a wide variety of HRV metrics has shown that during wakeful rest, both heart rate complexity (Kaplan et al. 1991; Iyengar et al. 1996; Pikkujamsa et al. 1999; Beckers et al. 2006; Voss et al. 2015) and variability (Jensen-Urstad et al. 1997; Umetani et al. 1998; Goff et al. 2010; Hernandez-Vicente et al. 2020) progressively decrease from early adulthood through to older age in healthy individuals. The World Health Organisation projects the number of people in the world over 60 years of age to increase from 1 billion (as of 2020) to 1.4 billion by 2030 and 2.1 billion by 2050 (data from who.int). Given the potentially negative physiological and psychological implications associated with a decrease in heart rate complexity and variability, it is pertinent there is continued research into the utility of HRV in older adults.

Previous research has assessed the intra and inter-day reliability of a few specific time-domain, frequency-domain (Al Haddad et al. 2011; Cipryan and Litschmannova 2013; Uhlig et al. 2020) and nonlinear HRV metrics (Maestri et al. 2007a). However, to the authors knowledge the inter-day reliability of the nonlinear HRV metrics has yet to be assessed in a homogenous group of healthy older adults. The current study therefore sought to extend upon the current literature investigating the reliability of HRV metrics, with the primary aim to provide new data on the day-to-day reliability of a range of HRV metrics in healthy active younger and older adults. The study also sought to assess the effect of age on HRV.

Methods

Participants

Sixty-six healthy individuals (50 male; 16 female) were recruited to participate in the study. Participants were divided into two age groups, the younger group (YG) were aged 18 to 30 years (N = 22; 16 M, 6F) and the older group (OG) were aged 50 to 70 years (N = 44; 34 M, 10F).

All participants were regular exercisers, having performed above the World Health Organisation guidelines (i.e., 2.5 to 5 h of moderate exercise per week; Bull et al. 2020) for ≥ 2 years. All participants were recruited to be closely matched for physical activity levels and exercise capacity. Participants were required to be non-obese, non-smokers, have no known or signs/symptoms of cardiovascular, neuromuscular, renal, or metabolic conditions and not be taking medications or dietary supplements that would affect cardiac function. The study was completed with full ethical approval of the University of Kent Research Ethics Committee (Proposal number: 21_2020_21), according to Declaration of Helsinki standards. All participants provided written informed consent prior to testing.

Experimental design

Each participant completed three visits to the laboratory at the same time of day (± 1 h) between the hours of 8am and 4 pm (AM visits, YG N = 8 and OG N = 21; PM visits, YG N = 14 and OG N = 23). Visit one involved participant screening, laboratory familiarisation, and an incremental exercise test (IET) to determine aerobic fitness. At visits two and three, participants completed the 30-min supine resting RR interval measurement to derive the HRV metrics.

Visits were conducted on non-concurrent days (with a minimum gap of 2 full days and maximum gap of 5 days between visits) and participants were instructed to refrain from any exercise in the day prior to testing and intense exercise in the two days prior. Participants were instructed to arrive euhydrated and in a post-prandial state, having eaten at least 4-h prior to testing. Participants were told to not consume caffeine within 8-h and alcohol within 24-h of testing.

Preliminary measurements and incremental exercise testing (visit one)

At visit one prior to exercise testing all participants provided written informed consent, completed a health questionnaire and the long form international physical activity questionnaire (Craig et al. 2003). Resting blood pressure, participant height, body mass and body composition were then measured, after which the participants completed a cycling IET to determine markers of aerobic fitness.

The IET protocol was performed on an electro-magnetically braked ergometer (Excalibur Sport, Lode BV, Groningen, The Netherlands). Participants completed a 10-min warm-up at 50 W, after which the required cycling power output increased by 25 W every minute (i.e., 1 W every 2.4 s) until they reached volitional exhaustion (operationally defined as a cadence of < 60 revolutions/min for > 5 s, despite strong verbal encouragement).

During the IET, respiratory gas exchange data were assessed using online breath-by-breath gas analysis (Metalyzer 3B; CORTEX Biophysik GmbH, Leipzig, Germany). Prior to all testing the gas analyser was calibrated according to the manufacturer recommendations using with ambient air and known concentrations of oxygen and carbon dioxide. The bidirectional turbine (flow meter) was calibrated with a 3-L calibration syringe.

The participant’s peak oxygen uptake (\(\dot{V}{\text{O}}_{{{\text{2peak}}}}\)) was assessed as the highest oxygen uptake that was attained during a 1-min period in the test. Participants gas exchange threshold was determined as the breakpoint in carbon dioxide production and oxygen consumption (i.e., the point at which the carbon dioxide production begins to increase out of proportion to the oxygen consumption). This breakpoint also coincided with the increase in both ventilatory equivalent of oxygen (\(\dot{V}{\text{E}}/\dot{V}{\text{O}}_{{2}}\)) and end-tidal pressure of oxygen with no concomitant increase in ventilatory equivalent of carbon dioxide (\(\dot{V}{\text{E}}/\dot{V}{\text{CO}}_{{2}}\); Beaver and Wasserman 1986; Pallares et al. 2016). The respiratory compensation point was determined as an increase in both the \(\dot{V}{\text{E}}/\dot{V}{\text{O}}_{{2}}\) and \(\dot{V}{\text{E}}/\dot{V}{\text{CO}}_{{2}}\)and a decrease in partial pressure of end-tidal carbon dioxide (Whipp et al. 1989; Lucia et al. 1999).

Measurement of RR intervals (visits two and three)

For collection of RR intervals participants were in a supine resting position, in a temperature-controlled room set at 20 C. The room was kept dark and quiet, and participants were instructed not to verbalise throughout the measurement and breathe freely at their normal resting rate. Before the 30-min RR interval measurement commenced, an initial 20-min supine rest period was carried out to ensure participants were at complete rest and their heart rates were stable.

To collect the RR intervals participants wore a Polar H10 heart rate monitor with a Pro Strap (Polar Electro Oy, Kempele, Finland), which has been shown to provide strong agreement and comparable RR interval signal quality to conventional ECG devices (Gilgen-Ammann et al. 2019; Schaffarczyk et al. 2022). The elastic electrodes of the Pro Strap were moistened, and the strap lengthened to fit around the participant’s chest circumference as described by the manufacturer. The RR intervals were acquired at 1000 Hz via the Elite HRV application (Elite HRV, Asheville, NC, USA) on a mobile device positioned directly next to the participant. The RR intervals were then exported as a text file for processing and analysis offline in MATLAB.

RR interval data pre-processing

All RR interval time series were pre-processed to exclude artifacts and outliers. RR intervals less than 0.2 s and greater than 2.0 s were removed. Secondly, RR intervals that differed from the mean of the surrounding 40 RR intervals by more than 20% were excluded.

The number of RR interval artifacts and outliers from all RR interval time series on Day 1 were: YG, 19.6 ± 20.5 RR intervals or 1.12 ± 1.24% (range 0.05 to 4.33%) of total RR intervals and OG, 7.5 ± 10.6 RR intervals or 0.46 ± 0.64% (range 0.00 to 2.65%) of total RR intervals and Day 2: YG, 16.3 ± 15.9 RR intervals or 0.94 ± 0.94% (range 0.00 to 3.03%) of total RR intervals and OG, 6.7 ± 12.1 RR intervals or 0.42 ± 0.76% (range 0.00 to 4.10%) of total RR intervals.

Heart rate complexity—nonlinear metric analysis

Approximate and sample entropy

Approximate entropy (ApEn; Pincus 1991) and sample entropy (SampEn; Richman and Moorman 2000) quantify the conditional probability that a template length of m and m + 1 data points is repeated during the time series within a tolerance of r (set at a % of the time series SD). SampEn differs from ApEn, as it avoids counting self-matches by taking the logarithm after averaging, thus reducing the inherent bias existing within the ApEn calculation.

In the current study template length was set at m = 2 and tolerance r = 0.2 of the SD of the RR interval time series, for both ApEn and SampEn analysis (Kaplan et al. 1991). ApEn was calculated as shown by Eq. (1) and SampEn by Eq. (2), where N is the number of data points in the time series, m is the length of the template, Ai is the number of matches of the ith template of length m + 1 data points, and Bi is the number of matches of the ith template of length m data points:

$$ApEn\left( {m,r,N} \right) = { }\frac{1}{{N - {\text{m}}}}\mathop \sum \limits_{i = 1}^{N - m} log\frac{{A_{i} }}{{B_{i} }}$$
(1)
$$SampEn\left( {m,r,N} \right) = - {\text{ log}}\left( {\frac{{\sum \begin{array}{*{20}c} { N - m } \\ { i = 1 A_{i} } \\ \end{array} }}{{\sum \begin{array}{*{20}c} {N - m} \\ {i = 1} \\ \end{array} B_{i} }}} \right)$$
(2)

Detrended fluctuation analysis

The detrended fluctuation analysis (DFA) algorithm was used, as outlined by Peng et al. (1994), to measure the fractal scaling of the RR interval time series. The DFA algorithm allows for the detection of long-range correlations embedded in seemingly non-stationary physiological time series data. The RR interval time series is first integrated, using Eq. (3):

$$y(k) = \sum_{j=1}^{k}({RR}_{j }- \overline{RR }), k = 1, ...,N$$
(3)

The integrated time series are then divided into boxes of equal length, n. Within each box length n, a least squares line is fitted to the data, yn(k) denotes the trend in each box. The integrated time series y(k) is then detrended by subtracting the local trend, yn(k), within each box. The root-mean-square fluctuation of the integrated and detrended time series is calculated by Eq. (4):

$$F\left( n \right) = { }\sqrt {\begin{array}{*{20}c} \frac{1}{N} & {\mathop \sum \limits_{k = 1}^{N} \left[ {y\left( k \right) - y_{n } \left( k \right)]^{2} } \right.} \\ \end{array} }$$
(4)

The DFA computation (4) is repeated across all box sizes to provide a relationship between F(n), the average fluctuation as a function of box size, and the box size, n, the number of RR interval data points in a box. The slope of the double log plot, log F(n) vs log n, determines the scaling exponent α. DFA α was calculated with box sizes ranging from 4 to \(\le\) 64 data points. DFA α1 was calculated over box sizes of 4 \(\le\) n \(\le\) 16 data points (i.e., scaling exponent calculated over short time scales) and DFA α2 was calculated over box sizes of 16 \(\le\) n \(\le\) 64 data points (i.e., scaling exponent calculated over long time scales), as used previously by Peng et al. (1995).

The DFA produces a scaling exponent α. An α = 0.5 indicates that the value of one RR interval is completely uncorrelated from any previous values (i.e., unpredictable white noise; indicative of a very rough time series). An α = 1.5 indicates Brown noise and a loss of long-range correlations (i.e., a smooth output with long term memory). While an α of 1.0 (i.e., 1/f or pink noise) is suggestive of a physiological output of high complexity, that is statistically self-similar with long range-correlations (Peng et al. 1995). Figure 1A presents an example raw RR interval time series and 1B presents the integrated time series with the least-squares fit “trend” line plotted for box sizes of 64 data points.

Fig. 1
figure 1

A Example raw RR interval time series; B the integrated RR interval time series, with the least-squares fit representing the “trend” in each box (red lines) and the vertical lines indicating the box size of n = 64 data points. The RR interval data presented produced a DFA α = 1.04 (DFA α calculated over box sizes 4 to \(\le\) 64; data were from a younger male participant aged 18 years)

Multiscale entropy

Multiscale entropy (MSE) analysis was performed as outlined by Costa et al. (2002) providing a measure of complexity of time series over multiple scales. The MSE analysis overcomes limitations of SampEn and ApEn which only measure the regularity of time series data on one scale, and therefore do not capture the structural and dynamical behaviour of the time series.

From the one-dimensional discrete time series, {χ1,…, χI,…, χN}, a coarse-grained time series were constructed, {y(τ)}, determined by the scale factor, τ, according to Eq. (5):

$$y\begin{array}{*{20}c} {\left( \tau \right)} \\ j \\ \end{array} = \frac{1}{\tau } \mathop \sum \limits_{{i - \left( {j - 1} \right)\tau + 1}}^{j\tau } \chi_{i^{\prime}} 1 \le j \le N/\tau$$
(5)

At one scale, the time series {y(1)} is the original time series of sample length. The length of the coarse-grained time series is equal to the length of the original time series divided by the scale factor, τ. The SampEn for each coarse-grained time series is calculated and plotted against the scale factor, τ, producing a MSE curve. The SampEn of each coarse-grained time series was computed using Eq. (2) and a template length m = 2 and r = 0.2 of the SD of the RR interval time series. The area under the MSE curve were calculated from scales 1 to 8 using Eq. (6) and is defined as the complexity index (CI-8) with higher CI values indicating greater complexity of the physiological signal.

$${\text{CI }} = \mathop \sum \limits_{i = 1}^{\tau } SampEn\left( i \right)$$
(6)

Poincare plot SD2

Poincare plots of RR interval times series were produced by plotting each RR interval as a function of the previous RR interval (Woo et al. 1992). Poincare plots were then analysed with an ellipse fitting procedure to derive the metrics SD1 (the standard deviation of the points perpendicular to the line of identity) and SD2 (the standard deviation along the line of identity; Brennan et al. 2001). Only SD2 was reported as SD1 is identical to RMSSD (Shaffer and Ginsberg 2017).

Heart rate variability—linear metric analysis

Time-domain metrics

The time-domain measures of heart rate variability quantify the amount of variability present within the RR interval time series.

The root mean square of successive differences between normal RR intervals (RMSSD) was calculated using Eq. (7):

$${\text{RMSSD}} = { }\sqrt {\begin{array}{*{20}c} {\frac{1}{N - 1}} & {\mathop \sum \limits_{n = 1}^{N - 1} (RR_{n + 1} - RR_{n } )^{2} } \\ \end{array} }$$
(7)

The standard deviation of normal RR intervals (SDNN) was calculated using equation (8):

$${\text{SDNN}} = { }\sqrt {\begin{array}{*{20}c} {\frac{1}{N - 1}} & {\mathop \sum \limits_{n = 1}^{N } (RR_{n} - \overline{RR} )^{2} } \\ \end{array} }$$
(8)

The RMSSD and SDNN metrics were reported in milliseconds and natural logarithm transformed values, LnRMSSD and LnSDNN.

Frequency-domain metrics

The frequency-domain measures of heart rate variability provide an estimate of spectral power in frequency bands. The power spectrum was estimated using a parametric autoregressive based model, with the absolute power in the low frequency power (LF) band (0.04–0.15 Hz) and high frequency power (HF) band (0.15–0.4 Hz) calculated, along with the LF/HF ratio. The absolute power in the LF and HF band is reported in ms2 and natural logarithm transformed values (Ln).

Statistical analysis

Data are presented as individual values or mean ± SD (unless specified otherwise). Statistical analyses were conducted using IBM SPSS Statistics 29 (IBM, Armonk, New York, USA). Visual inspection of Q-Q plots and Shapiro–Wilk statistics were used to check whether data were normally distributed.

Day-to-day reliability of all heart rate complexity and variability metrics was assessed through a two-way random intraclass correlation coefficient (ICC2,1) for absolute agreement, standard error of measurement (SEM), minimal detectable change (MDC) and Bias (being mean difference between day 1 and day 2). Upper and lower 95% limits of agreement (LOA) were calculated as the mean of differences between days ± 1.96 × the standard deviation of the differences. Between day coefficient of variations (CVs) of all HRV metrics were calculated by dividing the SD of both days’ measurement by the mean of both days measurement and multiplying by one hundred. Between participant CVs for all HRV metrics were calculated by dividing the SD of all participant measurement by the mean of all participant measurement and multiplying by one hundred. Paired samples t-tests were used to assess whether a significant difference in the complexity and variability metrics were present between days for each age group.

Based on the ICCs, relative reliability was defined as: poor = ICC < 0.5, moderate = ICC ≥ 0.5 to < 0.75, good = ICC ≥ 0.75 to < 0.90 and excellent = ICC ≥ 0.90 (Koo and Li 2016).

Hedges’ g effect sizes and the 95% confidence intervals were calculated to assess the differences between the two age groups (YG vs. OG) HRV metrics and interpreted as: 0.2 to 0.5 small effect, 0.5 to 0.8 medium effect, ≥ 0.8 large effect (Cohen 1992).

Multiple linear regressions were performed to estimate the effect of participant age, sex and \(\dot{V}{\text{O}}_{{{\text{2peak}}}}\) on all heart rate complexity and variability metrics. Males were set as the baseline reference level; therefore, positive beta coefficients indicate that being female will likely result in a higher value.

The significance level was set at P < 0.05 in all cases.

Results

Participant characteristics and anthropometrics

Data from forty-four older adults (34 M; 10F) and twenty-two younger adults (16 M; 6F) were included in the analysis. Table 1 presents participant anthropometrics and IET data.

Table 1 Participant characteristics, anthropometrics and IET data (mean ± SD)

Reliability of heart rate complexity and variability-based metrics

Based upon the ICCs the OG demonstrated poor reliability for the CI-8 and SD2 metric, moderate reliability for the RMSSD, SDNN, LnRMSSD, LnSDNN, LF(ms2), HF(ms2), LF(log), HF(log), ApEn, SampEn, DFA α, DFA α1 and DFA α2 metrics, and good reliability for the LF/HF metric (Table 2). By comparison, the YG demonstrated poor reliability for the ApEn, SampEn and SD2 metrics, moderate reliability for the LnSDNN, LF (ms2), LF(log), DFA α2 and CI-8 metrics, good reliability for the RMSSD, SDNN, LnRMSSD, HF(ms2), HF(log), LF/HF and DFA α metrics and excellent reliability for the DFA α1 metric (Table 3).

Table 2 Older group day-to-day reliability of RR interval complexity and variability metrics
Table 3 Younger group day-to-day reliability of RR interval complexity and variability metrics

Effect of age, sex and \(\dot{V}{\text{O}}_{{{\text{2peak}}}}\) on heart rate complexity

There was a significant reduction in the ApEn (P < 0.001; Fig. 2E), SampEn (P = 0.031; Fig. 2F) and SD2 (P < 0.001; Fig. 2H) metrics with ageing (Table 5). There was no significant effect of age on the CI-8 (P = 0.493; Fig. 2G; Table 5).

Fig. 2
figure 2

Comparisons between the younger and older groups complexity and variability metrics A root mean square of successive differences between normal RR intervals; B standard deviation of normal RR intervals; C low frequency power; D high frequency power; E approximate entropy; (F) sample entropy; G complexity index under 8 scales; H standard deviation of points along the line of identity of the Poincare plot (*P < 0.05; **P < 0.001; Data points are the mean of both days for each individual participant)

There was no significant effect of age on the DFA α1 (P = 0.107; Fig. 3B) and DFA α2 (P = 0.147; Fig. 3C) metrics (Table 5). The DFA α metric was significantly increased with ageing (P = 0.029; Fig. 3A).

Fig. 3
figure 3

Comparisons between the younger and older groups detrended fluctuation analysis metrics A DFA α (box sizes 4 to \(\le\) 64 data points); B DFA α1 (box sizes of 4 \(\le\) n \(\le\) 16 data points); C DFA α2 (box sizes of 16 ≤ n \(\le\) 64 data points; *P < 0.05; Data points are the mean of both days for each individual participant)

There was a significant effect of sex (P = 0.028), but not or \(\dot{V}{\text{O}}_{{{\text{2peak}}}}\) (P = 0.822) on DFA α1, with females presenting with lower values. There was no significant effect of sex or \(\dot{V}{\text{O}}_{{{\text{2peak}}}}\) on the ApEn, SampEn, DFA α, DFA α2, CI-8 and SD2 metrics (P > 0.05; Table 5).

Effect of age, sex and \(\dot{V}{\text{O}}_{{{\text{2peak}}}}\) on heart rate variability

There was a significant reduction in RMSSD (P < 0.001; Fig. 2A), SDNN (P < 0.001; Fig. 2B), LF power (P < 0.001; Fig. 2C) and HF power (P < 0.001; Fig. 2D) metrics with ageing (Table 5).

There was no significant effect of sex or \(\dot{V}{\text{O}}_{{{\text{2peak}}}}\) on all linear HRV metrics (P > 0.05; Table 5).

Discussion

Reliability of heart rate complexity and variability metrics

The current study provides new inter-day reliability data for a range of widely utilised time-domain, frequency-domain and nonlinear HRV metrics in healthy highly active younger and older adults. The primary findings of this investigation reveal all linear HRV metrics in both the younger adult and older adult groups to exhibit moderate to good inter-day reliability, as indicated by ICCs ranging from 0.56 to 0.88 (Tables 2 and 3). Similarly, the majority of nonlinear HRV metrics demonstrated moderate to excellent inter-day reliability with ICCs ranging from 0.55 to 0.93 (Tables 2 and 3). There were exceptions however, with ApEn, SampEn and SD2 metrics of the YG, and the SD2 metric of the OG exhibiting poor relative reliability, as shown by ICCs of less than 0.50 (Tables 2 and 3). This variability in the inter-day reliability of HRV metrics can likely be attributed to the sensitivity of the ANS and the influence of various individual internal and external factors that can be challenging to control (Fatisson et al. 2016).

It has been suggested that the assessment of test–retest reliability should not rely solely on ICCs (Weir 2005). This viewpoint is supported by the current study, with the ApEn, SampEn and CI-8 HRV metrics displaying ICCs ranging from 0.37 to 0.69, indicating poor to moderate relative reliability (Tables 2 and 3). However, these metrics exhibited low SEM values (ranging from 0.06 to 0.20) and low between day CVs (ranging from 2.95% to 7.65%), which suggests high absolute retest reliability. This apparent contradiction can be explained by the homogeneous population recruited and low between participant CVs for these specific metrics, leading to low relative but high absolute reliability (Atkinson and Nevill 1998; Weir 2005). In contrast, the SD2 metric showed both low relative reliability (ICCs ranging from 0.33 to 0.44) and low absolute reliability (between day CVs of 18.13% to 20.42% and SEM values of 17.43 to 60.00). Similarly, the frequency-domain metrics LF, HF, and LF/HF also exhibited low absolute reliability (Tables 2 and 3). These findings indicate that specific HRV metrics may present significant challenges when used to detect intervention/treatment effects or individual changes over time. Consequently, the HRV metrics with low relative and absolute reliability may not be suitable in specific research contexts, especially those with limited sample sizes or small intervention/treatment effects.

ICCs and SEM of the SampEn and DFA metrics for both age groups in the current study are comparable to those reported by Maestri et al. (2007a) who examined HRV inter-day reliability in healthy adults with a mean age of 38 years (range 26 to 56 years). Accordingly, the LnRMSSD, LnSDNN, LnLF, and LnHF metrics of both age groups produced similar ICCs to those reported for healthy young students aged between 18 and 39 years (Uhlig et al. 2020), in addition to comparable between day CVs and SEM to healthy trained young adults (aged 21.5 ± 1.4 years; Al Haddad et al. 2011). The corroboration between reliability studies improves confidence in the expected retest error of HRV metrics. However, it also emphasises the high level of variance in certain HRV metrics (i.e., LF, HF, LF/HF and SD2), as well as the difficulty facing researchers in sufficiently powering studies which are utilising HRV measurements across multiple visits and/or during longitudinal studies.

The study builds upon previous HRV reliability research providing inter-day reliability data for short-term resting HRV measurements for younger and importantly older adults across a range of widely utilised HRV metrics. The reliability data in Tables 2 and 3 provides a resource for researchers to reference when calculating sample sizes for future HRV studies with healthy adult participants. Importantly, given the disparity in the reliability of different HRV metrics (ICCs, 0.33 to 0.93; Between day CVs, 2.9 to 36.5; Tables 2 and 3), study sample size is recommended to be based upon the chosen metric with the lowest reliability to reduce the likelihood of a type I or type II error across all metrics. In addition, the reliability statistics also allow for the assessment of whether there is a true intervention effect or individual change in HRV metrics within a study and not just a result of biological and measurement error.

Effect of age, sex and \(\dot{V}{\text{O}}_{{{\text{2peak}}}}\) on heart rate complexity and variability

The current study findings demonstrate a significant age-related decline in linear (RMSSD, LnRMSSD, SDNN, LnSDNN, LF, HF) and nonlinear (ApEn, SampEn and SD2) HRV metrics (Tables 4 and 5; Fig. 2), corroborating the findings of a broad body of literature which has assessed the effect of age on heart rate complexity and variability (Kaplan et al. 1991; Iyengar et al. 1996; Jensen-Urstad et al. 1997; Umetani et al. 1998; Pikkujamsa et al. 1999; Beckers et al. 2006; Goff et al. 2010; Voss et al. 2015; Hernandez-Vicente et al. 2020). An age-related decrease in both the linear and nonlinear HRV metrics is expected, primarily driven by alterations in the ANS, characterised by a decline in parasympathetic activity and an increase in sympathetic drive (Seals and Esler 2000).

Table 4 Mean HRV metrics for age groups and effect size comparisons
Table 5 Multiple linear regression model statistics

Despite age-related differences in all other HRV metrics, there was no significant effect of age on the nonlinear DFA α1 and α2 metrics (Table 5; Fig. 3B and C). Mean DFA α1 and α2 values were close to 1.0 (i.e., 1/f or pink noise), indicative of a healthy physiological signal of high complexity that is exhibiting both short and long-range fractal-like correlations (Peng et al. 1995). These findings are comparable to previous research which also found no age-related difference in the DFA α1 and α2 metric (Vuksanovic and Gal 2005; Schmitt and Ivanov 2007; Wiersema et al. 2022). Seminal research exploring the effect of age on the fractal behaviour of RR interval time series observed healthy older adults (α2 = 0.75 ± 0.17) to have a significant decline in long-range fractal correlations, in comparison to healthy younger adults (α2 = 0.99 ± 0.10; Iyengar et al. 1996). The mean age of the older group in the study of Iyengar et al. (1996) was greater than the older group of the current study (74 years vs 59 years), which may partly explain the difference in findings between the studies, as well as the high activity levels of the older participants of the current study. It is important to note that despite recruiting a homogenous sample, several participants did produce α1 and α2 values closer to 0.5 and 1.5 (Fig. 3B and C). Such between participant variation is expected, occurring to differing extents for all HRV metrics (Tables 2 and 3) and highlights the importance of also accounting for the inter-individual variability of HRV metrics when seeking to understand the utility of HRV in different populations.

The findings of the current study demonstrate no significant age-related change in the nonlinear CI-8 metric (Fig. 2G; Table 5). Like the DFA α1 and α2 metrics, the CI-8 metric captures the structural and dynamical behaviour of the RR interval time series over multiple scales (Costa et al. 2002). Accordingly, the complexity (DFA and CI-8) of the study participants’ RR interval time series is suggestive of their cardiovascular systems ability to adapt to physiologic perturbations and respond quickly to challenges to maintaining homeostasis (Peng et al. 2009; Manor and Lipsitz 2013). The mixed findings of the effect of age on different HRV metrics highlights the necessity of employing multiple heart rate complexity and variability metrics when analysing RR interval times series. If only specific time-domain, frequency-domain or non-linear HRV metrics are utilised, studies may fail to capture different linear and nonlinear aspects of the signal, therefore potentially missing important information on cardiac interval dynamics. However, the choice and combination of HRV metrics by researchers is also likely to be dependent on the research context; with different HRV metrics better suited to capturing specific properties and/or changes in cardiac interval dynamics, in addition to the redundancy of combining HRV metrics which measure similar HRV properties (Maestri et al. 2007b).

The current study included male (N = 50) and female (N = 16) participants. Sex differences in HRV are well documented and are influenced by physiological, hormonal, and neural factors (Koenig and Thayer 2016). Moreover, sex-related differences in HRV may be more pronounced in younger adults, when compared to older adults (Maria et al. 2023). It should be noted that the current study did not control for menstrual cycle phase or hormone changes due to the menopause, which are known to effect HRV (Aubert et al. 2003; Maria et al. 2023). Sex did not significantly predict the HRV metrics in the current study, except for the DFA α1 metric (Table 5). The significant effect of sex indicates that females present with lower α1 value in comparison to males. Such differences in α1 is suggestive of a notable change in the short-range fractal correlation properties of HRV and an alteration in sympathetic and vagal activation (Tulppo et al. 2005).

While sex was not significantly predictive of the HRV metrics, the beta coefficients indicate a trend towards females having higher values in HRV metrics primarily associated with parasympathetic activity (i.e., HF power and RMSSD) in comparison to males. There is evidence to support an increase in parasympathetic modulation (as indicated by absolute HF power) in females compared to males (Koenig and Thayer 2016). However, evidence is argued to be inconclusive with heterogeneity in study findings, likely emanating from differences in study methodology and analysis methods (Maria et al. 2023).

Aerobic physical activity has been shown to have positive effects on measures of HRV in both younger and older adults, when compared to sedentary age matched individuals, through enhanced autonomic balance, improved baroreflex sensitivity and cardiac adaptations (Aubert et al. 2003). To capture the effect of inherent biological ageing on HRV (i.e., individuals unaffected by sedentary behaviour or underlying pathologies) all participants of the current study were recruited to be in full health and regular exercisers closely matched for physical activity levels and aerobic fitness (Table 1). Although the YG did present with a higher absolute aerobic fitness as measured by \(\dot{V}{\text{O}}_{{{\text{2peak}}}}\) (YG \(\dot{V}{\text{O}}_{{{\text{2peak}}}}\) = 3.5 ± 1.0 L min−1 vs. OG \(\dot{V}{\text{O}}_{{{\text{2peak}}}}\) = 3.0 ± 0.8 L min−1), \(\dot{V}{\text{O}}_{{{\text{2peak}}}}\) was not significantly predictive of any HRV metric (Table 5).

Limitations

The current study only assessed the reliability of HRV metrics derived from short-term RR interval measurements in healthy active younger and older adults during free-breathing wakeful supine rest. Due to the sensitivity of the ANS to various external and internal factors (Fatisson et al. 2016), caution is advised when extrapolating the reliability data reported herein to HRV metrics derived from RR interval measurements performed under different conditions. The current study was limited to the assessment of inter-day reliability and did not assess the intra-day reliability of the HRV metrics. Given the sensitivity of the ANS, it is probable the inter-day variation in HRV largely reflects biological error, whereas intra-day variation in HRV would likely provide a closer insight into the measurement error.

The current study assessed a range of time-domain, frequency-domain and nonlinear HRV metrics, which are extensively studied and widely accepted to provide valuable information regarding ANS function in ageing, between sexes and in athletes (Koenig and Thayer 2016; Shaffer and Ginsberg 2017; Lundstrom et al. 2023). However, it is important to highlight that the study does not provide a comprehensive list of available HRV metrics. Notably, the study did not include HRV metrics from the major families of symbolic dynamics, predictability, and empirical mode decomposition (Maestri et al. 2007b). Researchers should specifically consider using the symbolic dynamic metric, one variation pattern (1VP) and empirical mode decomposition metric, IMAI2. The IVP and IMAI2 metrics have been shown to provide additive predictive value independent to clinical predictors when assessing chronic heart failure patients (Maestri et al. 2007b) and detect experimentally induced changes in autonomic cardiovascular regulation in healthy individuals (Guzzetti et al. 2005).

The nonlinear HRV metric, ApEn, was included in the current study as a metric from the entropy family, which can assess the irregularity or randomness of an RR interval time series (Pincus 1991). However, the calculation of ApEn presents notable limitations due to its self-matching that may affect its interpretation (Richman and Moorman 2000). ApEn exhibits sensitivity to data length, particularly in cases of short data sequences such as RR interval time series, leading to potentially biased results due to its reliance on pattern identification within the arbitrarily specified tolerance parameter, “r”. Moreover, ApEn’s susceptibility to self-matching can cause relative inconsistencies; meaning if the ApEn of a time series is higher than another time series, it should remain higher under all conditions, however, it does not always remain higher (Richman and Moorman 2000). Despite ApEn demonstrating high absolute retest reliability, researchers are advised to account for these limitations when using ApEn for HRV analysis.

Conclusion

The current findings show that widely used HRV metrics derived from short-term (30-min) RR interval measurements are reproducible between days in healthy, highly active younger and older adults. However, there is a disparity in the inter-day reliability of different HRV metrics, with certain metrics presenting with a higher level of variance (i.e., LF, HF, LF/HF and SD2). Both linear and nonlinear HRV metrics capture different aspects of cardiac interval dynamics; therefore, researchers should not exclude metrics based solely on their reliability. Instead, studies should be designed appropriately based upon the chosen HRV metrics to increase the probability of detecting a true effect. This study also extends upon previous research by demonstrating a significant age-related decline in the majority of linear and nonlinear HRV metrics assessed. However, the participants’ sex and \(\dot{V}{\text{O}}_{{{\text{2peak}}}}\) did not significantly influence the HRV metrics.