A large body of research has documented a variety of learning outcomes affected by a wide spectrum of psychological factors, such as stress, anxiety, attention, and other diverse responses to lectures, exams, presentations, and normal school life. Stress, one of the most researched phenomena, has been found to have significant cognitive, psychological, social, and behavioral consequences for many learners, ultimately impacting academic success (Ahmady et al., 2021; Kwon et al., 2017; Lisnyj et al., 2020; Moreira et al., 2016; Ribeiro et al., 2018). Extensive research has revealed intricate relationships among emotional and cognitive variables and learning. These encompass test anxiety effects on educational outcomes (von der Embse et al., 2012; Zeidner, 2007, 2014); the influence of stress on attention, memory, and higher cognitive functions (Kofman et al., 2006; Redondo et al., 2019); and its impact on task performance in high-stakes environments (Martin & Naziruddin, 2020; Wetzel et al., 2010).

In traditional educational research, factors related to learning have typically been measured using self-report instruments (Weenk et al., 2018), which, while statistically confirmed for validity and reliability, have consistently faced criticism due to their inherent limitations (Williamson, 2007). Among the concerns is the subjective nature of self-reports, making them vulnerable to distortion. For example, respondents can misunderstand or misinterpret the literal or practical meaning of the survey questions (Hunt & Bhopal, 2004; Sudman et al., 1996). They can also result in mis-recalled information given that people often forget their experiences, even very significant ones, over surprisingly short periods (Schmier & Halpern, 2004; Wagenaar & Keren, 1986). Respondents must rely on their estimations, judgments, and perspectives (Brann et al., 2021; Chafouleas, 2011); may overestimate a certain state or trait (Williamson, 2007); or may answer in a way they perceive to be socially desirable (Krumpal, 2011). Further, the subjectivity inherent in self-report measures inevitably leads to reliability problems, since responses are notoriously changeable. Fatigue, anger, or broad situational changes also can cause response variation in the same person (Salinsky et al., 2001). Finally, self-report instruments typically inquire about learning experiences post-occurrence, presenting limitations in capturing learners’ states in real time during the learning process and in monitoring these states over prolonged learning periods (Pekrun, 2006; Prokofieva et al., 2019; Weenk et al., 2018).

Researchers have recognized and addressed the limitations of self-report instruments by pioneering the measurement of psychophysiological variables as a direct and objective reflection of learners’ cognitive and emotional states (e.g., Paas and Van Merriënboer (1994)). Psychophysiological measures assess the interplay between psychological and physiological conditions using a variety of tools and data, including electroencephalography (EEG) and event-related potentials (ERP), heart rate (HR) and heart rate variability (HRV), blood pressure (BP), electromyography (EMG), thermal imaging, and pupillometry (Gaffey & Wirth, 2014; Lohani et al., 2019). Physiological responses avoid the self-bias risks and are generally not distorted by respondents’ recall, understanding, interpretations, judgments, or intentions. Given their increased objectivity, psychophysiological measures capture stable individual features, increasing the reliability of the collected data (Scrimin et al., 2019, 2021). With the development of newer wearable technologies, the use of psychophysiological measures in educational research has become even more attractive. Such devices allow for real-time continuous monitoring of changes in individual psychological states, data that self-report pre- and post-surveys have only been able to collect to a limited degree (Castaldo et al., 2019; Huhn et al., 2022).

In an extension of clinical research, HRV is a psychophysiological measure increasingly being used in educational research. HRV is a relatively simple, affordable, noninvasive, and reliable measure that can capture changes in learner states over time (Castaldo et al., 2019; Laborde et al., 2017; Weenk et al., 2018). HRV represents the variability of the time interval between two consecutive heartbeats. It is a quantitative expression of balance in the functioning of the autonomic nervous system (ANS), also known as the sympathetic (signaling of a body in a state of alertness) and parasympathetic (signaling a body in a state of relaxation) modulation (Huhn et al., 2022; Prichard et al., 2012; Tracy et al., 2016; Wimmer et al., 2019). In the presence of high demands, like stressors, the sympathetic nervous system predominates over the parasympathetic, leading to an increased heart rate (HR) and a decreased HRV ( ). Conversely, dominant parasympathetic nervous activity reflecting a decreased HR and an increased HRV ( ) is associated with a state of relaxation (Huhn et al., 2022).

HRV is an index of the complex mix of physiological, emotional, and cognitive processes involved in self-regulation and adaptability due to brain–heart connections (Laborde et al., 2017; Malik, 1996). These connections work via two pathways: (1) inhibitory pathways via GABA (gamma-aminobutyric acid, a neurotransmitter with a calming effect on the nervous system) from the prefrontal cortex to the amygdala and (2) additional regulatory pathways by the amygdala on ANS through the production of several different hormones and neurotransmitters including dopamine, serotonin, cortisol, adrenaline, and noradrenaline. These two pathways modulate heart rate and thus HRV. Hence, HRV is influenced by a range of factors involved in physiological, emotional, and cognitive regulation, including physiological factors such as cardiac function, metabolic processes, hormonal regulation, respiration rate, and age; emotional factors such as stress, depression, anxiety, and anger; and cognitive factors such as working memory, attention, and executive function. All these factors are closely interrelated with inhibitory GABAergic pathways and the hormones and neurotransmitters produced from amygdala. High HRV implies a balanced ANS in a state of robust physiological, emotional, and cognitive regulation, allowing learners to respond to varied external stimuli with flexibility and resilience (Yee Chung et al., 2021), while lower HRV can be a sign of unbalanced ANS implying physiological, emotional, and/or cognitive issues.

HRV is typically measured using an electrocardiogram (ECG), which records the electrical activity of the heart. HRV analysis includes the time domain, frequency domain, and non-linear domain parameters. Time domain parameters are the simplest form of HRV analysis, involving direct measurements of the time intervals between heartbeats. Common metrics include the standard deviation of normal-to-normal (NN) intervals (SDNN) and the root mean square of successive differences (RMSSD). Frequency domain analysis examines the spectral components of HRV, decomposing the HR signal into frequency bands such as high-frequency (HF) bands (0.15–0.40 Hz), low-frequency (LF) bands (0.04–0.15 Hz), and very low-frequency (VLF) bands (0.003–0.04 Hz). The duration of HRV recordings can vary; short-term recordings lasting around 5 min are typical for quick assessments, while long-term recordings (over 24 h) offer a more comprehensive overview. Ensuring data quality in HRV measurement is crucial. In the medical context, HRV measurement standards were provided by the Task Force of the European Society of Cardiology and the North American Society for Pacing and Electrophysiology in 1996, with recommendations for meticulous pre-processing, artifact-free ECG recordings, and accounting for age, gender, health status, current medications, and the environment during the recording (Malik, 1996).

Educational studies have been conducted using HRV data in diverse learning contexts to provide an objective understanding of learners’ regulation mechanisms regarding emotional and cognitive phenomena (Forte et al., 2019; Lehrer et al., 2020). Studies on learner emotions have investigated the association between HRV and stress, anxiety, panic disorder, depression, emotion, empathy, burnout, and social competence (Dimitriev et al., 2008; Hammoud et al., 2018; Huhn et al., 2018; Melillo et al., 2011; Scrimin et al., 2018; Tharion et al., 2009). Similarly, studies of learner cognition have revealed a link between HRV and learner attention, working memory, executive functioning, problem-solving, general intelligence, and performance (Hansen et al., 2004; Melis & van Boxtel, 2001; Scrimin et al., 2018).

Since the use of HRV in educational research is still incipient, numerous challenges related to HRV measurement and interpretation should be addressed to provide sound guidelines for educational contexts. While HRV measurement presents an alternative and complementary method for assessing learners’ psycho-cognitive states, its validity has yet to be established, particularly regarding sensitivity and specificity, crucial attributes of the validity of a measure. HRV measures must be sensitive enough to accurately identify the areas of interest in a study (e.g., stress and attention) and specific enough to accurately identify a state of disinterest (e.g., relaxation and distraction). Both sensitivity and specificity pertain to the validity and accuracy of a measure, along with the concurrent validity (the extent to which HRV correlates with similar measures assessed approximately at the same time, such as subjective self-reported data) and the predictive validity (the extent to which HRV predicts certain learning outcomes) of the HRV measure.

Concerns regarding HRV utilization in educational research are as follows. First, several factors can limit its validity, such as different physiological variables like health status, particularly respiratory, and cardiovascular diseases (Campbell & Ehlert, 2012; Malik, 1996; Redondo et al., 2019), which should be controlled through robust methodological approaches (Laborde et al., 2017). Second, despite the ease of data collection, the wide range of information collected by an HRV measure can be easily misconstrued, and difficulties with interpreting findings can limit real-world implications (Laborde et al., 2017; Scrimin et al., 2021). Although Malik’s (1996) pivotal guidelines provided early standards for HRV data’s clinical use, they are dated and primarily focused on clinical medicine, necessitating more relevant guides for current educational research contexts. Third, few studies have systematically examined the concurrent or predictive validity of HRV measures in educational research.

This study addresses these caveats and provides a comprehensive overview of the methodological aspects of using HRV measures in educational research. Accordingly, we posed the following research questions: (1) To what extent has HRV been used to measure stress (as a representative emotional area) and attention (as a representative cognitive area) in educational research, and for which learners have HRV data been used? (2) How has HRV data been collected and analyzed in educational research (device, measuring time, parameters, precautions, and analysis software)? (3) To what extent does HRV data demonstrate concurrent and predictive validity in relation to other data types, such as self-reported or achievement data, in educational research? We intend to provide educational researchers with practical recommendations and methodological considerations regarding HRV use.

Method

This review was designed, conducted, and reported in accordance with the updated PRISMA (Preferred Reporting Items for Systematic review and Meta-Analysis) 2020 statement and guideline (Page et al., 2020).

Search Strategy

We searched four databases judged to be most relevant to our topic: PubMed (National Library of Medicine; 1781 ~), PsycINFO (1887 ~), Web of Science (1990 ~), and ERIC (1966 ~). We initially performed database searches for all records up to March 5, 2024, using search terms in three categories: heart rate variability (heart rate variability* OR HRV), educational context (education OR educat* OR learn* OR student* OR teach*), and stress or attention (attention* OR stress*). Detailed search queries for each database are presented in Table S1 (Search Queries in Online Supplementary File 1). In addition, we performed a hand search for gray literature in Google Scholar. An experienced research librarian supervised the development and implementation of the search strategies.

Inclusion and Exclusion Criteria

Inclusion criteria initially comprised original English research articles that investigated stress or attention using HRV measurements in educational contexts. We excluded research with extremely young (below 5 years) or older age groups (above 65 years) based on existing research evidence regarding HRV age dependency (Antelmi et al., 2004; Bonnemeier et al., 2003; Michels et al., 2013; Voss et al., 2015; Zhang, 2007). We only reviewed studies involving healthy individuals with no physical or psychological problems. Our inclusion criteria are presented in Table 1. After removing duplicates, two authors independently reviewed the titles and abstracts of all articles. Disagreement was resolved by consulting a third author to reach a consensus.

Table 1 Inclusion and exclusion criteria

Data were extracted from all eligible studies based on our research questions, resulting in the following data extraction categories and subcategories. More detailed data extracted by the classification scheme (as in Lee et al. (2020)) are provided in Table S2 (Classification Scheme in Online Supplementary File 1).

  • Areas of interest (stress/attention), participants (sample size, age, and school level group), and participant control (excluded participants and participant instructions)

  • HRV data collection and analysis (educational setting, HRV device, software, pre-processing, measurement standards compliance, baseline measure, measuring time, parameters, and data analysis and interpretation)

  • Concurrent and predictive validity (concurrently measured psychophysiological, self-report, and performance data) and correlation level (L1/L2/L3/L4)

As per the areas of interest, data were extracted into either stress or attention or both. Due to the blurred definitions and mixed use of the terms, stress in this study is defined as a complex psychological and physiological phenomenon that arises in response to perceived challenges or demands related to learning. Stress in this study is associated with specific situations or events rather than being chronic in nature. It manifests as anticipatory anxiety, increased mental and physiological arousal, and emotional tension. This phenomenon affects various aspects of an individual’s cognitive, emotional, behavioral, and physiological processes (Blackburn & Epel, 2017; Fink, 2017). We coded both studies of stress and those of anxiety uniformly as studies of stress.

We defined attention as a multifaceted cognitive function that involves the allocation of cognitive resources to stimuli or tasks. It encompasses processes such as focusing, sustaining concentration over time, managing cognitive load, and the ability to selectively prioritize certain stimuli over others (Guo et al., 2020; Redondo et al., 2019). We coded studies with experiments or interventions that required attention, such as the Stroop test or psychomotor vigilance task (PVT), as studies of attention.

We extracted information on HRV data collection and analysis, including participants (sample size, age, school level) and participant control (excluded participants participant instructions); the educational setting and the presence of professors/teachers; the HRV devices and software used; the presence of pre-processing; compliance with measurement standards; the presence of baseline measurements; the duration of measuring time; the specific HRV parameters collected; and the methods of data analysis and interpretation.

Regarding the correlation level for examining concurrent validity of HRV measurement, we coded the extent of the correlation between HRV and other data into four levels according to the magnitude of Cohen’s (1992) measure of effect size: level 1 (large effect size, r > 0.5), level 2 (medium and small effect size, r < 0.3 and r < 0.1, respectively), level 3 (r value was not reported, or no information on statistical correlation was provided), and level 4 (not correlated).

Using a data extraction form, we developed for this study through iterative testing and revision, the three authors worked independently. Disagreements were resolved through joint discussion.

Assessment of Study Quality

To assess the quality of the selected studies, we used the QualSyst Standard Quality Assessment Criteria by Kmet et al. (2004). This assessment tool was selected due to its applicability to evaluating the quality of quantitative studies, including non-randomized controlled trials. The criteria for evaluating quantitative studies are (1) clear question/objective, (2) appropriate study design, (3) proper subject/comparison group selection, (4) detailed subject characteristics, (5) random allocation in interventional studies, (6) investigator blinding, (7) subject blinding in interventional studies, (8) robust outcome and exposure measures, (9) appropriate sample size, (10) adequate analytic methods, (11) variance estimates for results, (12) control of confounding, (13) detailed results reporting, and (14) evidence-supported conclusions. These 14 criteria for quantitative studies are scored from 0 (lowest quality) to 2 (highest quality). The study quality was defined using a final mean score, with lower than 0.50 indicating inadequate quality, 0.50–0.70 indicating adequate, 0.71–0.80 indicating good, and higher than 0.80 indicating strong (Lee et al., 2008). We chose a conservative cutoff point of 0.71 for inclusion. The final quality scores ranged from 0.71 to 0.95, demonstrating that all included studies were of good quality.

Results

Trial Flow

A total of 1726 articles were identified from the four academic databases. None was identified in the gray literature. The first screening process eliminated 1025 duplicate articles, leaving 701 potentially relevant articles for title review. After 294 articles had been eliminated based on study titles, 407 potentially relevant articles remained for abstract review. A thorough abstract review further excluded 190 articles. The remaining 217 articles were retrieved for full-text review, resulting in an additional 169 being excluded. A final set of 48 articles were included in this systematic review. The exclusion process details and reasons for exclusion at each step are presented in Fig. 1.

Fig. 1
figure 1

Trial flow for this systematic review

Study Features

The 48 studies that met the inclusion criteria were published from 2006 to 2024 (March 5) and involved 2706 people in total. Most of the studies were quantitative (n = 43, 89.6%), whereas a few studies used mixed methods (n = 5, 10.4%), adopting both qualitative (interview) and quantitative approaches.

Areas of Interest and Participants

Of the studies, 44 out of 48 focused on stress (91.7%), while a few focused on attention (n = 12, 25%). Eight out of 48 studies focused on both (Fig. 2). Participants were aged between 6 and 58 years, and the sample size of each study ranged from 5 to 160. While there were relatively few studies with fewer than 10 participants (n = 2, 5.6%) or more than 100 participants (n = 6, 12.5%), most studies had sample sizes between 10 and 50 (n = 22, 45.8%) or between 50 and 100 (n = 18, 37.5%). By school level, the largest group of studies used HRV with undergraduate students (n = 33, 68.8%), half of whom were health sciences students, namely, medical (n = 15), pharmacy/biotechnology (n = 1), nursing students (n = 1), and physiotherapy (n = 1). Studies with participants in other school levels were equally represented: primary school students (n = 3, 8.3%), secondary school students (n = 3, 8.3%), and postgraduate students or professionals (n = 4, 8.3%). All the postgraduates or professionals were interns, residents, surgeons, or medical staff from medical fields.

Fig. 2
figure 2

Area of interest and educational settings of the included studies

Of the 48 studies, 27 (56.3%) controlled for participant factors, either excluding participants (marked as Excl: in Table S3 (Details of Reviewed Studies in Online Supplementary File 2)) (n = 30) or providing instructions (marked as Instr: in Table S3) (n = 13), or both (n = 16). Exclusions were most common for cardiovascular diseases (e.g., hypertension, heart diseases, coronary artery disease, arrhythmias, and tachycardia; n = 12), followed by mental health issues (n = 13), obesity or weight/BMI-related criteria (BMI > 25 or 30 kg/m2, BMI < 18 or 18.5 kg/m2; n = 7), and endocrine diseases including metabolic disorders (e.g., diabetes; n = 8). Criteria limiting smoking varied: Seven studies included only nonsmokers, five studies restricted participants from smoking before measuring HRV, and one study did not restrict participants from smoking due to the risk of withdrawal symptoms. Because hormonal changes can affect the cardiac ANS, four studies limited participants based on women’s menstrual cycles and only included females in the proliferative menstrual phase. Two studies recruited only males. Studies varied in participant exclusion based on medication (n = 14); some barred any regular or long-term medication users, and others excluded those on cardiovascular drugs, or required a medication-free period prior to HRV data collection. Instructions given before HRV measurement (n = 17, 35.4%) included avoiding caffeine (n = 15, 88.2% of instruction studies), alcohol (n = 11), and intense exercise (n = 4) and ensuring adequate sleep (n = 3). Eight controlled for noise, seven for temperature (e.g., 23 °C), four for illumination, and three for humidity (e.g., 45–55%). Seven studies specified the time frame for measuring HRV (e.g., between 9:00 and 12:30 pm after breakfast), and 11 studies asked participants to maintain their posture during HRV measurements (e.g., prone and supine).

HRV Data Collection and Analysis

Data Collection Context

Studies collected HRV data from varied contexts, which were classified into three categories: exam (n = 11, 22.9%), learning (n = 22, 45.8%), and experiments (n = 23, 47.9%; Fig. 2). The exam category included written exams, oral exams, and skill assessments; learning included face-to-face lectures, online lectures, clinical practice, and peer tutoring; experiments included cognitive tasks (e.g., Stroop test, mathematics, or memory), psychomotor tasks, and educational intervention (e.g., stress management program and video stressors). Stress measurements were predominantly conducted in examination (n = 11) and learning contexts (n = 22), whereas attention was primarily assessed in experimental settings (n = 12), with two of these studies also measuring attention in a learning context.

Device and Measuring Time

Among the 48 studies analyzed, 26 (54.2%) explicitly reported adherence to Malik’s (1996) guidelines. One study reported compliance with both Malik (1996) and Laborde et al. (2017). Although those remaining 22 studies are presumed to generally follow the recommended steps (e.g., baseline measurement and pre-processing) as suggested in the standard, they did not explicitly mention this compliance. Measuring devices were either wearable or static. Of the studies, 21 (43.8%) were conducted using wristband (n = 11, 22.9%) or chest belt (n = 13, 27.1%) wearable devices, and 3 studies involved the use of both. Twenty-eight studies (58.3%) used static devices, among which one study additionally employed a chest belt. The Polar series (n = 15, 31.3%) was the most used HRV device, and Kubios (n = 19, 39.6%) was the predominant HRV analysis software. Software like Kubios, MATLAB, and some from Polar offered pre-processing features, although their usage was variably reported (see Table S2 (Classification Scheme in Online Supplementary File 1) for all devices and software used in the studies).

For pre-processing, nearly one third of the studies explicitly mentioned their engagement in this process (n = 19, 39.6%), primarily through software functions. Although the majority did not specifically report on pre-processing, it can be inferred that they implemented it, as indicated by the software used (n = 17, 48.6%). Altogether, approximately 80% of the studies are estimated to have involved pre-processing. Most studies collected HRV data for less than 30 min (n = 37, 77.1%), and seven studies collected HRV data for more than 30 min, even up to several days (14.6%). The measurement duration in nine studies depended on the completion of a designed task (18.9%). More details regarding the research designs associated with the different HRV measurement durations are provided in Table S3 (Details of Reviewed Studies in Online Supplementary File 2).

Parameters

HRV data can be analyzed using different methods: time domain analysis, frequency domain analysis, and non-linear domain analysis. Of the studies, 38, 38, and 11 studies adopted each of the analysis methods, respectively. The most frequently analyzed time domain parameters were RMSSD (n = 29, 60.4%), indicating parasympathetic activation; SDNN (n = 15, 31.3%), the standard deviation of normal-to-normal R-R intervals; and pNN50 (n = 12, 25%), percentage of differences between normal-to-normal R-R intervals greater than 50 ms. In the frequency domain, the most frequently analyzed parameters were HF (n = 32, 66.7%), which reflects parasympathetic activity, LF/HF (n = 24, 50.0%), which reflects sympathovagal balance, and LF (n = 26, 54.2%), which reflects sympathetic activity. The non-linear domain was analyzed relatively rarely. Eight papers investigated SD1 and SD2 non-linear domain parameters, which mirror short-term and long-term HRV variability. LF/HF in the frequency domain and SDNN and pNN50 in the time domain were more frequently used in the context of measuring stress than measuring attention (see Table S2 (Classification Scheme in Online Supplementary File 1) for the complete parameter frequencies used in the studies).

Analysis and Interpretation

When interpreting HRV parameters, most of the studies evaluated the relative value—i.e., increases or decreases—in the collected data rather than considering absolute standards as used in a clinical context. Forty out of 48 examined changes from baseline values (83.3%), and ten studies (20.8%) compared HRV data between experimental and control groups. Studies considered participants’ age, gender, health status, and length of recording holistically (Xhyheri et al., 2012). None of the reviewed studies considered the normal range HRV parameter standards suggested and widely used in clinical settings.

Concurrent and Predictive Validity of HRV Measurement

Concurrently Measured Data

In all reviewed studies, HRV was not the only data collected; other psychophysiological, self-reported, and performance data were also collected. Of the 48 studies, 35 (72.9%) collected other psychophysiological data such as heart rate, cortisol levels, or blood pressure, and 35 studies (n = 72.9%) collected self-report data, such as stress scale scores, affect scale scores, or cognitive load scores, which we used to evaluate the concurrent validity of the HRV data. Eighteen (37.5%) studies collected performance data, such as test scores or skill performance scores, from which we evaluated the predictive validity of the HRV data.

Regarding psychophysiological measures, many stress studies (n = 44) concurrently collected heart rate (n = 25, 56.8%) and cortisol (n = 12, 27.3%) data, as did attention studies; four collected heart rate data (33.3%) and two collected cortisol data (16.7%) along with HRV data.

Validated self-reported stress scales such as the STAI (n = 10) or Subjective Perception of Distress Scale (SUDS; n = 3) were used in many stress studies, and some studies used invalidated self-report scales (n = 6, 13.6%) to complement the HRV data (n = 27, 56.3%). HRV data used to measure attention was more often paired with cognitive performance data (n = 5, 41.7%) than with self-reported scale data (n = 1, 8.3%).

Among the performance data used by studies, grades (n = 4, 9.0%), skill assessment (n = 3, 6.8%), and surgical training (n = 3, 6.8%) were primarily used to measure stress, whereas only cognitive performance scores were primarily used to measure attention (n = 5, 41.7%). Studies reported other diverse psychological data, including affect, depression, personality traits, empathy, resilience, mindfulness, or burnout, as detailed in Table S2 (Classification Scheme in Online Supplementary File 1).

Correlation Between HRV and Other Data: Concurrent Validity and Predictive Validity

To assess the concurrent validity of HRV data used to measure stress or attention in an educational context, one of four correlation levels (L1, L2, L3, and L4) was coded for each HRV-variable pair (studies had more than one pair). Level 1 (large effect size, r > 0.5) and level 2 (medium and small effect sizes, r < 0.3 and r < 0.1, respectively) correlations were considered as one group to contrast with level 4 correlations (not correlated). For stress measurement, there were 14 L1/L2 pairs and 8 L4 pairs, implying that the concurrent validity of HRV data used to measure stress has moderate support. For attention measurement, there was one L1/L2 pair and two L4 pair, the sample size is notably small, suggesting that there is insufficient evidence to support the concurrent validity of HRV data for measuring attention.

The top three data types assessed were heart rate, cortisol, and validated stress scales. The L1/L2 (high correlation) group included two studies with heart rate data, three studies with cortisol data, and eight studies with stress scale data. Heart rate is known to relate to HRV (Kazmi et al., 2016; Sacha, 2014); 25 studies out of 48 co-measured heart rates of participants, among which only two studies (8.0%) pertained to the L1/L2 group. Among the 12 studies that collected cortisol data, three studies (one that used blood samples and one that used saliva samples) reported a high correlation with HRV data (RMSSD, LF, HF), but one study showed that the parameter LF/HF had no correlation with cortisol data. The remaining studies that used psychophysiological measurements corresponded to L3, implying that most was contextually consistent with HRV data.

Among the 22 studies that employed validated stress scales, 8 (36.4%) reported a high correlation between HRV data and stress scale scores. The instruments used in L1 studies were Escala de Valoración del Estado de Ánimo (Mood Assessment Scale; EVEA) and Test Anxiety Inventory (TAI). The parameters used were balanced between time domain (RMSSD, NN50, SDNN) and frequency domain (LF, HF, LF/HF) data. EVEA scores highly correlated with HRV RMSSD, LF, and HF data but did not correlate with LF/HF data. Visual Analogue Scale (VAS) scores highly correlated with HRV data in students’ passive conditions, but no correlation with HRV data was found for students’ active or interactive conditions.

In attention studies, the concurrent validity of HRV data was supported due to a high correlation between HRV data and cortisol data used to measure attention, but validity was weakened as there was no correlation between HRV data and respiration rate. The overall correlations relevant to concurrent validity are provided in Table 2.

Table 2 Concurrent validity

To determine whether HRV data can predict academic or other performance, we considered the correlation between performance data and HRV data. Among the 21 studies that measured HRV and performance data, only 4 studies fell into the L1/L2 group, with 11 studies in the L4 group. Among stress studies, this trend was more pronounced. All three studies that co-measured test scores reported no correlation between test scores and HRV data, and two out of three studies that used skill assessment data provided evidence against the predictive validity of HRV. Two studies that used PVT test scores indicated low predictive validity of HRV data for measuring attention. The overall correlation information relevant to predictive validity is provided in Table 3. The study features by areas of interest, namely, stress and attention, and their proportional composition within the set of the reviewed studies are provided in Table 4.

Table 3 Predictive validity
Table 4 Study features by areas of interest

Discussion

Given the scarcity of HRV guiding research for educational use, this review explored the potential, measurement guides and validity of HRV measures for educational research. By systematically reviewing 48 educational studies that utilized HRV data, we synthesized the literature into (Q1) areas of interest and learner groups, (Q2) methodologies for data collection and analysis, and (Q3) concurrent and predictive validity of HRV measurement for educational research. Addressing challenges related to HRV measurement in educational research as a method to complement self-report instruments, findings and implications by research questions are further discussed below.

Measuring Stress and Attention with HRV in Educational Contexts

Most studies measured stress, which is consistent with the general clinical context (Glass, 2009; Sassi et al., 2015). In the clinical context, HRV data have been used for assessing stress at both the trait or tonic level and the state or phasic level (Malik, 1996), although clinicians tend to focus more on chronic stress from more continuous stressors rather than event-based stressors. However, in the studies analyzed, HRV data were predominantly utilized to represent stress at the state or phasic level, where stressors were academic events such as lectures, practice, or exam situations. Castaldo et al. (2019) reviewed 12 studies of acute mental stress measured with HRV in a clinical context, all of which involved stressors from cognitive tasks, such as arithmetic, Stroop tasks, and academic exams. Hair cortisol data, contrary to saliva and blood cortisol, is indicative of accumulated or chronic stress and may not be deemed suitable as a co-measured variable (Huhn et al., 2018). Self-report data, which has typically been used in existing educational research, measures stress at the end of a learning event, and HRV in psychiatric clinical settings measures chronic stress. HRV in the reviewed studies captured changes in learner stress throughout educational scenes in real time, thereby providing valuable insight into instructional design; this highlights the potential of using HRV measures in the field of education.

Attention was researched relatively less than stress. Measuring attention via HRV data in an educational context is a newer research area. Given that HRV can be used as a marker of attentional processes (Mathewson et al., 2007), HRV measures were used to assess different attentional demands (the extent to which a task is challenging and requires a high level of attention) in real time. Higher HF-HRV may be indicative of enhanced attention (e.g., Causse et al. (2011), Quintana et al. (2012), and Ramírez et al. (2015)). Among diverse psychophysiological data, EEG data are more commonly utilized to measure attention than HRV data (Schoenberg & David, 2014). However, the mobility of students during learning activities can hinder EEG measurement reliability, leading to a high degree of measurement error and making large-scale, long-term studies challenging (Xu & Zhong, 2018). The scarcity of attention studies using HRV highlights the need for further research in this area.

The studies included in this review showed that adult participants—undergraduates and professionals—were studied more frequently than children or adolescents. Specifically, stress was studied more widely, while attention was studied primarily in undergraduates. Although the therapeutic value of HRV biofeedback has been established in children, adolescents, and adults (Dormal et al., 2021), research in educational contexts with children and adolescents has yet to be adequately explored in comparison to research with adults (Goessl et al., 2017; Lehrer et al., 2020). In addition, participants in the healthcare field (medical students, residents, surgeons, or medical staff) accounted for 54.2% of the study populations. This is likely due to the nature of HRV and the fact that healthcare researchers have the necessary expertise to collect and analyze HRV data considering associated medical factors. This highlights the need for proper guidelines on the collection, analysis, and interpretation of HRV data for educational research.

Collecting and Analyzing HRV Data from Educational Contexts

The present study synthesized the methods that 48 studies reported to provide educational researchers with guidelines for utilizing HRV data to measure stress and attention in educational contexts. A classic work on HRV measurement standards by the Task Force of the European Society of Cardiology and the North American Society for Pacing and Electrophysiology (Malik, 1996) originally targeted heart disease patients based on the equipment available in 1996, providing standards for collecting and processing HRV data, interpreting results, and reporting findings. This guideline has been pivotal in HRV research. In the reviewed studies, about 54.2% explicitly reported following Malik’s, 1996 guidelines. One reported following the recommendations by Laborde et al. (2017). This study provided methodological considerations crucial for experimental designs, data analysis, and reporting using HRV data in the context of self-regulation research. While it signifies a step forward from clinical diagnostics to the context of psychological research, it acknowledges its own limitations in serving as an imperative guideline. This highlights the nuanced transition HRV research is undergoing, from its roots in clinical applications to broader psychological contexts, but the ongoing need for comprehensive and directly applicable guidelines for its integration into educational research still persists. The other studies did not mention any alternative standards but are presumed to generally follow the same recommended steps, implying the importance of reporting compliance with these standards.

The data collection contexts of the studies suggest that most HRV data measuring stress in an educational setting were collected during learning, with a relatively high proportion collected during exams and in experimental settings, whereas HRV data measuring attention were predominantly collected in an experimental context.

Studies controlling for participant factors in HRV research—through excluding certain individuals, providing specific instructions, or implementing both strategies—reflect the complexity of HRV as an index of the interplay between physiological, emotional, and cognitive processes. This approach, grounded in the understanding of heart-brain connections (Laborde et al., 2017; Malik, 1996), accounts for the impact of various health conditions and behaviors, like cardiovascular diseases, mental health issues, obesity, and substance intake (e.g., alcohol, caffeine, and medication), as well as hormonal influences related to female menstrual cycles. This level of detail is necessary to ensure the accuracy and reliability of HRV as a measure of these interrelated ANS balance and regulation systems.

In most studies, instructions, such as avoiding caffeine, alcohol, and intense exercise and ensuring adequate sleep, were given to participants before HRV measurement. Studies also controlled for noise, temperature, illumination, humidity, time, and participants’ posture during HRV data collection. Instructions employed in the studies imply that all these factors can affect HRV and produce artifacts or noise for HRV data. Similarly, most studies recommended that participants rest, relax, and practice guided breathing before measuring the baseline. This approach, though not strictly enforced, seems to have mitigated the significant respiratory rate influence on HRV. Most studies documented baseline measurements, and we inferred that the four without explicit reports also did so, highlighting the necessity for consistent reporting of key data measurement steps.

Recent studies over the past 3 years have shown a notable surge in the utilization of HRV data in virtual reality (VR), simulation, and online learning contexts (8 out of 12 studies, or 66.7%). Consequently, participant exclusion criteria have been expanded to include technologically related issues such as vision, ear, and movement sickness, which, while not directly related to HRV, impact the study conditions. This trend signifies a shift towards more personalized learning strategies enabled by technology, wherein learning outcomes can be optimized by closely monitoring and adapting to learners’ real-time needs and conditions using physiological data like HRV.

Almost 60% of the reviewed studies reported using a wearable HRV measurement device. The literature highlights several differences between wearable and static HRV devices. Static devices typically require users to remain seated or lying down during measurement and typically measure HRV for a short duration, such as 5 min (Rangaswamy et al., 2016). These devices are regarded as more accurate and precise compared to wearable HRV devices. This aligns with our findings that a significant portion of L1/L2 studies, which reported a high correlation between HRV and other measures, used static devices. Wearable devices, particularly wristband and chest belt forms, are extensively employed. Wristbands are favored for everyday health tracking due to their user-friendliness, while chest belts, offering more accurate heart rate data, are preferred by athletes and professionals. In the spectrum of data accuracy for HRV devices, chest belts are situated between wristbands and static devices, providing a balance of mobility and precision.

Although no significant differences in correlation levels of co-measured data and HRV were observed between wristband and chest belt types, Menghini et al. (2019) found that wrist-worn sensors measure HRV accurately in static conditions, like resting and guided breathing, but less so in dynamic conditions such as speaking. These findings underscore that the movement intensity of learning activities and the desired accuracy of HRV data should be weighed when determining the appropriate device. If measurement accuracy is of paramount importance, a static device may be more suitable; conversely, if ease of use or mobility is a priority, a wearable device may be more appropriate.

The HRV measurement duration varied across studies: 68.1% of studies measured HRV data for less than 30 min, with stress being measured primarily in the short term (no longer than 30 min), while attention tended to be measured in both the short and long term (Mathewson et al., 2007; Redondo et al., 2019). According to Malik (1996), the measurement duration can vary depending on the study purpose. Short-term measurements (e.g., 5 min) are used in clinical settings, while long-term measurements (e.g., 24 h) are used to assess HRV changes during daily activities, for each of which more suitable HRV metrics were provided in the study. Task-dependent measures of HRV have commonly been employed, particularly when specific tasks need to be completed. This emphasizes the significance of considering the areas of inquiry and context in which HRV is being studied. Experimental designs employed in each study are detailed in Table S3 (Details of Reviewed Studies in Online Supplementary File 2).

In long-term HRV data collection and in abnormal situations like stress, the issue of data stationarity has been long addressed (Kaczmarek et al., 2019; Malik, 1996; Spellenberg et al., 2020). Over long periods, physiological responses can change significantly, making it difficult to assume a constant pattern and interpret frequency domain metrics like HRV LF and HF components. Averaging HRV data over prolonged monitoring or during stress might also mask important fluctuations and nuances in bodily responses, leading to less accurate assessments of autonomic regulation.

Addressing the issue of data stationarity and more reliable data collection, pre-processing steps were emphasized and reported in the studies reviewed. For pre-processing, studies utilized various analysis software programs. The most used software was Kubios, followed by the Polar series (including Polar Precision Performance program, Polar ProTrainer software, and Polar Flow). Other software programs used include MATLAB, Cardiovit AT-10 Plus, LabChart Pro software, Acq-Knowledge, QHRV stress assessment, and BEAT software. All these software programs feature pre-processing functions, which include detrending, noise/artifact detection and removal, data filtering, ectopic beat detection and classification, data quality assessment, and visual inspection. Under nonstationary conditions like stress, the Binary Symbolic Dynamics and Heart Rate Asymmetry (HRA) methods are useful additions to HRV analysis. The Binary Symbolic Dynamics method simplifies HRV into binary sequences to better identify stress-related changes (Speelenberg et al., 2020). HRA distinguishes emotional responses, with positive emotions showing more short-term deceleration (Kaczmarek et al., 2019). Although not used in the 48 studies, incorporating these methods in future research, especially in educational contexts for stress assessment, could enhance the reliability of HRV as a stress measure.

Studies employed three types of HRV analysis: time domain, frequency domain, and non-linear. Studies often used more than one type of analysis. Time domain analysis considers the time intervals between heartbeats and reports values such as the mean and standard deviation of these intervals (e.g., SDNN and RMSSD). This analysis type is simple and easy to understand, but it may not provide as much information as the other two analysis types. Frequency domain analysis considers different frequencies present in the HRV signal. This can be represented as power spectral density, which separates the signal into different frequency bands and depicts the power (or strength) of each band (e.g., HF, LF, and VLF). This analysis type can provide more detailed information regarding the HRV signal, but it can be more difficult to understand. Non-linear analysis considers the complexity and irregularity of the HRV signal. This can be implemented using techniques such as fractal analysis and entropy analysis. Non-linear analysis can provide even more detailed information about the HRV signal but can be more difficult to understand than frequency domain analysis. Stress was primarily investigated using the RMSSD and SDNN parameters in the time domain and LF, HF, and LF/HF in the frequency domain. Attention was investigated using RMSSD in the time domain and HF in the frequency domain, but the use of LF and LF/HF was relatively limited. Within the domain of analysis, some studies used all parameters, some used a subset, and some used modified versions (C_HRV). The most common combination was RMSSD, HF, and LF/HF.

In interpreting parameters, standard normal and abnormal HRV ranges exist in a clinical context: a daytime RMSSD value lower than 25 ± 4 ms2 is associated with an increased risk of cardiovascular disease (Jarczok et al., 2019). An SDNN value of less than 50 ms2 is interpreted as a state of poor resistance to stress, and a pNN50 value of less than 3% indicates parasympathetic dysfunction (Jarczok et al., 2019; Xhyheri et al., 2012). The LF normal range is 1170 ± 413 ms2, HF is 975 ± 203 ms2, and the LF/HF ratio is 1.5–2.0 (Nunan et al., 2010). SD1 and SD2 are standard deviations of RR interval variability calculated on a phase space plot (Poincaré plot). In healthy individuals, patterns show broad and generally symmetrical oval shapes, whereas patterns in unhealthy individuals deviate from the normal range and present clustered shapes in a narrow region (Fishman et al., 2012).

The interpretation of HRV parameters in educational research differs from the interpretation for clinical use, as the former focuses on monitoring HRV changes in response to different educational interventions or conditions. Thus, although there are established standards for clinical use, it is common in educational research to interpret HRV parameters according to personalized algorithms by comparing a baseline level to a level post-intervention to assess whether values increased or decreased (Lu et al., 2022; Xhyheri et al., 2012).

The Concurrent and Predictive Validity of HRV Measures for Educational Research

HRV data seemed to have an acceptable level of concurrent validity as a measure of stress in educational contexts due to its strong correlation with other measures of stress, such as heart rate, cortisol levels, and validated stress scale scores. Among the 39 pairs of data that were analyzed, 14 pairs were associated with high correlations, while 8 pairs indicated no correlation. The concurrent validity of HRV data as a measure of attention in an educational context is less clear. Findings showed that among the 39 pairs of data, only one pair was highly correlated, while two pairs indicated no correlation. This suggests that a correlation between HRV data and measures of attention cannot yet be confirmed.

Noteworthily, among the studies that collected heart rate data, only 4.2% were categorized within the group with high correlations with HRV data, which is inconsistent with a previous research (Kazmi et al., 2016; Sacha, 2014) where heart rate was strongly related to HRV. Additionally, among the 12 studies that collected cortisol data, only 3 (1 blood sample and 2 salivary sample) reported a high correlation with HRV data (RMSSD, LF, HF), while another study showed that the parameter LF/HF had no correlation with cortisol data. In terms of stress studies, 15.9% of studies that employed validated stress scales reported a high correlation between HRV data and stress scale scores. The instruments used in L1 studies were the TAI and EVEA, and the parameters used were time domain (RMSSD, NN50, SDNN) and frequency domain (LF, HF, LF/HF) data. The study also found that EVEA scores highly correlated with HRV RMSSD, LF, and HF data but were not correlated with LF/HF. Overall, the correlations between HRV and other measures, such as heart rate, cortisol levels, and stress scale scores, were not always present and could have varied depending on the areas of interest, the types of scales or sub-scales used, and the parameters considered.

This study also investigated the extent to which HRV data can predict certain learning outcomes. Our findings suggest a limited correlation between performance and stress measured by HRV data. Of the studies included in the analysis that measured HRV data along with performance data, only two found a correlation between the two variables. Furthermore, most studies (80%) that measured test scores reported no correlation between test scores and HRV data, and two out of three studies using skill assessment data also reported no correlation. Additionally, two studies using PVT (psychomotor vigilance task) found low predictive validity for HRV data measuring attention. Previous research also suggests that stress can negatively impact cognitive performance. For example, studies have found that stress can impair working memory and attention, increasing the likelihood of errors or accidents (Lupien et al., 2009; McEwen & Gianaros, 2011). HRV data is useful for providing a more comprehensive understanding of learners’ psychological, physiological, and cognitive states, but more research is needed to confirm its validity for measuring attention and predicting learning outcomes.

Table 5 consolidates the discussions so far, synthesizing recommendations for educational research using HRV data and covering eleven key domains. The 1996 guidelines for HRV research on cardiac patients did not include certain elements that are newly added in these guidelines, which are tailored for educational research. These new additions are (1) areas of measurement, (7) co-measured data, and (10) concurrent and predictive validity; due to the focus on normal learners rather than cardiac patients, (2) participant control, (4) data collection condition, and specialized (9) interpretation have been incorporated. Technological advancements available nearly 30 years later have led to updates in (3) HRV measurement devices and (8) HRV software and pre-processing, while (5) measurement duration and (6) parameters build upon the existing guidelines with new techniques. The guidelines under (11) standardized reporting delineate reporting items to ensure that all the aforementioned aspects are adhered to and duly reported.

Table 5 Synthesized recommendations for educational research using HRV data

The limitation of the findings is that it is not yet clear whether HRV data can be used to accurately measure attention in an educational context. Additionally, our review suggests that there may be a limited correlation between stress and performance as measured by HRV data, indicating that more research is needed to confirm the predictive validity of HRV data regarding learning outcomes. The review also found that correlations between HRV data and other measures such as heart rate, cortisol levels, and stress scale scores are not always present and could vary depending on the areas of interest, the kinds of scales or sub-scales employed, and the parameters used. This means that more research is needed to understand the relationship between HRV and these other measures and to determine the best ways to use HRV data in educational research. In addition, in this study, we consider stress and attention as key psychological and cognitive phenomena in learners, focusing our review on research that includes either or both of these terms. Studies that only used terms such as anxiety, arousal, or cognitive load without mentioning stress or attention may have been excluded. We suggest that future research should conduct a more comprehensive analysis of HRV data usage across a broader spectrum of learner characteristics.

Although HRV data can capture changes in learner stress and/or attention in real time, providing valuable insight into instructional design, it is not yet completely clear whether HRV can be used as a valid measure of stress and/or attention in an educational context. Future research should consider the use of HRV data in educational contexts to investigate its potential and thereby develop evidence-based guidelines for the proper utilization, collection, and interpretation of HRV data for educational research, thereby ensuring its validity. In addition, the increased utilization of HRV data in digital learning research, as shown in recent studies, suggests the need for future investigations to more finely understand learners’ cognitive and emotional states in technology enhanced learning (TEL) environments and to link these insights to educational interventions.

Conclusion

This study examined the potential and validity of utilizing HRV data for educational research by conducting a systematic review of 48 studies that employed HRV measurement in educational contexts. Our findings implied that HRV data can provide valuable real-time insight into instructional design by capturing changes in learner stress during educational activities. The study also highlights the need for guidelines on the proper use of HRV in educational research and calls for further detailed investigations to be conducted. HRV data had a moderate level of concurrent validity as a measure of stress in an educational context. The concurrent validity of HRV data as a measure of attention in an educational context was less clear. There were limited correlations between stress and performance as measured by HRV data. More research is needed to confirm HRV data’s validity for measuring stress and attention and for predicting learning outcomes.

This study significantly extends the classic standards of utilizing HRV data, initially based on clinical contexts with heart disease patients, by synthesizing recommendations for educational research among healthy learners. It integrates the use of modern devices and software previously unavailable, providing guidelines for collecting, processing, analyzing, and interpreting HRV data. Furthermore, the study delves into the interplay of HRV with psychological and physiological aspects, such as stress and attention, thereby proposing fresh HRV research avenues and expanding its scope beyond conventional heart health metrics.

It is of great value to incorporate more objective data into educational research to make evidence-based educational decisions. However, it is important not to over-rely on a data-driven approach, but rather to supplement quantitative data with in-depth qualitative data and provide a more comprehensive understanding of educational phenomena. We hope that this study will contribute to promote a multifaceted approach by utilizing multi-modal data for educational research.