Introduction

The use of mobile health technologies, including wearable sensors, in sleep research has increased substantially over the last several years. The PubMed library contained 3 articles containing the terms “sleep” and “wearable” in 2012; this has now increased to 1262 articles by the end of 2022, suggesting that sleep researchers increasingly recognize the potential of harnessing wearable devices in their investigations of sleep.

Many of the benefits of using commercially available wearable devices in research studies are clearly apparent—the technology is of relatively low cost, readily available, and continues to improve in accuracy, convenience, and impact on care [1]. Furthermore, wearables passively collect data without the need for substantial levels of participant interaction, and without the need to travel to a sleep lab for polysomnography (PSG). One of the greatest potential benefits is in the ability of wearables to evaluate multiple nights (sometimes years) of sleep, thereby providing longitudinal data that may be more reflective of a participant’s normal sleep patterns compared to what can generally be obtained with more traditional methods. Longitudinal assessments, especially when gathered from large, diverse populations, can help us better understand how sleep variability and different sleep patterns might impact human health outcomes. Wearables also provide value to participants since they can return personalized health data in a user-friendly, real-time way via data visualizations, which may also help with engagement and retention in sleep-related research. Furthermore, wearables capture other data (e.g., steps, exercise) that might affect sleep, which cannot be done using PSG.

Given the emerging possibilities of utilizing wearables in sleep research, we aim to provide a comprehensive review regarding the use of wearable devices in population-based studies that focus on sleep, including potential challenges for researchers as well as future directions in the arena.

Where the Field has Been: Traditional Assessments of Sleep

Previously, the only assessment of sleep that has been done on a population scale is survey assessment using subjective recall. For example, the largest database of population sleep in the USA comes from the Behavioral Risk Factor Surveillance System (BRFSS)—where the Centers for Disease Control and Prevention (CDC) query randomly chosen individuals over the phone regarding how much interviewees think they sleep in a 24-h period; this database has often been tapped as a resource for sleep researchers [2,3,4]. The Nurses’ Health Study [5], with over 70,000 participants, was another influential, population-level study that utilized surveys for the assessment of sleep. However, survey-based data are fraught with limitations, including the use of subjective recall (instead of the use of objective measurement), the use of ordinal numbers (e.g., 5 vs 6 h of sleep per night), uncertainty as to whether this number means hours slept last night or on average over many nights, includes naps or non-nighttime sleep, and the cross-sectional nature of the query as opposed to longitudinally tracked sleep [6].

On the opposite end of the spectrum, the most in-depth assessment of sleep comes from PSG, which remains the gold standard for sleep assessment, diagnosis for many sleep disorders, and is currently the most accurate measure of sleep duration and architecture. Major limitations remain for PSG when considering population-level studies. PSG must occur in a sleep laboratory, outside of the natural environment of the participant, thereby limiting the number of participants in a research study. This also limits participant selection based on where a PSG laboratory might be available, and cost is an issue both in obtaining data and in analysis. Furthermore, PSG does not allow for the collection of longitudinal sleep information without repeated returns to a sleep laboratory (and if it did, it would likely alter the timing and duration of sleep it purports to measure).

Actigraphy, which has been used in sleep research studies since the late 1970s, is a non-invasive method of measuring gross motor activity, and has been used to monitor human rest/activity cycles. Actigraphs are generally wrist-worn devices that rely on accelerometry and have been used successfully in many sleep-related research studies to estimate sleep duration (though not sleep architecture) [7, 8]. These devices use (lack of) body movements (usually from the wrist/hand) to infer periods of sleep. While actigraphy cannot provide sleep staging data and is certainly not the gold standard for sleep measurements, it provides reasonable estimates of sleep duration, wake after sleep onset (WASO), sleep latency (when paired with a sleep diary), and sleep efficiency. Notably, actigraphy has been used in many different populations, including those who are ambulatory, chronically ill, and even those who are hospitalized [7, 9,10,11,12,13]. While some software can provide automatic scoring of the data, actigraphy is generally manually scored with an experienced scorer determining potential “rest intervals” of when a person is thought to be trying to sleep, and then an algorithm applied to these intervals to determine sleep vs. wake. The Cole-Kripke algorithm is an example of how some actigraphy programs determine sleep vs. not sleep and takes into account activity during a given epoch as well as activity in the surrounding time periods [14]. Actigraphy has been used successfully in moderate to very large studies of sleep duration for many years. The Study of Osteoporotic Fractures (SOF) enrolled almost 3000 women in the 1980s to examine what factors, including actigraphically measured sleep and activity, contributed to the development of osteoporosis [15]. Similarly, the Osteoporotic Fractures in Men Study (MrOS) used actigraphy in a group of almost 6000 men in the early 2000s, from which multiple sleep-related insights were gathered [16, 17]. The Hispanic Community Health Study/Study of Latinos (HCHS/SOL) study has used actigraphy in their ancillary studies, such as the Sueño Sleep Ancillary Study, and Study of Latinos – Investigation of Neurocognitive Aging (SOL-INCA) [18, 19], both of which have provided insights into the sleep and sleep-related health outcomes in a large Latino population. Furthermore, the UK Biobank study utilized actigraphy in over 100,000 individuals, where participants were asked to wear an actigraphy device on their wrist for seven consecutive days. These data have already led to novel sleep-related observations such as insight into possible sleep phenotypes, despite the modest duration of recording (only 7 days) and lack of use of PSG [20, 21]. Such data, when combined with genetic information, helped confirm—and also extend—knowledge of genes important for sleep timing and duration.

Where the Field is Now: Wearables, Smartphones, and What They Measure

Wearables

Wearables are like actigraphy in many ways—they use a similar underlying technology (3-axis accelerometry, which captures movement in the x-, y-, and z-planes), they infer sleep from a lack of movement, and are able to estimate similar sleep metrics as actigraphy through likely similar algorithms as described above (though the algorithms and analyses from these devices are proprietary and not publicly available). Many devices now also utilize a green-light photoplethysmography (PPG) or infrared sensing to determine heart rate, heart rate variability, and pulse oximetry, and can also now estimate respiratory rate (so-called multi-sensor wearables). Most commercially purchased devices are worn on the wrist or finger, with fewer that are worn on the chest or hip. The technology is also often connected to smartphones and other devices. The expectation is that wearables provide at least similar data to actigraphy, with claims or hopes that these additional inputs can help approach PSG level of accuracy.

In terms of accuracy in healthy populations, some studies suggest that wearables over-estimate sleep duration compared to PSG on the order of only 8 min (Fig. 1) [22,23,24,25,26,27,28]. Furthermore, recent studies have shown many newer devices performed as well as or better than actigraphy on sleep vs. wake measures in children to adults, though not as accurate as PSG [24, 29]. Given the reliance on movement (or lack thereof) for estimation of sleep, wearables have variable accuracy in being able to assess the time of sleep onset. The inclusion of heart rate, heart rate variability, and respiratory rate in device algorithms has improved the assessment of time to sleep onset, but not all devices carry these capabilities [30]. Circadian rhythm measurement is often of interest in sleep-related research and is easily measured with actigraphy. To our knowledge, no population-level studies have been conducted with wearables to evaluate circadian rhythms. However, some recent studies have shown promise in being able to extract circadian activity rhythm data using multiple parameters from the devices, and sometimes in conjunction with additional smartphone data [31,32,33].

Fig. 1
figure 1

Sleep metric agreement between Fitbit Charge 2 and PSG. Originally published in de Zambotti et al. [82]. Image and figure explanation reproduced with permission from Taylor and Francis. Bland–Altman plots for total sleep time (TST), sleep onset latency (SOL), time N1 + N2 (“light”) sleep, and time in N3 (“deep”) sleep. PSG-Fitbit Charge 2™ discrepancies for sleep outcomes (y-axis) are plotted as a function of the PSG outcomes (x-axis) for each individual. Circles represent individuals in the main group and triangles represent PLMS (Periodic Limb Movements) individuals. Biases are marked; the dotted lines refer to the upper and lower Bland–Altman plot agreement limits. Biases, and upper and lower agreement limits of the biases, are displayed for the main group (n = 35) only for clarity in the graphical representation

The measurement of sleep staging has gained momentum in terms of the technology and algorithms used, but further improvements are likely needed prior to comfortably relying on staging data [34, 35]. Most sleep architecture information from wearables is gained from accelerometry but also heart rate variability, which changes when transitioning from one sleep stage to the next [27]. For example, in de Zambotti et al.’s validation study of the Fitbit Charge 2 (Fig. 1), they found 0.81 sensitivity in detecting light sleep (N1 + N2 stages), 0.49 sensitivity in detecting deep sleep (N3), and 0.74 sensitivity in detecting rapid-eye-movement (REM) sleep [36]. While these estimations certainly show promise, they do not yet provide PSG-level sleep architecture information. Indeed, the de Zambotti et al. study utilized slightly older technology, and wearables continue to improve in terms of accuracy.

Smartphones

Smartphones, which have become almost ubiquitous in the USA, have the potential to passively monitoring a person’s daily activities, including sleep-related measures. These devices allow for data collection from a person’s so-called digital footprint [37], which includes when they start/stop using their phones over the day, peak times of phone use, and typing speed. Others have harnessed these data (most using apps downloaded to the phone, or GPS mobility patterns) to track sleep/wake timing, sleep duration, and circadian-related patterns, with good success [38,39,40,41,42]. Ceolini and Ghosh recently used 2 years of data from 401 participants to analyze over 300 million smartphone touchscreen interactions in an effort to determine the presence of multi-day rhythms across the cohort [43], which underscores the potential for passively collected digital data in providing new insights into sleep and circadian-related patterns in human health.

Considerations When Using Wearables in Sleep Research

Intellectual Property, Proprietary Sleep Algorithms, and Future Open Framework Efforts

The data obtained from wearables in a research study do not necessarily “belong” to the researcher but remain with the company providing the device. Device companies vary in how much data they share, in what format, and which data sharing platforms and apps to which they can connect. Furthermore, the algorithms that underly sleep detection are proprietary, and thus the granular level data—such as that obtained with actigraphy—are often unavailable to the researcher. This leaves a reliance on the company provided algorithm, with no ability for the researcher to make their own assessments of sleep vs. wake or to make further manipulations of the data. While Fitbit has recently allowed for an extended level of granularity for approved research requests, this drawback is certainly a limitation of the use of wearables for sleep research. Additionally, questions remain regarding if and how the different algorithms for different devices might impact the collected results analysis; in our experience, sleep data provided through company algorithms (e.g., Apple vs. Fitbit) have variations that can impact results, even for basic sleep measurements such as sleep duration. Given the proprietary nature of the company algorithms, the development of open frameworks to standardize data analysis across devices has been difficult, though progress towards this goal is being made. For example, others with expertise in validating consumer devices for sleep research are working to standardize steps towards testing consumer sleep technology performance against PSG and have published open-source code for this purpose [44]. Perez-Pozuelo et al. recently published a device-agnostic heart rate–based algorithm applicable across multiple devices to predict sleep metrics with good results; their code library is open-source and thus available for others to use [26].

Data Management

Collection and aggregation of the sleep data must be considered. As wearables have become more prevalent for research, companies have made participant data (from participants who have consented to share their data) more accessible via application programming interfaces (API) and access to cloud-based storage of data. Third-party platforms are also sometimes available to help collect and aggregate data for research studies, depending on the device [45]. Additionally, some large digital studies, such as the DETECT study [46], collected data from different types of devices, and this aggregation from different platforms may be done increasingly in the future.

Device-related Issues

Data storage on the devices themselves is limited, and devices must be synced regularly to participants’ smartphones (and therefore the cloud-based applications that store individual data). Thus, participants generally need to understand this limitation and must have the knowledge or available assistance in ensuring their phones and devices are set to sync appropriately. Data loss can occur if syncing does not occur regularly; for example, Fitbit stores high-resolution data on the device itself for 5–7 days. If a device is not synced for more than 5 days, older high-resolution data are removed permanently and deemed unrestorable. Furthermore, firmware updates are often offered by device manufacturers, but variably installed by the participant. This could impact how many individuals are using the most optimized version of software, potentially impacting the quality of data collection. If studies are on a smaller scale, these issues are more readily dealt with using direct communication with participants and allowing for frequent check-ins to ensure devices are syncing and updated appropriately. However, depending on the geographic extent and volume of participants, these issues could be insurmountable and impact the reliability of the data collected.

Participant Habits

Participants vary in how they use wearable devices, and many questions remain surrounding sleep specifically. Some individuals are consistent users (Fig. 2), while others may use their devices on specific days or only when they are interested in tracking their data. Not all people wear their devices overnight, particularly given that some devices require charging overnight (though some devices have up to a 5- to 10-day battery life, Table 1), which could preclude a consistent assessment of sleep. This is an important consideration, especially when compared to actigraphy where the battery life can often last several weeks. Additionally, it can be difficult to track short naps, as some devices do not reliably capture naps shorter than 1 h. Furthermore, if an individual is consistently wearing their device to sleep and is actively tracking their sleep, it is possible the individual is concerned about their sleep habits (as opposed to an individual who habitually takes off their device at night and is not concerned enough about their sleep to track or investigate it), which could bias representative samples in sleep research.

Fig. 2
figure 2

Taken from internal data from a commercial wearable device

Considering wear time in population-based studies. Sensitivity analysis of the variation in the number of individuals based on device wear time and days of data available, suggesting that participant habits impact data collection, i.e., fewer participants wear devices around the clock, which is important when considering total sleep time including naps during the day.

Table 1 Common and upcoming wearable devices in sleep-related research

Validation in the Population Studied

More investigation is needed regarding the use of wearable trackers for sleep research in healthy vs. unhealthy populations, though it is unlikely to be much different than the variation found in actigraphy measurements. Most of the validation data that is available for wearables has been conducted in a younger, ambulatory population. Despite this, many studies are utilizing wearable devices in populations with one or more comorbidities. For example, one recent systematic review investigated the use of wearable sensors for the monitoring of diabetes-related parameters in studies published between 2010 and 2020 and found that out of 26 studies, accelerometer-based devices with PPG were the most common with the intent of measuring activity and heart rate, with no specific validation data used [47]. Notably, de Zambotti et al. recently called for a move towards standardized performance evaluation approach to new technologies in sleep health measurements as opposed to the traditionally termed “validation” studies [48]. This is an effort to address the multiple new devices (and analytics algorithms), many with increased technologic capabilities, that are being used for sleep research—most without access to the raw data and proprietary algorithms that would allow for a scientific evaluation—in a variety of different populations.

Data Privacy

Data privacy issues are not unique to digital sleep health studies but apply to any study where participants are asked to share wearable and other health data electronically. While the number of individuals who use wearables has increased substantially, there are increasing concerns regarding data privacy [49]. Wearables collect an enormous amount of data—not only physical activity and sleep metrics, but personal information about the participants, including zip code, GPS location, and time stamps. In addition to the security of the databases themselves that store large datasets, there are also concerns regarding the exposures participants may face when linking their devices to cloud-based storage systems. Each layer of sharing has the potential to expose the participant to additional risks, with potential for reidentification remaining a large concern. A recent systemic review of 72 studies where participants shared biometric data of some kind, showed high risk to the participants of reidentification, with sometimes as little as 1–300 s of data needed to identify the an individual participant [50]. On an individual level, participants may also be concerned about which aspects of their data will be shared and how the information will be used. In one study, non-English speaking participants expressed concern about the term “tracker,” in activity tracker and how this would be applied to them [51].

Bias Within Samples and Diversity-Related Issues in Wearables

Bias within samples from consumer wearables is a major consideration for researchers, and relates to increasing the diversity of participant populations, which remains important for the advancement and strengthening of clinical research, including in sleep research [52, 53].

First, in regard to selection bias and representative samples, equity is certainly lacking in access to digital resources as well as the specific demographic groups that regularly purchase and use activity trackers. Even with growing adoption of broadband internet and smartphones among rural and lower-income Americans, there is still a digital divide in the use of internet-connected devices including mobile health technologies [54,55,56]. More studies are needed, but Pew research and data reported by Chandrasekaran et al. suggest that about 20–30% of Americans use a smartwatch or wearable fitness tracker, with women reporting higher usage than men [57, 58]. Individuals earning ≥ $75,000 annually were found to be more than twice as likely to wear a fitness tracker compared to those earning less; college graduates, white adults, and those who enjoyed exercise were more likely to wear a device as well [57]. Furthermore, more adults under the age of 50 use wearables compared to older adults, and suburban populations own more wearables than their urban and rural counterparts [58]. In general, those who buy and wear these devices are possibly more health conscious and may possess a greater understanding of health-related information. These consumer patterns naturally lend themselves to selection bias, particularly when examining secondary data from digital devices. Unless researchers are providing devices to their intended population or filtering participants in some capacity, a degree of selection bias should be expected. However, these studies are no less in important in adding to the literature given the potential to reach large populations.

Another source of potential bias in how wearables are used comes from the underlying technology itself. Some studies have reported that wearable devices, and the technology that underlies even clinical-grade pulse oximetry, have lower accuracy among users with darker skin tones, specifically for heart rate and oxygen sensing, with a recent class action lawsuit claiming racial bias against a high-profile wearable technology company [59,60,61].

All of the sources of bias discussed above have an impact on how diverse participants are included in sleep-related research. While these are all very important considerations, there may also be a flip side for digital devices in that there is also a potential to help bridge important gaps in research-related diversity issues. Studies in underserved and underrepresented communities show that while cost and knowledge of fitness trackers were barriers to having a device, the majority of individuals report they would use a device or mobile health application, and would be willing to share data for research purposes [51, 62]. Given that digital devices are relatively inexpensive, they may be more cost-effective to deploy to underserved communities compared to the more expensive actigraphy devices. Furthermore, in a bring-your-own-device (BYOD) approach (discussed below), enrolling participants who already have a device may free up funding and other resources to provide devices for those who are unable to afford them. Additionally, digitally deployed studies—where recruitment is decentralized (also discussed below)—could have a further reach in recruiting diverse populations than traditional in-person studies [63]. Finally, wearables can feedback health (and sleep) data back to participants, particularly in the context of digital research studies, where survey results, interpretations, and general recommendations can also be readily returned to participants.

When considering how wearables impact diversity, two opposing arguments could be made, as outlined below.

Bring-Your-Own-Device Approach

While investigators may be able to provide devices for hundreds or even thousands of individuals, it would become a much larger, and likely unsustainable, undertaking to provide devices for the tens to hundreds of thousands of individuals that often comprise population-level studies. Instead, given that up to 30% of individuals in the USA (equating to roughly 100 million people) currently use a wearable device, a bring-your-own-device (BYOD) approach to harness data from devices that individuals are already using may be the most effective [57]. For example, we were previously able to obtain sleep data from a large population (over 150,000 individuals) of Fitbit users and assess how sleep duration and variability correlated with self-reported BMI in the cohort [64]. This study was feasible only because these individuals already owned activity trackers and used them consistently prior to accessing the data. The DETECT study utilized a prospective BYOD approach based on data from wearables (including sleep information) in an effort to track COVID-19 infections [65], monitor long-COVID, and study vaccine reactogenicity [66, 67]. This study has enrolled over 40,000 participants to date [46]. Similarly, the TemPredict study analyzed biometric data from over 60,000 individuals wearing an OuraRing in an effort to detect COVID-19 [68]. While the latter two studies arose from a need (and likely public desire) to help fight COVID-19, they demonstrate how effective digital recruiting can be on a large scale. Furthermore, the BYOD design also allows for allocation of research funding and resources to provide devices to those who do not already have one or who cannot afford one, thus combatting some of the diversity issues described above.

Where we Might be Headed: Decentralized Digital Studies

As intimated above, most studies utilizing activity trackers are site-based, small, and observational, even though the potential for large population studies is immense. The current paradigm of in-person sleep research where one institution (or multiple institutions in multi-site trials) recruits hundreds to thousands of participants may pose limits on how data from wearable devices are retrieved, processed, and aggregated. Furthermore, recruitment for site-based research has geographic constraints since most study participants reside near study sites for in-person visits, which are typically located near urban academic medical centers. This selection bias results in homogenous study cohorts that do not fully represent the diversity of the real-word patient population, thus creating evidence that only applies to a limited group of patients and further propagates health disparities [69]. The decentralized study approach can leverage direct-to-participant engagement using digital recruitment methods through social media networks, apps, and other online channels without geographic restrictions [70, 71]. Sleep research may be able to move towards the same idea on a large scale with the utilization of wearables for fully remote sleep tracking [72]. With the use of 3rd party platforms and/or smartphone application, data collection from wearables among large study populations are increasingly feasible and more easily managed, in large part due to the COVID-19 pandemic [73, 74]. Furthermore, without research coordinators to conduct in-person screening, recruitment, and enrollment, the lack of manpower that sometimes limits studies would be a smaller hurdle in the conduction of large investigations. While the UK Biobank study recruited 500,000 participants (with ~ 100 K subsequently invited to wear accelerometer devices) over 22 sites [75], and the All of Us Research Program (AOURP) in the USA has recruited over 600,000 individuals to date (with participants offered the BYOD option to share sleep data) via digital and in-person recruitment, these studies had far more resources than most investigators have access to [76]. One of the main disadvantages of decentralizing studies is likely related to reliable data collection and ensuring that participants are completing any questionnaires/surveys and reliably linking their devices to the study—it may be unrealistic to expect the same level of engagement in a digital study that one would find in an in-person study.

An interesting aspect to consider in decentralized studies is the ability to bring in electronic health record (EHR) data, which has the potential to make data sets more robust as it provides background health information, as well as possibly vital signs and laboratory measurements [77]. Having access to these aspects of an individual’s healthcare can help enhance sleep-related research, and can also help us understand how sleep impacts important healthcare outcomes. Our approach to date is to have participants link their EHR data into digital studies when possible; however, there are major networks currently aggregating EHR data for the investigation of research questions. For examples, PCORnet studies have harnessed large-volume, real-world data through different Clinical Research Networks that contain patient data from multiple institutions all the way from major healthcare centers to community health clinics. Kaiser Permanente has been aggregating patient data for research purposes for decades [78, 79], and has recently aimed to have a biobank of over 400,000 patients that includes EHR and some biomarker data. The Veteran’s Administration (VA) also has an EHR database that can be mined for health research. While some sleep-related diagnoses may be able to be extracted from (e.g., insomnia, OSA) such datasets, future next steps may be to have individuals link their wearable device data into the EHR (and thus these networks) to make help networks link patient health data to daily digital footprints that include sleep data.

Future Considerations

Until recently, applications for wearable devices have revolved around the consumer, with a goal of increasing users in the wearable market by the owner company. Now, however, some devices are becoming cleared for specific medical use purposes (for example, Fitbit and Apple watch technology is FDA approved for atrial fibrillation), and clinical purposes that involve sleep disorders may be developed in the future, with the intent of diagnosis, monitoring, or even treatment. Given the proliferation of commercial wearables, we expect that the presence of these devices in the research space for discoveries related to sleep will only increase. Researchers may need to become more comfortable with the limitations of the data collection from these devices (e.g., not having PSG or granular actigraphy data available) and also recognize that differences in algorithms between devices can impact research findings. As the consumer space evolves, researchers may also need to adapt to different wearables. It also remains to be seen how popular wearables remain in the consumer setting as well, which could impact how these devices are utilized for in a sleep research context.

At the current juncture, wearables show promise in becoming a core component of the next wave of sleep-related research studies. One of the next stepping stones would be increasing the number of interventional studies, as most of the population-level studies completed using wearables to date have been observational in nature. Certainly, most modern wearables seem reasonable substitutes for actigraphy, and we hope that as technology and algorithms improve, deeper sleep-level data will also become available. Over time, we also hope that more granular data is made available by the companies that market these devices. Additionally, it would be interesting to see if wearables themselves could be used as an intervention on a large scale to help individuals track their own biometric data to improve their own sleep and possibly other potentially sleep-dependent clinical outcomes. Interestingly, not all studies with wearables have been positive—for example, step counters used for weight loss may have instead led participants to reward high step counts with additional food [80]. And, for sleep devices, some sleep data leads people to try and perfect their sleep, causing frustration where none existed before—so-called orthosomnia [81]. Important questions remain around accuracy with different skin tones, and health equity.

Conclusions

We anticipate an increase in wearable-based sleep research in the near future, and we may also see wearables become more prominent in the clinical setting as another tool to improve outcomes. More randomized clinical trials are needed in the population-level sleep space, and decentralized digital studies may serve as an important opportunity in accomplishing this goal.