Abstract
While individuals fail to assess their mental health subjectively in their day-to-day activities, the recent development of consumer-grade wearable devices has enormous potential to monitor daily workload objectively by acquiring physiological signals. Therefore, this work collected consumer-grade physiological signals from twenty-four participants, following a four-hour cognitive load elicitation paradigm with self-chosen tasks in uncontrolled environments and a four-hour mental workload elicitation paradigm in a controlled environment. The recorded dataset of approximately 315 hours consists of electroencephalography, acceleration, electrodermal activity, and photoplethysmogram data balanced across low and high load levels. Participants performed office-like tasks in the controlled environment (mental arithmetic, Stroop, N-Back, and Sudoku) with two defined difficulty levels and in the uncontrolled environments (mainly researching, programming, and writing emails). Each task label was provided by participants using two 5-point Likert scales of mental workload and stress and the pairwise NASA-TLX questionnaire. This data is suitable for developing real-time mental health assessment methods, conducting research on signal processing techniques for challenging environments, and developing personal cognitive load assistants.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Background & Summary
Daily workload can be an occupational stressor for many professionals, including healthcare staff, aviation personnel1, students, performance-evaluated workers, and many more2,3, even for unpaid work such as household or childcare4. Additionally, prolonged cognitive load can lead to stress and, therefore, many stress-related diseases. Associated adverse consequences affect both the stressed individual and the economies of the world5. To provide individuals with an essential toolkit for self-help, the World Health Organization (WHO) published an illustrated management guide in 25+ languages (https://www.who.int/publications/i/item/9789240003927). The prerequisite to utilizing cognitive load management techniques is that individuals need to assess their load levels timely and accurately, which, despite the aforementioned stress management tools at hand, many fail to do6. Moreover, distinguishing between cognitive overload and stress in situ conditions is often challenging. Reasons can be the occupation of mental resources with the task at hand, an overestimation of personal thresholds, or a fear of repercussion due to communicating a personal boundary, amongst others7. Besides individual assessments, two further options exist to assess an individual’s workload level: classification by an expert and objective classification. Expert-based assessments have the disadvantages of requiring time and extensive training, such that in turn, the associated costs are high and expert-based classification is rendered unrealistic for broad adoption. Objective classification has been researched extensively and is commonly based on patterns in individuals’ biomarkers (e.g., quantifying cortisol levels in saliva). As gold standard methods require a lot of time or highly specialized and expensive hardware, most of these methods are rendered equally unusable for broad adoption of real-time detection and -management applications.
For the aforementioned reasons, wearable low-cost sensors are researched extensively. Methods based on wearable sensors that are capable of acquiring physiological signals are capable of mental workload and stress detection utilizing electroencephalography, acceleration, electrodermal activity, and photoplethysmogram data, amongst others8,9,10,11. However, much of the existing research body is limited by either (i) utilizing only already publicly available data, (ii) considering only one physiological modality, (iii) performing studies only in unrealistic controlled environments, (iv) utilizing non-standard questionnaires, (v) considering only predetermined tasks, (vi) lacking a proper sensor data synchronization protocol, or (vii) collecting data only from a small number of participants. Table 1 compares the present work to other publicly available data sets of similar purpose.
The broader goals of creating this data set were to overcome existing limitations and to enable the research community to investigate the classification and regression of levels of cognitive load and physiological signals, both in controlled and uncontrolled environments. A wide range of participant-chosen tasks were permitted, with the constraint to be as close as possible to realistic home-office tasks. Timely collected labels were ensured, and three forms of the data set are provided: a) the fully raw and unprocessed data, b) a minimally pre-processed version, synchronized between the time series of individual sensors and already broken down into individual tasks with associated labels, and c) a multitude of extracted hand-crafted features, ready to be utilized in machine learning models. Scripts were made publicly available https://github.com/HPI-CH/UNIVERSE.
Additional publicly available artefacts include recording notes, the parameters for each individual’s data synchronization, log files, experimental configurations, notes collected during recordings in the controlled environment on excessive movements or temporary losses of the Bluetooth connections (usually these lasted four to eight seconds and occurred rarely), and participant’s notes in PDF form. The actual sensor data is available as recorded, in a synchronized and minimally pre-processed format, and as extracted features.
Figure 1 provides an overview of the developed study design and the assays used, while detailed information on the experimental design is given in ‘Experimental Design’ and data characteristics are described in detail in ‘Data Records’. Aside from cognitive load classification, this data can be reused to answer various research questions like reaction time estimation experiments, biomarker analysis, and artefact identification in physiological data for ambulatory settings, amongst others.
Methods
Ethics Statement
Ethical approval has been obtained from the Institutional Review Board (IRB) of the University of Potsdam (application number 36/2022). Study information sheets were sent to potential participants weeks before they participated in the study, and participants had sufficient opportunities to ask questions or raise any concerns before, during, and after the study. Written consent was obtained for both the participation in this study and the publication of anonymized data. It was communicated to the participants that they could, at any time and without negative consequences, drop out of the study, which one participant chose to do while collecting data in the self-chosen environments.
Participants & Demographics Information
Data collection was conducted during the summer, autumn, and winter of 2022 as well as the spring of 2023, after advertisements were sent out via mailing lists. Inclusion criteria required the participants to be aged 18 to 68, fluent in English, have a normal or corrected-to-normal vision, know how to use a smartphone, and have to regularly perform work that will be performance-evaluated (e.g. students or employees). Participants were excluded if the potential participant was retired or needed to regularly take medication to support the treatment of a neurological disease such as depression, brain damage, or similar. In total, 24 participants agreed and were eligible to participate. One participant dropped out due to personal reasons during the data collection in the self-chosen environments, and two additional participants recorded incomplete datasets, as the participant-controlled data collection failed. However, the laboratory data was collected mostly complete for these three participants. The data is nearly balanced across biological sex (11 female participants). The participants were educated, with the majority of study participants holding the equivalent of a master’s degree or higher (seventeen participants), while every participant had at least a bachelor’s degree. Participants were aged 24 to 61, the mean age was 29.5 years (± 8.2 years). All the participants were right-handed. The distribution of the countries of origin is depicted in Fig. 2a; the participants were from Germany (10), Brazil (3), Bangladesh (2), Chile (2), India (2), Ecuador (1), Egypt (1), Mexico (1), Peru (1), and the USA (1).
Experimental Design
To overcome the existing limitations in the research body, an experimental design was developed including data collection at varying days, varying times of day, and varying locations. As a result, the data set at hand provides for each participant data that has been recorded on different days and times of day, allowing the investigation of the fluctuation of physiological signals for this individual in unnatural (controlled laboratory) as well as in natural (realistic work-from-home) environments.
Controlled Environment
In the laboratory setting in designated rooms at the chair for Digital Health - Connected Healthcare at the Hasso Plattner Institute, the Digital Engineering Faculty of the University of Potsdam, participants were instructed to follow the study design depicted in sub-figure A of Fig. 1. Explanations on each respective task were given to the participants before they followed the instructions given by the study platform. Participants sat comfortably in a temperature-controlled room at a distance of 80 cm from a Full HD PC monitor with floor-to-ceiling windows to a green backyard garden. The laboratory rooms for the controlled recordings had been chosen in a rarely frequented wing, and the study directors had temporarily pasted notes on all the doors of that wing, reminding passersby to a) talk, if at all, at a low volume, and b) not to enter the recording rooms. Due to the sticky notes and the sound-absorbing properties of the doors, few distractions reached the participants while performing the tasks. If, by accident, someone entered the room, the experimenter present in the recording room got up quietly and immediately escorted the entering person outside, explaining they were unable to use that room for the time being, and ensuring the interruption was kept in low volume and of short duration. Thereby, interruptions during the recordings in the controlled environment were kept at a minimum (approximately five interruptions noticed by the participants) and noted down in the experimental notes. The experimenters sat at another workplace in the same room, far enough away to ensure the participant’s comfort and give a sense of privacy, while still allowing questions to be raised, if they were necessary. To avoid participant-driven breaks, participants were reminded to use the bathroom before the recordings, while being encouraged to notify and take a bathroom break during the recording, if necessary. Participants were given drinks of their choice (water, different teas, or coffee), and the choice was noted down in the session notes if it deviated from water. Subsequently, participants were instructed on how to use the devices on their own while the study director ensured proper device fit, and started the Python PsychoPy application (v2022.2.1, developed in Python 3.8).
During the study, the participants were mainly guided by the PsychoPy application. Only for two activities—the synchronization of the Python PsychoPy application with the sensors at the beginning and end of the experiment—the platform asked the participants to call the study directors for their assistance. This synchronization was performed by fast-paced tapping on the spacebar performed by the participant with the hand on which the Empatica E4 watch was worn, their non-dominant hand (i.e. left hand). Before the experiment, participants had been shown how to perform a fast-paced spacebar tapping by raising the hand with the Empatica E4 watch high above their head and quickly dropping it onto the spacebar, thereby generating high acceleration data. This whole process was to be repeated four times for both the synchronizations at the start and end of the experiment. During the instruction, all participants seemingly understood the procedure sufficiently well and tapped the training keyboard properly. However, during the actual experiment, some participants (approximately during a total of ten individual sessions) were too careful with the recording equipment. It was later confirmed in personal discussions that some participants were afraid of destroying the keyboard. As a result, some participants performed slow-paced tappings with small effective ranges of motion, resulting in difficult-to-detect acceleration patterns within the acceleration data. If these challenges occurred, the experimenters noted them in the recording notes. The study director sat on the desk opposite the participant, hidden behind two monitors, and took notes on excessive body movements or other anomalies that might have led to artefacts during the recordings, such as timestamps of drops in the Bluetooth connection or when the participant drank something.
The study protocol for the controlled environment is illustrated in subfigure A of Fig. 1. In the beginning and after synchronizing the sensors with each other and with the platform, the participants watched a relaxation video of scenic shots from the national park Torres Del Paine, Chile, with relaxing music (https://www.youtube.com/watch?v=jXl1GbK5ZO8&t=3s). Next, a set of questionnaires and an eye-closing session were performed, to assess the baseline affective state and physiological signals of the participants. For the remainder of the recording, the participants performed four different tasks in two difficulty levels: easy and hard, each of ten minutes duration and in random order.
These four tasks encompassed (i) Mental Arithmetic, (ii) Sudoku, (iii) N-back Task, and (iv) Stroop Task. (i) One of the well-known cognitive load-inducing techniques12 is the Mental Arithmetic Task. Here, participants are required to calculate mathematical problems mentally without additional support from writing instruments or calculators. The study was designed to include simple addition and subtraction for the easy level, potentially involving carry numbers. For the hard level, complex calculations of multiple-digit multiplication and division were required to be calculated. Operands ranged from -100 to 100, and the specific tasks, answers, and reaction times were logged during each experiment. (ii) The digital version of the game Sudoku13 can sometimes be found pre-installed on Linux and Windows systems and was utilized in this study. The Sudoku application was chosen in two distinct difficulty levels, easy and hard, among the four default difficulty levels provided (https://wiki.gnome.org/Apps/Sudoku). Participants aligned the puzzle of numbers from 1 to 9 in a 9 × 9 grid to arrange each column, row, and subsection to contain all numbers while accounting for the constraint that the same number can not occur twice in the same row, column, and 3 × 3 grid (i.e. subsection). If the participants solved the game before the time was up, they were instructed to play more rounds of the same difficulty level until they reached the time limit of ten minutes and the application automatically terminated. (iii) The N-back task14 requires memory-sequencing of coloured rectangle blocks shown on the screen. The participant had to match the colour of the current stimuli with the colour of the stimuli n elements earlier. The application was configured to depict six distinct colours (, , , , white, ), with n = 1 and n = 2 defining the respective easy and hard difficulty levels. The participants had two seconds to give their answer before inactivity was rated as a missed trial. Colours, answers, and reaction times were logged during each experiment. (iv) To stress the control processing of the working memory, the Stroop task15 was chosen. In this task, a sequence of single words appears on the screen, stating a colour (e.g. ). However, the word is coloured in the same or a different colour (e.g. ). The participants had to recognize the font colour, ignoring the written word, and type the starting letter of the name of the font colour (e.g. y for ). In total, four colours were utilized, namely , , , and . On the easy level, the participants had a maximum time to answer of 5.5 seconds, whereas on the hard difficulty, the answer had to be given in under 1.5 seconds. Colours, answers, and reaction times were logged during each experiment.
Other than Sudoku, each task was preceded by a trial session of 45 seconds. Objective labels and data were logged by the PsychoPy application (e.g. task difficulty, start and end timestamps, task-specific information, and more). Subjective labels were provided by the participants as answers to NASA-, PANAS-, affective sliders-, and Likert scale questionnaires after the relaxation video and at the end of all of the tasks. Due to time reasons, in between the individual tasks the participants solely answered pair-wise NASA-TLX questionnaire16, affective sliders17, and reported their mental workload and stress level during the previous task on a Likert scale18. The pair-wise NASA questionnaire quantifies the subjective mental workload of a given task in six continuous sub-scales, each ranging from 0 to 100, which includes questions regarding mental demand, frustration, physical demand, temporal demand, performance, and effort. For these sub-scales, additional weights were received from the pair-wise comparisons of the questions. The affective sliders measure the subjective ratings of pleasure and arousal on two separate sliders. The sliders were designed with visual bipolar affective states through emoticons19 to rate the current emotion and did not need a written explanation, despite being explained before the experiment in a video-guided explanation of the whole study paradigm. The participants also rated their subjective mental workload and stress levels on two 5-point Likert scales with the options: “very very low”, “low”, “nor low nor high”, “high”, and “very very high”.
After the recording, information on how to use the devices in the uncontrolled environments was repeated, the participants received a printed handout illustrating proper sensor fit detailing the steps required for data acquisition, and an appointment for the second recording in the controlled environment was made.
Uncontrolled Environment
For the recordings in the uncontrolled environment, participants were free to choose where, when, and on which tasks they wanted to perform a recording. The only strict requirement was that participants had to aim for a balanced distribution between tasks they would consider low load as well as high load. A printout with steps to be followed, handed out to each participant, served as a guideline to ensure sufficient data quality of the recordings, instructing the participants to perform the synchronization shaking protocol described in ‘Data Synchronization’, to perform sensor fit checks as illustrated, to start with an eye-closing session of one minute, and to record data for a variable time of 15 to 45 minutes at a stretch. Most of the participants followed the instructions on the shaking of the devices (some, however, at a rather low intensity and velocity), on the eye-closing protocol (more than 90%), and on the distribution of low-vs-high load tasks. The most prominent tasks were reading or searching information (i.e. documentations or research papers; ≈ 26.75%), coding (various difficulties and programming languages; ≈ 16.74%), and processing data (e.g. performing data analysis, project planning based on data, doing mathematical calculations, etc.; ≈ 16.74%), while the least prominent tasks were relaxation (e.g. meditation; ≈ 0.5%), preparing a presentation ( ≈ 1.9%), and attending a meeting ( ≈ 2.8%). Other tasks performed by the participants encompassed playing a game ( ≈ 7.8%), responding to emails ( ≈ 8.6%), watching a video (e.g. learning a new skill; ≈ 8.6%), and typing (e.g. summarizing publications or writing a new manuscript; ≈ 9.7%).
Recording Devices
Two wearable devices capable of recording physiological signals were used in this study, the Muse S headband and the Empatica E4 watch, shown in Fig. 3. The Muse S headband is capable of measuring Electroencephalography (EEG), Accelerometer (ACC), Gyroscope (GYRO), as well as Photoplethysmography (PPG) data. EEG data was recorded with 256 Hz at the five sensor locations AF7, AF8, TP9, TP10, and FpZ, according to the 10/20 international system and depicted in detail in Fig. 3b. ACC and GYRO data were sampled at 50 Hz and PPG data could have been sampled at 64 Hz. However, for reasons of the battery life and the stability of the Bluetooth connection, the recording of PPG data from the headband was not performed. As a consequence, PPG data was recorded only in the Empatica E4 watch. The recording for the Muse S headband was started and data was collected via the third-party app Mind Monitor. The third-party app Mind Monitor provided many additional features such as power band values, amongst which the Horse Shoe Indicator (HSI) for each electrode location—which reflects three states of electrode connection quality ([0.0 means no connection], [1.0 means good connection], and [2.0 means poor connection])—could be used in future studies to correlate it with the computed signal quality indices of this work. Subsequent pre-processing steps were performed based on the raw data, while the band-power features computed by the third-party app were discarded from the subsequent analysis but made publicly available as well. The Empatica E4 is capable of measuring Temperature (TEMP), Photoplethysmography (PPG), Electrodermal Activity (EDA), and Acceleration (ACC) data using the accompanying E4 realtime app provided by the manufacturer for iOS and Android devices or the E4 streaming server for Windows. TEMP and EDA data were collected with a sampling rate of 4 Hz, PPG data at 64 Hz, and ACC data at 32 Hz.
Two further recording devices were utilized during the study: a Google Pixel phone (Pixel 3a; Android 12), on which the data collection apps had been installed, and a personal computer (PC) was used to display the study platform PsychoPy. After data collection in the controlled environment, the respective Google Pixel phone was handed out to the participant alongside the Muse S headband and the Empatica E4 watch used, for subsequent data collection in the self-chosen environments. The PC ran Ubuntu 20.04.5 and had 4 cores and 8 GB of RAM. An extensive logging functionality had been implemented, which logged a multitude of information about synchronisation taps, questionnaire answers such as for NASA, PANAS, affective sliders, Likert scales, eye closing timestamps, and task-specific information such as correct answers, response time, and the operators in the specific task, amongst others. For reasons of redundancy, the logging was directed to two files: a. csv-file, and a. log-file. In two recordings, this redundancy has been needed, as either one of the files had not been written properly due to an application error, while the other file had been correctly stored on disk. As a consequence, for two of the recordings of the controlled environments, the log files had to be utilized to derive the task labels, and for two further recordings, the physiological signals had to be interpolated at the end of the recordings, due to Bluetooth connection problems with the sensors.
Data Synchronization
The internal clocks of (wearable) devices can run at different speeds than those of reference systems, resulting in a phenomenon called clock drift – and wearable sensors are no exception to this. This situation is worsened by the circumstance that wearable sensors can have different time zone settings, and computations on floating point numbers, the dates, are performed at varying degrees of accuracy. For these reasons, the internal clocks of devices need to be resynchronized. To ensure the synchronicity of data recorded from different devices and platforms, multiple solutions exist, such as the Lab-Streaming Layer (LSL; https://github.com/sccn/labstreaminglayer). However, this solution is limited to the availability of a platform that receives the streamed data, which is not guaranteed to be the case for the uncontrolled environments chosen by participants. For these circumstances, a shake-based protocol was developed and participants were asked to follow it closely. When starting a session in the self-chosen environment, participants had to start the recordings, place the wearable sensors flat on a surface and wait for six to twelve seconds, then take both the Muse S headband and the Empatica E4 watch together and shake them violently for about twelve seconds, finally placing the devices again on a flat surface and wait for six to twelve seconds before starting with the actual task. This procedure was to be repeated at the end of the recording. By resting both devices on a flat surface twice and simultaneously shaking them in between, very clear and similar patterns of acceleration and gyroscope data were collected. Figure 4 provides an overview of the resulting accelerometer data. After loading the Muse S and Empatica E4 data utilizing devicely in version 1.1.1 (https://pypi.org/project/devicely/1.0.2/), subsequent peak detection allowed for synchronization of the time series by potential alignment after clock drift, using the Python-based synchronization package Jointly20 in version 1.0.4. However, while a few participants omitted this step during the recordings in the self-chosen environment, a few of the accelerometer recordings in the uncontrolled environments were difficult to align. For this reason, and to ensure the same pre-processing steps across the data published, synchronization was performed based on the timestamps given by the wearables for the data recorded in the uncontrolled environments, while the data in the controlled environments had been synchronized using Jointly.
Data Processing
Two different formats of unprocessed data are available: (i) the raw data as recorded from the individual wearable devices and the study platform, and (ii) the data organized by tasks which is the result of splitting the original (raw) time series data into the respective tasks performed by the participants. Task extraction was based either on a simple splitting technique using the timestamps from the devices in the uncontrolled environment (ii.1) or a sophisticated shake-detection protocol performed by the experimenters for each recording in the controlled environment (ii.2). Additionally, a simple data pre-processing pipeline was implemented for use in qualitative and quantitative evaluations and resulted in three data sets (i-pre), (ii.1-pre), and (ii.2-pre). Pre-processing steps for the EEG data included a Butterworth filter for the range of 0.5 - 50 Hz using mne python package (https://mne.tools/stable/generated/mne.filter.filter_data.html) and applied to remove high-frequency noise of the muscle activation of the scalp and low-frequency disturbances such as heartbeats. An additional Butterworth filter at 50 Hz removed the power-line interference from the signal. Furthermore, a movement filter was applied by filtering the accelerometer data from the headband in the range of 0.5-20 Hz and the magnitude of the acceleration for each participant of a given controlled and uncontrolled session. By applying a binary search, the participant-wise threshold was computed to detect 3 to 5.25% of the high acceleration data and, after that, interpolate the corresponding EEG data. Additionally, the EEG data from the controlled session was normalized by min-max normalization and by removing the baseline obtained by the eye-closing session. In contrast, the data from the uncontrolled session went through a min-max normalization as not all participants performed an eye-closing session as instructed. The data from both sessions were average-referenced. Due to the lack of dedicated recording channels for ocular (electrooculogram, EOG), muscular (electromyogram, EMG), or cardiac (electrocardiogram, ECG) activity, the widely applied21 steps of EOG-, EMG, and ECG-removal were not performed in more detail, and only other obvious artefacts—such as the loss of contact or the Bluetooth connection, amongst others—were interpolated using the mean value of the neighbouring values. The raw blood volume pulse (BVP) data extracted from the PPG sensors underwent the same normalization mentioned for EEG data. Subsequently, for further data cleaning, the Savitzky-Golay-Filter was implemented using the Scipy python package (https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.savgol_filter.html) with filter order 4 and window length 31 and applied on the BVP data. The TEMP and EDA data were preprocessed by interpolating the missing data. The preprocessing steps are summarized in Fig. 5.
For extracting features from the preprocessed data of the four modalities, a 60-second window with 80% overlap was computed. As mentioned in the literature22, using Welch’s method and Hann window function, the power spectral density (PSD) of five commonly used frequency bands, namely δ (0.5 − 4 Hz), θ (4 − 7 Hz), α (8 − 12 Hz), β (12 − 30 Hz) and γ (30 − 50 Hz) were computed from each of the four channels of EEG data. The average of each band power over four channels denoted as mean − (δ, θ, α, β, γ) and the ratio θ/α were used as features. Additionally, asymmetry features were calculated by subtracting the log-transformed spectral power of the right hemisphere from the left hemisphere of each power band and denoted as frontal − α − asy and (δ, θ, α, β, γ) − asy. Using the NeuroKit2 python package (https://neuropsychology.github.io/NeuroKit/), features were extracted from the cleaned BVP data. The time domain features include the mean of the RR intervals (HRV - MeanNN), the standard deviation of the RR intervals, (HRV - SDNN) and the root mean square difference of successive R-R interval (HRV - RMSSD). The frequency domain features include normalized low frequency (0.04 to 0.15 Hz) power (HRV - LFn), normalized high frequency (0.15 to 0.4 Hz) power (HRV - HFn), and the ratio between the latter (HRV - LFn/HRV - HFn). From the EDA signal, the number of peaks of skin conductance response (SCR - Peaks - N), and the mean amplitude of the peak occurrences (SCR - Peaks - Amplitude - Mean) were extracted. Furthermore, the mean of the temperature (mean - temp) and the standard deviation of the temperature (std - temp) were extracted from the cleaned TEMP signal.
Data Records
The data23 has been deposited with Zenodo https://doi.org/10.5281/zenodo.10371068 run by CERN Data Centre. The data records are available with a total duration of data recordings in the controlled environment of only task-related data of 60.30 hours from the Muse S headband and 60.37 hours from the Empatica E4 watch, and a total duration of data recordings in the uncontrolled, participant-chosen environments of 94.85 hours from the Muse S headband and 99.33 hours from the Empatica E4 watch ( ≈ 315 hours in total). Information on the folder structure in the data repository is given in Fig. 6. Information on the additional data collected during the recordings in the controlled environment is provided in Table 2, while information on the data collected by the PsychoPy platform is given in Table 3. For the data recorded in the controlled environments, laboratory notes from the participant’s recordings and the task labels are located in all the top-level folders. The ‘Raw’-top-level-folders contain the data in the format as recorded. No analytical procedures were performed on this data, artefacts are present, and no features have been extracted. The PsychoPy-platform logs are stored only in this folder. The PsychoPy logs were used to create the ‘Labeled’-folder, in which the data was split into the respective tasks performed per laboratory recording ‘Lab1’ and ‘Lab2’. The respective name of the sub-folder, e.g. ‘arithmetic_easy’, gives information about the task performed (‘arithmetic’; i.e. mental calculations), and the difficulty level (‘easy’; i.e. designed to result in little-to-no load). The data recorded in participant-chosen uncontrolled environments is stored in the ‘Wild’ data folder. Here, in the ‘Raw’-top-level-folder, the data collected during each of the participant-performed recordings is stored in folders labeled in numerically ascending numbers of 1 to N. Data stored in the ‘Labeled’-folder is stored in sub-folders named according to the labels for stress and mental workload provided by the participants for the given recordings, using five-point Likert-scales (e.g. ranging from “very very low”-“vlw”, “low”-“low”, “nor low nor high”-“nor”, and “high”-“hig”, to “very very high”-“vhg”). That means, if the participant labeled the first recording in the self-chosen environment as “nor low nor high” in stress and “low” in mental workload, the data for this recording can be found in the subfolder called “nor_stress_low_mw_1”.
As for the ‘Preprocessed’ data, the data were preprocessed as mentioned in ‘Data Processing’ and saved according to the extracted individual tasks and the eye-closing sessions of the Labeled data. Furthermore, Features were extracted using a 60-seconds window from the preprocessed data of the task folders only, excluding the eye-closing sessions, and stored in the ‘Features’-folder.
Technical Validation
Data Quality Analysis
Commonly, data quality is quantified by using metrics such as the signal-to-noise ratio (SNR). This approach works particularly well for clearly defined expected signals, such as event-related potentials based on somatosensory stimuli24. However, when the signal of interest is not well-defined, another option is to define what is regarded as noise and compute the SNR as a signal quality index (SQI) for each part of the signal of interest. For the EEG data21, the BVP data25, and the EDA data26, many recommendations for data pre-processing exist, which give information on good and bad signals. Consequently, four different types of SQI were calculated for the EEG, BVP, EDA, and TEMP data, based on related work27.
For the EEG signal, the SQI was computed as SNR in decibels (dB). First, the power line interference was removed by using a notch filter at 50 Hz, realized as a digital filter applied forward and backwards. Second, the four-channel EEG data was average referenced resulting in a remaining dimension of four channels. For each channel, the SQI was computed as the Power Spectral Density (PSD) using Welch’s periodogram method, averaged across all ten-second windows (with five seconds of overlap) within each recording session. The resulting power spectra were averaged across channels. To quantify the signal, the SNR was computed as \(10* {\log }_{10}(band\,\_power/noise\_\,power)\), with band_power referring to the individual bands of interest (i.e. Delta (δ, <4 Hz), Theta (θ, 4 Hz - 7 Hz), Alpha (α, 8 Hz - 12 Hz), Beta (β, 13 Hz - 30 Hz), and Gamma (γ, 31 Hz - 100 Hz (below Nyquist frequency))) and noise_power referring to the power in the highest frequency band below the Nyquist frequency (i.e. 100 Hz - 125 Hz). Results are reported as SNR in dB in Table 4. As the results reported in Table 4 are averaged across all individuals and recordings of the different environments, and particularly as some participants took less care during the recordings to ensure good signal quality than other participants, Table 5 gives an overview of the individual data quality indices.
For the BVP signal, the SQI was computed as a measure of spectral entropy (SE) for each minute. As the signal was collected with 64 Hz sampling rate, bandpass filtering in the low-frequency ranges was possible below the Nyquist frequency of 32 Hz. First, a bandpass filter with the passband between 1 Hz and 3 Hz was applied. Subsequently, the SE-SQI was computed as the weighted sum over 15 non-overlapping 4-second windows. Therefore, Welch’s periodogram method was applied to compute the PSD for each 4-second window of the recording. The derived PSD was normalized, such that the total power across all frequency bins (1.0 Hz to 3.0 Hz; at a frequency resolution of 0.25 Hz) totalled approximately 1.0. For each minute of the recordings, the SE-SQI weighted sum was computed by averaging the PSD across the 15 four-second windows for the respective minute and building the weighted sum of the averaged power for each frequency bin, multiplied by the binary logarithm of the averaged power for this frequency bin. The negative value of the weighted sum divided by the binary logarithm of the number of frequency bins (i.e. 9; [1.0, 1.25, … , 2.57, 3.0]) was stored as the SE-SQI. A SE-SQI value of less than 0.8 is regarded as a good signal, while values above 0.8 are regarded as noisy signals27. In total, 68.94% of the data collected in the controlled environments are regarded as good signal quality, while 51.18% of the data collected in the uncontrolled environments are regarded as good signal quality.
As with the EDA data, these results can in part be explained by some participants ensuring better sensor fit than other participants, e.g. by wearing the Empatica E4 watch more tightly on the wrist of the non-dominant hand than other participants (confirmed in personal discussions after participant-driven data recordings). For the EDA signal, the SQI was computed as the rate of amplitude changes (RAC). Therefore, the EDA signal, which was sampled by the Empatica E4 watch at 4 Hz, was analyzed for trends and abrupt changes in the skin conductance level, following the relative temporal progress of the time series data. If at any given point in time t the EDA value was significantly higher than four samples ago (i.e. one second ago; t0 > = 1.2*t−4) or significantly lower than four samples ago (i.e. one second ago; t0 < = 0.9*t−4), the signal was graded as noisy, and marked accordingly. If however, no significant change of EDA value occurred throughout the last 480 samples (i.e. one minute of data; abs(t−480 − ti) < = 0.001∣ ∀i ∈ [− 480, 0]), the signal of the whole last minute was graded as noisy, and marked accordingly. In total, 81.18% of EDA data recorded in the controlled environment was graded as good signal quality, while overall 94.00% of EDA data recorded in the uncontrolled environments was graded as good signal quality. For the TEMP signal, the data was analyzed as in related work28. However, due to the suspicion that some participants forgot to take off the protective cap from the Empatica E4 watch sensors, narrower temperature values (30.00 to 40.00 degrees Celsius) were chosen as good signal quality data to rule out ambient room temperature data29.
To reproduce the results, only the data as described in ‘Data Records’ and in Fig. 6, Python in version 3.11.5, and the Python packages Numpy in version 1.26.2, Pandas in version 1.5.3, and Scipy in version 1.11.4 were utilized. The code to compute the respective SQI values (EEG-SQI-SNR, SE-SQI, and RAC-SQI) can be found online https://github.com/HPI-CH/UNIVERSE.
Statistical Analysis
To validate the technical robustness of the data set, a statistical analysis was performed between the extracted features mentioned in ‘Data Processing’ of easy and hard tasks across all participants in the controlled setup. The normality test was performed using the Shapiro-Wilk test for all individual features for each set of easy and hard tasks and all tasks together, which mostly suggested rejecting the null hypothesis, as one of the given examples depicted in Fig. 7. The figure shows the Q-Q plot for the differences between high and low-class features for all tasks. While some feature distributions followed normal distributions, most features have heavy tails, suggesting participant-dependent outliers in the feature set. Furthermore, a paired t-test was performed for normally distributed feature set, and a Wilcoxon test was performed otherwise using the ttest_rel function (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_rel.html) and wilcoxon (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wilcoxon.html), respectively.
The resulting p-values are depicted in Table 6, showing the power ratio feature has a p-value less than 0.001 for all the instances, justifying a strong feature for the data set. The many other instances with such small p-values ensure the quality of the data to strongly be able to differentiate between the tasks and thereby validate the technical robustness of the data set.
Machine Learning Analysis
Machine learning models are evaluated to classify between easy and hard tasks in the controlled environment and self-assessed high and low tasks in the uncontrolled environment to ensure the technical robustness of the data set. Before building the model, the data went through significant preprocessing steps, as described in ‘Data Processing’. Using the mentioned features from the four modalities, Logistic Regression (LR) classifiers from the sklearn Python package (https://scikit-learn.org/stable/) were utilized using K-fold cross-validation (CV) with K = 5 for the train-test split of the data. To tune the hyper-parameters of the models, a GridSearch was performed for each train-test set using the default CV and the following hyper-parameters: penalty: l1, solver: [liblinear, saga], Regularization (C): Logspace (-3,3,7), and penalty: l2, solver: [liblinear, saga, newton-cg, lbfgs], Regularization (C): Logspace (-3,3,7). Only those sessions that had data from all four signals: EEG, EDA, BVP and TEMP data, were considered for the models. For participants, UN_112 and UN_120, only one controlled session was opted for classification due to the unavailability of a second one. For the rest of the 22 participants, the data from two controlled sessions from each participant were combined to develop a personalized model for each. The data was labeled with the defined task labels: easy and hard. For the recordings in the uncontrolled environment, the Nasa-TLX scores reported by the participants were utilized to differentiate between easy and difficult tasks using a personalized threshold for each of the 21 participants with valid recordings. Due to the low quality of the BVP data, the data from UN_102 was left out of the comparison but should be considered in further evaluations, given that less than four modalities can be used for the models. The accuracy of the binary classification achieved by the models in the controlled and uncontrolled sessions is depicted in Fig. 8.
For the recordings from the controlled and uncontrolled sessions, the mean classification accuracy of the personalized models achieved an accuracy of 71% and 74%, respectively, validating multiple possible use cases of the data set in the future, such as building a personalized cognitive load assistant. However, the lowest classification accuracy achieved was close to the chance level for both sessions (48% for controlled and 54% for uncontrolled), which suggests that further pre-processing of the data at an individual level might be needed for some of the participant’s data, highlighting the potential to develop and test advanced algorithms on this data set. These results are in line with the values for F1, precision, and recall, which are at an average of 82%, 76%, and 90% for the data from the uncontrolled environments, and 71%, 69%, and 74% for the data from the controlled environment. Better results in the uncontrolled environments might be explained by participants being stronger involved in their actual, personal tasks in the uncontrolled environments than in the experimental tasks in the controlled environments, which were of little-to-no personal importance to them.
Usage Notes
The dataset can be used for multiple use cases with respect to individuals’ research questions. The ‘Raw’ data can address enormous signal processing research questions on obtaining good quality data from consumer-grade devices for both PPG and EEG sensors. In particular, the ‘Raw’ EEG data can also be used to analyze the usability of the data from a few-channel EEG over the conventional clinical-grade EEG recordings. The ‘Preprocessed’ data is a good source for the Deep Learning community working on unsupervised multi-class mental workload classification, or time-series data in general. Furthermore, the ‘Features’ provided with the data set can be utilized to investigate the contribution of each feature and answer the research question of the necessity of multi-modality in uncontrolled environments. In line with recent work on multi-task learning30, this data set can serve as a control group for disease- and non-disease-specific analyses by drastically increasing the available reference data.
Code availability
The Python codes utilized to load the raw data and synchronize based on shaking and tapping protocol are available at https://github.com/HPI-CH/UNIVERSE. Furthermore, the repository includes the Python codes for the preprocessing and feature extraction techniques, as well as the Python codes for the Machine Learning models used to validate the data set. The repository is self-contained on how to use it.
References
Masi, G., Amprimo, G., Ferraris, C. & Priano, L. Stress and workload assessment in aviation-a narrative review. Sensors 23, 3556, https://doi.org/10.3390/s23073556 (2023).
Hemakom, A., Atiwiwat, D. & Israsena, P. Ecg and eeg based machine learning models for the classification of mental workload and stress levels for women in different menstrual phases, men, and mixed sexes. Biomedical Signal Processing and Control 95, 106379, https://doi.org/10.1016/j.bspc.2024.106379 (2024).
Thielmann, B., Schumann, H., Botscharow, J. & Böckelmann, I. Subjective perceptions of workload and stress of emergency service personnel depending on work-related behavior and experience patterns. Notfall+ Rettungsmedizin 25, 15–22, https://doi.org/10.1007/s10049-022-01076-y (2022).
Reich-Stiebert, N., Froehlich, L. & Voltmer, J.-B. Gendered mental labor: A systematic literature review on the cognitive dimension of unpaid work within the household and childcare. Sex Roles 88, 475–494, https://doi.org/10.1007/s11199-023-01362-0 (2023).
Hassard, J., Teoh, K., Visockaite, G., Dewe, P. & Cox, T. The cost of work-related stress to society: A systematic review. Journal of Occupational Health Psychology https://doi.org/10.1037/ocp0000069 (2018).
Þórarinsdóttir, H., Kessing, L. V. & Faurholt-Jepsen, M. Smartphone-based self-assessment of stress in healthy adult individuals: A systematic review. Journal of Medical Internet Research https://doi.org/10.2196/jmir.6397 (2017).
Epel, E. S. et al. More than a feeling: A unified view of stress measurement for population science. Frontiers in neuroendocrinology 49, 146–169, https://doi.org/10.1016/j.yfrne.2018.03.001 (2018).
Sharma, L. D. et al. Evolutionary inspired approach for mental stress detection using EEG signal. Expert Systems with Applications 197, https://doi.org/10.1016/j.eswa.2022.116634 (2022).
Garcia-Ceja, E., Osmani, V. & Mayora, O. Automatic stress detection in working environments from smartphones’ accelerometer data: A first step. IEEE Journal of Biomedical and Health Informatics 20, 1053–1060, https://doi.org/10.1109/JBHI.2015.2446195 (2016).
Anusha, A. S. et al. Electrodermal activity based pre-surgery stress detection using a wrist wearable. IEEE Journal of Biomedical and Health Informatics 24, 92–100, https://doi.org/10.1109/JBHI.2019.2893222 (2020).
Loh, H. W. et al. Application of photoplethysmography signals for healthcare systems: An in-depth review. Computer Methods and Programs in Biomedicine 216, 106677, https://doi.org/10.1016/j.cmpb.2022.106677 (2022).
Ahern, S. & Beatty, J. Pupillary responses during information processing vary with scholastic aptitude test scores. Science 205, 1289–1292, https://doi.org/10.1126/science.472746 (1979).
Shakti, D. et al. EEG as a tool to measure cognitive load while playing Sudoku: A preliminary study. In 2019 3rd International Conference on Electronics, Materials Engineering & Nano-Technology (IEMENTech), 1–5, https://doi.org/10.1109/IEMENTech48150.2019.8981192 (2019).
Kane, M. J. & Engle, R. W. The role of prefrontal cortex in working-memory capacity, executive attention, and general fluid intelligence: an individual-differences perspective. Psychonomic Bulletin & Review 9, 637–671, https://doi.org/10.3758/bf03196323 (2002).
Stroop, J. R. Studies of interference in serial verbal reactions. Journal of Experimental Psychology 18, 643–662, https://doi.org/10.1037/h0054651 (1935).
Hart, S. G. & Staveland, L. E. Development of nasa-tlx (task load index): Results of empirical and theoretical research. In Advances in psychology, vol. 52, 139–183, (Elsevier, https://doi.org/10.1016/S0166-4115(08)62386-9 1988).
Betella, A. & Verschure, P. F. The affective slider: A digital self-assessment scale for the measurement of human emotions. PloS one 11, e0148037, https://doi.org/10.1371/journal.pone.0148037 (2016).
Ouwehand, K., Kroef, A. v. d., Wong, J. & Paas, F. Measuring cognitive load: Are there more valid alternatives to likert rating scales? In Frontiers in Education, vol. 6, 702616, (Frontiers Media SA, https://doi.org/10.3389/feduc.2021.702616 2021).
Mehrabian, A. & Russell, J. A.An approach to environmental psychology. An approach to environmental psychology (The MIT Press, 1974). ISBN 978-0-262-13090-5.
Herdick, A., Musmann, F., Sasso, A., Albert, J. & Arnrich, B. Jointly: A python package for synchronizing multiple sensors with accelerometer data, https://doi.org/10.5281/zenodo.5833858 (2022).
Urigüen, J. A. & Garcia-Zapirain, B. EEG artifact removal-state-of-the-art and guidelines. Journal of Neural Engineering 12, 031001, https://doi.org/10.1088/1741-2560/12/3/031001 (2015).
Moontaha., S., Kappattanavar., A., Hecker., P. & Arnrich., B. Wearable eeg-based cognitive load classification by personalized and generalized model using brain asymmetry. In Proc. of the 16th Int. Joint Conference on Biomedical Engineering Systems and Technologies - HEALTHINF, 41–51, https://doi.org/10.5220/0011628300003414 (2023).
Anders, C., Moontaha, S., Real, S. & Arnrich, B. A dataset on unobtrusive measurement of cognitive load and physiological signals (eeg, ppg, eda) in uncontrolled environments. Zenodo, https://doi.org/10.5281/zenodo.10371068 (2023).
Anders, C., Curio, G., Arnrich, B. & Waterstraat, G. Optimization of data pre-processing methods for time-series classification of electroencephalography data. Network: Computation in Neural Systems 34, 374–391, https://doi.org/10.1080/0954898X.2023.2263083 (2023).
Tamura, T., Maeda, Y., Sekine, M. & Yoshida, M. Wearable photoplethysmographic sensors-past and present. Electronics 3, 282–302, https://doi.org/10.3390/electronics3020282 (2014).
Gashi, S. et al. Detection of artifacts in ambulatory electrodermal activity data. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 4, 44:1–44:31, https://doi.org/10.1145/3397316 (2020).
Nasseri, M. et al. Signal quality and patient experience with wearable devices for epilepsy management. Epilepsia 61, S25–S35, https://doi.org/10.1111/epi.16527 (2020).
Böttcher, S. et al. Data quality evaluation in wearable monitoring. Scientific Reports 12, 21412, https://doi.org/10.1038/s41598-022-25949-x (2022).
Regalia, G., Resnati, D. & Tognetti, S. Sensors on the wrist. In Narayan, R. (ed.) Encyclopedia of Sensors and Biosensors (First Edition), 1–20, (Elsevier, https://doi.org/10.1016/B978-0-12-822548-6.00130-8 2023).
Dai, R. et al. Multi-task learning for randomized controlled trials: a case study on predicting depression with wearable data. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 6, 1–23, https://doi.org/10.1145/3534591 (2022).
Hasan, T., Zaman, S., Wesley, A., Tsiamyrtzis, P. & Pavlidis, I. Sympathetic activation in deadlines of deskbound research - a study in the wild, https://doi.org/10.1145/3544549.3585585 (2023).
Hinss, M. F. et al. Open multi-session and multi-task EEG cognitive dataset for passive brain-computer interface applications. Scientific Data 10, 85, https://doi.org/10.1038/s41597-022-01898-y (2023).
Wang, Y., Duan, W., Dong, D., Ding, L. & Lei, X. A test-retest resting, and cognitive state EEG dataset during multiple subject-driven states. Scientific Data 9, 566, https://doi.org/10.1038/s41597-022-01607-9 (2022).
Hosseini, S. et al. A multimodal sensor dataset for continuous stress detection of nurses in a hospital. Scientific Data 9, 255, https://doi.org/10.1038/s41597-022-01361-y (2022).
Coşkun, B. et al. A physiological signal database of children with different special needs for stress recognition. Scientific Data 10, 382, https://doi.org/10.1038/s41597-023-02272-2 (2023).
Kang, S. et al. K-EmoPhone: A mobile and wearable dataset with in-situ emotion, stress, and attention labels. Scientific Data 10, 351, https://doi.org/10.1038/s41597-023-02248-2 (2023).
Zaman, S. et al. Stress and productivity patterns of interrupted, synergistic, and antagonistic office activities. Scientific Data 6, 264, https://doi.org/10.1038/s41597-019-0249-5 (2019).
Moontaha, S., Schumann, F. & Arnrich, B. Online learning for wearable eeg-based emotion classification. Sensors 23, https://doi.org/10.3390/s23052387 (2023).
Acknowledgements
Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - Projektnummer 491466077. This research was (partially) funded by the HPI Research School on Data Science and Engineering.
Funding
Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and Affiliations
Contributions
Conceptualization, S.M., C.A. and S.R.; Data curation, S.M., C.A., and S.R.; Formal analysis, S.M., C.A., and S.R.; Investigation, S.M. and C.A.; Methodology, S.M. and C.A.; Project administration, S.M., C.A. and B.A.; Resources, S.M., C.A. and B.A.; Software, S.M., C.A. and S.R.; Supervision, S.M., C.A. and B.A.; Validation, S.M. and C.A.; Visualization, S.M. and C.A.; Writing - original draft, S.M. and C.A.; Writing - review & editing, S.M., C.A., S.R., and B.A. All authors have read and agreed to the published version of the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Anders, C., Moontaha, S., Real, S. et al. Unobtrusive measurement of cognitive load and physiological signals in uncontrolled environments. Sci Data 11, 1000 (2024). https://doi.org/10.1038/s41597-024-03738-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-024-03738-7
- Springer Nature Limited