Background & Summary

Daily workload can be an occupational stressor for many professionals, including healthcare staff, aviation personnel1, students, performance-evaluated workers, and many more2,3, even for unpaid work such as household or childcare4. Additionally, prolonged cognitive load can lead to stress and, therefore, many stress-related diseases. Associated adverse consequences affect both the stressed individual and the economies of the world5. To provide individuals with an essential toolkit for self-help, the World Health Organization (WHO) published an illustrated management guide in 25+ languages (https://www.who.int/publications/i/item/9789240003927). The prerequisite to utilizing cognitive load management techniques is that individuals need to assess their load levels timely and accurately, which, despite the aforementioned stress management tools at hand, many fail to do6. Moreover, distinguishing between cognitive overload and stress in situ conditions is often challenging. Reasons can be the occupation of mental resources with the task at hand, an overestimation of personal thresholds, or a fear of repercussion due to communicating a personal boundary, amongst others7. Besides individual assessments, two further options exist to assess an individual’s workload level: classification by an expert and objective classification. Expert-based assessments have the disadvantages of requiring time and extensive training, such that in turn, the associated costs are high and expert-based classification is rendered unrealistic for broad adoption. Objective classification has been researched extensively and is commonly based on patterns in individuals’ biomarkers (e.g., quantifying cortisol levels in saliva). As gold standard methods require a lot of time or highly specialized and expensive hardware, most of these methods are rendered equally unusable for broad adoption of real-time detection and -management applications.

For the aforementioned reasons, wearable low-cost sensors are researched extensively. Methods based on wearable sensors that are capable of acquiring physiological signals are capable of mental workload and stress detection utilizing electroencephalography, acceleration, electrodermal activity, and photoplethysmogram data, amongst others8,9,10,11. However, much of the existing research body is limited by either (i) utilizing only already publicly available data, (ii) considering only one physiological modality, (iii) performing studies only in unrealistic controlled environments, (iv) utilizing non-standard questionnaires, (v) considering only predetermined tasks, (vi) lacking a proper sensor data synchronization protocol, or (vii) collecting data only from a small number of participants. Table 1 compares the present work to other publicly available data sets of similar purpose.

Table 1 The comparison between the presented work and related work underlines the trend in the research area of mental state classification to utilize multimodal data, preferably labeled using scientifically validated questionnaires.

The broader goals of creating this data set were to overcome existing limitations and to enable the research community to investigate the classification and regression of levels of cognitive load and physiological signals, both in controlled and uncontrolled environments. A wide range of participant-chosen tasks were permitted, with the constraint to be as close as possible to realistic home-office tasks. Timely collected labels were ensured, and three forms of the data set are provided: a) the fully raw and unprocessed data, b) a minimally pre-processed version, synchronized between the time series of individual sensors and already broken down into individual tasks with associated labels, and c) a multitude of extracted hand-crafted features, ready to be utilized in machine learning models. Scripts were made publicly available https://github.com/HPI-CH/UNIVERSE.

Additional publicly available artefacts include recording notes, the parameters for each individual’s data synchronization, log files, experimental configurations, notes collected during recordings in the controlled environment on excessive movements or temporary losses of the Bluetooth connections (usually these lasted four to eight seconds and occurred rarely), and participant’s notes in PDF form. The actual sensor data is available as recorded, in a synchronized and minimally pre-processed format, and as extracted features.

Figure 1 provides an overview of the developed study design and the assays used, while detailed information on the experimental design is given in ‘Experimental Design’ and data characteristics are described in detail in ‘Data Records’. Aside from cognitive load classification, this data can be reused to answer various research questions like reaction time estimation experiments, biomarker analysis, and artefact identification in physiological data for ambulatory settings, amongst others.

Fig. 1
figure 1

(A) Study protocol for experiments in the controlled laboratory environment. Participants performed two sessions of approximately two hours duration, with each task lasting ten minutes and no rest between tasks. Before task transition, participants answered questionnaires designed to rate their mental state during the previous task. (B) Study protocol for experiments in the uncontrolled environments. Participants performed self-guided sessions until approximately four hours of data had been recorded in total. (C) In the first appointment, each participant was instructed on proper sensor placement, after which the first controlled recording took place. Subsequent recordings in self-chosen environments followed at participant-chosen times, while the participants had the sensors temporarily in their possession. The data recording per participant terminated with a final data recording in the controlled environment, resulting in a timely distribution of A-B+-A, spanning a variable number of days and participant-chosen duration of recordings.

Methods

Ethics Statement

Ethical approval has been obtained from the Institutional Review Board (IRB) of the University of Potsdam (application number 36/2022). Study information sheets were sent to potential participants weeks before they participated in the study, and participants had sufficient opportunities to ask questions or raise any concerns before, during, and after the study. Written consent was obtained for both the participation in this study and the publication of anonymized data. It was communicated to the participants that they could, at any time and without negative consequences, drop out of the study, which one participant chose to do while collecting data in the self-chosen environments.

Participants & Demographics Information

Data collection was conducted during the summer, autumn, and winter of 2022 as well as the spring of 2023, after advertisements were sent out via mailing lists. Inclusion criteria required the participants to be aged 18 to 68, fluent in English, have a normal or corrected-to-normal vision, know how to use a smartphone, and have to regularly perform work that will be performance-evaluated (e.g. students or employees). Participants were excluded if the potential participant was retired or needed to regularly take medication to support the treatment of a neurological disease such as depression, brain damage, or similar. In total, 24 participants agreed and were eligible to participate. One participant dropped out due to personal reasons during the data collection in the self-chosen environments, and two additional participants recorded incomplete datasets, as the participant-controlled data collection failed. However, the laboratory data was collected mostly complete for these three participants. The data is nearly balanced across biological sex (11 female participants). The participants were educated, with the majority of study participants holding the equivalent of a master’s degree or higher (seventeen participants), while every participant had at least a bachelor’s degree. Participants were aged 24 to 61, the mean age was 29.5 years (± 8.2 years). All the participants were right-handed. The distribution of the countries of origin is depicted in Fig. 2a; the participants were from Germany (10), Brazil (3), Bangladesh (2), Chile (2), India (2), Ecuador (1), Egypt (1), Mexico (1), Peru (1), and the USA (1).

Fig. 2
figure 2

Metadata on (a) the countries of origin of the participants, and (b) the duration of data collected for each participant in the controlled environment (in ), and in the uncontrolled environments (in ). As can be seen in (a), the majority of participants identified Europe, South America, or Asia as their continent of origin, while the minority of participants identified North America or Africa as the continent of their origin. As depicted in (b), overall approximately 120 hours of recordings in the controlled environment and approximately 194 hours of recordings in the uncontrolled environments were collected for both the Muse S headband and the Empatica E4 watch in total.

Experimental Design

To overcome the existing limitations in the research body, an experimental design was developed including data collection at varying days, varying times of day, and varying locations. As a result, the data set at hand provides for each participant data that has been recorded on different days and times of day, allowing the investigation of the fluctuation of physiological signals for this individual in unnatural (controlled laboratory) as well as in natural (realistic work-from-home) environments.

Controlled Environment

In the laboratory setting in designated rooms at the chair for Digital Health - Connected Healthcare at the Hasso Plattner Institute, the Digital Engineering Faculty of the University of Potsdam, participants were instructed to follow the study design depicted in sub-figure A of Fig. 1. Explanations on each respective task were given to the participants before they followed the instructions given by the study platform. Participants sat comfortably in a temperature-controlled room at a distance of 80 cm from a Full HD PC monitor with floor-to-ceiling windows to a green backyard garden. The laboratory rooms for the controlled recordings had been chosen in a rarely frequented wing, and the study directors had temporarily pasted notes on all the doors of that wing, reminding passersby to a) talk, if at all, at a low volume, and b) not to enter the recording rooms. Due to the sticky notes and the sound-absorbing properties of the doors, few distractions reached the participants while performing the tasks. If, by accident, someone entered the room, the experimenter present in the recording room got up quietly and immediately escorted the entering person outside, explaining they were unable to use that room for the time being, and ensuring the interruption was kept in low volume and of short duration. Thereby, interruptions during the recordings in the controlled environment were kept at a minimum (approximately five interruptions noticed by the participants) and noted down in the experimental notes. The experimenters sat at another workplace in the same room, far enough away to ensure the participant’s comfort and give a sense of privacy, while still allowing questions to be raised, if they were necessary. To avoid participant-driven breaks, participants were reminded to use the bathroom before the recordings, while being encouraged to notify and take a bathroom break during the recording, if necessary. Participants were given drinks of their choice (water, different teas, or coffee), and the choice was noted down in the session notes if it deviated from water. Subsequently, participants were instructed on how to use the devices on their own while the study director ensured proper device fit, and started the Python PsychoPy application (v2022.2.1, developed in Python 3.8).

During the study, the participants were mainly guided by the PsychoPy application. Only for two activities—the synchronization of the Python PsychoPy application with the sensors at the beginning and end of the experiment—the platform asked the participants to call the study directors for their assistance. This synchronization was performed by fast-paced tapping on the spacebar performed by the participant with the hand on which the Empatica E4 watch was worn, their non-dominant hand (i.e. left hand). Before the experiment, participants had been shown how to perform a fast-paced spacebar tapping by raising the hand with the Empatica E4 watch high above their head and quickly dropping it onto the spacebar, thereby generating high acceleration data. This whole process was to be repeated four times for both the synchronizations at the start and end of the experiment. During the instruction, all participants seemingly understood the procedure sufficiently well and tapped the training keyboard properly. However, during the actual experiment, some participants (approximately during a total of ten individual sessions) were too careful with the recording equipment. It was later confirmed in personal discussions that some participants were afraid of destroying the keyboard. As a result, some participants performed slow-paced tappings with small effective ranges of motion, resulting in difficult-to-detect acceleration patterns within the acceleration data. If these challenges occurred, the experimenters noted them in the recording notes. The study director sat on the desk opposite the participant, hidden behind two monitors, and took notes on excessive body movements or other anomalies that might have led to artefacts during the recordings, such as timestamps of drops in the Bluetooth connection or when the participant drank something.

The study protocol for the controlled environment is illustrated in subfigure A of Fig. 1. In the beginning and after synchronizing the sensors with each other and with the platform, the participants watched a relaxation video of scenic shots from the national park Torres Del Paine, Chile, with relaxing music (https://www.youtube.com/watch?v=jXl1GbK5ZO8&t=3s). Next, a set of questionnaires and an eye-closing session were performed, to assess the baseline affective state and physiological signals of the participants. For the remainder of the recording, the participants performed four different tasks in two difficulty levels: easy and hard, each of ten minutes duration and in random order.

These four tasks encompassed (i) Mental Arithmetic, (ii) Sudoku, (iii) N-back Task, and (iv) Stroop Task. (i) One of the well-known cognitive load-inducing techniques12 is the Mental Arithmetic Task. Here, participants are required to calculate mathematical problems mentally without additional support from writing instruments or calculators. The study was designed to include simple addition and subtraction for the easy level, potentially involving carry numbers. For the hard level, complex calculations of multiple-digit multiplication and division were required to be calculated. Operands ranged from -100 to 100, and the specific tasks, answers, and reaction times were logged during each experiment. (ii) The digital version of the game Sudoku13 can sometimes be found pre-installed on Linux and Windows systems and was utilized in this study. The Sudoku application was chosen in two distinct difficulty levels, easy and hard, among the four default difficulty levels provided (https://wiki.gnome.org/Apps/Sudoku). Participants aligned the puzzle of numbers from 1 to 9 in a 9 × 9 grid to arrange each column, row, and subsection to contain all numbers while accounting for the constraint that the same number can not occur twice in the same row, column, and 3 × 3 grid (i.e. subsection). If the participants solved the game before the time was up, they were instructed to play more rounds of the same difficulty level until they reached the time limit of ten minutes and the application automatically terminated. (iii) The N-back task14 requires memory-sequencing of coloured rectangle blocks shown on the screen. The participant had to match the colour of the current stimuli with the colour of the stimuli n elements earlier. The application was configured to depict six distinct colours (, , , , white, ), with n = 1 and n = 2 defining the respective easy and hard difficulty levels. The participants had two seconds to give their answer before inactivity was rated as a missed trial. Colours, answers, and reaction times were logged during each experiment. (iv) To stress the control processing of the working memory, the Stroop task15 was chosen. In this task, a sequence of single words appears on the screen, stating a colour (e.g. ). However, the word is coloured in the same or a different colour (e.g. ). The participants had to recognize the font colour, ignoring the written word, and type the starting letter of the name of the font colour (e.g. y for ). In total, four colours were utilized, namely , , , and . On the easy level, the participants had a maximum time to answer of 5.5 seconds, whereas on the hard difficulty, the answer had to be given in under 1.5 seconds. Colours, answers, and reaction times were logged during each experiment.

Other than Sudoku, each task was preceded by a trial session of 45 seconds. Objective labels and data were logged by the PsychoPy application (e.g. task difficulty, start and end timestamps, task-specific information, and more). Subjective labels were provided by the participants as answers to NASA-, PANAS-, affective sliders-, and Likert scale questionnaires after the relaxation video and at the end of all of the tasks. Due to time reasons, in between the individual tasks the participants solely answered pair-wise NASA-TLX questionnaire16, affective sliders17, and reported their mental workload and stress level during the previous task on a Likert scale18. The pair-wise NASA questionnaire quantifies the subjective mental workload of a given task in six continuous sub-scales, each ranging from 0 to 100, which includes questions regarding mental demand, frustration, physical demand, temporal demand, performance, and effort. For these sub-scales, additional weights were received from the pair-wise comparisons of the questions. The affective sliders measure the subjective ratings of pleasure and arousal on two separate sliders. The sliders were designed with visual bipolar affective states through emoticons19 to rate the current emotion and did not need a written explanation, despite being explained before the experiment in a video-guided explanation of the whole study paradigm. The participants also rated their subjective mental workload and stress levels on two 5-point Likert scales with the options: “very very low”, “low”, “nor low nor high”, “high”, and “very very high”.

After the recording, information on how to use the devices in the uncontrolled environments was repeated, the participants received a printed handout illustrating proper sensor fit detailing the steps required for data acquisition, and an appointment for the second recording in the controlled environment was made.

Uncontrolled Environment

For the recordings in the uncontrolled environment, participants were free to choose where, when, and on which tasks they wanted to perform a recording. The only strict requirement was that participants had to aim for a balanced distribution between tasks they would consider low load as well as high load. A printout with steps to be followed, handed out to each participant, served as a guideline to ensure sufficient data quality of the recordings, instructing the participants to perform the synchronization shaking protocol described in ‘Data Synchronization’, to perform sensor fit checks as illustrated, to start with an eye-closing session of one minute, and to record data for a variable time of 15 to 45 minutes at a stretch. Most of the participants followed the instructions on the shaking of the devices (some, however, at a rather low intensity and velocity), on the eye-closing protocol (more than 90%), and on the distribution of low-vs-high load tasks. The most prominent tasks were reading or searching information (i.e. documentations or research papers;  ≈ 26.75%), coding (various difficulties and programming languages;  ≈ 16.74%), and processing data (e.g. performing data analysis, project planning based on data, doing mathematical calculations, etc.;  ≈ 16.74%), while the least prominent tasks were relaxation (e.g. meditation;  ≈ 0.5%), preparing a presentation ( ≈ 1.9%), and attending a meeting ( ≈ 2.8%). Other tasks performed by the participants encompassed playing a game ( ≈ 7.8%), responding to emails ( ≈ 8.6%), watching a video (e.g. learning a new skill;  ≈ 8.6%), and typing (e.g. summarizing publications or writing a new manuscript;  ≈ 9.7%).

Recording Devices

Two wearable devices capable of recording physiological signals were used in this study, the Muse S headband and the Empatica E4 watch, shown in Fig. 3. The Muse S headband is capable of measuring Electroencephalography (EEG), Accelerometer (ACC), Gyroscope (GYRO), as well as Photoplethysmography (PPG) data. EEG data was recorded with 256 Hz at the five sensor locations AF7, AF8, TP9, TP10, and FpZ, according to the 10/20 international system and depicted in detail in Fig. 3b. ACC and GYRO data were sampled at 50 Hz and PPG data could have been sampled at 64 Hz. However, for reasons of the battery life and the stability of the Bluetooth connection, the recording of PPG data from the headband was not performed. As a consequence, PPG data was recorded only in the Empatica E4 watch. The recording for the Muse S headband was started and data was collected via the third-party app Mind Monitor. The third-party app Mind Monitor provided many additional features such as power band values, amongst which the Horse Shoe Indicator (HSI) for each electrode location—which reflects three states of electrode connection quality ([0.0 means no connection], [1.0 means good connection], and [2.0 means poor connection])—could be used in future studies to correlate it with the computed signal quality indices of this work. Subsequent pre-processing steps were performed based on the raw data, while the band-power features computed by the third-party app were discarded from the subsequent analysis but made publicly available as well. The Empatica E4 is capable of measuring Temperature (TEMP), Photoplethysmography (PPG), Electrodermal Activity (EDA), and Acceleration (ACC) data using the accompanying E4 realtime app provided by the manufacturer for iOS and Android devices or the E4 streaming server for Windows. TEMP and EDA data were collected with a sampling rate of 4 Hz, PPG data at 64 Hz, and ACC data at 32 Hz.

Fig. 3
figure 3

Figure 3a shows a study director wearing the Muse S headband and the Empatica E4 watch. Figure 3b was adapted with permission from38, and shows the electrode positions of the Muse S headband dry-electrode sensors according to the international 10-20 system.

Two further recording devices were utilized during the study: a Google Pixel phone (Pixel 3a; Android 12), on which the data collection apps had been installed, and a personal computer (PC) was used to display the study platform PsychoPy. After data collection in the controlled environment, the respective Google Pixel phone was handed out to the participant alongside the Muse S headband and the Empatica E4 watch used, for subsequent data collection in the self-chosen environments. The PC ran Ubuntu 20.04.5 and had 4 cores and 8 GB of RAM. An extensive logging functionality had been implemented, which logged a multitude of information about synchronisation taps, questionnaire answers such as for NASA, PANAS, affective sliders, Likert scales, eye closing timestamps, and task-specific information such as correct answers, response time, and the operators in the specific task, amongst others. For reasons of redundancy, the logging was directed to two files: a. csv-file, and a. log-file. In two recordings, this redundancy has been needed, as either one of the files had not been written properly due to an application error, while the other file had been correctly stored on disk. As a consequence, for two of the recordings of the controlled environments, the log files had to be utilized to derive the task labels, and for two further recordings, the physiological signals had to be interpolated at the end of the recordings, due to Bluetooth connection problems with the sensors.

Data Synchronization

The internal clocks of (wearable) devices can run at different speeds than those of reference systems, resulting in a phenomenon called clock drift – and wearable sensors are no exception to this. This situation is worsened by the circumstance that wearable sensors can have different time zone settings, and computations on floating point numbers, the dates, are performed at varying degrees of accuracy. For these reasons, the internal clocks of devices need to be resynchronized. To ensure the synchronicity of data recorded from different devices and platforms, multiple solutions exist, such as the Lab-Streaming Layer (LSL; https://github.com/sccn/labstreaminglayer). However, this solution is limited to the availability of a platform that receives the streamed data, which is not guaranteed to be the case for the uncontrolled environments chosen by participants. For these circumstances, a shake-based protocol was developed and participants were asked to follow it closely. When starting a session in the self-chosen environment, participants had to start the recordings, place the wearable sensors flat on a surface and wait for six to twelve seconds, then take both the Muse S headband and the Empatica E4 watch together and shake them violently for about twelve seconds, finally placing the devices again on a flat surface and wait for six to twelve seconds before starting with the actual task. This procedure was to be repeated at the end of the recording. By resting both devices on a flat surface twice and simultaneously shaking them in between, very clear and similar patterns of acceleration and gyroscope data were collected. Figure 4 provides an overview of the resulting accelerometer data. After loading the Muse S and Empatica E4 data utilizing devicely in version 1.1.1 (https://pypi.org/project/devicely/1.0.2/), subsequent peak detection allowed for synchronization of the time series by potential alignment after clock drift, using the Python-based synchronization package Jointly20 in version 1.0.4. However, while a few participants omitted this step during the recordings in the self-chosen environment, a few of the accelerometer recordings in the uncontrolled environments were difficult to align. For this reason, and to ensure the same pre-processing steps across the data published, synchronization was performed based on the timestamps given by the wearables for the data recorded in the uncontrolled environments, while the data in the controlled environments had been synchronized using Jointly.

Fig. 4
figure 4

Overview of the normalized acceleration data magnitude calculated from the Muse S headband and the Empatica E4 watch, utilized for data synchronization between the wearable devices. The shake start and end times are prominently visible by abrupt changes of acceleration magnitude, while the duration of shaking activity per sensor is similarly about 12 seconds. To form the pre-processed and labeled physiological data from the controlled sessions, the time series were aligned between the shakes using the Python package Jointly20. For the data recorded in the uncontrolled sessions, the labeled data in the Labeled folders of each participant were extracted using the same timestamps from both the wearable devices rather than the sensor synchronization using Jointly. This decision was met as some participants had forgotten to shake the devices either at the beginning, at the end, or at both times of the respective recording in the uncontrolled environments.

Data Processing

Two different formats of unprocessed data are available: (i) the raw data as recorded from the individual wearable devices and the study platform, and (ii) the data organized by tasks which is the result of splitting the original (raw) time series data into the respective tasks performed by the participants. Task extraction was based either on a simple splitting technique using the timestamps from the devices in the uncontrolled environment (ii.1) or a sophisticated shake-detection protocol performed by the experimenters for each recording in the controlled environment (ii.2). Additionally, a simple data pre-processing pipeline was implemented for use in qualitative and quantitative evaluations and resulted in three data sets (i-pre), (ii.1-pre), and (ii.2-pre). Pre-processing steps for the EEG data included a Butterworth filter for the range of 0.5 - 50 Hz using mne python package (https://mne.tools/stable/generated/mne.filter.filter_data.html) and applied to remove high-frequency noise of the muscle activation of the scalp and low-frequency disturbances such as heartbeats. An additional Butterworth filter at 50 Hz removed the power-line interference from the signal. Furthermore, a movement filter was applied by filtering the accelerometer data from the headband in the range of 0.5-20 Hz and the magnitude of the acceleration for each participant of a given controlled and uncontrolled session. By applying a binary search, the participant-wise threshold was computed to detect 3 to 5.25% of the high acceleration data and, after that, interpolate the corresponding EEG data. Additionally, the EEG data from the controlled session was normalized by min-max normalization and by removing the baseline obtained by the eye-closing session. In contrast, the data from the uncontrolled session went through a min-max normalization as not all participants performed an eye-closing session as instructed. The data from both sessions were average-referenced. Due to the lack of dedicated recording channels for ocular (electrooculogram, EOG), muscular (electromyogram, EMG), or cardiac (electrocardiogram, ECG) activity, the widely applied21 steps of EOG-, EMG, and ECG-removal were not performed in more detail, and only other obvious artefacts—such as the loss of contact or the Bluetooth connection, amongst others—were interpolated using the mean value of the neighbouring values. The raw blood volume pulse (BVP) data extracted from the PPG sensors underwent the same normalization mentioned for EEG data. Subsequently, for further data cleaning, the Savitzky-Golay-Filter was implemented using the Scipy python package (https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.savgol_filter.html) with filter order 4 and window length 31 and applied on the BVP data. The TEMP and EDA data were preprocessed by interpolating the missing data. The preprocessing steps are summarized in Fig. 5.

Fig. 5
figure 5

The preprocessing steps performed to clean the raw EEG data by applying Butterworth filter, Notch filter and baseline normalization. The BVP data was preprocessed by applying Savitzky-Golay-Filter and baseline normalization.

For extracting features from the preprocessed data of the four modalities, a 60-second window with 80% overlap was computed. As mentioned in the literature22, using Welch’s method and Hann window function, the power spectral density (PSD) of five commonly used frequency bands, namely δ (0.5 − 4 Hz), θ (4 − 7 Hz), α (8 − 12 Hz), β (12 − 30 Hz) and γ (30 − 50 Hz) were computed from each of the four channels of EEG data. The average of each band power over four channels denoted as mean − (δ, θ, α, β, γ) and the ratio θ/α were used as features. Additionally, asymmetry features were calculated by subtracting the log-transformed spectral power of the right hemisphere from the left hemisphere of each power band and denoted as frontalαasy and (δ, θ, α, β, γ) − asy. Using the NeuroKit2 python package (https://neuropsychology.github.io/NeuroKit/), features were extracted from the cleaned BVP data. The time domain features include the mean of the RR intervals (HRV - MeanNN), the standard deviation of the RR intervals, (HRV - SDNN) and the root mean square difference of successive R-R interval (HRV - RMSSD). The frequency domain features include normalized low frequency (0.04 to 0.15 Hz) power (HRV - LFn), normalized high frequency (0.15 to 0.4 Hz) power (HRV - HFn), and the ratio between the latter (HRV - LFn/HRV - HFn). From the EDA signal, the number of peaks of skin conductance response (SCR - Peaks - N), and the mean amplitude of the peak occurrences (SCR - Peaks - Amplitude - Mean) were extracted. Furthermore, the mean of the temperature (mean - temp) and the standard deviation of the temperature (std - temp) were extracted from the cleaned TEMP signal.

Data Records

The data23 has been deposited with Zenodo https://doi.org/10.5281/zenodo.10371068 run by CERN Data Centre. The data records are available with a total duration of data recordings in the controlled environment of only task-related data of 60.30 hours from the Muse S headband and 60.37 hours from the Empatica E4 watch, and a total duration of data recordings in the uncontrolled, participant-chosen environments of 94.85 hours from the Muse S headband and 99.33 hours from the Empatica E4 watch ( ≈ 315 hours in total). Information on the folder structure in the data repository is given in Fig. 6. Information on the additional data collected during the recordings in the controlled environment is provided in Table 2, while information on the data collected by the PsychoPy platform is given in Table 3. For the data recorded in the controlled environments, laboratory notes from the participant’s recordings and the task labels are located in all the top-level folders. The ‘Raw’-top-level-folders contain the data in the format as recorded. No analytical procedures were performed on this data, artefacts are present, and no features have been extracted. The PsychoPy-platform logs are stored only in this folder. The PsychoPy logs were used to create the ‘Labeled’-folder, in which the data was split into the respective tasks performed per laboratory recording ‘Lab1’ and ‘Lab2’. The respective name of the sub-folder, e.g. ‘arithmetic_easy’, gives information about the task performed (‘arithmetic’; i.e. mental calculations), and the difficulty level (‘easy’; i.e. designed to result in little-to-no load). The data recorded in participant-chosen uncontrolled environments is stored in the ‘Wild’ data folder. Here, in the ‘Raw’-top-level-folder, the data collected during each of the participant-performed recordings is stored in folders labeled in numerically ascending numbers of 1 to N. Data stored in the ‘Labeled’-folder is stored in sub-folders named according to the labels for stress and mental workload provided by the participants for the given recordings, using five-point Likert-scales (e.g. ranging from “very very low”-“vlw”, “low”-“low”, “nor low nor high”-“nor”, and “high”-“hig”, to “very very high”-“vhg”). That means, if the participant labeled the first recording in the self-chosen environment as “nor low nor high” in stress and “low” in mental workload, the data for this recording can be found in the subfolder called “nor_stress_low_mw_1”.

Fig. 6
figure 6

Overview of the data records. Per participant (i.e. UN_101 to UN_124), four top-level folders exist: ‘Raw’, ‘Labeled’, ‘Preprocessed’, and ‘Features’. The subfolders ‘Lab1’, ‘Lab2’, and ‘Wild’ refer to the first laboratory session with a participant (‘Lab1’), the recordings performed autonomously by the participant in self-chosen environments (‘Wild’), and the second laboratory session (‘Lab2’). The number of subfolders in the ‘Wild’-folder varies from participant to participant, as participants performed N recordings of varying duration. Only the contents of ‘Lab1’ and ‘Wild’ are depicted, as the structure of folders in ‘Lab1’ and ‘Lab2’ is the same. The ‘Raw’-top-level-folder contains the data as recorded. In the ‘Labeled’-folder, data is split into tasks and individual files for each modality are given. The ‘Preprocessed’-folder contains preprocessed data, while the ‘Features’-folder contains features extracted for each task.

Table 2 Examples of the contents of the files present for each participant in the two top-level folders ‘Raw’ and ‘Labeled’.
Table 3 Examples of the data collected by the PsychoPy platform for each of the recordings in the controlled environment for each participant.

As for the ‘Preprocessed’ data, the data were preprocessed as mentioned in ‘Data Processing’ and saved according to the extracted individual tasks and the eye-closing sessions of the Labeled data. Furthermore, Features were extracted using a 60-seconds window from the preprocessed data of the task folders only, excluding the eye-closing sessions, and stored in the ‘Features’-folder.

Technical Validation

Data Quality Analysis

Commonly, data quality is quantified by using metrics such as the signal-to-noise ratio (SNR). This approach works particularly well for clearly defined expected signals, such as event-related potentials based on somatosensory stimuli24. However, when the signal of interest is not well-defined, another option is to define what is regarded as noise and compute the SNR as a signal quality index (SQI) for each part of the signal of interest. For the EEG data21, the BVP data25, and the EDA data26, many recommendations for data pre-processing exist, which give information on good and bad signals. Consequently, four different types of SQI were calculated for the EEG, BVP, EDA, and TEMP data, based on related work27.

For the EEG signal, the SQI was computed as SNR in decibels (dB). First, the power line interference was removed by using a notch filter at 50 Hz, realized as a digital filter applied forward and backwards. Second, the four-channel EEG data was average referenced resulting in a remaining dimension of four channels. For each channel, the SQI was computed as the Power Spectral Density (PSD) using Welch’s periodogram method, averaged across all ten-second windows (with five seconds of overlap) within each recording session. The resulting power spectra were averaged across channels. To quantify the signal, the SNR was computed as \(10* {\log }_{10}(band\,\_power/noise\_\,power)\), with band_power referring to the individual bands of interest (i.e. Delta (δ, <4 Hz), Theta (θ, 4 Hz - 7 Hz), Alpha (α, 8 Hz - 12 Hz), Beta (β, 13 Hz - 30 Hz), and Gamma (γ, 31 Hz - 100 Hz (below Nyquist frequency))) and noise_power referring to the power in the highest frequency band below the Nyquist frequency (i.e. 100 Hz - 125 Hz). Results are reported as SNR in dB in Table 4. As the results reported in Table 4 are averaged across all individuals and recordings of the different environments, and particularly as some participants took less care during the recordings to ensure good signal quality than other participants, Table 5 gives an overview of the individual data quality indices.

Table 4 Averaged signal quality indices for the EEG data in decibels.
Table 5 Signal Quality Indices as described in ‘Data Quality Analysis’ in decibels (dB) for the EEG data, or in % of good quality data for EDA, BVP, and TEMP data.

For the BVP signal, the SQI was computed as a measure of spectral entropy (SE) for each minute. As the signal was collected with 64 Hz sampling rate, bandpass filtering in the low-frequency ranges was possible below the Nyquist frequency of 32 Hz. First, a bandpass filter with the passband between 1 Hz and 3 Hz was applied. Subsequently, the SE-SQI was computed as the weighted sum over 15 non-overlapping 4-second windows. Therefore, Welch’s periodogram method was applied to compute the PSD for each 4-second window of the recording. The derived PSD was normalized, such that the total power across all frequency bins (1.0 Hz to 3.0 Hz; at a frequency resolution of 0.25 Hz) totalled approximately 1.0. For each minute of the recordings, the SE-SQI weighted sum was computed by averaging the PSD across the 15 four-second windows for the respective minute and building the weighted sum of the averaged power for each frequency bin, multiplied by the binary logarithm of the averaged power for this frequency bin. The negative value of the weighted sum divided by the binary logarithm of the number of frequency bins (i.e. 9; [1.0, 1.25, … , 2.57, 3.0]) was stored as the SE-SQI. A SE-SQI value of less than 0.8 is regarded as a good signal, while values above 0.8 are regarded as noisy signals27. In total, 68.94% of the data collected in the controlled environments are regarded as good signal quality, while 51.18% of the data collected in the uncontrolled environments are regarded as good signal quality.

As with the EDA data, these results can in part be explained by some participants ensuring better sensor fit than other participants, e.g. by wearing the Empatica E4 watch more tightly on the wrist of the non-dominant hand than other participants (confirmed in personal discussions after participant-driven data recordings). For the EDA signal, the SQI was computed as the rate of amplitude changes (RAC). Therefore, the EDA signal, which was sampled by the Empatica E4 watch at 4 Hz, was analyzed for trends and abrupt changes in the skin conductance level, following the relative temporal progress of the time series data. If at any given point in time t the EDA value was significantly higher than four samples ago (i.e. one second ago; t0 > = 1.2*t−4) or significantly lower than four samples ago (i.e. one second ago; t0 < = 0.9*t−4), the signal was graded as noisy, and marked accordingly. If however, no significant change of EDA value occurred throughout the last 480 samples (i.e. one minute of data; abs(t−480ti) < = 0.001∣ ∀i ∈ [− 480, 0]), the signal of the whole last minute was graded as noisy, and marked accordingly. In total, 81.18% of EDA data recorded in the controlled environment was graded as good signal quality, while overall 94.00% of EDA data recorded in the uncontrolled environments was graded as good signal quality. For the TEMP signal, the data was analyzed as in related work28. However, due to the suspicion that some participants forgot to take off the protective cap from the Empatica E4 watch sensors, narrower temperature values (30.00 to 40.00 degrees Celsius) were chosen as good signal quality data to rule out ambient room temperature data29.

To reproduce the results, only the data as described in ‘Data Records’ and in Fig. 6, Python in version 3.11.5, and the Python packages Numpy in version 1.26.2, Pandas in version 1.5.3, and Scipy in version 1.11.4 were utilized. The code to compute the respective SQI values (EEG-SQI-SNR, SE-SQI, and RAC-SQI) can be found online https://github.com/HPI-CH/UNIVERSE.

Statistical Analysis

To validate the technical robustness of the data set, a statistical analysis was performed between the extracted features mentioned in ‘Data Processing’ of easy and hard tasks across all participants in the controlled setup. The normality test was performed using the Shapiro-Wilk test for all individual features for each set of easy and hard tasks and all tasks together, which mostly suggested rejecting the null hypothesis, as one of the given examples depicted in Fig. 7. The figure shows the Q-Q plot for the differences between high and low-class features for all tasks. While some feature distributions followed normal distributions, most features have heavy tails, suggesting participant-dependent outliers in the feature set. Furthermore, a paired t-test was performed for normally distributed feature set, and a Wilcoxon test was performed otherwise using the ttest_rel function (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_rel.html) and wilcoxon (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wilcoxon.html), respectively.

Fig. 7
figure 7

Q-Q plots of differences between high and low feature values for all tasks. The feature names are displayed in unique colors inside the plot. Some features, i.e., frontal − α − asy, α − asy follow close to normal distribution. Many features have heavy tails, suggesting possible outliers in the dataset.

The resulting p-values are depicted in Table 6, showing the power ratio feature has a p-value less than 0.001 for all the instances, justifying a strong feature for the data set. The many other instances with such small p-values ensure the quality of the data to strongly be able to differentiate between the tasks and thereby validate the technical robustness of the data set.

Table 6 P-values derived after t-tests between easy and difficult versions of the same task for the controlled environments, computed as described in ‘Statistical Analysis’.

Machine Learning Analysis

Machine learning models are evaluated to classify between easy and hard tasks in the controlled environment and self-assessed high and low tasks in the uncontrolled environment to ensure the technical robustness of the data set. Before building the model, the data went through significant preprocessing steps, as described in ‘Data Processing’. Using the mentioned features from the four modalities, Logistic Regression (LR) classifiers from the sklearn Python package (https://scikit-learn.org/stable/) were utilized using K-fold cross-validation (CV) with K = 5 for the train-test split of the data. To tune the hyper-parameters of the models, a GridSearch was performed for each train-test set using the default CV and the following hyper-parameters: penalty: l1, solver: [liblinear, saga], Regularization (C): Logspace (-3,3,7), and penalty: l2, solver: [liblinear, saga, newton-cg, lbfgs], Regularization (C): Logspace (-3,3,7). Only those sessions that had data from all four signals: EEG, EDA, BVP and TEMP data, were considered for the models. For participants, UN_112 and UN_120, only one controlled session was opted for classification due to the unavailability of a second one. For the rest of the 22 participants, the data from two controlled sessions from each participant were combined to develop a personalized model for each. The data was labeled with the defined task labels: easy and hard. For the recordings in the uncontrolled environment, the Nasa-TLX scores reported by the participants were utilized to differentiate between easy and difficult tasks using a personalized threshold for each of the 21 participants with valid recordings. Due to the low quality of the BVP data, the data from UN_102 was left out of the comparison but should be considered in further evaluations, given that less than four modalities can be used for the models. The accuracy of the binary classification achieved by the models in the controlled and uncontrolled sessions is depicted in Fig. 8.

Fig. 8
figure 8

Averaged results from the Machine Learning models, averaged for (a) controlled and (b) uncontrolled sessions. The dashed lines show the mean values across participants. For participants UN_102, UN_103, and UN_120, no classification models for (b) the data from the uncontrolled environment were computed.

For the recordings from the controlled and uncontrolled sessions, the mean classification accuracy of the personalized models achieved an accuracy of 71% and 74%, respectively, validating multiple possible use cases of the data set in the future, such as building a personalized cognitive load assistant. However, the lowest classification accuracy achieved was close to the chance level for both sessions (48% for controlled and 54% for uncontrolled), which suggests that further pre-processing of the data at an individual level might be needed for some of the participant’s data, highlighting the potential to develop and test advanced algorithms on this data set. These results are in line with the values for F1, precision, and recall, which are at an average of 82%, 76%, and 90% for the data from the uncontrolled environments, and 71%, 69%, and 74% for the data from the controlled environment. Better results in the uncontrolled environments might be explained by participants being stronger involved in their actual, personal tasks in the uncontrolled environments than in the experimental tasks in the controlled environments, which were of little-to-no personal importance to them.

Usage Notes

The dataset can be used for multiple use cases with respect to individuals’ research questions. The ‘Raw’ data can address enormous signal processing research questions on obtaining good quality data from consumer-grade devices for both PPG and EEG sensors. In particular, the ‘Raw’ EEG data can also be used to analyze the usability of the data from a few-channel EEG over the conventional clinical-grade EEG recordings. The ‘Preprocessed’ data is a good source for the Deep Learning community working on unsupervised multi-class mental workload classification, or time-series data in general. Furthermore, the ‘Features’ provided with the data set can be utilized to investigate the contribution of each feature and answer the research question of the necessity of multi-modality in uncontrolled environments. In line with recent work on multi-task learning30, this data set can serve as a control group for disease- and non-disease-specific analyses by drastically increasing the available reference data.