1 Introduction

In the last two decades, accurate automated detection of human actions has attracted significant interest in a wide variety of fields. Due to its potential for improving workflow efficiency and quality assessment, human action recognition (HAR) has been used in fields such as health and surgical training (Czempiel et al. 2020; Eivazi et al. 2017; Garcia-Ceja et al. 2020; Padoy 2019), autonomous driving (Billah et al. 2019; Ohn-Bar and Trivedi 2016) and human–robot interaction (Martínez-Villaseñor and Ponce 2019; Mojarad et al. 2018). Furthermore, the availability of video data provided by wearable devices has prompted a surge in research on egocentric action recognition (EAR), the recognition of human actions in first-person perspective (Kapidis et al. 2018; Kazakos et al. 2019; Li et al. 2015; Ma et al. 2016). The majority of these video-based HAR methods use a combination of convolutional neural networks (CNNs) and recurrent neural networks (RNN) to extract image-based features, as well as to leverage temporal relationships between video frames (Basha et al. 2020a, b; Dai et al. 2020; Garcia-Hernando et al. 2018; Kapidis et al. 2021; Ng et al. 2015). When trained with large quantities of training data from publicly available action recognition datasets, neural network-based models have been able to reach recognition accuracies of over 90% for everyday actions (Boualia and Amara, 2021; Wan et al. 2020).

However, in more specific applications, such as skill evaluation in surgical procedures (Eivazi et al. 2017) or the detection of use errors in usability studies, the desired action sequences are commonly not represented in either third-person video datasets such as UFC101 (Soomro et al. 2012) or egocentric video datasets EPIC-KITCHENS (Damen et al. 2018) and EGTEA gaze + (Li et al. 2021). Increasing the data labeling and algorithm training can increase classification performance (Mizik and Hanssens 2018), but are often cost-intensive (Chen et al. 2021; Romero Ugalde et al. 2015) or simply unfeasible (Garcia-Ceja et al. 2020; Zhou et al. 2020). Consequently, the training efforts and resulting recognition performances for specific human–object interaction tasks, such as tool handling, remain unknown, while data availability could considerably limit the applicability of state-of-the-art models. Instead, others have suggested that performances may be improved by increasing the number of trainable features for machine learning (ML) models (Jobanputra et al. 2019). In EAR, increasing the number of features is commonly achieved by including additional data modalities as input, such as audio information (Cartas et al. 2019; Kazakos et al. 2019; Wu et al. 2016), hand tracking (Garcia-Hernando et al. 2018; Kapidis et al. 2019b; Tekin et al. 2019a) or object detection (Kapidis et al. 2019a; Núñez-Marcos et al. 2022).

In the past decade, mobile eye tracking (MET) has established itself as a valuable tool for the measurement of cognitive processes and visual attention, given the fact that the human gaze reveals important object and scene-related information during cognitively demanding manual tasks (Huang et al. 2020; Lukander et al. 2017). In literature, few ML models have used gaze data to infer, detect or predict user actions (Kit and Sullivan 2016; Klaib et al. 2021; Ulutas et al. 2020). Even though they have shown that ML models could vastly benefit from exploiting human attention during human–object interaction (Rong et al. 2021), eye movements have been largely neglected as a suitable feature input for EAR models. Recent publications on automated neural network-based object–gaze mapping methods (Wolf et al. 2018) and subsequent developments of advanced gaze metrics such as the object–gaze distance (OGD), which captures information of the user’s peripheral field of vision by representing a relative distance of the gaze point to task-related objects (Wang et al. 2021), serve as an inspiration to explore the relevance of eye movements as data modality for EAR.

In this paper, we propose a multi-modal gaze-enhanced action recognition framework, the Peripheral Vision-Based HMM (PVHMM), and investigate the suitability of gaze movement data as an additional input modality for an EAR model. The PVHMM combines object and gaze features through an automated object-of-interest (OOI) mapping and achieves action recognition using a hidden Markov model (HMM). Two gaze features, the OOI Hits and the OGD, were investigated using 43 eye tracking recordings. An exemplary medical device handling task, which consists of seven cognitively demanding handling actions, was chosen as a subject of this study. We conducted an ablation study to investigate the PVHMM’s robustness to adversarial noise, by adding Gaussian noise to our gaze data and by employing leave-one-subject-out (LOSO) cross-validation. Finally, to study the differences between purely image-based approaches, the classification performances were compared with a popular image-based convolutional neural network, the VGG-16 (Simonyan and Zisserman 2014). With our findings, we offer insights into the effectiveness of eye movements as a data modality for EAR and contribute a gaze-enhanced recognition framework that shows accurate results on the presented application using a small mobile eye tracking dataset.

2 Related work

Current research on HAR has focused on the use of video-based ML approaches, using videos from first- and third-person perspectives. These approaches have included a variety of object, motion and temporal features and can recognize general physical activities using large video datasets. In specific applications where large training efforts and complex architectures outweigh the benefits, gaze data have the potential to add a valuable additional sensory modality to include task-relevant user attention.

2.1 Action recognition

Over the years, many ML models, such as SVMs, RFs, HMMs, conditional random fields (CRF) and K-nearest neighbors (KNN), have been successfully applied for video-based HAR under strictly controlled environments (Minh Dang et al., 2020). More recently, neural network-based approaches, such as convolutional neural networks (CNN), and recurrent neural networks (RNN), such as long short-term memory (LSTM), have gained favorable attention, due to their remarkable performances in image and object classification on public HAR datasets. Models that employ LSTMs can consider additional temporal connections and as a result take advantage of the sequential nature of data. Ng et al. (2015) employed an RNN that LSTM cells and a 2D-ConvNet using images extracted from video recordings, as model input. Dai et al. (2020) developed a neural network for video-based HAR, combining recurrent layers for temporal feature extraction with 2D convolutional and recurrent layers for spatio-temporal feature extraction.

Others have gone one step further by proposing 3D architectures that include a variety of hidden layers, consisting of 3D convolutional, dropout, 3D pooling and fully connected layers for spatio-temporal consideration of third-person perspective video data (Almaadeed et al. 2019; Basha, Pulabaigari, et al., 2020; Boualia and Amara, 2021; Wan et al. 2020). These state-of-the-art models have reached classification performances of over 90% on well-known datasets such as UFC101 (Soomro et al. 2012), HMDB51 (Kuehne et al. 2011) and KTH (Schüldt et al. 2004).

2.2 Egocentric action recognition

First-person vision (FPV)-based HAR, also known as egocentric action recognition (EAR), has gained increased popularity due to the recent accessibility of wearable recording devices, including Microsoft Hololens, Magic Leap or GoPro, and the emergence of publicly available FPV datasets, namely EPIC-KITCHENS (Damen et al. 2018) and EGTEA gaze + (Li et al. 2021). Furthermore, FPV offers a unique perspective to human–object interaction and can reveal the user’s intention (Kanade and Hebert 2012) as camera movement is driven by the wearer’s actions (Bandini and Zariffa 2020). On the other hand, EAR also poses unique challenges. Due to the non-static camera setup, noise and motion-blur is introduced in the recorded images by fast head and body movements, hindering motion-based action recognition systems that were developed for third-person perspective videos (Li et al. 2015). Additionally, the egocentric view can also lead to object–hand occlusion during object manipulation (Kapidis et al. 2019a).

To tackle these challenges, researchers have explored various multi-modal approaches, which are driven by cues intrinsic to egocentric videos (Núñez-Marcos et al. 2022), including object and hand recognition (Garcia-Hernando et al. 2018; Kapidis et al. 2018, 2019a; Ma et al. 2016; Tekin et al. 2019b), audio–visual (Arabacı et al. 2018; Cartas et al. 2019; Kazakos et al. 2019; Wu et al. 2016) or motion-based approaches (Li et al. 2015; Tang et al. 2018). Kazakos et al. (2019) approached EAR using the fusion of three modalities, namely RGB, motion and audio, to surpass single modality performances, and achieving 36.66% action detection accuracy on the EPIC-KITCHENS dataset. Kapidis et al. (2019b) proposed a multitask learning scheme using an RNN model with preceding 2D-ConvNets, which were trained to recognize single hands and OOIs in the scene, respectively. They were able to reach a 65.70% accuracy on the EGTEA gaze + dataset and 19.29% for the recognition of actions in the EPIC-KITCHENS dataset. Furthermore, Kapidis et al. (2019a) investigated another EAR approach using the detection of hand motions and object locations, without further use of visual features. They trained an LSTM model and were able to achieve a 34.9% accuracy on the EPIC-KITCHENS dataset, showing comparable results to more input demanding video-based methods. Sudhakaran et al. (2019) introduce a model that uses LSTM to focus on task-relevant spatial regions of interest, achieving 61.86% on the EGTEA dataset.

The above-mentioned models have explored and utilized a wide variety of egocentric cues to gain an understanding of the task-relevant features of each scene, but have not been able to show commendable EAR accuracies that would suggest success in more specific tasks on smaller datasets. Surprisingly, eye movements have been a largely disregarded informational source, considering that the position and the duration of the human gaze are known to reveal cognitive processes and crucial contextual information during manual tasks (Fuchs 2021; Land and Hayhoe 2001; Li et al. 2021). This can be attributed to the dependency of deep learning models, such as LSTMs, for large data quantities and the low number of publicly available datasets that provide gaze movement data during cognitively demanding tasks. Hence, our work aims to provide insights surrounding the informational value of gaze data and its effectiveness for task-specific EAR of sequential actions.

2.3 Gaze-based action recognition

In recent years, MET systems have become increasingly affordable and accurate, with the Tobii Pro Glasses 3 and the Pupil Labs Core reporting a data-sampling rate of up to 100 and 200 Hz and 0.60° angular accuracy, respectively (Tobii Pro 2020; Pupil Labs 2022.). Moreover, MET systems record video data from an egocentric view with high resolution, enabling the transferability of state-of-the-art EAR approaches.

However, the dynamically changing scene of FPV recordings increases the object-related data evaluation efforts manifold, particularly since automated gaze-object mapping procedures using CNNs have only been introduced recently (Wolf et al. 2018). This leads to a lack of available datasets with cognitively demanding tasks, which could potentially benefit from supplementary gaze information. Consequently, the use of MET systems for EAR has been severely limited and researchers have either explored gaze prediction and modeling (Fathi et al. 2012; Huang et al. 2020; Li et al. 2021; Min and Corso 2021) or utilized non-contextual gaze metrics such as fixation and saccade information for action recognition (Kit and Sullivan 2016; Li et al. 2015; Liao et al. 2018). Fathi et al. (2012) presented a generative probabilistic model that combines gaze prediction and object-based features as multi-modal input for action recognition during daily actions. They showed that the inclusion of gaze information resulted in higher action recognition performances, achieving 47% accuracy (chance = 4%). Li et al. (2021) proposed a deep model for joint gaze estimation and action recognition, reaching an accuracy of 53.40%.

Kit and Sullivan (2016) collected an ET data set of 5 natural tasks, such as navigating an office hallway, making a sandwich, or typing printed text into a word processor, and trained an HMM model using saccade amplitude and saccade direction. The final model achieved an overall classification accuracy of 36% (chance = 20%). Liao et al. (2018) trained an RF classifier for classifying five real-world navigation tasks using five different types of gaze features. No image-based features were used and the RF was trained with features such as fixation density, saccade direction and encoded saccade sequences resulting in a reported average classification accuracy of 67% (chance = 20%). Finally, Li et al. (2015) utilized the area near the gaze point to filter out the relevant trajectories of manipulation points. They were able to show that a multi-modal input of object, motion and egocentric cues achieved the highest performance on the GTEA gaze dataset.

Based on the promising results of the presented approaches, the inclusion of gaze data as an additional data modality has shown great potential to boost classification performances, vindicating further exploration into the influence of gaze movements on EAR models. In this paper, we present a gaze-enhanced action recognition framework, which is the first to exploit a combination of object features and gaze movement features and investigate the classification performance when applied to coherent human action sequences of a relevant medical device handling task.

3 The proposed Peripheral Vision-Based Hidden Markov Model framework (PVHMM)

The proposed EAR workflow of our PVHMM is visualized in Fig. 1. First, the data are collected using the Tobii Pro Glasses 2 eye tracker and fixations are detected using the default settings of Tobii Pro Labs (Version 1.162). Second, a Mask R-CNN algorithm is trained on the supervise.ly platform for automated object detection (Supervisely 2022). We used the supervise.ly native image augmentation tool to increase the number of training images, including crop, flip and Gaussian blur. Third, gaze features are extracted and transformed for HMM input. In this paper, we investigated the action recognition performance using two different gaze features, which were extracted as follows: The trained model and the gaze coordinates of each recording are concatenated using the cGOM (see Wolf et al. 2018) and the OGD (see Wang et al. 2021) algorithm. The cGOM algorithm matches the gaze point with the detected OOIs to create a list of OOI Hits. The OGD includes the wearer’s peripheral gaze information by measuring the 2D Euclidean distance between the gaze center and the OOI in each frame of the ET recording. Subsequently, the peripheral gaze information is transformed into dictionary values, where each entry contains a single numerical value. Fourth, action recognition is achieved using an HMM classifier and the investigated gaze features. In our framework, the HMM is trained to classify one action for each fixation, which yields a sequence of actions that has to be further processed to be usable for real applications. Consequently, successive fixations with identical action labels are merged and treated as part of the same entry, leading to a more representative sequence of the performed actions. 

Fig. 1
figure 1

Proposed Peripheral Vision-Based HMM (PVHMM) framework consists of four main parts: (1) ET data recording of the handling task. (2) Detection of objects of interest (OOI) using the MASK R-CNN algorithm. (3) Extraction and transformation of gaze features OOI Hits or object–gaze distance (OGD). (4) Training and validation of the hidden Markov model (HMM) classifier for a fixation-by-fixation and action sequence classification

3.1 Mobile eye tracking recording

The majority of publicly available EAR baseline datasets, such as the EPIC-KITCHENS (Damen et al. 2018), the HMDB51 (Kuehne et al. 2011) or the EGTEA Gaze + dataset (Li et al. 2021), include only simple non-specific human actions, such as walking, drinking, or opening a fridge. Consequently, using these datasets can limit our knowledge of the models’ performance, when applied to more specific scenarios involving complex actions, such as during product interaction, manufacturing, or surgery. Therefore, to validate the proposed approach we introduce a new dataset, which includes human activities within a task-specific context with limited training data capabilities. A medical device handling procedure was chosen, which is defined by a safety-critical intended use, prescribed by the device’s instructions for use (IFU). The gaze data were extracted as xy gaze coordinates for each fixation, as calculated by the Tobii Pro Lab software, and were used as input for the object segmentation (step 2 of the PVHMM), along with the video recordings.

3.1.1 Materials and Equipment

Data were recorded using the Tobii Pro Glasses 2 MET glasses. The device has a reported accuracy of 0.6° at a distance of 1.5 m. The front-facing camera has reported viewing angles of 82° horizontal and 52° vertical, recording with a resolution of 1920 × 1080 px at 25 frames per second and thus representing the user’s field of vision. The tracking percentage of the evaluated gaze samples was 85.2 ± 7.2%, as reported by the Tobii Pro Lab software. Algorithm training and testing were conducted on a GPU-enabled workstation with the following characteristics: GPU NVIDIA GeForce RTX 2080 (8 Gb), working memory (16 Gb), CPU: AMD Ryzen 7 3700X 8-core processor.

3.1.2 Participants

Twenty subjects, university students and a PhD candidate aged 20–31, participated in the study (17 male, 3 female). All participants reported normal or corrected-to-normal vision and no neurological conditions. Due to the absence of task-native experts, seven subjects were trained to an expert level on the chosen device handling task. The expert training was conducted prior to data recording and was carried out until subjects acquired the ability to finish the assembly repeatedly, without the use of the IFU and without making any mistakes. Each participant provided informed consent before testing and received a small monetary compensation.

3.1.3 Device handling task

To test the EAR system on a safety-relevant application that follows a specific set of sequential actions, a commercially available insulin injection device, the UnoPen, along with a novel connected add-on smart device prototype, the UnoPen Smartpilot, was chosen for the handling task. The Smartpilot add-on is connected to a smartphone application, which automatically counts the administered insulin units, displays instructions and displays the recommended holding time, during which the injection device is not to be removed from the injection site. Both devices were provided by the Swiss medical device manufacturer Ypsomed AG. For this task, participants were asked to carry out one successful injection into an injection pad (representative of the human body), using the UnoPen with the Smartpilot. The correct sequence of actions, given by the IFU, that is to be detected during the fourth step of the PVHMM framework, is shown in Fig. 2. Additionally, the output of the EAR classifier of the PVHMM is depicted as both a fixation sequence and the resulting collapsed action sequence (step 4 of the PVHMM). Herein, a safe injection cycle, as given by the IFU, consists of a sequence of eight actions in the following order: (1) Cap Off, (2) Apply Tip, (3) Setting Units, (4) Priming, (5) Setting Units, (6) Injection, (7) Remove Tip, and (8) Cap On. Thus, the task consists of seven distinct actions, where the action Setting Units had to be carried out twice.

Fig. 2
figure 2

Sequence of human–device actions as provided by the Instruction for Use (IFU) of the UnoPen (Ypsomed AG). A complete workflow consists of eight sequential actions, where setting units (3) is carried out twice, once before action priming (4) and once before action injection (6). A visualization of the output of the PVHMM is given for a fixation-based HAR and for the resulting collapsed sequence of actions, where successive action labels are summarized into a single entry (step 4 of the PVHMM)

3.2 Object segmentation 

The object detection builds upon the Mask R-CNN (Region-Based Convolutional Neural Network) instance segmentation algorithm. Figure 3 shows the seven OOIs which were chosen for the training of the algorithm, including both larger objects, such as the phone, the injection pad and the pen, as well as very small objects such as the unit gauge, the needle tip and the safety. OOIs such as Gauge, Cap and Safety were chosen because of their distinct appearance in no more than two of the eight investigated actions. For example, users were expected to focus their visual attention on OOI Cap only during actions Cap Off and Cap On. Using this method for OOI selection and algorithm training, we expect a more accurate classification performance for the HMM classifier.

Fig. 3
figure 3

Seven OOIs chosen for the object detection algorithm and the gaze feature transformation. The Smartpilot on top of the UnoPen is shown on the right and was separated into OOIs Tip, Pen and Gauge. (Color figure online)

Images for training were extracted from several different participant recordings. 344 images were labeled for the training dataset. To increase the number of training images using a small dataset, and to avoid over-fitting, we used image augmentation as a way of artificially proliferating the training images to boost the network’s performance, using the supervise.ly platform. A combination of the following augmentations were included: multiplication, vertical flip, crop, rotation, Gaussian blur, contrast and brightness to obtain 55′728 unique training images. The augmented images were grouped into a training and a validation set, with a split of 95:5. To increase the training efficiency of the neural network, a pre-trained Mask R-CNN model was used. The model was trained using the transfer learning approach for an initial 15 epochs and a learning rate (LR) of 0.001, as well as a second training step using an LR of 0.0001 over 25 epochs to achieve the effect of LR decay. The quality of the masks was evaluated using the intersection over union (IoU) metric (see Sect. 3.5.2).

We extracted the model weights from the epoch with the lowest loss value, which were used for object segmentation in all subsequent analyses.

3.3 Gaze feature extraction

To examine the capability of the proposed framework to achieve accurate EAR using a small dataset and to determine the influence of information density of a gaze feature on the EAR performance, two fixation-based, object-related gaze features were investigated. Using the weights that were extracted from the object segmentation model training, the gaze coordinates of each trial are transformed into OOI Hit and the OGD data format. The OOI Hits feature stores the gaze information as a list of looked-at OOI, one entry for each fixation, which can be used to train the HMM model for action recognition as is. On the other hand, the OGD provides a 2D pixel (px) distance measurement between the gaze point and each segmented OOI, resulting in an x-by-seven-sized matrix, where x is the number of fixations in a trial. Since this format is not HMM compatible, we performed a gaze feature transformation, which is explained in the following sections.

3.3.1 Object–Gaze Distance (OGD) feature transformation

In their recent work, Wang et al. (2021) have presented a way to increase the informational value of gaze data by using automated, machine learning-assisted computation of a 2D Euclidean distance between each OOI and the gaze point for each fixation. The calculation of distances to the gaze point allows for the simultaneous acquisition of time-series data for all OOIs, thus multiplying the object-related gaze data by the number of OOIs. Moreover, studies have also shown that near-peripheral vision is frequently used in the decision-making process (Reingold and Sheridan 2012; Krupinski et al. 2006), allowing us to include more task-related information into our gaze features. The illustration in Fig. 4a shows four detected OOIs (Gauge, Pen, Tip, Pad) and their matching distances to the gaze point (red circle). Figure 4b shows a visualization of different fields of vision during the device handling study, displayed as red and orange rings around the gaze point. The recorded near-peripheral vision is restricted to a 52° degrees visual angle, which was given by the maximum measurable vertical viewing angle of the Tobii Pro Glasses 2 front camera.

Fig. 4
figure 4

In a, the Euclidean pixel distances from the gaze point (red) to each OOI are visualized. The output of the OGD consists of an array of 2D distances, such as OGD = [\({d}_{Gauge}\),\({d}_{Pen} , {d}_{Tip} , {d}_{Pad}\)], with one entry for each OOI. In b, three fields of vision, foveal (circular area A), para- and perifoveal (circular area B) and peripheral vision (area C, rest of the image) are shown in a snapshot of the egocentric ET recording of the insulin injection task

As the HMM action recognition algorithm expects a sequence of single-dimensional observations as an input, for each fixation the OGD of multiple OOIs was transformed to obtain a single entry. First, the areas of each field of vision were assigned a letter from A to D (see Table 1). As the influence of the number of vision areas on the HMM classification performance was unknown at the time of our investigation, two different numbers of vision areas, three and four, were investigated. Then, the calculated distance of each OOI was retained as the letter of the matching vision area and strung together to form a word containing \({n}_{\rm{OOI}}\) letters, in alphabetical OOI order. Furthermore, a dictionary was created with each possible combination of OOI positions in a subject’s visual field \(n_{c}\) =\(n_{P} * n_{\rm{OOI}}\), where \(n_{P}\), which represents the number of chosen vision areas, was given a unique dictionary integer value within the range \({R}\):{ y | 1 ≤ \(\it{y}<{n}_{OOI}\)}. When using three vision areas (\({n}_{P}=3\)) and seven OOIs (\({n}_{OOI}=7\)), a dictionary is created with \({n}_{c}=2187\) values, ranging from AAAAAAA→ \({n}_{\mathrm{dic}}=1\) to CCCCCCC → \({n}_{\mathrm{dic}}=2187\). Similarly, using four vision areas and the same number of OOIs results in \({n}_{\mathrm{c}}\)= 16′384 dictionary values. The number of dictionary values is therefore directly influenced by the number OOIs and the number of vision areas chosen for analysis. Noticeably, many dictionary values do not lie within the solution space, since many string combinations (such as AAAAAAA) are physically impossible. Due to the small area of the foveal field of vision, only a limited combination of OOIs exist that can lie within the area at the same time.

Table 1 Vision areas used for the gaze features for the HMM, determined by biological accordance, converted into pixel ranges that accounts for the resolution of the Tobii ET glasses 2

After the OGD gaze transformation, a one-dimensional numerical vector (one entry per fixation) is obtained. While the absolute positions of objects are lost, each vector entry stores the discretized position of each OOI within the peripheral field of vision at the given moment in time. This vector is used as a gaze feature input for the HMM model for EAR. We expect that by discretizing the OGDs, we can reduce the computational complexity of the classifier while achieving high classification accuracies using small datasets.

3.4 Hidden Markov Model (HMM)

Due to the HMM’s ability to process sequential data, the PVHMM framework utilizes an HMM as the main EAR classifier. An HMM estimates the probability of the presence of unmeasurable hidden states X = (\({x}_{1}\), …, \({x}_{t}\)), through a sequence of measurable associated observations \(O\) = (\({o}_{1}\), …,\({o}_{t}\)). Observations can be any of those features (fixation durations, OOI Hits, transformed OGD, etc.) that convey information about the probability of occurrence of underlying hidden states. These observations \({o}_{t} \in K\) = (\({k}_{1}\), …,\({k}_{M}\)) are contained in \(K\), the set of possible observables. The set of possible hidden states is defined by S, \({x}_{t} \in S\) = (\({s}_{1}\), …,\({s}_{N}\)) which in the presented case is the set of seven possible actions during the device handling task. HMMs are built upon the Markov assumption, which states that the probability of each event is solely dependent on the current observation and the previous state, which can be written as

$$\, P(x_{t} |x_{1} , \ldots ,x_{t} - 1 = P(x_{t} |x_{t} - 1)$$
(1)

Furthermore, HMMs use two probabilistic distributions \({P}_{A}\) and \(P_{B}\). \(P_{A}\) models the conditional transition probabilities from state \(s_{i}\) to \(s_{j}\), \(a_{ij}\) = \(P_{A}\) \((x_{t} = s_{j} |x_{t - 1} = s_{i} )\), where each transition is stored in a transition matrix \(a\). The transition matrix is commonly initialized during the training of the model and carries the transition probabilities between each pair of hidden states, which in our case is the transition probability between different actions such as Setting Units and Remove Tip. The transition probabilities can be adjusted manually, based on task-related information (Courtemanche et al. 2011) in order to increase the model accuracy.\({ P}_{B}\) models the conditional probability of observable emission \({k}_{l}\), given the state \({s}_{i}\), where each emission is displayed in the emission matrix \({b}_{i,l}\)=\(P_{B}\) (\(o_{t} = k_{l } | x_{t} = s_{i} )\). Thus, HMM is defined by a tuple with five elements \(\mu = \left( {S,K,P,a,b} \right)\). For the actual prediction, the Viterbi algorithm was used to find the most likely sequence of hidden states X, given an observation sequence \(O\) (for example an OOI sequence). Thereby, the Viterbi algorithm seeks optimization for the globally complete sequence (Allahverdyan and Galstyan 2011).

Within the PVHMM, the HMM takes a one-dimensional data vector of one trial, containing either OOI Hits or the transformed OGD, and predicts one pre-trained action for each fixation. By using fixation-level metrics as the input data for EAR classification, the temporal boundaries of actions can be approximated with the precision of the duration of fixations, typically ranging from 100 ms up to 2000 ms. we obtain a vector with a fixation-by-fixation prediction of actions for each investigated handling trial.

3.5 Experimental validation

The proposed PVHMM action recognition framework was evaluated in two parts: First, we tested the classification accuracy under laboratory conditions using data that was collected with two expert subjects. Both expert subjects (authors of this work) were familiar with the task and showed the ability to perfectly perform the task prior to the data recording. To prevent overfitting of the HMM model, we included 9 trials that did not follow the IFU sequence, including 6 trials with repeated execution of actions Setting Units, Priming, Cap On, or Injection and 3 trials with omitted actions Setting Units, Priming, Cap On or Remove Tip. Consequently, the dataset contains trials with action sequences of various lengths and alternative orders. Using this set of 25 trials (16 normal order and 9 alternative order sequences), we conducted an ablation study, where we evaluated the influence of the number of vision areas on the performance of the classifier and the robustness of the system to Gaussian noise, using the leave-one-out cross-validation method (LOO), where the data of a single session were used as the test set in each fold of the cross-validation. The goal of the ablation study was to choose the optimal parameters for our ML framework using a small dataset while retaining the larger dataset for the final evaluation. Second, we included 18 more trials from 5 new experts and 13 novices, resulting in a total of 43 recorded trials from 20 participants, with a mean of 142.58 ± 44.17 fixations and a mean of 8.23 ± 0.86 performed actions per trial. Here, most participants followed the correct order of action sequences during the execution of the task. We applied a leave-one-subject-out (LOSO) cross-validation method, where the data of a single new individual are utilized as the test set in each fold. The data of the 25 expert trials of the LOO investigation were used solely for the training of the model, leading to 18-fold during cross-validation. This way we are able to conduct a subject-independent evaluation of the performance of the framework for new subjects (Gholamiangonabadi et al. 2020; Gunduz 2019). To determine the influence of the engineered gaze features on the EAR performance, all evaluations were performed on two gaze features, the OOI Hits and the OGD, and finally compared to the performance of a purely image-based classifier, the VGG-16. The VGG-16 (pre-trained on “ImageNet”) was trained using transfer learning with the same LOSO approach and an 80:20 training-validation split of the training data. The model was trained using 20 epochs, a starting loss rate of LR = 0.0001 and categorical cross-entropy loss function, a batch size of 8 and the “Adam” optimizer. We expect the image-based classifier to show a lower classification performance, mainly because the training images for closely related actions such as Cap On/Cap Off or Apply tip/Remove Tip are almost indistinguishable when looking at single images (see Fig. 5).

Fig. 5
figure 5

Still images from the medical device handling dataset, showing the similarity between actions Cap Off (a) and Cap On (b) as well as Apply Tip (c) and Remove Tip (d)

3.5.1 Ablation study

To investigate the influence of different gaze features on the classification accuracy of the HMM, the algorithm performance was evaluated using two gaze features, the OOI Hits and the transformed OGD, for both the LOO and LOSO cross-validation. The two gaze features differ in the object-related informational density, since OOI Hits carry binary gaze-on-object information, while the OGD contains information on the distance of the gaze center to each OOI and thus carries information of the near-peripheral area of vision. To gain a better understanding of the advantages and limitations of our proposed PVHMM classification framework, we evaluated the following range of parameters during a preliminary ablation study:

Number of vision areas To evaluate the influence of the number of discretized fields of vision on the classification accuracy, we have evaluated a set of three and four vision areas, namely ABC (foveal, parafoveal & perifoveal and near-peripheral area of vision) and ABCD (foveal, parafoveal, perifoveal and near-peripheral area vision).

Gaussian Noise To test the framework’s robustness to noise, we apply three Gaussian noise levels to the raw gaze point coordinate data. Here, we chose the peak noise level according to the outer threshold of each field of vision, 25 px, 60 px and 150 px. The influence of noisy data on the classification accuracy was compared using both the OOI Hit and the OGD gaze feature and evaluated during the ablation study.

3.5.2 Evaluation metrics

First, we evaluated the performance of the object detection algorithm using the widely used intersection over union (IoU) metric, which is calculated by dividing the area of overlap between the predicted and the manually labeled mask by their area of union. Additionally, the precision, recall and the f1-score were computed. The instance segmentation performance was evaluated on a set of 20 sample images from 5 study recordings. The sample images mostly contain scenes of human interaction with the OOIs, increasing the complexity of detection through occlusion. The sample images contained 91 ground truth OOIs. The widely used IoU with a confidence threshold of 0.5 and 0.7 sets a standard for the predicted masks to be evaluated, by only considering those masks for IoU calculation that were predicted with a confidence value of above 0.5 and 0.7, respectively.

Second, we evaluate the classifier accuracy on a fixation-by-fixation level, where each fixation contains a single classification made by the HMM model. The ability of the PVHMM and the VGG-16 to accurately detect the correct fixation-level action sequence is evaluated using the f1-score. Additionally, the accuracy is calculated for the VGG-16 classifier. Due to the novelty of the intended application of the proposed framework, i.e., for workflow and performance analysis in EAR, traditional classifier evaluation metrics do not suffice for the evaluation of all relevant aspects. Therefore, for the evaluation of the performance, we introduce three new action sequence-based metrics. The EAR performance is evaluated using the action sequence accuracy, the action sequence sensitivity and the action sequence precision metrics. For these action-level metrics, consecutive fixations with the same classified action label are collapsed into a single entry (see Fig. 1), transforming a fixation sequence into an action sequence. The introduced metrics compare the list of actions recognized by the PVHMM to the manually labeled ground truth. A classification error was counted for each misclassified, undetected and/or wrongfully detected action. Here, the duration of the collapsed actions and their deviation from the ground truth were not considered.

We evaluated the classifier using the following four evaluation metrics:

f1-score Frequently used in ML classifier performance evaluations, the f1-score is calculated as the harmonic mean of the precision and recall. In our case, the f1-score is used to evaluate the action classification on a fixation level for the PVHMM and the VGG-16. The f1-score is calculated fixation-by-fixation and as a result reflects the temporal boundaries of the detected actions (i.e., a perfect f1-score would lead to the perfect detection of the start and end of each action).

Action sequence accuracy (1—(no. of classification errors/ no. of classifications made)). The accuracy gives a sense of the overall ability of the classifier to retrieve the ground truth collapsed action sequence. A perfect action sequence accuracy of 1.0 signifies that the algorithm is able to classify the performed sequence of actions perfectly. If the number of errors exceeds the number of classifications, for example when the HMM classifies a sequence of two actions, but the ground truth contains a sequence of eight actions, the value 0 is assigned.

Action sequence sensitivity (no. of correctly made classifications/total no. of sequences in ground truth). The action sequence sensitivity allows us to quantify the number of actions that were correctly identified by the classifier, compared to the number of actions contained in the ground truth. A maximum value of 1.0 signifies a correct detection of all actions of the ground truth sequence.

Action sequence precision (no. of correctly made classifications/ total no. of made classifications). The action sequence precision expresses the fraction of correctly classified actions, out of all classifications made by the algorithm. It indicates how precise the classifier predicts if a prediction is made. A precision value of 1.0 means that each performed classification was made correctly.

It shall be noted that the action-level classification solely compares the action sequences and a perfect score could be reached, even though start and end time compared to the ground truth is misaligned.

The resulting data were not normally distributed, and statistical differences were therefore evaluated using a nonparametric Wilcoxon signed-rank test.

4 Results

To judge the validity of our proposed PVHMM framework, we evaluate and report the results of the object segmentation algorithm, the influence of Gaussian noise and choice of vision areas, as well as the main evaluation using a small dataset with many different subjects.

4.1 Object segmentation

The object detection model shows excellent results for the classification metrics across all objects for IoU confidence thresholds 0.5 and 0.7. The qualitative results showed that differences between predicted and ground-truth masks mostly occur when OOIs were close to one another (e.g., Tip and Pen), which can cause misdetections between OOIs. Averaged over all seven OOIs, the trained model achieved a mean IoU of 0.82 and 0.90 for thresholds 0.5 and 0.7 (see Table 2). For smaller OOIs, such as OOI safety, and for OOI tip the mean IoUs showed lower values, indicating a less accurate segmentation performance. Imperfectly segmented object masks of small OOIs, therefore, have a non-neglectable effect on the IoU. The results suggest that the IoU is biased toward larger OOIs due to the higher ratio of circumference to the surface area, but that the algorithm can be used for gaze feature extraction (step 3 of the PVHMM) with high confidence.

Table 2 Class-specific classification metrics of the OOI object detection model. Evaluation was carried out at an intersect over union (IoU) threshold of 0.5 and 0.7, on a sample of 20 images. The weighted average scales the metrics with the number of detections of each class

4.2 Ablation studies

Table 3 shows the results of the HMM classifier’s performance within our ablation study, using the LOO cross-evaluation. When using three vision areas, without added noise in the raw gaze data, both the OOI Hit and the OGD gaze feature show very high f1 scores, action sequence accuracies, action sequence sensitivities and action sequence precision. Specifically, during the fixation sequence classification, the use of the peripheral vision-based OGD feature led to a statistically significant increase (p = 0.05) of the f1-score. The evaluation of the action sequence classification performance yields very similar results for both gaze features and no statistical significance was observed. After increasing the number of discretized vision areas from three to four during the OGD gaze feature transformation, the model shows slightly reduced performances in both the fixation sequence and action sequence metrics. The ablation study of the vision areas suggests that the use of more vision areas, and thus more dictionary values, seems to lead to a small, statistically non-significant decrease in the classification performance.

Table 3 Results of the leave-one-out (LOO) cross-validation ablation study. The evaluation metrics are shown for three (ABC) and four (ABCD) vision areas, as well as added Gaussian noise of 25 px, 60 px and 150 px

The classifier’s robustness was investigated by evaluating the classification performance in response to noisy gaze data. Contrary to our initial expectations, the introduction of a small Gaussian noise of 25 px, which is the size of the foveal field of vision, resulted in a statistically non-significant increase in the mean action sequence accuracy for both the OOI Hits (p = 0.246) and the OGD feature (p = 0.143). With a further increase in the noise level to 60 px and 150 px, the HMM’s performance decreases noticeably, showing the lowest overall f1-score for both OOI Hits (0.439 ± 0.12) and OGD (0.791 ± 0.11) at the highest investigated noise level. However, even when subjected to an adversary noise of up to 150 px, the OGD-based PVHMM showed a reasonably high classification performance, significantly outperforming the OOI Hits-based PVHMM in the f1-score (p < 0.001), actions sequence accuracy (p < 0.001) and action sequence sensitivity (p < 0.001). The results suggest that the implementation of peripheral gaze information can provide higher model stability and classification accuracy when dealing with noisy data.

The results of the ablation study suggest that both investigated gaze features for the PVHMM result in similar performances, even though the OGD-based PVMM performs significantly better on a fixation-by-fixation level. Based on the findings of the ablation study, the subsequent LOSO evaluation of the OGD method was investigated using three vision areas (ABC) and without additional Gaussian noise.

4.3 Main evaluation

In the main evaluation of the algorithm’s performance, the data were evaluated using a leave-one-subject-out (LOSO) cross-validation method, where the trials of a single individual are used as the test set in each iteration. Figure 6 shows the confusion matrices for the classification of the seven investigated actions for the gaze feature enhanced PVHMM and the image-based VGG-16 classifier. In Fig. 6b, we see that the OGD feature achieves higher true positive rates, as well as lowers false negative and false positive rates compared to both the OOI Hits gaze feature and the VGG-16, for all investigated actions. Especially for actions Priming and Remove Tip, both involving small OOIs, the rate of true positive classified fixations is increased by over 20%. Similarly, the false positive rate is reduced most drastically for action Injection. Consequently, the results suggest that the PVHMM using the OGD gaze feature is capable of distinguishing actions with similar movements and identical OOIs, such as Cap On/Cap Off and Apply Tip/Remove Tip with high accuracy, while the image-based classifier is shown to be unable to effectively distinguish between similar actions. However, both HMMs show a high false negative rate for actions Setting Units and Priming. Overall, for both PVHMMs, most instances were misclassified as either the preceding or succeeding action, while misclassifications for VGG-16 were highest for similar human–object interactions.

Fig. 6
figure 6

Confusion matrix of the egocentric action classification results of the PVHMM trained using the OOI Hits (a) and OGD (b) gaze feature. Additionally, the confusion matrix for the image classifier VGG-16 (c) is shown

Table 4 shows the quantitative result of the LOSO cross-validation. The HMM that was trained using the OGD gaze feature significantly outperforms both the HMM trained with the OOI Hits feature and the VGG-16. More accurate classification performance was made both on a fixation-by-fixation level, as well as on a collapsed action sequence. Between the two gaze features, the inclusion of peripheral vision significantly increased the f1-score (p = 0.015), while achieving higher mean action sequence sensitivity (p = 0.055), action sequence accuracy (p = 0.221) and action sequence precision (p = 0.136). The VGG-16 showed a classification accuracy of 0.595 ± 0.10 (chance = 0.142) and a f1-score of 0.472 ± 0.09, which was significantly lower than both the OOI Hits (p < 0.001) and OGD based classifier (p < 0.001). Compared to the LOO evaluation of multiple trials with the same 2 subjects, the addition of 18 trials of 18 new subjects led to a decrease in the classification performance. However, the PVHMM using the OGD feature as input still shows the ability to classify fixation sequences and action sequences with reasonable accuracy, with an average f1-score of 0.849 ± 0.09 and an average action sequence accuracy of 0.840 ± 0.20.

Table 4 Results of the leave-one-subject-out (LOSO) cross-validation of the main study. The PVHMM trained with OGD gaze features outperforms the OOI Hits gaze feature and the VGG-16. The highest achieved value for each performance metric is indicated in [bold]

5 Discussion

In this manuscript, we presented a new egocentric action recognition framework that combines contextual task-related gaze information, video-based object segmentation and a hidden Markov model for the classification of human action sequences in a physical device handling task. The results show that the use of gaze movement data is highly effective for EAR using a small dataset of fewer than 45 egocentric video recordings and improves upon purely image-based classification approaches. The object segmentation algorithm was shown to work well within the PVHMM framework. Smaller objects were harder to detect and the overall performance might improve with a higher detection performance for OOIs such as Safety and Tip. The presented study of the suggested EAR framework has provided the following insights:

The introduction of noisy gaze data leads to a decrease in the PVHMM’s classification performance at high noise levels, regardless of the chosen gaze feature input. Contrary to our expectations, small noise of up to 25 px has been shown to slightly increase the average algorithm performance. We assume that a small noise level can shift the gaze center in situations where it lies close to small OOIs, such as OOI needle, to a location where the gaze center coincides with said OOI. Consequently, in some cases, a small noise-induced gaze point offset can lead to an advantageous increase in the algorithm performance that should be investigated further, particularly, when using OOI Hits as a gaze feature. Higher noise distortions affect the ability of the HMM to accurately detect the correct action and lead to overall lower performance since it becomes more likely that an OOI Hit can be falsely detected as either a “no Hit,” or as an OOI Hit on a different OOI within proximity. Conversely, as the OGD feature does not rely on binary gaze-on-target logic, it shows remarkably higher robustness to noise, confirming suggestions by Wang et al. (2021) that the quantification of peripheral gaze information can be used effectively for EAR.

High fixation-by-fixation classification performances during the main LOSO cross-validation show that human actions can be detected within the time window of single fixations, which commonly have a duration of 100–600 ms. Compared to other studies, where the data samples of each classifiable action range from 2 to 11 min (Kit and Sullivan 2016), the proposed gaze-based features show the potential for application in real-world scenarios.

The presented PVHMM has shown the capability of processing long video segments where human actions occur in sequential order, by compressing the available information using a multi-modal object–gaze feature approach. Compared to a pure image processing approach, the PVHMM is able to differentiate more effectively between actions that involve the same objects, background and often very similar hand–object movements, by using the engineered gaze metrics and the temporal separation of tasks. However, the confusion matrices show the HMM models’ tendency to misclassify fixations with either the preceding or proceeding action label when processing a coherent sequence of actions. This highlights a shortcoming for the OOI Hit gaze feature, which only carries information about the currently looked-at object and therefore only limited contextual information. During actions that require hand–eye coordination, operators can show gaze transitions between different objects, for example, during the assembly of the needle tip unto the insulin pen, leading to our HMM model wrongly classifying these fixations as a different action. These misclassifications were reduced when using the OGD as a gaze feature for the HMM. As the OGD feature incorporates more object features by adding peripheral gaze information, the model is less dependent on exact gaze movements but learns the position of OOIs relative to the center of attention for each action. Courtemanche et al. (2011) have shown that a stability factor for the transition properties can optimize the reactivity of the HMM, which could be included in future iterations.

The inclusion of gaze movements has shown that misclassifications can occur between actions with similar gaze paths, such as for Setting Units and Injection. In both cases, similar dictionary values can occur where other distinct OOIs lie outside of the foveal, parafoveal and perifoveal field of vision. While the effect of various vision area sizes on the classification performance in these cases should be further investigated, in instances where multiple actions contain identical gaze behavior and OGD dictionary values, a drop in classification performance is to be expected.

In this paper, we have shown that a multi-modal approach using an object and gaze features, without the further use of visual features, results in notable EAR performances. When subjected to inter-subject variability, the use of peripheral gaze information as an advanced object-related gaze metric has led to significant classification improvements over a purely image-based VGG-16 classification networks that does not consider temporal information. While the application of the PVHMM was studied in a limited application context, where the sequential nature of the data benefits the use of HMMs and further limitations regarding the general applicability for other sequential action-based tasks are still to be explored, we nevertheless provide significant findings regarding the use of gaze data in egocentric action recognition.

The increasing popularity of ready-to-use ET systems and the integration of ET in popular AR and VR head-mounted displays, such as the Hololens 2 and the HTC Vive Pro Eye, have made the technology more accessible to the public. Nevertheless, research regarding the use of eye movement data for the classification of human actions via ML systems has been scarce (Lukander et al. 2017). We have shown that the PVHMM performs well in a specific manual task involving sequential human device interaction, using a modern connected medical device. In specific real-world applications with similar action sequence orderings, such as standardized surgery procedures, where the collection of large data quantities for the training of state-of-the-art EAR models is often unfeasible, the use of gaze data within frameworks such as the presented PVHMM can constitute a valuable alternative. The ability to achieve f1-scores of up to 95% and action sequence accuracies of up 87%, when using the data of expert individuals, can be especially useful for the assessment of applications with repetitive execution and predetermined task sequences, such as the monitoring of assembly processes and the prevention of quality problems in a manufacturing assembly line (Bauters et al. 2018; J. Chen et al. 2019). Furthermore, the use of gaze-enhanced action recognition allows the subsequent calculation of advanced gaze metrics for each action. Consequently, gaze metrics such as the coefficient K (Krejtz et al. 2016), which can be used for discerning ambient and focal attention, can be included to characterize a subject’s cognitive load to identify cognitive challenging actions.

We believe that the PVHMM provides a significant contribution to the understanding of eye movement as input modality and the advancement of ET-based EAR, showing accurate results can be achieved during the classification of an object handling task involving small objects, closely related and sequential actions. The combination of context-rich gaze features, such as the OGD, and an HMM model as the classifier, has enabled high EAR classification performances using only a small mobile eye tracking dataset of 43 recordings.

6 Limitations

Naturally, the presented evaluations possess some limitations, which we would like to address. We have investigated the gaze-enhanced EAR within one specific exemplary use-case. Although we have included recordings with varying action sequences, the use of HMMs benefits from a task with a narrow setting where the sequence of actions does not have a high variety of possible sequences (ideally, the sequence is determined and not variable). Consequently, the applicability of the framework is best suited for tasks with a higher degree of ordered action sequences and the performance of the PVHMM has to be studied in the context of other use cases. In our evaluation, the competing classification method does not consider temporal information. While we chose this approach to compare the benefit of gaze features as data modality, future comparisons shall include temporal models such as an LSTM. The main evaluation of the gaze data from novice individuals shows that the overall accuracy decreases with high inter-person task execution variability. Therefore, future investigations should include an extensive study on the influence of more training data on PVHMM performance. Finally, the investigated model did not include a no-action class. Consequently, it is unknown how it will behave if action sequences including task unrelated actions are investigated.

7 Conclusion

In the present work, we introduced the Peripheral Vision-Based HMM (PVHMM), a gaze-enhanced egocentric action recognition framework for the classification of task-specific human action sequences. Through the evaluation of a small MET dataset of a relevant device handling task, we have shown that gaze movements are a valuable informational data modality and can be used to achieve egocentric action recognition. By investigating two object-related gaze features, evidence was given that using context-rich gaze data leads to remarkable classification performances and outperforms a commonly used image-based classifier. Additionally, we were able to show that the use of a peripheral vision-based object–gaze distance (OGD) gaze feature led to increased performance and increased robustness to noise, compared to a traditional gaze feature. The validation using an exemplary manual handling task showed the potential to achieve accurate EAR with limited data resources, creating an opportunity for applications in environments such as performance assessment and surgical workflow analysis in the future.