A deep learning based system for handwashing procedure evaluation

Hand washing preparation can be considered as one of the main strategies for reducing the risk of surgical site contamination and thus the infections risks. Within this context, in this paper we propose an embedded system able to automatically analyze, in real-time, the sequence of images acquired by a depth camera to evaluate the quality of the handwashing procedure. In particular, the designed system runs on an NVIDIA Jetson Nano\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^{{\mathrm{TM}}}$$\end{document}TM computing platform. We adopt a convolutional neural network, followed by a majority voting scheme, to classify the movement of the worker according to one of the ten gestures defined by the World Health Organization. To test the proposed system, we collect a dataset built by 74 different video sequences. The results achieved on this dataset confirm the effectiveness of the proposed approach.


Introduction
Nowadays, about 722,000 patients yearly in the world are affected by a healthcare associated infection; among them, 10% of the infected patients eventually die [1]. About 40% of healthcare associated infections are caused by an improper hand hygiene among healthcare workers (hereinafter only workers), which clean their hands less than half of the time they should [2]. Thus, it becomes more and more important for the public safety the adoption of strategies to improve hand hygiene, so as to reduce the healthcare infection rates.
Within this context, it is particularly relevant the surgical hand preparation, aimed at minimizing the risk of surgical site contamination with microorganisms originating from the surgeon's hands.
In 2009, the World Health Organisation (WHO) released guidelines on Hand Hygiene in Healthcare [3] to be adopted as a standard procedure for surgical hand washing; the procedure is composed of a sequence of well-defined gestures (detailed in Table 1) that the worker has to perform in a predefined order and with each gesture having a minimum time duration.
To evaluate the compliance with hand hygiene procedure, three main techniques can be adopted [3]: (1) direct observation of practice; (2) self-report of healthcare workers and (3) indirect calculation, based on the measurement of the products' usage. According to Haas and Larson [4], self-report is not really accurate, while indirect calculation based on the measurement of products does not provide information of non-compliance. On the contrary, direct observation is considered the gold standard, since it provides all the information required for the analysis; unfortunately, this task is typically carried out by human observers, thus it is time-consuming and expensive.
To reduce the costs associated with direct observation, in the last years some automatic or semi-automatic solutions to allow a machine to guarantee direct observation during the hand-washing [5] have been proposed. For

F
Wrap each finger with the other hand and rub it in its entirety with circular movements. The gesture is repeated for both hands.

FA
Wrap the wrist with the other hand, with slow circular movements rub the arm in its entirety slowly rising towards the elbow. The gesture is repeated for both arms.

W
Wet the hands, the movement consists in getting wet from the hand up to the elbow.
machine-based direct observation, it is required to have a system that can (1) monitor the compliance with hand hygiene procedure and also (2) provide a real-time feedback to the worker during the hand washing procedure, allowing workers to improve the gestures and then reduce the risk of infection for the patient. Among the most promising methods, video analytic has surely played a key role. Indeed, one of the most important milestones in this field has been the introduction of camera sensors for data acquisition, as well as the development of algorithms based on artificial intelligence, typically traditional machine learning, for the automatic analysis of the acquired images and videos [6] [7] [8]. Nevertheless, the literature is still quite limited, and a definitive solution to the problem has not been found yet.
In this paper, we propose a method based on deep learning for monitoring the surgeon handwashing procedure. The proposed system has been designed to be applied to both younger staff training and surgeon hand washing evaluation. Indeed, it is able to analyze the sequence of images acquired by a depth sensor and to classify the gestures performed by the worker in real-time. Moreover, the proposed method can provide a compliance feedback score, with respect to the single gesture, in terms of compliance with the guidelines. The continuous feedback about any single gesture allow the worker to realize what the actual performance are, so as to immediately adapt the movement in case of low conformance visual feedback. This approach is particularly relevant during the training phase, allowing the trainees to immediately recognize possible errors, but also to engage surgeon to respect the procedure timely.
With respect to state-of-the-art, we introduce two main novelties: (1) we propose a novel method for continuous surgeon handwashing procedure evaluation based on deep learning; the gestures are analyzed and classified by means of a Convolutional Neural Network; furthermore, the temporal information is taken into account via an overlapped sliding window; indeed, the decision is not taken by evaluating the single frame, but instead by considering a sequence of frames with a majority voting rule; (2) we built a dataset composed of 74 different video sequences; the dataset contains the sequence of gestures officially defined by the Hand Hygiene in Healthcare guidelines; the dataset is freely available under request for benchmark purposes 1 ; according to our knowledge, this is the first dataset on this topic made publicly available.
The paper is organized as follows: Sect. 2 introduces related works. In Sect. 3, we detail the proposed system fd98, together with the experimental setup and the description of our dataset. In Sect. 5, we define the metrics used in our experimentation and report the obtained results. Section 4 describes how the system works. Finally, we draw some conclusions and future work in Sect. 6.

Related work
Among the methods proposed in the last years for direct observation, we can identify two main approaches, depending on the type of sensors used for the measure. In the first category we can identify the systems employing a visual sensor and a video analytic algorithm that automatically analyze the gestures performed by the worker during the hand-washing procedure [9]. The second category includes those systems adopting other kind of sensors, typically wearable, such as smart watches [10]. Harmony [11] is an example of system belonging to the latter category. It is a hand wash monitoring system based on distributed sensors: each worker wears a smart watch, able to collect information related to linear acceleration, gravity, and gyroscope signals. Furthermore, a set of Bluetooth devices is put close to soap dispensers and in areas of interest, such as wash zones and patient bed zones. The smart watch communicates with sensors for dynamic activation and deactivation and analyzes the movement of the worker so as to identify and evaluate her/his gestures. The system is invasive and potentially expensive, since it requires the introduction of a smart watch for each worker to be monitored. Even if this is effective for generic healthcare workers, it can not be worn by surgeons, due to hygiene rules inside the operating rooms. Thus, we do not consider this as well as other similar approach like [12] as feasible for our purposes.
An interesting system is RFWash [13]. In the paper, authors propose use of a radio-frequency (RF) commercialoff-the-shelf mmWave sensor for evaluating the 9 gestures the WHO recommended for alcohol-based handrub. The authors characterize the challenges of recognizing back-toback hand gestures using an RF-based gesture recognition processing pipeline. Indeed, as evident, the lack of pauses between gestures makes segmentation difficult, which, in turn, affects the performance of the subsequent classification component. For that reason a new sequence learning approach that performs segmentation and recognition simultaneously has been proposed. The model is trained using continuous stream of minimally labelled RF data corresponding to naturally performed handrub gestures. The RFWash performance have been carried out using a dataset of 1,800 gesture samples collected from ten subjects over 3 months. A deep architecture has been defined including Convolution layers followed by Max Pooling (2x2), Fully Connected (FC) and Bidirectional LSTM layers, aiming at extracting spatiotemporal gesture features from input RD frames; also, a softmax layer and Connectionist Temporal Classification (CTC) is employed to predict the gesture sequence. From the performance point of view, with sequences of a duration of 5s per gesture the mean gesture error rate (GER) is about 11% while increasing the duration to 10s per gesture the mean GER drop to 7.41%.
In [14], a system for the automatic analysis of the hands after the washing has been proposed: the hands are washed with soap mixed with UV reflective powder; the operator has to insert the hands inside a case equipped with ultraviolet lighting and a digital camera. The presence of ultraviolet lighting leads the skin to show ultraviolet light only on the treated surfaces. The images acquired by the camera are then automatically analyzed by a segmentation algorithm, and the contour of the hand is determined by evaluating the green intensity channel. The green channel pixels belonging to the hand are partitioned in three clusters using c-means clustering algorithm. The optimal threshold between the intensity of clean and dirty areas is extracted using these clusters, to evaluate a percentage of dirty area and the remaining percentage of clean area. Even if based on a camera and on visual inspection, the system, like the previous one, is still quite invasive, since it requires the adding of UV reflective powder in all the soap dispensers; furthermore, the introduction of the UV-lighting case in the equipment could be a further source of contamination, infringing hygiene hand-washing rules for operating rooms.
In [8] a camera sensor put on top of the washing machine is introduced and a video analysis algorithm is proposed to automatically evaluate the quality of the washing procedure. This is among the first methods in which the sequence of gestures performed by the surgeon has been automatically analyzed, thanks to the introduction of a machine learning approach. Indeed, the segmentation step combines information related to both color and motion; then, a tracking procedure based on a single multimodal particle filter and a k-means-based clustering technique is adopted to track both hands and arms. Finally, a SVM ensemble classifier has been employed for recognizing the specific gesture. The use of a traditional camera introduces several issues related to variation in illumination conditions, as well as to the presence of the water. The experimentation has been carried out on 6 different hand poses with detection rates performance ranging from about 86% up to about 97%.
In [6], Xia et al. improved the performance achieved by [8] extending the number of poses to recognise from 6 to 12 (using poses for left and right hands see 1) adopting a SoftKinetic DS325 camera and applying Linear Discriminant Analysis (LDA) classifier. The performance on single frame pose estimation has been evaluated using a Leave-One-Person-Out (LOPO) subject-independent cross-validation protocol, considering both RGB and depth images. To address the high dimensionality of the HOG (Histogram of Gradient) features (2916 dimensions), in each round of the cross-validation the original HOG features are projected in lower dimensional subspace using the Principal Component Analysis (PCA). According to their analysis the LDA classification cost about 0.0139 ms on an Intel i7 with a 3.70 GHz clock. The recognition rate on single frame have been of 94.80% on RGB channel and 92.35% for depth channel. Finally a further experiment has been executed considering a video-pose estimation with a slide windows of different sizes and using a majority voting on single frame classification. With a windows size of 20 frames the achieved recognition rates are 99.37% and 98.31% for the two channels respectively. Authors declared that the recognition rate is 100% when the size of the windows is equal to whole video of the single poses ( Fig. 1).
Many of the proposed solutions requires computing powers such one provided by modern PC and those based on RGB cameras, due to the different environmental condition present tuning and setup problems when installed in different surgery blocks that prevent their application in real context. In [7] the authors use a depth sensor, namely a Kinect sensor, instead than a traditional camera. The system they propose, called WashInDepth, is able to record the washing procedure and determine if the subject has correctly complied with the prescribed guidelines. A background subtraction is applied and a set of hand-crafted features is extracted and then used to fed a decision tree classifier. The experimentation has been carried out with the involvement of 15 participants for two different scenarios. in the person independent scenario (where data from 10 participants were used for training and 5 for testing) the best performance achieved has been of about 55%. In the person depended scenario (Wherein, both the training and test data is from the same person) the best achieved performance has been of about 97%. Both performance have been achieved with a 15 Â 15 block size and smoothing windows of length equal to 50. Another important aspect of the WashInDepth solution is the possibility to run an a computer stick.
This kind of approaches, exploiting camera and video analytics solutions, represent a very important milestone in the scientific literature in this field. Anyway, although the topic is very relevant, as demonstrated by the main recommendation from the WHO for fighting the recent coronavirus pandemic, we can not find a wide literature; this is probably due mainly to the lack of dataset publicly available to be used for training, but also for benchmarking purposes. Indeed, in the era of deep learning, a huge amount of data becomes essential [15].
To face with this issue, we can surely inherit the wide literature available in gesture recognition [16][17][18][19]: indeed, each movement to be performed from the worker can be seen as a specific class of gesture to be recognized and analyzed. Within this context, the deep learning plays a crucial role, since most of the algorithms proposed in the last few years are based on this new frontier of artificial intelligence. Although there is not a standard taxonomy for partitioning the methods for gesture recognition, we can identify two interesting contributions, proposed respectively in [20] and [21]. Their taxonomy is mainly based on the type of sensor used for analysing the movement: visionbased [22], glove-based [23] and depth-based.
According to the above mentioned surveys, the first two approaches are not promising and natural enough, while the most promising methods available in the literature are based on depth cameras, which allows to exploit the third dimension related to the depth. This conclusion is still valid in our specific problem; indeed, the gloves can not be used for hygiene reasons and the vision based system suffers for environmental conditions; vice-versa, depth sensor, able to also evaluate the three dimensional space, seems to be the most suited sensor.
Independently on the sensor adopted for acquiring the set of images, the best results in gesture recognition are typically obtained by using convolutional neural networks (CNNs), which achieve outstanding results outperforming ''non-deep'' state-of-the-art methods [21]. Although a lot of different CNNs has been proposed in the last years [24,25], the new trend seems to be mainly related to the introduction of recurrent neural network (RNNs), such as LSTM, GRU or TCN [26,27], able to automatically encode the temporal information, which is evidently a very important and not negligible feature when dealing with gestures evolving during the time.
Anyway, the main drawback lies in the amount of data required for training when dealing with RNNs, which is typically higher with respect to the CNNs counterpart. This is an important consideration, since in our specific problem the amount of data is quite limited.

Dataset
The sequence of specific gestures to be performed during the hand washing procedure depends on several factors, including the context (e.g., patient care, visit, surgical operation), the type of soap, the use of specific tools like nail cleaners or sponges and so on. In this paper, we focus on the surgical hand washing procedure, as described in [28]. The procedure includes eleven different gestures, which need to be performed in a given order; the details of each gesture are reported in Table 1, together with an abbreviation of the gesture itself, which will be used hereinafter in the paper. Gestures 1 and 11 are exactly the same.
The dataset we propose was collected with the support of professors from the Department of Medicine, Surgery and Dentistry -''Schola Medica Salernitana'' of the University of Salerno, Italy.
The procedure was simulated by 53 different voluntaries, equally distributed between males (27) and females (26). The participants also had different height, so implying that the procedure was performed at different distances with respect to the camera. All the participants signed an informed consent. Each voluntary was properly trained by a medical doctor before performing the procedure; furthermore, during the procedure itself, a video with the specific gesture to be performed was shown to the worker (see Sect. 4 for more details). Also, each sequence was validated by a doctor before its insertion into the dataset.
The camera is mounted at a height above the washbasin of D plane ¼ 0:9m, in a zenithal position; the top view allows for the movements of the hands and of the arms without any occlusions; furthermore, the chosen height also entirely Neural Computing and Applications captures the area where the worker has to move for washing his/her hands.
The camera used for the acquisition of the dataset is an Intel Ò RealSense TM Depth Camera D435. The camera is controlled by a NVIDIA Ò Jetson Nano TM computing platform equipped with Quad-core ARM Ò Cortex TM -A57 CPU, a NVIDIA Maxwell TM with 128 core NVIDIA CUDA Ò GPU and 4GB LPDDR4 64-bit of RAM running Ubuntu operating system. The system was used for building the dataset and for interpreting the worker gestures providing real-time feedback using the designed GUI. The dataset consists of 74 depth video sequences; each video contains the sequence of the ten gestures, obtained by a continuous capture of the whole hand washing procedure performed by a worker. The depth images are represented in 16 bits, where each pixel represents the distance from the camera (in millimeters). Each image is captured at a resolution of 640 Â 480, and the acquisition is performed at 15 frames per second. Sample depth images for each gesture are shown in the rightmost column of Table 1.
The dataset was partitioned into training and test set. The training set is composed of 50 sequences recorded by 41 different subjects; the test set includes the remaining 24 sequences, performed by 12 different subjects. In the whole, the dataset consists of more than 131, 000 frames, as reported in Table 2. The table also reports for each gesture the number of frames and the average duration. We can note that the average duration of the gestures ranges from short gestures (4 seconds), such as W and S, to long gestures (more than 20 seconds), such as N and SH.

Proposed method
In this paper, we formulate the problem of assessing the conformance of the hand washing procedure in terms of a gesture recognition problem. In more details, we train a classifier to associate the frame to one of the ten classes, each class being associated to a gesture.
Each image is cropped to 340 Â 340, so as to only deal with the central region of the image, containing the hands of the worker. Furthermore, a bicubic interpolation is applied for rescaling the image to 170 Â 170. To only isolate regions of interest and to remove the background from the analysis, we apply a threshold on each image; with more details, we consider as background all the pixels in the image at a distance higher than D th , where D th ¼ D plane À . In our experiments, ¼ 1cm. Finally, we halve the size of samples (from 16 bits to 8 bits) and rescale the values as follows: where s o ði; jÞ and s r ði; jÞ are the values of the (i, j)-th pixel of the image before and after rescaling, respectively. The images preprocessed as described above are fed to a classier. Five different architectures are considered in our experimentation, namely VGG19 [29], ResNet50 [30], Xception [31], MobileNet [32] and NASNet [33]. The architectures are chosen to take into account both different dimensions and different typologies of layers. In particular, we considered: (1) networks of different dimensions, namely large (VGG19, NASNet), medium sized (ResNet50, Xception) and small networks (MobileNet); (2) networks based on different concepts, from traditional convolutional layers (VGG19, NASNet) to more modern blocks inspired by Network-In-Network architectures, respectively based on either residual blocks (ResNet50) or on depthwise separable convolutional layers (Xception, MobileNet).
During the training procedure, independently of the specific architecture considered, we applied a data augmentation technique to increase the robustness of the method with respect to the following main situations (Fig. 2): (1) the worker can be either right-handed or lefthanded; (2) while washing the hands, the worker could assume oblique position with respect to the washbasin. Starting from this consideration, we augmented the dataset with (1) flipped and (2) rotated images. The rotation was performed by 10 , 20 and 30 in both directions. We did not consider larger rotations since they cannot be physically done by the worker. For each gesture, the number of frames and the average duration of each gesture is provided for both training and test sets 4 How the system works The system was designed to work in real-world environments. To design the graphical user interface (GUI) of the system, we started with an observation period of medical staff in the frame of the BIPS national research project. During this observation period, we noted medical staff habits and discussed with them several possible GUI alternatives. The most appreciated is the one presented in Fig. 3. An automatic activation function was developed to facilitate the access to the system: it is required to maintain the hands under the camera for at least 3 s and the system starts.
The GUI is divided into three areas. Going in clockwise order, in the upper-left corner area the real-time video of the deep camera is showed. For facilitating the gesture recognition, a white box is impressed over the video, so that the medical staff can easily center his/her hands under the camera at the right distance. In the upper-right corner, a recorded video of the actual gesture to execute according to the WHO procedure is presented.
The bottom-half of the GUI is dedicated to the real-time system feedback. The area is divided into twelve rows. The first row, marked in blue, is a progress bar that indicates the remaining time for completing the actual gesture. This is important since each gesture has a specific duration as defined by the WHO guidelines; the next tens rows provide the real-time gestures recognition feedback and finally the bottom row report the performance: the actual achieved score (updated at the end of the procedure), the daily and weekly best scores. This gaming part was appreciated by the medical staff.
The most relevant part of the GUI is the one dedicated to the single gesture feedback. Differently from what the other systems offer, exploiting the system architecture based on ten different classifiers and the selected hardware platform, our solution can show in real-time the level of classification of the gesture actually recognised by the system. To advise the medical staff with the gesture to be performed, its name is highlighted in yellow (e.g., in Fig. 3 the Interdigital fold has to be executed). The level of compliance of the gesture is provided by a green bar: the longer the bar, the higher is the compliance with the gesture described by the WHO guidelines (full length mean 100% compliance).
The system provides a visual feedback about misclassifications too. Indeed, during the specific gesture execution, the GUI shows (through red bars) the level of classification of the other gestures. Since this happen in real-time, the user immediately realises that the performed gesture is not properly recognised by the system and try to improve its execution in the remaining time. Also this feature was considered very useful from the medical staff.
Finally, when all the gestures were executed, the overall performance is summarised by the GUI (right part of Fig. 3) so that the medical staff next time can improve the execution of those gestures with a lower level of recognition. The Score value is calculated as the average of the estimated accuracy of the gestures.

Experimental results
In this section, we present the performance achieved by the proposed system for conformance assessment of the hand washing procedure carried out by an healthcare worker. We assess the impact on performance of the different choices made during the design phase of the proposed approach by considering two main dimensions of analysis: the deep network architecture and the temporal dimension exploitation.
As regards the first point, we consider different state-ofthe-art deep neural network architectures and for each of them we evaluate performance using different optimizers, loss functions and initialization procedures. Then, we select the network architecture and configuration providing the highest performance on the test set as the base network for the successive studies related to the remaining dimension of analysis.
As for the second point, we explore the beneficial impact that may derive from the exploitation of the temporal information, which allows us to make the classification decision on a sequence of contiguous frames instead of a single frame. In this direction, we consider the aggregation by majority voting, or weighted voting, of the outputs provided by a classifier on all the frames within a sliding window.
The remainder of the current section is organized as follows: in Sect. 5.1 we describe the indices used to measure performance; then, we dedicate a specific subsection to each dimension of analysis: the deep network architecture and the temporal dimension exploitation are discussed in Sects. 5.2 and 5.3, respectively.

Performance indices
Depending on the level of detail and the goal of the analysis, we use several indices to assess the performance achieved by the investigated methods. In particular, the f 1 score is adopted as a single global performance index that compares and ranks different solutions; we also report the Precision and Recall to provide additional information on the type of the errors, i.e., false positive and false negative, respectively. The above three indices are calculated as: Recall ¼ 1 P l2Lŷl j j X l2Lŷ l j jRðy l ;ŷ l Þ ð3Þ where: F 1 ðy l ;ŷ l Þ ¼ 2 Á Pðy l ;ŷ l Þ Á Rðy l ;ŷ l Þ Pðy l ;ŷ l Þ þ Rðy l ;ŷ l Þ ; Pðy l ;ŷ l Þ ¼ y l \ŷ l j j y l j j and Rðy l ;ŷ l Þ ¼ y l \ŷ l j ĵ y l j j ð5Þ L is the set of the labels, y andŷ are the sets of predicted(sample, label) and true(sample, label) pairs, respectively, y l andŷ l the subsets of y andŷ with label l.
We also use the box plots (see Fig. 4 to briefly recall the elements of the box plot representation) as a tool for studying the impact over performance when setting specific values for the various parameters of the deep networks. In particular, we use this representation to graphically depict the collection of values of the f 1 score that are obtained by setting a value for a specific method parameter and by varying all the remaining parameters.
Finally, we use the 10 Â 10 confusion matrix for reporting the performance for each gesture class to focus on the ability of the system to distinguish some hand gestures from others.

Deep network architecture
To select the classifier used as base for the successive design stage regarding the exploitation of the temporal information, we consider various possible choices of the deep network architecture, the optimizer, the loss function, the training initialization procedure. In particular, as for the deep network architecture, we consider MobileNet, NAS-Net, ResNet50, VGG19 and Xception. As optimizer, we consider the following four choices: Adam, Adadelta, SGD, RMSprop [34]. We investigate also on the loss function by considering the mean square error (MSE) and the categorical cross entropy (CCE). As a final element of the study, we consider the starting point of the training procedure, i.e., training the network from scratch using randomly initialized weights (R) or, conversely, using the weights of the corresponding network already trained on a different domain (in this case, we used ImageNet initialization -I).
In our tests, we consider all the possible quadruples of values given by the Cartesian product of the sets of the dimensions of analysis: we evaluate 80 models (5 network architectures Â 4 optimizers Â 2 loss functions Â 2 training procedures). Each model is individually trained by starting with an initial value of the learning rate of 0.001 and then by decreasing it by a 0.3 factor when the validation loss does not increase after six consecutive epochs.
In Table 3, we report the performance achieved on the test set by the considered models with indication of the respective configuration of the parameters. Specifically, the performance of each model is reported in the three rightmost columns of the table and are expressed in terms of the Recall, Precision and f 1 indices.
We notice a large variability of the performance of the different configurations of the classifiers ranging from a minimum value f 1 ¼ 0:654 obtained by InceptionV3 using random weight initialization, using the SGD optimizer and categorical cross-entropy loss function, to the maximum value f 1 ¼ 0:884 achieved by VGG19 architecture, Fig. 4 Elements of the box plot model used for reporting compactly the salient elements of a distribution: they report the 25th, 50th (median) and 75th percentiles, the mean value and standard deviation. The upper and lower fences represent values more and less than 75th and 25th percentiles (3rd and 1st quartiles), respectively, by 1.5 times the difference between the 3rd and 1st quartiles; values above upper fence or below lower fence are generally declared as outliers initialized with ImageNet weights, using the Adam optimizer and the mean square error loss function (reported in bold in Table 3). The large variability of the results does not allow us to find evident correlations between the classification performance and each of the meta-parameters described above. To overcome this limitation, we use the box plots to compare the distributions of the f 1 score obtained by setting a value for each parameter and varying the remaining ones. Consequently, each plot of Fig. 5 deepens the study with regard to the architecture, the weight initialization, the optimizer and the loss function. As an example, the leftmost box plot in Fig. 5.a is derived from the 16 values of the f 1 score reported in Table 3 using the MobileNet as network architecture.
The first observation that we can draw from the analysis of the plots in Fig. 5 is that the weights initialization procedure is the only configuration parameter that has a relevant impact over performance. Fig. 5b suggests a superiority of the I choice with respect to R; this is further confirmed by the fact that 35 over 40 models (almost 90% of cases) trained with weights initialized from ImageNet achieve an f 1 score superior to the corresponding model trained with randomly initialized weights. Conversely, for all the other meta-parameters we do not find a value that relevantly stands alone over the others with respect to the value of the f1 index.
Nevertheless, from Fig. 5a, we notice the high compactness of the box plot related to the NASNet, which denotes a degree of robustness of this network with respect to the considered training parameters that is higher than the other deep networks; in fact, with exception of NASNet, all the other networks are characterized by a large difference between the upper and the lower fences. From a practical standpoint, this may be a relevant aspect when one does not want to train the network by considering all the possible parameters configurations.
From Fig. 5c, we notice that on one side it has to be expected similar performance when choosing Adadelta, Adam and RMSprop while on the other side, at least for the problem under analysis, the SGD optimizer should be avoided.
Finally, Fig. 5.d does not suggest any particular advantage in using the categorical cross entropy or the mean square error as loss function for the problem under consideration.

Temporal dimension exploitation
In this subsection, we analyze how the performance can be improved by exploiting the temporal dimension. The hand gesture occurs over a time interval and thus consists of a sequence of consecutive and highly correlated frames. The approaches considered in Sect. 5.2 take a decision over the generic i-th frame only using information extracted from that frame. Here, we intend to perform classification of the i-th frame by exploiting information of the r consecutive frames in the time window preceding the i-th frame. The four leftmost columns account for the network meta-parameters, where specifically: Arch. stands for the deep network architecture, Init. stands for weight initialization (with R = random, I = from ImageNet), Opt. denotes the optimizer, and L.F.  There is a wide literature on methods for frame sequences analysis applied to gesture recognition. In this regard, the recent trends of the scientific community suggest the use of deep learning methods specifically devised to learn a time-series representation, as recurrent neural networks (RNNs). Long Short Term Memory recurrent networks (LSTM-RNN) represents a notable example of such approaches. However, it is well-known that such a method requires very large train datasets to achieve an acceptable level of generalization. This observation was also confirmed by experiments that we carried out by training a LSTM-RNN with features extracted by the VGG19 base network; as a matter of fact, we obtained poor performance, largely below those yielded by using the network operating at the frame level (for the sake of conciseness we do not report detailed information on the outcomes of this experiment).
As an alternative solution to exploit temporal dimension, we decide to consider a simpler, yet effective, strategy based on the aggregation of the decisions of single frame classifiers operating over a sliding window. In this regard, we consider two aggregation strategies, namely: • Majority vote (MV): each frame is classified individually, then the most represented class in the window is the one assigned to the frame. • Weighted sum (WS): the classification outputs of each class across all the frames in the window are added together, then the class with the highest score is chosen.
The advantage of the second approach is that it does not require an additional training phase to the base singleframe classifier and does not introduce further computational burden. It only requires to set the size of the sliding window. Thus, to study the impact of this parameter on the performance, in Fig. 6 we find the curves of the performance, over the test set expressed in terms of the f 1 index, as a function of the sliding window size. In particular, there are included all the values from 1 to 60, where the performance with window size equal to 1 refers to the single frame classifier. The plots in Fig. 6 are all referred to the network configuration that achieves the highest performance (f 1 ¼ 0:884) on the test set and highlighted in bold in Table 3.
It is immediately evident the beneficial impact of taking the decision on a frame by exploiting also the classification outputs over the preceding frames included in a sliding window. We note that the larger is the sliding window, the higher is the overall performance, which passes from f 1 ¼ 0:884 with window size equal to one to f 1 ¼ 0:953 adopting the MV strategy or f 1 ¼ 0:957 with WS strategy, when the window size reaches the value of 60 frames. It is important to note that, since the dataset adopted for the experimental validation contains video captured at 15 fps, a window size of 60 frames corresponds to a time interval of 4 seconds that is roughly the duration of some gestures of the washing procedure (i.e., W and S). For this reason, the analysis of Figure 6 is stopped at the value of 60 frames. Table 4, shows the confusion matrix calculated over the test set of the best performing deep neural network, according to the previous subsection. The results in the table allow us to deepen the analysis on the performance of the approach with regard to each single gesture class. We notice that for four gestures, namely W, N, F and FA, the class accuracy is above 95%, then there are three other gestures, SN, S and P, with performance around 90%, while only 77:4% of the frames belonging to the SH gesture class is correctly recognized. The low performance on this class is mainly due to the erroneous attribution of 5:4% and 8:8% of samples of this class to the SN and BH. This confusion can be motivated by the presence of a sponge in the former or of the back of the hands in the latter as in the frames of the SH gesture class that the classifier is not able to discriminate correctly when taking the decision on a single frame. Similar considerations can be made on other classification errors, such as the 8:2% S samples wrongly attributed to the W class.
To evaluate how the exploitation of the temporal information contributes to the mitigation of the aforementioned problems, we refer to the example cases of a sliding window with sizes 15, 30 and 60 that corresponds to Fig. 6 Performance assessed on the test set and reported in terms of f 1 score of the top performing classifier (VGG19, initialized on Imagenet, trained using mean square error loss function and Adam optimizer) when a sliding window of 15 frames (on the dataset considered in this paper 15 frames correspond to a time interval of 1 s) a time interval of 1, 2 and 4 seconds, respectively. In particular, in Tables 5, 6, 7 and in Tables 8, 9, 10 we report the confusion matrices over the test set when using the MV and the WS aggregation rules, respectively, for the three values of the window size. As it could be expected from the results in Fig. 6, the larger is the window size the higher is the recognition rate on each single gesture class; furthermore, in all cases, the WS aggregation rule assures a slightly better performance compared to the majority voting. Interestingly, when using WS with a window of 4 seconds, we achieve almost 90% accuracy on the most challenging class, that is SH. Results are referred to the single frame classifier, i.e., sliding windows size equal to 1 Results are obtained setting the size of the sliding window equal to 15 and using the majority voting (MV) aggregation rule Table 6 Confusion matrix on the test set of the VGG19 network with weights initialized on ImageNet, and trained using the mean square error loss function and the Adam optimizer Results are obtained setting the size of the sliding window equal to 30 and using the majority voting (MV) aggregation rule

Conclusions
In this paper, we presented an embedded system for evaluating, in real-time at what extent the medical staff is compliant with the surgical hadwashing procedure as defined by the WHO. The system exploits a deep convolutional neural network to analyze the frames captured by a depth camera. In the design of the system, we considered five well-established deep neural networks architecture (MobileNet, NASNet, ResNet50, VGG19, Xception), with  Results are obtained setting the size of the sliding window equal to 60 and using the majority voting (MV) aggregation rule  Results are obtained setting the size of the sliding window equal to 15 and using the weighted sum (WS) aggregation rule Table 9 Confusion matrix on the test set of the VGG19 network with weights initialized on ImageNet, and trained using the mean square error loss function and the Adam optimizer Results are obtained setting the size of the sliding window equal to 30 and using the weighted sum (WS) aggregation rule

Neural Computing and Applications
four optimizers (Adam, Adadelta, SGD, RMSprop), two loss functions (mean square error and the categorical cross entropy) and two approaches for weight initialization (random and from the network trained in a different domain). We also verified that the aggregation of the decisions on a sliding window allowed us to significantly improve performance with respect to the decision taken on single frames. To this aim, in the tests we have explored both majority voting and weighted sum decision aggregation rules. The experimental analysis was conducted using the dataset collected with the support of the medical staff of the Department of Medicine, Surgery and Dentistry -''Schola Medica Salernitana'' of the University of Salerno, Italy. It included 74 video sequences, each referring to the execution of a complete hand washing procedure, and overall it comprised more than 131, 000 frames; the videos were captured at constant frame rate of 15 frames per seconds, corresponding to more than 2 hours of video footage. To the best of our knowledge, this is the first dataset in the literature dedicated to this highly specific problem in the area of gesture recognition. The dataset is made publicly available for scientific purposes upon request.
At the end of the experimental analysis, the best performing configuration was based on VGG19, initialized on ImageNet, trained using the mean square error loss function via the Adam optimizer; the proposed method achieved valuable performance in classifying the gestures among the 10 different classes defined by the WHO procedure, with an F 1 score of 0.957 by aggregating the classification outputs on window of 4 frames using a weighted sum.
We started to work of the system several months before the the outbreak of the COVID-19 pandemic. It is a preliminary system, thus presenting several limitations. Due to the relevance of the hand hygiene, we plan to extend the dataset for obtaining a more extensive and significant performance assessment of the proposed approach; this will also allow us to explore the adoption of more sophisticated network architecture specifically suited for operating on video sequences, such as the RNNs. Furthermore, we will also consider the possibility of extending the proposed method to recognize the handwashing gestures using alcohol-based solutions that can be adopted for paramedical staff, patients and visiting persons, but it may be also adopted in all the industrial sectors where careful hands hygiene is mandatory, such as like food preparation, conservation industry, restaurants.

Declarations
Conflict of interest The authors declare that there is no conflict of interests regarding the publication of this article.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright  Results are obtained setting the size of the sliding window equal to 60 and using the weighted sum (WS) aggregation rule

Neural Computing and Applications
holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.