Enhancing COVID-19 tracking apps with human activity recognition using a deep convolutional neural network and HAR-images

With the emergence of COVID-19, mobile health applications have increasingly become crucial in contact tracing, information dissemination, and pandemic control in general. Apps warn users if they have been close to an infected person for sufficient time, and therefore potentially at risk. The distance measurement accuracy heavily affects the probability estimation of being infected. Most of these applications make use of the electromagnetic field produced by Bluetooth Low Energy technology to estimate the distance. Nevertheless, radio interference derived from numerous factors, such as crowding, obstacles, and user activity can lead to wrong distance estimation, and, in turn, to wrong decisions. Besides, most of the social distance-keeping criteria recognized worldwide plan to keep a different distance based on the activity of the person and on the surrounding environment. In this study, in order to enhance the performance of the COVID-19 tracking apps, a human activity classifier based on Convolutional Deep Neural Network is provided. In particular, the raw data coming from the accelerometer sensor of a smartphone are arranged to form an image including several channels (HAR-Image), which is used as fingerprints of the in-progress activity that can be used as an additional input by tracking applications. Experimental results, obtained by analyzing real data, have shown that the HAR-Images are effective features for human activity recognition. Indeed, the results on the k-fold cross-validation and obtained by using a real dataset achieved an accuracy very close to 100%.


Introduction
The COVID-19 outbreak has pushed health authorities to fight an unprecedented battle against the time. Since its first occurrence in Wuhan, on December 31, 2019, SARS-CoV-2 virus has spread in more than 200 countries around the world, with a case fatality rate (CFR) of 2.25% and an infection fatality rate (IFR) of 0.68% [38]. As reported by the World Health Organization (WHO), at the time of writing (December 16, 2020) there have been 71.919.725 confirmed cases of COVID-19 in the world, including 1.623.064 deaths [60]. As depicted in Fig. 1, the number of deaths is constantly increasing. Numerous countermeasures have been undertaken in the last months to cope with the virus pandemic, pending the long-awaited vaccine. Although the biological and medical fields are the more active ones [47], many other disciplines are involved in providing useful support to the issue [13,21]. Under this aspect, the COVID-19 has changed how the world does science. Hundreds of clinical trials have been launched around the globe, gathering together thousands of researchers, medicals, hospitals, and laboratories. Never before, scientists from all over the world focused on a single topic and stopped almost all other researches.
Despite these efforts and even though some vaccines seem to be starting to spread around the world, the tracking of infected people seems yet the most affordable approach to take under control the spread of the pandemic. In this direction, solutions derived from modern mobile communications technologies seem to be the most promising options available [26,41,43,53]. In particular, mobile health applications for smartphones have been adopted by various States around the world to provide useful tools for contact tracing, information dissemination, and pandemic control in general [18]. For example, in Singapore, a mobile app, called "Trace-Together," is used to track the virus spreading after an individual was infected. "Immuni" is a similar app endorsed by the Italian government. In Switzerland, "SwissCovid" is the official mobile app managed by the Federal Office of Public Health (FOPH). Further, in the U.S. and UK, "COVID Symptom tracker" has been deployed by the Coronavirus Pandemic Epidemiology Consortium (COPE). A research on the most popular app stores, such as Apple AppStore and Google Play Store of the keywords Covid, SARS-CoV-2, and similars can provide an idea of the huge amount of apps available related to the COVID-19. The rapid proliferation of these apps has exacerbated the well-known problem related to app-quality assessment [11]. So that, in order to cope with it, a new metric named Mobile App Rating Scale (MARS) has been recently introduced [51]. A systematic review of the most popular apps for COVID-19 can be found in [18].
The main aim of these apps is to inform users if they have been close to an infected person for a time sufficient to be considered at high risk of contagion. Therefore, distance and position estimation are required to be as more accurate as possible. Although the positioning task has largely been solved for outdoor situations by using the Global Positioning System (GPS) along with precision refining methodologies [36], other technologies need to be used indoor. Typically, the indoor positioning systems (IPS) include a network of devices equipped with different technologies, which can be based on radio waves, such as WiFi and Bluetooth Low Energy (BLE) as well as on optical and acoustic solutions, like, for example the LiFi technology [40] and ultrasonic sensors, respectively. Nevertheless, as it is known, the distance between individuals can be estimated by using the Time-of-Flight (TOF) method without using the position of the involved devices. Due to its low energy consumption, easy deployment, low cost and widespread availability, BLE is the most widely used solution for distance estimation. More specifically, the RSSI (Receive Signal Strenght Indicator) is evaluated to estimate the distance between two devices/individuals through the Friis's equation [24]. Nevertheless, the environment in which a BLE system operates has a decisive influence on the intensity of the received signal and, then, on the RSSI correctness. This can create problems when trying to estimate the distance. Some factors that can generate variability in the estimation of signal intensity are [50]: • Metals and other reflective materials, which cause signal bounce; • Liquid elements that absorb the signal. In this regard, people crowds represent a serious drawback. We remark that the human body, being composed mostly of water, constitutes an important barrier to the propagation of Bluetooth signals at 2.4 GHz -ISM Band; • Physical obstacles. • Height difference between devices. • Relative orientation.
Although advances in the field of Machine Learning (ML), Deep Learning (DL), and Artificial Intelligence (AI) in general, have allowed the development of RSSI Real-Time Correction Algorithms [34], IPS and its applications to distance estimation remain a challenging problem.
Besides, most of the worldwide recognized social distance-keeping laws and criteria plan to keep a different distance based on the activity of the person and on the surrounding environment. For example, the distance between two individuals can be reduced if the facemask is used or if the persons are in outdoor spaces, whereas the distance must  be increased if they are indoor or they are making physical  activities, such as walking, running, footing, or other sports. Therefore, information about the surrounding environment and user activity can heavily affect the utility of mobile health applications.
In this study, in order to enhance the performance of the COVID-19 tracking apps, a human activity classifier based on Deep Convolutional Neural Network (DCNN) is provided. In particular, the raw data coming from the accelerometer sensor of a smartphone are arranged to form a multi-channel image (HAR-image), which is used as a fingerprint of the in-progress activity. Our aim is to provide a system able to automatically learn useful information from the device's owner concerning his activity and, then, also the surrounding environment. Such a system can be integrated as an additional function into tracking applications implementing the exposition estimation rules in order to empower their capabilities.
The comparison with the most common machine learning-based classifiers has shown that the aforementioned HAR-images are extremely effective features for Human Activity Recognition (HAR). Indeed, when evaluated through the k-fold cross-validation technique it achieved an accuracy very close to 100%.
The remainder of the paper is organized as follows. Firstly, in Sect. 2, the contact tracing is presented and discussed. In Sect. 3, the state-of-the-art methods for the HAR based on Machine Learning and Artificial Intelligence are shown. In Sect. 4, the HAR-Images and DCNN are presented in detail, whereas Sect. 5 describes the experiments and reports the results. Finally, Sect. 6 is devoted to the conclusions and future works.

Contact tracing
Contact tracing is the main countermeasure adopted from public health to cope with COVID-19 disease [30]. More specifically, contact tracing refers to the ability to reconstruct the contact chains of virus-positive people. The tracking can also take place in a "traditional" way, by interviewing positive people and tracing the situations in which they could endanger the health of close people and properly warning them. Nevertheless, there are situations, such as having been in line at the supermarket, in a bar or office, in which we do not know exactly who we met. In these situations, digital support systems, such as an app, can represent an effective help. A contact tracing app works as follows: when a user downloads the app, an own code (id-code) is associated with it and such code varies several times a day. Using Bluetooth technology the smartphones that are in Bluetooth action range exchange these codes, without anyone, not even the system, knowing who they correspond to. To preserve privacy, the app does not collect any personal data, such as name, age, address, telephone number, or email. When a person is found to be positive to COVID-19, his id-code is entered by the healthcare staff into a system from which each smartphone picks up it and compares it with the list of codes it has registered. If it turns out that the smartphone has had risky contact with the one associated with a positive person, appropriate behavioral indications are provided to the involved user.
Nevertheless, there are two types of apps, that is apps that make use of centralized or decentralized databases. In both cases, when individual A meets B, their smartphones exchange encrypted id-codes [42]. If A becomes positive, the status of his app is updated. At this point, in centralized systems, the data remain on the server, in decentralized systems the data remain on the user's smartphone.
Regardless of the type, the apps need to respond to the guidelines emitted from worldwide health organizations. For example, with reference to Fig. 2 On the other end, the cases considered to be at low-risk are: • A person who has been in an enclosed environment with a COVID-19 case for fewer than 15 min or at a distance greater than 2 m. • A person who has had face-to-face contact with a COVID-19 case for fewer than 15 min and at a distance of fewer than 2 m. • Traveling together with a COVID-19 case in any type of transport over a distance of greater than 2 m.
Finally, the ECDC explains that a "longer contact duration increases the risk of transmission." The 15-min limit is arbitrarily chosen for practical purposes. Notice that, according to World Health Organization, the incubation period, during which a person is however contagious, ranges from 2 to 14 days. It is not easy to know and remember who we have been close to, within 2 m, for about 15 min, in the last two weeks. For this reason, an app able to estimate the proximity of two devices can give an effective contribution. Further, a point-to-point detection system eliminates the need for central data collections by States and, also, it can be sufficient for tracing people by considering the threshold of 15 min of exposure. Indeed, the goal of these applications is not to warn people that certain distance thresholds have been exceeded, but to be able a posteriori, many days later, to warn that a proximity contact has occurred with people who are being tested positive in a time later the contact.

Human activity recognition and related work
In estimating exposure risks, an effective contact tracing cannot prescind from the considerations concerning the activities that characterize the specific behavior of the involved users (if they are running, or walking, standing or sitting, etc.). Thus, automatic HAR assumes paramount importance in supporting tracking applications. Although the human activity recognition task has been extensively studied in the last decades [28], it remains a challenging problem that needs to be addressed and solved. Its main use is in eldercare and healthcare applications, especially when it is combined with other technologies, such as the Internet of Things (IoT). HAR can be performed by using many technologies, but nowadays the proliferation of small sized electronic equipment and the large usage of AIand ML-based algorithms in many research and industrial fields [12,[14][15][16], have allowed the spread of HAR solutions by leveraging the built-in sensors of smartphones. Typical human activities that can be recognized by HAR systems are: walking, sleeping, driving, sitting, running, standing, cooking, etc. Nevertheless, HAR is widely used in many other real applications, such as crime monitoring [39], domotic [22], suspicious human activity [55], daily activities [17], gesture recognition [31], and military applications [35]. Data collected by inertial systems, video frames, and images are usually processed to make HAR. In this regard, HAR is generally performed firstly by processing univariate or multivariate time series in order to extract useful and effective features, and, then, by making activities inference through ML-based techniques. Deep neural networks (DNN) are the most promising tools for automatic features learning starting from raw data, and they have outperformed the signal processing approach, which requires a great knowledge of specific domains and a manually designing of the features [58].
The current approaches to make HAR can be classified into two main categories, that is, vision-based and sensorbased. The core processing of the former includes the classical steps adopted in computer vision for mining useful features, such as data preprocessing, cleaning, segmentation, feature extraction, and finally classification. Due to the huge diffusion of the built-in cameras in mobile electronic devices, in the last decades, there has been a great demand for automatic image processing, and then many approaches have also been proposed for video-based HAR technologies. The most significant advances in this field can be found in [5], where the HAR task is categorized according to several criteria. A taxonomy-based approach and an enlightening comparison among different methods are provided by Aggarwal, et. al. in [1]. Again, in [29], the authors discuss the advantages and the disadvantages of different features mining from images. Many other surveys have been found in the literature [7,48,52,56]. However, as reported in [57], issues related to shadows, observing angle, background colors, light intensity, and more can negatively affect the HAR quality.
In contrast, built-in sensors in smartphones can overcome these issues and can be effectively used for HAR. Besides, many daily activities, such as sitting, standing, walking are strongly related to gravity and accelerations, and then they can be identified by using three-axis accelerometers, gyroscopes, and other sensors commonly integrated into smartphones. A detailed description of how the accelerometers of smartphones are used for HAR can be found in [4]. While in [37], the combined usage of accelerometer, gyroscope, and magnetometer sensors along with a deep neural network is shown. Also, in [3], the authors propose a HAR system capable to identify 20 activities by using five wearable dualaxis accelerometers and a ML-based classifier, achieving an accuracy of 84%. Again, in [49], five transport activities are identified by only using the smartphone inertial sensors.
Another important aspect to be considered in sensorbased HAR is that the time series-based classification requires the partitioning of the signals in temporal windows, which are associated with an activity. Usually, fixedlength temporal windows, that are shifted on time series, are used (sliding window approach). For example, in [25] a HAR system able to recognize six activities applying a Deep Recurrent Neural Network (DRNN) is presented. The paper shows how the size of the sliding window and its offset (so-called stride) affect the recognition time (throughput). The best recognition rate of 95.03% was achieved. In [8], the time series of each axis of a threeaxis accelerometer were partitioned by a 50-dimensional window, corresponding to 2.5 seconds, and, then, given as input to three different recurrent networks, one for each axis. The model has proven to be able to distinguish six activities with an accuracy of 95.1%. Nevertheless, a dynamic sliding window is shown in [44]. The approach proposed by the authors is able to adjust the window size and the stride at every step of the training. The experimental results have shown that the model is able to achieve good performance. Indeed, the best result was 95.32% and 97.69% for the recall and precision metrics, respectively.
However, as reported in [2], the choice of the right window size and offset is an open challenge, because it depends on many factors, such as the nature of the sensor data, the activity to be recognized, and the designed model.
Finally, for their similarity to the approach shown in this study, we report some of the most notable methods using Convolutional Neural Networks (CNNs) in HAR. Although CNN has been originally designed for dealing with images, its use in many other application fields has proven to be very effective. Indeed, 1D signals can also be processed by a CNN by exploiting the aforementioned sliding window approach. The first important study reporting the usage of CNN for HAR can be found in [63]. The authors use different CNNs for each axis of the accelerometer, and then the CNN outputs are arranged in a unique flattened vector, which is used as a hidden layer of a fully connected neural network. The achieved accuracy is 88.19%, 76.83%, and 96.88% for three different datasets, respectively. The divide and conquer approach together with a 1D CNN is used in [9]. The activities are first classified into static and dynamic classes by a binary 1D CNN-based classifier, and then for each class, specific activities are discriminated by a 3-class 1D CNN classifier. The results show an accuracy of 97.62% and 94.2% when tested on UCI and OPPORTUNITY datasets, respectively. Images derived by applying the Fast Fourier Transform (FFT) to each signal of the three-axis accelerometer and gyroscope are used in [27]. The resulting 28 x 28-dimensional images are used as input to the CNN to achieve performance values of 88%, 87%, and 87% for precision, recall, and F1 score, respectively, when the method is applied for recognizing eight human activities.
A comprehensive study of different techniques used for making HAR based on machine learning algorithms and deep networks can be found in [54].

HAR-images based activity recognition
In this Section, the proposed approach is shown in detail. To this end, the mathematical representation of HAR is first provided, next the HAR-Images are described, and finally how they are used in CNN in order to mine effective features is explained.

Mathematical formalism of HAR
According to the definition given in [10], human activities can be seen as a set of actions performed by the user in a given environment over a temporal period.
Accordingly, let A = {A 1 , A 2 , … , A n } be the set of the activities to be recognized, and let S = {S 1 , S 2 , … , S m } be the set of the sensors involved in data capturing related to the activities.
Then, a specific action, a ∈ A , that occurs in a certain time window ( t ) can be associated to a tuple of time-series ( d t ) capturing the activity information, whose elements ( r t l ) are the sensor readings acquired over t.
Note that because sensors can provide multiple time series as a result of the measurement action, then any r t l is a q (l) -dimensional vector of time series, with q (l) the number of time series provided from the sensor l ∈ S . Besides, because any sensor has a proper operating sample frequency ( f (l) ), the number of sample ( (l) ) captured in the given temporal window t is given by: As a consequence, any r t l is a ( (l) × q (l) )-dimensional matrix, while d t can assume the aspect of a ( × q)-dimensional matrix, where = Max( (l) ), ∀l ∈ S , and q = ∑ p l=1 q (l) , with p the number of sensors involved in the measurement. More specifically, d t can be represented by a tensor having the structure as depicted in Fig. 3.
The goal is to build a map ( ) between the action occurred over t , ã t ∈ A and the sensor readings, d t , that is: Note that provides an estimation of the action associated with the sensor reading. So that, ã t is to be considered as the estimated action, which needs to be compared with the corresponding true action â t (ground truth).
Hence, the HAR goal is to find out an optimal by minimizing the discrepancy between the estimated action ( ã t ) and the true one ( â t ).
Typically, a loss function ( Γ ) is employed for representing this discrepancy, as follows: Besides, the time series are not directly used as input to , but a their projection ( ) on a new d-dimensional state space is employed. That is: with where i are the new features.
Ultimately, the goal is to minimize the equation (5) for any a ∈ A.

HAR-images
Usually, sensor-based HAR is considered as a classical time series analysis problem, so that it is addressed by using recurrent neural networks, such as LSTM (Long Short-Term Memory) [8,25,62]. On the contrary, in this study, we investigate the feasibility of recognizing human activities by using convolutional neural networks. To this purpose, images need to be derived from sensor raw data. We referred to them as HAR-Images. Fig. 3 Data structure of d t . There are p sensors involved. Any sensor reading ( r t l ) is associated with a different number ( q (l) ) of time-series of length equal to (l) With reference to equation (1), any sensor reading is first windowed and then used to build the HAR-Images as follows. As mentioned earlier, any r t l is a ( (l) × q (l) )-dimensional matrix, so that two generic columns, r t l,j and r t l,k , of this matrix, may be used to build an image on an x-y plane. The dots on this plane are depicted by using a column as the x-coordinate and the other one as the y-coordinate. Hence, for any temporal window ( t ), the derived image includes (l) dots. Besides, for a sensor l ∈ S with q (l) readings, the number of distinct admissible column couples, and then of images ( nC (l) ) is given by: Note that the image size of each channel depends on the minimum and maximum values assumed by each column involved in the given temporal window ( t ). Indeed, we remark that the coordinates of any dot are individuated by the values of these columns at a given index. Therefore, let (l) be the difference between the maximum and the minimum values assumed by the reading values of a sensor l, then the size of each corresponding image-channel ( chSize (l) ) is given by: However, in order to provide more informative content to image-channels, any dot on the x-y plane is associated with a number ranging from 0 to 255, as it happens in the matrix representation of images. Such a number represents the number of occurrences that a dot falls in the same position. However, because the sensor readings include different values, it could happen that no dots fall in the same position.
To address this issue a quantization process is performed on the images. More precisely, each column is first quantized through a given quantization step (Q), and then the resulting discrete sets are used as coordinates of the dots. As a consequence, the size of the quantized image-channel ( chSizeQ (l) ) becomes: Ultimately, each image built on a couple of columns can be considered as a different channel of a HAR-image. The same procedure is adopted for each sensor. So that, for a given t , the number of HAR-Images is p, and each HAR-Image includes nC (l) channels.
For example, for a three-axis accelerometer ( q = 3 and p = 1 ), the different time series associated with each axis Q captured in a temporal window t lead to nC = 3 , that is a unique ( p = 1 ) HAR-Image including three channels. The first image-channel is derived by considering the axes X-Y, the second X-Z, and the third Y-Z.
With reference to equation (6), the HAR-Images represent a kind of projection ( ) of the time series into a (p × × × nC)-dimensional state space. Nevertheless, as it is shown in the next subsection, such a projection includes other transformations.

HAR-features mining and deep convolutional neural network
Once the HAR-Images are built, relevant features need to be mining from them in order to build a classifier. Pattern recognition (PR) based approaches have been extensively used in the last decades for making HAR [33]. Although PR has proven to be effective in cases in which only a few activities need to be recognized, it has shown several weaknesses in most daily cases. Indeed, firstly, the features are mining through a heuristic hand-crafted way. The effectiveness of this approach heavily can be affected by human skills in the specific domains. Secondly, often, such features are related to some statistical measure, such as standard deviation, mean which are not able to capture deep insight. Thirdly, traditional PR requires many labeled data, which could be not always available. Finally, most of the existing PR-based approaches make use of static data. As a consequence, they are not adequate for dealing with data streams and incremental learning. On the contrary, deep learning is able to overcome these issues by making simultaneously the features extraction and the model building in a unique process [6]. Further, features are learned automatically, they can represent complex activities, and finally, the insight learned can be transferred to new similar tasks.
Among many deep learning-based approaches, we focus on the Deep Convolutional Neural Network, which is a stacked network including a Convolutional Neural Network and a Deep Neural Network. DNNs can be seen as a special category of the more general Artificial Neural Network (ANN). The main difference between them is that DNN makes use of more hidden layers than ANN. This confers to DNN the capability of mining more representative and salient features through a process in which more complex features are derived from less complex ones. Layers near to the input are representative of simple features, while layers closer and closer to the output represent more complex features. Generally, DNNs are used after other deep networks, such as CNNs.
CNNs are extensively used in computer vision, and they have proven high ability to make image classification, speech and text analysis. As it is known, CNNs rely on the convolutional operation followed by a pooling process.
As described earlier, for a sensor l ∈ S , the corresponding HAR-Image includes nC (l) image-channels (see equation (7)), each of size ( (l) × (l) ) . Thus, let X c,(l) ∈ R (l) × (l) be the c-th single channel-image of the HAR-Image associated to the sensor l, and K f ,(l) ∈ R a×b be the f-th convolutional filter, then the convolution operation between the CNN-input ( X c, (l) ) and N f filters produces the following output ( Y c,(l) ): with Y c, (l) i,j the entries of the output matrix Y c,(l) . The row size ( Y c,(l) x ) and column size ( Y c,(l) y ) of Y c,(l) are given by: where S x and S y control the shifting (called stride) of the filter on both the x and y axes of the input. While, P (Padding) defines the number of zeros to add around the border of X c,(l) in order to match the output size to that of the input, without compromising the result of the convolution operation.
To take into account the presence of multiple channels of the sensor involved, the multi-channel 2D convolutional process is used. It consists of the element-wise summation ( Y (l) ) of the output of each convolutional operation performed on each channel of the given sensor l ∈ S.
where Y (l) i,j the entries of Y (l) . Note that all image-channels of a HAR-Image associated with a sensor l have the same size. Thus, with an abuse of notation, we can stat that the size of Y (l) can be expressed as follows: for any c = 1, … , nC(l).
Once the convolutional operation is ended, the pooling process is generally performed. It is used to reduce the size of the convolutional output in order to speed up the computational tasks. To accomplish this, the CNN-output ( Y (l) ) is divided into regions on which some calculations are performed. A y typical pooling is the MaxPool, which consists of finding out the maximum value in the considered region and using it as a point on the new image-output. Ultimately, a pooling operation that uses a (k (l) × k (l) ) -dimensional regions, reduces the (Y (l) x ∕k (l) × Y (l) y ∕k (l )-dimensional Max-Pool-output (W (l) ) , which represents the so-called feature map.
Thus, with reference to equation (6), further transformations have been added to the raw sensor data. In a nutshell, acts in tho steps: a) the raw sensor data have been first arranged as images, b) and then such images are used to mine useful features. Note that what has been said so far refers to a single sensor l ∈ S.
To take into account all involved sensors, any feature map ( W (l) ) associated with a sensor l is flattened in a one-dimensional vector. Then, all these vectors are joined together to form a new unique one-dimensional vector ( Flat_Vec ). With reference to equation (15), the size of this vector can be expressed by: with p the number of sensors involved in the measure.
Finally, a DNN with a softmax output layer is connected to the CNN-output for classification purposes. More precisely, Flat_Vec is given in input to a deep fully connected neural network, and then its output is provided as input to the softmax network to infer the class of the input in terms of probabilities.
Accordingly, let z = 1, … , be the index of the hidden layers of a DNN, let s (z) be the number of neurons of each hidden layer (z), then when Flat_Vec is provided as input to this DNN, the output of each layer ( h (z) ) can be expressed by: where W (z) ∈ R s (z) ×s (z−1) and b (z) ∈ R s (z) are the weights matrix and the bias vector of the layer z, respectively. While, is an activation function. We remark that the size of h (z) is s (z) . Besides, the layer z = 1 takes as input the above Flat_Vec.
As above-mentioned, the last layer ( ) includes a softmax function. So that, with reference to equation (17), the j-th output component ( h ( ) j ) of the last layer of the DNN can be expressed as follows: with w ( ) j,k and b ( ) j the entries of W ( ) and b ( ) , respectively. Then, the j-th component of the softmax output ( h sm ) is given by: with j = 1, … , n and n the number of admissible actions (A). The action ( A j ) with higher probability is considered to be the output of the classifier. Such an action along with the true action are fed as input to the loss function of equation (5) and used during the training phase. Figure 4 depicts the general structure of the DCNN used in this study. Note that the depicted network is used for any t . That is to say that the network is trained through all the HAR-Images derived by sliding the temporal window t on all training dataset. To accomplish this, usually, a specific offset ( stride (l) ) is used. As a consequence the number ( numHAR (l) Images ) of nC-dimensional HAR-Images associated with a sensor reading l is given by: where D (l) is the length of the sensor reading l. As it is better shown in the next section, the choice of t and stride, and consequently of numHAR (l) Images , heavily affects the performance of the proposed network, and then they need to be chosen appropriately.
Finally, we remark that with reference to the equation (5), the loss function to be minimized is that generally used for training deep neural networks for multi-class classification [64].

Experiments and results
The aim of this Section is to demonstrate the effectiveness of HAR-Images in representing actions starting from raw sensor readings and to show the effectiveness of the DCNN as a powerful classifier for HAR. To accomplish this, a real dataset was used in the experiments. All the experiments were conducted by using the high-level neural networks API library for Python over the TensorFlow platform running on a 64-bit Windows 10 operating system. The machine used to perform the experiments was a portable PC equipped with an Intel(R) 4-Core i7-8565U CPU @ 1.80GHz, 16 GB of RAM, and NVIDIA GeForce MX150 GPU with 4 GB of memory.

WISDM dataset
The dataset used in the experiments was released by the Wireless Sensor Data Mining (WISDM) Laboratory [32]. It includes raw data collected from several Android-based mobile devices under controlled laboratory conditions. In particular, the data refer to samples produced by the three-axis accelerometers embedded in smartphones, also capable to detect the device orientation. In the latest version of the dataset, thirty-six volunteers were engaged for collecting data. Each volunteer was asked to wear a smartphone in the front pocket of the pants while they performed specific actions for a given period of time. The horizontal movements of the user's leg were captured by Fig. 4 Network structure of HAR-Images DCNN. As depicted, once the HAR-images over a given t are built, several convolutional and pooling transformations are performed. The output of these layers is first flattened and then fed to DNN. The softmax layer concludes the network by providing an estimation of the classes in terms of probabilities the X-axis of the accelerometer, whereas Y-axis captured the vertical movement, and finally, the forward and backward movements were captured by Z-axis. The labeling of the data, and the start and stop of the data collection related to any action were controlled by an app running on the smartphone, which also collected other information such as, timestamp, sample frequency, user's name. Nevertheless, the app was set to collect data every 50 ms, that is f = 20Hz which means 20 samples per second.
Six common activities were considered in data collection. They are: The dataset includes 1, 098, 208 samples, distributed as shown in Table 1. For any axis, the sample values range from −20 to 20.
As it is usual, we performed an exploratory data analysis (EDA) [23], which led to eliminating some samples that report 0 values for all axes. Because the accelerometer is sensible to the gravity, the condition in which all axes release a value equal to zero is considered an error. The resulting distribution of the number of samples for any user is shown in Table 2.
As depicted the total number of samples was reduced to 1, 085, 367, that is 12, 841 samples were deleted.
A dataset including transformed features is also provided by the WISDM Laboratory. It includes 46 features derived by considering a temporal window of 10 seconds, that is 200 sensor readings. These features are all derived from six basic features. For each axis they are: average value, standard deviation, average absolute difference, average resultant acceleration, time between peaks, and binned distribution. We remand to [32] for further details.
We remark that in this study, we do not use such transformed features, but we use the raw data. Nevertheless, as it will be shown in Sect. 5.5, we used such features for comparing the performance of our approach with the stateof-the-art based on Machine Learning techniques.

HAR-images
The samples reported in Table 2 were used to build the dataset of HAR-Images as discussed earlier. In order to compare our approach with the state-of-the-art, we used a temporal window equal to t = 10 seconds , which corresponds to 200 samples ( = 200 ) according to equation (2). Besides, a stride of 10 samples was used. As above discussed, because WISDM dataset includes data captured by a single three-axis accelerometer, we have p = 1 and q = 3 . As a consequence, according to equation (7), for any t a single HAR-Image is produced ( p = 1 ) including 3 channels ( nC = 3 ). Besides, according to equation (20), the total number of 3-channels HAR-Images is equal to 105205, distributed as reported in Table 3. Finally, a quantization step of Q = 0.025 was used, which led to having images of size equal to (40 × 40) . Also, in order to adapt the entry values of the image-matrices to the DCNN input, all images were normalized to have entry values ranging from 0 to 1.

DCNN model and experimental setting
With reference to Fig. 4, the implemented DCNN structure includes two 3D-channels CNN-layers with 2 and 4 ( 2 × 2)-dimensional filters, respectively, padding=same, activation=relu, and two ( 2 × 2)-dimensional MaxPooling each for layer. Next, a flattening network is used, followed by a DNN. DNN includes two fully connected layers of 48 and 24 neurons, respectively. Both layers are followed by a Dropout of 0.5. Finally, a softmax function with 6 output concludes the network. Figure 5 shows the Python representation of the network structure along with the input and output sizes for each network element. The network was training by using sparse-categorical-cross entropy as loss function (equation (5)), Adam as optimizer, for 100 epochs with a batch_size = 5.

Performance metrics and results
In order to evaluate the classification quality of the proposal, the following metrics were used on the testing dataset. They are derived by the multi-class confusion matrix [20]:    actions incorrectly classified, FNs (False Negatives) are the actions incorrectly rejected, and TNs (True Negatives) are the actions correctly rejected. We remark that any metric ranges from 0 to 1 or equivalently from 0% to 100% , where 100% represents the best performance.
All the metrics were evaluated by using the Python library scikit-learn [19].
In order to investigate the performance of the proposal, without running into problems related to bias, the 10-fold cross-validation technique was used in the experiments. Accordingly, with reference to Table 2, the samples belonging to each action were first divided into 10 folds, and then, in turn, each fold was used for testing, while the remaining samples for training. To accomplish this, the sklearn library for Python was used. Tables 4 and 5 show the confusion matrix and the metrics of the experiment results for any action, respectively. The, mean, standard deviation, and 95% confidence intervals are also shown. As depicted, all the metrics achieved the maximum value for any action. Indeed, only a few samples were incorrectly classified (Table 4). Although some actions can be considered many similar, such as Sitting and Standing, the HAR-Images proved to be able to distinguish them with high accuracy. Indeed, the achieved accuracy was 100% . Also, the classifier showed some weakness for Upstairs, which, for a few cases, was confused with Walking or Downstairs. However, also in this case, the accuracy achieved values very close to 100%.

Comparison and discussion
In this Section, a comparison with some solutions presented in the literature is provided.
Firstly, in order to show the effectiveness of the HAR-Images as powerful features to be used in HAR, we tested the performance of the so-called transformed features used in the WISDM Lab [32]. We remark that, as aforementioned in Sect. 5.1, such transformed features are derived from the raw time-series data by using a temporal window of 10 seconds (corresponding to 200 samples), that is the same number of samples used in our experiments. The transformed dataset is provided from WISDM Lab as an ARFF file containing 46 (including the class) features and 5418 samples. Two features, the "UNIQUE_ID" and "user" were pruned in our experiments, because they could bias their results. Four classifiers from the WEKA suite [59] were used, that is Naive Bayes, MLP (Multi-Layer Perceptron), J-48 decision tree, and SMO (Sequential Minimal Optimization). Besides, the 10-fold cross-validation technique was also used to test the performance of the WISDM features. Figure 6 depicts the average values of the performance metrics for any classifier. As shown, our approach outperformed all the others. All the metrics achieved a value very close to 100% . Although MLP and J-48 seem to perform better than Naive Bayes and SMO, their Sensitivity, of 0.78 and 0.81 respectively, is very low compared to the one achieved by our proposal, which is 0.99. Another important aspect that needs to be considered is that, unlike our approach, the transformed features seem to be unable to deal with unbalanced datasets. Indeed, as depicted in Table 6, the number of WISDM samples for any action is quite different. Most of the samples refer to the Walking and Jogging actions. The low values for Sensitivity, Precision, AUC, F-measure, and higher values for Specificity for the WEKA classifiers are a clear demonstration of this drawback introduced by the WISDM features.
Finally, in order to provide further proof of the effectiveness of our approach, Table 7 shows a comparison with some of the most notable solutions recently presented in the literature that use deep neural networks, that is DCNN [45], LSTM [46], and CNN [61]. DCNN makes use of merging features that include the WISDM transformed features with features extracted by a CNN. Notice that the accuracy of DCNN for any action is missing in the table and only the mean is reported, because we used the results published in their paper. LSTM refers to a solution based on a recurrent neural network, while CNN refers to a solution based on a pure CNN with a 1D convolution operation.
As depicted, our approach was able to recognize any action with a similar accuracy and precision. Indeed, all the metrics achieved similar value, almost 100% . On the contrary, Upstairs and Downstairs were confused by DCNN, LSTM, and CNN. Indeed, the associated metrics achieved the lowest values. Again, the Standing action was not well distinguished by other ones.

Conclusions and future works
In this study, we have proposed a method to enhance the performance of the COVID-19 tracking apps through the detection of human activity recognition (HAR). In particular, starting with raw data readings coming from built-in sensors in smartphones, we have derived special images, called HAR-Images, able to capture useful and salient knowledge related to the in-progress user activity, and thus to be used as a kind of signature of the activity. A deep convolutional neural network (DCNN) has been used to mine such insights. A mathematical model of the proposal has first been provided, and then its application on data coming from a three-axis accelerometer has been depicted.
Unlike the state-of-the-art approaches, which build images by using separately each axis of an accelerometer, the proposed HAR-Images are derived by exploiting contemporaneously a couple of axis. Their contents are used as coordinates of dots on an x-y plane. This different point of view confers to the proposal the capability to capture salient relations existing among data of the different axis gathered at the same time. As a consequence, the extracted images include more powerful insights, which allows building a more performant DCNN based classifier.
Indeed, the experimental results, obtained by using the WISDM dataset, have been absolutely satisfying. The metrics, derived by the multi-class confusion matrix  Actual  Jogging  32425  7  7  0  0  0  Walking  0  41648  27  44  0  0  Upstairs  37  73  11483  73  0  0  Downstairs  5  58  65  9300  0  0  Sitting  0  0  1  0  5542  19  Standing  0  0  0  0  3 4388 state-of-the-art approaches, our proposal has proven to be the best. Indeed, the performance of most of the other approaches present in literature never have exceeded a F-measure of 91%. These results have confirmed the effectiveness of the proposed HAR-Images and the DCNN structure in making HAR. In addition, its ability to working in real-time confers to it the possibility to be used for COVID-19 tracking apps.
The experimental results encourage us to investigate, in the future, the application of the HAR-Images also in other contexts, such as telemedicine or personal fitness monitoring.

Conflicts of interest
The authors declare that they have no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.