1 Introduction

Human-centered computing is an emerging research and application area that aims at understanding the human behavior (and psychology) and integrating users and the social context with the digital technology. This necessitates and subsumes human activity recognition (HAR) which aims at recognizing the actions, characteristics, and goals of one or more individual(s) from a temporal series of observations streamed from one or more sensors. Automatic recognition of behavioral context can serve in many domains including health monitoring, elderly care, indoor and outdoor sporting, virtual coaching, surveillance systems, robotics learning and cooperation, smart homes, Industry 4.0, predictive maintenance, human–robot interaction, the military, etc. A comprehensive survey of applications can be found in [1].

Generally, the objectives of HAR systems can be summarized as follows: (1) determining (both in real-time and offline) the ongoing actions/activities of a person, a group of persons, or even the crowd based on sensory observation data, (2) assessment of the ongoing action, for example, if the action is an indoor squat exercise, what is the performance level of the player and what mistakes done during exercising, (3) determining certain characteristics of the subject(s) such as the identity of people in the given space, biometrics such gender, age, weight, etc., (4) knowledge about the context within which the observed activities have been taking place, and (5) estimating the body weight.

For our purposes in this article, we consider an action or activity as an ordered intelligible sequence of bodily movements.

HAR systems can be classified based on the modality of sensory information used, as the kind of sensory data greatly affects the types of features, algorithms, architectures, and techniques used for analysis. Generally, we can identify the following streams of research and developments in HAR systems: (1) HAR systems based on visual data (such as images and video streams), (2) HAR systems based on inertial motion streaming from sensors such as IMUs (Inertial Measurement Units), (3) HAR systems based on the received signal strength (RSS) from commodity routers installed in the surrounding environment [2, 3], (4) HAR systems based on infrared signals, and (5) HAR systems based on audio signals.

To the best of our knowledge, there is not much work fusing several of these modalities together. In the current study, we focus exclusively on the second modality, namely, inertial motion timeseries data. This type of data has the advantage of being more intrinsic in capturing the motion dynamics of the activity, and is not affected by any irregularities in the surrounding environment such as the absence of light for visual capturing. However, this is also its weakness as it has lower capability, in contrast to the other modalities, in capturing the contextual environment, and interactions with other objects and/or persons in that environment.

The idea of an automatic HAR system depends on collecting measurements from some appropriate sensors which are affected by motion attributes and characteristics of different body joints. Then, depending on these measurements, some features are extracted to be used in the process of training activity models, which in turn will be used to do inference later about these activities and performers. A typical example of a class of activities is those Activities of Daily Living (ADL) that people have the ability and custom for doing on a daily basis like eating, moving, hygiene, dressing, etc. [4].

There are many methods and data acquisition systems which are based on different sensory readings for recognizing these actions. Historically, heavy devices were used to collect accelerometer data such as data acquisition cards (DAQs) [5]. Later, smaller integrated circuits connected to PDAs were used for this aim, for instance, camera-based computer vision systems and inertial sensor-based systems. In computer vision, activities are recorded by cameras [6]. However, the drawback of the camera-based approach is that it may not work due to the absence of full camera coverage of all person’s activities. In addition, for privacy issues, cameras are meddlesome as people do not feel free and comfortable being observed constantly.

Based on the data acquisition paradigm, HAR systems can be divided into two general categories: ambient/surrounding fixed-sensor systems and wearable mobile sensor systems. In the fixed setting, data are collected from distributed sensors attached to fixed locations in the activity’s surrounding environment (where users’ activities are monitored) such as surveillance cameras, microphones, motion sensors, etc; see [7] for applications in smart homes. Alternatively, the sensors are attached to interactive objects in order to detect the type of interaction with them, such as a motion sensors attached to the cupboard doors or microwave ovens (to detect opening/closing), or on water tap to feel turning it on or off, and so on. Although this method can detect some actions efficiently, it has many limitations due to its fixed nature. Another limitation is that if the user wants to leave the place, she will not be observed from the fixed sensors and her activities won’t be detectable. Privacy is also another issue, especially when considering video surveillance cameras and/or auditory sensors.

Figure 1 shows the evolution of gathering inertial motion data; a National Instruments DAQ card-1200 with analog input, a USB-based wearable programmable chip, the more advanced devices, e.g., smart phones and smart watches.

Fig. 1
figure 1

Acceleration sensing hardware evolution

In the wearable-based systems, the measurements are taken from mobile sensors mounted to (or embedded on commodity devices mounted to) human body parts like wrists, legs, waist, and chest. Many sensors can be used to measure selected features of human body motion, such as accelerometers, gyroscopes, compasses, and GPSs. They can also be used to measure phenomena around the user, such as barometers, magnetometers, light sensors, temperature sensors, humidity sensors, microphones, and mobile cameras. These kinds of sensors are essentially embedded on wearable devices. Consumer inertial MEMS IMU sensors are tiny devices with typical package size of 2.5 mm \(\times \) 3 mm which contain three axial accelerometers and gyroscopes. They can be found in many commodity products we use almost every day as smartphones, fitness trackers, smartwatches, gaming controllers, etc. Two of those high tech IMU devices named BMI160 are inside the NASA helicopter Ingenuity which did fly many times on Mars surface in 2021.

On the contrary to the fixed-sensor-based systems, wearables are more intrinsic and able to continuously measure data from the user everywhere, while sleeping, working, or even traveling anywhere since it is not bounded by a specific place where the sensors are installed. Also, it is very easy to concentrate on directly measuring data of particular body parts efficiently without a lot of preprocessing that is needed, for example, in fixed depth cameras. So, as opposed to ambient sensors, a wearable sensor is intimately related to the body part to which it is mounted. So, in some sense it is directly expressive of this part (such as its motion dynamics, vital signs, etc.), without the need of much or any preprocessing at all. Examples of wearables include smartwatches, smart shoes, sensory gloves, hand straps, and clothing [8, 9]. Recent statistics show that the total number of smartphone subscribers reached about 6.5 billion in 2022 and is expected to reach 7.7 billion by 2022 [10]. According to an IDTechEx report [11] the wearable sensor industry is worth \(\$2.5\) billion in 2020, having tripled in size since 2014, and is expected to reach more than \(\$4.5\) billion in 2024. Therefore, the disadvantages of wearable mobile sensors of being intrusive, uncomfortable, and annoying have vanished to a great extent, making this method of on-board sensing from smart mobile and wearable devices very suitable for HAR data acquisition.

As the current trends have been greatly shifted towards systems that are data oriented with statistical machine learning analysis and synthesis, feature extraction and representation learning play essential roles in the effectiveness and efficiency of HAR systems. Feature engineering in timeseries inertial data has proven as effective as automatic feature extraction/selection using representation learning through the deep learning paradigm. From one perspective this is due to the limited datasets available and the small sizes of these datasets. From another perspective inertial timeseries data is low on explicit semantics compared, for example, with visual, auditory, and textual data; so it is not very much amenable to the high level convolutional operation dominant in deep neural networks. However, the power of deep learning emerged exclusively in transfer learning which is of great benefit in deploying practical solutions based on wearable and mobiles devices.

Transfer learning is the transfer of knowledge gained in one domain and/or task to another domain and/or task, so we needn’t start over on the latter. Transfer learning is essential in HAR systems, not only for saving computational resources [by avoiding end-to-end training on the new task(s)], but more importantly for handling the tremendous heterogeneities induced by the huge configuration space incurred by the data collection and streaming processes in IMU-based HAR systems. These include among others the following:

  1. 1.

    Hardware heterogeneity:

    • Different kinds of devices (with embedded IMUs) such as smartphones, smartwatches, wrist bands, specialized IMU sensors that can be placed on different body parts, etc.

    • Different brands of the same kind of device. For example, varieties of smartphones such as iPhone, Galaxy, Huawei, etc.

    • Different models of the same brand of the same device. For example, iPhone 10, iPhone 11, etc.

    • All the above lead to different characteristics of the IMU sensing technology, sampling rates, dynamic ranges, sampling rate instabilities, etc.

  2. 2.

    Software heterogeneity:

    • Different operating systems; for example, iOS vs. Android.

    • Different versions of the same operating system.

    • Different filters embedded in the IMU sensors or sensors logs.

  3. 3.

    Data collection heterogeneity:

    • Variations in location(s) of the device(s) mounted to the body part(s). Given a fixed type of device, say smartwatch, it can be worn on the left or right wrist. Smartphones have more wider range of locations. These lead to variations in the signals corresponding to the same activity.

    • All other factors fixed, gender and age greatly affect the signals of the same activity.

Figure 2 gives a general framework of modern approaches to HAR that are data-oriented. The first process in the pipeline is, of course, the data collection which can be done through ambient devices such as surveillance cameras, microphones, etc., or through wearable devices such as smartwatches, smart clothing, etc. The data then are cleaned, curated, denoised, etc. Then, data go through higher-level preprocessing steps including, for example, the separation of body and gravitational acceleration. Data are then segmented, for example, into time spans based on gait cycle, or transitions between different consecutive activities, etc.

Fig. 2
figure 2

General framework of HAR systems

The data go through a crucial step for feature extraction/selection to capture and produce the main characteristics and representations of the given data. This process can be done in three main ways: (1) purely classical hand-crafted features, (2) purely automatic feature generation through deep neural methods, and (3) a hybrid of the two, where a preliminary simple feature extraction/transformation step is performed (for example, transforming the original timeseries into a new series induced by calculating the autocorrelation function of the original signal), followed by representation learning through deep neural methods. The produced feature vectors/feature maps are then fed to a learning model which can be classical such as support vector machines or more recent deep neural networks. In the latter the last layer(s) are concerned with the application domain which can greatly vary between recognizing the type of the particular action performed, predicting and estimating the biometrics of the person performing the action such as the gender, age, weight, etc., the type of surrounding environment such as the type of ground on which walking occurs such as asphalt, grass, sand, etc.

To the best of our knowledge, we present the most recent research works related to the use of transfer learning in IMU-based HAR systems. The overall aim of this paper is to give an introduction and survey of human activity recognition systems using inertial motion data streamed from IMU sensors embedded on-board smart mobile and wearable devices. The survey is more oriented towards technical aspects with some hints to application areas. More specifically, we focus on the following:

  1. 1.

    We investigate the data collection process for HAR systems. In addition, we provide an elaborate detailed survey of all publicly available benchmark HAR datasets. To the best of our knowledge, this is the most comprehensive compiled list of available HAR datasets done so far.

  2. 2.

    A major preprocessing step for any HAR system is how to present the raw inertial data to the subsequent machinery of either a machine learning model or more traditional deterministic non-adaptive systems. Here we survey traditional handcrafted feature engineering techniques (extraction/selection) in addition to the more recent representation learning based on deep neural computing. It is apparent that for the sole purpose of end-to-end classification models, more traditional techniques based on feature engineering are effective enough, however, for transfer learning for the purposes of cross-domain and/or cross-datasets and/or cross-environmental configurations and setups, deep learning is a necessity.

  3. 3.

    Transfer learning is a very powerful tool in machine learning in general (and a major path to artificial general intelligence), and in particular, for inertial-based HAR systems. This is due to the tremendous heterogeneity and the inherent gap between the data collection and streaming process during the system training and the corresponding process during deployment. This is further complicated by the low semantical content of timeseries inertial motion data, as compared, for example, with visual data. Here, we survey the work done in cross-domain HAR systems and how effective they are.

  4. 4.

    The following subject we address is based on a categorization of HAR systems depending on whether they are developed for online real-time prediction or for offline batch analysis.

  5. 5.

    Deep neural networks are highly parameterized complex models and they can be highly sensitive to small perturbations (chaotic systems), caused by, what is called, adversarial attacks. (“Chaotic” in the sense that the process of training deep neural networks can be viewed as the evolution of a dynamical system. This dynamical system is not robust, that is, it is sensitive to small perturbations. For example, small perturbations in the weights initialization, the round-off errors caused by floating point operations can lead to dramatically different results such as divergence of the training process instead of convergence, or convergence to better or worse solutions over the loss function, etc.) There are hardly few works addressing adversarial attacks against timeseries data. The hardest challenge is to qualify and quantify such attacks in this context. In visual data, adversarial attacks can be mainly characterized by the perturbed image being imperceptible to the human eye, a quality basically stemmed from the high semantical content of visual data. Such quality is absent from timeseries data. Here, we survey the few works addressing adversarial attacks against timeseries signals and further point to research challenges in this area.

  6. 6.

    We briefly discuss embedded implementation of HAR systems.

  7. 7.

    Finally, we discuss and survey some major applications domains of HAR systems based on inertial motion data.

As the HAR research area is very huge, we had to focus on particular aspects and give our own perspectives on what is going on. Of course, it is by no means comprehensive. However, it gives, at least, useful entry points for deeper investigation into the particular aspect of interest for the reader, in addition, to some tutorial and pedagogical merits. All the materials are presented with rather assessment and critical eyes.

The paper is organized as follows. Section 1 is an introduction. Section 2 discusses the data collection aspects and processes for human activity recognition tasks based on the inertial motion modality. In addition, we present, in a tabular form, a comprehensive survey of publicly available benchmark datasets in the literature related to that domain. This table is listed in “Appendix A”. Section 3 introduces the feature extraction and selection techniques for timeseries data that are used in learning systems for activity analysis and inference. In Sect. 4 we discuss offline systems that are based on one-shot task, such as the recognition of a single isolated action, versus real-time continuous operation over the online streaming timeseries data. Section 5 surveys and assesses the state-of-the-art techniques for transfer learning of activity tasks based on inertial motion signals. Section 6 gives some typical application domains of activity recognition from wearable IMUs; this spans biometrics (e.g., age, gender, gait), virtual coaching, healthcare, and ergonomics. In Sect. 7, we discuss some implementations of HAR systems on embedded devices. Finally, Sect. 8 concludes the paper with an overall analysis of inertial-based HAR systems and their future potential.

2 Data collection

There is a wide variety of activity datasets that are based on inertial motion data. These are typically used as benchmarks to test new techniques for activity analysis and inference systems. A comprehensive survey and a rather elaborate description of publicly available benchmark datasets are given in Table 15 in “Appendix A”. Figure 3 shows some configurations of different HAR datasets; (a) a motion-node that recorded data in USC-HAD dataset [12], (b) the sensor placement used for the REALDISP dataset [13].

Fig. 3
figure 3

Permission is taken from the authors at [12] and [13] to use this figure

Configurations of different HAR datasets.

Figure 4 shows sample images while performing different ADL activities (EJUST-ADL-2 [14]).

Fig. 4
figure 4

Sample images while performing different ADL activities (EJUST-ADL-2 [14])

Figure 5 shows the activities and data collection of EJUST-ADL-1 dataset [24] using Apple smartwatch.

Fig. 5
figure 5

The activities and data collection of EJUST-ADL-1 dataset using Apple smartwatch

At designing any sensor-based activity recognition system, the number of sensors and their locations are very critical parameters. Regarding the locations, different body parts have been chosen from feet to chest. The selected locations are chosen according to the relevant activities. For example, ambulation activities (such as walking, jumping, running, etc.) are typically detected using waist or a chest sensor [15]. On the other hand, non-ambulation activities (such as brushing teeth, combing hair, eating, etc.) can be naturally classified more effectively using a wrist-worn, or in general upper body, mounted sensors [16].

Many kinds of physical features can be measured using wearable sensors [17]. These include: (1) environmental data such as barometric pressure, temperature, humidity, light intensity, magnetic fields, and noise level, (2) vital health signs such as pulse, body temperature, blood pressure, blood sugar (glucose) and respiratory rate, (3) location data which are typically identified by longitudes and latitudes using GPS sensors, and (4) body limbs motion such as acceleration and rotation of body parts like arms, legs, and torso using accelerometers and gyroscopes. Note that all of these can be considered as timeseries data. There are other different kinds of data that can be produced by wearable devices including textual data from social media.

However, in this article we focus exclusively on timeseries motion data. For the purpose of activity recognition, the inertial motion measurements have proven to represent human motion accurately, where some methods depending only on acceleration measurements have achieved very high accuracy [15, 16]. However, it is very difficult to extract the pattern of motion from acceleration raw data directly because of the high frequency oscillation and noise. So, preprocessing in addition to feature extraction and selection techniques should be applied on the raw data before modeling, or alternatively use deep learning methods for automatic feature extraction and selection (representation learning). The latter might not be a good choice over raw noisy data taking into consideration adversarial attacks, their qualitative and quantitative nature over timeseries data.

A line of work aims at automatically determining the on-body sensor location given the inertial motion signal of a given fixed activity. For example, Madcor et al. [18] study the particular action of ‘walking’ and how the induced signal can be used to determine the sensor’s location. They do not do any feature extraction, instead, they use the raw signal data as a direct input to a random forest classifier. They validate their work on four datasets: EJUST-GINR-1 [19, 20], RealDisp [13, 21], HuGaDB [22], and MMUSID [23]. All of these datasets are either exclusively gait or including gait among other activities; however, only the gait is used in the study.

Figure 6 shows a silhouette diagram with sensors’ locations in all the considered datasets. The signals used for recognition are the tri-axial linear acceleration in addition to the tri-axial angular velocity. These two modalities have proven effective enough to detect the sensor’s location as evidenced by the performance metrics. Accuracy reached above 90% and most of the recall and F1-score results are above 80%. As expected most of the confusion in location determination occurs between symmetric sensors across the body such as between left and right calf, left and right wrist, etc. Such line of research can be applied in data augmentation of timeseries inertial data as well as in developing healthcare applications, for example, in treatment assessment for people with motion disabilities. Figure 7 shows a data sample of 3D acceleration, 3D rotation, and 3D angular velocity signals of the activity 'walk' (dataset EJUST-ADL-1 [24]).

Fig. 6
figure 6

Schematic diagram of on-body sensors locations in some activity datasets

Fig. 7
figure 7

A data sample of 3D acceleration, 3D rotation, and 3D angular velocity signals of the activity ‘walk’ (dataset EJUST-ADL-1 [24])

A similar line of research is conducted by Abdu-Aguye in [25]. The output of this work can be used to create ‘virtual IMU sensors’. In [25] the authors explore the feasibility of transforming/mapping inertial timeseries data streamed from a ‘source’ body location, whilst performing a given activity, to a different ‘target’ body location. They use a bidirectional long-short-term memory (Bi-LSTM) to learn such mapping. An LSTM is a rather stable form of recurrent neural networks (RNNs) that can capture long-term correlation between the current time sample with historical samples of the input timeseries. The bidirectional version takes as well the future into account when predicting the current moment, hence, it smooths the predicted signal, and hence, generates more accurate timeseries. However, apparently it can only be used in offline mode (or online mode with limited future input). The model is rather simple and is shown in Fig. 8. Figure 9 shows the IMU sensors placements on the human body (EJUST-GINR-1 dataset [19, 20]).

Fig. 8
figure 8

Model for mapping inertial motion data across different body locations [25]

Fig. 9
figure 9

IMU sensors placements on the human body (EJUST-GINR-1 dataset [19, 20])

One of the most difficult and challenging problems in the collection of inertial motion data is synchronization. One reason that this process is more difficult in such data modality is that timeseries data has much lower semantics content, than, for example, visual or auditory data. In the latter modalities synchronization and/or segmentation can be done offline in a manual fashion, though it is a laborious process. Such offline manual process is not feasible in the case of timeseries data. Synchronization can involve determining the start and end of unit actions, the transition/switching between consecutive actions, synchronization and alignment among different IMUs used simultaneously in data collection, and in the case of multiple modalities the alignment among such modalities.

Fayez et al. [26] collect the VaIS dataset. It is a bimodal dataset of squats: visual and inertial motion timeseries. The authors use 9 different IMU devices mounted on different parts of the subject’s body in addition to 2 RGB cameras one positioned in front of the subject and the other at half frontal/lateral. For the IMU devices that data are collected from the devices and then sent in the form of CSV files to a mobile app on an iPhone. Through this applications all of the IMU sensors are readily synchronized. Each record in the CSV files has a timestamp with a resolution in the order of milliseconds. Hence, this is used to synchronize among the 9 IMU devices. For the visual data they used the camera of a smartphone. They used a timestamp app in order to annotate the video capturing with a time format that is closest to that produced by the IMU devices. The readings obtained from the IMU units have the global timing in the following format <YYYY-MMM-DDTHH:MM:SS.S>.

Figure 12 shows a snapshot the video capturing with time annotation. Optical character recognition is then used on that section of the frames depicting the time annotation in order to extract the time. The time formats of both the IMUs and that extracted from the frames are matched together. They finally matched the time instances in both modalities taking into consideration the difference in IMUs sampling rate and the video streaming frame rate. This caused a delay error between the two modalities that does not exceed 200 ms. During the data collection process, the authors used a whistle to indicate exactly when the subject started the activity. This would generate a high frequency peak in the sound wave used to cut the video indicating the execution of the given activity. The first extracted frame is then matched with the corresponding IMUs frame given the aligned time steps. Similar procedure employed to indicate the last video and IMUs frames.

The same authors have recently collected a new workout activity dataset [27]. The dataset is called MMDOS; it is a multi-modal dataset and is unique in the sense that it combines many data modalities and sensors including: RGB, inertial motion, depth, and thermal. Accordingly, the synchronization here is much more challenging than in the previous work. The data are collected from 50 participants performing four types of workout exercises. As in the previous work all the IMU sensors are already synchronized through a mobile app running on an iPhone. All IMUs have the same sampling rate. The depth imaging is obtained using a Kinect camera, and the timestamp is extracted from the Kinect metadata using MATLAB with the same format as the IMUs. RGB video are captured using mobile phones and a mobile app is used to annotate each frame of the video capture with the timestamp.

See Fig. 10 for an illustration of time annotation of RGB video frames. Optical character recognition is used to extract this annotation and convert into proper time format. The timestamp of the frontal camera is used as a frame of reference for all other devices, taking into account the different sampling rates of the involved sensors. In addition, the authors used a beep sound to indicate exactly when the subject starts the activity so that it makes a peak in the audio signal. This sound is used to align the start of the RGB video, and all timestamps are then synchronized taking into consideration the different sampling rates across all devices. As for the thermal imaging they performed the synchronization completely manually due to the inability to receive any metadata, including timestamp, from the video streaming. The whole synchronization process is depicted in Fig. 11. Figure 12 shows a video recording with timestamp annotation [26].

Fig. 10
figure 10

Frontal camera view of a shoulder press exercise with the timestamp on the lower right [27]

Fig. 11
figure 11

Synchronization among all modalities in the MM-DOS dataset [27]

Fig. 12
figure 12

Video recording with timestamp annotation [26]

3 Feature extraction/selection

For inertial-based HAR systems, many approaches to feature extraction and selection have been developed focusing on the structural and/or statistical properties of the input timeseries. In this section, we survey the main feature extraction and selection techniques. There are different kinds of features used to represent timeseries data. Some typical examples of features are shown in Table 1, which are essentially based on handling the timeseries either in the time domain or in the frequency domain.

Table 1 Different types of timeseries features

Huynh et al. [29] have suggested that selecting individual tailored features for each activity can improve both the computational efficiency and the recognition accuracy. But, designing a specific descriptor for each activity is not flexible for adding new activities to the system. For example, if the system has fifty activities, fifty different descriptors are needed to design in the worst case, which can be impractical.

In the following sub-sections, we survey research works using different types of features, e.g., autocorrelation function, Fourier analysis/spectra, autoregression, raw data, embedding, etc. These types of features can be used standalone or combined with other types of features, e.g., time domain features (standard deviation, variance, kurtosis, root mean square, mean, skewness, etc.) and/or frequency domain features (discrete cosine transforms, wavelet transform, entropy, quantiles, etc.).

3.1 Autocorrelation function

One of the most effective and simple feature extraction techniques is based on the use of autocorrelation function [30] taking lag values starting from 0 up to a certain maximum value. The autocorrelation function computes the correlation/similarity between a given timeseries and a lagged version of it. The empirical estimation of the autocorrelation function can be computed as follows:

$$\begin{aligned} \begin{aligned} {\hat{\text {acf}}}(\ell ) =&\frac{{\hat{\gamma }}(\ell )}{\hat{\gamma (0)}} \\ {\hat{\gamma }}(\ell ) =&\frac{1}{T} \sum _{t = 1}^{T - \ell } (x_{t+\ell } - {\bar{x}})(x_{t} - {\bar{x}}) \end{aligned}\end{aligned}$$
(1)

where \({\bar{x}}\) is the sample mean of the timeseries and \(\ell \) is the lag value (the amount by which the original signal is shifted). The autocorrelation function captures the linear correlation between the timeseries and the shifted versions of it. In some way it can be viewed as a feature reduction technique that reduces the original signal into a smaller signal of length depending on the maximum lag value \(\ell \). From another perspective, it can be as well considered as length normalization, that is, different timeseries may have different lengths, however, if we fixed a certain lag, then the acf of all of them up to this lag will produce timeseries with the same length. This might be necessary for many analytical techniques, such as when required to find the similarity/dissimilarity between two timeseries.

For example, Gomaa et al. [24] used a smartwatch to collect sensory data for 14 activities of daily living (ADLs), creating a dataset called EJUST-ADL-1 (see Table 15 for full description). They compute the autocorrelation function for each of the tri-axial timeseries of the acceleration, angular velocity, and rotation displacement, up to a certain lag (empirically choosing 10 as suitable). These are vectorized and fed to a random forest (RF) classifier that recognizes the performed activity. They achieved 80% accuracy over the 14 activities.

The autocorrelation function has also been effectively used for identifying periodic peaks [31] through segmenting repetitive actions. In a different context Mostafa et al. [32] extracted the autocorrelation function from a tri-axial accelerometer (linear motion) and a tri-axial gyroscope (angular motion) to produce a multi-channel feature vector consisting of 6 channels each of length 11 (10 is the maximum lag of the acf). The signals are streamed from IMU sensors during walking activity. The authors experimented with multiple gait datasets including: EJUST-GINR-1 [19, 20], OU-ISIR [33], Gait Events DataSet (GEDS) [34], and Human Gait Database (HuGaDB) [22].

The acf features are then fed into two models for biometrics predictions: a classical random forest for gender classification and a convolutional neural networks (CNN) for age estimation. Figure 13 shows the CNN architecture used. In comparison with other works in the literature, their CNN regressor performs better over almost all the datasets considered with age mean-absolute-error ranging from 0.9 years up to 2.4 years. For gender recognition both the random forest and CNN have comparable results above \(90\%\) (except for the OU-ISIR dataset). This indicates the effectiveness of the autocorrelation function as timeseries representation which is rather simple to compute and interpret.

Fig. 13
figure 13

The BioDeep CNN architecture [32]

Adel et al. [20] use the autocorrelation function as a feature extractor for person identification using inertial streaming from wearable IMUs during walking activity. Again the acf is computed for the 6-axial inertial signal (3-axial accelerometer and 3-axial gyroscope). They experiment with the EJUST-GINR-1 gait dataset [19, 20]. It is apparent from Fig. 14 the inherent periodicity of the gait signal, specially in the y-direction, and hence, the acf is well suited to capture the gait characteristics of the given timeseries, specially capturing the corresponding unique gait fingerprint of each individual. The authors feed the induced feature vectors to three classifiers: a SVM, a RF, and a CNN. The number of subjects are rather limited (20), however, the authors study the best on-body sensor location for person identification and how fusion of several IMU sensors mounted on different body parts would affect the results. Even though the case for gait is obvious (lower body locations), the proposed methodology open new directions for similar kinds of research with more complex types of activities.

Fig. 14
figure 14

A sample of gait signal (EJUST-GINR-1 gait dataset [19, 20])

A hidden assumption in the use of autocorrelation is that the activity signals are statistically stationary. It means that the statistical behavior of the timeseries does not change over time. To the best of our knowledge this has empirically shown to be the case where the performance of predictive models is rather satisfactory. However, much work still needs to be done to verify that assumption and for which classes of activities. And more importantly to explicitly state this assumption and put it in perspective. Another assumption, or maybe rather simplification, is that each inertial modality (whether linear or angular), and each axis of the same modality is assumed to be completely independent of the others. Maybe a better descriptor would be obtained by considering the cross-correlation among the different axes of the same modality and even across different modalities.

Nourani [35] presented a comprehensive comparison of HAR activities using IMU sensors; Nourani selected five state-of-the-art feature-sets as case studies; statistical features, self-similar features, histogram bins features, physical features, orientation invariant features. The self-similar features includes the number of autocorrelation peaks, prominent autocorrelation peaks, weak autocorrelation peaks, maximum autocorrelation value, first autocorrelation peak height, etc.

Wang [36] presented a data fusion-based hybrid sensory system for elderly ADL recognition; one of the proposed features is the autocorrelation of the data collected from the accelerometer, gyroscope, magnetometer, barometer, and temperature sensors. Janidarmian et al. [37] presented a comprehensive analysis on wearable acceleration sensors in HAR; one of the features used is the autocorrelation; the height of the first and second peaks and the position of the second peak.

Morris et al. [38] proposed RecoFit a system that uses a wearable sensor to find, recognize, and count repetitive exercises. Data was acquired by a wearable armband worn on the subject right forearm; it contains a SparkFun IMU sensor (3-axial accelerometer and 3-axial gyroscope). The authors used the autocorrelation function to find patterns of self-similar repetitive exercise. Each 5-s window is transformed into 224 features. For instance, the authors compute the following five features; number of autocorrelation peaks, prominent peaks, weak peaks, maximum autocorrelation value, height of the first autocorrelation peak after a zero-crossing.

Muehlbauer et al. [39] proposed a HAR system based on segmentation, recognition, and counting. They achieved \(85\%\) segmentation accuracy and \(94\%\) recognition accuracy on ten classes based on user-independent training. The proposed approach for segmentation utilizes features based on an autocorrelation function. For each window, the authors calculated the median crossings and peaks of the signal autocorrelation function. The high values of these two features indicated clearly that an exercise was performed.

Zebin et al. [40] proposed a HAR system based on multichannel time series signals collected by five wearable sensor units each with an integrated MPU-9150 IMU (accelerometer and gyroscope). Six activities were recognized; walking, walking upstairs, walking downstairs, sitting, standing, lying down. The authors applied four machine learning algorithms; k-NN, NB, SVM, and MLP. Several feature extraction methods were adopted; average value for all three acceleration components, RMS value for all three acceleration components, autocorrelation features (height of main peak, height and position of second peak), spectral peak features (height and position of first six peaks), spectral power features (total power in five adjacent and pre-defined frequency bands).

Jain et al. [41] proposed a stride segmentation method from 3-axial accelerometer data for three walking activities (level walking, walking upstairs, and walking downstairs). The autocorrelation function is used for stride boundaries estimation. The segmentation and extraction of stride relevant data were achieved using a tuning parameter based on the minimum standard deviation among the segmented strides. The method evaluation is performed on the HAPT [42] and OU-ISIR gait [43] inertial sensor datasets.

Bota et al. [44] proposed a Semi-Supervised Active Learning (SSAL) based on Self-Training (ST) approach for HAR to automate partially the annotation process in order to reduce the required number of annotated samples by an expert. The proposed approach selects the most relevant samples for annotation and propagate their annotation to the most confident samples. The authors compared supervised and unsupervised models with the proposed approach on two ADL datasets; (1) the UCI dataset [45] with the following activities; standing, sitting, downstairs, upstairs. The considered features are statistical domain features (e.g., skewness, max, mean, histogram), temporal domain features (e.g., zero crossing rate), spectral domain features (e.g., median frequency, max power spectrum). (2) the CADL dataset (collected by the authors) with the following activities; standing, sitting, downstairs, upstairs. The considered features are statistical domain features (e.g., mean, histogram, root mean square), temporal domain features (e.g., autocorrelation, temp centroid), spectral domain features (e.g., max power spectrum). The proposed method reduced the required annotated samples by more than \(89\%\) while maintaining an accurate model.

Altun and Barshan [46] compared various techniques for HAR using inertial and magnetic sensors; Bayesian decision making (BDM), least-squares method (LSM), k-NN, DTW, SVM, ANN. Daily and sports activities are recognized using five sensor units. The sensor units are worn by eight users on chest, arms, and legs. Each sensor unit comprises 3-axial gyroscope, 3-axial accelerometer, 3-axial magnetometer; 26 features were calculated for each signal: minimum and maximum values, mean value, variance, skewness, kurtosis, 10 equally spaced samples from the autocorrelation sequence, first five peaks of the discrete Fourier transform and the corresponding frequencies.

Yu [47] presented CNN and RNN models to classify fall and non-fall activities. A publicly available dataset was used [48]; 6 MTx sensor units were worn; each sensor unit comprises 3-axial accelerometer, 3-axial gyroscope, 3-axial magnetometer and atmosphere pressure; this dataset includes 20 fall activities and 16 ADL activities. CNN achieved an accuracy of over \(95\%\) for classifying fall and non-fall activities. RNN achieved an accuracy of over \(97\%\) in classifying fall, non-fall, and a third category (defined as “pre/postcondition”). The following features were extracted for each time window: minimum and maximum values, mean value, variance, skewness, kurtosis, first 11 values of autocorrelation sequence, and the first five peaks of discrete Fourier transform.

Pereira et al. [49] presented a biofeedback system based on sEMG and IMU sensors for home-based physiotherapy exercises evaluation. sEMG signal was used to identify temporal intervals in which muscular activity was occurring. Exercise repetitions were segmented into time windows where features related to subject posture were extracted. The extracted features include; statistical features (e.g., skewness, kurtosis, histogram, etc.), temporal features (e.g., mean, median, maximum, minimum, variance, temporal centroid, standard deviation, root mean square, autocorrelation, etc.). These features were input to DT, k-NN, RF, and SVM methods to classify proper/improper exercise execution and achieved an accuracy of \(\ge \, 92\%\).

Rosati et al. [50] compared two feature sets for HAR; the first includes time, frequency, time-frequency domain features, while the second feature set includes time domain features considering the signals physical context. The comparison of the 2 feature sets were based on 4 machine learning classifiers performance. 61 healthy subjects performed 7 ADL (activities of daily living) activities while wearing an IMU-based device. A 5 s window was used to segment each signal; for each window 222 and 221 features were extracted for each feature set, respectively. The first feature set includes 222 features from different domains; 20 time-domain features (mean value, variance, standard deviation, skewness, kurtosis, minimum value, maximum value, 25th percentile, 75th percentile, interquartile range, 10 samples from the autocorrelation sequence), 3 frequency-domain features (mean and median frequency of the power spectrum, Shannon spectral entropy), 14 time-frequency domain features (norms of approximation and detail coefficients of 7 levels of decomposition of discrete wavelet transform). Each feature set had simultaneous feature selection and classifier parameter optimization performed using a Genetic Algorithm (GA). The results showed that the SVM has the best performance on both feature sets (\(97.1\%\) and \(96.7\%\)). The second feature set allowed for easier understanding of changes in the biomechanical behavior in complex cases, e.g., the application on pathological subjects.

Muaaz [51] proposed a HAR system called WiWeHAR using Wi-Fi and wearable IMU sensors. The standard Wi-Fi network interface cards collected the Channel State Information (CSI) and the wearable IMU (accelerometer, gyroscope, magnetometer) collected the subject’s local body movements. The authors computed the time-variant Mean Doppler Shift (MDS) from the processed CSI data and the magnitude from each IMU sensor. Then, they extracted 23 time-domain and frequency-domain features from the MDS and the magnitude data: mean, variance, standard deviation, skewness, mean absolute value, waveform length, enhanced mean absolute value, enhanced waveform length, weighted mean absolute value (window function step-wise or continuous), maximum fractal length, mean amplitude change, root mean square, difference absolute standard deviation, simple squared integral, Willison amplitude, zero crossing, slope sign change, max of the absolute value, slope, peaks of the autocorrelation, spectral peaks, spectral energy. The authors evaluated the performance of WiWeHAR by using a multimodal human activity dataset that was collected by 9 subjects. Every subject performed four activities: walking, falling, sitting, picking-up object from floor. The results indicated that the proposed multimodal system outperformed the unimodal individual CSI, accelerometer, gyroscope, and magnetometer HAR systems achieving overall recognition accuracy of 99.6–100%.

3.2 Fourier analysis

Another major, and very classical descriptor, of the timeseries is based on Fourier transform and its variants. Generally, Fourier transform can be considered as the projections of the given timeseries over an infinite basis of complex exponentials \(e^{i 2\pi nt}\). As these complex exponentials are orthonormal they can be considered to form a basis of an infinite dimensional space. So, the Fourier coefficient is a projection. In other words the nth Fourier coefficient is a projection of the function f(t) over the axis of nth complex exponential \(e^{i 2\pi nt}\). The numerical/computational realization of the Fourier transform is the Discrete Fourier transform, and its most prominent implementation is called the Fast Fourier transform. The coefficients of the transform can be taken as a descriptor/feature vector of the input timeseries. As the transform is infinite in nature, we consider only the dominant components which correspond to the dominant frequencies (in magnitude) in the original timeseries.

Of particular interest is the wavelet transform. Wavelet transform is an improvement over the classical Fourier transform where the signal now can be decomposed in terms of both time and frequency localized basis functions. This helps capture the irregularities and non-stationarities in the input timeseries as it is a localized Fourier transform. Given a continuous-time signal x(t), its idealized Fourier transform can be computed as follows:

$$\begin{aligned} \begin{aligned} \gamma (s,\tau ) = \frac{1}{\sqrt{s}} \int x(t) \cdot \psi \left( \frac{t - \tau }{s}\right) {\rm{d}}t \end{aligned} \end{aligned}$$
(2)

The function \(\psi \) represents the wavelet basis function (called the Mother wavelet). Its wavelet is characterized by two parameters: its translation value \(\tau \) and its scaling value s. These parameters determine both the time-localization and the frequency-localization of the mother wavelet. By continually changing the parameter \(\tau \), the mother wavelet is translated and accordingly temporal resolution is achieved. On the other hand, by continually changing the parameter s resolution in the frequency ranges is gained.

By convolving the given timeseries with the scaled and translated version of the mother wavelet, a similarity value between the signal and the mother wavelet at that translation and scale is obtained. This similarity defines the wavelet coefficient \(\gamma (s,\tau )\). Abdu-Aguye et al. [52] did investigate the effectiveness of two feature extraction techniques. One of them is based on decomposing the signal applying wavelet transform using the Haar mother wavelet.

The autocorrelation of the resulting coefficients are computed up to a certain lag (they experimented with 5, 10, and 15). The resulting feature vectors are then fed to 4 different classifiers: random forests [53], canonical correlation forests (a variation of random forests) [54], ProtoNN [55], and Deep Forests (gcForest) [56].

The problem explored is direct activity recognition, specifically, recognizing activities of daily living over three datasets: HAPT [42], Sports and Daily Activities [57], and EJUST-ADL-1 [24]. Their results show the effectiveness of the developed descriptor based on the wavelet transform, specially with comparison with the other descriptor, proposed by the authors, based on autoregressive modeling. Across all classifiers and datasets the wavelet-based descriptor achieves at least in the range of \(80\%\)’s.

Motivated by the strong need of monitoring human activities for healthcare, Preece et al. [58] have done a rather extensive comparison of different types of feature descriptors extracted from accelerometer signals. They compare 14 feature extraction methods derived from wavelet transform, time-domain, and frequency-domain characteristics of the inertial signals. The comparison was done using two activity datasets collected from 20 subjects. One dataset contains the common daily activities of walking, ascending stairs, and descending stairs. The second dataset is more encompassing and contains 8 activities including hopping, running, jogging, and jumping. To test the sensitivity of the feature descriptors with respect to the sensor’s location, the authors compared the activity classification accuracy over inertial signals streamed from three different lower limb locations: waist, thigh, and ankle, and combinations of these. The lower limb was chosen exclusively as all the activities mainly involve the lower body joints. Nearest neighbor classifier was used to test the performance of different configurations of feature descriptor/sensor placement/activity set. Cross-subject validation is used to assess and compare amongst the different configurations. Their empirical study indicated that the wavelet transform can best be used to characterize non-stationary inertial signals, however, it does not perform as well for classifying stationary activities performed by healthy subjects. In the latter case frequency-based features perform supreme with respect to others. Their best descriptor performed about \(95\%\) cross-subject classification accuracy.

Foerster and Fahrenberg [59] proposed motion and posture recognition model based on accelerometers data. The considered activities include posture, motion, stairs, lying, supine, sitting, walking, and bicycling. The activities were recorded by 31 subjects comprising 13 motions and postures that were repeated for three times. Five uni-axial sensors and three placement locations (sternum with three axes and right and left thighs). The authors proposed a hierarchical classification method that recognizes sub-categories of motion and posture activities with only \(3.2\%\) mis-classifications. The raw DC values and rectified AC values were averaged for each activity and monitoring segment. The walking frequency was calculated using short-time Fourier transform within a frequency band of 0.5–4 Hz from the sternum sensor z-axis.

Preece et al. [58] compared different feature descriptors extracted from accelerometer signals. They compared 14 feature extraction methods derived from wavelet transform, time-domain (e.g., mean, standard deviation, median, etc.), and frequency-domain (e.g., principal frequency, spectral energy, magnitude of first five components of FFT analysis, etc.) of the IMU signals. In order to extract the frequency-domain features, an FFT was applied on each 2-s window (128-sample) with \(50\%\) overlap between consecutive windows. The final frequency-domain feature set comprises of the magnitudes of the first five components of the FFT power spectrum.

Janidarmian et al. [37] proposed a HAR system based on wearable accelerometers. The proposed features include the Power Spectral Density (PSD), the Fourier coefficients in the frequency domain, the positions and power levels of the highest 6 peaks of PSD computed over a sliding window, and the total power in 5 adjacent and pre-defined frequency bands.

Bao and Intille [60] proposed a HAR system based on subject annotated acceleration data. Features like mean, energy, frequency-domain entropy, and correlation are extracted from 512 sample sliding windows of acceleration data with \(50\%\) overlapping between consecutive windows. Each window duration is 6.7 s. A several seconds window is sufficient to cover cycles in activities like walking, window scrubbing, and vacuuming. The window size of 512 samples allows for quick computation of FFTs.

Huynh and Schiele [29] studied the impact of features and window length on HAR accuracy. To study the effect of various windows lengths, acceleration features were extracted from windows of 128, 256, 512, 1024, and 2048 samples. For each window, the authors computed the magnitude of the mean, variance, energy, spectral entropy, and the discrete FFT coefficients. The results showed that there is neither a single feature nor a single window length that achieved best accuracy for all activities. The FFT features were always among the features with the highest cluster precision. However, the FFT coefficients that achieved the highest precision differ for each activity, and the recognition can be enhanced by selecting features for each activity individually. In general, the highest peaks for the FFT coefficients were laid between the first and the tenth coefficient. The recognition results indicated as well that integrating different FFT coefficients to bands of exponentially increasing size can be a trade-off for using individual or paired coefficients. For the non-FFT features, the variance has continuously high precision values, however, the spectral entropy has the highest values for the activity ‘standing’. On average, the features with window lengths of 1 and 2 s achieved slightly higher precision values than those with other window lengths. However, there are big differences across the various activities, and selecting different window lengths for various activities resulted in higher recognition rates.

Eyobu and Han [61] presented a spectrogram-based feature representation and data augmentation technique for HAR based on wearable IMU sensor data. The Hanning window was utilized and 512 samples were used in each FFT vector with a sampling frequency of 50 Hz. The Hanning window can be defined as in Eq. (3) and was named for Julius von Hann, an Austrian meteorologist. It is also known as the Cosine Bell.

$$\begin{aligned} \begin{aligned} w(n)&= \cos ^{2}\left( \frac{n}{N}\pi \right) , \quad n = -\frac{N-1}{2}, \\&\quad \ldots , -1,0,1,\ldots ,\frac{N-1}{2} \end{aligned} \end{aligned}$$
(3)

An ensemble of data augmentations in feature space is proposed to tackle the data scarcity problem. Performance evaluation was done on an LSTM architecture to assess the impact of the proposed feature representation and data augmentations on HAR accuracy. In addition to evaluating on a newly proposed dataset, the data augmentation technique is evaluated on the UCI HAR dataset [62]. Using few spectral features, the authors achieved state-of-the-art recognition performance. Fewer spectral features exhibit a lesser training time of about 1 h and 45 min compared to about 2 h and 30 min for the large feature set utilized in this work. The proposed extraction approach achieved the best performance enhancement in accuracy of \(52.77\%\) compared to the UCI online dataset. The proposed OR (Original spectral features) + LA1 (1st Local Averaging) achieved the best performance enhancement in accuracy of \(32.6\%\) compared to the UCI online dataset.

García [63] proposed human activity recognition and segmentation using Hidden Markov Models. The recognition is based on IMU signals collected by two smartphone sensors: accelerometer and gyroscope. Six different activities are considered: walking, walking-upstairs, walking-downstairs, sitting, standing, and lying. These activities were performed by 30 users. The feature vector included: mean, standard deviation, median, absolute value (for each axis), minimum and maximum values, Signal Magnitude Area (SMA), energy average sum of squares, interquartile range, entropy, autoregression coefficients, correlation coefficient between two signals, maximum frequency component, frequency signal weighted average, skewness, kurtosis, frequency interval energy within FFT 64 bins of each window, angle between two vectors. The author concluded that using Cepstral techniques were suitable for HAR and segmentation. Despite inertial analysis being similar with speech recognition, some transformations (like pre-emphasis) are useless for inertial signals analysis. The RASTA filtering and delta coefficients were found to be only suitable for activity segmentation. Thus, their use is not recommended when the subject/activity recognition is the main goal. The use of Activity Sequence Modelling is highly suggested since it reduced all error rates for recognition and segmentation. The subject recognition error rate was reduced from 26.8 to 16.5% and the vector size from 561 to 180 features. The activity segmentation error rate was reduced from 5.2 to 0.4% and the vector size from 561 to 405 features. The activity recognition error rate was increased slightly from 3.5 to 3.6% using the Cepstral techniques, however the vector size was lower (only 180 features).

3.2.1 Spectra

Machado et al. [64] proposed a HAR system based on a 3-axial accelerometer sensor; the proposed system is based on unsupervised learning approach. The considered features are statistical domain features (kurtosis, skewness, mean, standard deviation, interquartile range, histogram, root mean square, median absolute deviation), temporal domain features (zero crossing rate, pairwise correlation, autocorrelation), spectral domain features (maximum frequency, median frequency, cepstral coefficients, power spectrum, Mel-frequency cepstral coefficients, fundamental frequency, power bandwidth).

Krause et al. [65] proposed a wearable system to specify online the user context and context transition probabilities. The authors tried various sampling frequencies and different feature engineering methods were applied. For low sampling rates (1 and 10 samples per minute), the onboard averages and absolute differences were calculated. For high sampling rate (8 samples per second), the Fast Fourier Transform (FFT) oscillatory spectra of the recorded accelerometer values were calculated. A 64 point FFT (i.e., spectra of 8 s windows with different spectrum for each axis) is a good choice to discriminate various fine-grained movement patterns. The resulting spectra were log-transformed. In the low sampling rate cases, all 8 data channels of the BodyMedia device were recorded forming an 8-dim feature-space. For the high sampling rate, the spectra of the two accelerometers were recorded forming a 128-dim feature-space.

Abreu et al. [66] proposed a HAR system where the signals were collected from four sensors: accelerometer, gyroscope, magnetometer, and microphone. A new dataset was proposed comprising 10 activities; e.g., opening door, brushing teeth, typing on keyboard, etc. The proposed HAR classifier is based on multiple HMMs one per activity. The authors extracted all the three axes (xyz) of the 3-axial sensors (accelerometer, gyroscope, and magnetometer) and the overall magnitude of each \(\sqrt{x^{2} + y^{2} + z^{2}}\). To address similar activities classification conflict (e.g., opening door and opening faucet), the sound was considered in the classification process; the authors combined the IMU sensors with the microphone. Each signal is segmented in time windows of length 250 ms without overlap. For each 250 ms window, over 30 different features were extracted. The features came from temporal, spectral, and statistical domains that were calculated for all sensors and axes. The final output is a feature vector with 265 features for each 250 ms time window. The authors achieved an overall accuracy of \(84 \pm 4.8\%\) utilizing 27 features selected by Forward Selection coming from different domains (statistical, temporal and spectral). In online recognition, the proposed solution experienced preliminary tests by 3 of the 8 initial subjects. The classifier detected activities within a continuous stream with an F1-score of \(74 \pm 26\%\).

Shoaib et al. [67] present a fusion-based HAR system using smartphone motion sensors. Ten male subjects perform seven activities; walking, running, sitting, standing, jogging, biking, walking upstairs, walking downstairs. Five smartphones (Samsung Galaxy SII i9100) are used with the following sensors: accelerometer (with and without gravity), gyroscope, magnetometer. The authors adopted various machine learning techniques: Bayesian networks, SVM, logistic regression, k-NN, decision trees, and random forest. Three feature sets were proposed of time domain features and one set of frequency domain features (the sum of the first 5 FFT coefficients and spectral energy). The authors showed that the utilized sensors, except the magnetometer, were individually able of recognition based on the activity type, the body locations, the feature set, and the classification method (personalized or generalized). The authors also showed that the sensors fusion only enhances the overall recognition accuracy when their individual performances are not high enough.

Tahir et al. [68] presented a HAR model based on IMU sensors, i.e., gyroscopes and accelerometers. The IMU data is processed using multiple filters such as Savitzky–Golay, median, and Hampel filters for examining the lower/upper cutoff frequency patterns. Statistical features (average, middle, squared deviation, and max/min values), frequency features (chirp z-transform (CZT), spectral entropy, the Helbert transform), and wavelet transform features (the Walsh–Hadamard transform) were extracted. The Adam [69] and AdaDelta [70] methods were utilized in the feature optimization phase. The proposed model was evaluated on the USC-HAD dataset [12] and the Intelligent Media-sporting Behavior (IMSB) dataset [71] that is a new subject-annotated sports dataset. The “leave-one-out” cross validation is utilized and the proposed model achieved recognition accuracy of \(91.25\%\), \(93.66\%\) and \(90.91\%\) when compared with the USC-HAD [12], IMSB [71], and Mhealth [72] datasets, respectively.

Figueira et al. [73] proposed a HAR system where two sensors were used (accelerometer and barometer) with sampling frequency of 30 and 5 Hz, respectively. The signals obtained were divided into 5-s windows. The dataset used is composed of 25 subjects. ADLs were performed with smartphone worn in 12 different positions. Three sets of features were extracted from each window; spectral domain (maximum/median/fundamental frequencies, max power spectrum, total energy, spectral centroid/spread/skewness/kurtosis/slope/decrease/roll-on/roll-off, curve distance, spectral variation), statistical domain (skewness, kurtosis, histogram, mean, standard deviation, interquartile range, variance, root mean square, median absolute deviation), temporal domain (correlation, temporal centroid, autocorrelation, zero crossing rate, linear regression).

Attal et al. [74] presented various classification methods for HAR. Three accelerometers were placed at the subject chest, right shank, and left ankle. Four supervised classifiers k-NN, SVM, GMM, RF and three unsupervised classifiers k-means, GMM, HMM were compared. Sensors raw data and extracted features are used independently as inputs for each classification method. Nine accelerometer signals were collected from three MTx IMUs. For each signal, the following time and frequency-domain features are calculated: eleven time-domain features (mean, variance, median, interquartile range, skewness, kurtosis, root mean square, zero crossing, peak-to-peak, crest factor, and range) and six frequency-domain features (DC component in FFT spectrum, energy spectrum, entropy spectrum, sum of wavelet coefficients, squared sum of wavelet coefficients, and energy of wavelet coefficients). The k-NN gave the best results between the supervised classification algorithms, while the HMM provided the best results between the unsupervised classification algorithms.

Suto et al. [75] presented various feature extraction methods applied in the HAR domain. The quality of the selected features was assessed using feed-forward artificial neural network, k-Nearest Neighbor, and decision tree. The authors applied a time sliding window of 32 samples with \(50\%\) overlap between the consecutive windows. This window length spans 1.6 s time interval as the authors used the WARD database [76] that has a sampling frequency of about 20 Hz. The extracted time-domain features are the mean, variance, mean absolute deviation, root mean square, zero crossing rate, interquartile range, 75th percentile, kurtosis, signal magnitude area (SMA), and the min–max values. The extracted frequency domain features include the spectral energy, spectral entropy, spectral centroid, and the principal frequency. Other kinds of extracted features are the correlation between axes, autoregressive coefficients, and the tilt angle. The results showed that the proposed machine learning and feature selection approach outperformed other classification methods (up to approximately \(100\%\) recognition rates). The authors also showed that a few number of sensors (compared to other works) can be sufficient for good classification performance.

3.3 Autoregression

Vector autoregression is an extension of the ordinary least squares (OLS) regression to the multivariate case. A k-dimensional timeseries (k can be the number of inertial sensors times the number of degrees of freedoms of each sensor) can be linearly regressed in time. In other words, the vector value at time t is linearly dependent on the vector values at previous time steps (up to a certain order) plus a Gaussian noise. Formally, this can be written as follows:

$$\begin{aligned} \begin{aligned} {\textbf{y}}_{t} = A_{1}{\textbf{y}}_{t-1} + A_{2} {\textbf{y}}_{t-2} + \cdots + A_{p} {\textbf{y}}_{t-p} + {\textbf{b}} + {\epsilon }_{t} \end{aligned} \end{aligned}$$
(4)

The \(\{A_{i}\}\) matrices represent the linear transformations, \(\textbf{b}\) is a bias/intercept term, and \({\epsilon }_{t}\) is a white Gaussian noise. This kind of regressions are called autoregressive (AR) models in the terminology of timeseries analysis. A key to the validity of these models is that the given timeseries is stationary [30], which roughly means that the statistical properties (such as the moments) of the signal remains fixed over time. Even though this might arguably be not the case for many human actions, features based on autoregressive models have been used with reasonable success. The model’s unknowns are these matrices, the bias vector, and the covariance matrix of the noise term. These can be estimated using a variant of the ordinary least squares technique. Once these parameters are estimated, they can be taken as the feature vector or the descriptor of the given timeseries.

Abdu-Aguye et al. [77] investigated the effectiveness of this kind of features for activity recognition using four different classifiers, namely, random forests, canonical correlation forests, ProtoNN, and deep forests. They tested their work over three benchmarks: EJUST-ADL-1 [24], HAPT [42], and Daily and Sports Activities [57]. The dimensionality of the timeseries in each dataset depends on the multitude of the inertial modalities used. The empirical study shows the general efficacy of using the autoregressive coefficients as a feature descriptor of the timeseries, even though the statistical stationarities across all the activities given in the datasets are questionable in several kinds of actions. The results though differ widely based on the dataset used as well as the classifier applied. However, random forests can be considered the best economical choice in terms of the achieved predictive performance and their demand of computational resources.

Hassan et al. [78] presented a HAR system using the smartphone IMU sensors and deep learning. Firstly, the relevant features are extracted from the raw sensor data, e.g., mean, median, autoregressive coefficients, etc. Secondly, the features are processed by a Kernel Principal Component Analysis (KPCA) and a Linear Discriminant Analysis (LDA). Finally, the features are trained using a Deep Belief Network (DBN) for accurate activity recognition. The proposed approach outperformed the traditional recognition methods such as multiclass Support Vector Machine (SVM) and Artificial Neural Network (ANN). The system had been examined for 12 different physical activities where the mean recognition rate was \(89.61\%\) and the overall accuracy was \(95.85\%\). The other traditional methods achieved at best a mean recognition rate of \(82.02\%\) and an overall accuracy of \(94.12\%\).

Khan et al. [15] presented a HAR system based on a 3-axial accelerometer via augmented feature vector and a hierarchical recognizer. At the last level, the activity state (static, transition, dynamic) is classified using statistical features and ANNs. The upper recognition level utilizes the autoregressive modeling of the acceleration signals. The augmented feature vector consists of autoregressive coefficients, signal magnitude area, and the title angle. The resulting feature vector is further processed by linear-discriminant analysis and ANNs to recognize a particular human activity. The proposed activity-recognition method recognizes 3 states and 15 activities with an average accuracy of \(97.9\%\) using a single 3-axial accelerometer placed at the user’s chest.

San-Segundo et al. [79] adopted the well-known methods successfully applied in the speech processing domain: Mel Frequency Cepstral Coefficients (MFCCs) and Perceptual Linear Prediction (PLP) coefficients. In addition, Relative Spectra (RASTA) filtering [80] and delta coefficients were also evaluated for IMU processing. The proposed HAR system is based on Hidden Markov Models (HMMs) and recognizes 6 activities: walking, walking upstairs/downstairs, sitting, standing, and lying. A publicly available dataset was used, UCI Human Activity Recognition Using Smartphones [81, 82]. This dataset includes several sessions of physical activity sequences from 30 subjects. The dataset had been divided randomly into 6 subsets to apply a sixfold cross validation procedure. The baseline method for feature extraction performed several calculations in the time and frequency domains. For the time domain, the features included the mean, correlation, signal magnitude area (SMA), and autoregressive coefficients. For the frequency domain, the features included the energy of different frequency bands and frequency skewness. Other features included the angle between vectors. A total of 561 features were extracted to represent each window (i.e., every 2.56 s). The results presented in this paper enhances significantly the error rates of the baseline method. The adapted MFCC and PLP coefficients enhanced HAR and segmentation accuracies while decreasing the size of the feature vector. RASTA-filtering and delta coefficients reduced the segmentation error rate significantly giving the best results: an activity segmentation error rate lower than \(0.5\%\).

3.4 Raw data

Most of the work done in HAR systems are based on supervised learning over derived features, manually or automatically extracted from the sample timeseries. In the following we survey some of the techniques that are deviated from this direction.

Some research work have treated the streamed signals as direct input without any prior preprocessing and/or feature engineering. Most of that work can be considered as automatic feature engineering (representation learning) through the use of the recent advances in deep neural networks. However, some of that work just directly models the temporal aspects of the incoming signal without explicitly learn any feature representation. In other words, the dynamics of the model itself encode the input signal without explicitly transforming (whether manually or automatically) the signal into some feature space. The most prominent example in that direction is the use of Hidden Markov Models (HMMs). An HMM is a latent variable model where there is a sequential relationship amongst the latent variables. It is theoretically founded on the theory of Markov stochastic processes, where, such process satisfies the Markovian property that the current state of the system depends exclusively on the previous state, and hence, independent of earlier history.

The introduction of latent states, forming the HMM, actually extends this model beyond such Markovian assumption. More specifically, the latent variables represent discrete states that transit among themselves according to some stochastic law. The transition behavior follows the Markovian property, that is, the next transition depends only on the current state. Along side transiting to a new state, a sampled output is observed, through which these hidden states are inferred. The parameters of the model to be learned or estimated are: the transition probabilities, the parameters of the observables distributions, and the distribution of the initial state. By its inherent nature, HMMs are very suitable for modeling sequential data, and in particular, in our context timeseries data. They can be used as inferential engine as well as generative engine. The main advantage is that HMMs are very well mathematically established and hence, can provide explanation to its induced inferences and outputs. They require relatively modest computational resources during training (specially when training using frequentist approaches such as expectation maximization) when compared to deep neural networks.

In the context of timeseries HAR signals, an HMM model can be viewed as encoding the given activity. For example, assume that we have only an HMM H with only two states, that is used to model the timeseries given in Fig. 15a. Figure 15b shows the corresponding states produced by an HMM that models the timeseries and consists of two states. As can be seen from the figure we can perceive the HMM model as encoding the timeseries using a binary string. So, for example, if the timeseries is the z-acceleration of a pushup exercise, then the modeling HMM just encodes this exercise using two levels, they could be the up position and the down position.

Fig. 15
figure 15

An example signal and its corresponding HMM abstraction with two states

Ashry et al. [83] have used HMM for activity classification. Their main contribution is the novel application of such models, along with devising dissimilarity measures between different HMMs in the domain of activity recognition. Each activity sample, a multi-axial timeseries, is modeled using HMMs for the different axes; the HMM itself is taken to represent the signal (a descriptor of the signal). The training phase consists entirely of deriving the HMM models for all the activity samples in the given dataset. These are considered templates/prototypes to be used in the subsequent testing phase. In a 1-Nearest Neighbor fashion, a new test multi-dimensional timeseries is used to induce its characterizing HMM. This HMM is compared against those of the templates using a dissimilarity measure developed by the authors to deduce the closest one whose class of activity is then taken as the decision for the new sample.

Figure 16 shows the architecture of the proposed model in [83]. The authors validated their work on two publicly available datasets, namely, the EJUST-ADL-1 [24] and USC-HAD [12]; and compared the performance against an extant approach utilizing feature extraction and another technique utilizing a deep Long Short-Term Memory (LSTM) classifier. Table 2 shows a snapshot of the results. On the EJUST-ADL-1 they achieved performance metrics (accuracy, sensitivity, specificity, precision, and F-measure) in the order of \(90\%\) and on the order of \(85\%\) over the USC-HAD dataset.

Fig. 16
figure 16

Framework of the proposed HMM model for HAR [83]

Table 2 A comparison between an HMM-based method [83], LSTM using raw data, and RF [24] over two different public datasets

A crucial notice about this work is that there is no preprocessing step for feature extraction is done, the raw timeseries data streamed from the sensors are fed directly to the HMM. Hence, in some sense, HMM can provide the same advantage as deep neural networks for automatic feature extraction. However, the type of extracted features here are encoded in the form of dynamics. This is very appropriate for low semantical content data such as timeseries. However, this may not be feasible for more complicated data modalities such as visual or audio data. It is also of noticeable interest whether HMMs can be used for transfer learning which is a basic advantage of deep neural networks. In principle it seems plausible that HMMs can be used for that purpose, however, to the best of our knowledge, we have not found any work addressing the feasibility and/or effectivity of HMMs for transfer learning.

On a different track Gomaa et al. [84] apply two well-known statistical and measure-theoretic approaches in the context of human activity recognition. The first approach is time-neglectful in the sense that the temporal order among the sampled points of the timeseries is dropped. It is based on using statistical methods, particularly, goodness-of-fit tests, where the timeseries is treated as a collection of unordered points that are generated from some unknown probability distribution. A goodness-of-fit test is a statistical tool used to test the hypothesis that a specific sample (in our case the collection of timeseries points) is generated from a given distribution.

In [84] the authors apply one of the best known goodness-of-fit tests, namely, the Kolmogorov–Smirnov (KS) test. This test was developed in the 1930s by two Russian mathematicians Kolmogorov and Smirnov [85, 86]. The second approach is more time-aware considering the ordered nature of the timeseries as a sequential stream of measurements. Their approach is based on direct analysis of the timeseries sample points by using two distance metrics that can measure the dissimilarity between two timeseries even if they have different lengths which is typical in HAR applications. The first dissimilarity metric is based on the estimated autocorrelation coefficients of the given signals. The second is the infamous dynamic-time warping (DTW). The latter is much more accurate, though it is more computationally demanding. Their approach is essentially a 1-NN (1-Nearest Neighbor), where the distance function is taken to be one of these two dissimilarity measures in an infinite dimensional timeseries space. Their experimentation is very limited to one dataset of signals from a wrist-worn accelerometer collected for ADL activity recognition; the dataset is called “Dataset for ADL Recognition with Wrist-Worn Accelerometer” [7, 87], which has 14 activities of daily living. The DTW-based method is more stable reaching an accuracy of about \(84\%\) which is better of by more than \(25\%\) that of the other autocorrelation dissimilarity measure. The results based on the KS-test are much less promising.

On a similar track, where the sequential nature of the timeseries signal is ignored, Gomaa et al. [88] treat a timeseries as an unordered set of measurements sampled from an unknown probability distribution. They choose some random timeseries samples from each activity and fit them using probability distributions. These templates are then used as prototypes against which a new inertial motion sample is tested to determine the corresponding action. The authors used two probabilistic modeling techniques: histograms and kernel density estimates. For histograms, Manhattan distance is used as a measure of dissimilarity, whereas Manhattan distance and KL-divergence are used as measures of dissimilarity for kernel density estimates. Again, the validation of the approach is very limited by experimenting only on one dataset, which is the “Dataset for ADL Recognition with Wrist-Worn Accelerometer” [7, 87]. This dataset has 14 ADLs. For some configurations they managed to reach above \(80\%\) accuracy.

3.5 Embedding

All of the previous discussion have focused on either handcrafted feature engineering or using the timeseries data directly. A third alternative has gained much attention recently with the emergence of deep learning, in particular, deep neural networks, which is representation learning, or more concretely automatic feature extraction. Generally, automatic feature extraction through deep neural computing has achieved tremendous success in recent years and allowed for technological applications, and concerns as well, that had been unimaginable before. As the burden of engineering effective features is now alleviated, it came with a price: we don’t understand the generated embedding/representation, and generally the inferred decisions. In some cases, specially with the visual modality, some work has been done to visualize and interpret the resulting hierarchical representations. Generally, the ability to generate informative embedding is crucial for effective transfer learning. In addition, the data manifold in the embedding space is generally easier to manipulate than the original input space.

Deep neural networks entered the mainstream with the seminal work of Hinton and Salakhutdinov [89]. Since then, the general trend in machine learning research has been directed towards the adoption and analysis of deep methods as applied in different domains. Abdu-Aguye et al. [90] train a convolutional neural network (CNN) essentially for representation learning to generate embeddings for inertial timeseries data that are then used for transfer learning across different datasets. They train a CNN and use its convolutional filters as a feature extractor, then subsequently they train a feedforward neural network as a classifier (or for that matter any classifier can be used instead) over the extracted features/embeddings for other datasets. The overall architecture of their system is shown in Fig. 17.

Fig. 17
figure 17

CNN for feature extraction + classification [90]

The authors use a spatial pyramid pooling layer in order to generate a fixed-size embedding regardless of the length of the input timeseries, hence, alleviating the need to fix the input length of the timeseries. They validate the effectiveness of their embedding model on five different activity recognition datasets: EJUST-ADL-1 (aka Gomaa-1) [24], HAPT [42, 91], EJUST-ADL-2 (aka HAD-AW) [14], Daily and Sports Activities [57], and REALDISP [13].

Tables 3 and 4 show example results from the transfer learning, the first table shows the performance metric in terms of the overall accuracy, and the second shows the computational resources in physical time (seconds). The diagonal is the baseline (self-testing), whereas cross-testing are the off-diagonal entries.

Table 3 Classification accuracy using 4-2-1 pooling [90]
Table 4 Training times (in seconds) for self-testing and cross-testing scenarios (4-2-1 pooling) [90]

The machine used to obtain these timings utilized an 8-core Intel Xeon Platinum 8168 CPU with 16 GB of RAM. During the training cycles, CPU-only training was used to give a fair estimate of performance and eliminate any extraneous advantages due to the use of graphics processing units (GPUs). The authors thereafter report the mean and standard deviation of the obtained times measured in seconds.

In the latter, a model is pre-trained on each source dataset mentioned in the first column, and then fine-tuned and tested on a target dataset in each corresponding row in the table. It is apparent that transfer learning, and hence, the generated embeddings, are effective where the maximum reduction in accuracy is about \(6\%\), and in most cases it is about \(3\%\). On the other hand, the bottom table shows tremendous saving in computational resources as a result of transfer learning. The only computation here is required for fine-tuning. In the worst case (i.e., on the Daily Sports dataset) about \(24 \times \) speedup is achieved compared to the self-testing scenario; and about \(52 \times \) speedup in the best case (i.e., on the REALDISP dataset). It is also apparent that when the authors used the datasets, Daily Sports and EJUST-ADL-2, the best results have been achieved regarding transfer learning. This means that their pre-trained model is the most powerful when transferring the gained knowledge from these datasets to another dataset. This can be attributed to the wide spectrum of activities contained in these two datasets (19 actions in the former and 31 in the latter); see Table 15 for full description of these datasets.

Another work based on proper embedding representation learning of timeseries inertial signals is done by Abdu-Aguye et al. [92]. In particular, they employed such embedding representation in a deep metric learning (aka similarity learning) framework for activity recognition. A deep Triplet Network is used to generate fixed-length embeddings/descriptors from activity samples. Such embeddings are located in a classical Euclidean space and are to be used for activity classification. The overall system architecture is similar to that shown in Fig. 17, except for the use of triplet network with the corresponding loss function and the replacement of the fully connected layers towards the end of the information flow by a 1-Nearest Neighbor classifier with the Euclidean distance as the dissimilarity metric. The triplet loss can be defined as follows:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}(\theta ) = \sum \max \left( \Vert E_{\rm{a}} - E_{\rm{p}} \Vert _{2}^{2} - \Vert E_{\rm{a}} - E_{\rm{n}} \Vert _{2}^{2} + \alpha , 0\right) \end{aligned}\end{aligned}$$
(5)

where \(E_{\rm{a}}\) is the embedding of the anchor sample, \(E_{\rm{p}}\) is the embedding of the positive sample, and finally, \(E_{\rm{n}}\) is the embedding of the negative sample. The summation of the loss function is taken over all the triplets in the training set. \(\alpha \) represents a margin parameter that describes the minimum desired spacing between entities of different classes.

The authors evaluate their work using the same 5 datasets used in [90]. Transfer learning (cross-testing) is also performed across different datasets: a model is built by pretraining in some source dataset, then fine-tuned and tested on a target dataset. Table 5 shows a sample of the results, both self-testing and cross-testing, of the proposed deep metric architecture of the five mentioned datasets. Comparing these results with those of the previously mentioned work, Table 3, we observe the following; (1) the results of self-testing are comparable, both methods are in close proximity to each other, however, (2) the method proposed in [90] is much more capable to transfer to new datasets, that is, the induced embeddings from this scheme is much more abstract representation of the timeseries inertial signals that can transcend the source dataset. As the two lines of work do their evaluation over the same list of datasets, it is worth to investigate what factors affect the transferability power of the underlying model.

Table 5 Classification accuracy of embeddings using deep metric learning [92]

The work done in [90] learns the embeddings within an end-to-end training pipeline for the overall final purpose of activity recognition, whereas the work in [92] learns the embeddings in a more abstract setting, where the task of activity recognition appears more implicit in the choice of the three examples for the triplet loss: the anchor, the positive example, and the negative example. It may as well be the technique used for action classification may play a role in effecting the transfer learning; in the former work a fully connected neural network is used, whereas in the latter a 1-Nearest Neighbor is used. Maybe the choice of 1 neighbor leads to overfitting and more smoothing are needed as a regularization. All of these need to be investigated, as transfer learning in general, and its particular use in inertial timeseries data are so crucial in the success of such systems and the practical widespread effective use of wearable devices.

Khaertdinov et al. [93] use a variant of deep metric learning equipped with attention for HAR. The main focus of the paper is to design a system that is robust against two problems in wearables-based HAR systems: (1) inter-class similarities (different activities have similar inertial motion patterns) and (2) subject heterogeneity, where different subjects perform the same activity quite differently. The authors show the detrimental effect of these two factors on the recognition performance by using t-SNE visualization over 2-dimensional Euclidean space. The crux of the argument of the plotted graphs is that different actions have quite overlapping clusters and the same action may spread over different clusters in that 2-dimensional space. These motivate the whole work, however, it can be criticized that visualization is a quite poor evidence as it projects from the rich high dimensional space of the timeseries into very low 2-dimensional space, with much less degrees of freedom.

More evidence is needed, for example, plotting 3-dimensional visualizations, plotting multiple 2-dimensional visualizations over varying features, or any other means. The main technical contributions of the paper behind that motivation of well-class separability are: (1) developing a hierarchical triplet loss function and (2) developing a smart triplet mining algorithm. The hierarchical triplet loss is based on a data structure called the “hierarchical tree” that stores information about the inter-class and intra-class distances of all activity classes in the dataset. Given two activity classes \(a_{1}\) and \(a_{2}\), their inter-class distance can be computed as follows:

$$\begin{aligned} \begin{aligned} d_{\rm{inter}}(a_{1},a_{2}) = \frac{1}{\Vert a_{1}\Vert \cdot \Vert a_{2}\Vert } \sum _{i \in a_{1},j \in a_{2}} \Vert E_{i} - E_{j} \Vert _{2}^{2} \end{aligned} \end{aligned}$$
(6)

where \(E_{k}\) is the embedding of sample k. So, it is the average of all distances between pairwise samples from the two classes. On the other hand, the intra-class distance for activity b can be computed as follows:

$$\begin{aligned} \begin{aligned} d_{\rm{intra}}(b) = \frac{1}{\Vert b\Vert \left( \Vert b\Vert - 1\right) } \sum _{i,j \in b} \Vert E_{i} - E_{j} \Vert _{2} \end{aligned} \end{aligned}$$
(7)

Each leaf node corresponds to a certain class label. Classes are merged at different levels of the tree based on some merging threshold. Now the hierarchical loss function can be computed as follows:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}(\theta ) = \frac{1}{2N} \sum _{i = 1}^{N} \max \left\{ \Vert E_{i}^{\rm{a}} - E_{i}^{\rm{p}} \Vert _{2}^{2} - \Vert E_{i}^{\rm{a}} - E_{i}^{\rm{n}} \Vert _{2}^{2} + \alpha ',0\right\} \end{aligned} \end{aligned}$$
(8)

where N is the number of triplets, \(E_{i}^{\rm{a}}, E_{i}^{\rm{p}},E_{i}^{\rm{n}}\) are the embeddings of the anchor, positive, and negative samples in the triplet i. \(\alpha '\) is the dynamic margin that depends on the intra-class distance of the anchor activity as well as the level \(\ell \) in the hierarchical tree at which the anchor and negative classes are merged. So, adapting the margin to the particular activities involved in the triplet makes the training more robust and resilient to the two opposing effects of inter-class similarities and subject heterogeneity. The second main technical contribution of this work is the technique for mining triplets for a mini-batch. They call their method “anchor-neighbor sampling”.

It is essentially a smarter guided selection of the triplets to make proper use of the constructed hierarchical tree to improve the training process and well separate the classes. The authors evaluate their work over three public benchmark datasets: PAMAP2 [94], USC-HAD [12], and MHEALTH [95]. Their embedding model is based on LSTM with two types of attention, attention based on the temporal sequence of timeseries points and spatial attention scanning multiple channels (different inertial types and axes) at the same time. They report results over all three datasets and multiple configurations of their model (for example, use one or both types of attention) and compare with other state-of-the-art methods.

Strömbäck et al. [31] create an iconic dataset, called “MM-Fit” for workout exercises. This dataset is multi-modal collected from varying devices streaming the following types of data: (1) inertial motion data collected from multiple devices including smartphones, smartwatches, and earbuds, (2) multi-view RGB-Depth video streaming, and (3) 2D and 3D pose estimates. The task is to do activity recognition, particularly in this case, recognition of the workout exercise. The authors compare the power of different deep recognition models based on unimodal vs. multimodal data types.

Each modality is trained separately using a stacked convolution autoencoder to produce an embedding representation of the crucial features from the given modality. For the inertial motion timeseries the autoencoder architecture consists of 1D convolution filters. As for the visual streaming, they use the 3D pose estimates from [96] and the relative joint positions are encoded using cylindrical coordinates as such coordinate system is more robust against background and illumination variation. Such 3D poses are arranged into 3D images that are fed to an autoencoder that is based on 2D convolution filters.

Each unimodal type of sensory data is trained independently to generate an embedding that is output from the later layer of the autoencoder (the last layer of the encoder part). These induced embeddings representations of the different modalities are then fused and fed to a multi-modal fully connected autoencoder. Lastly, the fused embeddings of this latter network are used to train a fully connected classification network to recognize 10 different workout exercises. Their multi-model framework achieves \(96\%\) accuracy on unseen subjects in their dataset MM-Fit. This is in contrast with \(94\%\) using data from the smartwatch only, \(85\%\) from the smartphone only, and \(82\%\) on data from the earbud device. A general critic to this work is that their proposed model architecture maybe overly complicated.

Mahmud et al. [97] proposed a self-attention model that utilizes different types of attention mechanisms to generate higher dimensional feature representation. The authors evaluate on four publicly available HAR datasets: PAMAP2 [94], Opportunity [98], Skoda [99], and USC-HAD [12]. The sensor attention maps can capture the importance of sensors modalities and location in predicting the activity classes. The authors adopted the self-attention architecture from the Neural Machine Translation (NMT) task [100] into the HAR problem and proposed a model incorporating self-attention with sensor and temporal attention.

The authors proposed that the data samples are equivalent to words and the time windows are analogous to sentences. They introduced the first attention layer on the raw input. Secondly, the authors adopted the self-attention and positional encoding from the transformer architecture into the HAR problem in order to capture the spatio-temporal dependencies of sensor signals and their modalities. After a number of self-attention blocks, the authors added another layer of attention to learn the global attention from the context. Finally, a fully connected layer was placed to classify the activity. The proposed model achieved remarkable performance enhancement over the existing state-of-the-art models on both benchmark test users and Leave-one-user-out evaluation. The authors also observed that the sensor attention maps was able to focus on the impact of the sensors modality and placement in predicting the various activity classes.

Tao et al. [101] proposed an attention-based model for HAR using multiple IMUs worn at various body locations. A feature extraction module extracts the most discriminating features from every sensor with CNNs. An attention-based fusion method learns the sensors importance at various body locations and generates an attention-based feature representation. A feature extraction module learns the inter-sensor correlations that are connected to a classifier predicting classes of activities. The proposed method was evaluated using 5 public datasets: Daily [102], Skoda [103], PAMAP2 [94], Sensors [67], DaphNet [104]) and outperformed the state-of-the-art approaches on various activity categories.

Li and Wang [105] proposed a deep learning method that is based on residual blocks and bi-directional LSTM (BiLSTM). The model extracts spatial features of multidimensional signals from IMUs using the residual block. The model then gets the forward and backward dependencies of the features sequence using BiLSTM. The resulting features were input to a Softmax layer to perform the HAR task. The optimal parameters of the model were acquired experimentally. A new dataset was proposed containing six activities: sitting, standing, walking, running, going upstairs, going downstairs. The proposed model was evaluated on the newly proposed dataset and two other public datasets namely WISDM [106] and PAMAP2 [94]. The accuracies achieved are \(96.95\%\), \(97.32\%\) and \(97.15\%\) on the newly proposed dataset, WISDM and PAMAP2, respectively.

In [107], the authors proposed a graph neural network (GNN) that is end-to-end. This system captures the sample information with an efficient way and also captures the relation with the other samples as an undirected graph form. This is probably the first contribution where time series are represented as a graph-based structural representation aiming at HAR utilizing sensor data. The proposed approach was experimented on six publicly available datasets. The proposed method achieves for all the datasets approximately 100% recognition accuracy. The GNN classification was done at two steps. First, the node classification was done for each node \(v \in V\) of G that is labeled as \(y_{v}\) to predict the labels of every node v by calculating an embedding vector \(h_{v}\) where \(y_{v}=f(h_{v})\) and f is a differentiable function for approximation. f was utilized in order to make mapping for the similarity between the nodes in an embedding dimension with the real graph nodes that are existing in the real graph. In the embedding dimension, some information was saved in a compressed approach withing \(h_{v}\). Second, the graph structure is considered as well for the graph classification beside using the node information.

3.6 Input tensor structure

In the analysis of inertial motion data, a diverse set of tensor structures have been used as inputs to the learning architectures (whether classical or variants of deep neural networks). The input tensor structure essentially depends on several factors, notably, the number of IMU sensors to be used for analysis, the number of axes of the sensor(s) used, whether vectorization (concatenating into a single vector) is used, whether raw sensor data or some sort of extracted features are used, etc.

For example, one may use only the accelerometer data in one dimension (or take the magnitude of several directions into one resultant vector). In this case 1D tensors are used. It may be the case of taking the three axes of an accelerometer and concatenate them together into a single vector (vectorization), making also for a 1D input tensor. One may take the three axes of the accelerometer making each in a separate channel, making 2D tensor input, or making them in one channel, however, stacked vertically, making again a 2D tensor. One may use several sensors: accelerometer, gyroscope angular velocity, and magnetometer. Each has three axes that can be stacked vertically in a single channel. Thus, having three channels, each is a 2D array of three axial readings, so in this case, we have 3D input tensor. So, in conclusion the input tensor structure depends on the particular work and the authors’ choice.

An example of such variations can be found in the work of Mubarak et. al. [90], where the authors used six 1D channels: accelerometer-x, accelerometer-y, accelerometer-z, gyroscope-x, gyroscope-y, and gyroscope-z. Each of these is a 1D timeseries, so the whole tensor can be considered a 2D tensor input. Similar strategy was applied in the work of Abeer et. al. [32]. In [83] the authors use hidden Markov models (HMMs) to model inertial motion data. For each axis of each sensor a dedicated HMM is used to model the dynamics of motion along this axis. So, the input to each HMM is a univariate timeseries motion data, in other words, it is a 1D tensor. On the other hand, Madcor et. al. [18] have used two sensors, the accelerometer and gyroscope, each sampled in three axes. The timeseries of the six axes are, after some preprocessing, concatenated into one big vector (vectorization) to be input to the learning architecture. So, here the input is a 1D tensor.

3.7 Transformers

In this subsection, we survey the use of the transformer model in Human Activity Recognition from IMUs time series data. The transformer model is a deep learning NN model that was developed primarily for the NLP and computer vision tasks.

In [108], the authors proposed a one-patch lightweight transformer that combines RNNs and CNNs advantages. In addition, the authors proposed TransFed that considers more the privacy concerns. TransFed is a federated learning classifier that utilized the proposed lightweight transformer. The authors designed a test framework to acquire a new HAR dataset gathered from 5 subjects, and then utilized the novel dataset in order to assess the performance of the HAR classifier in the two settings of federated and centralized environments. In addition, the authors used a public dataset to assess the performance of the HAR classifier in centralized environment for comparison with other HAR classifiers. The results showed that the proposed approach outperformed the state-of-the-art HAR classifiers which are based on CNNs and RNNs and in the same time was more efficient computational-wise. Table 6 presents the hyper-parameters adopted during the training phase.

Table 6 Hyper-parameters of each subject for federated learning [108]

Table 7 presents a comparison of 2 main features between the proposed transformers in [108] with the RNNs and CNNs. The RNN models have no parallelization during training due to the sequential nature, thus the model is slow and computationally expensive. The CNN models can have parallel computation, however, computationally extensive due to the convolution function. The proposed transformer approach removes the recurrence and convolution. That are replaced with a self-attention approach to perform input/output dependencies. The method relies completely on an attention mechanism to calculate input/output representations. Moreover, the transformers allow for computations parallelization. Also, RNNs and CNNs utilize a huge number of parameters (hundreds of thousands or more), while the proposed transformer in [108] has only 14,697 parameters. In addition, the proposed transformer utilizes a single patch not multiple-patches, thus, it is computationally efficient.

Table 7 Comparison of the transformer method with the RNNs and CNNs methods based on the computational costs [108]

Table 8 presents the comparison of the proposed transformer in [108] with state-of-the-art approaches. The proposed approaches achieved a big enhancement in accuracy: (1) trained and tested on the proposed new dataset (labelled with b) and (2) the WISDM dataset (labelled with a). The proposed approach has an accuracy of 98.74% (federated setting) and 99.14% (centralized setting) on the newly gathered dataset. The proposed approach has an overall accuracy of 98.89% on the WISDM dataset (centralized setting).

Table 8 Comparison of the transformer method with state-of-the-art approaches for HAR classification [108]

In [112], the authors proposed Human Activity Recognition Transformer (HART) a transformer framework based on sensors adapted to the IMUs inside the mobile devices. The results on HAR activities on many public datasets revealed that HART has less Floating point Operations Per Second (FLOPS) and fewer parameters achieving better results than the state-of-the-art methods. In addition, the authors provided performance assessments on various architectures in heterogeneous environments. The proposed models can generalize better either on various sensor devices or on-body locations.

Table 9 shows the F-score for various models on four datasets: RealWorld, HHAR, MotionSense, and SHL. Moreover, the authors combine five datasets to form the largest one with biggest diversity for assessment. The combination had thirteen different activities predicting data with high-class imbalance from various devices and locations.

For the proposed configuration with MobileViT and MobileHART, 2 network settings were developed: extra small (XS) and extra extra small (XXS). The XS and XXS are differently characterized by the filters number, the expansion factors leading to filter growth after every MV2 layer, and the MobileViT / MobileHART blocks embedding size. The layers number and attention heads are the same in the 2 network settings.

Table 9 The F-scores on five datasets [112]

In [116], the authors proposed a self-attention Transformer model for ADLs classification. The authors compared the proposed method with a recurrent Long-Short Term Memory (LSTM) model. The proposed approach is a two-level hierarchical model where atomic activities are detected in the first step then in the second step the probability scores are calculated and used by the Transformer-based to classify 7 more complex ADLs. The experiments showed that the Transformer model was on bar with and sometimes outperforms LSTM in the subject-dependent setting (73.36% and 69.09%) only depending on the attention-approach to present global dependencies between input and output without using any recurrence. The proposed model was evaluated on 2 different segment lengths proving its effectiveness to learn long-range dependencies of actions of short lengths occurring in complex activities.

In [117], the authors presented a Two-stream Convolution Augmented Human Activity Transformer (THAT) model. The proposed approach utilized a two-stream architecture to extract both time-over-channel and channel-over-time features, then use the augmented multi-scale convolution transformer to extract the range-based patterns. Results on four evaluation datasets showed that the proposed approach is more effective and efficient compared to the state-of-the-art models. Table 10 presents the THAT model recognition accuracy in comparison with five state-of-the-arts models on four evaluation datasets.

Table 10 The THAT model recognition accuracy in comparison with five state-of-the-arts models on four evaluation datasets [117]

In [121] the transformer model was utilized for Human Activity Recognition from time-series signals. The self-attention approach in the transformer model present the individual dependencies between time-series values. The attention mechanism has a comparable performance to the state-of-the-art CNN-LSTM (e.g., [122]). The transformer model was evaluated on the KU-HAR dataset spanning a wide range of activities [123] achieving an average recognition accuracy of \(99.2\%\) compared to \(89.67\%\) of the Random Forest with FFT [123].

4 Confined versus streamed actions

Most of the work done in HAR target offline applications and hence, focusing on recognizing confined actions, that is, the timeseries sample is assumed to represent exclusively one action/activity of interest. A line of research though has been targeting online real-time applications, and hence, focusing on recognizing a continuous stream of actions in real-time. Most of this work, though, use vision-based models [124], in which actions in a video stream are detected and recognized in both offline and online modes [125]. In other approaches, data are collected from distributed ambient sensors such as microphones, cameras, and motion sensors that are attached to fixed locations in the surrounding environment (e.g., walls, cupboards doors, microwave ovens, and water taps). Examples of these system include the Aruba and Tulum datasets [126]. These are created within the CASAS smart-home project [127]. Studies on these datasets are done by Yala et al. [128]. These sensors, however, have many restrictions owing, in particular, to their fixed nature, and user activities are not detectable, let alone recognizable, if the user goes outside the spatial area where these sensors are installed. Additionally, from the perspective of privacy, it is neither acceptable nor convenient to monitor people continuously using cameras in private rooms. And even for less private places, monitoring is susceptible to security breaks.

For detecting activities in a continuous stream of inertial motion signals from wearable devices, some segmentation approaches such as activity-based windowing, time-based windowing, and sensor-based windowing [128] have been proposed. However, each one of them admits some kind of drawback. In activity-based windowing, the activities are in general not well distinguished resulting in largely imprecise activity boundaries. In time-based widowing, streaming data are divided into fixed-time intervals/windows. However, it is faced with the problem of determining the proper window length. If it is too small, the window may contain insufficient information for decision making; if it is too large information of multiple activities may be included in the allotted window. Generally, this parameter is problematic as it may differ according to the activity, the gender of the subject performing the activity, the age of the subject performing the activity, the sampling rate of the IMU device, etc. So, smarter techniques should be used to determine an adaptive window size, for example, detecting the change in the stochastic properties of the signals, as generally is done in Markov regime switching-based methods. In sensor-based windowing, each window contains the same number of sensor events [128]. A major drawback of this method is that the window may contain sensor events that are widely separated in time, and it may contain sensor events of more than one subject as in the Tulum dataset.

One recent work that addresses the continuous recognition of streaming of actions is done by Ashry et al. [129]. They process inertial motion signals streamed from IMU sensors on-board a smartwatch. They collect their own dataset, called “CHAR-SW”, which consists of different streams of activities of daily living, see Fig. 18a. See Table 15 for a full description of the dataset. They used cascading bidirectional long short-term memory (Bi-LSTM) for their analysis. Bi-LSTM has better and more stable performance than the uni-directional LSTM as the former can perform smoothing as at any point in time the past and future motion samples are available for the current prediction. From another perspective this architecture may not be feasible for real-time systems where only past samples (and possibly a small window of future samples) can only be available. In this work the input to the Bi-LSTM is not the raw timeseries data, however, it is a descriptor which composed of a continuous stream of the following extracted windowed features: autocorrelation, median, entropy, and instantaneous frequency. So their approach is a mixture of classical feature engineering, as manifested by the handcrafted descriptor of the input timeseries, and deep automatic feature extraction, as manifested by the Bi-LSTM which automatically captures the sequential dependency of the streaming actions. So, the technique can be viewed as converting the original timeseries into a coarser more abstract timeseries. The system architecture is shown in Fig. 18b. The evaluation of the system showed that it can recognize activities in (almost) real-time with an accuracy up to \(91\%\). Figure 19 shows a framework for gait-based person identification [20].

Fig. 18
figure 18

Online analysis of a streaming sequence of actions using IMU sensory data from a wearable device [129]

Fig. 19
figure 19

Framework for gait-based person identification [20]

Khannouz and Glatard [130] proposed data stream classifiers on the HAR use case. The authors measured the classification accuracy and resource consumption (in terms of runtime, memory, and power) of five stream classifiers on two real human activity datasets [13, 38] and three synthetic datasets. The results showed that the Hoeffding Tree, the Mondrian forest, and the Naive Bayes classifiers were overall better than the Feedforward Neural Network and the Micro Cluster Nearest Neighbor classifiers on four datasets including the real ones. The three best classifiers overall performed much worse than an offline classifier on the real datasets. The Hoeffding Tree and the Mondrian forest are the most memory draining with longest runtime but no difference in the power consumption between classifiers was reported. The stream learning for HAR on connected devices had two main challenges: high memory consumption and low F1 scores.

Krishnan and Cook [131] proposed a sliding window technique for HAR in an online streaming behavior which means classifying activities whenever new sensor events came in. Since different activities can be best classified by different window lengths of sensor events, the authors incorporated the time decay and mutual information-based weighting of sensor events within a sensor window. More contextual information like the previous activity and the previous window activity were also added to the feature characterizing a sensor window. The experiments evaluating these methods on real-world datasets concluded that combining mutual information weighting of sensor events and the past contextual information into the feature leads to the best performing streaming activity recognition.

Abdallah et al. [132] proposed a dynamic human activity recognition framework with growing data streams. This framework was called STAR standing for STream learning for mobile Activity Recognition. This framework comprised of incremental and active learning methods for real-time recognition and adaptation in streaming situations. While the data stream grows, the authors refine, improve, and personalize the model to handle the drift in the given data stream. To detect the concurrent activities, the authors applied an online clustering for each data chunk. The prediction and adaptation methods were deployed on each cluster. The experimental results using three real activity datasets, OPPORTUNITY [98], WISDM [106], Smart Phone Accelerometer Data (SPAD) [133], showed that the proposed approach enhanced the performance of the activity recognition specifically across different subjects.

Abdallah et al. [134] proposed and evaluated a method for concept evolution (e.g., new action or activity) that was applied to emerging data streams. This method continuously screens the streaming data movement in order to detect any evolving updates. This method was able to detect the arrival of any new concepts either they were normal or abnormal. This approach has also applied continuous and active learning to adapt with the observed concepts in real-time. The authors evaluated the proposed approach on the HAR datasets OPPORTUNITY [135] and WISDM [106] as benchmarks for the emerging data streaming application. The proposed approach proved its effectiveness on these datasets in observing new concepts and adapted continuously with low computational cost.

Yala et al. [136] proposed an approach of HAR using online sensor data. The authors presented four techniques for extracting features from sensor events sequence. They proposed a sensor window approach to perform HAR online by recognizing activities when new sensor event occurs. As different activities can be better identified by various window lengths, the authors proposed a mutual information-based weighting of sensor events within a window. As some activities do not necessitate many movements, the authors proposed a “last-state” of sensor feature set within a window to identify activities. These techniques are evaluated on first 6 weeks of Aruba dataset and on 3 months of Tulum dataset [126]. These methods showed an enhancement in classification accuracy over the baseline approach when removing events with missing labels. In the baseline case, the authors construct a feature vector with fixed dimension that contains the first and last sensor events time, the window duration, and the different sensor events count within the window.

Zhang and Ramachandran [137] adopted an online method called Very Fast Decision Tree (VFDT) to simulate the real HAR scenario. The two main contributions are (1) the authors trained the model online using the data samples only once for training and (2) the model can be updated to recognize new activities after building the VFDT by adding small number of labeled samples. The proposed method achieves an average accuracy of \(85.9\%\) for all subjects and the accuracy rates of single subject are between \(60.5\%\) and \(99.3\%\). The average accuracy of learning new activity from another data is \(84\%\) and the accuracy rate of a single subject reached up to \(100\%\).

Kwon et al. [138] proposed IMUTube integrating computer vision and signal processing techniques to convert human activity videos into virtual IMU data streams. These virtual IMU data streams are equivalent to accelerometer signals collected from various human body locations. The virtually generated IMU data can enhance the performance of different models on public HAR datasets. This work represents an integrated system that utilizes computer vision and signal processing for sensor-based on-body HAR that opens the door to further enhance the recognition accuracy based on a large dataset.

5 Transfer learning

Transfer learning refers to the learning paradigm where what has been learned in one setting and/or task is exploited to initiate and/or improve generalization in another setting and/or task [139]. It is the improvement of learning in a new task through the transfer of knowledge from a related task that has already been learned [140]. Transfer learning has been boosted by the recent revolution in deep learning, though it is not particularized to the latter. Deep neural computing has revolutionized representation learning which has made transfer learning readily available.

5.1 Fine tuning

The key motivation, especially in the context of deep learning, is the fact that most models which solve complex problems need much data, and getting vast amounts of labeled data for supervised models can be really difficult, considering the time and effort it takes to label data points. Hence, foundational models (i.e., models that have already been learned on massive amounts of data in capable entities such as big corporations) can be downloaded and fine-tuned for specific tasks with only limited amount of data with noticeable degree of success.

A rather recent survey on the progress of transfer learning for classification, regression, and clustering problems can be found in [141]. Lisa Torrey and Jude Shavlik in their book [140] described three possible advantages to consider when employing transfer learning on a target task using a model pre-trained on a source task. These benefits are illustrated in Fig. 20, where two performance curves are shown: the green curve is the training performance using trained-from-scratch (end-to-end training) model on the target task whereas the red curve shows the performance of fine-tuning pre-trained source model to work on the target task.

  1. 1.

    Higher start The target skill starts higher than it otherwise would be. In some cases this can be considered as a satisfactory performance where fine-tuning may be unnecessary anymore.

  2. 2.

    Higher slope The rate of improvement of the target task during training (fine-tuning) of the source model is steeper than it otherwise would be. This means the convergence rate is higher.

  3. 3.

    Higher asymptote The convergence of the pretrained source model on the target task is better than it otherwise would be. This means it converges to a higher performance point.

Fig. 20
figure 20

Different ways of learning improvement through transfer learning

On a more foundational level, the work done in [142] empirically investigates and quantifies the generality vs. specificity of neurons in each layer of a deep CNN. Their results indicate that transferability is negatively affected by two factors: (1) the specialization of higher layer neurons to their original source task at the expense of performance on the target task and (2) optimization difficulties related to splitting networks between co-adapted neurons (different hidden neurons having correlated behavior). They did an example network trained on ImageNet and demonstrated that either of these two issues may dominate, depending on whether the features are transferred from the bottom, middle, or top of the network. It is also derived, rather naturally, that transferability of features decreases as the distance between the base/source task and target task increases, but in any case transferability is better than random initialization (training-from-scratch). A final result is that initializing a network with transferred features from almost any level of layers can produce a boost to generalization, though depending on the layer the degree of effective transferability varies. Most of the research on transfer learning have been exclusively focused on visual data. Some recent work are directed towards timeseries specially inertial motion data for the purposes of activity recognition and gait analysis.

In the following, we survey some of the transfer learning work on timeseries data. Through the above-mentioned discussion we have already talked about several research works that in principle represent effective applications of transfer learning in the context of timeseries analysis. For example, Abdu-Aguye et al. [90] developed an approach to transfer learning for fixed and variable-length activity recognition inertial motion timeseries data. They do an end-to-end training using a CNN, see Fig. 17, then the convolutional filters are used as an unsupervised feature extractor. This portion of the network is then used to generate features for any dataset and activities; where such features can subsequently be fed to any classifier (in the paper, the classifier is a classical feedforward neural network).

The authors used a mini-batch of size 128 for all the pooling parameters evaluations. The learning-rate number of training epochs in both self-testing and transfer learning was 35 epochs. The training and testing phases were repeated 15 times with different data samples were used each time for training and testing. In self-testing and transfer learning; training with 75% of data and testing with 25% of data.

The authors validated their approach on five different heterogeneous activity datasets, and a snapshot of their results is shown in Table 3. The classification accuracies obtained are shown together with their standard errors. (i.e., \(accuracy \pm std.err\)). The baseline are the diagonal elements indicating the self-testing scenarios (train and test on the same dataset). Whereas the off-diagonals show the transfer learning results (cross-testing): train on a dataset, fine-tune and test on another. Relative to the baseline diagonal accuracies, transfer learning shows very promising results, specially considering that the given datasets are very diverse regarding the devices used to collect the streaming data, the sampling rates, the subjects performing the activities, the locations of the sensors on the subject’s body, etc. The results also indicate the stability and robustness of the transfer learning scenarios where the standard errors of the results are generally low, remaining below \(3\%\) in all the scenarios considered, and implying that transfer learning across the timeseries activity datasets is effective and comparable to training from scratch.

It can be observed that the HAD-AW and REALDISP datasets consistently have the best transfer learning (cross-testing) results. This can be attributed to the fact that they consist of a large and diverse range of activities, causing the convolutional filters to be relatively more discriminative than when trained on other smaller datasets. So, wider range of activities may play more crucial role in the effectiveness of transfer learning than having many and longer samples of a smaller set of activities. This conclusion is as well supported by observing that the HAPT dataset, having the smallest number of activities shows the poorest transfer learning results in all scenarios. This makes a stronger case for the use of class-diverse datasets as source datasets for transfer learning.

Now we look at the second important benefit of transfer learning, namely, computational feasibility. As transfer learning should save huge amount of computational resources as the target task requires just fine-tuning rather than training-from-scratch. The average times required for self-testing and transfer learning scenarios are shown in Table 4. The self-training scenarios take much longer times, due to the multitude of operations inherent in training the feature extraction part, namely the convolutional layers.

In transfer learning, the training is done on some dataset, then the classifier is fine-tuned on some other dataset. In this paper specifically, the authors train a CNN and used its convolutional filters as a feature extractor then trained a feedforward neural network classifier on the extracted features for other datasets. The transfer learning (cross-testing) scenarios take significantly less time (only for fine-tuning), yielding a 24.17\(\times \) speedup compared to the diagonal baseline (end-to-end training) in the worst case (i.e., on the Daily Sports dataset) and a 52.31\(\times \) speedup in the best case (i.e., on the REALDISP dataset). We can as well observe that the timings for the transfer-learning scenarios are fairly steady, regardless of the source dataset. This implies that the transfer learning testing times are purely dependent on the size of the target dataset.

The same authors have done another work in the same direction [92], where instead of using CNN as a feature extractor, they apply deep metric learning. Metric learning is based on representation learning of the input samples inside some embedding Euclidean space, such that, similar inputs (e.g., inertial timeseries from the same activity) are mapped to proximal representations in the embedding Euclidean space, and different inputs (e.g., inertial timeseries corresponding to different activities) are mapped to faraway representation vectors in the embedding space. Of course, the distance here is essentially the Euclidean distance. The “deep” qualifier in deep metric learning just refers to the use of the deep learning technology to generate such representations. The authors used a deep Triplet network to generate fixed-length descriptors for inertial timeseries samples for the purpose of activity classification.

The triplet network consists of 2 convolutional and 3 feed-forward layers. The authors involved between the convolutional and feedforward layers a 1-D Spatial Pyramid Pooling layer. Batch Normalization is used in between successive feedforward layers. The triplet loss function has a margin \(\alpha \) parameter equals to 1. The triplet selection strategy was set to be the Random Negative Triplet Selection. The optimizer was the Stochastic Gradient Descent with a Nesterov Momentum value of 0.90. A 128-feature embedding vector was used as a feature vector for classification. For training, 75% of the selected datasets were utilized. After training, the training samples embeddings were used as exemplars for a 1-Nearest Neighbor classifier. This classifier is then used to obtain the embeddings classification accuracy of the unseen 25% of the dataset. Experimentation is carried out on the same five datasets as in Tables 3 and 4. Activity classification accuracies reached up to about \(96\%\) in the self-testing scenarios and up to about \(91\%\) in cross-testing without retraining.

Similar approach was taken up by Khaertdinov et al. [93] where they use deep metric learning equipped with attention for HAR. Their goal is to develop a robust wearable-based HAR system that overcomes two difficulties: (1) inter-class similarities (different activities have similar inertial motion patterns) and (2) subject heterogeneity, where different subjects perform the same activity quite differently. The hyperparameter tuning was done on the datasets validation sections. The models parameters were optimized by the Stochastic Gradient Descent with Adaptive Moment Estimation (ADAM) using the default parameters (\(\epsilon = 10^{-8}\), \(\beta _{1} = 0.9\), \(\beta _{2} = 0.999\)). The models were trained for period of 100 epochs. The learning rate was doubly decreased after every 10 epochs without performance improvement. Training with Hierarchical Triplet Loss (HTL) needs pre-training with the original triplet loss. Thus, the best models on the validation sets were fine-tuned using HTL for 10 extra epochs. For all models (including the LSTM model trained with the original triplet loss), the authors made of the LSTM hidden state vectors size equals to 512. For all the triplet networks, the authors fixed the mini-batch size near to 128.

Hu et al. [143] did work aiming at re-purposing activity instances collected in some activity domain for training a classifier in some other activity domain. The authors call these samples “pseudo training data” in the target domain. Contrasted with the previous work, which is feature-based, this kind of transfer learning is instance-based. They achieve their goal by developing an algorithm that utilizes pre-existing web data describing the two activity sets. Using such descriptions a similarity framework is built within which activity samples in the source domain are mapped to activity samples in the target domain with some confidence value.

The authors are omitting sequential information in the dataset, Thus, the evaluation criteria is just to calculate the accuracy on the test dataset; that is the correctly predicted activities over the total activities number in all time periods. Using the activity names (e.g., preparing breakfast, cleaning, etc.) as queries submitted to Google and retrieving the top K results; K is a parameter tuned to clarify the relationship between the algorithm accuracy and the Web pages number retrieved for each query. Using that approach the authors managed to get significant performance increase in activity recognition in the target domain as compared to the baseline case where no knowledge transfer is done. In experimentation over real world datasets, their algorithm achieved about \(60\%\) accuracy with no or very few training data in the target domain.

Khan et al. [144] proposed an instance-based transfer learning framework. The work is quite motivated as they emphasize the need for transfer learning. Their justification can be articulated in the following points:

  1. 1.

    HAR systems are typically based on a limited number of activities that are present in the training and testing splits of the underlying dataset. However, in practical scenarios these limited set of activities need to be extended (concept drift).

  2. 2.

    In practical scenarios the data distributions can be radically different from the distributions inferred during the training phase of the HAR system (data drift).

  3. 3.

    There is shortage in labeled data samples, and abundance of unlabeled samples. Transfer learning can help in that direction of self-supervised learning.

  4. 4.

    Training and testing environment and setup are typically quite different from the corresponding deployments.

Many more can be said about the challenges in inertial motion-based HAR systems. Some detailed discussion is given regarding the heterogeneities in HAR systems in Sect. 1. See as well Fig. 20 and the associated discussion at the beginning of Sect. 5 about the learning benefits gained from transfer learning.

In the face of such challenges, Khan et al. [144] proposed a framework, called TransAct, by augmenting the Instance-based Transfer Boost algorithm. They apply their method on three publicly available datasets: MHealth [95], Daily and Sports Activities [57], and SBHAR [42] (detailed description of these datasets can be found in Table 15). The authors experimented with these datasets in such a way to ascertain their transfer learning agenda. They do cross-testing spanning different environments, and accommodating new sets of activities and changes in the data distribution by using a combination of clustering with anomaly detection.

The accelerometer data was segmented by sliding window with frames of 128 samples with 50% overlap. The frame length was the same across all the datasets. The authors removed noise using low-pass filter. Using FFT on the accelerometer data, most of the high-energy frequency components were lying in-between 0 and 20 Hz. The authors extracted both time domain and frequency domain features. Time domain features (e.g., mean, standard deviation, etc.) and frequency domain features (e.g., energy, entropy, etc.) are calculated using FFT on each frame. The experimental results show that the TransAct model achieves on average \(81\%\) activity detection accuracy.

Khan et al. [145] proposed a framework, called UnTran that aims at bridging the gap between a learned HAR model in some source domain and its adaptability or deployment in a different activity target domain (device heterogeneities, sensing biasness, inherent variabilities in subjects’ behaviors and actuations). UnTran is based on transferring two layers of an autoencoder pretrained on samples from the source domain to the target domain, where such two layers are used to generate common feature space for the two domains. Specifically, the authors trained a Deep Sparse Autoencoder (DSAE) on pre-extracted feature vectors and subsequently use a part of the pretrained encoder as a generic feature extractor. Three SVM classifiers are trained across three different configurations: (1) training on features extracted by the DSAE over the source domain samples, (2) training on features extracted from DSAE over the target samples, and finally, (3) training on low-level features extracted from the target samples. The decisions from the three classifiers are then fused to induce a final decision. The authors validated their work on three datasets: Opportunity [146], Daily and Sports Activities [57], and WISDM Actitracker [106] (detailed description of these datasets can be found in tab 15).

The authors segmented the accelerometer data into frames of 128 samples with 50% overlap between frames. Low-pass median filter was applied to cancel noise. Statistical time-domain and frequency-domain features were extracted. These features were fed to the classifier in batches each with a batch size of 32. The frame length and batch size were the same across all datasets and experiments. Transfer learning baseline methods were implemented; Transfer Component Analysis (TCA) [147] in Python and Joint Distribution Adaptation (JDA) [148] in MATLAB. The proposed Deep Sparse Autoencoder (DSAE) is comprised of 4 layers. A softmax layer was used to encode the classification labels in the source domain. The first two layers of the source tuned network were used to build the proposed classifier.

The experiments show that UnTran achieves about \(75\%\) F1-score in recognizing unseen new activities using only \(10\%\) percent of the target samples. The system also maintains an F1-score of \(\%98\) for recognition of seen activities in presence of only 2–3% of labeled activity samples.

Motivated by almost the same reasons, Chikhaoui et al. [149] explored the development of a transfer learning framework based on CNN. Learned feature representations in one activity dataset are used to analyze activities in another dataset that is characterized by different data collection hardware and software setups, and even changes in the sensing modalities. A CNN classifier is pretrained and the learned weights are transferred to a new similarly structured network in the target domain. The latter network is fine-tuned using samples from the target dataset without making any changes or updates to the transferred weights. The authors validated their work over three benchmark datasets: MobiAct [150], RealWorld Human Activity Recognition (RWHAR) [151], and Mobile Sensing Heterogeneities for Activity Recognition [152]. Hence, they tested the effectiveness of their approach across different users, device placements, and sensors modalities.

The accelerometer data was segmented into non-overlapping 4  s segments. The authors trained the source model using 100  epochs and the target models were fine-tuned with 50 epochs. The CNN models were implemented using the Tensorflow library. The authors obtained F1-scores up to about 0.94 in some of the evaluated scenarios.

Another interesting work with a different spirit is that done by Abdu-Aguye et al. [25]; we briefly discussed this work above. The idea can be conceptualized as generating “virtual IMU sensors”. Given inertial motion data streamed from a real IMU sensor mounted on some source body location, these timeseries are then mapped to their correspondence on a target body location. So, these latter samples can be considered as streaming inertial motion data from a virtual IMU sensor mounted on the target body location. Independent of the performed activity the authors developed a learning architecture that maps an inertial sample from the source body location to the target body location. It is emphasized that this is done regardless of the activity, it is meant to capture how dynamics in general are transferred and correlated across different body locations. In order to empirically test their work they developed a rigorous evaluation strategy as shown in Fig. 21. Given a source body location and a target body location the dataset inertial samples are divided into 3 groups: \(50\%\), \(25\%\), and \(25\%\). The mapping between the source and target locations is called “roaming”. Three sets of evaluations are performed. The first kind of evaluation is aimed at establishing a baseline as a lower bound for the predictive performance: a classifier is trained on \(25\%\) of the target samples (1B in Fig. 21), then testing on \(25\%\) of the source samples (0C) using this classifier. The second evaluation is based on training a roaming model on the \(50\%\) source and target samples (denoted as 0A and 1A Fig. 21), the input to the roaming model is a source sample and the output is a target sample. Then, roam \(25\%\) of the source samples (0C) using the trained model. The classifier developed in the first evaluation is then used to evaluate the roamed samples, that is, the roamed samples form the testing data. This evaluation helps deriving the performance gains obtained from using the roaming model. The third and final set of evaluations is done to obtain an upper bound of the predictive performance, where the trained classifier, from the first set of evaluations, is used to evaluate genuine target samples (1C). As designed the predictive performance of the roaming model (the second set of experiments) lies between the lower and upper bounds obtained by the first and third sets of evaluations. The authors used the REALDISP [13] data set for their empirical study, where multiple sensors stream from different body locations at the same time; see Table 15. A recurrent deep model, based on Bi-directional LSTM is used for training the roaming map, see Fig. 8.

Fig. 21
figure 21

Data assignment for experimental and evaluation methodology [25]

Table 11 shows a sample of the results obtained through roaming in terms of average accuracy and F1-score of activity recognition. The left column shows the activities categorized into four groups: activities involving the whole body (\(\ddagger \)), trunk activities (\(\Vert \)), the upper extremities (\(\dagger \)), and the lower extremities (\(^\wedge \)). The source location is the right lower arm (RLA) and two target locations: the right upper arm (RUA) and back (BACK). Except for activities involving the lower extremities the RUA upper bound (the UB column) shows good F1-metrics. This can be trivially explained by the fact that such activities have more dynamics captured by the RUA as compared to those involving only the lower extremities. In addition, the lower bounds (the LB column in the table) indicate that the upper and lower parts of the arm generate similar dynamics; this may be attributed to their proximity and connectedness. The targeted proposed roaming (the “Proposed” column in the table) then show significant improvement over the lower bound indicating the effectiveness of the roaming model. Turning into the back location we see the lower bounds results are much worse. This means that the classifier is unable to detect the proper dynamics of the back from the RLA. This emphasizes the need for transformation, ‘roaming’ using the authors’ terminology, of the inertial motion signals of the RLA to the back before making any useful inference. As shown in the table the transformation is rather effective in many of the cases, specially when comparing to the upper bounds.

Table 11 F1-measures for roaming experiments for right lower arm (RLA) to right upper arm (RUA) and back (BACK) [25]

5.2 Prompting

With the rise of pre-trained models, fine-tuning these models to target domains is a crucial task. Thus, parameter efficient transfer learning of large models is very important. In [153], the authors proposed black-box visual prompting (Black-VIP) that adapts the pre-trained models efficiently without knowing the model architectures and parameters. Black-VIP consists of two components; coordinator and simultaneous perturbation stochastic approximation with gradient correction (SPSA-GC). The coordinator designs input-dependent image-shaped visual prompts that enhances the few-shot adapting on distribution and location shift. SPSA-GC estimates efficiently the target model gradient to update the coordinator. Experiments on 16 datasets showed that BlackVIP allows robust adaptation to different domains with no access to the pre-trained models parameters using lowest memory requirements. Such prompting models can help in communicating with the feedback of a virtual coach for human activity recognition and assessment [154].

In [155], the authors proposed “inverse prompting” to control better the generation of text. The main idea of inverse prompting is using the generated text in order to inversely predict the prompt that improves the relevance between the prompt and the generated text. The authors pre-trained a large-scale language model for a systematic study using human evaluation on tasks of open-domain poem generation and question answering. The results showed that the proposed approach outperforms the baseline models and the generation quality was near to the human performance in some tasks.

In [156], the authors proposed a Prompt approach for Text Generation (PTG) in a transferable setting. Initially, PTG learns some source prompts for different source generation tasks and then these prompts are transferred for target generation tasks. The authors designed an adaptive attention approach to calculate the target prompts that includes task-level and instance-level information. For each data instance, PTG learns a specific target prompt attached to highly relevant source prompts. The results showed that PTG achieved better results than fine-tuning approaches. The authors released the proposed source prompts as open source where researchers are able to reuse in order to enhance new text generation tasks.

In [157], the authors presented Multitask Prompt Tuning (MPT) that initially learns a single transferable prompt by transferring knowledge from different task-specific source prompts. The authors then learned multiplicative low rank changes to this prompt to adapt efficiently to every downstream target case. Results on 23 NLP datasets showed that the proposed method outperforms the state-of-the-art approaches including the full fine-tuning baseline method in some tasks though tuning only 0.035% of task-specific parameters.

6 Application domains

In this section we demonstrate some of the case studies for which inertial motion sensors are effectively used for different kinds and tracks of human actions.

6.1 Biometrics

Human biometrics are a set of measurements of bodily characteristics that are to be used for authentication and/or identification. The most common biometrics include: fingerprints, facial, voice, iris, palm or finger vein patterns. However, in our context the biometrics are based on the execution pattern of an action or activity, through the use of streaming IMU motion signals. Recognition of human biometrics has been widely studied in the recent years due to its significance and use in many applications such as security and surveillance, speech analysis [158, 159], recommendation systems [160], and healthcare applications [161].

The wide and varying types of biometrics dictate the use of various kinds of data modalities streaming from different kinds of sensors. The typical sensors used are the visual, audio, and inertial motion data. Here, we naturally focus on the latter, specially that its applicability is more recent with the wide spread use of wearables with embedded IMU units.

To the best of our knowledge IMU-based biometrics systems have been exclusively directed towards three characteristics: age, gender, and gait. The next subsection is dedicated to gait analysis, and here we consider age and gender. Mostafa et al. [19] proposed a method for gender recognition from inertial data based on wavelet transform (refer to Sect. 3.2 for details about feature extraction using Fourier transforms and its variants). The motion data are collected during walking activity (gait streaming data). They co-collect data for that purpose, which is the EJUST-GINR-1 [19, 20], which contain samples from smartwatches in addition to specialized IMU units placed at 8 different parts of the human body; see Fig. 9. Random forest and CNN are used for the binary classification: male/female; with input features the wavelet transform of the 6-axial timeseries: 3-axis accelerometer and 3-axis gyroscope angular velocity. The authors studied which sensor location(s) on the human body is(are) optimal in terms of distinguishing males from females during the act of walking. They found, rather expected, that sensors placed on the legs and waist are best in recognizing the gender during walking. They reached accuracy of about \(97\%\) using the sensor placed on the left cube. The authors did experiments as well with the OU-ISIR Gait dataset [33], however, the performance was rather modest about \(75\%\) (using the CNN).

Using the OU-ISIR dataset [33] a competition for gender recognition and age estimation was conducted [43] based on IMU signals extracted during gait activity. Most of the participated teams got relatively low accuracy in gender classification. On the other hand, results on age estimation were much better. The best results of this competition are presented by Garofalo et al. [162], which reached accuracy of about \(76\%\) for gender recognition and a mean absolute error of about 5.39 for age regression by using orientation independent AE-GDI representation along with a CNN.

Riaz et al. [163] proposed a system for gender recognition, age and height estimation based on IMU signals streamed from on-body sensors during gait action. Age estimation is formulated as a classification problem by splitting the age into three ranges: (1) less than 40 years, (2) between 40 and 50, and (3) older than 50 years. Height estimation is formulated as a classification problem by splitting the height into three ranges: (1) less than or equal 170 cm, (2) between 170 and 180, and (3) greater than or equal 180 cm. The authors used random forest as a classifier with two kinds of validation strategy: tenfold cross validation and the more difficult subject-wise cross-validation. They reached an accuracy about \(93\%\) for gender recognition using the sensor on the chest, about \(89\%\) for age classification using chest and lower back sensors, and about \(89\%\) for height classification using the chest sensor.

Jain et al. [164] did gender recognition, from gait action, using accelerometer and gyroscope signals streamed from IMU sensors embedded onto a smartphone. The authors used multi-level local pattern and local binary pattern as feature extractors. The extracted features are then fed to an SVM and aggregate bootstrapping (bagging). To evaluate these models, 252 gait signals collected from 42 subjects were used. They achieved an overall accuracy of about \(77\%\) using multi-level local pattern as the extracted features along with bagging.

Mostafa et al. [32] proposed a system, called BioDeep, for gender and age analysis. The author employ a simple, yet effective as shown above in other works, feature extraction technique which is based on the autocorrelation function. This descriptor is then fed to two decoupled CNN networks, one for gender recognition and the other for age estimation. Age estimation here is formulated as a regression problem. As a baseline they also used random forest justified by the fact that they test the effectivity of deep neural networks. However, there is some confusion in the paper as they claim to test the effectivity of deep learning, rather, they test the effectivity of deep neural networks in particular, as random forest is a deep learning strategy as well [165]. The main contribution in the paper can be considered as the extensive empirical studies they performed on a varying list of datasets. To the best of our knowledge, this might be the most comprehensive study from that perspective. They perform self-testing as well as cross-testing (transfer learning) across these datasets. The proposed system architecture is shown in Fig. 13. The authors validated their work using several datasets and different on-body sensors including: EJUST-GINR-1 (right cube RC sensor), EJUST-GINR-1 (Waist sensor) [19, 20], HuGaDB (Right Shin RS) [22], and GEDS (Tibialis anterior Right TaR) [34].

Tables 12 and 13 show a sample of the results produced in this paper both for gender recognition and age estimation. The diagonal elements show the self-testing scenarios, that is, train and test on the same dataset, which also play the role of a baseline when considering the off-diagonal elements which represent the cross-testing results, that is, train on a dataset and testing on a completely different one. The performance metric for gender recognition is the overall accuracy, whereas it is the mean absolute error for age estimation. The results indicate that the proposed method seems very effective in both self-testing and cross-testing as well, where the latter is more significant as it represents a sort of transfer learning that can be used to overcome the heterogeneity problem in inertial-based systems (see the discussion in Sect. 1 for the sources and consequences of this problem). For gender recognition, the results are transferable to within 1–2%; for age estimation the results are transferable to within 2–3 times increase in the mean absolute error. The authors claim that transfer learning provided 20–30\(\times \) speedup in the training time compared to end-to-end training on the same dataset.

Table 12 Self- and cross-testing results of gender recognition (accuracy) [32]
Table 13 Self- and cross-testing results of age estimation (mean absolute error) [32]

Riaz et al. [166] estimate the age based on the IMU analysis of human walk. The authors recorded 6-dim accelerations and angular velocities of 86 subjects when they perform standard gait activities using IMUs mounted on chest. The recorded signals were segmented into single steps; for each step, a total of 50 spatio-spectral features were computed from the 6 dimensional components. In order to estimate the subject age, three ML classifiers were trained: RFs, SVMs, and MLP. Two cross validation techniques were utilized: tenfold and subject-wise cross-validation. A random forest regressor trained and validated on hybrid data achieved an average RMSE (root mean squared error) of 3.32 years and a MAE (mean absolute error) of 1.75 years with tenfold cross validation and average RMSE of 8.22 years with subject-wise cross validation. Since the subjects belong to two demographical regions (Europe and South Asia), this gives broader empirical confirmation that age can be estimated based on human gait.

Khabir et al. [167] estimated gender and age from IMU-based gait dataset that is part of the Osaka University-ISIR Gait Database [43]. The authors showed that a SVM model had the best accuracy for classifying gender while Decision Trees had the highest variance score (\(R^{2}\) value) for age prediction.

Rasnayaka et al. [168] studied the personal characteristics collected from on-body IMU sensors and the sensor location effect on the user’s privacy. By analyzing the sensor locations with respect to privacy and service, the authors recommended locations have maintained the functionality (like the biometric authentication) while reduced the privacy invasions. The authors collected multi-stream dataset from 3 on-body IMU sensors consisting of 6 locations for 6 actions with different physical, personal, and socio-economic characteristics from 53 subjects. An opinion survey was conducted about the relative importance of each characteristic from 566 persons. Using these datasets the authors showed that the gait data can tell much personal information that may be considered a privacy issue. The opinion survey gave a ranking of the physical attributes based on the recognized importance. Using a privacy vulnerability index, the sensors located in the front pocket/wrist were more privacy invasive compared to the back-pocket/bag sensors that were less privacy invasive without a big loss of the functionality (e.g., the biometric authentication).

6.1.1 Gait analysis

Gait is the manner of human walk, and it is a biometric trait that is unique to each individual [169]. In the computational domain, gait analysis has importance, particularly, in person authentication and identification. Person authentication is concerned with the problem of identifying if a new gait instance belongs to a legitimate user or is an imposter [170]. This is typically done by searching for the new pattern in a list of patterns of the legitimate users [171]. In contrast, for person identification/recognition, the task is to classify a new gait instance to belong to one of a set of predefined users [43]. The embedded inertial motion sensors in the commodity wearable devices provide real-time streaming of linear acceleration and angular velocity, in addition to magnetometer readings, which together can fully capture the motion dynamics of the owner’s motion, or at least the motion dynamics of the body part (for example, wrist) to which the device is attached. The main advantage here is that such process occurs implicitly and does not require the active participation of the human subject [172,173,174,175]. This is in contrast to fingerprint authentication where the person has to take an action for authentication or even to face recognition which can be easily fooled. Person identification using motion data streaming from wearables turns out to be important in applications, for example, when different people share the same wearable device and it is desirable to identify who is using the device at a given time while walking.

The work done in [20] tries to answer two questions (though the answers seem obvious, but it is more of an engineering verification): (1) what is(are) the body location(s) that is(are) most effective for person identification through streaming from an IMU device attached to such part(s) and (2) if we have the freedom of mounting more IMU devices to different body parts, then which combination is the most effective? The authors apply their work on the EJUST-GINR-1 [19, 20] dataset. It is a gait dataset with 20 subjects, 10 males and 10 females. See Table 15 for a full description of the dataset; and see Fig. 9 for a depiction of the sensors locations over the different body parts. Two of these sensors are smartwatches worn on both wrists. The authors use all the 3 axes of the linear acceleration as well as the 3 axes of the angular motion as streaming sources. Then, they use derivatives of these timeseries, specifically, the autocorrelation function, to derive 6-axial fixed size much smaller signals. These are fed to 3 different classifiers, namely, random forest, support vector machine, and a convolution neural network. The overall schema of the system is shown in Fig. 19. Detailed performance metrics are given for every sensor on the body and for paired combination of sensors for the three classifiers. Generally, the results seem good, and, as expected, the best results are obtained from sensors mounted on the lower part of the body as they are closer to the motion source. However, the sensor on the back of the neck achieves next best performance to those IMUs mounted on the lower body parts. The authors hypothesize that the “gait patterns are transmitted almost intact through the backbone up to this location”, however, they do not provide any further evidence of references to this claim. The main drawback of this work is that the number of subjects are small (total of 20), in addition to the lack of diversity in their age (they are all early 20s), though they are balanced regarding gender.

An extension of this work was done by Adel et al. [176], by using a more advanced modeling technique, namely, Siamese networks, and validating the work over multiple richer datasets. In a sense, this work brings the long standing work of visual-based gait recognition using Siamese networks into the realm of inertial motion data. As expected these networks are used for automatic feature extraction. The proposed scheme is used for person authentication, where the network is trained, using raw inertial data, with only few samples per authorized subject. Once trained, the network can then be used to authenticate any given user using only one gait sample of that user. The work is validated using three publicly available datasets: OU-ISIR [33] (large-scale dataset with 744 subjects), EJUST-GINR-1 [19, 20] (small scale dataset with 20 subjects only), and MMUISD [23] (medium scale dataset with 120 subjects). The main performance metric adopted in this paper is the equal error rate (EER). The EER is the point on the ROC (receiver operating characteristic) curve where the false acceptance/positive rate equals the false rejection/negative rate. It is one of the typical metrics used to measure the performance of person identification and authentication systems. Accuracy is reported as well. The overall framework is shown in Fig. 22. Their model achieved \(3.42\%\) EER on OU-ISIR (it is the world’s largest inertial gait dataset) after training on 592 subjects and testing on other 78 subjects; on average each subject provided an average of 5.9 s of gait data. The best EER on the MMUISD was \(6.26\%\), and \(5.13\%\) on the EJUST-GINR-1. The latter dataset has the fewest number of subjects (only 20), but the largest gait duration of each subject. So, we can conclude from these results that larger number of subjects is more effective than the gait longevity of the subjects. An interesting work done by the authors is the use of transfer learning to train on subjects from a particular dataset and authenticate subjects from another. The hardware and software setup across the different datasets are quite different (for example, the type of the sensor and its location on the body); see Table 15. Hence, this task is quite difficult. The base case of performance is, of course, when the source and target datasets are the same. The trained Siamese network is essentially used to generate embeddings and, then, difference vectors, for the target datasets. Applying the same reasoning as above using the OU-ISIR as the source dataset produces the best transfer results over the other datasets.

Fig. 22
figure 22

Framework based on Siamese networks for person authentication based on inertial gait streaming [176]

6.2 Virtual coaching

In this section, we survey the activity recognition works mainly focusing on indoor exercises (push-ups, squats, etc.). In the context of athletics and sporting, inertial motion sensors are typically made up of accelerometers to measure force and acceleration, a gyroscope to give an indication of rotation (angular velocity and angular displacement), and a magnetometer to measure body orientation. Each of these sensors collects data across three perpendicular axes and can capture athlete’s movement in minute detail. These are used to continuously stream data that are typically processed by state-of-the-art machine learning, data modeling, and inference techniques that induce much higher level useful information from such data for workout and exercising actions. For example, they can be used to differentiate between a jump to the right or a jump to the left, a walk or a run, a walk on sand, solid land, or grass, etc. Other uses include measuring bowling in cricket, jumping in basketball, or kicking a ball in soccer, fencing, etc. One of the most popular uses of inertial sensors is in trying to quantify athlete readiness and fatigue in the field. For example, there are protocols that are often included in athlete’s warm up while wearing sensors. Wearables, especially IMUs, can be used to detect changes in acceleration or direction of acceleration which may change with injury and/or fatigue. Each person has an individual motion signature, so coaches, trainers, and sports scientists can compare an athlete to her typical normal self. IMUs can provide good insights into the exercising and/or playing performance and can aid in providing assessment and feedback to athletes recovering from injury.

Generally, IMUs can be placed anywhere on the human body. However, depending on the exercise or workout, some places are more proper to capture the dynamics of the motion of that exercise. For example, IMUs can be placed at the ankle for running sports, at the waist to detect jumping, at the wrists for sports such as tennis and fencing, and in between the scapula to approximate rough body movements. Current IMU technologies have increased dramatically, for example, sampling rates have exceeded 1 kHz, which allows for thorough and detailed analysis of high speed movements. Also, dynamic ranges have increased to about 200 g (for force and acceleration measurements), so we can now measure movements such as high velocity cutting, acceleration and deceleration, and jumping.

Optimal human movement is an important goal for coaches, trainers, therapists, and athletes in most sports. Skeletal muscles generate the force needed for these motions. Consequently, proper timing of skeletal muscle activation is important to improve movement quality and achieve better sports performance. However, injury, fatigue, and/or poor training can lead to sub-optimal timing and coordination of skeletal muscle activations [177, 178]. Consequently, this can lead to severe decrease of exercising performance and increase the risk of injury. Therefore, the assessment of muscle activation patterns provides a picture of the overall health and effectiveness of the musculoskeletal system [179]. It allows athletes to reduce muscle injury or imbalance and increase the effectiveness of training exercises.

Different feedback systems are used in the field of sports and rehabilitation. The use of such systems has shown some success by providing a positive effect on training control [180, 181]. In addition, the feedback system has the advantage of providing early warnings and precautions that can help to reduce the risk of injury [182] and to improve the performance during different movement tasks (e.g., balance) of elderly people [183]. Feedback can be given in several ways including visual, auditory, haptic, textual, or various combinations of these.

In [31], the authors present a multi-modal deep learning method utilizing many data sources for activity segmentation, exercise recognition, in addition to counting the number of repetitions. The authors introduced the MM-Fit, which is a multi-modal dataset containing a collection of IMUs extracted from multiple devices: smartphones, smartwatches, and earbuds. These devices were worn by subjects performing full-body workouts and are time-synchronized with multi-viewpoint RGB-D video, and pose estimates in 2D and 3D. They have developed a very rigorous architecture that is hierarchical in nature, where at the bottom level, autoencoders are used to learn representations of the individual modality signals. Then, at later layers incremental fusing of the representations of different modalities is done in order to learn more abstract representations (embeddings) of the learned lower level representations. Three main tasks are addressed by this work: activity segmentation, exercise recognition, and counting the number of repetitions of a given exercise. Totally, 21 full-body workout sessions are recorded; totalling 809 min across 10 participants; workout lengths range 27–67 min with average duration 39 min. Workout exercises monitored in this dataset include: squats, lunges (with dumbbells), bicep curls (alternating arms), sit-ups, push-ups, sitting overhead dumbbell triceps extensions, standing dumbbell rows, jumping jacks, sitting dumbbell shoulder press, and dumbbell lateral shoulder raises. Table 15 gives more detailed description of the dataset. For exercise recognition the authors did extensive evaluation using each device separately and individual modalities within device, in addition to different fusion of all these. For example, taking IMUs, the authors experiment with accelerometer only, gyroscope only, and a combination of both. The best recognition accuracy is obtained from the fusion of all sensors’ modalities reaching more than \(99\%\). On the other hand, the inertial sensors of the earbud and the smartphone put in the right trouser pocket gave the lowest accuracy around \(80\%\). Similar fine extensive evaluation has been done regarding the performance of counting repetitions of each exercise across each device. The performance metrics used were mean absolute repetition counting error and the percentage of exercise sets for which the predicted repetition count is exact, within 1, or within 2 of the ground truth repetition count. From the perspective of exercise type, squats, lunges, and lateral raises gave the lowest mean absolute error regarding their repetition counting. On the other hand, biceps curls and push-ups gave the worst mean absolute error. These results are consistent across all devices, sensors, and modalities. From the device perspective, on average across all exercises, the smartwatch on the left hand gave the lowest mean absolute repetition counting error.

6.3 Healthcare

Accurate and reliable timely quantification of human psychological, physical and physiological state using wearable sensing devices is paramount in health monitoring and safety surveillance. This in accordance necessitates the measurement of relevant signals from various on-body spots. A comprehensive review on the use of wearables in the medical sector can be found in [184]. All the capabilities of smartphones, including on-board embedded IMU sensors, can be used to collect behavioral data unobtrusively (without any intervention or disturbance to the subject), in situ, as it unfolds in the course of daily life [185]. Harari et al. [185] investigate the use of continuously streaming data of a person’s social interactions, daily activities, and mobility patterns for psychological research and profiling and for discussing the ongoing methodological and ethical challenges associated with research in this domain. The authors boldly claim that smartphone, and generally wearables, sensing would lead the psychology on the road to fully realizing its promise as a truly behavioral science.

The work done by Baghdadi et al. [186] aims at developing a general framework for a monitoring system for the prediction of critical physical states prior to any potential safety and health incident. They used Kalman filter to estimate the hip acceleration and trunk posture from a data streamed from a single IMU mounted at the ankle. Such data is combined with apriori information from Fourier series approximation to relate body kinematics considering the periodic nature of gait. As a first application they built an SVM classifier for the continuous monitoring of fatigue by classifying the feature pattern of the kinematic profile categorizing into ‘fatigue’ versus ‘non-fatigue’ states. This provides for minimally intrusive, inexpensive approach for fatigue detection in the workplace. The authors examined a couple of research questions: (1) can a single IMU unit placed at the right ankle be sufficient to capture the subtle changes in gait due to fatigue? and (2) can a computationally efficient classifier be used to distinguish between fatigued and non-fatigued states? The authors did experiments using SVM as a classifier and the results affirmed positively these questions. They managed to reach an accuracy of \(90\%\) in predicting the fatigue states for the 20 recruited subjects. This study found that the fatigue, due to a simulated manufacturing task of manual material handling, results in changes in gait kinematics. So, in a more abstract sense, this is saying that inertial motion data are able to distinguish and capture the fatigue state. From the feature extraction/selection perspective, not all features of gait kinematics have the same performance in fatigue prediction; foot acceleration and position trajectories are among the high performing ones. This line of research is crucial for studies in ergonomics. The authors projected that their research studies propose a framework for safety surveillance and fatigue monitoring to prevent safety incidents utilizing a minimal set of sensors and data that can be extended to other movement monitoring applications in healthcare through the ordinary smart wearable devices.

IMU devices typically have local storage that is large enough, so data can be recorded directly to the device, hence, sampling rates can be increased to capture minute motion details and very fast speed like in sports [27]. IMUs can easily be used indoors, across long distances, and even underwater in some cases.

6.3.1 Vital signs

Vital signs can be defined as the measurements of the basic functions of the human body [187]. The four major vital signs that are always monitored by the medical doctors and healthcare specialists include [187]: The human body temperature, the pulse rate, the respiration/breathing rate, and the blood pressure. The blood pressure cannot be considered as a vital sign, however, it is often measured with the other vital signs. Vital signs are indeed useful in monitoring and detecting medical issues.

Epson sensing technology for vital signs monitoring is a product that has a sensing device that measures the vital signs indicating a biological activity in human bodies [188]. The proposed vital sign solution utilizes the photo-electric sensing. This measurement technology uses the hemoglobin’s light absorbing properties. This technology is based on shining light using a green LED to the blood vessels lying under the skin. Epson has improved the measurement accuracy by using a 3-axial accelerometer and an algorithm to detect and eliminate the noise of the body movement (e.g., caused by arm movement).

In [189], the authors reviewed the utilization of wearable devices in healthcare and workout monitoring based on vital signals. This includes the electrocardiogram (ECG), electroencephalogram (EEG), electromyography (EMG), inertia, body movements, heart rate, blood, sweat, etc.

In [190], the authors reviewed the utilization of wearable devices and detection algorithms utilizing the piezoresistive/inertial sensors for monitoring the breathing factors and respiratory parameters based on inertial sensors (i.e., accelerometers and gyroscopes) and detecting dysfunctions or pathologies without surgery. Several tools were utilized to gather the respiratory parameters from the acceleration data.

In [191], the authors proposed an approach for detecting sleep parameters. This is achieved by measuring the earth magnetic field variation. The authors used a magnetometer sensor that is laid on the body which detects the night-time breathing movements (in millimeter). This is done by measuring the magnetic vectors changes. A wearable sensor that is soft and non-invasive is printed on a miniature printed circuit board. This includes a wireless Bluetooth low-energy (BLE) module and a low-power microcontroller. A minimum power consumption is achieved by utilizing a smart processing algorithm. This algorithm calculates respiration rate, apnea time, and time while movement.

In [192], the authors reviewed the conventional approaches for monitoring cardio-pulmonary rates and the ability of replacing these methods with radar-based approaches. The authors also studied the challenges that need to be overcame by the radar-based vital signs monitoring approaches in order to be applied in the healthcare domain. The authors presented a PoC for a radar-based vital sign detection and showed good measurement results. In [193], the authors highlighted the abilities of the IMU sensors to detect the vital signs. The results presented were promising in monitoring the soldiers’ vital signs.

6.4 Ergonomics

Kuschan et al. [194] studied 2 lifting movements utilizing the acceleration and angular velocity signals. The ergonomic movement was done by the test subjects bending at hips and knees only generating the lifting power by squatting down. The unergonomic movement was done by the test subjects bending forward to lift the box with their backs mainly. The signals were collected by a vest equipped with 5 IMU sensors each with 9 dimensions. The IMU data collected from both the ergonomic and unergonomic movements were statistically analyzed and visualized. A CareJack vest was used that provided the worker with a real-time feedback about his ergonomic movement decreasing the spinal and back diseases risk.

Mudiyanselage et al. [195] proposed surface electromyogram (EMG) system using machine learning to recognize body movements that may harm muscles while handling materials. This study utilized a lifting equation developed by the U.S. National Institute for Occupational Safety and Health (NIOSH). The equation specifies a Recommended Weight Limit that recommends the maximum acceptable weight that a healthy worker can lift/carry and a Lifting Index value to assess the risk threshold. Four different ML models (Decision Trees, SVM, k-NN, RF) were developed to classify the risk assessments based on the NIOSH lifting equation. To find the best performance of each algorithm, the models sensitivity to different parameters was also studied. The results showed that the Decision Tree model was able to predict the risk level with an accuracy near to \(99.35\%\).

Humadi et al. [196] proposed a system based on IMUs and Kinect V2 for Rapid Upper Limb Assessment (RULA) risk assessment that was compared to a reference motion-capture cameras. The activities here were the different tasks of manual material handling. The calculated RULA scores were compared to the reference system using the proportion agreement index (Po) and Cohen’s Kappa coefficient (\(\kappa \)). The wearable-based approach showed a “substantial” agreement (\(\kappa > 0.6\)) with the motion-capture cameras. The optical-based approach showed an “inconsistent” agreement with the motion-capture cameras ranging from “fair” (\(\kappa < 0.4\)) to “moderate” (\(\kappa > 0.4\)) agreements caused by self-occlusion and object-occlusion. Thus, the wearable-based approach was more suitable for in-field ergonomic risk assessment compared to the optical-based approach.

Fauziah et al. [197] proposed a tool to feedback with real-time assessment for workers ergonomic risks based on workers’ postures. Direct visits to a manufacturing site in Indonesia were performed to study various production tasks to obtain complaints data of Work-Related Musculoskeletal Disorders (WMSDs). Tasks with a higher risk of injury were chosen to perform task analysis in order to segment the task steps. Appropriate risk assessment methods were chosen, e.g., Rapid Entire Body Assessment (REBA), Rapid Upper Limb Assessment (RULA), Ovako Working Posture Analysis System (OWAS), etc., based on the tasks critical characteristics. The tasks being studied included: cleaning up production residues, grinding processes, manual bending, and welding. The appropriate assessment method utilized was the OWAS. The measures considered as the WMSDs risk factors were the posture of back, arms, legs, and weight being handled. These measures were considered in the design of the sensor system which integrated 7 IMUs placed on the mentioned body locations. On a developed mobile application, a score shows the risk value and a warning signal is prompted to the worker.

Hsu and Lin [198] proposed a system of IMUs and exoskeleton robot for work performance assessment and assisting workers in performing different agricultural activities. By locating 3 IMUs on the worker’s trunk and legs, the system controls the exoskeleton robot to move the workers’ legs while supporting their backs. For evaluating the waist assistance system, ergonomic assessment tests were applied, e.g., Rapid Upper Limb Assessment (RULA) and electromyography (EMG). By using the exoskeleton robot, the muscle activity of the Erector Spinae was reduced by about \(17.3\%\) for lifting 10 kg objects during stoop lifting and by \(20.4\%\) for lifting 10 kg during semi-squat lifting. For the same tasks, the muscle activity of Biceps Femoris was reduced by \(18.2\%\) and \(15.2\%\), respectively. This work is also applicable to rehabilitation activity, construction work, and other related work domains.

Humadi et al. [199] proposed an IMU system in order to assess the manual material handling activities using in-field RULA score. This was done using 3D Cardan angles and 2D projection angles relative to reference values recorded by a camera system for motion-capture. The axes conventions are generally used to establish the location/orientation of coordinate axes as a frame-of-reference. For the trunk and neck joint angles, the 2D convention had significantly smaller RMSE (\(p < 0.05\)). For the other upper-body angles, the convention had significantly smaller RMSE depending on the angle under analysis. The 3D convention showed a “moderate” agreement with the frame-of-reference, while the 2D convention showed a “substantial” agreement for two activities and a “moderate” agreement for one activity. For the 3D convention, the intra-class correlation coefficients ranged from 0.82 to 0.94. For the 2D convention, the coefficients ranged from 0.87 to 0.95 for repeated sessions performed by each subject. Thus, the proposed wearable IMU system using the 2D convention was found to be an accurate ergonomic risk assessment tool.

Jahanian et al. [200] proposed a system to study the risk and recovery metrics of arm use to discriminate between (1) Manual Wheel Chair (MWC) subjects having spinal cord injury (SCI) and matched able-bodied subjects and (2) MWC subjects with rotator cuff pathology progression for 1 year from users without pathology progression (longitudinal study). 34 MWC subjects and 34 age- and gender-matched able-bodied subjects were hired. Upper arm risk (humeral elevation \(>60^{\circ }\)) and recovery (static \(\ge 5\) s and humeral elevation \(<40^{\circ }\)) measurements were calculated from wireless IMUs placed on the upper arms and torso while being in a free-living environment. Two independent magnetic resonance imaging studies were evaluated for a subset of 16 MWC subjects about 1 year apart. The risk events frequency (\(p =0.019\)), recovery events summed duration (\(p =0.025\)), and each recovery event duration (\(p =0.003\)) were higher for MWC subjects than able-bodied subjects. The risk events summed duration (\(p =0.047\)), risk events frequency (\(p =0.027\)), and risk to recovery ratio (\(p =0.02\)) were higher and the recovery events summed duration (\(p =0.036\)) and recovery events frequency (\(p =0.047\)) were lower for MWC subjects having rotator cuff pathology progression (\(n = 5\)) with respect to the users without progression (\(n = 11\)). The IMU-derived measurements which quantified the arm use postures \(>60^{\circ }\) and risk to recovery ratios gave potential risk factors insights for rotator cuff pathology progression.

Vignais et al. [201] performed a material handling task ergonomic analysis by integrating a subtask video analysis and a RULA computation using a motion capture system combining IMUs and electrogoniometers. Five workers participated in the experiment and seven IMUs were located at the worker’s upper body (pelvis, thorax, head, arms, forearms). An upper body biomechanical model continuously provided trunk, neck, shoulder, and elbow joint angles. Wrist joint angles were provided from electrogoniometers synchronized with the IMU system and the activity of the worker was video recorded at the same time. Joint angles were input to an ergonomic evaluation model during post-processing based on the RULA method. An RULA score was computed at each time step to specify the upper body exposure risk (right and left sides). Local risk scores were also calculated to identify the exposure anatomical origin. The video-recorded work activity was studied in time to recognize all subtasks included in the task. The mean RULA scores were for all subjects at high risk (for right and left sides, 6 and 6.2, respectively). A temporal analysis showed that workers spent at an RULA score of 7 most part of work time. Mean local scores showed that during the task most exposed joints were elbows, lower arms, wrists, and hands. Elbows and lower arms were during the work cycle total time at a high risk level (for right and left sides; \(100\%\)). Wrist and hands were exposed during most work period to a risky level (right: \(82.13 \pm 7.46\%\); left: \(77.85 \pm 12.46\%\)). Subtasks called ‘snow thrower’, ‘opening vacuum sealer’, ‘cleaning’ and ‘storing’ were identified as the most unsuitable for right and left sides where the mean RULA scores and time spent percentages were at risky levels.

Villalobos and Mac Cawley [202] proposed an IMUs-based system to recognize human activity using AI and ML techniques to perform task classification and ergonomic assessments in workplace location. The authors used low-cost IMUs placed on slaughterhouse workers’ wrists to gain information about their motions. The authors identified the wrists/hands risk factors leading to work-related musculoskeletal disorders (WRMSDs). The authors showed that by using low-cost IMU sensors on the slaughterhouse workers’ wrists, the sharpness of the knife can be accurately classified and the worker RULA score can be predicted.

6.5 Human–computer interaction (HCI)

In [203], the authors proposed a wearable system that is operating in real-time for recognizing finger air-writing in 3-D space using the Arduino Nano 33 BLE Sense. This edge device runs TensorFlow Lite for action recognition on the device. The proposed system allows users to write characters (26 English lower-case letters and 10 digits) by moving fingers in air. This system uses a DL algorithm to recognize 36 characters from the motion data collected from the IMUs. This data is then processed by a microcontroller embedded in the Arduino Nano 33 BLE Sense. The authors prepared 63,000 stroke data samples written in air of 35 subjects (18 males and 17 females). This data was used in CNN training. The proposed system had a high recognition accuracy of 97.95%.

In [204], the authors proposed a hand gesture segmentation method that is a mobility-aware. The proposed system detects and segments hand gestures. The authors utilized a CNN in order to classify hand gestures considering noises due to mobility. The proposed system is called MobiGesture was used for healthcare. The authors used the leave-one-subject-out cross-validation test. The experiments with real subjects showed that the proposed approach achieves 94.0% precision, and 91.2% recall when the subject is moving. The proposed algorithm is 16.1%, 15.3%, and 14.4% more accurate than the state-of-the-art when the subject is standing, walking, and jogging, respectively.

In [205], the authors proposed two hand gesture recognition methods based on RNN. The first method is based on video signal and utilized a combined structure of CNN and RNN. The second method uses accelerometer data and only RNN. A fixed-point optimization is used to optimize the memory size for storing the weights and minimizing the power consumption in hardware and software processing. This optimization quantizes most of the weights into two bits.

In [206], the authors proposed FingerPing sensing technique. FingerPing is able to recognize different fine-grained hand poses. This is achieved by analyzing the acoustic resonance features. A surface-transducer that is fixed on a thumb ring is able to inject acoustic chirps (20–6000 Hz) into the body. There are four receivers on the wrist and thumb to gather the chirps. Various hand poses create different paths in order to the acoustic chirps to travel. This creates specific frequency responses at the four receivers. FingerPing could discriminate till 22 hand poses. This includes the thumb touching every phalange of the twelve one on the hand, also the ten American sign language poses. Experiments with 16 subjects had showed that the proposed system was able to recognize the aforementioned two poses sets achieving an accuracy of 93.77% and 95.64%, respectively.

In [207], the authors proposed a transfer learning approach to train target sensor systems from existing source sensor systems. This method utilizes system identification methods in order to learn a mapping function between the signals from the source sensor system to the target sensor system, and vice versa. This is doable for sensor signals that are of the same or cross modality. The authors proposed two transfer learning models in order to translate recognition systems based on activity templates/models. This depends on the features of the source and target sensor systems. The proposed transfer learning approaches are evaluated in a HCI scenario; the transfer is done between wearable sensors located at various body locations, and between wearable sensors and ambient depth camera. The results showed that a nice transfer is achievable in just few seconds of data. This was done irrespective of the transfer direction for similar/cross sensor modalities.

In [208], the authors proposed a Quantum Water Strider Algorithm with Hybrid-Deep-Learning-based Activity Recognition that is called QWSA-HDLAR model. The proposed QWSA-HDLAR approach can recognize the various types of HAR. The QWSA-HDLAR model utilized a deep transfer learning approach. A neural-architectural-search-network (NASNet)-based feature extractor was used to generate feature vectors. The proposed QWSA-HDLAR method utilizes a QWSA-based hyperparameter tuning method in order to select the hyperparameters of the NASNet model. The classification of human activities is achieved using a hybrid CNN with a bidirectional RNN model. The experiments were done using two testing datasets, KTH and UCF Sports datasets. The results showed that QWSA-HDLAR model outperformed the state-of-the-art DL approaches.

7 Embedded HAR

In this section, we address HAR systems implemented on embedded devices. Basterretxea et al. [209] aimed at an FPGA-based autonomous single-chip HAR system without needing external computing means or human intervention. The system comprises all typical HAR process phases, i.e., signal segmentation, feature extraction through signal processing, feature selection through dimensionality reduction of input space, and activity recognition using a neural network classifier. A physical activity recognition application was utilized as a frame-of-reference to evaluate the system performance and analyze FPGAs potential benefits in prospective wearable HAR applications.

Czabke et al. [210] presented a microcontroller-based algorithm for human activity classification based on a 3-axial accelerometer. For long term activity patterns monitoring, the data amount is kept small with efficient data processing. The authors classified in real-time the following activities: resting, walking, running, and unknown activity. The algorithm ran on the proposed device Motionlogger with a key fob size and can be placed in a pocket or a handbag. The algorithm testing with 10 users locating the Motionlogger in their pockets showed an average accuracy \(> 90\%\).

Li et al. [211] presented an activity recognition system from passive RFID data using a deep convolutional neural network. For activity recognition, the RFID data are feed directly into a deep convolutional neural network, instead of selecting features. A cascade structure is used that first detects from the RFID data the “object use” then predicting the activity. The “object use” of specific activities means recognizing activities based on the used objects, i.e., detecting human–object interaction from the sensed data. The proposed system solves activity recognition as a multi-class classification problem, thus it is scalable for applications having big number of activity classes. Testing the proposed system using RFID data obtained in a trauma room comprising 14 h of RFID data from 16 actual trauma resuscitation (restoration of acute illness or near death). The proposed system outperformed state-of-the-art systems implemented for activity recognition and had similar performance with process-phase detection like systems based on wearable sensors. The authors also studied the strengths and challenges of the proposed deep learning architecture for activity recognition using the RFID data.

Sundaramoorthy et al. [212] proposed HARNet that is efficiently handling resources and computationally practical network for on-line Incremental Learning and User Adaptability to be a mitigation technique used for anomalous user behavior recognition. Heterogeneity Activity Recognition Dataset [152] was utilized for evaluating HARNet and other proposed variants by using acceleration data collected from different mobile platforms. The authors utilized Decimation for down-sampling in order to generalize sampling frequencies between mobile devices. Discrete Wavelet Transform was utilized for preserving information among frequency and time. Typical evaluation of HARNet on the adaptability of users resulted in an accuracy increase by \(35\%\) by utilizing the model ability to extract differentiating features between activities inside heterogeneous environments.

Saponas et al. [213] argued that researchers have explored various approaches like services running in mobile phone background, cameras in environment, RFID readers, etc. The authors recommended that for HAR system to transition from interesting research topic to mainstream technology and for enhancing the day-to-day activities, HAR must use many of the user’s computing ecosystem components, e.g., low-power embedded hardware running continuously and mounted on phone, cloud services tracking the user’s activities on long-term and recognizing current activity from all inputs, etc. The authors provided a brief discussion of how this system can be implemented and utilized.

Chen et al. [214] proposed a multi-level HAR modeling workflow named Stage-Logits-Memory Distillation (SMLDist) in order to construct deep convolutional HAR models using embedded hardware. SMLDist comprises stage distillation, memory distillation, and logits distillation. Stage distillation limits the intermediate features learning direction. The student models were taught by the teacher model how to explain and save the inner relationship between high-dimensional features depending on Hopfield networks in memory distillation. Logits distillation constructs logits filtered by a smoothed conditional rule to keep the probability distribution and improve the accuracy of the softer target. The authors made a comparison of the accuracy, F1 macro score, and energy cost on a MobileNet V3 model [215] embedded platforms built by SMLDist with different state-of-the-art HAR systems. The batch size of the training and validating samples was set to 256 while the learning rate was set to \(1 \times 10^{-4}\). The proposed model had a good balance to be robust and efficient. SMLDist can as well compress the models given a slight performance loss with an equal compression ratio compared to other advanced knowledge distillation approaches on 7 public datasets.

Bhat et al. [216] proposed the first HAR system performing both tasks of online training and inference. The proposed system first generates features using the FFT and discrete wavelet transform of a stretch sensor that is textile-based and accelerometer. Based on these features, the authors developed an ANN classifier that is trained online utilizing the policy gradient algorithm. The experiments on an IoT device (TI-CC2650 MCU) that is low power with nine subjects showed an accuracy of \(97.7\%\) in recognizing 6 activities and the transitions between these activities with \(<\, 12.5\) mW consumption of power. Another line of work by Bhat et al. [217] introduced an open source platform called OpenHealth that allows for health monitoring using wearable sensors. OpenHealth represents a hardware/software set and wearable devices enabling autonomous integration of clinically related data. The OpenHealth platform comprises a wearable device, standard software interfaces and implementations of applications for activity and gesture recognition.

Alessandrini et al. [218] solved the HAR problem directly on embedded devices by developing an RNN on a low cost, low power microcontroller, while checking the targeted performance through accuracy and low complexity. The authors first implemented an RNN which combines Photoplethysmography (PPG) and 3-axial accelerometer data in which these data is used to match motion artifacts in PPG to recognize human activity accurately. The authors then deployed the RNN to a Cloud-JAM L4 embedded device that is based on an STM32 microcontroller. This device is optimized to preserve an accuracy \(> 95\%\) whilst requiring small computational power and memory resources. The experiments showed that this system can efficiently be developed on a limited-resource system, aiming at designing a wearable embedded system that is fully autonomous for HAR and logging.

Choudhury et al. [219] presented a Mobile Sensing Platform (MSP) that is a small wearable device. MSP is designed for accurate activity recognition on embedded devices to support ubiquitous computing applications that are context-aware. The authors had performed many real-world deployments and subject studies. The results are utilized to enhance the hardware/software design and HAR algorithms.

Ravi et al. [220] a HAR system based on a deep learning for accurate and real-time classification on wearable devices that have low power. In order to gain steadiness against changes in the orientation, placement, and sampling rates of sensor, a feature generation using the IMU data spectral domain. The proposed method utilized the transformed input temporal convolutions sums. the proposed approach accuracy is assessed versus the current state-of-the-art techniques on both laboratory and real world activity datasets. The authors performed a systematic analysis for the parameters used in feature generation. In addition, the authors compared the activity recognition computation times on mobile devices and sensor nodes. The proposed system was deployed as an Android app and as an embedded algorithm as well for the Intel Edison Development Platform. Intel Edison was used to show human activity classification on-node by using the trained recognition model. Using a dual-core Intel Atom CPU with rate of 500 MHz, wireless connectivity, and compact physical with dimensions of \(35.5 \times 25.0 \times 3.9\) mm, the Intel Edison is a small powerful platform that is suitable for smart sensor networks applications.

Table 14 presents a comparison and summarize several embedded HAR systems, including power consumption, performance metrics and time consumption.

Table 14 Comparison of several embedded HAR systems, including power consumption, performance metrics and time consumption

8 Discussion and conclusion

Human activity recognition (HAR) has gained tremendous momentum in recent years along with the digitization of all aspects of society in general and individual lives in particular. Of particular significance and potential are those systems based on streaming of inertial data (such as linear and rotational motion) from IMU devices. Such sensing capabilities are currently embedded in almost all commodity mobile and wearable devices. As the field is very huge and diverse, this article is by no means comprehensive; it is though meant to provide a logically and conceptually rather complete picture to advanced readers, as well as to present a readable guided introduction to newcomers. Our logical and conceptual perspectives mimic the typical data science pipeline for state-of-the-art AI-based systems. Namely, starting with datasets, either publicly available and/or collected, to feature engineering and/or automatically extracted, developing learning models including the transfer of pre-trained models to new datasets and/or tasks, and finally validating the trained models. We have as well touched on adversarial attacks as actual deployment involves security/privacy issues. Adversarial attacks against timeseries data is a pretty recent research and only a handful of works are available.

There are huge opportunities both on the research and industrial sides for IMU-based HAR. To the best of our knowledge, there have been very few works addressing the fusion of inertial and other data modality such as the visual. Some recent work in empirical psychology aim at merging IMU with social media texts trying to profile and understand the mental status of the user and potential risks with behavioral patterns and mood changes, for example. With the COVID-19 outbreak, gyms among other public places have been forced to close their doors in order to control the spread of the global pandemic. Moreover, people were advised to stay at home and in some places, were fined if found outside. Hence, many people’s mental and physical state started to deteriorate. As a result, more and more workout mobile apps started to gain an increasing popularity. Accordingly, there has been a growing interest in performing exercising and sports indoors with the assistance of smart digital technology for tracking and assessment of such exercising. Such kinds of apps are based essentially on measuring and tracking motions, and here where inertial data are readily, naturally, and directly helpful. Much more quality inertial datasets need to be collected and made open, where the setup and data configuration have to be well documented. We believe that for closed HAR systems (deployment environment very much resembles the training model construction environment), deep neural methods may not be that effective, especially with the limitation of datasets and their sizes. However, as seen above, deep methods are very effective and useful in transfer learning handling the deployment of HAR systems in radically different situations and contexts from those where the model was built.

Inertial data streaming through IMU devices are more intrinsic than other data modalities, specially the visual, as they are embedded into wearables and hence, mobile carried on with the user wherever she is. It has as well the advantage of complying more with the security and privacy concerns of the users. Its main disadvantage is high sensitivity to noise and low explicit semantic content, as compared, for example, with visual, audio, or textual data. Hence, they need smart algorithms to extract the most feasible and valuable information from them.

Some research works have applied advanced 3D data tensor models in medical images analysis, specifically the Azheimer’s disease [221,222,223,224]. However, the modalities used in these works are visual scans (such as magnetic resonance and in general 3D visual data). To the best of our knowledge, these models have not been used for analysis of inertial motion data which is the main focus of the current paper. However, it is worth investigating the use of such architectures, and others like normalizing flows and GANs (Generative Adversarial Networks), for the analysis of timeseries inertial motion data.

Another crucial research point in the domain of HAR based on IMU Data is the “adversarial attacks” on time series data. Deep neural networks are highly parameterized complex models and they can be highly sensitive, and hence, ill-robust, to small perturbations, as observed by [225]. This is most witnessed in visual data, where the perturbations can be completely imperceptible to the human eye, however, may lead to vastly and widely different classification outputs than expected or desired [225]. These perturbations may be explicitly engineered (target adversarial attack) and added to inputs for malicious purposes, which can potentially have devastating consequences in critical production systems. Therefore, there is a need to design methods that are robust to such attacks, detecting and mitigating them, in order to maintain the integrity and correct functionality of such systems [226].

Most of the work done in that direction have targeted visual data, aiming at the detection and mitigation against such attacks including techniques such as adversarial learning and the proper development of adversarial examples. Attacking timeseries streaming data can be of detrimental consequences as well. For example, models that are deployed to detect and monitor seismic waves can be fooled into creating fear and panic in society; wearables used for activity recognition from inertial signals can fool the user into believing performing fake actions, or believing some performance level in a given activity [227].

Only recently there have been some emergent works targeting adversarial attacks against timeseries data, among which are the data streamed from IMU sensors. To the best of our knowledge, Fawaz et al. [228] in 2019 were the first to address adversarial attacks against deep learning models trained on timeseries data. Such timeseries adversarial examples can severely affect the predictive performance and reliability of decisions taken by such models in real-world applications specially those related to healthcare applications, sports, and security. The crux of this work is that timeseries data, similar to visual data, are also prone to adversarial attacks. And even they developed adversarial attacks (examples) by transferring and adapting adversarial attacks that had been shown to work on images. More specifically, the authors formulated attacks on timeseries data based on two methods: (1) the Fast Gradient Sign Method (FGSM) and (2) the Basic Iterative Method (BIM). FGSM is based on one step update along the direction of the gradient at each time step (this is the opposite to the learning update).

Saponas et al. [213] highlighted the key challenges which are facing the adoption of HAR in future systems in industry. The authors mentioned that performing HAR using “mobile devices” is the crucial factor and accordingly the authors highlighted two important challenges: the location independence and continuous operation/function. The authors mentioned that no single device can figure all the context needed for the required HAR use cases. The authors argued that the future HAR systems have a challenge to utilize all the computing ecosystem components from embedded devices to the cloud.

In [229], the authors argue that one might have the false perception that performing HAR using IMUs is already at a perfect state and that no further actions need to be done. On contrary, out of those who are not using their wearable devices sensing: 36% cited that this is due to the measurement accuracy and 34% cited that is because the improper activity tracking [230]. The authors argue that HAR is still far away from being solved due to the following reasons: (1) the difficulty in defining an activity, (2) Heterogeneous benchmarks, (3) Collecting ground-truth data for the HAR systems is expensive and not easy, (4) Various characteristics causes diverse activity profiles, (5) the lack of standardization in storing, processing, and analyzing the data. These points highlight the open research problems in this field and give some insights about the future research works.