1 Introduction

Human activity recognition (HAR) is a challenging research area extensively investigated in the field of Ambient Assisted Living (AAL) (Suryadevara and Mukhopadhyay 2014). The knowledge of the daily behaviour of an elderly person living independently can be valuable information for clinicians (Godfrey et al. 2007). However, self-assessment of daily activities has shown to be subjective and variable, as a subject’s own assessment can differ from that of an expert in the field (Smith et al. 2005). This fact explains the increasing attention on the development of automatic activities tracking systems for subject monitoring.

Different sensor platforms are utilized with the aim of automatically monitoring a person in a home environment. According to the type of sensor/s employed for data collection, approaches can be roughly divided into three different categories; (1) computer vision (video cameras), (2) ambient sensors such as passive infrared sensors (PIRs), On/Off or open/close sensors, microphones, vibration sensors etc. and (3) wearable devices with inertial sensors such as accelerometers, gyroscopes and magnetometers. Research efforts are currently shifting towards wearable solutions, which avoid occlusion and the privacy concerns related to the use of video cameras in a home environment. In fact, previous surveys regarding the acceptability of the use of wearable devices showed positive results, not only in adults but also within the elderly population (Nelson et al. 2016; O’Brien et al. 2015).

Several efforts have been made to develop systems for HAR using wearable sensors in view of its many applications in fitness, security and health care. Examples include fall detection systems (Ugolotti et al. 2013; Micucci et al. 2017) or sleeping analysis systems (de Arriba-Pérez et al. 2018). However, predominantly, attention has been given and fitness applications in which periodic activities such as walking, running or climbing stairs are analysed (Bayat et al. 2014; Casale et al. 2011; Capela et al. 2015; Erdaş et al. 2016). Little research is reported concerning quotidian daily activities, such as eating, drinking or hygiene-related activities, which could also be used as an indicator of a person’s well-being. The authors in (Amft et al. 2007; Dong et al. 2014; Amft and Tröster 2008) investigated activity events related to eating as well as dietary periods. Amft et al. (2010) studied fluid intake. However, hygiene-related activities have not yet been studied in depth.

A HAR system normally embodies data collection, signal pre-processing, feature extraction, feature selection/reduction and activity classification. Given that human activity takes places at frequencies of up to 20 Hz, data collection is normally performed at frequencies of 50–100 Hz avoiding an aliasing effect on the collected signals (Wang et al. 2011a). In the pre-processing step, different filters are applied to eliminate potential noise outside the human activity frequency range, as well as to separate the low frequency component (gravity) from the high frequency component (body motion) (Bayat et al. 2014; Casale et al. 2011; Erdaş et al. 2016; Mannini and Sabatini 2010). After filtering, the signal is normally segmented into sliding windows (Bayat et al. 2014; Casale et al. 2011; Ravi et al. 2005; Wundersitz et al. 2015; Erdaş et al. 2016; Capela et al. 2015) or shorter fundamental movements known as primitives (Zhang and Sawchuk 2012; Garcia-Ceja and Brena 2013) from which feature vectors are extracted.

Features are normally calculated in the time and frequency domains. The extraction of features in the wavelet domain has been investigated (Erdaş et al. 2016), however, the classification results did not improve when added to the features calculated in the other two domains. Within the time domain, statistical features have demonstrated good classification results. Examples include mean, standard deviation, inter-quartile range and correlation between accelerometer axes (Casale et al. 2011). Algorithms such as Dynamic time warping (DTW) (Billiet et al. 2016; Bruno et al. 2013) and Histogram of Oriented Gradients (Miyamoto and Ogawa 2014), typically used in computer vision, are also employed as feature descriptors. Once the feature vector is defined, a step of feature selection/reduction is normally applied to remove those redundant and irrelevant features not contributing significantly to improve the signal description. Selection models such as Chi-Square (Erdaş et al. 2016), Analysis of variance (ANOVA) (Wundersitz et al. 2015), Lasso Regression (Wundersitz et al. 2015), or filter-based approaches such as Relief-F or correlation-based feature selection (CFS) (Capela et al. 2015), have been used in previous work for dimensionality reduction purposes. Finally, the classification step takes place. Although unsupervised classification approaches have been investigated in the past (Kwon et al. 2014), human activity recognition is typically tackled as a supervised learning problem. A wide range of supervised classifiers such as support vector machines (SVM), neural networks, random forest (RF) or K-nearest neighbours (KNN) have been widely used in previous studies (Jafari et al. 2007).

After reviewing the previous work on activity recognition using wearable sensors, our observation was that some activities are less easily discriminated than others. Due to this, the proposed models so far struggle to recognize specific activities sharing a high degree of similarity.

The aim of the research reported here is to mitigate the drawback above by developing a novel model able to autonomously identify those less discriminative activities to improve their classification rates. This was achieved by the development of a novel multi-level refinement approach able to autonomously identify such activities and optimize their feature vectors. The study, based on data from a wrist-mounted tri-axial accelerometer worn on the dominant hand of the users, includes a set of seven quotidian activities of daily living (ADLs). The results achieved show a classification accuracy improvement for all the refined activities, giving support to the use of the methodology proposed.

The rest of this paper is organized as follows. Section 2 reviews previous work on the use of wearable devices for activity recognition. Section 3 describes the methodology followed to develop the multi-level refinement classification approach. Section 4 discusses the results. Section 5 justifies the use of the multi-level refinement step in future activity classification studies.

2 Related work

Many attempts have been made to develop HAR systems using wearable sensors. Studies have varied with regards to the type of sensors, the number of them, their location, the activities to be tracked, the pre-processing techniques, the feature extraction and selection methods, the classification approaches as well as the research purpose itself. Bayat et al. (2014) used a smart-phone with a single tri-axial accelerometer to test the activity recognition results on a set of six activities, carrying the phone in the pocket and carrying the phone in the hand. They achieved a maximum accuracy of \(91.15\%\) on the ‘in-hand’ experiment using a combination of different classifiers. Casale et al. (2011), studied a group of five activities using a single tri-axial accelerometer worn on the chest, achieving a maximum accuracy of \(94\%\) using the Random Forest Classifier. Ravi et al. (2005) used a tri-axial accelerometer worn near the pelvic region to study eight different activities, obtaining an accuracy of over \(99\%\) combining different classifiers by applying plurality voting. However, data was collected only from two subjects, thus compromising the variability of the experiment.

Wundersitz et al. (2015) studied a circuit composed of eight fitness activities using a system embodying a tri-axial accelerometer and a tri-axial gyroscope embedded in a vest and located at the upper trunk of the experiment participants, achieving a maximum accuracy of \(92\%\) using a logistic model tree (LMT). Zhang and Sawchuk (2012) used data from a tri-axial accelerometer and a tri-axial gyroscope worn at the right hip to develop a Bag-of-Words (BoW) approach applied to activity recognition. They achieved a maximum accuracy of \(92.7\%\) using a K-Means clustering algorithm for primitive construction and support vector machine using an RBS kernel for the classification of those primitives. Erdaş et al. (2016), used data from a tri-axial accelerometer worn on the chest to study seven activities. Their approach included an ensemble feature selection, which combined with a Random Forest classifier obtained a maximum accuracy of \(88\%\). Godfrey et al. (2007) used two accelerometers worn on the sternum and on the right thigh to classify sitting, standing and stepping with an accuracy of around \(98\%\). Jafari et al. (2007) classified sit-to-stand, stand-to-sit, lie-to-stand and stand-to-lie movements using a tri-axial accelerometer exhibiting an average of \(84\%\) accuracy using a K Nearest Neighbours classifier.

Suto and Oniga (2018) investigated the impact of signal processing, activation functions and hyper-parameter selection on the classification performance of artificial neural networks (ANNs) on two different activity sets. The maximum classification accuracy achieved was 99.4% when using five sensor nodes and 97% when using a single sensor unit, with each sensor unit embodying a tri-axial accelerometer and a tri-axial gyroscope.

Dong et al. (2014) developed a detector for periods of eating from data collected by a wrist-worn tri-axial accelerometer and gyroscope, combining a custom wrist energy peak detector and a naive Bayes classifier, achieving an accuracy of \(81\%\). Munoz-Organero and Lotfi (2016) developed a stochastic model for recognizing and classifying different types of steps and falls using a tri-axial accelerometer worn on the abdomen for model training and on the chest for validation, achieving a step classification accuracy of \(91.14\%\). Micucci et al. (2017) studied the recognition of falls across three different datasets including accelerometer data from different activities of daily living as well as from simulated falls. The performance of one class and two classes KNN and SVM were evaluated. The highest average sensitivity (96.5%) was achieved with the use of two classes SVM. Garcia-Ceja and Brena (2013) tackled a long term activity recognition problem as a distribution of simple activities with an accelerometer worn on the belt of the users, achieving a maximum accuracy of \(92.5\%\) using a KNN classifier. Wang et al. (2011a) developed a Hidden Markov Model (HMM) to classify six different daily activities using data from a waist-worn tri-axial accelerometer, achieving a recognition accuracy of \(94.8\%\).

Billiet et al. (2016) studied six transition activities in rheumatic and musculoskeletal patients using a bi-axial accelerometer worn on the biceps of the dominant arm. This study included signal-processing features and pattern-based features applying dynamic time warping (DTW), obtaining an average accuracy of \(93.5\%\). The authors in (Mannini and Sabatini 2010) developed a Gaussian Continuous Hidden Markov Model (cHMM)-based sequential classifier using data from five bi-axial accelerometers to classify seven different activities, achieving a maximum accuracy of \(98.4\%\). Amft and Tröster (2008) investigated dietary events using four inertial measurement units (IMUs) worn on the upper and lower parts of the arms, an ear microphone and a sensor collar composed of a stethoscope microphone and a electromyogram (EMG), obtaining a maximum recognition rate of \(82\%\) for the classification of cutlery usage, drink, spoon usage and eating only using the hand. Using the same sensors, Amft et al. (2007) studied the intake of six different types of food and water by the use of a Probabilistic context-free grammar (PCFG) parser, achieving an average classification rate of \(80\%\). Amft et al. (2010) studied the recognition of sips by using the feature similarity search (FSS) as feature descriptor, alongside traditional features, for data collected by a wrist-worn system embodying a tri-axial accelerometer, a tri-axial gyroscope and a compass, as well as a magnetic coupling sensor to measure the relative position between the wrist and the shoulder of subjects. The achieved sip detection rate was \(89.2\%\). Authors in Bruno et al. (2013), propose a recognition model for motion primitives relying on Gaussian mixture modelling (GMM) and Gaussian mixture regression (GMR) based on dynamic time warping (DTW) and Mahalanobis distance descriptors, using data from a single wrist-worn accelerometer.

Among the studies above, a common feature can be seen; some activities are less easily discriminated than others. For example, authors in Wang et al. (2011a) struggled to classify ‘sit’ and ‘fall’. Billiet et al. (2016) observed the highest number of false detections in their study was occurring in two specific activities; ‘get up’ and ‘maxreach’. Amft et al. (2007) found the groups ‘spoon’ and ‘apple’ to have considerably lower detection rate than others. Zhang and Sawchuk (2012) found difficulties to classify ‘walking upstairs’ and ‘walking downstairs’ among other walking-related activities. This problem has motivated the development of the multi-level refinement approach presented in this paper.

3 Methodology

Using a simple wristband which includes a tri-axial accelerometer sensor has proven to be the most acceptable form of wearable sensor. There are many readily available wristband which incorporate other sensors including gyroscope, gps, heart rate variation (HRV) and temperature. A tri-axial accelerometer is a sensor that provides simultaneous acceleration measurements in three orthogonal directions, namely x, y and z axes to represent acceleration \(A_x\), \(A_y\) and \(A_z\), respectively. From that data, an informative feature set can be calculated and a posteriori the activity classification is tackled as a supervised classification problem. Using a tri-axial accelerometer sensor on a wrist is illustrated in Fig. 1. As an alternative to a tri-axial accelerometer, a gyroscope could also be used. However, given that a gyroscope consumes approximately ten times the power of an accelerometer (Dong et al. 2014), making the use of the former excessively power consuming for all-day monitoring systems, the decision was taken to use only accelerometer data, which as shown by previous HAR studies, is sufficient to achieve high activity classification rates.

Fig. 1
figure 1

Diagram showing how the accelerometer is mounted on the wrist and the direction of the orthogonal output signals \([A_x, A_y, A_z]\)

Based on the literature review presented in Sect. 2, it can be observed that the general question of HAR is being addressed via several approaches. To the best of our knowledge, previous work in the field lacks further analysis of classification results which can lead to better classification performance.

In the remainder of this section, the multi-level refinement activity classification approach will be explained step by step. This investigation differs from earlier studies as follows. Firstly, some of the activities included in the activity set have not been studied together. Secondly, the proposed approach does not finish on the classification step as traditional approaches do. Instead, a novel approach is used whereby the classification performance in the form of confusion matrix is examined to identify those activities which worsen the performance of the proposed model, with the aim of directing the computational efforts towards refining the misclassification rate of the model. A comparison between the methodology used in this paper and that in most previous work is illustrated in Fig. 2.

The motivation for this design is inspired by the fact that signals from different activities differ diversely from each other. Subsequently, a subset of features can be informative to differentiate between two specific activities but not necessarily between two other activities. In the case of unbalanced datasets (some class or classes have lower amount of data), performing a feature selection on the whole activity set could miss out relevant features for specific activities from which less data is available, since the aim of the future selection is the optimization of the overall model classification accuracy.

Fig. 2
figure 2

Diagram representing the steps employed by previous work for activity classification (top) and in this paper (bottom). It can be seen that after the classification step takes place, an additional multi-level refinement step is included, whereby pairs of activities which worsen the performance of the classification model, are grouped together for further inspection

3.1 Data collection

Six subjects, two female and four male, between 21 and 36 yeas of age, participated in this research experiment. They were asked to perform a set of seven quotidian daily activities:

Fig. 3
figure 3

Diagram of the activities studied and their accelerometer signals (1 s windows)

  1. 1.

    Hand washing,

  2. 2.

    Teeth brushing,

  3. 3.

    Standing,

  4. 4.

    Sitting,

  5. 5.

    Picking up an object from the floor,

  6. 6.

    Walking upstairs,

  7. 7.

    * Walking downstairs is also included in the activity set for classification.

The volunteers were asked to wear the sensor on their dominant hand but no instructions were given as to how to perform the activities, adding reality and variability to the data. Since human activity happens at no more than 20 Hz (Wang et al. 2011a), a sampling frequency of 100 Hz was used which according to the Nyquist theorem will be sufficient to avoid undesirable aliasing effects on the collected signals.

A Meta Motion R wristband mounted system was used for data collection (MetaMotionR 2017). A visual representation of the system alongside one second windows of the signals from the different activities studied in this paper can be observed in Fig. 3. This system includes, among other sensors, a tri-axial accelerometer, which was employed in this experiment. After the data was collected, it was sent via low energy Bluetooth to a smart-phone with the use of the Metabase app, which allows for the configuration of the sensors as well as the access to the sensor data. The question on where an inertial sensor should be worn to optimize the information collected keeps unraveled, since the optimal location is dependent on the chosen activity set. However, as compared to other areas of the human body, the wrist enables a high degree of freedom. Additionally, the wrist is a natural place for instrumentation, avoiding undesired obtrusiveness. Given that, the social acceptance is likely to increase due to its resemblance to a common watch.

3.2 Pre-processing

The accelerometer data is composed of three different time series \(a_{x_t}, a_{y_t}, a_{z_t}\), which correspond to the medio-lateral, vertical and antero-posterior acceleration inputs respectively. A fourth time series, namely \(|a_{t}|\), is calculated as the argument of the tri-dimensional vector.

Different smoothing and filtering techniques are applied to each time series. First of all, a median filter with a window length \(w_{l} = 7\) has been used for smoothing purposes. Wang et al. (2011b) demonstrated that this filter has a competitive signal to noise ratio (SNR) as compared to other types of filters used for accelerometer data. On top of the median filter, a low-pass 20 Hz Butterworth filter is applied to filter out those frequencies not related to human activity. According to previous studies, human activity takes place at frequencies lower than 20 Hz (Wang et al. 2011a).

Two different components can be extracted from the accelerometer raw data; the gravity component, which is associated with the low frequency component of the signal, and the linear acceleration caused by the motion itself, which is associated to the high frequency component of the original signal. As done by previous work (Casale et al. 2011), a cut-off frequency of 1 Hz was used to extract the gravity component and the motion component from the signal.

In addition to the time series above, the rate of change of acceleration (jerk) of the the signal before being split into motion and gravity components was calculated, therefore obtaining four additional time series. The resultant time series are as follows:

  • Original signal: [\(a_{x_{t}}, a_{y_{t}}, a_{z_{t}}, |a_{t}|\)]

  • Motion: [\(a_{x_{{m}_{t}}}, a_{y_{{m}_{t}}}, a_{z_{{m}_{t}}}, |a_{{m}_{t}}|\)]

  • Gravity: [\(a_{x_{{g}_{t}}}, a_{y_{{g}_{t}}}, a_{z_{{g}_{t}}}, |a_{{g}_{t}}|\)]

  • Jerk: [\(a_{x_{{j}_{t}}}, a_{y_{{j}_{t}}}, a_{z_{{j}_{t}}}, |a_{{j}_{t}}|\)], where \(|a_{{j}_{t}}|= \frac{\partial |a_{t}| }{ \partial t }\)

The segmentation of the signals is performed using sliding windows with a window length of 1 s and a \(40\%\) overlapping percentage. Features will be calculated from these windows a posteriori.

3.3 Feature extraction

Guided by previous work (Ravi et al. 2005; Erdaş et al. 2016; Dargie 2009), different features in the time and frequency domains were evaluated. Within the time domain, different statistical features were explored. These include measures of central tendency like the mean and root mean square (RMS), measures of statistical dispersion such as range, standard deviation and inter-quartile range, measures of distribution shape such as kurtosis or skewness, measures of dependence between different axes, such as Pearson’s correlation and measures of magnitude of varying quantity such as signal magnitude area. On top of the statistical features, measures such as peak frequency, zero-crossings frequency and signal entropy were calculated. After converting the signal to the frequency domain through the Fast Fourier Transform (FFT), features such as largest magnitude of the signal spectrum, index of the spectrum component with the highest magnitude and the energy of the signal across all the spectrum components were explored. Except few cases where it was not appropriate, features were calculated over all the time series exposed in Sect. 3.2. The dimensionality of the resultant feature vector is \(n=266\).

This feature set has been carefully selected so as to provide informative and discriminative information with regards to a wide array of signal characteristics, such as range, dispersion, central tendency, periodicity, frequency distribution, magnitude and changes in direction. Description of the selected features are presented below:

  • Mean:

    $$\begin{aligned} \overline{a}=\frac{1}{W_l}\sum _{t=0}^{W_l} a_t \end{aligned}$$
    (1)

    where \(a_t\) is the acceleration at time t and \(W_l\) is the window length expressed as number of samples.

  • Standard deviation:

    $$\begin{aligned} \sigma = \sqrt{\frac{\sum _{t=0}^{W_l} (a_t-\overline{a})^2}{W_l-1}} \end{aligned}$$
    (2)

    where \(a_t\) is the acceleration at time t, \(W_l\) is the window length expressed as number of samples, and \(\overline{a}\) the mean acceleration of the corresponding window.

  • Signal magnitude area:

    $$\begin{aligned} sma = \frac{1}{W_l} \int _{0}^{T=W_l} \left( \left| a_{x_t}-\overline{a_x}\right| +\left| a_{y_t}-\overline{a_y}\right| +\left| a_{z_t}-\overline{a_z}\right| \right) dt \end{aligned}$$
    (3)

    where \(a_{{x}_{t}}, a_{{y}_{t}}, a_{{z}_{t}}\) are the acceleration at time t on the x, y and z axes respectively, \(W_l\) is the window length expressed as number of samples, and \(\overline{a_x}\), \(\overline{a_y}\), \(\overline{a_z}\) the mean acceleration on the corresponding axis in the corresponding window.

  • Signal entropy:

    $$\begin{aligned} H(a)=\sum _{t=0}^{Wl} \left| a_t-\overline{a}\right| \log _{10} \left| a_t-\overline{a}\right| \end{aligned}$$
    (4)

    where \(a_t\) is the acceleration at time t, \(W_l\) is the window length expressed as number of samples and \(\overline{a}\) the mean acceleration in the corresponding window.

  • Correlation:

    $$\begin{aligned} r_{xy}=\frac{Cov(a_x,a_y)}{\sigma (a_x)\sigma (a_y)} \end{aligned}$$
    (5)

    where \(Cov(a_x,a_y)\) is the covariance of the acceleration on the axes x and y, and \(\sigma (a_x) \,and \,\sigma (a_y)\) are the standard deviation for the acceleration on the axes x and y respectively.

  • Skewness:

    $$\begin{aligned} \gamma _1=\frac{\frac{1}{W_l}\sum _{t=0}^{W_l} (a_t-\overline{a})^3}{(\sigma (a))^3} \end{aligned}$$
    (6)

    where \(a_t\) is the acceleration at time t, \(W_l\) is the window length expressed as number of samples and, \(\overline{a}\) and \(\sigma (a)\) are the mean acceleration and the standard deviation in the corresponding window respectively.

  • Kurtosis:

    $$\begin{aligned} \beta _2=\frac{\frac{1}{W_l}\sum _{t=0}^{W_l} (a_t-\overline{a})^4}{(\sigma (a))^4} \end{aligned}$$
    (7)

    where \(a_t\) is the acceleration at time t, \(W_l\) is the window length expressed as number of samples and, \(\overline{a}\) and \(\sigma (a)\) are the mean acceleration and the standard deviation in the corresponding window respectively.

  • Root mean square:

    $$\begin{aligned} RMS=\sqrt{\frac{1}{W_l}\sum _{t=0}^{W_l}(a_t)^2} \end{aligned}$$
    (8)

    where \(a_t\) is the acceleration at time t and \(W_l\) is the window length expressed as number of samples. The root mean square is calculated on all the time series presented in Sect. 3.2.

    The transformation from the time domain to the frequency domain has been computed using the Fast Fourier Transform:

    $$\begin{aligned} A(k) = \sum _{t=0}^{W_l-1}a_te^{(-i2\pi kt/W_l)} \end{aligned}$$
    (9)

    where \(a_t\) is the acceleration at time t and \(W_l\) is the window length expressed as number of samples, A(k) is the sequence of \(W_l\) complex-valued numbers given the sequence of data a(t).

  • Energy:

    $$\begin{aligned} E=\frac{\sum _{k=1}^{W_l}\left| a_k\right| ^2}{W_l} \end{aligned}$$
    (10)

    where \(a1,a2,\ldots a_{W_l}\) are the FFT components of the corresponding window of length \(W_l\).

3.4 Feature selection/reduction

As stated by Mannini and Sabatini (2010), when the dimension of a feature space is considerably high, learning the parameters for a classifier becomes a difficult and consuming task. In addition, feature selection/reduction can maintain or even increase the discriminative capability of the whole feature set. The current study has explored the use of three different methods for dimensionality reduction. First an analysis of variance (ANOVA) is conducted, where features were ranked according to their F measure, which is calculated as the ratio of the variance between classes and the variance within the class. Even though the use of ANOVA on non-normally distributed data can increase the chances of obtaining false positives, the F measure here is only used as a feature ranking according to the dissimilarity between classes. After features are ranked, the subset that maximizes the classification result is selected. The other two approaches explored were principal component analysis (PCA) and truncated singular value decomposition (SVD). These two approaches perform an orthogonal transformation of the data into a new coordinate system where the new coordinates are those which maximize the variance of the data, being the difference between the two approaches that PCA centres the data before computing the singular value decomposition.

3.5 Classification

As mentioned above, seven activities were investigated in this study. Each activity was recorded for each person separately and then labelled accordingly to fit them into the selected classifiers after feature extraction and selection. The performance of three different classifiers were evaluated: K-nearest neighbours (KNN), random forest (RF) and support vector machine (SVM) using a radial basis kernel (RBS). The optimal classifier was selected during the feature selection stage along the optimal number of features/components. A 3-fold cross-validation method was used to test the robustness of the classification results. Each fold included three different sets—the training set, the validation set and the test set. The validation set was used to identify refinement opportunities and the test set to validate the performance improvement during the different refinement steps.

3.6 Multi-level refinement

Our multi-level refinement can be defined as an algorithm that aims at optimizing the classification accuracy of a group of classes by an improvement on the recognition rate of those classes which lower the classification rate of the whole group. Its implementation is justified by the fact that in a classification problem, a classification accuracy lower than \(100\%\) is normally caused by the difficulty to classify specific classes, unless the recognition rate is identical for all the classes, though this is not a common occurrence.

After the activity classification takes place, the confusion matrix is further analyzed and activities are compared in pairs. If the classification accuracy between any pair of activities is lower than that on the whole model, those activities are grouped together for refinement. Activities which are found to lower the accuracy of the system due to their misclassification rate with an activity already pertaining to a refinement group are added to that same group, otherwise a new refinement group is constructed with these pair of activities. At this point the feature selection is optimized for each group selecting the most informative feature set for each of them. This process is repeated until groups of two activities are reached.

The multi-level refinement then focuses the computational efforts on the classification of those activities that are more difficult to classify in the first place. A pseudocode of the multi-level refinement algorithm is shown in Algorithm 1.

figure a

4 Results

In this section, the experimental results reached are presented, explained and discussed. This section is divided as follows; Sect. 4.1 examines the feature selection methods proposed for dimensionality reduction. Section 4.2 presents the classification results and the improvement achieved by the multi-level refinement algorithm. Finally Sect. 4.3 discusses the results obtained.

4.1 Feature reduction

ANOVA K-best, PCA and SVD are computed to find out the optimal subset of features/components for the description of the data set. To do so, the performance of the different feature selection methods is examined across all the possible values ranging from n = 1 to n = 266 (whole feature set). These three feature reduction techniques are examined on the three different classifiers proposed; random forest, KNN and SVM. The best classification performance (average per-class classification accuracy = 99.15%) is achieved using ANOVA K-Best alongside a random forest classifier when n = 149, n being the number of features after ranking according to their F ratio. The performance of the different feature selection methods with random forest across the number of dimensions can be observed in Fig. 4.

Fig. 4
figure 4

Classification performance of the feature selection methods based on a random forest classifier

4.2 Classification and refinement

The first step of deploying the multi-level refinement algorithm is to train and evaluate a 7-class classification model representing the 7 studied activities using the train set and test set respectively. The classification resulted the confusion matrix and classification metrics presented in Tables 1 and 2, respectively.

Table 1 Confusion matrix before refinement using a standard classification method
Table 2 Classification metrics of the 7-class model before refinement

The average per-class classification accuracy achieved by the model is 99.15%, however there exist relevant differences in terms of precision and recall between different activities. At this point, the multi-level refinement algorithm is used to identify the activities lowering the performance of the model. This is performed using the validation set. “Teeth Brushing”, “Walking Downstairs” and “Walking Upstairs” were identified to need refinement. For the benefit of reading convenience, this group of activities are referred to as “Group 1”.

Taking into account the activities within Group 1, feature selection is performed and a new 3-class model including the outlined set of activities was trained using the training set. The aim of this step is to optimize the feature vector for the classification of the identified set of activities, which was found to have a dimensionality of n = 131. Such new 3-class model is now used to reclassify the samples which were previously predicted as to pertain to Group 1. The classification metrics and resultant confusion matrix after the first refinement step can be seen in Tables 3 and 4, respectively.

Table 3 Classification metrics after the first refinement step
Table 4 Confusion matrix after the first refinement step

After the first refinement step, the same process was repeated. In this case, the 3-class classification model alongside the validation set were used to identify activities which needed further refinement. The new set of activities found was formed by the activities “Walking Downstairs” and “Walking Upstairs”. This group of activities are referred to as “Group 2”. A new 2-class model was trained (using the training set) with an optimized feature vector for the classification of the activities within Group 2. The dimensionality of the vector in this case was n = 186. It can be noticed that the number of dimensions has now increased as compared to previous classifications. This may be due to the high similarity in terms of acceleration between walking downstairs and walking upstairs. The samples from the test set previously classified as to pertain to Group 2 were reclassified using the new 2-class model, leading to the classification metrics shown in Table 5 and the confusion matrix illustrated in Table 6.

Table 5 Classification metrics after the second refinement step
Table 6 Confusion matrix after the second refinement step

The different refinement steps show an improvement in terms of per-class classification accuracy of the refined classes. This led to a better performance of the multi-level refinement approach proposed as compared to standard off-the-shelf classification methods. Such improvement can be better visualized in Fig. 5 where the per-class classification accuracy of the refined activities and the average per-class classification accuracy of the 7-class model across the different refinement steps are shown.

Fig. 5
figure 5

Comparison of activity classification accuracy before and after the refinement steps

4.3 Validation and discussion

To validate the multi-level refinement algorithm, a test was run on the Anguita et al. (2013) data set. The data set contains data collected from 30 volunteers performing a group of six different activities of daily living (ADLs) while wearing a tri-axial accelerometer and a tri-axial gyroscope on the waist. After the classification, two groups were formed: (1) sitting and laying, (2) walking downstairs and walking upstairs. An improvement in classification performance was observed in all activities. The per-class classification accuracy was improved from 98.29 to 100% for sitting, from 98.29 to 100% for laying, from 99.19 to 99.37% for walking downstairs, and from 99.39 to 99.56% for walking upstairs.

These results suggest that the use of the proposed multi-level refinement can improve classification accuracy on those activities that were more difficult to classify between when following a traditional classification approach. In addition, this method will benefit unbalanced experiments where data from specific activities are more difficult to collect as compared to others. After cross-validating the data-set, the data from those specific activities may be not enough to classify them against similar activities. This problem could be reduced by applying the multi-level refinement approach presented.

5 Conclusion

In this paper we propose a novel multi-level refinement technique for optimization of classification results in a HAR problem using accelerometer data. The proposed refinement algorithm autonomously analyses the confusion matrix, finding those activities that worsen the performance of the model and grouping them together for further inspection. Individual groups of activities are then studied separately, this process being repeated until only two activities remain in each group. The proposed approach has demonstrated that the classification results improve on each refinement step. Activities with low classification rates in the first place, obtained better classification accuracy when studied separately, even with the use of lower dimensional feature vectors in some cases.

This suggests that feature informativeness depends on the activity set chosen. Thus computational efforts should be given to particular group of activities (or classes) with lower classification performance, in order to optimize the selection of features and thus their classification rate. This approach could have a significant positive impact when the recognition of a specific class is crucial for the interest of the study as for example a fall detection system.