Keywords

1 Introduction

Human activity recognition is an important topic in human–computer interaction and has been used in multiple fields, e.g., games (PS VR), interactions with devices via gestures, security systems, and lifelogging. Human activity recognition is also required in daily life. For example, abnormal patient activities must be detected automatically and appropriately. Automatic abnormal detection reduces the burden on nurses and enables them to help other people. Activity recognition could also prove useful in corporate management. Automatic collection and analysis of the working states of employees may improve the working efficiency. Therefore, it is important to develop a monitoring system that can automatically, correctly, and unobtrusively recognize and record the activities of daily living (ADL).

2 Related Work

Multiple methods for human activity recognition have been developed, some of which have achieved a high accuracy of coarse-grained activity classification using a single sensor or multiple sensors and classifiers. However, these methods are not sufficiently precise for recognizing the differences in daily activities that include asymmetric and small movements of both hands.

Maurer [1] proposed a method employing 4 sensors (a dual-axis accelerometer, a light strength sensor, a temperature sensor, and a sound sensor) that were attached on the wrist of a user and used a C4.5 decision tree as a classifier. The method achieved a recognition accuracy of 93% in an experiment classifying 7 types of ambulation activities that have motions larger than those involved in daily deskwork.

Berchtold [2] used an acceleration sensor that was embedded in a smartphone placed in the pocket or held by a hand and a fuzzy logic-based classifier. The recognition accuracy was between 71% and 97% in an experiment classifying 15 types of activities. In addition, the recognition accuracy of the activities “sitting or standing in a moving bus” and “dancing” was in the range 71%–73%.

Riboni [3] collected acceleration data from both a smartphone and a device attached on a user’s waist. With a logistic multiclass classifier, the recognition accuracy in an experiment identifying 23 types of daily human activities, including “brushing teeth” and “writing on the blackboard,” was 93%. However, incorrect recognition occasionally occurred when 2 activities involved similar body movements. For example, the activity “writing on the blackboard,” the recognition accuracy of which was 76.8%, was easily recognized incorrectly as “standing still,” the recognition accuracy of which was 65.16%.

Kao [4] introduced a method based on a fuzzy basis function using a three-dimensional (3D) acceleration sensor that was attached on a user’s waist. In an experiment classifying 7 types of daily activities, this method achieved an overall recognition accuracy of 95%, even though some activities that contained similar activity patterns of the upper limbs were misclassified. For example, the activity “knocking,” a combination of hand and body movements, was misclassified as “running,” which has a clear body movement pattern.

By using smartphone or multi sensors attached on user’s waist, related works achieved high recognition accuracy of ambulation activities such as “walking” or “running”. However, accuracy of recognition was not nice for activities with hand movements. We think that analyzing the differences of both hands may be able to enhance the accuracy when recognizing small asymmetric hand movements, which is one of the requirements of our society.

Daily desk activities generally involve small asymmetric hand movements. Therefore, we believe using only a single hand movement is insufficient for accurately recognizing these activities. Guiard [5] presented a theoretical framework of the asymmetry of human bimanual actions. Each hand has a different role, and different tasks have different movement combinations of both hands. The use of both hands is essential for differentiating activities involving asymmetric hand movements and small body movements.

Moreover, previous studies have shown that using multiple sensors is helpful in achieving high recognition accuracy; however, users generally feel uncomfortable when multiple accelerometers are attached to body parts [6], such as waist, ankle, and buttock. We believe that using accelerometers only on a user’s wrists will reduce the inconvenience of the sensors in daily life.

After comparing multiple machine learning methods, we decided that the support vector machine (SVM) and random forest (RF) methods would perform well with a number of features and reduce overfitting.

3 Objectives

Based on related studies, we believe that enhancement of the recognition accuracy is a purpose that deserves experimental evaluation. This paper evaluates a method that uses acceleration data from both the wrists of a user to recognize ADL with small asymmetric hand movements, which is different from those considered in related studies. Using machine learning methods, the acceleration of both hands would be able to characterize the differences between daily activities.

We designed an experiment to recognize several daily activities and evaluate the performance of features, classifiers, and participants. In our experiment, many conditions of combination of basic statistics and Fourier transform coefficients were evaluated to find the best.

4 Experiment

This section describes the target activities, measuring devices, participants, extracted features, combinations of features, classifiers, and collected dataset used in our experiment.

4.1 Target Activities

To focus on the classification of activities with hand movements and reduce the influence of body movements, including walking or running, we chose 4 types of daily deskwork activities performed in the sitting position as target activities (Table 1).

Table 1. Target activities.

4.2 Measuring Device

Two 3D acceleration sensors were attached to both the wrists of a user, as shown in Fig. 1. These sensors were used to collect the acceleration data in three orthogonal directions at a sampling rate of 100 Hz. Using Bluetooth, the data were transferred to a computer in real time.

Fig. 1.
figure 1

A photograph showing the positions of the sensors.

4.3 Participants

In total, 10 participants (1 female and 9 male university students with an average age of 25 years) participated in the experiment. During the experiment, a participant sat in front of the desk. No other specific instructions concerning the body posture or the method to finish the target activities were provided to the participants.

4.4 Features

The acceleration data collected from one accelerometer at time t corresponds to a 3D vector, \( V\left( t \right) = \left( {x_{t} ,y_{t} ,z_{t} } \right) \), where x t , y t , and z t represent the acceleration data of the X, Y, and Z axes, respectively. The proposed method extracts features from the amplitude A(t), which is expressed as Eq. (1):

$$ \varvec{A}\left( t \right) = \sqrt[2]{{x_{t}^{2} + y_{t}^{2} + z_{t}^{2} }}. $$
(1)

A(t) is not affected by the orientation of gravity although the accelerometers worn by different participants may have had different orientations during the experiment.

The sampled time series data were segmented into time windows for feature extraction. Our analysis used consecutive overlapping windows whose overlap was 75%. The window length w was 2.56 s, i.e., 256 samples, which means that the overlapping time p was 1.92 s. Multiple previous studies have used window length that ranged from 1 [3] to 6.7 s [7] because each pair of ambulation movements, e.g., “sitting–standing–sitting,” can be completed in 6 s. According to our experience, we think that it is necessary to capture the important hand/finger movements in deskwork activities in 3 s. Therefore, we used 2.56 s less than 3 s as the window length.

Two types of features were extracted from each window: basic statistics (maximum, minimum, and average) and Fourier transform coefficients. A Fourier transform can help us obtain the energy of each frequency component from a signal with multiple frequency components.

The maximum, minimum, and average of A(t) between time \( t_{a} \) and \( t_{a} + w \) are denoted as \( Max(t_{a} ) \), \( Min(t_{a} ) \), and \( Ave(t_{a} ) \), respectively. A basic statistics vector, \( \varvec{S}(t) \), comprises \( Max(t) \), \( Min(t) \), and \( Ave(t) \). The suffixes R and L represent the right and left hands, respectively. For example, \( \varvec{S}_{L} \left( t \right) \) represents the basic statistics vector of the left hand. Meanwhile, 128 Fourier transform coefficients were extracted from A(t), which is denoted as F(t).

Our evaluation used \( Max_{R} ,Max_{L} , Min_{R} , Min_{L} , Ave_{R} \), and \( Ave_{L} \) at times \( t, t + p,\, t + 2p,\,t + 3p,\,t + 4p,\,{\text{and}}\,t + 5p \) as well as \( \varvec{F}_{\varvec{L}} \left( t \right) \) and \( \varvec{F}_{\varvec{R}} \left( t \right) \). A set of features at times \( t, t + p, t + 2p, t + 3p, t + 4p,\,{\text{and}}\,t + 5p \) contains information of the temporal progress, which is important to correctly classify activities. Therefore, to compare the importance of features, we composed several types of feature vectors using the basic statistics of time t (S1) and the basic statistics of times \( t, t + p, t + 2p, \,t + 3p, t + 4p,\,{\text{and}}\,t + 5p \) (S6). F128 represents a full range of Fourier transform coefficients. F5 and F10 have 5 and 10 Fourier transform coefficients, respectively, starting from the lowest frequency component.

4.5 Conditions

A condition shows the components of feature vectors, e.g., B represents both hands, L represents the left hand, and R represents the right hand. For example, B(S1 + F128) represents a feature vector containing a simple statistics of time t and a full range of Fourier transform coefficients of both hands. In total, 33 conditions of the feature vectors are listed in Table 2.

Table 2. Components of the feature vectors.

4.6 Classifiers

For activity recognition, two types of machine learning methods were used to build classifiers: SVM and RF. The scikit-learn library for the Python programming language was used to build SVM classifiers [8] with the radial basis function kernel and RF classifiers [9] with 10 trees and without a depth limit. For SVM, this experiment used C = 1.0 and gamma = 1/(number of features) to evaluate all datasets, where C trades off the misclassification of training examples against the simplicity of the decision surface and gamma defines how far the influence of a single training example reaches [10]. Moreover, a grid search was performed to find the best parameters for the dataset.

4.7 Collected Dataset

We collected data from 10 participants on different days and extracted samples from these data for analysis. The sizes of the extracted samples are listed in Table 3.

Table 3. Sample sizes

4.8 Experimental Procedure

Each participant was asked to perform the target activities. In “keyboard typing,” the participants were asked to use a full mechanical keyboard with a Japanese layout and perform a copy-typing task of a part of an article, as shown in Fig. 2. The article was a Japanese computer science articleFootnote 1 published in an IEICE magazine and contained both Japanese and English text. “Writing” was a task to copy a part of a different articleFootnote 2 published in the same magazine with a pen. One black ballpoint pen and one sheet of A4 paper were arranged for each participant. In “browsing on a tablet,” the participants were asked to find the specified places on Google Maps using an 8 in. HUAWEI android tablet (JDN-W09). These places were not displayed on maps with scales of over 200 m. The participants were required to find the target locations starting from a 1-km scale map by knowing only their names. They were not allowed to use the search function. In “finding pages,” the participants were asked to find 5 pages in an English computing survey magazineFootnote 3 of 171.4 mm × 254 mm size.

Fig. 2.
figure 2

A photograph showing the experimental performance for “keyboard typing.”

5 Evaluation

5.1 Evaluation Method

The accuracy of recognition was based on a 10-fold cross validation. Cross validation [11] is a method to estimate how accurately a predictive model will perform in practice by repeatedly separating the dataset and using one part as test data and the other parts as training data. Furthermore, cross validation can effectively limit problems such as overfitting.

Two types of evaluation methods were used in this paper, the evaluation of activity discrimination overall accuracy and the evaluation of individual accuracy. In the evaluation of activity discrimination overall accuracy, datasets of 10 participants were merged into a total dataset, which is randomly selected in 10-fold cross validation. To find the best feature vector and classifier, the overall accuracy was evaluated on all conditions listed in Table 2.

In addition to the overall accuracy, in the evaluation of individual accuracy, a personal accuracy of each participant was evaluated using 10-fold cross validation where the best feature vector and classifier obtained in the evaluation of activity discrimination overall accuracy.

5.2 Evaluation Criterion

The measures of accuracy include precision, recall, and F-score—a measure combining precision and recall. Precision is the fraction of correct predictions of the total predictions, while recall is the fraction of correct predictions of the total correct predictions.

In multiclass classification problems, although the performance of a classification model can be evaluated using precision, recall, and F-score, there are different ways to calculate precision, recall, and F-score, e.g., micro-average, macro-average, and weighted average. Macro-averaged precision and recall can be affected by minority classes, which means that the macro-average can reveal the performance of classifiers more precisely. Therefore, in this study, macro-averaged precision and recall were used for the evaluation.

The precision, recall, and F-score can be calculated using a confusion matrix, which can show the direct relationship of actual classes with predicted classes in a single table. A sample multiclass confusion matrix is summarized in Table 4.

Table 4. Multiclass confusion matrix.

Let \( m_{jk} \) be the (j, k)-th element in an n × n confusion matrix. \( m_{jk} \) represents the count of data items in an actual class \( C_{j} \) but recognized as class \( C_{k} \). For class \( C_{i} \), we can calculate the precision \( P_{i} \) and recall \( R_{i} \) using the following Eqs. (2) and (3):

$$ P\left( {C_{i} } \right) = \frac{{TP_{i} }}{{TP_{i} + FP_{i} }} $$
(2)
$$ R\left( {C_{i} } \right) = \frac{{TP_{i} }}{{TP_{i} + FN_{i} }} $$
(3)

where true positive \( (TP_{i} ) \), false positive \( (FP_{i} ) \), false negative \( (NP_{i} ) \), and true negative \( (TN_{i} ) \) are defined as Eqs. (4) to (7).

$$ TP_{i} = m_{ii} $$
(4)
$$ FP_{i} = \sum\nolimits_{j = 1}^{\text{n}} {m_{ji} - TP_{i} } $$
(5)
$$ FN_{i} = \sum\nolimits_{k = 1}^{n} {m_{ik} - TP_{i} } $$
(6)
$$ TN_{i} = \sum\nolimits_{j = 1}^{n} {\sum\nolimits_{k = 1}^{n} {m_{jk} - TP_{i} - FP_{i} - FN_{i} } } $$
(7)

The macro-averaged precision \( P \) and recall \( R \) are expressed as Eqs. (8) and (9).

$$ P = \frac{{\sum\nolimits_{j = 1}^{N} {P\left( {c_{j} } \right)} }}{N} $$
(8)
$$ R = \frac{{\sum\nolimits_{j = 1}^{N} {R\left( {c_{j} } \right)} }}{N} $$
(9)

The F-score, which is the harmonic mean of \( P \) and \( R \), is expressed as Eq. (10) as follows.

$$ F = \frac{2 \cdot P \cdot R}{P + R} $$
(10)

The F-score was used for the evaluation to compare the accuracy of recognition.

6 Results

6.1 Results of the 10-Fold Cross-Validation

The experimental results of RF and SVM are shown in Figs. 3 and 4, respectively.

Fig. 3.
figure 3

F-score of the random forest (RF) method.

Fig. 4.
figure 4

F-score of the support vector machine (SVM) method.

As shown in Figs. 3 and 4, for S6, the F-score of RF reached 0.90, and that of SVM reached 0.93. Moreover, the result of B(S6) with parameter tuning, which corresponds to B(S6) after tuning the parameters in SVM, achieved the highest recognition accuracy, i.e., 0.94.

Conversely, for F5, the recognition accuracy for RF was 0.55, while that for SVM was 0.4 for F10, which is the lowest recognition accuracy in this evaluation.

As shown in Figs. 3 and 4, the accuracy of using the acceleration data from both hands is higher than the accuracy of using the acceleration data from only one hand. This means that feature vectors that have the features of both hands are more effective than those that have the features of only one hand. Meanwhile, the recognition accuracy of the feature vector S6 was better than those of other vectors, which means that the basic statistics are more effective than the Fourier transform coefficients. The confusion matrices of B(S6) with parameter tuning of SVM, which corresponds to S6 of both hands after tuning the parameters, and B(S6) of RF are summarized in Tables 5 and 6, respectively.

Table 5. Confusion matrix [support vector machine (SVM)] of B(S6) with parameter tuning.
Table 6. Confusion matrix [random forest (RF)] of B(S6).

6.2 Results for Individuals

We evaluated a model for each participant based on B(S6) with parameter tuning in SVM and B(S6) in RF, which are the most effective feature vectors according to the evaluation described in Subsect. 6.1. The dataset of each participant was used to build a classification model, and the model was evaluated via 10-fold cross validation. The results for each participant are shown in Fig. 5.

Fig. 5.
figure 5

Evaluation of the individuals.

Figure 5 shows that the F-scores of RF for all participants were between 0.92 and 0.99. In addition, for each participant, the F-scores of RF were slightly higher than the F-scores of SVM, which were between 0.90 and 0.99. However, the F-scores of participant 4 were 0.90 and 0.92 in SVM and RF, respectively, which were lower than those of the others. The confusion matrices of participant 4 for SVM and RF are shown in Tables 7 and 8, respectively.

Table 7. Confusion matrix (SVM) of participant No. 4.
Table 8. Confusion matrix (RF) of participant No. 4.

7 Discussion

7.1 Period

Figures 3 and 4 illustrate that the accuracies of recognition of L(S6), R(S6), and B(S6) are better than those of L(S1), R(S1), and B(S1), respectively, for SVM and RF. This finding shows that a long sampling period contributes toward improving the recognition accuracy in the case considering only one hand.

7.2 Hands

From a comparison of the recognition accuracies obtained in the conditions shown in Figs. 3 and 4, we can observe that the recognition accuracy for the case involving both hands B is integrally higher than that for the case involving only one hand. For the best feature vector S6, the recognition accuracies for the case involving both hands were 0.93 and 0.91 for SVM and RF, respectively. These accuracies were higher than that obtained using either the left hand or the right hand. These results show that the use of features from both hands is more effective in comparison with those from only one hand.

However, as summarized in Tables 5 and 6, the activity “browsing on a tablet” was easily misclassified as “keyboard typing” or “finding pages,” while the activity “finding pages” was easily misclassified as “browsing on a tablet.” It is likely some patterns in these activities were not accurately discriminated yet and need to be studied further.

7.3 Fourier Transform Coefficients

By comparing all the conditions in Figs. 3 and 4, we can see that the Fourier transform coefficients are not as effective as the basic statistics (maximum, minimum, and average) in this evaluation. However, using a combination of basic statistics and a few Fourier transform coefficients can achieve nearly the same result as using basic statistics alone.

Furthermore, comparing the results of F128, F10, and F5, we see that the use of Fourier transform coefficients alone cannot achieve a nice accuracy of recognition. This means that Fourier transform coefficients are not useful in this evaluation.

7.4 Classifiers

Figures 3 and 4 show that on average, RF can achieve better recognition accuracy than SVM. However, using S6 in SVM resulted in the highest recognition accuracy, i.e., 0.93. Moreover, the accuracy of S6 increased to 0.94 after using the best parameter in the grid search. The best parameters were C = 1000, gamma = 0.001, and kernel = rbf. This shows that tuning the parameters of SVM can enhance the recognition accuracy.

Conversely, tuning the number of trees in RF does not enhance the recognition accuracy. In this evaluation, the parameter n_estimators was set to 10, 100, and 1000 to test the classifier; however, the recognition accuracies were nearly the same.

In addition, Fig. 5 shows the results of the individual evaluations. For most participants, RF can achieve better recognition accuracy with individual datasets than SVM.

7.5 Individual Variation

As shown in Fig. 5, the recognition accuracy of each participant was over 0.9 and most recognition accuracies were over 0.95 in RF, which means that for 10 participants, this method can achieve good recognition accuracy.

However, the recognition accuracy of participant 4 was slightly lower than those of the others. We think this is because participant No. 4 was left-handed even though he behaved like a right-handed person in daily life. As summarized in Tables 7 and 8, confusion can easily occur for the activities “keyboard typing” and “writing.” In addition, the activity “browsing on a tablet” was easily misclassified as “finding pages” or “writing,” whereas the activity “finding pages” was easily misclassified as “keyboard typing” or “browsing on a tablet.” Participant 4 finished the target activities using the commonly used hand (his non-dominant hand); however, we believe that the movements of his dominant hand may have affected the recognition accuracy. We need to investigate such individual variations further.

8 Conclusions

This paper evaluated a method to recognize deskwork activities based on machine learning algorithms using acceleration data from both wrists. The evaluation was performed on 33 conditions of feature vectors comprising two types of features. The experimental findings suggested that recognition accuracy was better when using data from both hands in comparison with the case using data from only one hand. Moreover, a feature vector containing only the basic statistics features of 6 time windows was the most effective in all feature vectors and achieved an F-score of over 0.90 for each participant in the individual evaluations.

In addition, the classifiers were found to influence the recognition accuracy. RF achieved better recognition accuracy than SVM under most conditions. However, the highest recognition accuracy was achieved by using SVM.

Furthermore, our results suggested that analyzing the acceleration of movements of both hands could enhance the accuracy of recognizing daily activities with small asymmetric movements.