Journal of Ambient Intelligence and Humanized Computing

, Volume 5, Issue 1, pp 77–89

Wearable sensor-based human activity recognition from environmental background sounds


    • Department of Electronics and Electrical EngineeringKeio University
  • Tadahiro Kuroda
    • Department of Electronics and Electrical EngineeringKeio University
Original Research

DOI: 10.1007/s12652-012-0122-2

Cite this article as:
Zhan, Y. & Kuroda, T. J Ambient Intell Human Comput (2014) 5: 77. doi:10.1007/s12652-012-0122-2


Understanding individual’s activities, social interaction, and group dynamics of a certain society is one of fundamental problems that the social and community intelligence (SCI) research faces. Environmental background sound is a rich information source for identifying individual and social behaviors. Therefore, many power-aware wearable devices with sound recognition function are widely used to trace and understand human activities. The design of these sound recognition algorithms has two major challenges: limited computation resources and a strict power consumption requirement. In this paper, a new method for recognizing environmental background sounds with a power-aware wearable sensor is presented. By employing a novel low calculation one-dimensional (1-D) Haar-like sound feature with hidden Markov model (HMM) classification, this method can achieve high recognition accuracy while still meeting the wearable sensor’s power requirement. Our experimental results indicate an average recognition accuracy of 96.9 % has been achieved when testing with 22 typical environmental sounds related to personal and social activities. It outperforms other commonly used sound recognition algorithms in terms of both accuracy and power consumption. This is very helpful and promising for future integration with other sensor(s) to provide more trustworthy activity recognition results for the SCI system.


Social and community intelligenceDigital footprintWSNsSound recognitionHaar-like featureHMM

1 Introduction

The past decade witnessed rapid development in basic Internet, communications theories and in some newly emerging technologies, such as wireless sensor networks (WSNs) (Culler et al. 2004), wearable sensing and computation (Bonfiglio and Rossi 2011); these technologies are gradually entering an applicable stage where they can be used for various purposes. The technological advancement has recently led to the emergence of a brand-new research area: social and community intelligence (SCI) (Zhang et al. 2011). With the SCI technology, individual’s behavior patterns, social interactions and community dynamics inside a certain society can be explored, collected, analyzed, and well managed. In addition, applications of SCI technology will be helpful to enrich our life contents and improve our society’s efficiency.

“Community detection and social behavior analysis” and “socially-aware computing” are two major topics of SCI research (Zhang et al. 2011; Pentland 2005). Reliable detection as well as comprehension of individual activities and person-to-person interactions in a certain society is a fundamental problem encountered in SCI research. Information pertaining to personal and social activities can be detected and traced by the so-called “digital footprint” left behind by people while interacting with cyber-physical spaces (Zhang et al. 2011; Guo et al. 2011b). With various sensing capabilities sensors embedded into the mobile phones, wearable devices and WSNs technology’s involvement, people’s daily information can be digitalized and perceived. This will facilitate the understanding of “digital footprints” information of individual and social interactions inside an organization (Pentland 2005; Choudhury 2004; Laibowitz et al. 2006; Yano et al. 2009; Yano et al. 2008; Guo et al. 2011a). Among these sensing media, acoustic sound is a rich information source for identifying the individual and social behaviors.

In this study, a sound sensor embedded in the wearable sensor node (Yano et al. 2009; Nishimura et al. 2008) shown in Fig. 1, is utilized to recognize environmental background sounds happening around people. These sounds contain useful information to understand what activities an individual does. They can also act as a social “bridge” among people. By recognizing these sounds continuously for a whole day, the people’s daily activities log can be established accordingly. This log indicates personal and social interactive information. Many SCI applications can be created based on the log information. For example, it is very helpful to establish household medical systems such as remote monitoring and diagnosis for patients, and individual’s daily physical and health monitoring at home. Log information can also assist in understanding social interactions in a particular group or society; for example, the working status of employees and their efficiency in offices or working places (Yano et al. 2009). A good example for a group dynamics application is to determine a common favorite individual in the group. The utilized wearable device “UberBadge” mounted on each participant of the group contents sound sensor (Pentland 2005; Laibowitz et al. 2006).
Fig. 1

Our power-aware wearable sensor node consisting of embedded sound, acceleration, IR sensor and other sensors with a size of an ID card (3.86 inch × 2.87 inch × 0.35 inch)

Energy efficiency plays an important role for mobile and wearable devices in the SCI system (Bonfiglio and Rossi 2011). In order to reveal individual activities and social interactions, most front-end sensing units are mobile and portable, for example mobile phones, PDAs and wearable devices. In addition, power supply for these devices is an energy limited battery, unlike a DSP and FPGA board fitted with a power adaptor. Conventional sound recognition and acoustic signal processing algorithms that can be executed on the DSP or FPGA (Dong et al. 2007; Veitch et al. 2011) platforms may not perform well on our wearable sensor node. Therefore, a major challenge for this research is development of a new sound recognition algorithm for achieving high accuracy with low calculation cost to meet the energy requirement.

Some environmental sound recognition researches have been reported previously (Chen et al. 2005; Goldhor 1993; Ma et al. 2006; Chu et al. 2009; Cowling and Sitte 2003; Peltonen et al. 2002; Dong et al. 2007; Bharatula et al. 2005). At the feature extraction stage, conventional state-of-the-art Mel-frequency cepstrum coefficients (MFCCs) filtering is used to extract the sound feature and obtain good recognition accuracy (Chen et al. 2005; Goldhor 1993; Ma et al. 2006; Dong et al. 2007). However, computationally expensive FFT is calculated before entering a bank of Mel-scale filters in the extraction flow. This increases the calculation complexity of sound feature extraction. At the classification stage, performance of the Gaussian mixture model (GMM), support vector machine (SVM), Linde–Buzo–Gray algorithm (LBG), k-means, and hidden Markov Model (HMM) classifiers has been studied and compared in work (Cowling and Sitte 2003). Through the work, we have learned that the HMM (Ma et al. 2006; Rabiner 1989) classifier can achieve high recognition accuracy with an acceptable increment of calculation cost compared with other classifiers.

In this paper, a novel Haar + HMM algorithm is proposed to recognize the environmental background sounds. Haar-like filtering is a commonly used feature extraction method for 2-D image processing fields. This method was first used in 2-D face detection and yielded good performance (Viola and Jones 2004); it was also applied to speech and non-speech detection (Nishimura and Kuroda 2008b). In order to utilize its low cost and high efficiency aspects, 1-D Haar-like filtering is newly employed for environmental sound recognition. The integral signal (IS) method (Nishimura and Kuroda 2008a) can further decrease the calculation cost considerably during the Haar-like filtering without compromising accuracy. Furthermore, the HMM classifier can achieve comparatively high recognition accuracy at the classification stage. With the above mentioned advantages, our Haar + HMM algorithm is very effective and can be used for environmental background sound recognition on the power-aware wearable sensor node.

The rest of this paper is organized as follows. Relevant previous work is discussed in “Sect. 2”. In “Sect. 3”, our proposed Haar + HMM algorithm is introduced in detail. Evaluation benchmarks for our proposed sound recognition algorithms are presented in “Sect. 4”. “Section 5” introduces a detailed experimental process. In “Sect. 6”, with the introduced sound recognition algorithm and experimental data, system results and discussions are presented. Finally, the conclusion and future work are given in “Sect. 7”.

2 Review of related work

In this section, we discuss three questions. First, the reasons why the sound is used as a detection medium to recognize people’s daily activities are studied. Second, the researches related to sound recognition in general and for human activities are reviewed. Finally, researches about activity recognition utilizing wearable devices are reviewed.

Accurately knowing an individual and understanding the person-to-person’s activities inside a society is a premise for the SCI system to fulfill its functions. Many detection media are used to recognize human activities, the most commonly used are acceleration (Bao and Intille 2004; Yin et al. 2008; Krause et al. 2005), video (Rota and Thonnat 2000), infrared ray (IR), and sound (Chen et al. 2005; Bharatula et al. 2005; Pentland 2005; Laibowitz et al. 2006; Yano et al. 2009). In research (Bao and Intille 2004), five two-axis accelerometers were attached on the tester’s joints to recognize 20 human daily activities, and this was done successfully with 84 % accuracy. Work (Yin et al. 2008) also used the acceleration sensors to detect people’s abnormal activities caused by Parkinson or Alzheimer’s diseases. Based on their reports, we can conclude that the acceleration is mainly applied for detection of an individual’s activities. It is rarely employed for detection of social activities. Video is also widely used to detect people’s individual and social activities (Rota and Thonnat 2000). Because of security and privacy concerns, employing images as an activity detection medium is inconvenient or not allowed in some unobtrusive locations, such as in a hospital or a restroom. In addition, image signal processing is more computationally complex than acoustic signal processing. Sound has unique advantages in terms of detection accuracy, algorithm complexity, and operational convenience. Therefore, it is an ideal detecting medium to be utilized for the personal and social activity recognition.

Recently, in study (Chu et al. 2009), a new matching pursuit (MP) algorithm was introduced to decompose sound’s time–frequency feature. In each step, the best decomposed matching atom from a redundant dictionary (such as Gabor dictionary) is searched. The sound can be presented by a linear combination with those atoms. A drawback of the MP algorithm is that the calculation cost for the searching enlarges significantly as the number of the atoms in the dictionary increases. In work (Dong et al. 2007), a complicated MFCC-based sound feature with HMM classification is implemented on the Ezairo 5900 SoC system. It is used to classify environmental sounds for a hearing aid application. A 24-bit specific DSP IP core is employed to process acoustic environmental sounds. It is difficult for our power-aware wearable sensor to execute these complex algorithms. In work (Chen et al. 2005), seven bathroom activities are recognized by detecting sounds, such as shower and brush tooth sounds. They are sampled by a microphone and are subsequently recognized by utilizing the MFCC + HMM algorithm on a PC. An average recognition accuracy of 83.5 % has been achieved. The difference between our research and Chen’s work is that the recognition of Chen’s work is processed off-line on a PC. In our case, processing must be done by using the limited power available in the wearable sensor node.

To accomplish the activity recognition on a power-aware wearable device, lightweight signal processing is necessary. In work (Krause et al. 2005), the following five activities can be discriminated using a wrist-worn eWatch accelerometer platform: walking, running, sitting, standing, and ascending/descending stairs. The detection accuracy evidently decreases with a reduction in the accelerometer’s sampling rate. An optimized sampling scheme facilitates realization of a tradeoff—increase in the deployment lifetime of the eWatch without significant deterioration in accuracy. In work (Bharatula et al. 2005), how to trade off the power consumption and accuracy of a sound-based context recognition system is reported. Free combinations of nine time-domain features (such as mean and variance) and five frequency-domain features (such as bandwidth and frequency centroid) constitute sound feature sets. Different recognition results are obtained using different classifiers. A target sound feature set and classifier is decided by the tradeoff between accuracy and power consumption. However, exploring the ideal sound feature set and classifier is an empirical and complicated process. Hence, compared with this method, our proposed Haar-like sound feature with HMM classification is more effective.

3 Sound recognition implementation by utilizing the Haar + HMM algorithm

The proposed sound recognition flow is shown in Fig. 2. It follows two sequential steps: generation of off-line sound templates and on-line sound classification. Features of the template sound can be extracted by low computational Haar-like filtering. After training them off-line, the sound template is completed and stored in memory in advance. When the input test sound comes, its feature can be extracted on-line by applying the same filtering method. Following this, the recognition result is finally achieved by comparison with the prepared templates using the HMM classifiers (Rabiner 1989; Rabiner and Juang 1993).
Fig. 2

Sound recognition flow

3.1 Haar-like sound feature extraction

3.1.1 1-D Haar-like filtering

Inspired by the low cost and efficient feature extraction of Haar-like filtering used in 2-D face detection (Viola and Jones 2004), this novel filtering method is also applied to a 1-D signal, for example speech/non-speech detection, acceleration processing and recognition (Nishimura and Kuroda 2008b; Hanai et al. 2009).

A basic Haar-like filter hfilter(j) is denoted by Eq. (1) and shown in Fig. 3.
$$ h_{filter} (j) = \left\{ {\begin{array}{*{20}c} { - 1\begin{array}{*{20}c} & { - W_{filter} /2 < j \le } \\ \end{array} 0} \\ { + 1\;\begin{array}{*{20}c} & {0 < j \le W_{filter} } \\ \end{array} /2} \\ \end{array} } \right., $$
where, Wfilter is the width of the Haar-like filter hfilter(j).
Fig. 3

One-dimensional (1-D) Haar-like filter hfilter(j)

In comparison with the MFCC’s Mel-scale filter, Haar-like filter is simple and has a low calculation cost. Its filter width Wfilter and shift width Wshift between neighbor filters, as shown in Fig. 4, are adjustable. These simple controllable parameters can be designed and applied for the feature extraction of environment sound in our research.
Fig. 4

One-dimensional (1-D) Haar-like filtering for one frame’s sound signal

One frame length’s sound signal (256 sampling points) processed by Haar-like filtering is shown in Fig. 4. The Haar-like feature xm is calculated by the sum of the absolute outputs of Haar-like filtered signals:
$$ x_{m} = \sum\limits_{n = 0}^{N - 1} {\left| {\sum\limits_{k = 1}^{{W_{filter} }} {h_{m} (k) \times s(nW_{shift} + k)} } \right|} , $$
$$ \begin{array}{*{20}c} & = \\ \end{array} \sum\limits_{n = 0}^{N - 1} {\left| {oneFilterValue(n)} \right|} , $$
where s(t) is the input sound signal and hm(k) denotes a Haar-like filter whose length can have a different value. Wshift is the shift width between neighbor filters. The filters number N in one frame is calculated as
$$ N = (W_{frame} - W_{filter} )/W_{shift} + 1 . $$
Parameter Wshift is adjustable as α change [α is defined in Eq. (4)]. A longer Wshift (larger α) helps to reduce the N value and decrease the calculation of each frame’s sound data accordingly. The variation of α also affects the final recognition result. When α = 0, Wshift is set to 1.
$$ \alpha = W_{shift} /W_{filter} $$

3.1.2 Integral signal (IS)

From Eq. (1) and Fig. 3, it follows that the coefficients of the Haar-like filter are −1 when j ≤ 0, and then change to +1 when j > 0. Thus, after the sound signal s(t) passes a Wfilter width Haar-like filter, the final filtering result is the absolute value of the difference between the sum of the sampling sound’s (−Wfilter/2, 0] and (0, Wfilter/2] two-parts data. Based on this and borrowing from the integral image concept introduced in work (Viola and Jones 2004), a novel concept called Integral Signal (Nishimura and Kuroda 2008a) is newly utilized in this work. The IS of each sound frame has been calculated and stored in memory as a preprocessed intermediate signal for later use. It is defined as follows:
$$ IS(n) = \sum\limits_{t \le n} {s(t)} $$
Therefore, the filtered sound signal calculation can be denoted as
$$ oneFilterValue = IS(t + W_{filter} ) - 2 \times IS(t + W_{filter} /2) + IS(t) $$

In Eq. (2a), Wfilter multiplication and Wfilter −1 addition calculations are need in order to obtain the filtering result of each frame sound. However, with the proposed IS method in Eq. (6), the calculations are reduced to one multiplication and two addition calculations. Therefore, it is obvious that the computational complexity of xm in Eq. (2b) decreases. At the same time, the accuracy does not deteriorate.

3.1.3 Haar-like sound feature

A Haar-like filters group \( h_{v} = \left\{ {h_{v1} , \, h_{v2} , \ldots ,h_{vi} , \ldots , \, h_{vn} } \right\} \) (1 ≤ i ≤ n) chosen from M filters groups’ pool is utilized to extract the feature of sound sv(t). 1 ≤ v ≤ p, p is the number of all detected sounds. hvi is an 1-D Haar-like filter which is as previous “Sect. 3.1.1” defined. Value n is the feature dimension of each sound frame.

Two parameters that decide the pool size M are defined as HaarWidMax (Maximum Haar filter Width) and HaarFilNum (Haar Filters Number). M’s value is decided by combination expression below:
$$ M = \left( {\begin{array}{*{20}c} {HaarFilNum} \\ {HaarWidMax/2} \\ \end{array} } \right). $$
For each frame of sound sv(t), its Haar-like feature Xv is formed by passing the Haar-like filters group \( h_{v} = \left\{ {h_{v1} , \, h_{v2} , \ldots ,h_{vi} , \ldots , \, h_{vn} } \right\} \). Therefore, the sound feature Xv can be calculated by utilizing the IS method and is denoted as
$$ X_{v} = \left\{ {x_{v1} ,x_{v2} , \ldots ,x_{vi} , \ldots ,x_{vn} } \right\}, $$
where 1 ≤ i ≤ n, n = HaarFilNum is the feature dimension of each sound frame, and xvi is as the previously introduced Haar-like feature xm.

Sound feature plays an important role in achieving the expected final recognition results. With the simple Haar-like filters group and applying the IS method for the calculation, the extraction process to form the Haar-like sound feature can be completed with an extremely low computational cost. The achieved Haar-like sound features are simple and effective. These are very helpful in efficiently speeding up the feature extraction process and reducing the calculation cost significantly to meet the energy requirement.

3.2 Off-line training for the Haar-like filters group

Haar-like filters group hv decides the feature Xv of the individual sound sv(t). The detailed training process to select the filters group hv is described in work (Nishimura and Kuroda 2008b). The group’s selection result is based on the training error. It is evaluated by matching feature vectors extracted from training data against the clustering model. Minimum error yielding of the filters group is selected.

Two assumptions are established in the training stage:
  1. 1.

    Once the value of the HaarFilNum has been decided, the dimension of all p sounds’ feature is the same. That is similar to how Xv in Eq. (8) defines (n = HaarFilNum).

  2. 2.

    Once an hv for the test sound sv(t) has been chosen, the left p−1 sound’s filters group should be chosen from the remaining M−1 candidate filters groups’ pool. This can guarantee that the different sound sv(t) adopts the different filters group hv.

The two introduced parameters HaarWidMax and HaarFilNum in Eq. (7) decide the training complexity and searching scale during the hv’s selection stage. The size of the searching pool M is shown in Table 1 with combinations of these two parameters’ variation. For example, when HaarWidMax = 18 and HaarFilNum = 5, feature {x1, x2, x3, x4, x5} of each sound is according to one Haar-like filters group among M = 126 filter groups pool.
Table 1

Training Haar-like filters pool size M with relation to the “HaarWidMax” and “HaarFilNum”

In this table, the bottom gray cells are impossible cases. The middle white cells are non-executable cases because the M value is less than our target 22 testing sounds. The top gray cells are our experimental cases

During hv’s training, the LBG clustering model (Linde et al. 1980) is employed to develop new cluster centers in work (Nishimura and Kuroda 2008b). In this research, k-means cluster (Duda et al. 2001; Wiki_k-means 2012) is applied instead. This is because the k-means cluster is more controllable than the LBG cluster. It means that the number of clustering centers in LBG is split with a power of 2, whereas it can adopt a value less than that of the LBG in k-means clustering. Moreover, in the following HMM classification stage, the number of observation states in the HMM model is equal to that of the k-means clusters. This clustering method change is of benefit to reduce the size of HMM’s observation sequence, and further decreases the HMM classifier’s calculation cost.

3.3 HMM classification

As shown in Fig. 2 to classify different environmental sounds, the appropriate off-line trained HMM classifier λv(π, A, B) for individual sound sv(t) is necessary. After obtaining the updated centroids of sound sv(t) by k-means clustering, an observation Oq is formed by mapping the training sound vector q into a centroid index. Namely, the training vector is assigned to the index of the nearest centroid. Therefore, an observation sequence of sound sv(t) can be denoted as \( O_{v} = \left\{ {O_{1} ,O_{2} , \ldots ,O_{q} , \ldots ,O_{T} } \right\}_{v} \). With the composed training sound’s Ov and initial HMM parameter λv(π, A, B)0, the Baum–Welch algorithm is applied to refine the model λv(π, A, B) until it converges less than ε in the HMM classifier’s training stage (Welch 2003; Rabiner and Juang 1993; Rabiner 1989).

The block diagram of an on-line test sound HMM classifier is shown in Fig. 5. In a real recognition stage, the extracted Haar-like feature of the unknown test sound l is quantized and establishes an observation sequence Ol. After computing the probability of all template sounds’ P(Oll) (1 ≤ l ≤ p) that employs the Viterbi algorithm (Rabiner and Juang 1993; Gold and Morgan 2000), the result with the highest likelihood among all the templates is recognized as the most similar to the test sound.
Fig. 5

Block diagram of a test sound’s HMM classification

$$ l^* = \mathop {\arg \max }\limits_{1 \le l \le p} [P\left( {O_{l} |\lambda^{l} } \right)] $$

After analyzing Eq. (9), we can find that the calculation cost is on the order of p × N2 × T for each sound. The cost is proportional to the number of all detected sounds p, the square of the number of state N, and the number of observations in sequence T in the HMM model (Rabiner 1989; Rabiner and Juang 1993).

4 Benchmark values to evaluate our proposed sound recognition algorithms

As mentioned in “Sect. 1”, a wearable device can help in accurately understanding individual and social interactions. These devices mostly have limited battery power. The following three blocks inside the wearable device mainly consume the limited energy (Yamashita et al. 2006; Bonfiglio and Rossi 2011; Bharatula et al. 2005): analog to digital converter (ADC), communication block, and MCU microprocessor, as depicted in Fig. 6.
Fig. 6

Schematic diagram of the wearable sensor node

Among them, the ADC and the communication blocks consume most energy; the energy remaining for the MCU is limited (Doherty et al. 2001; Yamashita et al. 2006; Bonfiglio and Rossi 2011). It has also been proven that locally processing the sampling data consumes less energy than transmitting them to the upper servers to process (Lynch and Loh 2006; Bharatula et al. 2004; Bonfiglio and Rossi 2011). Thus, the MCU should utilize the limited energy to complete signal processing inside the sensor node locally. That means the applied algorithm should be operated within the node energy budget. At the same time, the final recognition accuracy should be guaranteed to a reasonable degree.

In aspect of how much the accuracy a sound recognition system could achieve, it has been reported in some researches (Chu et al. 2009; Ma et al. 2006; Eronen et al. 2006). If the recognition targets are environmental sounds, the listening test experiments performed in above researches indicate that people’s hearing can achieve approximate 82 % accuracy. This conclusion provides a benchmark for deciding the accuracy level of our environmental sound recognition research.

Another aspect is to evaluate whether the applied recognition algorithm(s) can achieve the sound recognition by using the limited power assigned for the MCU inside our wearable sensor. The MCU is a Renesas Technology’s H8S/2218 chip (Renesas_H8S_2218 2011; Yamashita et al. 2006; Nishimura et al. 2008). It is a microprocessor with a 0.35 μm process, 16-bit architecture, 65 basic instructions, 6 mA working current, and 3.0–3.6 V working voltage. Inside this MCU chip, there is an embedded low power H8S/2000 CPU core in which our proposed sound recognition algorithm is executed. The CPU core works at 20 MHz (50 ns per cycle), 1.8 V input voltage with an average 4 mA working current. The main parameters of the MCU and the CPU core are summarized in Table 2. From the specification (Renesas_H8S_2218 2011), we can calculate that for one-cycle commands, such as addition and subtraction operations, it consumes 4 mA × 1.8 V × 50 ns = 0.36 nJ energy. For four-cycle commands, such as multiplication operation, it consumes 4 mA × 1.8 V × 4 × 50 ns = 1.44 nJ energy.
Table 2

Main technical parameters of the H8S/2218 MCU and embedded H8S/2000 CPU Core

*10 mAh is the energy assigned for the sound processing module in CPU

We aim that the sound module in the sensor node could continuously work for 24 h (3,600 × 24 = 86,400 s), and the CPU core can finish the sound recognition algorithm within each 1 s sampling. Therefore, the recognition results can help capture a person’s activities for a whole day. The algorithm is executed by individual addition and multiplication operations in the CPU.
  • 1.8 V × 10 mAh = 18 mWh = 64.7 J (1 J = 2.78 × 10−4 Wh) energy in CPU for calculation.

  • 64.7 J/86,400 s = 7.5 × 10nJ/s = 0.75 mil. nJ/s energy assigned for execution of the sound recognition algorithm

Therefore, based on our hardware platform, a minimum 82 % sound recognition accuracy and maximum 0.75 million nJ/s power consumption for computation are decided. These two values are used as benchmarks to evaluate the performance of the sound recognition algorithms. They are indicated as dashed-lines in Fig. 11 for performance comparison. If the performance marks of the applied algorithms drop into the top left region of the figure, it can be concluded that the algorithms are suitable for our sound recognition application.

5 Experimental process

5.1 Test target sounds

Many personal and social activities happen in our daily life. We can understand these activities by recognizing their background sound. 22 experimental sounds in our research are listed below. They are sampled in a real environment, and not in a noise-isolated space. Among them, the background sounds of social activities are sampled in noisy environments.
  • Background sounds of personal activities

  1. 1.

    Vacuum cleaner (house cleaning)

  2. 2.

    Washing machine (wash clothing)

  3. 3.

    Water sound from tap —Household clean

  4. 4.

    Brush teeth

  5. 5.

    Shaving (shave beard)

  6. 6.

    Taking shower (bath)

  7. 7.

    Hair dryer (dry hair)

  8. 8.

    Urination (man)

  9. 9.

    Flush toilet (use water closet) —Sanitary

  10. 10.

    Chewing cake (eat)

  11. 11.

    Drinking (drink something)

  12. 12.

    Oven-timer (toast some food) —Dietetic

  13. 13.

    Walk inside room

  14. 14.

    Walk (walk in street)

  15. 15.


  16. 16.

    Moving train (travelling in a train)

  17. 17.

    Rain hits an umbrella (in the rain) —Outdoor acts

  18. 18.

    Telephone ringing (phone call)

  • Background sounds of social activities

  1. 19.

    Supermarket (shopping)

  2. 20.

    Discussion/talking in lab (discuss/talk with others)

  3. 21.

    Restaurant (outside dining)

  4. 22.

    Front square of a subway entrance (meet friends)


5.2 Experimental data collection and data sets

The sampling mode of our wearable sensor node introduced in Fig. 1 has been wirelessly configured in advance. During data collection, it operates at the setting configuration. The node is hung in front of the tester’s chest or set within the environment depending on the test activity. For example, it can be placed on the bathroom’s countertop when the tester takes a shower. Under normal circumstances, it is hung in front of the chest. The sampling rate of these 22 sounds is 16 kHz with 16-bit resolution. The data are stored in the sensor node’s on-board memory, and used for the sounds’ templates training and test inputs.

Each of the above mentioned 22 type of sounds is recorded more than three times on different days. Among many recordings of each sound, one recording is randomly picked as the testing input, and these different 22 testing input sounds compose the testing data set. At the same time, the remaining records of each sound are collected together as the templates training set. The durations of the testing set vary from 14.9 to 256.8 s (indicated on the 2nd column of Table 4). For the templates training set, their durations range from 16 to 277 s and total length is 1,788 s.

5.3 Performance evaluation

During the recognition process, each unit length of the detected sound is 1 s. It means that the algorithms for our sound recognition should finish within each one second as discussed in “Sect. 4”. Each sound frame contains 256 sampling points with a 50 % overlap.

The recognition accuracy rate AR is defined as:
$$ AR = \frac{{C_{u} }}{{A_{u} }} \times 100\;\% $$
where Cu stands for the number of correctly recognized units (1 s period), Au stands for the number of all input sound units (1 s period).

Another evaluation factor of the performance of our sound-context recognition system is the calculation cost. It can be determined by the amount of multiplication and addition calculations within the whole algorithm flow.

6 Experimental results and discussion

As analyzed in “Sect. 4”, the sound recognition algorithm executed on the wearable sensor requires that the recognition accuracy should be improved while satisfying the sensor node’s computational power budget. After conducting experiments and analyzing their results in this section, we can find that our proposed Haar + HMM algorithm for environmental sound recognition can successfully satisfy these requirements.

6.1 Parameters tuning and recognition accuracy

Figure 7 indicates how the parameters HaarFilNum and HaarWidMax affect the average accuracy of the sound recognition system. Among all these cases, when HaarFilNum = 5, HaarWidMax = 18, α = 0 (Wshift = 1), number of HMM states = 7, number of HMM observe symbol = 15, and ε = 0.01, the average accuracy of the 22 sounds can reach highest at 98.2 %. Even with HaarFilNum = 2 (other parameters are identical), it can yield accuracy of more than 94.0 %. These results greatly outperform the required minimum accuracy of 82 % decided in “Sect. 4”, and also prove that our proposed Haar + HMM environmental sound recognition algorithm with the proposed training method is effective.
Fig. 7

Average accuracy in function of the parameters: HaarFilNum and HaarWidMax (α = 0)

Besides α = 0, the recognition results of typical α = Wshift/Wfilter = 0.5 and α = Wshift/Wfilter = 1 are also illustrated in Fig. 8. Except for the value of α, the parameters are set as in the previous experiment with a maximum accuracy 98.2 %. From this figure, we can observe that the accuracy of all cases surpasses the required minimum accuracy of 82 %. The variation of α does not significantly affect the accuracy of our proposed sound recognition system. The accuracy range is from a minimum 93.7 % to a maximum 98.2 %. Different combinational values of the HaarFilNum and α introduce only 4.5 % variation. For the maximum accuracy which happens at HaarFilNum = 5, the variation of accuracy is only 1.3 %. So the influence of the value of α on accuracy is not significant if the appropriate HaarFilNum is chosen.
Fig. 8

Average accuracy in function of the parameter: HaarFilNum and α

6.2 Different sound features’ performance comparison

Different sound features yield different performances. With the same HMM classifier utilized in “Sect. 6.1”, the accuracy and calculation cost of the MFCC (Davis and Mermelstein 1980) and three Haar-like features (α = 0, 0.5, 1.0, HaarFilNum = 5, HaarWidMax = 18) are compared. The process of the MFCC feature extraction is complex which contents FFT, logarithm, discrete cosine transform (DCT) and many multiplication computations. On the other hand, the Haar-like feature only requires a small number of addition and multiplication as Eq. (6) denotes. The experimental results shown in Fig. 9 and Table 3 prove that our proposed Haar + HMM outperforms MFCC + HMM in terms of both accuracy and calculation cost. The most aggressive case with α = 1.0 can obtain 96.9 % accuracy by employing only 8.3 % of MFCC’s multiplication and 8.2 % of MFCC’s addition calculations.
Fig. 9

Performance comparison of proposed Haar-like and traditional MFCC sound features with same HMM classifier—average accuracy and multiplication/addition calculation cost (256 samples/frame)

Table 3

Different sound feature—MFCC and Haar-like feature (α = 0, 0.5, 1.0). Performance comparison (per frame = 256 samples)

Parameter α is an important and effective variable that affects system’s accuracy and calculation cost. From Figs. 8, 9 and Table 3, it is evident that the average recognition accuracy drops by 1.3 % when the value of α changes from 0 to 1. However, this trivial 1.3 % decrease in accuracy helps to considerably reduce the calculation cost. The multiplication calculation can be reduced by 72.2 % and the addition calculation by 79.8 % compared with the referenced α = 0 case. It is because the filters number N in Eq. (3) deceases with increasing α and further reduces the calculation cost in sound’s feature extraction stage dramatically. Meanwhile, the increase of α slightly deteriorates the final recognition accuracy. We believe this limited accuracy decrease is because most of the environmental sounds are quasi-stationary.

6.3 Performance comparison of different classifiers

With the same α = 1.0 Haar-like feature configuration used in “Sect. 6.2”, the performance of the HMM classifier is investigated with the referenced k-means and LBG classifiers. The comparison results are shown in Fig. 10 and Tables 4, 5. The clusters number of the HMM and the k-means classifiers are 15. The LBG’s cluster is set to 16 = 24 which is close to the k-means and HMM’s 15 clusters for comparison. It can be seen that the Haar + HMM algorithm achieves the highest average accuracy of 96.9 % among these three cases.
Fig. 10

Performance comparison of LBG, k-means, and HMM classifiers with same Haar-like sound feature (Haar-like feature’s α = 1.0)

Table 4

Recognition accuracy confusion matrix of 22 different tested sounds with Haar + HMM algorithm (α = 1.0); accuracy comparison with other Haar + HMM two cases (α = 0/0.5), Haar + k-means and Haar + LBG
Table 5

Comprehensive performance comparison of four different sound recognition algorithms: MFCC + HMM, Haar + LBG, Haar + k-means, and Haar + HMM (1 s unit = 124 frames in each 1 s sound unit, Haar-like feature’s α = 1.0)

*The whole energy = 1.44 nJ × Mul. + 0.36 nJ × Add (mil. nJ) based on Sect. 4’s discussion

During the classification, the HMM classifier needs more computation than the k-means classifier does. As in “Sect. 3.3” mentioned, the Viterbi algorithm determines the final recognition performance from the on-line observation sequence Ol in the HMM classification. The algorithm is additionally employed to estimate the likelihood of Ol sequence that is calculated from the k-means cluster’s centroids developed during the off-line training stage. Moreover, the Viterbi algorithm itself employs many multiplications as Eq. (9) indicated. These obviously lead to an increase of multiplication calculation compared with k-means cluster in Fig. 10.

6.4 Performance comparison of whole system

Performance comparison of the recognition algorithms of different environmental sounds is illustrated in Table 5 and Fig. 11. Results of four algorithms—MFCC + HMM, Haar + LBG, Haar + k-mean, and Haar + HMM are compared. Among them, the average accuracy of the three algorithms: MFCC + HMM, Haar + LBG, Haar + HMM outperforms the 82 % benchmark decided in “Sect. 4”. The highest accuracy is achieved by the Haar + HMM algorithm.
Fig. 11

Performance comparison of MFCC + HMM, Haar + LBG, Haar + k-means, and Haar + HMM (Haar-like feature’s α = 1.0)

In Fig. 11, we also find that the sound feature extracted by the Haar-like filtering needs less calculation energy than the MFCC filtering. Three algorithms with the Haar-like feature are executable based upon the wearable sensor’s energy budget. However, the MFCC sound feature with the HMM classifier is so complicated that it goes beyond the 0.75 mil. nJ/s calculation energy benchmark. Compared with the Haar + k-mean method, the Haar + HMM algorithm’s calculation energy increases 0.661−0.483 = 0.178 mil. nJ/s. However, the accuracy obviously increases by a further 96.9−75.6 % = 21.3 % due to the effective HMM classification.

Within the top left region confined by the two benchmarks, the Haar + HMM algorithm achieves better comprehensive performance. It consumes a little more energy 0.661−0.509 = 0.152 mil. nJ/s compared to the Haar + LBG spends. However, it can achieve a much higher 96.9 % accuracy than the Haar + LBG’s 82.3 %. When the requirement of the calculation energy becomes stricter, Haar + LBG can be a candidate solution.

7 Conclusion and future work

Environmental background sound is a rich information source for identifying individual and social behaviors. In this study, a power-aware wearable sensor is utilized to recognize the environmental sounds happening in the background of these activities. A novel low calculation with high recognition accuracy Haar + HMM algorithm is utilized to realize this function.

Based on the wearable sensor power budget and listening test results, the target recognition accuracy and power consumption benchmarks to evaluate the applied sound algorithms have been decided. By utilizing the Haar + HMM algorithm, an average accuracy of 96.9 % of 22 typical personal and social related environmental sounds has been achieved. This proves that our proposed algorithm performs well in the application of the sound-context recognition. At the same time, it still satisfies the wearable sensor’s power requirement. Experimental comparison also indicates that our method outperforms other commonly used sound recognition algorithms with respect to the accuracy and power consumption. This method is promising and applicable for future systems combined with other sensor(s), such as accelerometers, to achieve higher accuracy rate and more reliable human activity recognition results for the SCI system.

There are some interesting tasks to be conducted in the future. One is to implement our applicable algorithm and evaluate it upon the real power-aware wearable sensor node. Another, as the usual environmental sound recognition researches, the test sound templates of our research have been trained and registered. However, the input can be a new non-registered sound in real applications. To solve this problem, some methods in the similar speech and face recognition researches can be considered. Moreover, the sound–context recognition of more complex social activities is also a direction of our future work.


The authors want to sincerely thank Dr. Yano K., Senior Chief Researcher of Central Research Laboratory at Hitachi Ltd. for providing us an opportunity to take part in this research. We want to express our sincere acknowledges to Mr. Ohkubo N. and Mr. Wakisaka Y. for developing the wearable sensor node well used in our experiments. We would also thank Dr. Daribo Ismael, Mr. Jun Nishimura, and Mr. Hao Zhang for their helpful discussion and comments during this research. Finally, we gratefully acknowledge the anonymous reviewers. Their valuable comments and suggestions are very helpful to improve the presentation of this paper and our future work.

Copyright information

© Springer-Verlag 2012