Keywords

1 Introduction

1.1 Background

In the ubicomp research community, estimation of indoor context such as user indoor locations and activities and conditions of an indoor environment is one of the most important research tasks. Figure 1 shows examples of context information that many ubicomp studies focus on, and an example application based on the context information. The context information that the ubicomp researchers focus on are roughly categorized into environmental conditions and user context. Moreover, the user context are roughly categorized into user positions, user activities and user conditions. These context information enables us to provide context-aware services such as a home automation system, which adaptively controls home appliances according to the activities of a user or changes in the environment, and a surveillance system for an independently living elderly person, which observes activities of the elderly person or detects dangerous actions such as falling. In Fig. 1, a context recognition system obtains an environmental condition such as an opened state of a window and user context such as a running activity of the user at coordinates \((x', y')\), and then a home automation system asks an air-conditioner to direct airflow to \((x', y')\).

Typical existing approaches of indoor context recognition employ sensing devices including special devices emitting signals such as RF signals and ultrasound [12], acceleration sensor devices attached to the wrist of a user [6] or indoor everyday objects [14] and a surveillance camera [5]. However, these methods have the following problems: (1) the special devices are expensive in many cases, (2) the user should always wear the sensor devices, (3) the installation and the maintenance costs for the sensing system are high, and (4) capturing camera images invades privacy and makes the user uncomfortable. Owing to the widespread of the wireless communications, indoor context recognition using Wi-Fi signals is now attracting attention. Wi-Fi devices are now very popular with the price reduction of these devices and Wi-Fi devices have been already installed in many indoor environments. Therefore, an indoor context recognition system using Wi-Fi signals can be installed inexpensively. Indoor context recognition using Wi-Fi signals employs signals transmitted and received by a smartphone carried by a user or computers and access points installed in the indoor environment. Because the propagation of Wi-Fi signals is affected by human movements and environmental changes, the context information can be estimated from the changes in the propagation information of Wi-Fi signals.

Fig. 1.
figure 1

Example of context information and applications

One of the propagation information used by many studies on indoor context recognition based on Wi-Fi signals is received signal strength indicator (RSSI). RSSI is the signal strength observed by a receiver, and RSSI is obtained as a real value for each Wi-Fi packet. Furthermore, due to the recent development of the wireless communication technologies, channel state information (CSI) has been available as the propagation information of Wi-Fi signals. The Wi-Fi communications standardized in IEEE 802.11n employ two characteristic techniques: multiple input and multiple output (MIMO), which uses multiple antennas for transmitting and receiving signals, and orthogonal frequency division multiplexing (OFDM), which uses multiple subcarriers whose frequencies are different from each other. CSI describes attenuation and phase information of Wi-Fi signals for each subcarrier in OFDM as well as for each transmit and receive antenna in MIMO.

As mentioned above, context information studied in indoor context recognition using Wi-Fi signals is roughly categorized into user positions and user activities. Some researchers proposed methods for fall detection [16], gesture recognition [3] and detection of vital sign such as respiration [1] and heart beats [2] as well as activity recognition [15] using Wi-Fi signals. The methods for indoor localization and activity recognition using Wi-Fi signals are categorized into device-bound methods and device-free passive methods. The device-bound methods require a user to carry a Wi-Fi device such as a smartphone. In contrast, the device-free passive methods employ only computers and access points installed in an environment, and does not require a user to wear any devices. A typical application of the device-bound methods for indoor localization is indoor navigation because a user is assumed to carry a device such as a smartphone, which can be installed with a navigation app. The device-bound methods for activity recognition that attempt to detect slight human movements such as respiration and heart beats, assume that a Wi-Fi device is located close to the human body, resulting in the significant effects on Wi-Fi propagation by the human movements. The device-free passive methods for indoor localization and activity recognition can be useful to implement a home automation system and a surveillance system such as intrusion detection and dangerous action detection because of their ability of transparent sensing.

1.2 Motivation

In the existing methods for context recognition, the burden of the system installation and daily use of the system is large because the system employs expensive special devices or the system requires to attach sensor devices to the user wrist or target objects. Nowadays, because smartphones and inexpensive Wi-Fi devices have been widely available, we can reduce the burden on the user in the following context recognition methods using these devices.

Indoor Localization. Many existing indoor localization methods employ camera images [5] or special devices which emit signals such as ultrasound [12]. However, capturing camera images invades privacy of a user and the special devices are typically expensive, making it difficult to install these localization systems into a home environment. In contrast, because Wi-Fi devices are inexpensive and some Wi-Fi devices have been already installed in the environment in many cases, we can construct an inexpensive and privacy-aware indoor localization system using Wi-Fi signals.

However, in the prior studies on indoor localization using Wi-Fi signals, the user should collect training data corresponding to the propagation information of Wi-Fi signals while recording his/her position, i.e., ground truth. In addition, these methods require a lot of training data collected in the environment, resulting in the high installation cost of the localization system. Therefore, this study tries to reduce the burden of collecting training data in the environment by transferring training data collected in other environments for device-free passive localization.

Activity Recognition. In the existing methods for activity recognition, sensor devices such as a smartwatch attached to the wrist of a user place a burden on the user. In contrast, because activity recognition methods based on Wi-Fi signals employ a smartphone carried by a person or Wi-Fi devices installed in the environment, the user does not feel burdensome.

Prior studies on activity recognition using Wi-Fi signals employ frequency analysis of the propagation information to recognize activities with periodic movements such as walking or employ anomaly detection techniques to detect falling. Meanwhile, many prior studies on gesture recognition using Wi-Fi signals require training data (i.e. Wi-Fi signals while gestures are performed) where gestures are performed at each position and then compare the shapes of the collected signal waveforms to classify the gestures. Because the propagation information of Wi-Fi signals greatly changes when the position of the user changes, the accuracy of the gesture recognition methods significantly degrades, when a test gesture is performed at a position different from the training positions. Therefore, this study investigates gesture recognition methods independent of the user positions.

Estimating States of Indoor Everyday Objects. In many of the existing methods for estimating states of indoor everyday objects such as estimation of door open/close states, sensor devices are assumed to be attached to each objects [14]. However, the deployment and the maintenance costs of these distributed sensing approaches are high due to the burdens related to replacing batteries of the devices and the faulty devices. In contrast, because Wi-Fi devices have been already installed in the environment and state estimation using Wi-Fi signals can be performed with few devices, the deployment and maintenance costs of the Wi-Fi devices are considered to be low.

While, the effects on Wi-Fi signal propagations caused by the movement of indoor everyday objects differ from object to object, similar objects may be installed in the same environment. Accordingly, it is difficult to estimate states of the objects by simple frequency analysis or comparing the shapes of the waveforms of propagation information. Furthermore, because the propagation information of Wi-Fi signals (i.e. CSI) describes the combined multipath effects such as reflection and path loss, it is difficult to intuitively design classification features to be extracted from the CSI data and verify their validity. Therefore, this study tries to estimate the states of the indoor everyday objects precisely using deep learning techniques. The deep learning techniques enable us to automatically extract meaningful features even if the effects of Wi-Fi signals caused by each object are confusing.

1.3 Research Content

The goal of this study is to develop a practical context recognition system utilizing Wi-Fi signals. We propose methods to solve the problems of the prior studies on context recognition of three types of important context information: indoor position, states of indoor everyday objects and hand motion gesture. Information about indoor positions of a user and states of indoor everyday objects are useful for a home automation system which controls illuminations, HAVC and home appliances. Moreover, a gesture recognition system enables us to achieve easy and intuitive control of networked appliances. In this paper, we introduce our three studies on practical recognition of the context information mentioned above.

  1. 1.

    Transferring Positioning Model for Device-free Passive Indoor Localization

    Recent studies on device-free passive indoor localization using Wi-Fi signals rely on the fact that the variance value of RSSI increases when a user passes between a signal transmitter and a receiver [11]. This study transfers training data (i.e. the variance values of RSSI with the known user coordinates) collected in other environments (source environments) to a target environment [8]. By doing so, we can reduce the burden related to collecting training data in the target environment.

  2. 2.

    Detecting State Changes of Indoor Everyday Objects

    CSI is more fine-grained information than RSSI since it includes the multipath effects such as reflection and path loss of Wi-Fi signals. However, because Wi-Fi signals are reflected by various objects in an environment, it is difficult to manually model the effects of CSI caused by each object. Therefore, this study tries to precisely detect state changes of indoor everyday objects such as open/close events of a door and a window by modeling the effects of CSI caused by each object automatically using deep learning techniques [9].

  3. 3.

    Position Independent Gesture Recognition

    The propagation information of Wi-Fi signals such as CSI changes depending on the positions of a user and a transmitter. Many prior studies on gesture recognition using Wi-Fi signals assume that training data are collected in the same situation as the test data [3]. This study tries to extract features independent of user positions from CSI obtained at few training positions, and then recognize gestures performed at any positions in an environment [10].

2 Transferring Positioning Model for Device-Free Passive Indoor Localization

2.1 Background

The device-free passive indoor localization relies on the fact that Wi-Fi signals are affected by a human body. When multiple Wi-Fi devices are installed in an environment, the set of RSSI obtained by each receiver depend on a position of a person. Therefore, the set of RSSI is considered to be a fingerprint of the position. However, because raw RSSI values change according to various environmental factors such as humidity and temperature, and the positions of house furnishings, the recent device-free passive localization method [11] employs the variance value of RSSI. When a person passes between a transmitter and a receiver, the variance value of RSSI increases. Therefore, we employ a variance value of RSSI for each position of a user as a fingerprint.

Because the fingerprinting approach relies on machine learning techniques, the procedure consists of a training phase and a test phase. In the training phase, we obtain fingerprints (i.e. the sets of variance values of RSSI when a user is at known coordinates), and then the obtained fingerprints are used as training data to learn an indoor positioning model. The positioning model estimates the user coordinates by using RSSI observed by the receivers. However, in order to learn the indoor positioning model, we should collect labeled training data at many positions in the environment. Collecting such training data in a user house is very costly and impractical because the user has to input his/her coordinates at many training points. In this study, we try to construct an indoor positioning model for a target environment without using any labeled training data obtained at the target environment, by transferring training data from other environments (source environments) to the target environment. By doing so, we can easily construct a positioning model for any environments by reusing labeled training data obtained from several source environments in advance.

2.2 Proposed Method

We assume that one transmitter and multiple receivers are installed in the source and target environments, and the floor plans of the environments including the device positions are given. First, we learn a variance model that shows the relationship between a RSSI variance value and a position on a line segment connecting the transmitter and the receiver that a person passes for each transmitter and receiver pair in a source environments. And then the variance model is transferred to a pair in the target environment whose characteristics of RSSI seem to be similar to the pair in the source environment. Finally, we learn a model for detecting a person who passes between a transmitter and receiver, and also learn models for estimating the coordinates of the user on the line segment. By using the outputs of the models, we track a person in the target environment based on a particle filter.

Learning Variance Model. Because we found that a signal characteristics change significantly for each region separated by walls, a variance model is constructed at each separated region (sub-line segment). We employ a mixture of two Gaussian functions as the variance model. The peaks of the model (i.e. the mean of each Gaussian function) correspond to the end points the sub-line segment corresponding to the positions of the transmitter and the receiver (or the walls). The other model parameters (i.e. variance and weight of each Gaussian function) are computed based on the least square approximation by using the position on the sub-line segment that the person passes and corresponding variance values of RSSI.

Transferring Variance Model. We first select sub-line segments in source environments whose characteristics are similar to the sub-line segment in the target environment. The selection is performed based on the following criteria: (1) the length of the sub-line segment, (2) distribution of RSSI when there is no person in the environment and (3) distribution of variance values of RSSI when a person walk randomly. And then, we transfer the variance model to the sub-line segment in the target environment by weighted-averaging parameters based on above criteria.

Constructing Positioning Model. We construct passing detection models and positioning models for the target environment based on transferred variance models. A passing detection model for a sub-line segment detects whether or not a user passes the sub-line segment at a certain time by using variance values based on SVM. A positioning model for a sub-line segment estimates a passing point of a person based on its transferred variance model when the passing detection model detects a passing.

Fig. 2.
figure 2

Experimental environment

Fig. 3.
figure 3

Accuracy for each method

2.3 Evaluation

Data Set. A participant walked for 20 min in each of four environments where one transmitter and 10 receivers are installed (Fig. 2). We evaluated our method based on one-environment-out cross validation where three environments are source environments and one environment is a target environment.

Evaluation Methodology. A method trained on labeled sensor data obtained in the same environment (Supervised) and a method that selects a variance model in source environments at random in the transfer phase are compared with our method.

Accuracy of Tracking. Mean distance errors are shown in Fig. 3. Our method achieved an average positioning error of 1.71 m and the difference between the error of Supervised and that of our method was only about 0.08 m. Furthermore, we could reduce the error about 0.64 m compared to Random. From this result, the effectiveness of proposed method was confirmed.

3 Detecting State Changes of Indoor Everyday Objects

3.1 Background

To detect state changes of indoor everyday objects such as door open/close events, many existing methods assume a sensor device to be attached to each indoor object. However, this distributed sensing approach is expensive to deploy because we should attach a sensor device to each indoor object and maintain these sensor devices, e.g., replacing device batteries and faulty sensor devices. Therefore, this study proposes a method for estimating states of indoor everyday objects without using distributed sensors attached to the objects. We employ a commodity Wi-Fi access point as a transmitter of Wi-Fi signals and a computer equipped with a commodity Wi-Fi module as a receiver installed in the environment.

Owing to the recent development of the wireless communication techniques, we can obtain CSI which is one of propagation information of Wi-Fi signals from some advanced Wi-Fi network interface cards (NIC) such as the Intel 5300 Wi-Fi NIC. CSI is a \(N_t \times N_r \times N_s\)-dimensional complex matrix, where \(N_t\), \(N_r\) and \(N_s\) are the number of transmit antennas, receive antennas and subcarriers, respectively. The absolute values and angles of CSI elements describe the attenuation and the phase shift of Wi-Fi signals, respectively.

However, because CSI describes the combined multipath effects, it is difficult to intuitively design the features extracted from CSI. Therefore, we apply feature learning approaches based on deep neural networks (DNNs). Specifically, we design a novel DNN architecture consisting of convolutional layers and long short-term memory (LSTM) layers. The convolutional layers learn meaningful features considering the correlation of CSI in each channel. In addition, LSTM layers learn the temporal dynamics of the event. Furthermore, because we attempt to recognize the events of multiple indoor objects, CSI consists of the mixed effects caused by the objects. To separate the mixed effects, we employ independent component analysis (ICA). Moreover, to further improve the event recognition accuracy, we harness knowledge about the event/state transitions of an object (e.g. door open event occurs only when the door is in a closed state) using hidden Markov models (HMMs).

3.2 Proposed Method

We assume that a transmitter and a receiver are installed in a room. First, we decompose the amplitude and phase information of the CSI data using ICA. Next, the original amplitude, original phase, decomposed amplitude and decomposed phase time-series data are fed into a DNN to extract the meaningful features. Finally, the meaningful features are input into HMMs to smooth the classification result and to harness the knowledge about the event/state transitions of an object. Our method estimate whether the state of each object is “open” event, “close” event, “opened” state or “closed” state at each time slice.

Decomposition Using ICA. We decompose \(N_t N_r\)-dimensional time-series data (i.e. the amplitude or phase information of the CSI data for each subcarrier) using ICA to separate the effects on the CSI data caused by each object. To capture minute changes in signals caused by an object of interest, we construct an unmixing matrix tailored to the object. From labeled training data, we extract time-segments while events of the object occur and use only the extracted time-segments to compute the unmixing matrix tailored to the object.

Feature Extraction Using DNN. Our network consists of three convolutional layers and two LSTM layers. The inputs of the network are a time-series of amplitude (or phase) values within a time-window whose width is \(W_T\) (i.e. \(N_t N_r N_s \times W_T\)-dimensional matrix). In training phase, the outputs of LSTM layer input to the softmax layer to output class probabilities.

Classification Using HMM. We prepare left-to-right HMMs for each event/state of the objects to estimate the events/states of the objects from DNN outputs. In HMM decoding, we use the knowledge about the event/state transitions to prohibit impossible transitions such that a door “open”? event occurs when the door is in a “opened”? state.

Fig. 4.
figure 4

Experimental environment

Fig. 5.
figure 5

Classification accuracy for each method

3.3 Evaluation

Data Set. We collected data in real three environments where six indoor everyday objects are installed as shown in Fig. 4. We installed a access point as a transmitter and a PC with the Intel 5300 NIC as a receiver in each environment as shown in Fig. 4. A modified NIC driver developed by Halperin et al. [4] was installed on the PC to collect CSI data. The transmitter has two antennas and the receiver has three antennas, i.e., \(N_t = 2\) and \(N_r = 3\). In addition, the number of subcarriers is 30, i.e., \(N_s = 30\). We sent udp packets at a rate of approximately 1,000 Hz to obtain CSI. In each environment, a participant conducted 150 sessions of data collection. Throughout a session, the participant used all objects so that each event of the objects occurred once in an arbitrary order. We randomly selected 90% of the sessions as training sessions and used the remaining sessions as test sessions.

Evaluation Methodology. We estimated the states of each objects every 0.1 s to evaluate the performance of our method. We prepared the following methods to investigate the effectiveness of the ICA, the knowledge of the objects, and HMMs.

  • w/o ICA: This method does not use time-series data obtained by ICA. It simply uses amplitude and phase time-series data extracted from the raw CSI data.

  • w/o knowledge: This method does not use a knowledge of each object.

  • w/o HMM: This method does not use HMMs. We simply use the classification results of the softmax layer.

Classification Accuracy. Figure 5 shows the classification accuracy for each method in the three environments. As shown in the results, when we do not use ICA, the F-measures decrease by about 1%–5%. Moreover, when we do not use the knowledge of the objects, the F-measures decrease by about 8%–15%, and when we do not use HMMs, the F-measures greatly decrease by about 10%–20%. Therefore, we could confirm that the ICA, the knowledge of the objects, and HMMs are effective to estimate the states of the objects.

4 Position Independent Gesture Recognition

4.1 Background

Existing gesture recognition methods employ a depth camera such as Microsoft Kinect [13] or a wearable acceleration sensor such as a smart watch attached to the wrist of a user [6]. However, the depth camera approach has a problem of limited sensing area and the wearable approach requires a user to wear a wristwatch device.

In this study, we try to recognize hand gestures using neither a camera device nor a hand-worn sensor device. We assume that a user carries a commodity smartphone in, for example, his/her chest pocket and recognize hand gestures based on CSI transmitted by the smartphone affected by movements of the hand. We obtain CSI using a computer equipped with a commodity Wi-Fi module that is installed in an environment and communicates with the smartphone.

In this study, we investigate gesture recognition independent of the user position using CSI. The propagation of Wi-Fi signals is greatly affected by the position of the user and the direction of the user body because the transmitter (smartphone) is carried by the user. Therefore, we try to extract the component corresponding to the velocity of hand movements from CSI based on the Doppler shift, which can be a feature independent of the user position, and investigate the effectiveness of the feature for gesture recognition.

However, it is difficult to estimate the Doppler shift from the phase components of CSI directly because the transmitter and the receiver are not synchronized precisely and the bandwidth of Wi-Fi signals is much wider than the range of the Doppler shift caused by human movements. Recent studies [7] try to compute the Doppler velocity corresponding to the velocity of a target object from CSI using the multiple signal classification (MUSIC) algorithm. The concept of this method relies on a fact that, when the target moves, the path length of the signals reflected by the target as well as the phase components also change. Therefore, the velocity component of the moving target can be computed from the difference in the phase components over some packets. We attempt to use this method to extract features corresponding to the hand velocity.

4.2 Methodology

Our method computes the velocity components of the hand movement from CSI, and recognize hand motion gestures by HMMs. We use the tethering function of a smartphone to obtain CSI transmitted from the smartphone. The tethering function of many smartphone use only one antenna for transmitting and receiving signals. Therefore, we extract the velocity components from CSI obtained from the single antenna.

Computing Doppler Velocity. We extract the Doppler velocity corresponding to the hand velocity from CSI based on the method proposed in [7]. First, we apply conjugate multiplication between the elements of CSI for each packet. In [7], the authors multiply between CSI of the two receive antennas. In contrast, because the tethering function of the smartphone uses only one receive antenna, we multiply between CSI of adjacent subcarriers. Next, we remove this term by subtracting the averaged CSI within a certain time window, which is regarded as the static components. Finally, we estimate the Doppler velocity from some Wi-Fi packets using the MUSIC algorithm. However, because the whole hand moves in the hand motion gestures, we obtain many values corresponding the velocities of the various parts of the hand. Therefore, we make a pseudo spectrogram from the pseudospectrum computed by MUSIC algorithm for each time sliding window.

Classification Using HMM. We compute the variance value, the maximum value and the kurtosis value as classification features from this pseudo spectrogram for each time slice, and then classify gestures by a 10-state left-to-right HMM prepared for each gesture.

4.3 Evaluation

Data set

A PC equipped with the Intel 5300 NIC was installed in our experimental environment as a receiver, and a smartphone was carried by a participant in front of the chest. The smartphone was connected to the PC using the tethering function with 5.2 GHz center frequency, and sent Wi-Fi packets at a rate of approximately 200 Hz. The participant stood facing toward the computer 2 m, 4 m and 6 m away from the computer, and performed 10 sessions where each session consists of 6 kinds of hand gestures 10 times: moving hand “up”, “down”, “left”, “right”, “clockwise” and “anticlockwise”.

Classification Accuracy. We investigated the effects of the velocity components on position independent recognition using the leave-one-position out cross-validation where data collected at two different are used as training data and data collected at the remaining position are used as test data. The average accuracy and F-measure are poor and 42.4% and 37.4%, respectively. This may be because the amplitudes of the spectrograms of the different positions are different from each other. Therefore, we believe that adjusting the amplitude components for each position improves the recognition accuracy under the position independent setting.

5 Conclusion

In this paper, we introduce our three studies on context recognition using Wi-Fi signals: indoor localization, state estimation of indoor objects and gesture recognition. We proposed a indoor localization method with low installation cost and an accurate state estimation method for indoor objects. In addition, we introduced our current study on gesture recognition independent of user positions.