Introduction

Human authentication has always been the focus of attention in security field. Human authentication and identity camouflage are like two armies that confront each other and upgrade their equipment, and constantly propose corresponding solutions according to each other’s technological development. The backwardness of the authentication often leads to serious economic problems, the lack of credibility, and even more fatal consequences.

In recent years, the improvement of computer computing power and the development of related methods have promoted the diversity and accuracy of human authentication, and gradually met various personal privacy protection needs and organizational privacy protection needs.

The knowledge-based authentication method based on “people knows information” [1] has been widely used for several decades. Including common access card [2, 3], username and password, etc. The 4- or 6-digit pin codes authentication method or the common access card authentication method makes it difficult for the perpetrator to imitate others in a short period of time without preparation. However, several researches [4] have shown that knowledge-based authentication methods such as pin codes are difficult to remember and vulnerable to attack by perpetrators [5, 6]. Common access cards are easily stolen or lost [3], and usernames and passwords are easily disclosed and embezzled. Hence, about one-fifth [4] of people prefer to store all of the personal information in one device, leading to a significantly increased risk and harm of information leakage.

The biometric-based authentication method based on “people is something” has greatly alleviated the above problems. The biometric is usually divided into bio-physiological and bio-behavioral due to the nature of the human feature. Bio-physiological authentication achieves the purpose of recognizing a person’s identity by measuring the physical features of the human body, including hand region features [7, 8], facial region features [9, 10], ocular features [11, 12], etc.. Bio-physiological authentication does not require people to remember anything, but requires cooperation to provide a sufficient amount of features for authentication. Therefore, bio-physiological authentication is usually used in situations where the person voluntarily provides personal information or is compelled to provide personal information under supervision, such as police stations and banks. The bio-behavioral authentication is well adapted to the authentication needs under unsupervised conditions. Bio-behavioral authentication identifies people by detecting and learning the biological behaviors such as the personal habitual movements. Including voice recognition [13, 14], gait recognition [15, 16], keystroke dynamics [17] and signature [18, 19]. Because each person’s body characteristics (such as: height, weight, arm length, muscle development, throat development, etc.) and behavioral habits (such as: keystroke speed and strength, step size, etc.) are different, bio-behavioral authentication can identify people with acceptable accuracy. Compared with knowledge-based authentication, bio-behavioral authentication does not have the disadvantages of being easily lost or stolen. Compared to bio-physiological authentication, bio-behavioral authentication requires less hardware and can be operated by low-cost sensors. At the same time, bio-behavioral authentication has higher concealment [20, 21] and is not easy to be discovered.

However, no matter in the collection of dataset or the application of the methods, there is a fact that cannot be ignored, that is, in most real life scenarios, the person being measured will not provide such ample authentication information. For the most common example, stealing is often a quick and precise movement completed within 1–5 s. The thief’s face is not completely exposed to the camera, and there is no interaction with the phone screen or buttons. In addition, such as the movement of the intruder opening the door, etc., people in these scenarios often do not provide sufficient data and rich variety of features for authentication.

Therefore, this leads to two important questions and challenges:

  1. 1.

    Can a simple movement that is completed in a short time and only generates a small amount of data be used for authentication?

  2. 2.

    Can simple and commonly used sensors capture these very short-term movements?

In response to the above challenges, we compress the time and complexity of movements to an unprecedented degree. At the same time, we design a variety of method structures to process data, and make a detailed analysis.

In this work, we first model the authentication movement based on human joints and bones from a mathematical point of view and construct the model function. According to the relevant parameters of the function, the authentication movement is analyzed in detail and the sensor selection method is designed.

Then, because the existing public dataset could not support our work, we collected a 110-persons movement dataset for the designed authentication movement. The participants have no prior knowledge, that is, the participants who are collected data do not know the purpose of the data before the collection, and only inform them of data use after collection and obtain data use authorization. In this way, it is ensured that the collected data conform to the daily habits of the collected participants.

Finally, since there has never been a previous authentication method based on data of the same order of magnitude as our movement data, we designed different deep neural network structures based on a variety of common backbones and combined with a variety of machine learning classification methods for the results comparison and analysis.

We note that a shorter conference version [22] of this paper appeared in ACM Turing Celebration Conference (2020). Our previous work did not analyze the authentication movement mathematically, and did not clarify the connection between movement, mathematical model and sensor. Compared with previous work, we greatly compress the movement complexity, reduce the number of sensors, add four additional cells and add a cell evaluation module.

The remainder of the paper is organized as follows: In the next section, we describe related work in bio-behavioral authentication. In the subsequent section, we analyze the mathematical model and design the authentication movement followed by which we introduce our proposed method named RaiseAuth. In the penultimate section, we analyze the model performance and the resistance to attack. We conclude this work in the final section.

Related work

In this section, we summarize the efforts of recent research community in bio-behavioral authentication and corresponding measurement studies.

A number of bio-behavioral authentication methods have been reported during the last decade [1, 23]. Hong [24] collected the sensor data generated by the human hand when writing through a special watch containing an acceleration sensor, and used this as the basis for authentication. Each participant is required to write 20 words and approximately 120 strokes, which takes approximately three minutes to generate one training sample. Langyue [25] proposed a novel feature extractor which achieved 97% accuracy in authentication based on people’s gait behavior, where each person needed to provide data generated by walking for more than 33 min for training. Timothy [26] authenticated people through the coherent movements of people hitting the keyboard continuously, and designed the method based on a training dataset of over 5000 keyboard strokes per person. The above methods can often achieve an accuracy of more than 95%, and have good application prospects in some specific situations. However, it is obvious that the amount of dataset used by these methods is huge, and it takes a long cost to collect data from one person to train the model and improve the accuracy, which makes these methods unsuitable for short-term simple movement data.

Reducing the complexity and data scale of the authentication movement is important because each order of magnitude decrease in authentication movement ushers in a new set of unforeseen challenges. Several researches [27,28,29,30] are also trying to use simple movements to achieve authentication. In [31], the author used the combined movement when the user answers the phone, that is, the user unlocked the mobile phone and took the mobile phone to the ear as an authentication movement. A 5% Equal Error Rate (EER) was achieved on a self-built dataset of 48 people based on a training set where each person performed ten movements within 6000 ms. Jakub [32] collected the phone-holding behavior data of participants within 40 min, and proposed a method based on Multilayer Perceptron (MLP) and Convolutional Neural Network (CNN) machine learning models. The trained model is able to achieve 8% EER on 20 s level test samples. Although the above methods and systems have made certain contributions in compressing the amount of data, there is still room for improvement due to the following two reasons. First, the above methods are all data-driven, and do not analyze and explain the authentication movement itself, which makes the correlation between the authentication movement and authentication method unclear. Therefore, our method fills in this gap by analyzing in detail the mathematical correlation between movements, sensors, and authentication. Secondly, even if the above methods simplify the authentication movement, the amount of data required for single-sample training is still more than 2.5 times ours.

In contrast, RaiseAuth uses a simpler authentication movement, establishes a mathematical model to explain the rationality of the movement, and comprehensively analyzes the model structure and authentication results.

Fig. 1
figure 1

Human arm joints and bones mathematical model

Movement and sensors

In this section, we build a mathematical model of human joints and bones. By analyzing the relevant parameters of the model functions, the authentication movement is designed and the reasons for the selection are explained. Also the sensor selection method is designed according to the analysis results.

Mathematical analysis

Most of the recognizable behaviors of the human body depend on the operation of the limbs. Compared with the trunk, the limbs have fewer bones but have higher flexibility. The muscles can ensure the normal operation of the limbs, and bones and joints provide a natural entry point for the establishment of mathematical models.

We chose the right arm among the limbs as the model building template (Fig. 1), where point O is the shoulder joint, point M is the elbow and point N means the wrist. We take the point O as the coordinate origin, and put the bone and joint models into the space coordinate system. At the same time, we assume that the initial state of the arm is \(O-A_1-B_1\), the end state of the arm movement is \(O-A_2-B_2\), and the arm state \(O-M-N\) is intermediate state of the arm at time \(t(t\in [0,T])\) during the movement. Note that the movement trajectory of the arm’s change from \(O-A_1-B_1\) state to \(O-A_2-B_2\) state is not necessarily a straight line change, but more likely an irregular trajectory change.

Set point \(A_1(a,0,0)\),point \(B_1(a+b,0,0)\), at time t, elbow is located at point \(M(x_1,y_1,z_1)\), the angle of rotation of the upper arm \(O-M\) is:

$$\begin{aligned} \overrightarrow{\theta _1}=\overrightarrow{\theta _1}(t)=(\theta _{1x}(t),\theta _{1y}(t),\theta _{1z}(t)), \end{aligned}$$
(1)

the angular velocity is:

$$\begin{aligned} \overrightarrow{\omega _1}=\overrightarrow{\omega _1}(t)=(\omega _{1x}(t),\omega _{1y}(t),\omega _{1z}(t)). \end{aligned}$$
(2)

For the forearm \(M-N\), set point \(N(x_2,y_2,z_2)\), relative to the upper arm the angle of rotation is:

$$\begin{aligned} \overrightarrow{\theta _2}=\overrightarrow{\theta _2}(t)=(\theta _{2x}(t),\theta _{2y}(t),\theta _{2z}(t)), \end{aligned}$$
(3)

the angular velocity is:

$$\begin{aligned} \overrightarrow{\omega _2}=\overrightarrow{\omega _2}(t)=(\omega _{2x}(t),\omega _{2y}(t),\omega _{2z}(t)). \end{aligned}$$
(4)

Then

$$\begin{aligned} \overrightarrow{\theta _1}=\int _{0}^{t}\overrightarrow{\omega _{1}(t)}\textrm{d}t=\left( \int _{0}^{t}\omega _{1x}(t)\textrm{d}t,\int _{0}^{t}\omega _{1y}(t)\textrm{d}t,\int _{0}^{t}\omega _{1z}(t)\textrm{d}t\right) , \end{aligned}$$
(5)
$$\begin{aligned} \overrightarrow{\theta _2}=\int _{0}^{t}\overrightarrow{\omega _{2}(t)}\textrm{d}t=\left( \int _{0}^{t}\omega _{2x}(t)\textrm{d}t,\int _{0}^{t}\omega _{2y}(t)\textrm{d}t,\int _{0}^{t}\omega _{2z}(t)\textrm{d}t\right) . \end{aligned}$$
(6)

When \(t=T\), set point \(A_2(c_1,d_1,e_1)\), point\(B_2(c_2,d_2,e_2)\), in the state change of the arm from \(O-A_1-B_1\) to \(O-A_2-B_2\), the trajectory traversed by the elbow M is \(L_1={A_1}{A_2}\), the trajectory traversed by the wrist N is \(L_2={B_1}{B_2}\). Since the upper arm \(O-M\) and the forearm \(M-N\) are two inflexible bones, the trajectory traversed by the points on the line OM and MN are actually the concentric motion with the same angular velocity as the end points M and N, respectively. Therefore, the movement state function of the arm can be simplified to be represented by the motion trajectories of points M and N, so we have the movement state function

$$\begin{aligned} Y=L_1+L_2. \end{aligned}$$
(7)

Then

$$\begin{aligned} {\left\{ \begin{array}{ll} x_1=a\cos \theta _{1x}\\ y_1=a\cos \theta _{1y}\\ z_1=a\cos \theta _{1z} \end{array}\right. }, \end{aligned}$$
(8)
$$\begin{aligned} {\left\{ \begin{array}{ll} x_2=x_1+b\cos (\theta _{1x}+\theta _{2x})\\ y_2=y_1+b\cos (\theta _{1y}+\theta _{2y})\\ z_2=z_1+b\cos (\theta _{1z}+\theta _{2z}) \end{array}\right. }, \end{aligned}$$
(9)

calculate the differential:

$$\begin{aligned} {\left\{ \begin{array}{ll} dx_1=-a\sin \theta _{1x}\cdot \omega _{1x}(t)\textrm{d}t\\ dy_1=-a\sin \theta _{1y}\cdot \omega _{1y}(t)\textrm{d}t\\ dz_1=-a\sin \theta _{1z}\cdot \omega _{1z}(t)\textrm{d}t \end{array}\right. }, \end{aligned}$$
(10)
$$\begin{aligned} {\left\{ \begin{array}{ll} dx_2=[-a\sin \theta _{1x}\cdot \omega _{1x}-b\sin (\theta _{1x}+\theta _{2x})(\omega _{1x}+\omega _{2x})]\textrm{d}t\\ dy_2=[-a\sin \theta _{1y}\cdot \omega _{1y}-b\sin (\theta _{1y}+\theta _{2y})(\omega _{1y}+\omega _{2y})]\textrm{d}t\\ dz_2=[-a\sin \theta _{1z}\cdot \omega _{1z}-b\sin (\theta _{1z}+\theta _{2z})(\omega _{1z}+\omega _{2z})]\textrm{d}t \end{array}\right. }, \end{aligned}$$
(11)

then

$$\begin{aligned} L_1&=\int _{0}^{T}\sqrt{(dx_1)^2+(dy_1)^2+(dz_1)^2} \nonumber \\&=a\int _{0}^{T}\sqrt{(\sin \theta _{1x}\cdot \omega _{1x})^2+(\sin \theta _{1y}\cdot \omega _{1y})^2+(\sin \theta _{1z}\cdot \omega _{1z})^2}\textrm{d}t \end{aligned}$$
(12)
$$\begin{aligned} L_2=\int _{0}^{T}\sqrt{(dx_2)^2+(dy_2)^2+(dz_2)^2}=\int _{0}^{T}\sqrt{f(t)}\textrm{d}t, \end{aligned}$$
(13)

where

$$\begin{aligned} f(t)&=a^2[(\omega _{1x}\sin \theta _{1x})^2+(\omega _{1y}\sin \theta _{1y})^2+(\omega _{1z}\sin \theta _{1z})^2] \nonumber \\&\quad +b^2[((\omega _{1x}+\omega _{2x})\sin (\theta _{1x}\nonumber \\&\quad +\theta _{2x}))^2+((\omega _{1y}+\omega _{2y})\sin (\theta _{1y}+\theta _{2y}))^2 \nonumber \\&\quad +((\omega _{1z}+\omega _{2z})\sin (\theta _{1z}+\theta _{2z}))^2] \nonumber \\&\quad +2ab[\omega _{1x}(\omega _{1x}+\omega _{2x})\sin \theta _{1x}\sin (\theta _{1x}+\theta _{2x}) \nonumber \\&\quad +\omega _{1y}(\omega _{1y}+\omega _{2y})\sin \theta _{1y}\sin (\theta _{1y}+\theta _{2y}) \nonumber \\&\quad +((\omega _{1z}+\omega _{2z})\sin (\theta _{1z}+\theta _{2z}))^2] \nonumber \\&\quad +\omega _{1z}(\omega _{1z}+\omega _{2z})\sin \theta _{1z}\sin (\theta _{1z}+\theta _{2z})]. \end{aligned}$$
(14)

The derivation of the above formulas show that the movement state function is composed of two parts, \(L_1\) and \(L_2\), which means that the most comprehensive data acquisition scheme should be to wear sensors at the elbow M and wrist N for measurement, and use the data to authentication. However, through Formulas (12) and (13), it can be found that the parameters required to calculate \(L_2\) cover all the parameters required by \(L_1\), and the parameter richness of \(L_2\) is much higher than that of \(L_1\), that means the weight of \(L_2\) in the movement state function should be much higher than \(L_1\). So it can be concluded that only adding sensors at wrist N to measure \(L_2\) should also be able to achieve authentication with good performance.

Authentication movement

In Formula (14), it can be found that the main parameters that affect the movement state function are the upper arm length a, the forearm length b, the angular velocity \(\omega \) with the three components of the space coordinate system, and the rotation angle \(\theta \), which means that the authentication movement only needs to meet these conditions to be able to authenticate. So the movement we designed is very simple (Fig. 2):

Fig. 2
figure 2

Authentication movement. a In standing posture. b In sitting posture

We asked participants to place their arms below their waist and then raise them 15–20 cm according to their natural strength and speed. This movement is typically completed within 2 s. Putting the hands below the waist is only to ensure that the participants have enough room to raise the hand, and the reason for the raise is that we need to ensure that the arm movement is from the participant’s subjective decision, which makes the participants to mobilize more muscle power to perform the movement, which is the key to affecting angular velocity and rotation angle. At the same time, since the lengths of the upper arms and forearms of each participant are different, this will also provide more parameter variables for the movement state function. Therefore, with such a simple movement, all the requirements of the movement state function are satisfied. At the same time, we provide two scenarios, that is, to complete this movement in the scenario of sitting and standing, respectively.

Sensors selection

Accelerometer, magnetometer and gyroscope sensors are chosen as the sensor for data acquisition for the following two reasons. First, after mathematical analysis of arm model, the movement function we obtained reveals the parameter composition that affects authentication movement. The accelerometer sensor can provide good data support for calculating the trajectory traveled by the wrist, the gyroscope sensor can intuitively reflect the change of angular velocity, and the magnetometer sensors provides convenience for confirming the instantaneous direction of movement. Second, We aim to increase the applicability of the method by collecting data using simple and commonly used sensors. At present, most smartphones have built-in accelerometer, magnetometer and gyroscope sensors to meet the different needs of different mobile applications. Therefore, choosing these three sensors can make our method have a wide range of application scenarios, rather than just stay in theory.

Accelerometer

The acceleration sensor can collect the acceleration of the sensor itself. Due to the gravity of the earth, when the device is stationary, the value of the accelerometer sensor will always generate 9.81 m/s\(^{2}\) interference. To eliminate this interference, high-pass filtering is used to process the acceleration sensor data. The low-pass filter is also used to process the raw data of acceleration sensor. The low-pass filter can eliminate the noise generated in the process of data collection. The accelerometer data is used to reflect the user’s habits of the arm strength and direction of the arm force. At the same time, the accelerometer data will be used for auxiliary calculation trajectory.

The data processed by high-pass filter and low-pass filter are added to the feature subset select together with the raw data to ensure that the important feature information will not be lost.

Fig. 3
figure 3

RaiseAuth method architecture

Gyroscope

Due to the rigidity and precession of the gyroscope, the gyroscope data can provide an important basis for calculating the change of the angular velocity in three-dimensional direction.

Magnetometer

Magnetometer sensor can measure the strength and direction of the magnetic field of the sensor. The data of magnetometer sensor can reflect the instantaneous direction change of sensor. In our method process, magnetometer data are used to reflect the direction during the participant moving his arm.

Method

For better evaluation, we build a variety of architectures suitable for us based on the existing popular backbones, and put them into different cells to select the most suitable one and do further analysis. The architecture of the RaiseAuth method is shown in Fig. 3, which contains 5 cells, each cell is configured with the corresponding data preprocessing method. Cell evaluation will calculate the score each cell gets on the dataset during training, and provide feedback that will cause RaiseAuth to close paths to the cells with low scores.

The data of accelerometer, magnetometer and gyroscope were, respectively, expressed as (AXAYAZ), \((MX,MY,MZ)\), (GXGYGZ), and the experimental data obtained by the accelerometer were processed by low-pass filtering (LPF) and high-pass filtering (HPF) respectively, and the obtained data were, respectively, expressed as (LXLYLZ) and (HXHYHZ).

1D multi-scale cell

What we first consider is to process and analyze the sensor data from a one-dimensional perspective. Due to the limited receptive field of the one-dimensional convolution kernel, we extract features from the input data from three different scales based on the bottleneck of ResNet [33] to achieve multiple classification task.

Fig. 4
figure 4

1D multi-scale cell data preprocessing

1D data preprocessing

As shown in Fig. 4, we stitch together the sensor data in the order of HPF, LPF, accelerometer, magnetometer and gyroscope. The length of data generated by each sensor on each axis is fixed at 140, the insufficient part is filled with zero, and the excess part will be discarded, forming a one-dimensional data input with a length of 2100.

Cell architecture details

The cell structure and details are shown in Fig. 5. Where BN means batch normalization, “/2” means stride is 2. Taking “1\(\times \)7conv, 32, /2” as an example, it means to perform a convolution calculation with a convolution kernel of 1\(\times \)7 size, the number of output channels is 32, and the stride is 2. Taking “1\(\times \)3 maxpool, /2” as an example, it means to perform a max pooling calculation with a convolution kernel of 1\(\times \)3 size and the stride is 2. Taking “1\(\times \)6 avgpool, /1” as an example, it means to perform a average pooling calculation with a convolution kernel of 1\(\times \)6 size and the stride is 1.

Fig. 5
figure 5

1D multi-scale cell architecture

When the preprocessed data are input into the cell, it will first go through two convolution layers to increase the number of channels. We choose to use a large convolution kernel to increase the receptive field, thereby increasing the breadth and depth of the feature map. We adopt batch normalization (BN) [34] right after each convolution and before activation, following [34]. We choose to add max pooling to mix features and adjust output resolution after the activation function. Then, considering the restricted receptive field of one-dimensional convolution, we perform residual convolution operations on the input from three scales of 1\(\times \)3, 1\(\times \)5, and 1\(\times \)7, and finally concatenate the feature vector by the output of the average pool. And input the vector into the fully connected layer, and combine with softmax to achieve classification.

During training, we use a batch size of 64 and use SGD with a momentum of 0.9 to initialize the weight. The learning rate is initialized to 0.01 and exponentially descends with a descending rate of 0.95. When the error plateaus, the learning rate will be divided by 10. Dropout is not used, following the practice in [33]. The models are trained for up to 150 epochs.

2D residual cell

In this cell, we preprocess the signal into a 2D structure and combine ResNet to achieve multi-classification tasks. We choose models of different depths to construct cells.

2D data preprocessing

Fig. 6
figure 6

2D residual cell data preprocessing

As shown in Fig. 6, we construct the sensor data into an input format of 3 (channel) \(\times \) 5 (height) \(\times \) 140 (width). The length of one training sample data generated by each sensor on each axis is fixed at 140, the insufficient part is filled with zero, and the excess part will be discarded. Among them, 5 height corresponds to the data generated by 5 sensors, and 3 channels correspond to the data generated by 5 sensors on the three axes of (xyz).

The advantage of this structure is that during the convolution operation, the convolution kernel can simultaneously acquire the data generated by the three axes at the same time, and with the difference in the size of the convolution kernel, the receptive field can cover the data generated by other sensors in a similar time, too.

Cell architecture details

Referring to previous work on Occam’s Razor and ResNet [33], we choose ResNet-18, ResNet-50 and ResNet-101 to construct cell. Even though it is theoretically shown that the residual block will ensure that the network accuracy will at least not decrease as the network deepens, we still try shallow networks to improve the accuracy comparison. During training we use a batch size of 32 and use SGD with a momentum of 0.9 to initialize the weight. The learning rate is initialized to 0.01 and exponentially descends with a descending rate of 0.94. When the error plateaus, the learning rate will be divided by 10. Dropout is not used. The models are trained for up to 130 epochs.

Traditional machine learning cell

In [35], it was shown that traditional machine learning methods can often achieve better results than deep neural networks on small sample datasets. For example, on the ORL dataset [36], 400 images are divided into 40 categories, and the Random Forest algorithm achieves better results than the deep neural network in classification, which is also confirmed in some of our previous work [22]. Our dataset is also a small sample dataset, so we built a traditional machine learning cell for experiments and comparisons.

Feature extraction and concatenation

To increase the optional range and diversity of features, on the basis of the original data, we take \({\text {MA}} = \sqrt{{\text {AX}}^{2} + {\text {AY}}^{2} + {\text {AZ}}^{2}}\), \({\text {MM}} = \sqrt{{\text {MX}}^{2} + {\text {MY}}^{2} + {\text {MZ}}^{2}}\), \({\text {MG}} = \sqrt{{\text {GX}}^{2} + {\text {GY}}^{2} + {\text {GZ}}^{2}}\), \({\text {MLA}} = \sqrt{{\text {LX}}^{2} + {\text {LY}}^{2} + {\text {LZ}}^{2}}\), \({\text {MHA}} = \sqrt{{\text {HX}}^{2} + {\text {HY}}^{2} + {\text {HZ}}^{2}}\). For the above 20 sets of data, we calculated the minimum, maximum, mean, standard deviation, skewness and kurtosis values for each set of data. Therefore, for each training sample, we get a total of 120 features as shown in Table 1. For the convenience of the following, we assign a unique id to each feature in the Table 1. For example, FId.15 is the mean value of the Z component of the acceleration measured by the accelerometer (AZ) and FId.95 is the skewness of the magnitude of the acceleration (MA). The feature id will be used instead of the long name of the feature in the following.

Table 1 The correspondence between feature name and feature id
Table 2 Feature subset in sitting/standing posture

The feature fusion of the bio-behavioral authentication can increase the accuracy and reduce the redundancy of the data. Moreover, the earlier the data fusion, the better the effect [37], but due to the early sensor-level data fusion, a large amount of noise will be brought into the model, so sensor-level fusion often does not yield the best results. Therefore, the feature-level data fusion is considered to be a more effective choice for improving accuracy. We selects feature-level data fusion.

figure a

Feature subset selection

Feature subset selection plays an important role in reducing data dimensions and preventing data overfitting. Based on the correlation between features and classes, a feature subset selection algorithm is proposed, named correlation feature maximization (CFM) feature subset selection algorithm (Algorithm 1).

Suppose the number of features is \(N_{\textrm{feature}}\), CFM evaluates each feature according to the correlation between features and classes [38]. \(N_{\textrm{feature}}\) correlation evaluation scores are obtained, formed array \(S_{\textrm{feature}}\). The higher the score, the greater the correlation between the feature and the class. Suppose \(N_{\textrm{sensor}}\) sensors are involved in the authentication movement, CFM regards features extracted from the same sensor as a set, totally \(N_{\textrm{sensor}}\) sets of features. Suppose \(M_{\textrm{subset}}\) is the number of features in each set of features. At the same time, CFM assigns a weight value to each feature according to the ranking of correlation evaluation scores, So there is the array \(W_{\textrm{feature}}\). The higher the ranking, the bigger the weight value is. The weight value depends on whether the user values the individual performance of individual sensors or the cooperative performance of all sensors. If users pay more attention to the individual performance of individual sensors, the features with high correlation score have greater weight value difference with those with low ranking. If users pay more attention to the performance of sensor cooperation, the features with high correlation score have smaller weight value difference with those with low ranking. The details of CFM are shown in Algorithm 1.

Finally, from each set of features, the features of the top \(M_{\textrm{sensor}}\) scores are extracted and combined the feature subset (Table 2). This enables the CFM algorithm to filter out the best-performing features produced by each sensor and group them into feature subsets. Among them, in the sitting posture, the proportions of the features related to the high-pass filtered accelerometer, low-pass filtered accelerometer, accelerometer raw data, magnetometer and gyroscope to the total number of features of the feature subset are: 15%, 21.7%, 21.7%, 28.3%, 13.3%. In the standing posture, the proportions are 14.9%, 19.1%, 19.1%, 23.4%, and 23.4%, respectively. Each sensor data occupies an important proportion in the feature subset.

Classifier selection

According to the feature subsets (Table 2), to find the most suitable classification algorithm, 5 commonly used classification algorithms [39] (Naive Bayes (NB), Bayes Net (BN), J48, Random Forest (RF) and Simple Logistic (SL)) were used to model and authenticate in the sitting and standing postures respectively.

Cell evaluation

Cell evaluation will calculate the Score each cell gets:

$$\begin{aligned} Score = \frac{1}{N_{\textrm{param}} \times {\text {Error}}}, \end{aligned}$$
(15)

where \(N_{\textrm{param}}\) is the number of parameters involved in the cell calculation, and Error is the minimum error rate on the valid set during training. This module selects the cell with the highest score at the end of training and keeps its path, while closing the paths to other cells, ensuring that only one cell path will be left in the end.

Performance evaluation

In this section, we first describe the details of our dataset composition and the collection of training and test samples. Secondly, we evaluate the performance of different cells of the model on the dataset, and conduct random attack experiments and imitation attack experiments.

Data collection used iPhone7 64 GB, with IOS 12.2 system. The calculation is performed on Windows 10, RTX3080Ti 32 GB RAM, 1TB hard disk.

Data collection

We used the built-in accelerometer, magnetometer and gyroscope sensors to collect the data. The number of participants in the experiment was 138, 110 participants participated in the training set and testing set collection, 20 participants participated in the random attack experiment, and 8 participants participated in the imitation attack. The sampling rate of the accelerometer, magnetometer, and gyroscope was set to 100 Hz.

In response to our previous challenge of using the smallest possible dataset to achieve high-accuracy authentication, to compress the sample size required for each person’s training as much as possible, we asked each of 110 participants to perform 5 times authentication movement in the sitting and standing positions. The duration of each movement was determined by the participants’ personal habits, but was usually no more than two seconds (in fact, we founded that the longest movement time of the 110 participants was 1.38 s). Therefore, in the 1D data preprocessing section and 2D data preprocessing section, we set the data length of each axis of each sensor to 140, to ensure that all sample data can be included while padding as few zeros as possible. A total of 1100 training sample data were collected. We then asked 110 participants to perform 4 more times authentication movement in the sitting and standing positions to form the test dataset. Therefore, we got a total of 1980 training and testing samples, a total of 110 types of authentication movement datasets. We will perform a 110-class multi-classification task on this dataset.

To test RaiseAuth’s resistance to random attacks, each of the 20 random attackers provided 50 random attacks in sitting posture and 50 random attacks in standing posture. Each random attack has the following characteristics: random attacker will try to make authentication movement with random behavior habits without knowing the 110 real user data. A total of 1000 random attack data of standing posture and 1000 random attack data of sitting posture were provided.

To test RaiseAuth’s resistance to imitation attack, four pairs of people who were similar in height, weight, forearm length and upper arm length were required to make authentication movement. Each pair of people will have a real user and a imitated attacker. During the process of observing the real user’s authentication movement, the imitated attacker tries to imitate the real user 20 times in the standing posture and sitting posture respectively. four pairs of people were provided with 80 attack data of standing posture imitation and 80 attack data of sitting posture imitation.

Table 3 Cell parameters and classification error in sitting posture, standing posture and combined dataset
Fig. 7
figure 7

The accuracy results of RaiseAuth traditional machine learning cell combined with 5 classification algorithms in sitting/standing posture

Results

Based on the training set and testing set of 110 participants, we evaluated the performance of each Cell, and parts of the results are shown in Table 3. The column Params indicates the number of parameters involved in the calculation process of the Cell. The more the number of parameters, the more computing resources are required to implement training and testing. It can be seen that the network structure parameters in 1D multi-scale cell are significantly lower than 2D ResNet-18 cell, 2D Resnet-50 cell and 2D Resnet-101 cell. In terms of error rate, we selected three ways to evaluate. We took out the data of sitting posture, the data of standing posture, and the data of mixed sitting posture and standing posture for model training and testing respectively. From the results, we can see that even if the network structure includes a 1-dimensional shallow low-parameter neural network to a 2-dimensional 101-layer deep high-parameter network, their errors are basically the same. The depth of the network did not bring about a significant change. This is in agreement with our previous predictions, and with previous work in [35]. We believe that this is due to the order of magnitude of the sample data is too low, and the sample categories are too many, so that the deep neural network structure cannot obtain enough data to train the weight in the network. This makes the network usually overfit very quickly on the training set, but not well performance on the valid set and test set. However, we can still see from the results that compared with the standing posture, the sitting posture has a lower error rate, and the effect achieved by the mixed data set is comparable to the sitting posture. The relatively low error of 1D multi scale cell indicates that, compared with deepening the structure of the network, collecting more information from different scales is better. The richness of scales information can improve the training accuracy of small sample dataset.

Then, we tested the traditional machine learning cell. We selected a total of 5 algorithms (Naive Bayes (NB), Bayes Net (BN), J48, Random Forest (RF) and Simple Logistic (SL) ) as the classifier of the model, train the model under the standing and sitting posture dataset respectively, and select 5-fold cross-validation. It can be seen that the RandomForest classification algorithm has the best performance, achieving an accuracy of 97.49% in sitting posture and an accuracy of 94.68% in standing posture (Fig. 7). The accuracy is

$$\begin{aligned} {\text {Accuracy}} = \frac{{\text {TPR}} + {\text {TNR}}}{{\text {TPR}} + {\text {FPR}} + {\text {FNR}} + {\text {TNR}}}, \end{aligned}$$
(16)

where TPR is true positive, FPR is false-positive rate, TNR is true-negative rate and FNR is false-negative rate.

At the same time, we selected five commonly used feature subset selection algorithms to compare with the CFM algorithm. We limit the number of features of the feature subset selection algorithms to have the same number of features as the CFM algorithm, and also use the RandomForest classifier. In the sitting posture, the accuracies of the ReliefF [40], GainRatio [41], InfoGain [42], SymmetricalUncert [43], oneR [44], and CFM (ours) algorithms are 96.98%, 97.21%, 97.21%, 97.35%, 97.38%, and 97.49%, respectively. In the standing posture, they were 93.17%, 93.55%, 93.17%, 94.49%, 93.24%, and 94.68%, respectively. This shows that our CFM algorithm has better performance on this task.

The False Positive Rate (FPR) and True Positive Rate (TPR) of the model are calculated and the Receiver Operating Characteristics (ROC) curve is drawn when the user is sitting posture (Fig. 8) and standing posture (Fig. 9). The ROC curves can visually display the FPR and TPR. The more the curve is “convex” to the upper left corner, the better the classifier effect.

Fig. 8
figure 8

The ROC curve in the sitting posture of RaiseAuth

Fig. 9
figure 9

The ROC curve in the standing posture of RaiseAuth

Fig. 10
figure 10

The sample predictions of 1000 random attacks calculated by RaiseAuth in sitting posture

Fig. 11
figure 11

The sample predictions of 1000 random attacks calculated by RaiseAuth in standing posture

Random attack test

To test the robustness and security of RaiseAuth, 20 attackers randomly performed 50 authentication movement data collections in both sitting and standing postures without knowledge of the 110 real user’s information. This resulted in a total of 1000 sitting attack data and 1000 standing attack data. These 2000 attack data was used to test RaiseAuth, and the sample prediction of each random attack data was shown (Figs. 10, 11). The sample prediction represents how much confidence the model has in classifying samples to the current class. For example, 40% prediction means that the current sample matches one classified class by 40%. In the sitting posture, no attack prediction was more than 33%, and only 2 attacks had a prediction of 31%-35%. In the standing posture, no attack prediction was more than 35%, and only 3 attacks had an prediction of 31%-35%. The Minimum Threshold Line indicates how much the model needs to set the sample prediction rate threshold under attack to be able to resist the attack well. Usually, the threshold of sample prediction rate of real users is set to 90%, that is, when the model makes a judgment that the prediction is more than 90%, the judgment will be considered reliable. Our experiments prove that all 2000 random attack samples can be filtered out only by setting the minimum threshold line to 35%, which makes the difference between random attack samples and real user samples very obvious. This means that traditional machine learning cell in RaiseAuth structure perform well in resisting random attacks.

Table 4 Body size of attacker and real user in imitated attack test

Imitated attack test

Imitated attacks occur occasionally in real life, where attackers achieve identity obfuscation by observing and imitating real users. We tested the imitated attack resistance of the RaiseAuth. Participants were four pairs of people who were similar in height, weight, forearm length and upper arm length (Table 4). Four of them are real users and four are attackers. Real users normally collected authentication movement data, while attackers observed and imitated the collection process of real users. Next, each attacker imitated 20 attacks by imitating real users of similar body size in sitting (Fig. 12) and standing (Fig. 13) posture. In the results, the predictions of attacks were mostly concentrated in the 40% - 60% range, no one was more than 70%. The minimum threshold line is 70%.

At the same time, we analyzed the experimental data of real users and attackers. We showed the two best features in the standing posture (Fig. 14) as an example. This data comes from a pair of Real user1 and Attacker 1, of which the lower left corner is the value of FId.19 and FId.21 generated after 20 authentication movements by real user1. The upper right corner is the value of FId.19 and FId.21 generated by the imitator Attacker 1 imitating the real user1 to make 20 imitation movements. It could be seen that even though the real user and the attacker were very similar in body size, and the attacker was given 20 opportunities to observe and imitate, the difference between them was still very obvious for RaiseAuth.

Fig. 12
figure 12

The sample predictions of imitated attacks calculated by RaiseAuth in sitting posture

Fig. 13
figure 13

The sample predictions of imitated attacks calculated by RaiseAuth in standing posture

Fig. 14
figure 14

Feature Id 19’s and Feature Id 21’s data visualization of a pair of participants in imitated attack test

Conclusions

In this work, we aim to achieve high-accuracy authentication while compressing the movement complexity of the authentication movement as much as possible. The contribution of this work is twofold.

First, based on the mathematical modeling of the arm, we constructed the movement state function, and according to the parameters of the movement state function, the complexity of the authentication movement was maximally compressed.

Second, to the best of our knowledge, we are the first to authenticate against an authentication movement with such low movement complexity. We constructed an authentication dataset involving a total of 138 people. At the same time, based on this dataset, we designed an authentication method named RaiseAuth with multiple Cells.

Through the analysis of the experimental results, we can draw the following two points, (1) RaiseAuth is better than deep neural network cell when paired with traditional machine learning cell, which means that in the face of small samples and multi-category dataset, traditional machine learning algorithms are more effective. This is in agreement with previous work in [35]; (2) RaiseAuth performs well against random attacks that do not know the real user information, but it suffers a certain impact when confronting imitators with similar physical parameters that stare at real users and conduct 20 imitation attacks. This shows that under extreme conditions, authentication based on body movements may be affected by imitated attacks.

In summary, RaiseAuth can perform multi-classification tasks with good performance on the 110-class authentication dataset with low movement complexity and small samples.