Introduction

Neural decoding based on electromyography (EMG) signals has attracted many researchers to explore [1]. It is a technology translating bioelectrical signals in muscles into corresponding instructions [2]. Compared with other human–computer interaction modes, neural decoding is more convenient and less constrained by the surrounding environment, resulting in tremendous development potential in medical, entertainment and military fields [3, 4].

There has been a lot of literature about neural decoding of gestures. Naik et al. [5] associated independent component analysis with Icasso clustering to extract features of surface electromyography (sEMG), then classified gestures by linear discriminant analysis (LDA). Lima et al. investigated relevance vector machines and fractal dimension to identify seven gestures [6]. Besides, convolutional neural network (CNN) has benefited from the success in the computer vision field, and is applied in neural decoding [7,8,9,10,11]. Wei et al. [12] combined information detected from electrodes in different methods, and input them to a multi-stream CNN framework. Hu et al. [13] considered time-series information by recurrent neural network based on this work, which improved the recognition accuracy. In addition, Allard et al. [14] and Zhai et al. [15] proposed a novel method by calculating the feature matrices from time–frequency domain information and classified gestures with CNN models.

While all these methods usually recognize gestures under a fixed force level. The combination of data preprocessing and classifiers is not discussed in the various force levels situation, which means the neglect of strength information. Considering this factor, force myography was selected to recognize sixteen gestures at three force levels from nine subjects [16]. Different from the force sensing resistor (FSR) signals, sEMG signals cannot present information related to force directly. Jiang et al. [17] upgraded hardware with inertial measurement units (IMU) and sEMG sensors, analyzed information from both surface gestures and air gestures. But the force levels considered were only medium and low, the effect of high force level was not investigated. Besides, EMG signals affected by different force levels may reduce the performance of the gestures recognition task. To eliminate the force influence, Al-Timemy et al. [18] adopted an energy-based feature set in the circumstance of six gestures at three force levels. Although this method obtained good results, it ignored the usefulness of force information in real life.

Intuitively, gestures recognition is always not an isolated problem. When a gesture is performed, the subject would provide different force levels according to the needs of the environment. As discussed in [19,20,21], these levels can also be decoded by analyzing the EMG signals. This naturally motivates us to explore the multi-task learning (MTL) framework to decode gestures and force levels in sEMG signals [22].

MTL is a method aiming to learn multiple related tasks simultaneously [23]. By sharing features representation between tasks, it makes models generalize better than learning independently one task [24]. Besides, the introduced inductive bias also leads to a risk reduction of over-fitting. However, most prior jobs either treat the importance of tasks equally or search them by greedy search, which usually cannot find the optimal parameters of models [25]. For conquering this problem, pseudo-task augmentation (PTA) is employed for influencing learning dynamics. As complementary, it is validated to result in performance gains for both single-task learning (STL) and MTL [26].

The main contributions of this paper can be summarized as follows:

  1. 1.

    Different from datasets that collect various gestures with constant force, a dataset containing eight gestures at three force levels is provided. On this basis, matches of data preprocessing and classifiers are compared to find appreciate methods for gestures recognition.

  2. 2.

    Different from a single task of gestures recognition or force estimation as most works do, neural decoding in this job is formulated as an MTL problem. The feasibility of whether gestures and force levels can be decoded synchronously from sEMG signals is explored, which boosts the performance of gestures recognition tasks and gives additional force information.

  3. 3.

    To modify the equal importance of tasks, a PTA strategy is followed with interest. Different from the method proposed in [26], a PTA strategy with weight coefficients is introduced, which considers the relationship between tasks and demonstrates efficient performance in experimental results.

The remaining parts of the paper are summarized as follows. Datasets and materials are discussed in Sect. 2. Proposed methods of multi-task CNN models associated with PTA are presented in Sect. 3. Experiments and results on two datasets are demonstrated in Sect. 4. Finally, the conclusions of the paper are summarized in Sect. 5.

Dataset and materials

In this paper, two datasets are used in experiments. One represents amputees and the other represents subjects with healthy limbs.

Amputees dataset

In [18], nine amputees conducting six gestures with three force levels participated in the experiment. sEMG signals were sampled at a frequency of 2000 Hz. The gestures are separate thumb flexion, index flexion, fine pinch, tripod grip, hook grip and spherical grip. Each gesture was performed 5–9 trials and each trial lasts 2.5–20 s according to the amputees. The force levels are represented by high, median, and low. Following the protocol, the first eight EMG electrode channels are utilized in experiments. To solve sample imbalance, this paper first sorts the trials in each gesture according to the file size and then selects the first five largest ones for experiments. The first, third and fourth are used as the training set, the second and fifth are used as validation and testing set.

Healthy subjects dataset

For experiments of subjects with healthy limbs, a wearable device is developed to collect the sEMG signals (see Fig. 1). The device consists of 16 acquisition modules. Each module contains a pair of electrodes with a vertical distance of 10 mm. All the 16 sEMG signals were amplified with gain of 960 and filtered with a bandwidth of 20–500 Hz. Besides, signals were sampled at 1000 Hz with an analog-to-digital converter, resulting in an 8-bit digital signal in each module. For comparison, signals from acquisition modules with odd number indexes are chosen in experiments.

Fig. 1
figure 1

sEMG signals collection device

As for the collection period, seven subjects aged from 25 to 30 volunteered, including two females and five males. During collection, a force sensor was first used to test the maximum voluntary contraction (MVC) of each subject. Then, subjects were asked to perform eight gestures with three force levels, representing 20%, 40% and 60% of MVC, respectively. These gestures are separately palm press, thumb press, three-finger grasp, grasp, pinch, fist press and key pinch (see Fig. 2). All of them are selected from the commonly used gestures with strength, such as grasp or contact with a surface. For each gesture, five trials were collected and each trial continued for five seconds. Considering muscle fatigue, there was a rest for several seconds between every two trials. sEMG signals detected from the selected eight channels are shown in Fig. 3.

Fig. 2
figure 2

Eight gestures with variable force levels

Fig. 3
figure 3

sEMG signals from eight channels

MTL and PTA strategies

Data preprocessing and CNN models

Given the raw sEMG signals, we divide them into small segments by the sliding window method. Considering the large amount of dataset required by the CNN model, overlapped windowing scheme is utilized. Drawing on the past experiments, window length should be shorter than 300 ms to satisfy subjects imperceptible in real-life applications [27, 28]. In this work, it is set to 200 ms for both datasets, with an overlapped window size of 140 ms.

Before inputting to CNN, segmented data in each electrode was transformed into frequency information by Fast Fourier Transform. Considering the majority of sEMG energy ranged from 0 to 500 Hz, for the dataset of amputees, the first 100 spectrums were used as input [14, 29]. And the first frequency band was removed to reduce baseline drift and motion artifact. With respect to the dataset of healthy subjects, the first five frequency bands were removed due to filters of the hardware.

To make full use of information in electrode channels, a multi-stream CNN model is designed as seen in Fig. 4. In this way, spectrums from each electrode are used as the input of each stream. And there are three blocks in the stream, which are batch normalization (BN) layers, convolutional layers, and max-pooling layers. The convolutional layer has 32 kernels sized 1*3, and the max-pooling layer is sized 1*2 with a stride of 2. By stacking blocks, CNN model exacts features through a hierarchy of spectrum abstractions. All the streams will converge into fully connected (FC) layers for classification. There are three FC layers in the model, and the nodes of the first two are separately 512 and 256, with a dropout probability of 0.5. While the last FC layer consisted of two parts connecting to the second FC layer independently, namely the number of gestures and force levels.

Fig. 4
figure 4

Multi-stream CNN structure

MTL framework

As described before, two tasks are finished by the last FC layer of the model simultaneously, which means they share the same features as most MTL frameworks do. The specific information of MTL framework is introduced below.

Given a training set with \(T\) samples \(D = \{ s_{i} ,{\kern 1pt} {\mathbf{y}}_{i} \}_{i = 1}^{T}\), where \(s_{i}\) is the \(i\)th signal segment, \({\mathbf{y}}_{i}\) are the corresponding labels made up of gesture label (\({\mathbf{y}}_{i}^{g}\)) and force level label (\({\mathbf{y}}_{i}^{f}\)). For clarity, the index \(i\) is eliminated, then the shared feature vector \({\mathbf{x}} \in {\mathbb{R}}^{C \times 1}\) of the last max-pooling layer can be formulated as:

$$ {\mathbf{x}} = f(s{\kern 1pt} ;{\kern 1pt} {\kern 1pt} {\mathbf{k}}_{c} ,{\kern 1pt} {\kern 1pt} {\mathbf{b}}_{c} ,{\kern 1pt} {\kern 1pt} {{\varvec{\upgamma}}},{\kern 1pt} {\kern 1pt} {{\varvec{\upbeta}}}) $$
(1)

where \(f\) donates the non-linear function from the input signals to features. \({\mathbf{k}}_{c}\) and \({\mathbf{b}}_{c}\) are the parameters of kernels and bias vectors of convolutional layers. \({{\varvec{\upgamma}}}\) and \({{\varvec{\upbeta}}}\) are the set of scales and shifts in the BN layers.

After feature representation from the last pooling layer, three FC layers are employed for classification. Suppose \({\mathbf{W}}_{i} \in {\mathbb{R}}^{{D_{i - 1} \times D_{i} }}\) and \({\mathbf{b}}_{i} \in {\mathbb{R}}^{{D_{i} \times 1}}\) are the weight matrices and bias vectors of the \(i\)th FC layer with output number of \(D_{i}\) (\(D_{0} = C\)), then the prediction score of the \(i\)th FC layer \({\mathbf{y}}_{i}^{p}\) is as follows:

$$ {\mathbf{y}}_{i}^{p} = {\mathbf{W}}_{i}^{T} {\mathbf{y}}_{i - 1}^{p} + {\mathbf{b}}_{i} $$
(2)

where \(i = 1,2,3\), \({\mathbf{y}}_{{_{0} }}^{p} = {\mathbf{x}}\). Specifically, the third FC layer contains both gestures and force levels. Let \({\mathbf{W}}_{{3{\text{g}}}}\), \({\mathbf{W}}_{{3{\text{f}}}}\),\({\mathbf{b}}_{{3{\text{g}}}}\), \({\mathbf{b}}_{{3{\text{f}}}}\) donate the weight matrices and bias vectors of the two-part in the last layer, the outputs \({\mathbf{y}}^{{{\text{pg}}}}\) and \({\mathbf{y}}^{{{\text{pf}}}}\) can be represented as:

$$ {\mathbf{y}}^{{{\text{pg}}}} = {\mathbf{W}}_{{3{\text{g}}}}^{T} {\mathbf{x}} + {\mathbf{b}}_{{3{\text{g}}}} $$
(3)
$$ {\mathbf{y}}^{{{\text{pf}}}} = {\mathbf{W}}_{{3{\text{f}}}}^{T} {\mathbf{x}} + {\mathbf{b}}_{{3{\text{f}}}} $$
(4)

The probabilities of \({\mathbf{x}}\) belonging to gestures (\({\hat{\mathbf{y}}}^{{{\text{pg}}}}\)) and force levels (\({\hat{\mathbf{y}}}^{{{\text{pf}}}}\)) are calculated by feeding \({\mathbf{y}}\) into a softmax function.

$$ {\text{softmax}}({\mathbf{y}}^{{{\text{pg}}}} )_{m} = p(\hat{y}^{{{\text{pg}}}} = m|{\mathbf{x}}) = \frac{{\exp (y_{m}^{{{\text{pg}}}} )}}{{\sum\nolimits_{i} {\exp (y_{i}^{{{\text{pg}}}} )} }}, $$
(5)
$$ {\text{softmax}}({\mathbf{y}}^{{{\text{pf}}}} )_{n} = p(\hat{y}^{{{\text{pf}}}} = n|{\mathbf{x}}) = \frac{{\exp (y_{n}^{{{\text{pf}}}} )}}{{\sum\nolimits_{j} {\exp (y_{j}^{{{\text{pf}}}} )} }}, $$
(6)

where \(y_{i}^{{{\text{pg}}}}\) and \(y_{j}^{{{\text{pf}}}}\) are the \(i\)th element in \({\mathbf{y}}^{{{\text{pg}}}}\) and the \(j\)th element in \({\mathbf{y}}^{{{\text{pf}}}}\). The softmax function converts the output \({\mathbf{y}}^{{{\text{pg}}}}\) and \({\mathbf{y}}^{{{\text{pf}}}}\) into a probability distribution over respective labels. Finally, the predicted gesture \(\hat{y}^{{{\text{pg}}}}\) and force level \(\hat{y}^{{{\text{pf}}}}\) are obtained via:

$$ \hat{y}^{{{\text{pg}}}} = \mathop {\text{argmax}}\limits_{m} \;{\text{softmax}}({\mathbf{y}}^{{{\text{pg}}}} )_{m} , $$
(7)
$$ \hat{y}^{{{\text{pf}}}} = \mathop {\text{argmax}}\limits_{n} \;{\text{softmax}}({\mathbf{y}}^{{{\text{pf}}}} )_{n} , $$
(8)

The cross-entropy losses are employed:

$$ L_{{\text{g}}} = - \sum\limits_{m = 1}^{M} {y_{m}^{g} \log (p(\hat{y}_{m}^{{{\text{pg}}}} = y_{m}^{{\text{g}}} )|{\mathbf{x}},{\mathbf{W}}_{1} ,{\mathbf{b}}_{1} ,{\mathbf{W}}_{2} ,{\mathbf{b}}_{2} ,{\mathbf{W}}_{{3{\text{g}}}} ,{\mathbf{b}}_{{3{\text{g}}}} )} $$
(9)
$$ L_{{\text{f}}} = - \sum\limits_{n = 1}^{N} {y_{n}^{{\text{f}}} \log (p(\hat{y}_{n}^{{{\text{pf}}}} = y_{n}^{{\text{f}}} )|{\mathbf{x}},{\mathbf{W}}_{1} ,{\mathbf{b}}_{1} ,{\mathbf{W}}_{2} ,{\mathbf{b}}_{2} ,{\mathbf{W}}_{{3{\text{f}}}} ,{\mathbf{b}}_{{3{\text{f}}}} )} $$
(10)

where \(M\) and \(N\) are the number of gestures and the number of force levels, respectively. Let the parameters of the whole model donate as \(\Theta\), compared with STL, the loss function consisted of two parts is as follows.

$$ \mathop {\min }\limits_{\Theta } (L_{{\text{g}}} + \alpha L_{{\text{f}}} ) $$
(11)

where \(\alpha\) represents the importance of the auxiliary task. In real life, the force levels prediction will be valuable on condition that gestures are recognized correctly, so \(\alpha\) ranges from 0 to 1.

PTA strategy

PTA strategy adopts the idea from MTL that training related tasks drawn from the same feature space [30]. If the last layer of the MTL framework is seen as a decoder for each task, the PTA mean numbers of distinct decoders are made within the task. It has been proven that the PTA strategy has a fundamental effect on learning dynamics, which leads to further improvements in both STL and MTL. For MTL with \(T\) tasks and \(D\) decoders in each, the learning problem of PTA strategy can be expressed as follows:

$$ \Theta^{*} = \mathop {{\text{argmin}}}\limits_{\Theta } \frac{1}{{{\text{TD}}}}\sum\limits_{t = 1}^{T} {\sum\limits_{d = 1}^{D} {L(y^{t} ,\hat{y}^{{{\text{td}}}} )} } $$
(12)

where \(y^{t}\) is the true label of the \(t\)th task, and \(\hat{y}^{{{\text{td}}}}\) is the predicted score of the \(d\)th decoder in the \(t\)th task.

However, this method gives equal importance to all tasks by default, which is usually unreasonable in MTL. For example, only when the gestures are predicted correctly, can the prediction of force levels be meaningful. Therefore, an approximate range of weight is first determined through grid search, then learning dynamics is further affected by PTA strategy. The modified PTA strategy during the training period is conducted as follows:

$$ \Theta_{{{\text{tr}}}} = \mathop {{\text{argmin}}}\limits_{\Theta } \frac{1}{D}\sum\limits_{d = 1}^{D} {\{ L(y^{{\text{g}}} ,\hat{y}^{{{\text{gd}}}} )} + \alpha \, \cdot \,L(y^{{\text{f}}} ,\hat{y}^{{{\text{fd}}}} )\} $$
(13)

where \(\hat{y}^{{{\text{gd}}}}\) and \(\hat{y}^{{{\text{fd}}}}\) are prediction scores of the \(d\)th decoder for the gestures and force levels, respectively. For validation, the best performing decoder for each task is selected as follows:

$$ \Theta_{{{\text{eval}}}} = \mathop {{\text{argmin}}}\limits_{\Theta } \{ L(y^{{\text{g}}} ,\hat{y}^{{{\text{g}}d_{1} }} ) + \alpha L(y^{{\text{f}}} ,\hat{y}^{{{\text{f}}d_{2} }} )\} $$
(14)

where \(d_{1} ,d_{2} \in [1,D]\). Because the two parts of the loss function are independent, the weight parameter \(\alpha\) does not influence the final result. So it is ignored during the experiments. To distinguish it from the original PTA strategy, the algorithm in this paper is expressed as a weighted PTA (WPTA). The specific implementation of WPTA is shown in Algorithm 1.

figure a

In this way, the parameters of each decoder are initialized independently to ensure the learning dynamics work. Furthermore, by updating \(\Theta\) while freezing parameters of decoders except the first one of all tasks every iteration, the optimal model can still be learned for each task. Finally, for validation and testing progress, the best performing decoder of each task is selected, which contributes to improving computational efficiency.

Experiments and analysis

In this section, the proposed methods are evaluated with a dataset of amputees and healthy subjects, respectively. With each dataset, a range of settings are conducted as follows: (1) gestures recognition with different methods; (2) multi-task recognition for gestures and force levels; (3) MTL with WPTA strategy. All these mentioned experiments are conducted three times and the average performance is shown below.

Recognition results of amputees

Gestures recognition with different methods

Gestures recognition under different force levels is an important topic. Combining with previous experience, data processing and algorithm are firstly discussed in this paper. To verify the rationality of the proposed method, methods in [12, 17] are used for comparison. The former uses downsampling and low-pass Butterworth filter for amplitude estimation and then proposes a multi-stream CNN based on the processed data to classify sEMG signals, which has a strong contrast with the proposed method. The latter solves similar tasks of this paper. It first extracts features including mean absolute value, zero crossing, slope sign changes and waveform length from the raw signal, and then adopts LDA to recognize gestures with two force levels. For the dataset of amputees, the results of nine patients using three methods separately are shown in Fig. 5.

Fig. 5
figure 5

Result in different methods

It shows that the CNN model based on frequency domain information makes the best result (about 5.57% higher than CNN with time-domain information, and 2.14% higher than the traditional method). The phenomenon means that compared with the time domain information used in [12], frequency domain information is more sensitive to the variance of force. And different from the traditional method, the CNN model can extract implicit information in the data more effectively, which verifies the effectiveness of the proposed method.

Multi-task recognition for gestures and force levels

On the basis of multi-stream CNN with spectrums as inputs, an MTL framework is employed as described in Sect. 3. Considering the particularity of the task, gestures recognition task is more important than force levels recognition task, the weight coefficient is searched by setting \(\alpha\) from 0.0 to 1.0, with an interval of 0.1. Results of both gestures and force levels prediction are shown in Tables 1 and 2.

Table 1 Gesture recognition results in different weight coefficients (%)
Table 2 Force levels recognition results in different weight coefficients (%)

Notably, \(\alpha\) is 0.0 means that a single task is performed without force levels prediction. So there is no result. From Table 1, it is seen that compared with the single gestures recognition, MTL can effectively improve the accuracy of gestures. This suggests that the gestures and force levels share a uniform feature space, which verifies the feasibility that both tasks can be classified within a MTL framework. Specifically, features learned from force levels recognition task maybe helpful for gestures recognition task, and vice versa. Thus both tasks can have a preferable performance with MTL. Besides, this method provides additional information about force and makes it more practical in real life. On this basis, MTL using a grid search can improve the accuracy of gestures recognition task, which further proves that there is a correlation between the two tasks, and the variable force levels affect the gestures recognition.

Multi-task recognition with WPTA method

Table 1 shows that when \(\alpha\) is equal to 0.3–0.6, the gestures recognition accuracy performs best. So the weight coefficients in Eqs. 13 and 14 are set to 0.3, 0.4, 0.5 and 0.6, separately. For comparing the method in [25], training with Eq. 12 is also performed, which donates as No on \(\alpha\) axis in Fig. 6.

Fig. 6
figure 6

Results of WPTA method

The green plane donates the best results obtained in MTL methods with grid search (\(\alpha\) is 0.3, number of decoders is 1). It is seen that with MTL and WPTA, improvements are achieved in both tasks, demonstrating the superiority of the proposed method. This is reasoned by the WPTA characteristics which change the learning dynamics adaptively, which can be regarded as a process of fine-tuning importance of auxiliary tasks. So it improves the disadvantage of fixed weight coefficient among all subjects in a grid search. By increasing the number of decoders, the results get better, which draws the same conclusion with [26]. Theoretically, the original PTA method may achieve the same results when there are enough decoders. However, it is a waste of computing resources of hardware.

Recognition results of healthy subjects

As for the dataset of healthy subjects, following the above conduct, we have successively carried out experiments on different settings. For further analysis, experiments on segmented data with length of 150 ms are also carried out. The best performance of gestures prediction and corresponding force levels prediction scores are shown in Table 3.

Table 3 Results of healthy subjects (%)

For MTL and MTL with WPTA in Table 3, the \(\alpha\) is equal to 0.3 in the data segment length of 150 ms and 0.4 in 200 ms. It is seen that the algorithm proposed still has certain advantages compared with other methods. As the spectrums in 200 ms contain detailed information in the frequency domain, it shows better performance compared with 150 ms.

In particular, a comparison between STL and MTL with WPTA for each subject based on frequency information is demonstrated in Table 4. It is seen that for most subjects, the proposed method makes performable results, no matter what the time window size is. This is the reason that with MTL, features learned from force levels recognition task improve the main task performance. By introducing an additional loss, MTL also acts as a regularizer preventing over-fitting. Besides, via WPTA, the disadvantage of equal tasks importance is overcome, and a reasonable importance weights distribution is obtained. Rather than one decoder of each task, WPTA provides choices by more decoders, thus further improving the performance. In addition, due to the different physiologic states according to the subjects, the comparison also differs. In detail, stronger subjects can better resist to muscle fatigue during collection, resulting in a more stable EMG signal at each force level. While for others, EMG signals will be affected by muscle fatigue, and the proposed method cannot make much improvement in this situation.

Table 4 Comparison of STL and MTL with WPTA (%)

In addition, compared with the experimental results of amputees, results of healthy subjects perform better although there are more gestures under the same electrode channels condition, and accuracy differences among methods are smaller than that of amputees. This is reasoned that without damage to the forearm, sEMG signals of healthy subjects are more regular.

Conclusions and future work

Gestures recognition based on sEMG signals has been greatly investigated in recent years. Compared with existing methods, neural decoding under variable force levels situation is still a difficult problem. This paper firstly explores the combinations of data preprocessing and classifiers, proving that frequency domain information is more sensitive to strength. Considering the importance of force information in real life, MTL framework is leveraged to decode the gestures and force levels simultaneously. Experimental results validate the efficiency of MTL. Considering that grid search is so rough that may lead to local optimal solutions, PTA strategy is proposed here. However, it cannot improve the performance because all tasks are assigned as the same weight by default. By combining the above two methods, WPTA technology is applied, which boosts the performance of all tasks in MTL. It is worth noting that similar optimization forms are not only suitable for sEMG signals decoding, but also for any other MTL tasks.

In the future work, we will continue to study the adaptive weight method, and replace the grid-search with it.