In this section, we will discuss and compare the performance obtained with different classification setups and approaches. When presenting classification performance, we have specifically selected area under curve (AUC) since it provides a more valid performance estimate in the presence of our imbalanced binary classification problem. Also, while training any classifier, class weights are set to be inversely proportional to the number of samples in the class so as to remove any bias caused by imbalanced class sizes. Of the all setups discussed in this section, only person-dependent one uses the data from a single participant, for training and testing, in a Leave-One-Sample-Out manner. Other setups, person independent and TPT, use data from other participants. Thus, person-dependent setup is expected to act as an upper bound on the performance since it is a personalised setting by nature.
Person-dependent performance
In the person-dependent setup, each participant is trained and tested on their own data. Since we do not have enough data to come up with distinct training and test sets, we applied Leave-One-Sample-Out cross-validation scheme for performance evaluation. Based on the findings reported in [10], we made sure that training set is not contaminated. This means for each fold, any adjacent samples to the test sample are eliminated from the training set. With this elimination, we aim to provide an unbiased performance estimate. We have used a logistic regressor as classifier where the optimal regularisation parameter C in Eq. (1) is found by nested k-fold cross-validation.
The procedure is applied to each participants’ data separately, obtaining performance evaluations for each. This resulted in varying performance scores, ranging from an AUC score of 55–79%. The mean performance across all participants is 68% ± 6. Individual scores for each participant are shown in Fig. 5.
The variation in performance scores can be linked to two different factors we have already discussed. The first is the personal connection between speech and body movements read through the accelerometer. As expected, the problem becomes harder for people with more subtle movements, resulting in lower performance. Still, each participants’ performance score is higher than random (50% AUC), proving that our features are still discriminative.
Second factor is related to the class distributions. As shown in Fig. 5, some participants’ class distributions are highly skewed towards the negative class. We cannot say that such imbalance always guarantees low performance, since it may still be possible to train robust models from small numbers of highly informative samples. However, we already see negative effects of this imbalance in our results. The participants with the lowest performance scores have small number of positive samples. There are only two participants with AUC scores lower than 60% (P12: 55% and P15: 57%), and they have the second and third lowest percentages of positive samples (25 and 16%, respectively) in the whole dataset. So, for these two participants, we cannot be sure if the low performance is caused by subtle movement while speaking or the small number of positive samples.
We expect these results to act as an upper-unbiased limit for speech detection performance.
Person-independent performance
In the person-independent setup, we have used Leave-One-Subject-Out cross-validation for performance evaluation, where each participants’ samples are classified with the model obtained from other participants’ data. So, the training set is formed by concatenating and standardising all other participants data. Similar to the person-dependent setup, logistic regressor is used as the classifier and optimal regularisation is then found on the training set with cross-validation.
With this setup, we obtained an average AUC score of 58%, with a standard deviation of 7%. The individual scores for participants varied from 45 to 60%. The individual scores obtained with the person-independent setup are also shown in Fig. 5, together with the results of other setups. Apart from two participants (7 and 8), where the person-independent setup yielded slightly better AUC scores than the dependent one, the person-dependent setup always outperforms the independent setup. We compared the performances of person-dependent and person-independent setups per person using a paired one-tailed t test. As expected, the result of the t test showed that the person-dependent setup yields significantly better results than the independent one (\(p<0.01\)).
In the ideal learning paradigm, training with more samples should yield a better, more robust model, contradicting what we see. However, it is also assumed that the samples in the dataset are coming from the same independent and identically distributed (i.i.d.) probability distribution. From what we see from Figs. 3 and 4, it is more likely that every participant has their own probability distribution that their samples are drawn from. Thus, concatenating the data of all participants and training a model on this dataset results in an unreasonable and impractical decision boundary. These person-independent results strengthen our claim of the personal nature of connection between speech and body movements and motivate the requirement of an adaptive model.
Transductive parameter transfer performance
Our TPT experiments also employed a Leave-One-Subject-Out setup, where each participant is treated to be the target dataset while all other participants acted as source sets. This setup is similar to the person-independent one, since the labels of only other participants are used for classification. With TPT, an average AUC of 65% ± 6 is obtained. Individual performance values are included in Fig. 5, in addition to those of the person-dependent and person-independent setups.
It is clearly seen that TPT outperformed the person-independent setup for majority of the participants (16 out of 18), providing an AUC score close to the person-dependent setup. One-tailed t test between the TPT and the person-independent scores showed that TPT is significantly better than the other (\(p < 0.01\)). For few cases, TPT even outperforms the person-dependent setup (participants 2, 7, 8, 11); however, the person-dependent results are still significantly better than TPT (\(p < 0.02\)). This result is quite interesting and might be caused by different factors. When the performance for participants 7 and 8 is inspected, it can be seen that even the person-independent setup outperforms that of the person-dependent one. This suggests that for these participants, using more data (even belonging to other participants) provides a better estimation of the decision boundary. In such a case, we may expect TPT to outperform all other setups. Although the same pattern is not present for participants 2 and 11, we might still argue that these participants have benefited from the use of the data of other participants, most probably the ones having a similar distribution.
These results prove that it is still possible to generalise over unseen data, with an acceptable performance, if an adaptive method like TPT is employed. In 10 min, one might argue that there is relatively little variation in an individuals’ behaviour. However, assuming that between-person variation remains fairly high over this interval, as it can be seen from Figs. 2 and 3, it is particularly interesting that we get good results, showing the robust generalisation ability of our method even with a limited amount of data. With the proposed transfer learning approach, performance results that are always better than the random baseline are obtained and statistical significance tests showed that our proposed method guarantees to perform better than traditional non-adaptive person-independent learning.
Comparison with the state of the art
This section compares the performance of our transductive parameter implementation with the state-of-the-art approaches. Firstly, we present the person-independent results obtained with Random Forests (RF) and Hidden Markov Model (HMM)-based approaches proposed in [12]. Secondly, we present the results obtained with the TPT implementation given in [35] and discuss in detail how our different choices affected the final performance. Individual performance scores obtained with all four methods, including ours, can be seen in Fig. 6.
Non-adaptive person-independent methods
We have implemented the methods presented in [12]. We have used the exact same setup they defined which includes the features they used (PSD 0–8 Hz), window sizes for feature extraction (5 s for RF, 3.5 s for HMM), number of trees in Random Forest classifier (500) and number of states in HMM (2). We compare with the Leave-One-Subject-Out cross-validation setup reported in [12].
With the RF, we obtained an average AUC score of 55% ± 6. The HMM performed slightly better, providing an average AUC of 59% ± 6. When compared to our person-independent results obtained with logistic regression, neither RF nor HMM provided a significantly better result. This is an interesting finding since it shows that a linear model is as powerful as a nonlinear model for the speech detection problem, in a Leave-One-Subject-Out setup. Our proposed TPT method, on the other hand, significantly outperforms both of these methods. There are only three participants that have better performance scores than our proposed implementation of TPT; participants 1 and 3 for RF and participants 1 and 11 for HMM. One-tailed t tests between our TPT results and both RF and HMM showed TPT performs significantly better (\(p<0.01\) for both RF and HMM). The authors of [12] applied their non-adaptive method on a limited dataset that includes only nine people. We believe, with the increasing number of participants, the person-specific nature of speech is magnified and the requirement for adaptive methods increases.
Detailed comparison with state-of-the-art TPT implementation
Our proposed TPT implementation improves upon that presented in [35]. Although the basic framework of the method remains, our implementation choices made the method more suitable to the nature of our problem, as demonstrated by the performance results. We have used the implementation provided by [35] and obtained performance results with that setup, resulting in an AUC of 62% ± 6. Our implementation outperforms it for 15 out of 18 participants. The paired one-tailed t test between performance scores shows that our implementation is significantly better than [35] (\(p<0.01\)).
There are four main differences between our implementation and the one in [35]. TPT implementation in [35] uses: (1) a SVM instead of logistic regression (LR), (2) independent support vector regressors (SVRs) instead of KRR, (3) a density kernel (DK) instead of EMD kernel, (4) support vectors (SV) instead of the whole data (WD) to estimate distributions of source sets. To investigate which modification affected the performance most, we carried out four follow-up experiments. In these experiments, we replaced one of our choices with the original one in [35]. Table 1 shows the average AUC and standard deviation over all participants obtained with each of these modifications. One-tailed t tests were used to quantify differences between our full implementation and one of the modified approaches.
Table 1 Performance and significance of the four modified TPT implementations compared to ours, which had an average AUC score of 65% ± 6 (**\(p<0.01\); *\(p<0.05\))
Table 1 shows that the most effective change uses a logistic regressor instead of a linear SVM. The two setups where our logistic regressor is replaced by a SVM (SVM and SV in Table 1) have the lowest performances. It is an unexpected result since the two classifiers are quite similar. However, the logistic regressor was more successful than the linear SVM when person-specific classifiers were being trained which we believe resulted in this performance difference.
Since our features are often correlated with each other, we preferred to use a KRR instead of the SVRs which is also supported by [30]. The performances shown in Table 1 back our decision since our method with KRR performed significantly better than the SVRs method. The average performance difference between two methods could be low, but our method provides significantly better results.
Finally, we can see that replacing EMD with a density kernel (DK) does not affect the performance at all. For our data, a density kernel was as successful as the EMD kernel in estimating the similarities of distributions. This is quite different than the findings in [30], but we believe it is related to the distribution characteristics of our data.