In this section, results of the identification and authentication analysis with timing-based as well as audio-based features, results of analysis with fused features and the comparison of obtained results will be presented. These are different analysis methods and the performance reporting should not be mixed. A good performance for identification will not automatically indicate a good performance for authentication or vice versa. We choose the Accuracy for Identification and Equal Error Rate (EER) for authentication tasks.
As in the previous section, subsections will provide separate results of the identification task and authentication task for audio and timing information analysis. Next, the comparison of the results and fusion of timing and audio results for authentication task will be provided.
The 4-fold cross-validation was realized for all test using always 4 combinations of training and testing recording sets, which means all recordings were used in one of the setups as training and as testing in another one. The average of these 4 tests is the cross-validated result.
Identification task
Timing based results
When evaluating the identification task, with the same template and distance metric as in the authentication task, we obtained a rank-1 accuracy of 56.7% when using 1 session for training and 64.6% when using 3 sessions for training. Full Cumulative Matching Characteristic Curve (CMC) for the latter case is given in Fig. 6. The CMC curve describes the Rank-N vs Accuracy plot, which means that the tested user was successfully identified among the N top scores.
We clearly see that the identification accuracy is not very high and only increases slightly when using three sessions for training and the remaining session for testing.
Audio based results
When using the audio based information in an identification setting, we noticed that the results were much better compared to timing analysis. The accuracy highly depended on the number of Gaussian Mixtures (GM) in PDFs and the number of HMMs used. Table 3 gives a partial overview of the accuracies obtained for various values of PDFs, GMs used and various HMM states. The Table 2 shows the results when using 3 sets for training and 1 for testing. It can be seen that the best result is obtained for 3 HMM states in combination with 128 GM in a PDF, but other settings give results that are almost as high.
Table 2 Accuracy of audio analysis cross-validated identification results when using 3 sessions for training and 1 for testing
We have also done another similar test, but this time using only a single session for training the system, were the other 3 sessions are used for testing. In this case the results are significantly lower, as can be seen in Table 3. We also noted, when comparing the two tables, that the range of accuracy values is much broader when using only 1 session for training which is mainly because of the lack of the training data amount. We want to explore this phenomenon in the future work where we’ll have larger amount of data and use the universal model adaptation for decreasing the amount of data needed to train the user model.
Table 3 Accuracy of audio analysis cross-validated identification results when using 1 session for training and 3 for testing
In Table 4 you can see the cross-validated results of the best audio models for each scenario. It is interesting that the randomly selected recordings bring the best results. This means the typing behavior slightly changed between the sessions. For example, we discovered that when using only one session for training the system achieved 90.62% (cross-validated) and within this scenario the best result was 92.91% with the second session used for training, and the worst was 88.93% with the fourth session in training. It means that the first session when the user experienced the typing of the password for the first time was not the worst but not the best either. Nevertheless, in real-life application, the user chooses a password which is familiar to him, and therefore it should be easier to type. The problem of poor results with last session in training should not affect a real-life application because the users usually do not lose their typing habits over time.
Table 4 Comparing cross-validated audio analysis identification accuracy results when using the best acoustic models
Authentication task
Timing based results
The obtained Equal Error Rate (EER) when using 1 session for training and 3 for testing was equal to 14.4%, which dropped to 11.7% when 3 sessions were used for training and 1 for testing. The EER for authentication is at an acceptable level, more so given the fact that it is a short password (only 8 characters, i.e. 15 features) and that it is a common English word that most likely is easy to type for all participants.
Audio based results using calibration
We first evaluated the audio information for authentication purposes. However, when using a single session for training and the remaining three sessions for testing, we found that the EER was as high as 21.1%. Even using 3 sessions for training and the last session for testing did not improve the results significantly, as the EER decreased only to 19.1%. These results are obviously too poor for practical purposes, and more importantly they are worse than the results we obtained based on the timing analysis.
Comparing to identification results, one can clearly identify the problem of output probability inconsistency between the testing utterances. For identification task we used the probabilities of every user model enrolled in the system for computing the current utterance probability. Then the probabilities were normalized by the number of frames and energy of the test recording (supported by HTK Tools) for authentication task. As they were still in different range for every recording, no meaningful threshold could be used for authentication task, so we decided to use the first user as a benchmark. It means that the first user could be used in a real system to calibrate the setup (environment sound), and then this calibration model could be used for normalization of the gathered tested user model probability. The main idea is that using the two results from two models on the same recording will bring us a benchmark information about levels of the scores on that particular recording.
We tried computing the distance between the calibration and genuine model probability and then the distances of the genuine (1 for every test recording) and the impostor (48 for every test recording) models were used to compute the final EER using the formula below.
$$Logprob_{calibrated} = \frac{{Logprob_{actual}}}{{Logprob_{calibration\_user}}} \times {\mathrm{50}} $$
The authentication results were varying between 9.4% and 14.8% EER. The best cross-validated result of 11.6% was achieved for 512 PDF with only 1 state HMM model and 3 training (enrolling) sessions. The worst (21%) for 3 training sessions was achieved with 3 state 1024 PDF HMM model. For the more realistic scenario of only 1 training session the best cross-validated result of 16.6% EER was achieved using 3 states 64 PDF HMM.
Comparison of timing and audio analysis
In this section we will make a comparison between the performance results based on timing information and the audio information(Tables 5 and 6).
Table 5 Comparing cross-validated rank-1 identification accuracy results when using the best acoustic models and general timing analysis models
Table 6 Comparing cross-validated EER authentication results when using general Timing & Audio analysis models with the calibrated results of the best acoustic models
What we can clearly observe is that audio-based and timing-based KD both perform differently. Most notably, timing-based KD performs significantly better in an authentication task (see Table 6), while in an identification task the performance of audio-based KD is much better (see Table 5). Given the high performance of audio KD in case of identification, we assume that we should be able to gain better performance in authentication as well. The main hurdle at this moment is that the distance scores need to be normalized.
Fusion of the timing and audio analysis results for authentication task
First of all, we chose the best models from audio authentication setup for 1 and 3 training sessions. The chosen models were used for fusion of the calibrated results with timing-based analysis distances. The fusion was done using simple multiplication of distances after putting them in the same ratio from 0 to 200. It was necessary to suppress the results of the first user which was used for audio probability calibration described above (also for Timing analysis). So the results are for 49 user authentications without cross-validation.
We used BosarisFootnote 1 toolkit [6] with widely used fusion approach for speaker identification/authentication [16] or Query by Example Search on Speech [19] for results comparison. We used a half of the testing set as development subset for fusion function training and applied it to the rest of the testing set - evaluation subset. After that we swapped the subsets and the average EER is presented in the Table 7.
Table 7 EER authentication results when using best acoustic models and fusing them with relevant timing models (same train/test sessions) using simple linear approach and Bosaris toolkit
For the fusion of 1 test session we chose a scenario where session 3 was used for testing and 1, 2 and 4 for training of the 1 state 512 Gaussian mixtures HMM model. For 3 test session scenario we chose the session 2 for training and sessions 1, 3 and 4 for testing of the 3 state 64 Gaussian mixtures model. The results of the original timing and calibrated audio distances compared with fused ones are depicted in the Table 7 and compared using Detection Error Trade-off (DET) Curve in Fig. 7. The Bosaris toolkit results as DET curves are presented in Fig. 8a and b. It is clear from the data that the fusion of the timing and calibrated audio systems provides significantly better results then each of them alone.