1 Introduction

This paper is an extension of a paper presented at the CSNet 2023 conference, which detailed the collection and analysis of the data set and results using the correlation algorithm. This paper has the additions: further data analysis using K-Nearest Neighbour with Manhattan distance, T-SNE dimensional reduction for visualization of distinctive groups of data points, and additional results using K-NN to differentiate between the participants in the data set.

It is well known that a common security risk is the use of weak, reused, or compromised passwords, with compromised credentials causing 80% of breaches of web applications by external attackers [1]. While there have been efforts to raise awareness of these risks, many people are unconcerned for their password security, with 24% of people using a variation of the same 8 common passwords [2]. Furthermore, password reuse is very common, with half of IT professionals admitting to reusing passwords [3], the average employee reusing each password an average of 13 times [4], and half of all people using the same password for all their accounts [5]. To make matters even worse, in a study by Google [2], only 45% of respondents said they would change their password after it was discovering that accounts had been breached. It has for a long time been suggested that passwords are on their way out to be replaced by other methods of authentication [6, 7]; however, passwords still remain by far the most used method of personal authentication for account access control, often backed up by 2-factor authentication (2FA) on a mobile device giving a one-time code. While one-time codes are highly effective at preventing account breaches, a significant portion of people do not employ it on their accounts [8]. According to a Ponemon Institute report, roughly half of people report that one-time code 2FA is a cause of irritation and interruption of work flow [9]. Furthermore, 2FA is not a guarantee for security, as hackers can find ways to bypass it [10].

It has been shown in that the characteristics of a person’s typing can be used to accurately differentiate between people [11, 12]. This can be used to improve the security of password authentication, by adding an additional authentication step. Multi-factor authentication consists of at least 2 factors, including “something you know”, such as a password, “something you have”, such as a device, and ”something you are”, or personal biometrics. The first two of these are widely in common use; however, the use of physiological biometrics for authentication is uncommon due to the extra implementation costs, such as iris scanners or fingerprint readers, or potentially exploitable face-recognition [13]. However, authentication through behavioral keystroke dynamics requires no additional hardware solutions, nor any extra actions by the users, making it a promising method to unintrusively strengthen security.

There is a reported lack of keystroke dynamics data sets [12]. This study aims to add to the available data sets and give preliminary analysis of the data sets using simple statistical methods. The paper consists of sections Background, giving an overview of the literature and existing data sets, Method and Data collection, and finally Data analysis with results.

2 Background

In the 1980 s, studies were done on the applicability of keystroke dynamics authentication (KDA) [14,15,16]. They found that it was a highly promising method to feasibly increase security. Since then, keystroke dynamics for profiling and authentication has been extensively studied for decades, and there is a wide collection of studies and literature on the topic. Some have investigated the effects of password length, password entropy, longitudinal effects [17], typing pressure, touch screens, with free-text typing for continuous authentication [18] as well as fixed-text authentication [11, 12, 19, 20].

2.1 Data sets

While multiple data sets have been made to study keystroke dynamics, they are often limited in size and variety of passwords, and few are publicly available. In [12], the openly available data sets for KDA were surveyed. They report a lack of available data sets for KDA and give a list of 6 KDA data sets, 4 of which included fixed text. The identified fixed-text data sets are “.tie5Roanl” [21], “try4-mbs” [22], “greyc laboratory” [23], and [24] consisting of “yesnomaybe”, “bahaNe312!”, and “ballzonecart”. An issue with some of these data sets is that they use implausible passwords consisting of random characters, which goes against common password recommendations of memorable passphrases [25]. Another issue is the limited selection of passwords in these data sets, making it difficult to determine which features of a password are most beneficial for KDA, such as length, entropy, readability, and typing distance.

The data sets presented in this work consist of readable passwords of varying length and symbol replacements, and unlike other data sets also include an “attack set” of entries from individuals who are unfamiliar with the passwords to emulate an attacker, as well as the legitimate users of the familiarized password. The inclusion of the attack set allows a higher typing variance for KDA benchmarking to make up for a limited data set size.

2.2 Metrics

A list of metrics for typing characteristics have been studied [17]. Using a sequence of timing data of key actions of presses and releases on a keyboard, metrics such as timing between events can be extracted and used to categorize the password entry. The metrics used in this study are

  • Press-to-Press: the time between when a key is pressed down and the subsequent key is pressed down.

  • Release-to-Press: the time between when a key is released and the subsequent key is pressed down. This time may be negative.

  • Hold time: also known as Press-to-Release, the duration of time a key is held down.

Figure 1 demonstrates these three metrics. Other metrics might also be used, such as release-to-release, typing speed, or measurements between more distant keys across the password. Such metrics can be derived from the three metrics used here.

Fig. 1
figure 1

A sequence of key actions showing the timing metrics

2.3 Possible configurations of KDA

Fig. 2
figure 2

Two possible configurations of a KDA algorithm

Figure 2 demonstrates two possible ways to combine KDA with one-time code. To achieve a more convenient multi-factor user authentication, keystroke dynamics may be used in parallel with one-time codes. Only when the biometrics algorithm denies access, a one-time code from a 2FA app can be required to access the account, along with the correct password. This may reduce irritation and workflow interruption from 2FA, leading to higher adoption and resulting in stronger password security overall. However, the issue of accuracy is highly pertinent for this to be the case. The biometrics algorithm tuned to a user may correctly predict that the log in attempt is by the authorized user, or True Acceptance Rate (TAR), or by an unauthorized attacker, or True Rejection Rate (TRR). It may also erroneously deny access to the authorized user, or False Rejection Rate (FRR), or erroneously allow access to an unauthorized attacker, or False Acceptance Rate (FAR). In the event of a false rejection, the result is an irritation for the user having to go through an extra authentication step with the required one-time code. More importantly, a false acceptance would result in an account breach. Therefore, the applicability of keystroke dynamics to augment 2FA in parallel is strongly dependent on the false acceptance rate. Alternatively, keystroke dynamics can be implemented as an additional step in series with a one-time code for high-security scenarios. In this case, the FRR is the major factor for applicability. Using a distance metric between a password entry and the legitimate users recorded keystroke dynamics, a threshold can be used as a classifier. This threshold decides the strictness of the authentication and therefore affects both FRR and FAR. A lower FAR can be achieved at the cost of a higher FRR. Typically, studies give the Equal Error Rate (EER), which is the point where FAR and FRR meet. However, with the two proposed authentication flows, it is also useful to provide results of when the algorithm is optimized for lower false acceptances or false rejections. To achieve this, the confidence threshold to pass the KDA step can be adjusted. A higher threshold increases the FRR and decreases the FAR.

Table 1 The selected passwords

3 Method

The study is set up to gather data in a simulated real life situation. In the event of a widely compromised password, the number of login attempts per person will be limited, and the attackers will likely be unfamiliar with the password. One possible scenario is a group of students acquiring the password of a university faculty member, attempting to log into their account to gain access to exam material. To simulate this scenario, two data sets are needed. Firstly, a data set consisting of a large number of login attempts by a wide variety of participants at the university, here referred to as the “attack data set”. Secondly, a data set of the “legitimate” users of a password, consisting of a small set of individuals with a larger quantity of data per individual, referred to as the “defence data set”.

The study is designed to determine the following:

RQ1: How consistently do individuals type the collected passwords, and how is this affected by the typing proficiency (i.e., keystrokes per second)?

RQ2: How do features such as password length and complexity affect the consistency and distinctiveness of an individuals keystroke dynamics?

RQ3: How do realistic attackers differ from users, and can a data set of plausible attacker attempts augment the accuracy of a KDA algorithm?

3.1 Data collection

A system setup was developed to gather the data in the study, using a keylogger based on C++ which records every key press and release with millisecond timing. To maintain the data integrity and continuous data collection, measures were taken to ensure the participants could not stop, disturb, or sabotage the continuous data collection, by not allowing the participants to exit the data collection application, access the file system, or access other applications. The setup displays a word or phrase to type and only logs the successful typing attempts. Any unnecessary key presses are not logged. If there is no typing within a time limit, the program resets. Keyboard keys which can interfere with the data collection were disabled, and duplicate copies of the recorded data were made regularly. Many USB-connected keyboards have a polling frequency of less than 1000Hz, often 125Hz resulting in a 8 millisecond polling period. This is not an issue with PS/2 keyboards, so a PS/2 Norwegian keyboard was used to achieve millisecond time resolution in this experiment.

The data set collected for this study consists of two parts. The first is a set of 5 people typing the set of passwords 200 times and represents legitimate users of a password, and secondly a larger set of 100 people, each typing two passwords ten times, emulating a set of attackers. This was to provide a realistic comparison to compromised passwords being misused, where the small set of participants is equivalent to the legitimate users of accounts, while the larger set of participants giving 10 typing attempts per password represents malicious login attempts.

The attack data sets were collected at the university campus reception. The data set contains the millisecond timing of every key press and key release, for a set of passwords being typed by a set of participants. The keylogger was left unattended, running on a publicly available PC on campus. The participants were asked to write 2 of 6 possible passwords 10 times each and were rewarded with a unique code which could be exchanged for a small chocolate bar. The PC was left unattended at the campus during data collection; however, to prevent participants from attempting to get multiple codes, only one chocolate was given per person in exchange for a code, and a sleep delay was added to the system after each participant had received their code. In the case that the participant gives up half way, the program will time out and reset, discarding the data from their attempt. For the defence set, the keylogger was altered to take a higher count of password entries. A group of 5 individuals were recruited to type all 6 passwords, 200 times each, on the same machine and keyboard as the attack set.

3.2 The passwords used

The passwords selected for this data set are shown in Table 1. They consist of varying lengths and complexity.

The number of keystrokes is how many keys need to be pressed to write the password, excluding “enter”. Since 3 measurements are made per keystroke, the number of dimension is three times the keystrokes, plus the hold time of enter. The definition of entropy used here is the measure of possible configurations. Given a password length L and a pool of symbols R, there are \(R^L\) possibilities. E is the bits of entropy given that \(2^E = R^L\), which gives \(E=log_2(R^L)\). Four plausible passwords were made, consisting of one to four selected words from a dictionary. The shorter two passwords have two versions with special symbols, giving them a larger symbol pool L and thus higher entropy, while the longer pass-phrases are not appropriate for character substitutions as it would be too inconvenient to type. Since a common way of cracking passwords is a dictionary attack, it is common advice to not use dictionary words in passwords. Another common advice is to create memorable pass-phrases by chaining multiple words to achieve a high entropy [26, 27]. Nevertheless, using dictionary words in passwords is common and is therefore relevant to investigate.

Table 2 Standard deviation

3.3 Data quality and usability

The data consists of typing timing data of key presses and releases with 1 millisecond time resolution. Certain factors may affect the quality and usability of the data set.

Time resolution

The data sets have a time resolution of 1 millisecond. A PS/2 connected keyboard was used, since many USB-connected keyboards have a polling frequency lower than 1kHz. It has been shown in [28] that higher clock resolutions produce better results.

Size

Since the defence data sets are limited in size and participants, attack sets were also collected. The attack data sets include 103 individuals typing two passwords ten times each, giving roughly 33 participants per password. This allows the attack sets to have a much higher variance than each defence set, which produces more realistic results than only comparing between the defence sets. While ten repetitions may not be sufficient for an algorithm to differentiate each individual, this data set is only intended to emulate an attack case. The “defence” data set consists of a much smaller number of individuals, however with a sufficiently high quantity of data per person, allowing an algorithm to recognize certain individuals typing certain passwords.

Demographic

The data collection was performed at a university campus, and the participants consist mostly of students. There is a high variance in typing proficiency in the participation pool. The participants are a random sample of people at the university.

Repeating individuals

While the data collection is stated to only permit one session per person of 20 password entries, some individuals may come more than once, since participation is anonymous and unsupervised. However, the reward handout for participation is done manually, and it is stated that only one reward is given per person. In the case that some individuals attempt to participate on multiple days, the data may still be useful, as each person may type slightly differently on different days, and is within a realistic attack scenario where the same attacker may attempt to login on different days.

Typing speed

The typing speed of the attack data set is on average lower than the defence attack data set. This was expected, as the individuals in the defence set may improve their proficiency for the typed password, while the attack set is a set of attackers unfamiliar with the passwords. The set of attempts from the attack data set may be used to validate the algorithm.

Repetitive typing

The act of typing the passwords many times in a row may affect data quality in the defence set. Other studies have investigated the longitudinal effects of typing over a longer time period; however, the data sets presented in this study were each collected in single sessions. There may be factors such as boredom and typing fatigue that manifest as detectable patterns in the data.

Realism

The attack data set emulates a real-world scenario where many different attackers get a handful of attempts at logging in with a compromised password. This data is useful since it gives a look into how people generally type the passwords to compare with the typing patterns of specific individuals.

Plausibility of passwords

The passwords used were selected from English dictionary words. Some of the pre-existing fixed-text data set passwords consist of a string of random symbols, while this data set aims to investigate the effect of password length and character substitutions.

Table 3 Correlation analysis

3.4 Data format

The first step is to process the data into a more useful form. The data collection program produces two formats which can be derived from each other. One is a sequential list of key presses and releases with millisecond timing, while the other is a set of metrics of the timing between these key presses or key releases. This data could be further processed to show the timing relations between an arbitrary key to another key. Each row in the data sets is a single password entry, consisting of three metrics for each key of the corresponding password in sequential order.

Fig. 3
figure 3

t-SNE of a the 25-dimensional data set and b the 97-dimensional data set

4 Data analysis

4.1 Variance

The standard deviation for sampled data can be used to show typing consistency and is found by calculating the average deviation for each data column. This is shown in Table 2.

The participants have a variance of standard deviations ranging from 0.024 to 0.1916. This correlates with typing speed, and a lower variance is expected to correlate with KDA accuracy, as higher typing consistency is expected to produce more distinct classifications. The average standard deviation for a password can then indicate which password would perform the best with KDA. It can be seen that special characters result in a higher standard deviation. The password length does not appear to have a significant effect, which is expected since this is the average of the variances of each data measurement.

Table 4 KNN results, K = 3

4.2 Correlation

Correlation is a measure of statistical relationships between data. The correlation between two entries X and Y is defined in Eq. 1, where \(\bar{x}\) and \(\bar{y}\) are the averages of the sets, and x and y are the members of the sets.

$$\begin{aligned} Correl(X,Y) = \frac{\sum (x - \bar{x})(y - \bar{y})}{\sqrt{\sum (x - \bar{x})^2\sum (y-\bar{y})^2}} \end{aligned}$$
(1)

Each entry of each set is correlated to the entries of each other set, to get the average correlation between sets. Table 3 shows the average correlations of entries from one data set to entries of another data set. As each data set has some variance, the average correlation of entries to its own data set indicates the consistency of proportional timing between keys. Standard deviation showed average timing consistency of each key, while this is showing consistency of timing of each key relative to each other. As the average standard deviation of each feature was highest when using special characters, the self-correlation can also be seen to be lower in these cases. Additionally, it can be seen that increasing the length of the password has a detrimental effect on the self-correlation, i.e., consistency of typing between entries.

4.3 t-SNE

t-Distributed Stochastic Neighbor Embedding (t-SNE) is a dimensionality reduction technique common for data visualization and can reveal patterns and structures within high-dimensional data by mapping them to a lower-dimensional space while preserving pairwise similarities. Figure 3a shows the t-SNE visualization of the “observer” data sets, reduced from 25 dimension. Figure 3b shows a visualization of the data sets for “Repetition Learn Machine Thinker”, reduced from 97 dimensions.

The attack set has a high variance. As the attack set consists of a mix of approximately 30 attackers per password, this is expected. The defence sets are well clustered, especially with the faster typers. Participants 4 and 5 are the slowest typers in the set and are less well-defined in this visualization. Comparing the two graphs, it can be observed that the shortest passwords result in significantly better clustering than the longer password.

4.4 K-nearest neighbor

Using K-nearest neighbor, an entry can be classified as one of the participants. This is done by measuring the distance from an entry, e.g., a data point, to the data points in a labeled data set and classifying the data point as the same class as the nearest neighbors. A constant K is used for the number of neighbors used to decide the class. To measure the distance between two data points, Manhattan distance is used, shown in Eq. 2, where a and b are the coordinate elements of data points A and B, respectively. Other distance metrics such as Euclidean distance were also considered; however, Manhattan distance was found to be better for high-dimensional data.

Fig. 4
figure 4

The KDA correlation algorithm

Fig. 5
figure 5

Participant 4, ”Ob$erv3r”, correlation of each entry of the defence set with its own data set (blue) and with the attack set (orange)

Fig. 6
figure 6

Participant 4, ”Ob$erv3r”, correlation of each entry of the attack set with its own data set and with the defence set

Table 5 Correlation results
Fig. 7
figure 7

Defence entries and attack entries with a threshold

The issue with implementing this method in a real scenario is the requirement of having a data set of multiple people typing the same password. However, it may be possible to synthesize these data.

$$\begin{aligned} Manhattan Distance(A,B) = \sum |a - b| \end{aligned}$$
(2)

To calculate the true accept and false reject of the K-nearest neighbor method for a data set, each data point is evaluated with the data set, with the evaluated data point removed from it, and the resulting classification is compared with the original classification. The value of K was selected based on the resulting accuracy. To find the true reject/false accept, the data points from the attack data set are classified as either the target class (false accept) or another class (true reject). A choice must be made here if a data point may be classified as belonging to the attack set. If not, then the attack data points must be classified as one of the 5 users, guaranteeing an average of 20% false accept rate. If the attacks data points can be classified as belonging to the attack set, the data points from specific individual from the attack set which a data point is taken as an attempted attack must be excluded, since the attack set is only meant to represent plausible attackers and not include the specific attacker attempting to log in. Note that although attack participants were instructed not to participate twice, and the data collection process was only partially supervised, it is possible some have participated more than once, which would skew the false positive rate using this method. For the results listed in Table 4, it is assumed this is not the case.

4.5 Correlation as a KDA algorithm

By correlating a password entry to the data sets, a correlation threshold can be used to determine if the entry belongs to the legitimate user or an unknown attacker.

When a password entry is inputted to the KDA algorithm to determine if it is the legitimate user or an attacker, it is

  • Correlated with every entry of the defence data set, to find the average correlation with the data set for that user.

  • Correlated with every entry in the attack data set, to find the average correlation with attackers.

  • Evaluated based on average correlations to defence and attack data sets, compared to each other and a correlation threshold.

The correlation value may be used to differentiate between user and attacker. One method is to assume the highest correlation is always the match. Another method is requiring correlation above a certain threshold. A third method is requiring that the difference between self-correlation and attack-correlation is high enough. The method used here is a combination of the two first.

Two types of measures can be made here. A common performance metric is the Equal Error Rate (EER), the point at which the FAR is equal to the false rejection rate. The confidence threshold is configured for each individual and password. The alternative measure is one where either a low FAR or FRR is highly prioritized, by measuring the accuracy at the point where the other hits 0.

Through the KDA process in Fig. 4, a data entry is correlated with both the defence set and attack set to determine if it belongs to the correct user or an attacker. To achieve an accept, the correlation coefficient to the defence set needs to be higher than the set threshold, but also higher than the correlation to the attack set. When correlating the entries of the defence set, this may lead to a false reject as demonstrated in Fig. 5. However, when correlating the entries of the attack set to the defence set, attack entries that have a correlation coefficient the attack set, leading to true rejects above the threshold, demonstrated in Fig. 6, allowing the threshold at the EER to be lowered, which in turn lowers the FRR.

When correlating the defence entries to the defence set, the specific entry in question is excluded from the set, and when attack entries are correlated to the attack set, every entry of the corresponding participant in the attack set is excluded. By comparing the entries marked as green in Figs. 5 and 6, it is clear that this method of correlating to an attack set in addition to the defence set can augment the accuracy of the algorithm, as the additional true rejects outnumber the additional false rejects. An issue with implementing this method is that reducing false rejects to an absolute minimum might be a priority for the sake of usability, depending on whether a low false reject rate, a high true accept rate, or a low EER is prioritized.

To find the EER of this process for each defence set, each entry of the defence set is evaluated to find the FRR and TAR, and each entry of the attack set is evaluated to find the TRR and FAR. The threshold is then adjusted so that FAR is equal to FRR. Additionally, the threshold can be adjusted so that either FRR or FAR is equal to zero. Note that since excluded entries may lead to false rejects, a FRR of 0 can not always be achieved by adjusting the threshold. Table 5 presents these results for the data sets collected. Figure 7 shows these correlations for one of the participants typing one of the passwords as an example.

5 Conclusion and future work

The main contributions from this work are the public data set for keystroke dynamics research, an analysis of the data set using various statistical methods with investigation into the effects of password features, KNN, and correlation algorithm with results as a benchmark for further research.

The consistency of the data sets was analyzed using various methods, such as standard deviation of each data column, average correlation of each entry of a set with its own set as well as the other sets, and t-SNE to visualize the distinct clustering of each participant. It was found that typing speed was correlated with accuracy.

Two algorithms were used to produce benchmark results with the collected data sets. The KNN method is dependent on sets of multiple people typing the same password; however, these data could potentially be synthesized for an arbitrary password by simulating a human typist. The correlation or distance threshold method can be used on single individuals typing a unique password and is validated using a set of realistic attackers. Both methods can be augmented with a broad set of realistic attack attempts. Password length had a positive effect with the KNN method and a negative effect with the distance threshold method. Password complexity had a detrimental effect with both methods.

The authors will proceed to implement machine learning algorithms such as LSTM in order to improve the results and investigate the effects of password features and the usage of an attack set with such algorithms. The data sets are publicly available online on USN Figshare and [29], adding to the limited selection of data sets for KDA research, and include a variance of passwords, typed by multiple people with a variance of typing speeds.