Adaptive password guessing: learning language, nationality and dataset source

Human chosen passwords are often predictable. Research has shown that users of similar demographics or choosing passwords for the same website will often choose similar passwords. This knowledge is leveraged by human password guessers who use it to tailor their attacks. In this paper, we demonstrate that a learning algorithm can actively learn these same characteristics of the passwords as it is guessing and that it can leverage this information to adaptively improve its guessing. Furthermore, we show that if we split our candidate wordlists based on these characteristics, then a multi-armed bandit style guessing algorithm can adaptively choose to guess from the wordlist which will maximise successes.


Introduction
Password guessing is a common methods an attacker will deploy for compromising end users. A human password guesser will often leverage auxiliary information (spoken language, knowledge of the website the passwords were created for, or demographics of the users) in order to tailor their guesses. However, nowadays, many guessing attacks are automated. It is important for us to understand the capabilities of such automated attacks. In this paper, we investigate whether an automated guessing algorithm can similarly identify such patterns and leverage them to improve its guessing.
A commonly used method for guessing passwords involves using wordlists of password guesses and guessing these in an optimum order in order to compromise as many users as possible. Guessing passwords in the optimal order is important for an attacker as they wish to compromise many users with a small number of guesses. In particular, an online attacker is often using automated attacks and is limited to a certain number of guesses before a lockout is triggered. It is a challenge for an attacker to discover which guesses will result in the highest success rates.
In this paper, we suggest and develop an explore and exploit protocol based on the classic multi-armed bandit problem (MAB). This protocol can be used to guess a password set effectively using guesses from a selection of wordlists. Because of the learning nature of the model over a small number of guesses, we view this as a method which could prove effective as an online and offline guessing method.
We see at least three potential offensive use-cases for such a guessing model. The first approach is the most direct and utilises the real-time convergence of the MAB. An online guesser, guessing a selection of users passwords from a website, will learn from each success and use it to inform the next guess made against all users. A second approach involves an attacker gathering information by applying the multi-armed bandit to an offline leaked dataset of users from a given organisation. They can use the MAB to determine the optimum choice of wordlist and then carry out a tailored (online) attack on other users from the same organisation. This has the potential to be effective as passwords created by users of the same organisation can significantly improve guessing returns [1]. A similar method could be deployed by learning information via online guessing against a subset of accounts. The MAB can guess against these accounts, learning characteristics until they are locked out. Once the MAB has highlighted the appropriate wordlist, then the wordlist can be used against other users, avoiding triggering lockout on potentially more valuable users.
While offensive use cases are interesting, a more immediate application of this work is to emphasise the need for organisations to encourage users away from predictable password patterns. This work demonstrates that passwords differ measurably depending on their source use and that an automated password guesser can take advantage of this to significantly improve guessing success. This indicates that websites should consider blocklisting passwords in a way that is tailored to their particular subject matter and users. In particular, websites who have experienced previous password leaks could work at restricting future users from using passwords which occurred with a high frequency in that leak.
The paper continues as follows: Section 2 describes related work. Section 3 describes our set-up of the multi-armed bandit learning model. Section 4 demonstrates that our multiarmed bandit can be used to identify dataset source and that leveraging this knowledge improves guessing. We also show that the multi-armed bandit can improve guessing even when it is not leveraging specific knowledge about the dataset source. Section 5 demonstrates that language and nationality characteristics can also be derived during guessing and that these can be adaptively leveraged to improve guessing success. Section 6 provides a brief summary and conclusion. Appendix A provides the analysis behind the variable choice and implementation decisions for our model set up.

Related work
The most widely used guessing strategy involves combining wordlists with word mangling rules. The success of this strategy, which is demonstrated by Hashcat and John the Ripper, is best highlighted through the success of both teams in the annual KoreLogic "Crack Me If You Can" contest and also in the wide spread use of both software [2][3][4]. In 2005, Narayanan and Shmatikov employed Markov models to enable faster guessing [5]. A Markov model can predict the next character in a sequence based on the current character. In 2009, Weir et al. used probabilistic context-free grammars (PCFG) to guess passwords [6]. PCFGs characterise a password according to its "structures". Structures can be password guesses or word mangling templates that can take dictionary words as input. In 2016, Wang et al. developed a targeted password guessing model which seeds guesses using users' personally identifiable information [7]. Wang et al. leverage existing probabilistic techniques including Markov models and PCFG as well as Bayesian theory. They create tags for specific personally identifying information (PII) associated with a user. In their most successful version, TarGuess-I, they use training with these type-based PII tags to create a semantic aware PCFG. Independently, Li et al. also created a method for seeding password guesses with personal information. Their guessing also extended the probabilistic context free grammar method [8]. Also in 2016, the use of artificial neural networks for password guessing was proposed by Melicher et al. [9]. Artificial neural networks are computation models inspired by biological neural networks. These were used to generate novel password guesses based on training data. In 2019, Hitaj et al. proposed using deep generative adversarial networks (GAN) to create password guesses [10]. A generative adversarial network pits one neural network against another in a zero-sum game. PassGAN is able to autonomously learn the distribution of real passwords from leaked passwords and can leverage these to generate new guesses. In 2020, Pasquini et al. [11,12] introduced the idea of "password strong locality" to describe the grouping together of passwords that share fine-grained characteristics. This password locality can be leveraged to train their learning model to generate passwords that are similar to those seen and to help with password guessing.
Our model differs from previous work in that previous work has tried to create effective words to be guessed against passwords and has used learning techniques to inform how to create these words. In our work, we assume lists of guesses exist in the form of multiple wordlists; our learning technique informs which wordlist will be most effective for guessing the particular password set and how to combine guesses from multiple wordlists in order to utilise them effectively. In particular, instead of creating a single ordered wordlist, our method proposes splitting wordlists according to their characteristics, e.g. a German password wordlist and a wordlist leaked from LinkedIn. In this way, the wordlists with characteristics that are relevant to the passwords being guessed can be chosen and then others ignored.
We believe this is an effective way to tackle optimising guessing because research has shown that demographics and password source play a significant role in users' password choice. In 2012, Malone and Maher investigated passwords created for different websites and found that nationality plays a role in user's choice of passwords [13]. They also found that passwords often follow a theme related to the service they were chosen from. For example, a common LinkedIn password is "LinkedIn". This result was further investigated by Wei et al. in their 2018 paper "The Password Doesn't Fall Far: How Service Influences Password Choice" [14].
Researchers have also studied password sets drawn from particular languages. Sishi et al. studied the strength and content of passwords derived from seven South African languages [15]. Li et al. completed a large scale study of Chinese web passwords [16]. Weir et al. used Finnish passwords as a basis for studying the use of PCFG in the creation of password guesses and mangling rules [6]. Dell et al. [17] included both Italian and Finnish password sets and guessed them using English, Italian and Finnish wordlists. In this work, we show that nationality plays a role in password choice even when the spoken language is the same.

The multi-armed bandit
The learning technique we use is based on an adaptation of the classic multi-armed bandit problem. The multi-armed bandit problem describes the trade-off a gambler faces when presented with a number of different gambling machines. Each machine provides a random reward from a probability distribution specific to that machine. The crucial problem the gambler faces is how much time to spend exploring different machines and how much time to spend exploiting the machine that seems to offer the best rewards. The objective of the gambler is to maximise the sum of rewards earned through a sequence of lever pulls.
In our scenario we consider each different wordlist as a bandit that will give a certain distribution of successes when used. In order to make effective guesses, we want to explore the returns from each wordlist and also exploit the most promising wordlist. With each guess, we learn more about the distribution of the password set that we are trying to guess. Leveraging this knowledge, we can guess using the wordlist that best matches the password set distribution, thus maximising rewards.

Password guessing: problem set-up
Suppose we have n wordlists. Each wordlist, i = 1 . . . n, has a probability distribution p i , and σ i (k) denotes the position of password k in wordlist i. So, the probability assigned to password k in wordlist i is p i,σ i (k) .
Suppose we make m guesses where the words guessed are k j for j = 1 . . . m. Each of these words is guessed against the N users in the password set, and we find N j , the number of users' passwords compromised with guess number j.
To model the password set that we are trying to guess, we suppose it has been generated by choosing passwords from our n wordlists. Let q i be the proportion of passwords from wordlist i that was used when generating the password set. Our aim will be to estimate q 1 , . . . , q n noting that n i q i = 1 and This means that the q i are coordinates of a point in a probability simplex. If the password set was really composed from the wordlists with proportions q i , the probability of seeing password k in the password set would be By construction, the Q k are between 0 and 1, and because q ≤ 1, then the Q k ≤ 1.

Maximum likelihood estimation
Given this problem set-up, we will construct a likelihood function which will describe the likelihood that a given set of parameters q 1 , . . . , q n describe the password set.
In this section, we introduce this likelihood estimator and describe the methods for convergence to a unique maximum.

Likelihood function
We construct the following likelihood for our model with m guesses: where the first term is the multinomial coefficient representing the number of possible orderings for successfully guessing all N users' passwords where (N 1 . . . N m ) are the successes for the guesses we have already made, and (N − N 1 · · · − N m ) are the successes for the guesses we have yet to make. The second term, Q k , denotes the probability with which we expect to see password k j in the password set to the power of how many times it was actually seen. The final term represents the remaining guesses and states that they account for the remaining users' passwords in the password set that have not yet been compromised.
Our goal is to maximise this likelihood function by choosing good estimates for q 1 , . . . q n based on our observed rewards from each previous guess. Note that a single password can exist in multiple wordlists, so with each guess, we learn more about q i for all of the wordlists. In fact, one of the interesting features of this model compared to a traditional multi-armed bandit model is that one lever pull can provide information about all the bandits.
We can take the log of the likelihood function to create a simplified expression. In addition, we can remove the multinomial which is simply a constant for any values of Q. This leaves us with Note that if the (1−Q k 1 −· · ·−Q k m ) = 0, then the probability of a password existing outside the first m passwords is 0, so the number of times we see it would also have to be 0: the whole term will be 0 and log (0) is not an issue. A similar argument applies to all the other terms, since if the Q k m was zero, the N m would also be zero, and you would not have seen the password.
In [18], we prove that the log-likelihood function, log L, is concave. This means that the likelihood function has a unique maximum value [19], making it a good candidate for numerical optimisation. We will use gradient descent to find the q i that maximise L after m guesses subject to the constraints (Eq. 1).

Gradient descent
As we apply iterations of gradient descent to estimate the parameters q 1 , . . . , q n which maximise the likelihood function, we must maintain the constraints of the system (Eq. 1). To meet these constraints, we project the gradient vector onto a probability simplex and then adjust our step size so that we stay within that space.
With each iteration of gradient descent, we move a step in the direction which maximises our likelihood function. The gradient is scaled by a factor α to give a step size. This is further scaled by an amount β to ensure the move from p to p + αβ g satisfies β ≤ 1, β|| g|| ≤ 1 and p + β g lies within the simplex.

Multi-armed bandit implementation
The multi-armed bandit problem involves a number of design choices. The key variables are the initial q i values, the choice of word to guess and how big of a step size to take in the gradient descent. In Appendix A, we test multiple options for each of these variables. Below, we summarise the optimal implementation for our model.
Initialisation We expect the gradient descent to improve with each guess made since every guess provides it with more information. There are a number of ways of initialising thê q i in gradient descent after each guess provides new information. Based on our analysis (shown in Appendix A), we will initialise theq values toq = 1/n and reset them to this value before each new guess. This means we do not carry forward information from a previous bad estimation.
Guess choice Once we have generated our estimate of thê q-values, we want to use them to inform our next password guess. We will use what we denote the Q-method for the guess choice decision.
The Q-method uses the predictedq-values to estimate the probability of seeing each word k in the passwordset. If, for example, we have a word k which has probability p 1 (k) in wordlist 1 but also occurs in wordlists 2 and 3 with probabilities p 2 (k) and p 3 (k), respectively. Using Eq. 2, we can compute the total probability of this word occurring in the password set by multiplying the probability of each password occurring in a given wordlist by the weighting assigned to that wordlist. So if wordlists 1, 2, 3 are weighted as q 1 , q 2 , q 3 , then the probability of password k occurring in a passwordset made up of these wordlists is Computing this for each word guess option k should determine which k has the highest probability of being in the password set and use this word as the next guess.
Gradient descent step-size Recall that each gradient descent iteration, we move a step in the direction which maximises our likelihood function. The gradient is scaled by a factor α to determine how far we move towards the perceived maximum. A step size which is too large could fail to converge to the maximum and instead overstep it; a step size that is too small means we may never reach the maximum value.
From our analysis in Appendix A, we can conclude that the best estimates for the q values are provided using the constant alpha method for determining step size. This method involved a set alpha included in the iteration which results in a reduced step size as we approach the maximum. In particular, α = 0.1 is an effective value of alpha for our model.
Pseudo-code for our implementation of the multi-armed bandit is shown in Algorithm 1. The full code is also available on GitHub [20].

Dataset source
In this section, we will investigate whether the multi-armed bandit can identify the source of a dataleak, given a set of possible options. This information could be valuable to validate the source of a password leak or to aid a password guesser in tailoring their guesses. We will also show that the multi-armed bandit can adaptively choose between wordlist options to improve guessing success even when we have no information about the password leak source.
To do this, we utilise existing real-world leaked password datasets. The datasets we have chosen are computerbits.ie, hotmail.com, flirtlife.de and 000webhost.com. The 10 most popular passwords in each of these password leaks are shown in Fig. 1.

Algorithm 1 Multi-armed bandit.
Input: wordlists i = 1 . . . n and password set to be guessed. Output: guesses m = 1 . . . 100 and estimate of wordlist weightings q i . 1: while m < 100 do 2: stop q or loops > 100 do 4: loops + 1 13: end while 14: Compute Compute the Q score for every word 15: next_guess = max wor d (Q k ) 16: m + 1 17: end while Computerbits.ie dataset: N = 1795 In 2009, 1795 users' passwords were leaked from the Irish website Computerbits.ie. The most popular words in this dataset include many Irish-orientated words: dublin, ireland, munster, celtic. Also, the second most popular password for the website Computerbits.ie was "computerbits", reinforcing the idea that the service provider has an impact on the user's choice of password [14]. We make the assumption based on the website domain and origin that the dominant nationality of the users in this dataset is Irish.
Hotmail.com dataset: N = 7300 Ten thousand users' passwords from the website Hotmail.com were made public in 2009 when they were uploaded to pastebin.com by an anonymous user. Though it is still unknown, it is suspected that the users were compromised by means of phishing scams [21]. The most popular password in this leak was "123456" which occured 48 times (0.66% of users chose this password).
Flirtlife.de dataset: N = 98,912 In 2006, over 100,000 passwords were leaked from a German dating site Flirtlife.de. A write-up by Heise online [22], a German security information website, states that the leaked file contained many usernames and passwords with typographic errors. It seems that attackers were harvesting the data during log-in attempts.
After cleaning this data (using the methods specified in [13]), we were left with 98,912 users and 43,838 unique passwords. The passwords in this data are predominantly German and Turkish.
000webhost.com dataset: N = 15,252,206 In 2015, 15 million users' passwords were leaked from 000webhost.com [23]. The attacker exploited a bug in an outdated version of PHP. The passwords were plaintext and created with a composition policy that forced them to be at least 6 characters long and include both letters and numbers. The leaked dataset was cleaned in the same way as in [23]. There are 10 million unique passwords in the dataset. The rank 1 password represents a surprisingly low 0.16% of the users' passwords.
The above datasets were chosen because they are available online for others to replicate this work, they are used regularly within the literature [13,23,24], and because they each contribute interesting characteristics in terms of either user demographics or composition restrictions. Cleaning of the password datasets was completed according to the needs of the dataset; in particular, we ensured that each user only contributed a single password.

Dataset source 1: 100% from Flirtlife.de
Let us take the following scenario: A password guesser or security architect finds a list of passwords online. They suspect the passwords were leaked from a particular location (for example from the flirtlife.de website), but they do not know how to validate it. The investigator can read the passwordset into the multi-armed bandit along with a selection of candidate sources and determine if the suspected source is correct.
To test the effectiveness of such a test, we create a new passwordset containing 1000 users passwords sampled without replacement from the flirtlife.de dataset. We read three candidate sources into the multi-armed bandit: the remaining 97,912 users' passwords from Flirtlife, 1795 users from computerbits.ie and 7300 users' passwords from hotmail.com. We set the multi-armed bandit to guessing the sampled passwordset. It will make one guess at a time and then compute the estimated weight of each wordlist. If the scheme is effective, it should be able to approximate that the sample is best matched to the Flirtlife dataset.
The results are shown in Fig. 2. By 10 guesses, the multiarmed bandit has assigned a 90% weighting to the Flirtlife dataset. Showing that the sample most likely originated from this source. Note that the sample was chosen without replacement, and therefore, the 1000 users in the sample are distinct from the 97,912 users in the Flirtlife dataset. The multi-armed bandit is able to effectively match the sample to the correct The multi-armed bandit model outlined in Algorithm 1 also results in high guessing success.

Dataset source 2: 60% 000webhost, 30% Hotmail, 10% Computerbits
We now investigate whether the multi-armed bandit can determine the weightings for a passwordset even when it was created as a combination from multiple sources. To do this, we create password sets from a particular mix of sources.
We create a password set made up of 10,000 users' passwords; 6000 were selected randomly from the 000webhost dataset, 3000 from the hotmail.com dataset and 1000 from the computerbits.ie dataset.  Fig. 3(l), we show the weightings (q − values) that the multi-armed bandit assigns to each of the candidate datasets: 000webhost, Hotmail and Computerbits. Since we created the password set, we know the true weightings are 0.6, 0.3 and 0.1, respectively, and these are shown as the black solid horizontal lines. The multi-armed bandit does a good job of estimating the actual weightings. After just 3 guesses, the Multi-armed bandit can accurately estimate the weightings that should be assigned to each of the three wordlists.
In Fig. 3(r), we plot the number of successful password guesses for the different wordlist guessing options: guessing with each wordlist individually, combining all the wordlists together and using the multi-armed bandit to adaptively choose between the wordlists. Despite 000webhost being to the dominate source for the passwords, it is not effective at guessing the passwords efficiently. When we guess solely using the 000webhost.com dictionary, we get lower guessing returns than when we use the multi-armed bandit to adaptively inform our guesses. We believe this is because of the flattened nature of the distribution which likely results from the composition restrictions that were placed on the passwords. Guessing the top password from the 000webhost wordlist results in only 0.16% of the users' passwords. Whereas guessing the top password in the Hotmail dictionary results in 0.66% of the users being compromised. Also, because the 000webhost passwords are restricted by a composition policy, they do not accurately guess the 40% of passwords in the passwordset that come from hotmail.com and computerbits.ie. It is interesting that combining all the wordlists into "wordlists combined" does no better than 000webhost.com and significantly worse that the multiarmed bandit method. It is worth noting that the multi-armed bandit guessing would also be influenced by the high ranking of 000webhost passwords and their low guessing success.  However, it still performs significantly better because it guesses using information and weightings from all the dictionaries.

Dataset source 3: 60% Flirtlife, 30% Hotmail, 10% Computerbits
To see whether the Multi-armed bandit is still effective when the 000webhost.com set is not included, we create a new passwordset. This time, we take 60% of the passwords from Flirtlife.de, 30% from Hotmail.com and 10% from Computerbits.ie.
In Fig. 4(l), we plot the estimated q-values after the gradient descent was completed for each guess. Again, even after a small number of guesses, we have good predictions for how the password set is distributed between the three wordlists.
In Fig. 4(r), we show the number of users successfully compromised after each new guess. After 100 guesses, the multi-armed bandit method had compromised 795 users, in comparison to the 870 users compromised by guessing the correct password in the correct order for every guess.

Dataset source 4: passwords from all four sources
Finally, we create a new password set this time made up of 10,000 users' passwords from all 4 different wordlists: 55% were selected randomly from the hotmail.com dataset, 30% from the flirtlife.de dataset, 10% from the 000webhost dataset and 5% from the computerbits.ie dataset. In Fig. 5(l), we plot the estimated q-values after the gradient descent has completed for each guess. The actual proportions are shown as solid horizontal lines. Within 20 guesses, the multi-armed bandit can estimate the correct weightings. It estimates the correct order after 4 guesses. Figure 5(r) shows the guessing returns for guessing with the individual wordlists, guessing with all the wordlists combined and guessing using the multi-armed bandit. The

Dataset unknown source
In the previous simulations, we showed that if we include the source dataset as a wordlist option, the multi-armed bandit can identify it and use this knowledge to improve guessing success. We now investigate whether the multi-armed bandit can still be leveraged to improve guessing success even if there is no obvious link between the wordlists provided and the passwordset.
To investigate this, we use the 2009 rockyou.com password leak which includes 32 million plaintext user credentials. This password set has been frequently used by researchers in the field and therefore allows effective comparison to other works. All four wordlists were used to guess the Rockyou password set: Computerbits, Hotmail, Flirtlife and 000webhost. However, this time, we have no a priori information about a relationship between the passwordset, Rockyou, and these wordlists. Figure 6(l) shows the estimated breakdown of Rockyou between the four wordlists. Hotmail is assigned the highest rating with 000webhost, Flirtlife and Computerbits falling below it respectively. In terms of the breadth of the audience demographic in each of the wordlists, this assessment of the breakdown seems logical. The nationality specific websites such as computerbits.ie and flirtlife.de fall lowest, and 000webhost.com, which enforces composition restrictions, fares slightly worse than hotmail.com.
In Fig. 6(r), we compare using the multi-armed bandit adaptive guessing (solid purple line) to guessing using each wordlist separately. The multi-armed bandit performs well. After 100 guesses, it has compromised 945,371 (64% optimum, 2.9% total) users in comparison to 804,731 (54% optimum, 2.5% total), 703,041 (47% optimum, 2.2% total), 603,783 (41% optimum, 1.9% total) and 64,024 (4.3% optimum, 0.2% total) from Flirtlife, Hotmail, Computerbits and 000webhost, respectively. Notice that the combination guessing strategy follows the distribution of the 000web- Fig. 7 Irish users. Left: q-value estimates for the Irish password set from Computerbits.ie. The multi-armed bandit identifies that the passwordset is best linked to the Irish users wordlist. Right: guessing success. The multi-armed bandit and the Irish users wordlist offer similar guessing returns, and both are better than using a generic (all users) passwordset host wordlist all the way until guess 59 when it eventually guesses the most popular password 123456. Simply combining wordlists means that whichever wordlist has the most users in it will be dominant. This gives good evidence to support our suggestion of splitting wordlists out based on their characteristics and adaptively learning which wordlist to choose from, rather than the traditional method of creating one large wordlist for guessing.

Language and user nationality
It is well known that user demographics, such as nationality and language, play an important role in their password choices [13,25,26]. Indeed, this is information that human password guessers might look for when determining their guessing strategies. In this section, we investigate whether the multi-armed bandit can identify these characteristics and leverage them to improve guessing.
Determining the dominant language used in a password set would be a relatively simple task if we have the entire password set as a plaintext list. However, as passwords are hashed and salted or protected by a server, we must instead make a guess, and if it is successful, we can reflect on what language seems to be resulting in the most successes. The multi-armed bandit can help us with this learning exercise.
Clearly, a simple method could tell the difference between, say, Chinese and English passwords. We are interested in the more challenging setting of distinguishing between Irish users' passwords and English users' passwords when the spoken language is the same, or between English and German passwords where both use the Latin alphabet. In this section, we will show that our learning methods are able to identify these subtle distinctions.
The two password sets we will try guessing are the computerbits.ie password set and the flirtlife.de password set.
Computerbits.ie is made up of 1795 Irish users. Flirtlife.de is made up of 98,912 predominantly German and Turkish users. The two wordlists were drawn from the large set of 31 password leak datasets known as Collection #1 [27]. One of these password datasets was selected, and from this, we extracted all the passwords whose corresponding email address contained the country code top-level domain ".ie" and separately ".de". These formed our nationality specific user wordlists from Ireland and Germany with 90,583 and 6,541,691 users, respectively.
Irish passwords We are interested in whether the multiarmed bandit will match the distribution of the Irish password set computerbits.ie to the extrapolated Irish wordlist taken from the subset of Collection #1 (denoted "Irish users" from now on).
In Fig. 7(l), we included three wordlists: the hotmail.com leaked passwords, the flirtlife.de password set and the Irish users. Hotmail.com is an international website. However, it is suspected that the Hotmail users in the dataset we have were compromised by means of phishing scams aimed at the Latino community. Flirtlife is a dating site with predominantly German and Turkish users. Figure 7(l) plots the breakdown estimated by the multi-armed bandit. From the first guess, it estimates that the passwords in the computerbits.ie set match closely to the passwords in the Irish users wordlist. Notice that some weighting is assigned to the Hotmail wordlist but essentially none to the flirtlife.de password set.
In Fig. 7(r), we guess the passwords in the Irish computerbits.ie password set. The black line shows the returns for an optimum first 100 guesses. We also guess them using the order and passwords from the full Collection #1 password set that the Irish users were chosen from. We label this full dataset "all users". We made 100 guesses against the 1795 users in the Computerbits password set. The top 100 most popular words were chosen in order from each wordlist. The ing success. The multi-armed bandit offers significantly better guessing returns than guessing using the German users passwordset or the all users passwordset wordlist composed of only Irish users performed better at guessing than the wordlist with all users' passwords in it. We also include the guessing success for our multi-armed bandit model. It performs as well as guessing using the Irish users set showing that it was able to quickly learn the nationality and adapt its guessing accordingly German passwords We now try to guess the flirtlife.de password set using the wordlist of German users. While flirtlife.de is a German dating site, its main users were both German and Turkish.
In Fig. 8(l), the multi-armed bandit does not link the Flirtlife passwords to the German users wordlist until after the high frequency passwords, up to 50, have been guessed. However, in Fig. 8(r), the guessing success is still highest for the multi-armed bandit. The next best wordlist option is the German user wordlist and finally guessing using all users' passwords. This indicates that while German passwords do feature strongly, it is not the only nationality featuring in the password set. Recall that Flirtlife is made up of users from two nationalities and languages: German and Turkish. The multi-armed bandit in this case is the best option for guessing as it takes into account guesses from all wordlists and adaptively chooses between them.

Conclusion
This research demonstrates that an automated password guesser can learn characteristics of a password set with each guess made and that it can leverage this information to improve guessing success.
We have shown that a multi-armed bandit model can adaptively choose between different wordlists to improve guessing success. We have also demonstrated that characteristics such as dataset source, language and nationality can be inferred from a leaked passwordset in an automated way using this multi-armed bandit (MAB) technique. Our MAB learning algorithm develops its learning about the distribution of the password set it is guessing with every guess made. Importantly, it requires no a priori training. In many previous wordlist approaches, a single ordered wordlist is created. In our method, wordlists are separated based on their source or characteristics. In our examples, the separation of wordlists consistently improves guessing success over using a single ordered wordlist. This adaptive learning model demonstrates that a password guesser can learn about a password set with each guess made and emphasises the effectiveness of dynamic real-time analysis of guessing returns.
Knowing the potential of this guessing model is useful for both users and organisations. It provides evidence for the importance of guiding users away from passwords which reflect characteristics associated with demographic or website specific terms. It also demonstrates that password choices differ measurably depending on their source use. This could indicate that websites could consider tailored blocklisting techniques. In particular, websites who have experienced previous password leaks could work at restricting future users from using passwords which occurred with a high frequency in that leak.

Appendix A: Multi-armed bandit design
The multi-armed bandit problem involves a number of design choices. The key variables are the initial q i values, the choice of word to guess and how big of a step size to take in the gradient descent. In this appendix, we will describe the analysis which helped us choose an optimum set-up for each of these factors.

A.1 Gradient descent initialization
We expect the gradient descent to improve with each guess made since every guess provides it with more information.
There are a number of ways of initialising theq i in gradient descent after each guess provides new information. The following are three different methods for choosing the initialisation value: Random Randomly pick starting values forq i , subject to Eq. 1, Average Choose the average starting value, i.e. assume the passwords are uniformly distributed between the n wordlists, soq i = 1/n, Best Use our previous best estimate for theq-values, based on the gradient descent results for the previous guess.

A.2 Guess choices
Once we have generated our estimate of theq-values, we want to use them to inform our next guessed password. We suggest three options for how to choose our next guess: Random Randomly choose a wordlist and guess the next most popular password in that wordlist. Best wordlist Guess the next most popular password from the wordlist with the highest correspondingq-value. Q-method Use information from theq-values combined with the frequencies of the passwords in the wordlists to inform our next guess (As shown in Alg. 1).
These options have different advantages. In the first option, we randomly choose a wordlist to guess from, but we are still taking the most probable guess from the wordlist we choose. This option emphasises the continued exploration of all the wordlists. In the second option, we are choosing the wordlist we believe accounts for the largest proportion of the password set.
The last option is specifically basing password guess choices on Eq. 2. It uses the predictedq-values to estimate the probability of seeing each word k in the passwordset. If, for example, we have a word k which has probability p 1 (k) in wordlist 1 but also occurs in wordlists 2 and 3 with probabilities p 2 (k) and p 3 (k), respectively. Using Eq. 2, we can compute the total probability of this word occurring in the password set by multiplying the probability of each password occuring in a given wordlist by the weighting assigned to that wordlist. So if wordlists 1, 2, 3 are weighted as q 1 , q 2 , q 3 , then the probability of password k occuring in a passwordset made up of these wordlists is p(k) = q 1 p 1 (k) + q 2 p 2 (k) + q 3 p 3 (k). Computing this for each word guess option k should determine which k has the highest probability of being in the password set and use this word as the next guess.

A.3 Evaluation of initialisation and guess choices
In Fig. 9, we compare the nine combinations of initialisation and guess choice. We do this by creating a simulated password set made up of three wordlists with the following proportions: 60%, 30% and 10%. The black lines show the true proportions, and the plotted points show the multiarmed bandit's estimates of these proportions after each guess.

A.3.1 Guess choice
We can see that the Best Wordlist method for choosing guesses is not effective for determining the characteristics of  Where there is large overlap between the passwords in the wordlists, the best method will provide information about the distribution of each dictionary. However, when this is not the case, we do not learn about the relationship between the password set and all the wordlists. The Random Dictionary and Q-method seem to perform broadly similarly in the tests above.

A.3.2 Initialisation
When the number of guesses is less than the number of wordlists, the likelihood function can be degenerate. This means multiple combinations of q-values can maximise the likelihood function. This is not as much of an issue in the Average and Random initialisation methods but in the Best qs initialisation method we are seeding the next guess with the best q-values from our previous guess. Since we are constrained by the probability simplex defined by Eq. 1, movement to leave this initial approximation can be constrained. For example, see Fig. 10 depicting a 3-dictionary probability simplex which shows q in a corner of the simplex: q-values in a corner position are limited in the valid directions they can move in. This creates the potential to become "stuck". When the Best Wordlist method is used for guess choice, the chance of becoming "stuck" is compounded by limiting the information gathered about any password set other than the one estimated as the best. A simple solution is to reset the q-values before each descent as is done in the Average values method. A more complex solution could involve avoiding seeding guesses until a non-degenerate likelihood function can be derived.
Notice the spikes in two of the graphs in Fig. 9 which show initialisation with Random q-values. 1 These spikes represent a failure to converge when the randomly chosen initialq values are far from the true values. In our other simulated password sets, these spikes were also present in the graphs that used random qs for initialisation and a random dictionary for guessing. For this reason, we will avoid using the random initialisation method when determining the make-up of a password set.

A.3.3 Initialisation and guess choice conclusion
Based on this analysis, we find that the results from the models initialised using both Random q-values and previous Best q-values are not reliable. In addition, when guesses are chosen from the Best wordlist only, the model does not fare well at approximating the q-values. The Average q initialisation method paired with either the Random dictionary guess choice or the Q-method guess choice performs consistently well at approximating characteristics. One advantage of the Q-method over the random dictionary choice is that it is deterministic.

A.4 Choice of gradient descent step size
Recall that each gradient descent iteration, we move a step in the direction which maximises our likelihood function. The gradient is scaled by a factor α to determine how far we move towards the perceived maximum. A step size which is too large could fail to converge to the maximum and instead overstep it; a step size that is too small means we may never reach the maximum value. There are a number of options for choosing this step size; three popular options are listed below: Constant alpha Let α be a small constant. For example 0.1. This setup means that the step size will become smaller as we converge towards the maximum. Constant step size Keep the step size constant by choosing an α value that depends on the size of the step taken, α = || g|| . A downside of this fixed step size is that when we get close to the maximum, we could take a fixed size step past it. Adaptive The step size can be adapted based on the perceived gains. For example, if the Likelihood function is increasing, then we might try increasing α on the next step. Whereas if the movement will make the function decrease, then we can scale back α until we get a value which will not cause a decrease. This option means that the function will definitely increase on each step. However, it is important to verify that we remain within the probability simplex with each adaption of α.
Each method has advantages and disadvantages, and it is important to choose a method that works for a given set-up of the multi-armed bandit. We implemented these three options in four ways: (1) constant alpha, (2) constant step size, (3) starting with constant alpha and then adapting step size to maximise likelihood increase or avoid likelihood decrease and (4) starting with constant step size and then adapting step size to maximise likelihood increase or avoid likelihood decrease.
Recall the adaptation involves increasing the step size if it will result in an increased likelihood value and similarly decreasing the step size incrementally until the resulting likelihood is an improvement on the last likelihood. Thus, we can guarantee that each step results in an increase to the likelihood function.

A.4.1 Evaluation of step size
Our goal is to achieve an accurate prediction of the q-values for a selection of wordlists to help us determine the characteristics of the password set. We therefore compare how effectively the step size options estimate the q-values. We can do this in two ways. By looking directly at the q-values and visually comparing them against the actual values. Alter-natively by plotting the distance the q-values are from the optimal and analysing this graph. Below, we do both and see that both show us that the constant alpha approach seems to work best with our model.
In the previous section, we determined that the Average qs & Random dictionary setup was a reliable set-up for estimating the q values. Therefore, in Fig. 11, we use the Average qs & Random dictionary set-up of the model and then test each of the four step size options. We can see that constant alpha seems to give the best approximation of the true q-values for this set-up of the initialisation and guess choice. Both adaptive options give a poor estimation, and the constant step size option performs only slightly worse than constant alpha.
Next, we will use the L 1 norm to assess the estimates in a wider variety of conditions. We can use the L 1 norm to measure the distance between the actual and estimated q values at each guessing point. The result is the sum of the differences between the actual and estimated q for each wordlist. Fig. 11 Dataset source 3: q-value estimates. Shown for each combination of gradient descent step size methods. The black lines show the true values of q 1 , q 2 , q 3 , and the points show our estimatedq 1 ,q 2 ,q 3 after each guess is made Fig. 12 L 1 norms shown for each gradient descent step size method. The L1 norm shows the distance the estimatedq value is away from the actual q value. We want this difference to be small. The subplots show these values in more detail for guesses 30-40 Figure 12 compares the L 1 norms for the four step methods. In Fig. 12a, we see the L 1 norms for the constant alpha method. All of the Q-method approximations have L 1 norms consistently close to zero after an initial peak within the first 3 guesses. We see spikes in the Random qs & Best wordlist plot, and the other two Best wordlist options perform poorly. Random dictionary choice performs well with an L 1 norm generally less than 0.1. Figure 12b shows the L 1 norms for the constant step size method. We see regular spikes in the Random qs & Best wordlist method. There is more variation in the Q-method results. The L 1 norm for the Best qs & Best wordlist method never converges to zero.
The general impression from the two adaptive plots ( Fig. 12c and d) is that they are inconsistent in their estimation. We see spikes in all the Random q initialisation methods. The constant step size adapting method reports reasonably small L 1 values for Average q initialisation with Q-method guess choice and for Best q & Random dictionary choice.
The constant alpha adapting method only shows low L 1 values for Average q initialisation and Q-method guess choice.

A.4.2 Step size variable conclusion
From this analysis, we can conclude that the best estimates for the q values are provided using the constant alpha method for determining step size. This method involved a set alpha included in the iteration which results in a reduced step size as we approach the maximum. For the graphs shown above, we used the constant value α = 0.1. We also tested the values 0.5 and 0.01 and determined, based on L1 norm results, that the α = 0.1 value worked best for our model.
In this appendix, we have covered in detail the variable choices that we considered in our multi-armed bandit model. In the paper, we use the variables chosen above. Where possible, we plot the estimate and actual q-values to demonstrate that the model is working as intended and converging to the correct maximum likelihood.