Real-Time Detection of Dictionary DGA Network Traffic using Deep Learning

Botnets and malware continue to avoid detection by static rules engines when using domain generation algorithms (DGAs) for callouts to unique, dynamically generated web addresses. Common DGA detection techniques fail to reliably detect DGA variants that combine random dictionary words to create domain names that closely mirror legitimate domains. To combat this, we created a novel hybrid neural network, Bilbo the `bagging` model, that analyses domains and scores the likelihood they are generated by such algorithms and therefore are potentially malicious. Bilbo is the first parallel usage of a convolutional neural network (CNN) and a long short-term memory (LSTM) network for DGA detection. Our unique architecture is found to be the most consistent in performance in terms of AUC, F1 score, and accuracy when generalising across different dictionary DGA classification tasks compared to current state-of-the-art deep learning architectures. We validate using reverse-engineered dictionary DGA domains and detail our real-time implementation strategy for scoring real-world network logs within a large financial enterprise. In four hours of actual network traffic, the model discovered at least five potential command-and-control networks that commercial vendor tools did not flag.


INTRODUCTION
Malware continues to pose a serious threat to individuals and corporations alike [33]. Typical a ack methods such as viruses, phishing emails, and worms a empt to retrieve private user data, destroy systems, or start unwanted programs. e majority of these a acks may be launched through the network [31], posing a major threat to any Internet-facing device. Some malware reaches out to a command and control (C&C) centre hosted behind domains generated by an algorithm (DGA domains) a er it in ltrates the target system to receive further instructions. Identi cation of such domains in network tra c allows for the detection of malware-infected machines.
A single active DGA has been seen generating up to a few hundred domains per day [33]. At scale within a company, this is infeasible for a human analyst to triage amidst the thousands of benign domains occurring simultaneously. Automated detection systems are developing but the sightings of DGAs in worms, botnets, and other malicious se ings is growing [1].
However, DGAs that combine random words from a dictionary like "milkdustbadliterally [.]com", "couragenearest [.]net", and "boredlaptopa orney [.]ru" [13] are meaningfully harder for humans to detect (see Table 1 for comparison). In this paper, we will refer to this type of DGA as a dictionary DGA and focus on those using dictionaries composed of English words.
Common defences against malicious DGA domains include blacklists [23,26], random forest classi ers [3,50,55], and clustering techniques [8,32]. When the lists are well maintained and the features are chosen carefully, these methods have acceptable e cacy. However, both blacklists and these models possess serious limitations: relying on hand-picked features which are time-consuming to develop, lacking the ability to generalise with the few manual features implemented, and requiring continuous expert maintenance. More comprehensive tactics are necessary to detect incessant new DGAs stemming from network-based malware.
Recent innovations using deep learning have state-of-the-art accuracy on DGA detection. Such models are highly exible with the proven success in complex language problems. ey do not require hand-cra ed features that are time-intensive to make and easy to evade. Woodbridge et al. [50] were the rst to present a long short-term memory (LSTM) network for DGA classi cation. Other architectures were later applied, such as further variations on an LSTM [4,28,46,48,55], a convolutional neural network (CNN) [38,58], and a hybrid CNN-LSTM model [56]. Although successful for random-character DGA domains, these classi ers have largely been ine ective in identifying dictionary DGA domains. ese models also perform well on their various testing sets but their performance can su er when a empting to generalise to new DGA families or new versions of previously seen families.
Against this background, we present a novel deep learning model for dictionary DGA detection. is advances the state of the art in the following ways. First, we present the rst usage of parallel CNN and LSTM hybrid for DGA detection, speci cally applied to dictionary DGA detection. e model is trained on standard largescale datasets of reverse-engineered dictionary DGA domains. It achieves the most consistent success at dictionary DGA classi cation amongst state-of-the-art deep learning architectures for classi cation, generalisability, and time-based resiliency. Second, we detail our insights into dictionary DGA domains' inter-relationships and their e ect on generalisability of models as an outcome. ird, we validate our model on live network tra c in a large nancial institution. In four hours of logs, it discovered ve potential C&C networks that commercial vendor tools did not ag. Finally, we detail our scalable implementation strategy within the security context of a corporation for real-time analysis.

BACKGROUND
An ever-growing number of malware rely on communication with C&C channels to receive instructions and system-speci c code [33]. e destination (domain or IP address) of this channel can be hardcoded in the malware itself, making its location discoverable via reverse engineering or straightforward log aggregation techniques. Once known, this domain or IP address can be blacklisted, rendering the malware inert. To avoid this single point of failure, malware authors employ domain uxing, in which the destination of the C&C communication changes systematically as the a acker registers new domains to the C&C hub.
e key to malware domain uxing is the use of unique and likely unregistered domains that are known to the a acker but can blend in to regular tra c. To accomplish this, malware families employ domain generation algorithms (DGAs) to create pseudo-random domains for use in communication. ese domains are used for short periods of time and then phased out for newly-generated domains; this quick turnover means that manual techniques are not e ective. Additionally, reverse engineering these algorithms may be slow or impossible if the malware is encrypted. For the vast majority of malware samples, tra c related to malicious activity is present in networks weeks or months before the malware is analysed and blacklisted [26].
To prevent DGA-based malware from ex ltrating, disabling, or tampering with assets, institutions must detect malicious tra c as soon as possible. roughout this paper we will discuss our solution while keeping in mind that it must be practical, operating in realtime, enriching contextual data within in true threat environments.

Domain Generation Algorithms (DGAs)
DGA usage spans a variety of cases, from benign resource generation to phishing campaigns and the management of botnets, groups of machines that have been infected by malware, such as Kraken [37], Con cker [34,35], Murofet [42], and others [52]. e goal of all DGAs is to generate domains that do not already exist and, for malicious cases, will not be agged by vendor security tools or analysts. To accomplish this, DGA authors typically use either character-based or dictionary-based pseudo-random assembly process to form domains.
Each method has bene ts and downfalls. Character-based DGA domains are more likely to not be registered. But to a human security analyst, gibberish domains made from character-based DGAs stand out from human-cra ed domains due to their phonetic implausibility and lack of known words within them. ere is a visible unique pa ern underlying character DGA domains, such as "lrluiologistbikerepil", that dictionary DGA domains, like "recordkidneyestablishmen", do not follow. Dictionary DGA domains are more challenging to detect when scanning logs because they are pronounceable, contain known words, and mirror the character distribution of legitimate English domains [7]. See similarities between known dictionary DGA domains and benign domains in Table 1. DGA detection systems have been implemented to assist in highlighting DGA domains for further investigation. ese have largely been tailored towards character-based DGAs. Character-based DGAs are more common: of 43 known reverse-engineered DGAs available in DGArchive [13], 40 of them use a seed to pseudorandomly assemble characters or a word surrounded by random characters to form a domain name. Most methods for generic DGA analysis still struggle to identify dictionary-based DGA malware families because they classify all DGAs rather than focusing on speci c algorithms. is paper will focus on classifying the largest available sets of known dictionary DGA domains: gozi [25], matsnu [44], and suppobox [14]. Each varies in the dictionary-based domain generation tactic, the length of the domain, and the dictionary corpus. ese dictionary DGA families are o en undetected by methods proposed in prior research aimed at general DGA detection because of the large number of families available for other types of DGAs. By targeting where others are weaker, our model can provide greater coverage when used in conjunction with generic DGA models and other contextual information for increased con dence in identi cation.
Much of prior DGA research has involved making lookups into historical or related domain name server (DNS) records. Such methods o en rely on signals a ained from Non-Existent Domain (NX-Domain) responses when unregistered domains are queried. Since DGAs o en generate hundreds of domains per day and at most only a few of those domains are actually registered by the a acker, large numbers of these requests result in NXDomains. Many NX-Domain responses from the same computer are unlikely to result from expected user behaviour, and thus this pa ern of DNS tra c can be associated with DGA activity [8,22,52].
However, such queries within high-volume DNS log data can be prohibitively slow and unsuitable for real-time decision-making needed to reduce the risk of compromise. It is for this reason that our model considers limited data, only the domain name, rather than all of the potential elds given through standard network logging. We also only use open source datasets rather than restricted NXDomain lists for reproducibility and to provide an accessible starting point for others looking to tailor this system to their own environment.

Related Work
Defensive tactics began analysing network logs with statistical or manually selected features instead of static blacklisting or rules when it became overwhelming to maintain them. Unsupervised probabilistic ltering [36] and random forest models [3,39] were some of the leading systems for detecting DGAs.
Future techniques included more contextual information which improved the longevity of detection systems. Clustering [51,52,59], Hidden Markov Models (HMMs) [8], random forests models [40,47,53], and sequential hypothesis testing [22] used data such as WHOIS or NXDomain responses with the domain to identify DGAs. However, a number of these techniques require batches of live data to maintain relevancy or high volumes of data which are not typically feasible in real-time environments.
Deep learning rst addressed DGA detection with work by Woodbridge et al. [50], an implementation of an LSTM used for nonspeci c DGA analysis. eir experiments show that their deep learning approach, an LSTM network, outperforms a character-level HMM and a random forest model that utilise features such as the entropy of character distribution. eir analysis and implementation led to a large success for identifying most DGA families; however, their LSTM did not score highly on suppobox or matsnu, dictionary DGA families.
Since then others have joined the eld, implementing a variety of deep learning models. Several took the LSTM model from Woodbridge et al. and provided improvements. Tran et al. [46] took the native class imbalance of DGA data into account. Others updated the training data with other known DGA datasets [4,55] or added more contextual information to the score [11]. Another altered the original architecture of their LSTM to a bi-directional LSTM layer [28], demonstrating the potential enhancements of changing the model's architecture.
When a CNN was applied to text classi cation [18,19,57] and showed success over an LSTM on some tasks [54,58], it was eventually applied to malicious URL analysis [38]. Other approaches to this problem include a Generative Adversarial Network (GAN), showing that the arms race for DGA detection could advance on its own [7]. Recent work combining CNN convolutions and LSTM temporal processing into new sequential hybrid models have also been brought to this problem [10,20,30,56]. Other comparative works have been published a empting to nalise which model is the best for DGA detection [9,12,29,43,55,56]. eir evaluations state deep learning maintains greater success over random forest models trained using manually-selected features, but do not consider the greater context of the model's deployment or implementation environment. Our research picks up this work, systematically evaluating deep learning architectures to speci cally target where most DGA detection systems consistently underperform: dictionary DGAs.
Koh et al. were one of the rst to train deep learning to specically target dictionary DGA domains [21]. Utilising a pre-trained embedding for the words within the domain, they trained an LSTM both on single-DGA and multiple-DGA data sets. While their results set the bar for dictionary DGA detection, their model had severe limitations from its context-sensitive word embedding on what it could learn and they did not use all available data during training and testing. Another related work on dictionary DGA detection is WordGraph from Pereira et al. [32]. ey take large batches of NXDomains and the longest common substring (LCS) of every pair within the set, connecting any co-occurring LCS within a single domain name to construct their WordGraph. e dictionary DGA domains are shown to cluster whereas benign domains have no discernible pa ern and is shown to generalise over changes to the DGA's dictionary. A random forest classi er is trained on the pa erns between domains to identify dictionary DGA pa erns. is method shows promise at adapting to di erent DGAs. However, it is too computationally intensive for many systems to support for only domain name analysis.

Real-Time Deployment Environment
Within a large corporation with thousands of employees, security tools struggle to assist analysts a empting to monitor corporate assets. Analysts investigating anomalous activity use a variety of lters to limit the data they need to consider before nalising a verdict on any given activity. We assume other lters for response type, network protocol, NXDomain results, proxy labels, etc. are also included. Scores from a model for dictionary DGA detection would be added into the system for analysts to include whichever additional information they deem necessary.
Much like the work by Kumar et al. [24] and Vinayakumar et al. [48], we aim to not only address this cyber security issue with text classi cation techniques, but also the greater system in which the model would be deployed. Prior systems consider the various model performance metrics on common data sets as well as the realworld generalisability, response time, and scalability of their chosen model when scoring domains in real time. We extend their work to new controlled tests and describe deploying detection systems within a corporate environment.

BILBO THE "BAGGING" HYBRID MODEL
We present a new deep learning model to deploy for real-time dictionary DGA classi cation. As mentioned before, deep learning architectures are capable of learning variations to dictionaries and DGAs, with the added bene t of training quickly. ere have been many deep learning architectures published for this task for stateof-the-art comparison.
Since we can treat domains as sequences of characters, LSTM models are a natural t for classifying DGA domains. LSTM nodes make decisions about one element in the sequence based on what it has seen earlier in the sequence.
us, LSTM nodes learn parameters that are shared across the elements of sequence. is parameter sharing allows LSTMs to scale to handle much longer sequences than would be practical for traditional feedforward neural networks [16]. For example, an LSTM neuron might recall that it has seen seven vowels in a nine-character domain, making it unlikely that the domain is made up of natural English text. is sequential specialisation of LSTMs a racted us initially, but we found it alone could not generalise to new dictionary DGAs as well as other architectures.
Others have applied CNNs in various forms since used for URL analysis by Saxe et al. [38]. Convolutional neural networks (CNNs) were designed to handle information that is in a grid format, such as pixels in an image matrix. By treating text as a one-dimensional grid of le ers, CNNs were shown to have excellent results for natural language tasks [19,57]. We translate domain names to arrays of characters, allowing the CNN to examine local relationships between characters via a sliding window, thus grouping characters together into words. For example, the domain "facebook" can be broken down into four-character windows: "face", "aceb", "cebo", "eboo", and "book". By dividing character arrays into smaller, related parts in this manner, CNNs demonstrated success on URL classi cation tasks [38].
When multiple models perform well on the same task, many practitioners have combined models or model architectures to enhance the various bene ts they individually provide. e most common technique is to combine pre-trained models to form an ensemble model, where each individual model produces a score and these scores are combined in some way to produce a new score. In this context we could train a general DGA classi er that combines one model trained to classify character DGA domains and another trained to classify dictionary DGA domains. e bene t of combining both models is dependent on how they are combined and how it decides which model to "trust" for its nal decision without the context of how they were developed.
Hybrid models are similar to ensemble models, but rather than taking the individual score from each component, a hybrid model combines the architectures before the extracted features are reduced to a single score. ese models are trained as a single end-to-end model. A hybrid architecture allows the model to learn which combinations of features of the input are signi cant indicators for accurate classi cation. Most common hybrid models combine architectures by stacking them in di erent ways. For instance, using a CNN's convolutional layer to extract features and then feed them into an LSTM layer [10,20,30,49,56].
Our novel hybrid model, as seen in Figure 1, processes domain names via an LSTM layer and a CNN layer in parallel. e outputs of these two architectures are then aggregated or "bagged" by a single-layer ANN. is "bagging" is a vital opportunity for this model to discern which parts of the captured information from the LSTM and CNN assists the best when labelling dictionary DGA and benign domains. Inserting an ANN instead of a single function increases the potential optimisation of the "bagging". Because of the importance of this piece in the architecture, we named our model Bilbo the "bagging" model. Unlike ensembles which optimise its components prior to conjoining, hybrids optimise over all the components. As demonstrated in our results (Section 6), Bilbo successfully combines LSTM, CNN, and ANN layers for dictionary DGA detection and is the best at consistently classifying dictionary DGAs amongst state-of-the-art deep learning models. Figure 2: Comparing the shared largest common substrings from within each domain family considered during our classi cation (alexa, suppobox, gozi, and matsnu). e circumference is grouped by colour for each family. e counts are for the number of times the overlapping LCS was an LCS for a domain pair within a given family. Note that any overlap in the centre has no meaning and the counts contain overlap between LCS shared between one pair of families and any other.

DATA ANALYSIS
To be er understand the success and failures of the models used in our tests, we conducted a brief analysis of our data set of known dictionary DGAs. e dictionary DGA domains were selected from collections of related DGAs, called DGA families, published on DGArchive [13], a trusted database of domains extracted from reverse-engineered DGA malware. From this source, several families of DGAs were empirically identi ed as solely dictionary DGAs based on the structure of the domain names generated by malware samples. e families selected were suppobox, gozi, and matsnu with domains collected over two years (2016-17) by DGArchive. A er removing duplicate domain names, the resulting selection contained 137, 745 samples of suppobox, 18, 539 samples of matsnu, and 20, 313 samples of gozi.
e legitimate domains in the training set originate from the Alexa Top 1 Million domains, measured in 2016 [5]. e Alexa list ranks domains by the number of times each has been accessed. Since DGA-based malware tends to use domains for short periods of time, we assume that top Alexa domain names are human-generated and label them as non-DGA. ese popular domains, mostly containing valid English words, encourage the model to learn characteristics of legitimate combinations of English words. We randomly sampled an equivalent number of domains from Alexa to match the total number of dictionary DGA samples available.
To further understand our data, we conducted several comparisons: (1) By extracting the longest common substrings within each family, compare the lists between families for dictionary similarity. See Figure 2 for a summary of those results (2) Using the widely adapted Jaro-Winkler algorithm for string similarity [15], we compared every domain in our data set within their own families and with every other family. e histogram in Figure 3 shows us how similar families are and how this could in uence the results for generalisability.

Longest Common Substring (LCS)
e application of this algorithm was inspired by Pereira et al.'s technique for dictionary extraction [32]. We applied this to each individual group (alexa, suppobox, gozi, and matsnu) to generate a list of every LCS between pairs of domains. ese lists contain all possible dictionary words used to generate the domains. By comparing the lists between the families, we can see how learning one family's list could assist in identifying the other. Figure 2 visualises the overlap between sets with a chord diagram. e circumference is partitioned into four parts and is labelled with the count for the number of times overlapping substrings were seen as the LCS for a domain pair within its family. For instance, look at the black vertical chord between gozi and alexa. e colour black means that alexa, the family assigned black, is the smaller portion of this relationship, i.e. fewer of its LCS (approximately 10 million) are within the overlap with gozi (approximately 100 million).
LCS overlapping between alexa and gozi also include LCS from other overlaps. gozi's large partition of the circumference while also being the smallest family means it overlaps frequently with other groups. Overall matsnu and gozi have the largest overlap, sharing 8.6% of their LCS and 92% of their LCS when including the number of times it was seen as the LCS of a pair. e longest LCS between them was 14 characters; the average length for LCS was 4.238 characters. erefore, there must be only a few very common substrings between the families, which deep learning models could learn.

Jaro-Winkler (JW) Score
To understand the similarity of an entire domain string with any other domain, we used the JW score [15].
is algorithm takes the ordering of the characters and the collection of characters to develop a score between [0,1]. e closer the score is to one, the more similar the domains are to one another. We compared every domain to generate diagrams such as Figure 3.
Most families follow the same distribution with a mean of about 0.5 for JW score. However, notice the slight skew in alexa and suppobox. Due to a large percentage of their domains having little to no JW similarity, the average score for alexa was 0.4023 and suppobox was 0.4901. is slight di erence is ampli ed when considering other aspects of the family. Both suppobox and alexa have the smallest average lengths of domains at 13 and 9 characters, respectively. Both groups have a standard deviation of approximately four characters and most frequent length of about eight characters. With this, the low JW scores for alexa and suppobox make sense with shorter domains. e other sets, matsnu and gozi, are much longer in comparison with most frequent lengths of 14 and 23 characters, respectively. e dictionary for their DGAs seems to select from shorter, 3-5character words. Since there are less possible combinations of valid short words, more overlap between gozi and matsnu, which is also apparent in Figure 2.
is exploratory data analysis helped us develop an intuition around how di erent dictionary DGAs relate to each other and gave us hope that models would pick up on these relationships even though most of these families use di erent dictionaries and generation algorithms. Also, this same analysis should prove useful when comparing and expanding the model with other dictionary DGA families as they emerge.

EXPERIMENTAL DESIGN
We frame the DGA detection problem as a binary text classi cation task on only the domain string. e score provided by our model can then be used independently or be enriched with additional security context. In this section, tests are designed on known labelled data to demonstrate the baseline performance of each model. ese experiments re ect practitioner concerns on model deployments within real-world context: (1) Testing the model's ability to do binary classi cation with of benign and dictionary DGA domains (2) Evaluating the model's generalisability for identifying unseen dictionary DGAs (3) Examining the model's scores as the dictionaries and DGAs evolve over time, how well can the model classify new dictionary DGA domains from known families We compare Bilbo to four deep learning models: a single-layer ANN, CNN, LSTM, and MIT's CNN-LSTM Hybrid [49,56]. Each is based o of state-of-the-art models for DGA classi cation; the implementation for each is described below. Our results highlight the strengths and weaknesses of each architecture in the di erent scenarios.

Testing
Each experiment uses data pulled from the Alexa Top 1 Million list [5] and DGArchive [13]. e only three available dictionary DGA families are considered: gozi, matsnu, and suppobox. For model training and validation, the data is always separated into three sets: training, testing, and holdout. Training and testing are used at every epoch to see if early stopping should occur, preventing over ing. e results for each metric, listed in Section 6, are from applying the model to the holdout set.

Testing
Classification. e rst test evaluates how the model performs for binary classi cation between benign (negatives) and dictionary DGA domains (positives). With a balanced dataset, 80% was used for training the model. e remaining 20% (approximately 60,000 domains) was randomly sampled to use for testing and holdout: 50,000 domains for testing the model at each epoch and 10,000 domains for validating the model a er training was completed. All training, testing, and validation data sets contained an approximately equal number of positives and negatives.

Testing
Generalisability. is test evaluates how the model generalises to unseen dictionary DGAs. For this, three trials are created from the data sorted by dictionary DGA family. Each trial takes two of the families for training and splits the third over testing and holdout. For example, one variant uses matsnu and suppobox domains to train the model while evaluating the model's performance using gozi domains.
is paper is the rst to test DGA detection models in this way.

Testing Time-based Resiliency.
DGAs have been found to evolve over time, varying their generation algorithms slightly or using entirely new dictionaries [24]. While our tests for generalisability highlight some of the deep learning models' ability to classify alterations in the dictionary DGA, they are limited by our scope of sampling in 2016-17. To test detection system's resiliency on future versions of dictionary DGA domains, we evaluate our models trained on data from 2016-17 with DGA samples from November 2019. Models trained on all three dictionary DGA families are applied to this dataset.

Implementation of Deep Learning Models
Deep learning models take numerical sequences as input.
us, every domain string is encoded as an array of integers and then padded with zeros to ensure that all inputs are of the same size. Each Unicode character is mapped to an integer through a constructed list of 40 valid domain-name characters. For example, "google" would be converted to [7,15,15,7,12,5] and padded with zeros at the beginning to get all inputs up to our maximum length of a domain string: 63 characters. Our nal input is [0, 0, ..., 0, 7, 15, 15, 7, 12, 5]. During initial iterations, we con rmed that padding the end of the sequence made no di erence when compared with padding the beginning of the sequence. Rather than a common embedding for all deep learning models, the embedding is learned by the model during training. e outputs from each deep learning model is a score, a single oat between zero and one.
is value indicates the model's con dence that the domain was generated by a dictionary DGA.
We compare our main model, Bilbo, against four models adapted from state-of-the-art architectures: a single layer ANN, a CNN, an LSTM, and MIT's Hybrid [49,56]. e code for each model is in Appendix A. As mentioned in Section 2.2, deep learning models have frequently been shown to outperform feature-based approaches for DGA detection and are capable of millisecond scoring speeds. Because of these ideal characteristics for a dictionary DGA detection system, Bilbo is only compared to other deep learning architectures.
All models were built in Keras [45] using the TensorFlow [2] backend on a MacBook Pro to convey the ease for model retraining and that models can be deployed on smaller cloud servers. Each model is trained three times for ten epochs with a batch size of 512.

Artificial Neural Network (ANN)
. is fundamental model architecture underlies both the CNN and LSTM. As a baseline for this study, similar to Yu et al. [56], we include a single-layer ANN with 100 neurons in its hidden layer during our testing and consideration. is architecture is also included within Bilbo as the conjoining layer for the parallel CNN and the LSTM component architectures.

Long Short-Term Memory (LSTM)
Network. is architecture is a slight adaptation on the LSTM used by Woodbridge et al. [50]. Because it was tuned for a slightly di erent task, we reevaluated some of its hyperparameters. From our automated grid search of hyperparameters, as shown in Figure 4, it was clear that increasing LSTM layer size improved our accuracy on the testing set for generic binary classi cation. We found that an LSTM layer of 256 nodes provided us with the highest accuracy on the testing dataset without loss to its performance in real-time deployments.
e only alterations to the original model were the input parameters to match our standard across models and doubling the size of the LSTM layer. is is the same architecture implemented as a component within Bilbo.  e LSTM layer size and optimiser are compared for accuracy on the test set, demonstrating improved performance using larger networks and either the adam or rmsprop optimiser.

Convolutional Neural Network (CNN).
We followed Saxe et al. 's parallel convolution structure [38] to compare with state-ofthe-art with a CNN. A er testing a variety of lter sizes individually, combinations of various lters were also analysed to nd the best architecture for our task. Based LCS analysis for each family, the majority of substrings within dictionary DGAs appeared to be within the range of two to six characters. is model's nal architecture includes ve di erent sizes (2-6 characters) of convolutions, 60 lters of each length with a stride of one character, and pooling later concatenated to provide a vast amount of information towards the nal score. is architecture balances the model complexity against the prediction accuracy on our training set.

Bilbo.
Our initial results with the individual LSTM and CNN, as seen in Table 2, indicated each model was learning relevant but distinct characteristics for accurate identi cation of dictionary DGAs. Bilbo's architecture "bags" the extracted features from the LSTM and CNN with a hidden layer of 100 nodes, from which a nal prediction is rendered. is hybrid model learns to balance the features extracted by both the LSTM and CNN. e same architectures described previously for the individual ANN, LSTM, and CNN are combined to form Bilbo. is model is the rst parallel usage of a CNN and LSTM hybrid for DGA detection.

MIT Hybrid
Model. Based on the original encoder-decoder model presented by MIT [49], several recent publications have adapted this CNN-LSTM hybrid model to DGA classi cation [30,43,56]. Unlike our model, this uses the CNN convolutions to feed inputs into an LSTM. e MIT hybrid architecture adapted by Yu et al. [56] is another benchmark during testing. Comparing Bilbo's parallel usage of a CNN and an LSTM to this model demonstrates the signi cance of our parallel architecture in binary classi cation of dictionary DGAs. eir single convolutional layer consists of 128 one-dimensional lters, each three characters long with a stride of one.
is is fed into a Max Pooling layer before a 64-node LSTM. is model contains no drop out and relies on a single sigmoid to a en the results to a single score.

Metrics for Comparison
Considering real-world applications for DGA detection, a balance between incorrect domains and lack of con dence for true dictionary DGA domains must be found. To help measure each model's performance for this, three core metrics are calculated to summarise common metrics used in machine learning research. e rst is the area under the receiver operating characteristic (ROC) curve (AUC), which measures the model's ability to detect true positives as a function of the false positive rate. Maximising AUC means improving labelling of both positive and negative samples. e second is accuracy; how well the model scored positive and negative labels our of all samples in the holdout set. Finally, the F 1 score is the harmonic mean of precision and recall, giving insight to the context of true positive labels within the holdout set.
Using abbreviations for true positive (TP), true negative (TN), false positive (FP), false negative (FN), true positive rate (TPR), and false positive rate (FPR), these are computed in the following ways: F P F P + T N Consistency of the core metrics in every se ing is key to nding the best performer while evaluating models on labelled data. To quantify this, the core metrics are treated as assessment questions: one point of consistency is awarded to each of the top three models within every core metric. e model with the most points across testing classi cation and generalisability is deemed the most consistent performing model.

RESULTS
In this section, we elaborate on the values of metrics from each model resulting from each test. e threshold for a label for every test was 0.5. Overall, the priority is to accurately apply both positive and negative labels to the dataset. From these tests, the model with the best consistency score is viewed as the best for deploying into real world se ings.

Results of Testing Classi cation
e values for the metrics from this test are provided in Table 3. In this test, the ANN is signi cantly worse than the specialised deep learning models in every metric, according to a student t-test with 95% con dence on the all collected results. e ANN's FPR of 0.1953 is almost a whole magnitude worse than MIT's FPR, which was the best. e CNN and LSTM are statistically similar in all metrics with the LSTM outperforming the CNN in most precision, TPR, and FPR. is is due to the imbalance between the dictionary DGA families, with suppobox comprising of about 78% of the malicious samples. During our substring analysis, we found that suppobox contained the longest substrings, revealing that models which learn the long sequence of suppobox's dictionary words would have an advantage when classifying the majority of dictionary DGA domains. e LSTM is designed to learn sequential relationships between characters rather than subsets of characters like the CNN.
is is why the LSTM beats the CNN and, as shown in Table 6, is a consistent leader in the core metrics.
Both MIT's hybrid model and Bilbo perform the best across all metrics. e di erence between the two is insigni cant in all metrics, di ering less than 0.01 for the F 1 score, Accuracy, and AUC.
is near identical performance is similar to the LSTM and CNN comparison earlier. ere is also a pa ern in most of the metrics that when the CNN is be er than the LSTM, Bilbo is be er than the MIT model and vice versa. MIT's parameters are mostly dedicated to the LSTM layer, explaining the similar performance between the two models.
Bilbo consistently performs between or be er than its component models in all metrics by regularising the performance of the LSTM and CNN with an ANN, displaying the expected results of our parallel architecture. In the empirical analysis of the results, the top scoring domains from both the CNN and LSTM were present in the nal scoring of Bilbo as expected.
e di erence between the deep learning models, excluding the ANN, in this test is very small. Given a domain name, they are all successful at labelling dictionary DGA domains from benign domains a er learning from three diverse dictionary DGA families. e consistency scores for this test place the LSTM model, the MIT model, and Bilbo as the best performers.

Results of Testing Generalisability
As presented in Table 4, the metrics have been limited to three core metrics to maximise for best overall performance. A model's AUC indicates the model's likelihood of correctly classifying a sample as a positive or negative. e F 1 score conveys how well the model correctly labels dictionary DGA domains with regard to those that should be or were labelled. Accuracy states how well the model labelled the data within this particular holdout set. e values for the core metrics were not expected to surpass 0.9 due to the di erences between each dictionary DGA family. Analysis of each family's LCS and the JW scores between families not depicted in this paper stated some families overlap more with one family than another. is dependence in uences each model's performance by limiting its ability to generalise unless certain families have been seen before. Hence the values across this table are lower than in Table 3.
e ANN outperforms the other models in this task with higher core metrics in two of the trials. However, it also only surpasses the other models when matsnu or gozi are part of the training set. Figure 2 depicts a large overlap in their LCS. is could explain what the ANN is able to learn for be er performance on new DGAs when either matsnu or gozi is in the training set and the other is in the testing set. e next most consistent performer in this test is the CNN. Its training on smaller character windows allows it to excel when applied to new dictionary DGAs. Based on earlier data analysis, the most frequent LCS in every family were three to four characters and typically overlapped. e large overlap in LCS between matsnu and gozi reinforce these short substrings, explaining why the CNN outperforms others when both matsnu and gozi are in the training set.

Results of Testing Time-based Resiliency
e nal test is on a single day's worth of recent domain samples from each of the dictionary DGA families already considered. Listed are the ratios of true positives out of the total number of samples for that dictionary DGA family. Total samples for each family are as follows: 1325 from gozi, 686 from matsnu, and 4257 from suppobox.
Using all of the trained models from the classi cation test, the average scores are listed. e results are close between all model   architectures and, when averaged, are close to the accuracy seen during testing. As for the relative decrease in accuracy for matsnu and gozi, this is due to the class imbalance between the dictionary DGA families in the dataset. Regardless of which model selected for deployment, it will need to be updated frequently with new labelled data whenever trusted and available to increase this accuracy on future dictionary DGA domains. roughout all of these tests, each state-of-the-art deep learning model achieves top metrics. To determine which is the best, we consider the application environment the model is to be deployed in and its need for a consistent well-performing model. A er aggregating the consistency points for the top performers from every core metric in each test and trial, presented in Table 6, Bilbo is found to be the most consistent and capable model for deploying within real-world dictionary DGA detection systems.

REAL-WORLD DEPLOYMENT
Once Bilbo was trained, tested, and validated using open source data from the Alexa Top 1 Million [5] and DGArchive [13], we evaluated performance in a live system. We deployed the model on a cluster of servers to be queried by a data pipeline and applied the model to live network tra c from a large enterprise.

Implementation at the Corporate Level
Within corporate environments, a large security information and event management (SIEM) system is typically used to centralise and process relevant data sources. Security analysts use the SIEM for their daily work to investigate suspicious activity within their environment. e data they view is limited by a series of lters and joins they apply on various datasets.
To productionise Bilbo in a high-throughput environment generating hundreds of domains per second, we developed a model as a service framework. is framework promotes scalability, modularity, and ease of maintenance. Client systems processing domain names, such as the SIEM, make requests of the model servers to receive scores on new domains. is communication is performed using gRPC, Google's library for remote procedure calls [17], which ree stages of the updating process for our Model as a Service (MAAS) Architecture for model deployments to be accessible to SIEM and other client systems. From le to right, the rst stage shows clients interacting through the load balancer with old model servers. To update the servers, we spin up new model servers with the latest version and con rm production readiness before attaching them to the load balancer. Finally, the old model servers are deleted, leaving the new model servers in their place. At no point during this process will the clients be unable to receive scores from our models.
was selected for its speed over methods like REST (Representational State Transfer). e communication from client to server is language-agnostic, allowing a client wri en in Java or Scala to interface seamlessly with our Python model.
A load balancer manages tra c to the model servers and only the load balancer endpoint is exposed to the client. is allows multiple clients to reach out to a single location in order to receive scores from the model. Any number of model servers can run behind the load balancer, but these details are abstracted away from the clients, who only interface with the load balancer endpoint. is allows us to increase and decrease the size of the model server cluster in response to changing without interrupting service; such scaling can be con gured to take place automatically in response to metrics like CPU utilisation.
While our model does not learn inline, its predictions, combined with a ground truth label provided by an analyst, can be used to retrain the model, allowing it to learn from mistakes and improve its predictive power. us, we need to be able to deploy a retrained model frequently and with low overhead. Since the model server cluster is behind a load balancer, we can make this change without shu ing down the service. We simply put additional model servers (running the newest model) behind the load balancer, and, once they have been con rmed to run successfully, remove the model servers running an outdated version. e model update process can be seen in Figure 5. Along with their scores, the model servers return the version of the model that they are running; this is helpful in evaluating our models over time and in distinguishing between models during the brief overlap period when two versions of the model are running behind the load balancer.
Several key design decisions allow us to handle requests to the service at very large scale. While gRPC minimises network latency by allowing bi-directional streaming between the client and server, the calls to our service are still time-intensive, so we built in a bloom lter caching mechanism on the client side to avoid this bo leneck.
is more intelligent client only reaches out to the server if it receives a domain that it has not recently seen before. Our analysis of domain tra c revealed that only 15% of domains are unique in an hour of tra c; this optimisation dramatically reduces the workload of our model server cluster.
We evaluated Bilbo based on its processing capacity and its ndings, as seen below. Our initial prototype consisted of a single client reaching out to a load balancer with a single server in the cloud. With an unoptimized compilation of Tensor ow for our backend, the fastest scoring averaged to approximately 10 ms per record, increasing linearly with an increasing number of requests. If we anticipate 1000 domains per second, our model only needs to be hosted on 10 servers. On a Cloud service such as Amazon Web Services, we can keep a ten-node cluster running for less than y cents (USD) per hour.

Results in Enterprise Tra c
For further model performance testing, Bilbo is evaluated on realworld network tra c. Randomly selecting one window of tra c from August 14 th , 2017, and another window of tra c from November 15 th , 2017 3 . Each network sample set contains domain names over a two-hour period. A er parsing the domain names from the URLs in the logs, the August and November data contained 20,000 and 45,000 unique domains, respectively.
Since we lack ground truth for the domains in our captured samples to validate our results, we pulled in additional information for each domain. First, we included the action decision of the proxy, which denies domains that are known to be malicious. Second, we added scores from VirusTotal [41], a site that aggregates blacklists to provide reputation scores for domains and is commonly used by security analysts for evaluation of domains (accessed November, 2017). Note that both the proxy and VirusTotal are imperfect since they are unaware of malicious content related to a domain until thorough analysis has been performed, which can take many weeks [26]. We cross-referenced the high scores from our model with the results from the proxy and VirusTotal to perform a basic investigation.
Feeding our model only the domain names, we discovered a series of domains with similar naming pa erns: •  reatCrowd network graph of the domain "boilingbeetle" discovered in enterprise proxy tra c by the ensemble model. is domain is connected through select IP addresses to other domains of similar structure, in the pattern of a command and control network.
• dot.masticationlamest[.]com/a s At a glance, these domains follow an algorithmic pa ern of three characters, two words, and the "/a s" ending, making them strong candidate dictionary-based DGA domains. Upon further examination, all of these domains were queried by the same machine, which, prior to our discovery, had been deactivated due to complaints of incredibly sluggish performance. is is highly suggestive of malware activity using a dictionary DGA network.
Additionally, we found four domains, each representing distinct suspicious networks matching the expected pa ern for dictionarybased DGA C&C hubs: Each of these networks, when visualised by reatCrowd, a crowdsourced network analysis "system for nding and researching artefacts relating to cyber threats" [6], are shown to be comprised of domains that are made up of two or more unrelated words, all resolving to the same IP address, in the pa ern of a domain-uxing dictionary DGA. e "boilingbeetle" network is shown in Figure  6. ese domains and their related networks were not agged by the online blacklists used by VirusTotal; only some of the domains within each network were blocked by the proxy.
Further investigation noted that these networks are for advertisement tra c, indicating that dictionary DGA techniques are being used to bypass ad-blocker mechanisms. Although not apparently malicious, these ve discoveries of dictionary-based DGA from potential malware, found in only a few hours of proxy log data, demonstrates that our solution is able to ag relevant results in live tra c.

CONCLUSIONS AND FUTURE WORK
In this paper, we present a parallel hybrid architecture named Bilbo, composed of an LSTM, a CNN, and an ANN, for dictionary DGA detection. Dictionary DGAs bypass most general, manually-de ned DGA defences and are harder to detect due to their natural language characteristics. Bilbo is compared to state-of-the-art deep learning models adapted for dictionary DGA classi cation and evaluated on consistency over AUC, Accuracy, and F 1 score. Overall, Bilbo is the most consistent and capable model available.
Bilbo was then applied to a large nancial corporation's SIEM, providing inline predictions within a scalable framework to handle high-throughput network tra c. During investigations, our model's scores were used to lter data and ag suspicious activity for further analysis.
When applied to several hours of live network logs, Bilbo successfully classi ed tra c matching the expected network pa ern: a single IP address hosting several domain names that make no semantic sense and follow a trend of English words put together. Although the identi ed domains from the network logs were not botnets or worms reaching out to a C&C, which are very rare, Bilbo was able to identify dictionary DGAs used by advertisement networks and other applications with potential malicious intent.
Later improvements include the continued reduction of false positives and applying natural language processing (NLP) techniques. One method to reduce false positives would be to consider layering a generative model to determine if the input domain is similar to any data Bilbo has seen before. is could increase or decrease the score, or add another lter to alter a user's con dence in the score. Applicable NLP techniques detect anomalous word combinations in domains by scoring the likelihood words would be collocated.
is could prove fruitful for DGA detection but heavily depends on the corpus for parsing out words and gathering initial collocation information to understand for a baseline of what is normal.
ACKNOWLEDGMENTS ank you to Capital One for the incredible opportunity to deploy a machine learning model developed for research into a live environment for evaluation. To Jason Trost, your mentorship and intellectual curiosity inspires everyone around you. We appreciate your and Capital One's support to publish our work as an academic paper a er our talks in industry.
To the reviewers at our last a empted venues, thank you for the incredible feedback that greatly improved our analysis.