Applying NLP techniques to malware detection in a practical environment

Executable files still remain popular to compromise the endpoint computers. These executable files are often obfuscated to avoid anti-virus programs. To examine all suspicious files from the Internet, dynamic analysis requires too much time. Therefore, a fast filtering method is required. With the recent development of natural language processing (NLP) techniques, printable strings became more effective to detect malware. The combination of the printable strings and NLP techniques can be used as a filtering method. In this paper, we apply NLP techniques to malware detection. This paper reveals that printable strings with NLP techniques are effective for detecting malware in a practical environment. Our dataset consists of more than 500,000 samples obtained from multiple sources. Our experimental results demonstrate that our method is effective to not only subspecies of the existing malware, but also new malware. Our method is effective against packed malware and anti-debugging techniques.


Introduction
Targeted attacks are one of the serious threats through the Internet. The standard payload such as a traditional executable file has still been remained popular. According to a report, executable file is the second of the top malicious email attachment types [49]. Attackers often use obfuscated malware to evade anti-virus programs. Pattern matchingbased detection methods such as anti-virus programs barely detect new malware [28]. To detect new malware, automated dynamic analysis systems and sandboxes are effective. The idea is to force any suspicious binary to execute in sandboxes, and if their behaviors are malicious, then the file is classified as malware. While dynamic analysis is a powerful method, it requires too much time to examine all suspicious files from the Internet. Furthermore, it requires high-performance servers and their license including commercial OS and applications. Therefore, a fast filtering method is required. In order to achieve this, static detection methods with machine learning techniques can be applicable. These methods extract features from the malware binary and Portable Executable (PE) header. While the printable strings are often analyzed, they were not a decisive element for detection. With the recent development of natural language processing (NLP) techniques, the printable strings became more effective to detect malware [8]. Therefore, the combination of the printable strings and NLP techniques can be used as a filtering method.
In this paper, we apply NLP techniques to malware detection. This paper reveals that printable strings with NLP techniques are effective for detecting malware in a practical environment. Our time series dataset consists of more than 500,000 samples obtained from multiple sources. Our experimental result demonstrates that our method can detect new malware. Furthermore, our method is effective against packed malware and anti-debugging techniques. This paper produces the following contributions.
-Printable strings with NLP techniques are effective for detecting malware in a practical environment. -Our method is effective to not only subspecies of the existing malware, but also new malware.
-Our method is effective against packed malware and antidebugging techniques.
The structure of this paper is as follows. Section 2 describes related studies. Section 3 provides the natural language processing techniques related to this study. Section 4 describes our NLP-based detection model. Section 5 evaluates our model with the dataset. Section 6 discusses the performance and research ethics. Finally, Section 7 concludes this study.

Static analysis
Malware detection methods are categorized into dynamic analysis and static analysis. Our detection model is categorized into static analysis. Hence, this section focuses on the features used in static analysis.
One of the most common features is byte n-gram features extracted from malware binary [47]. Abou-Assaleh et al. used the frequent n-grams to generate signatures from malicious and benign samples [1]. Kolter et al. used information gain to extract 500 n-grams features [12,13]. Zhang et al. also used information gain to select important n-gram features [55]. Henchiri et al. conducted an exhaustive feature search on a set of malware samples and strived to obviate overfitting [6]. Jacob et al. used bigram distributions to detect similar malware [9]. Raff et al. applied neural networks to raw bytes without explicit feature construction [41]. Similar approaches are extracting features from the opcode n-grams [3,10,35,57] or program disassembly [7,14,18,44,50].
While many studies focus on the body of malware, several studies focus on the PE headers. Shafiq [40].
Other features are also used to detect malware. Schultz et al. used n-grams, printable strings, and DLL imports with machine learning techniques for malware detection [46]. Masud et al. used byte n-grams, assembly instructions, and DLL function calls [20]. Ye et al. used interpretable strings such as API execution calls and important semantic strings [54]. Lee et al. focused on the similarity between two files to identify and classify malware [16]. The similarity is calculated from the extracted printable strings. Mastjik et al. analyzed string matching methods to identify the same malware family [19]. Their method used 3 pattern matching algorithms, Jaro, Lowest Common Subsequence (LCS), and n-grams. Kolosnjaji et al. proposed a method to classify malware with neural network that consists of convolutional and feedforward neural constructs [11]. Their model extracts feature from the n-grams of instructions, and the headers of executable files. Aghakhani et al. studied how machine learning-based on static analysis features operates on packed samples [2]. They used a balanced dataset with 37,269 benign samples and 44,602 malicious samples.
Thus, several studies used the printable strings as features. However, the printable strings are not used as the main method for detection. The combination of the printable strings and NLP techniques are not evaluated in a practical environment. This paper pursues the possibility of the printable strings as a filtering method.

NLP-based detection
Our detection model uses some NLP techniques. This section focuses on the NLP-based detection methods.
Moskovitch et al. used some NLP techniques such as Term Frequency-Inverse Document Frequency (TF-IDF) to represent byte n-gram features [35]. Nagano et al. proposed a method to detect malware with Paragraph Vector [36]. Their method extracts the features from the DLL Import name, assembly code, and hex dump. A similar approach is classifying malware from API sequences with TF-IDF and Paragraph Vector [51]. This method requires dynamic analysis to extract API sequences. Thus, the printable strings are not used as the main method for detection.
NLP-based detection was applied to detect malicious traffic and other contents. Paragraph Vector was applied to extract the features of proxy logs [30,32]. This method was extended to analyze network packets [22,31]. To mitigate class imbalance problems, the lexical features are adjusted by extracting frequent words [23,33]. Our method uses this technique mitigate class imbalance problems. Some methods use a Doc2vec model to detect malicious JavaScript code [29,37,39]. Other methods use NLP techniques to detect memory corruptions [52] or malicious VBA macros [24][25][26][27]34].

NLP techniques
This section describes some NLP techniques related to this study. The main topic of NLP techniques is to enable computers to process and analyze large amounts of natural language data. The documents written in natural language are separated into words to apply NLP techniques such as Bag-of-Words. The corpus of words is converted into vectors which can be processed by computers.

Bag-of-words
Bag-of-Words (BoW) [43] is a fundamental method of document classification where the frequency of each word is used as a feature for training a classifier. This model converts each word in a document into vectors based on the frequency of each word. Let d, w, and n be expressed as a document, word (w i=1,2,3,... ), and a frequency of w, respectively. The document d can be defined by Eq. (1). For this Eq. (1), next (2) locks the position of n on d, and omitted the word w. Thisd j (d j=1,2,3,... ) is shown as a vector (document-word matrix). In Eq. (3), let construct the other documents to record the term frequencies of all the distinct words (other documents ordered as in Eq. (2)).
Thus, this model enables to convert documents into vectors. Apparently, this model does not preserve the order of the words in the original documents. The dimension of converted vectors attains the number of distinct words in the original documents. To analyze large-scale data, the dimension should be reduced so that can be analyzed in a practical time.

Latent semantic indexing
Latent Semantic Indexing (LSI) analyzes the relevance between a document group and words included in a document. In the LSI model, the vectors with BoW are reduced by singular value decomposition. Each component of the vectors is weighted. The decomposed matrix shows the relevance between the document group and words included in the document. In weighting each component of the vector, Term Frequency-Inverse Document Frequency (TF-IDF) is usually used. |D| is the total number of documents, {d : d t i } is the total document including word i, f requency i, j is the appearance frequency of word i in document j. TF-IDF is defined by Eq. (4).
TF-IDF weights the vector to perform singular value decomposition. The components x ( i, j) of the matrix X show the TF-IDF value in the document j of the word i. Let X be decomposed into orthogonal matrices U and V and diagonal matrix , from the theory of linear algebra. In this singular value decomposition, U is a column orthogonal matrix and linearly independent with respect to the column vector. Therefore, U is the basis of the document vector space. The matrix X product giving the correlation between words and documents is expressed by the following determinant. Generally, this matrix U represents a latent meaning.
In this model, the number of dimension can be determined arbitrarily. Thus, this model enables to reduce the dimension so that can be analyzed in a practical time.

Paragraph vector
To represent word meaning or context, Word2vec was created [21]. Word2vec is shallow neural networks which are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of documents and produces a vector space of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words which share common contexts in the corpus are located in close proximity to one another in the space. queen = king − man + woman is an example of operation using each word vector generated by Word2vec. Word2vec is a method to represent a word with meaning or context. Paragraph Vector is the extension of Word2vec to represent a document [15]. Doc2vec is an implementation of the Paragraph Vector. The only change is replacing a word into a document ID. Words could have different meanings in different contexts. Hence, vectors of two documents which contain the same word in two distinct senses need to account for this distinction. Thus, this model represents a document with word meaning or context. Our detection model consists of language models and classifiers. One model is selected from these models, respectively. In training phase, the selected language model is constructed with extracted words from malicious and benign samples. The constructed language model extracts the lexical features. The selected classifier is trained with the lexical features and labels. In testing phase, the constructed language model and trained classifier classify unknown executable files into malicious or benign ones.

Training
The training procedure is shown in Algorithm 1. Our method extracts all printable (ASCII) strings from malicious and benign samples and splits the strings into words, respectively. The frequent words are selected from the words, respectively. Thenceforth, the selected language model is constructed from the selected words. Our method uses a Doc2vec or LSI model to represent the lexical features. The Doc2vec model is constructed by the corpus of the words. The LSI model is constructed from the TF-IDF scores of the words. These words are converted into lexical features with the selected language model. Thus, the labeled feature vectors are derived. Thereafter, the selected classifier is trained with the both labeled feature vectors. The classifiers are Support Vector Machine (SVM), Random Forests (RF), XGBoost (XGB), Multi-Layer Perceptron (MLP), and Convolutional Neural Networks (CNN). These classifiers are popular in the various fields, and have each characteristic.

Test
The test procedure is shown in Algorithm 2. Our method extracts printable strings from unknown samples, and splits the strings into words. The extracted words are converted into

Implementation
Our detection model is implemented in Python 2.7. Gensim [42] provides the LSI and Doc2vec models. Scikit-learn 1 provides the SVM and RF classifiers. The XGB is provided as a Python package 2 . The MLP and CNN are implemented with chainer 3 . The parameters will be optimized in the next section.

Dataset
To evaluate our detection model, hundred thousands of PE samples were obtained from multiple sources. One is the FFRI dataset, which is a part of MWS datasets [5]. This dataset contains logs collected from the dynamic malware analysis system Cuckoo sandbox 4 and a static analyzer. This dataset is written in JSON format, and categorized into 2013 to 2019 based on the obtained year. Each data except 2018 contains printable strings extracted from malware samples. These data can be used as malicious samples (FFRI 2013-2017, 2019). Note that this dataset does not contain malware samples themselves. Printable strings extracted from benign samples are contained in 2019's as Cleanware. These benign data do not contain the time stamps. Hence, we randomly categorized these data into 3 groups (Clean A, B, and C). Other samples are obtained from Hybrid Analysis (HA) 5 , which is a popular malware distribution site. Almost ten thousand of To extract printable strings from these samples, we use the strings command on Linux. This command provides each string on one line. Our method uses these strings as word. Our extraction method is identical to the FFRI dataset 7 . Thus, our dataset is constructed with the multiple sources. Table 1 shows the number of each data, unique words, and family name. Tables 2 and 3 show the top family names.
The unique words indicate the number of distinct words extracted from the whole dataset. In the NLP-based detection method, the number of unique words is important. Because the classification difficulty and complexity mainly depend on the number. The family indicates the number of distinct malware family defined by Microsoft defender. In benign samples, each dataset contains huge number of unique words. This means these benign samples are well distributed and not biased. In malicious samples, each dataset contains enough number of unique words and malware families. This suggests they contain not only subspecies of the existing malware, but also new malware. Hence, these samples are well distributed and not biased.
In this experiment, TP indicates predicting malicious samples correctly. Since our detection model performs binary classification, Receiver Operating Characteristics (ROC) curve and Area under the ROC Curve (AUC) are used. An

Parameter optimization
To optimize the parameters of our detection model, the Clean A-B and FFRI 2013-2016 are used. The Clean A and FFRI Fig. 2 The F1 score for each model First, the number of unique words is optimized. To construct a language model, our detection model selects frequent words from each class. The same number of frequent words from each class are selected. This process adjusts the lexical features and enables to mitigate class imbalance problems [23]. The F1 score for each model is shown in Fig. 2.
In this figure, the vertical axis represents the F1 score, and the horizontal axis represents the total number of unique words. In the Doc2vec model, the most optimum number of the unique words is 500. In the LSI model, the F1 score gradually rises and achieves the maximum value at 9000.
Thereafter, the other parameters are optimized by grid search. Grid search is a search technique that has been widely used in many machine learning researches. The optimized parameters are shown in Tables 5 and 6.
Thus, our detection model uses these parameters in the remaining experiments.

Cross-validation
To evaluate the generalization performance, tenfold crossvalidation is performed on the Clean A and FFRI 2013-2015. Figure 3 shows the result. The vertical axis represents the Accuracy (A), Precision (P), Recall (R), or F1 score (F). Overall, each metric performed good accuracy. The LSI model was more effective than the Doc2vec model. Thus, the generalization performance of our detection model was almost perfect.

Time series analysis
To evaluate the detection rate (recall) for new malware, the time series is important. The purpose of our method is detecting unknown malware. In practical use, the test samples should not contain the earlier samples. To address this problem, we consider the time series of samples. In this   Table 1, the training samples are selected from the earlier ones. Moreover, the benign samples account for the majority of the test samples. This means the test samples are imbalanced, which represent more practical environment. Thus, the combination of each dataset is more challenging than cross-validation. The results of the time series analysis are shown in Figs. 4 and 5.
The vertical axis represents the recall. Overall, the recall gradually decreases as time proceeds. The detection rates in the FFRI dataset are better than the ones in the HA dataset. This seems to be because the training samples were obtained from the same source. Nonetheless, the recall in HA maintains almost 0.9. Note that these samples were identified by VirusTotal and categorized based on the first defined year. The LSI model was more effective than the Doc2vec model. In regard to classifiers, the SVM and MLP performed good accuracy. Thus, our detection model is effective against new malware. Furthermore, the same technique [23] mitigates class imbalance problems in executable files.
To visualize the relationship between sensitivity and specificity, the ROC curves of each model with FFRI 2016 are depicted in Figs. 6 and 7.  The vertical axis represents the true positive rate (recall), and the horizontal axis represents the false positive rate. Our detection model maintains the practical performances with a low false positive rate. As we expected, the LSI model was more effective than the Doc2vec model. The best AUC score achieves 0.992 with the LSI and SVM model.
The required time for training and test of FFRI 2016 is shown in Table 7.
This experiment was conducted on the computer with Windows 10, Core i7-5820K 3.3GHz CPU, 32GB DDR4 memory, and Serial ATA 3 HDD. In regard to training time, it seems to depend on the classifier. Complicated classifiers such as CNN require more time for training. The test time maintains flat regardless of the classifier. The time to classify a single file is almost 0.1s. This speed seems to be enough to examine all suspicious files from the Internet.

Practical performance
In practical environment, actual samples are more distributed. Hence, the experimental samples might not represent the population appropriately. To mitigate this problem, a more large-scale dataset has to be used. Moreover, the training samples should be smaller. To represent actual sample distribution, the FFRI 2019 and Clean A-C are used. They contain 500,000 samples. These samples are randomly divided into 10 groups. One of the groups is used as the training samples. The rest 9 groups are used as the test samples. The training and test are repeated 10 times. The average result of the practical experiment is shown in Fig. 8. The vertical axis represents the Accuracy (A), Precision (P), Recall (R), or F1 score (F). Note that the training samples account for only 10 percent. This means the dataset is highly imbalanced. The LSI and SVM are the best combination. The best F1 score achieves 0.934. Thus, our detection model is effective in practical environment.  Table 8.
The detection rate of unknown malware is on the same level with known malware. Thus, our detection model is effective to not only subspecies of the existing malware, but also new malware.

Defeating packed malware and anti-debugging techniques
Our detection model uses lexical features extracted from printable strings. These features include the API or argument names, which are useful to identify malware. These useful strings, however, are obfuscated in the packed malware. Several anti-debugging techniques might vary the lexical features. The test samples of the time series analysis are categorized into 4 types; packed and unpacked, or anti-debugging and no anti-debugging by PEiD 8 . PEiD detects most common packers and anti-debugging techniques with more than 470 different signatures in PE files. Since the FFRI dataset does not contain malware samples, we analyzed the HA dataset. Table 9 shows the detection rate of each malware type. Tables 10 and 11 show the top names. Contrary to expectations, each detection rate is on the same level. We also analyzed the benign samples. These samples contain 27,196 packed samples and 77,893 samples with anti-debugging techniques. Detection rate in each type is 0.988 to 0.997. Therefore, our method does not seem to deobfuscation. Thus, our method is effective against packed malware and anti-debugging techniques.

Limitation
We are aware that our study may have some limitations. The first is attributed to our dataset. As described previously, this study used more than 500,000 samples. Actual samples, however, might be more distributed. Hence, our dataset might not represent the population appropriately. As the matter of fact, we cannot use all actual samples for evaluation. To the best of our knowledge, the possible solution is using large-scale and multiple sources.
The second is lack of detailed analysis. In this study, we used a simple signature-based packer detector. This program has approximately 30 percent of false negatives [38].
We did not identify the packer names of our samples completely. Hence, our experimental result may not be applicable to some sophisticated packers, which are not detected by signature-based packer detectors. We identified our samples with VirusTotal and Microsoft defender. As reported in a paper, there is a problem with label changes in VirusTotal [56]. This can affect the accuracy of our experiment. Further analysis is required to reveal these issues.
The third is lack of comparison. In this paper, we focused on the practical performance and did not compare our method with other related studies. Due to the issues about the dataset or implementation, it was not feasible to provide fair comparison. Further experiments are required to provide fair comparison.

Conclusion
In this paper, we apply NLP techniques to malware detection. This paper reveals that printable strings with NLP techniques are effective for detecting malware in a practical environment. Our dataset consists of more than 500,000 samples obtained from multiple sources. The training samples were selected from the earlier samples. The test samples contain many benign samples thereby be imbalanced. In the further experiment, the training samples account for only 10 percent thereby be highly imbalanced. Thus, our dataset represents more practical environment. Our experimental result shows that our detection model is effective to not only subspecies of the existing malware, but also new malware. Furthermore, our detection model is effective against packed malware and anti-debugging techniques.
Our study clearly has some limitations. Despite this, we believe our study could be a starting point to evaluate practical performance. Our method might be applicable to other architectures. In future work, we analyze the detail of the samples. A more precise packer detector will improve the reliability of this study.

Conflict of interest
The authors declare that they have no conflict of interest.
Funding This work was supported by JSPS KAKENHI Grant Number 21K11898.
Ethical approval This article does not contain any studies with human participants or animals performed by any of the authors.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.