MAPAS: a practical deep learning-based android malware detection system

A lot of malicious applications appears every day, threatening numerous users. Therefore, a surge of studies have been conducted to protect users from newly emerging malware by using machine learning algorithms. Albeit existing machine or deep learning-based Android malware detection approaches achieve high accuracy by using a combination of multiple features, it is not possible to employ them on our mobile devices due to the high cost for using them. In this paper, we propose MAPAS, a malware detection system, that achieves high accuracy and adaptable usages of computing resources. MAPAS analyzes behaviors of malicious applications based on API call graphs of them by using convolution neural networks (CNN). However, MAPAS does not use a classifier model generated by CNN, it only utilizes CNN for discovering common features of API call graphs of malware. For efficiently detecting malware, MAPAS employs a lightweight classifier that calculates a similarity between API call graphs used for malicious activities and API call graphs of applications that are going to be classified. To demonstrate the effectiveness and efficiency of MAPAS, we implement a prototype and thoroughly evaluate it. And, we compare MAPAS with a state-of-the-art Android malware detection approach, MaMaDroid. Our evaluation results demonstrate that MAPAS can classify applications 145.8% faster and uses memory around ten times lower than MaMaDroid. Also, MAPAS achieves higher accuracy (91.27%) than MaMaDroid (84.99%) for detecting unknown malware. In addition, MAPAS can generally detect any type of malware with high accuracy.

However, previous deep learning-based malware detection approaches commonly require very high cost (in terms of computing resources) for using them because they use a combination of multiple features to achieve the high accuracy [71]. For example, a classifier model generated by the convolutional neural network (CNN) requires enormous amount of memory for classifying data [44]. Consequently, albeit previously proposed deep learning-based malware detection systems could achieve very high accuracy, it is unlikely to employ them on our mobile devices of which computing resources are limited or personal computers. Therefore, it is of great importance to develop a malware detection approach that can protect users from newly emerging malware and can be practically used.
In this work, we propose a practical malware detection system, MAPAS, that achieves high accuracy against known and unknown malware as well as adaptable usages of computing resources. MAPAS learns behaviors of malicious applications based on API call graphs by using a deep learning algorithm (CNN). Then, it detects malware based on common patterns of API call graphs of malware. For efficiently detecting malware, MAPAS does not utilize a classifier model created by CNN but uses a lightweight classifier that calculates a similarity score between API call graphs used for malicious activities and API call graphs of applications that are going to be classified by using the Jaccard Similarity algorithm [3].
To show the effectiveness and efficiency of MAPAS, we thoroughly evaluate our prototype and compare it with a state-of-the-art Android malware detection approach, MaMaDroid [61]. MaMaDroid also utilizes API call graphs for detecting malware based on their behaviors. Our evaluation results demonstrate that MAPAS achieves better performance in terms of a processing time to classify applications and MAPAS uses much lower memory than the previous approach. Specifically, MAPAS classifies applications 145.8% faster and uses memory around ten times lower than MaMaDroid (when it used the random forest algorithm). In addition, MAPAS achieves higher accuracy (91.27%) than MaMaDroid (84.99%) for detecting unknown malware (i.e., when they classify newer malware released later than ones in our training dataset).
In summary, this paper makes the following contributions: This paper is organized as follows. We first provide technical backgrounds in Sect. 2. Section 3 explains the goals of MAPAS and presents the specific design approach in Section 4. We evaluate MAPAS to demonstrate its effectiveness and efficiency in Sect. 5. Previous studies are discussed in Sect. 6. Finally, Sect. 7 discusses the conclusion.
We release the source code of our proof-of-concept implementation at https://github.com/okokabv/MAPAS.

Background
In this section, we introduce malware detection methods and a common limitation of machine/deep learning-based Android malware detection approaches, the mainstream of malware detection approaches, that hinders practical uses of them.

Detecting android malware
Android malware detection approaches can be categorized into two groups based on analysis methods (i.e., dynamic analysis and static analysis) used to collect features of malware: (1) dynamic analysis-based malware detection approaches and (2) static analysis-based ones.
Dynamic analysis-based malware detection approaches have an advantage over static analysis-based approaches in analyzing concrete behaviors of malware [5,12,14,22,26,27,32,36,67,70,73,74,81,83,87,90]. Also, they have another advantage of analyzing malware equipped with anti-analysis mechanisms such as obfuscation. However, typically the dynamic analysis method consumes a lot of resources and time because we actually need to execute applications.

Typical features used for static analysis-based malware detection approaches
The first step to develop a malware detection system is to decide features of malware to distinguish them from benign applications. Typically, developer-written descriptions, user reviews, permissions, opcode and APIs are used as such features.
Developer-written descriptions A couple of research work employed developer-written descriptions on applications as a key feature for detecting malware [53,62]. However, detecting malware based on developer-written descriptions is not reliable because inferring accurate execution behaviors of applications is unlikely possible.
User reviews Among Android malware detection approaches, there were attempts that employ user reviews as an important feature [33,41]. However, similar to the malware detection approaches that use developer-written descriptions, the accuracy is not high enough to be used in a practical manner because user reviews usually do not contain concrete explanations on applications that can be used for detecting malware.
Opcode Several previous work showed there are common patterns of opcode that can be used to classify malicious applications [16,54,66,85]. They used common patterns of opcode such as move and invoke of bytecode in malicious applications.
Permissions There have been many research work for detecting malware based on permissions that applications require (e.g., a user's location, phone information, a mobile device's network status etc.) [10,19,23,42,46,63,64,76]. These approaches detect malware by using commonly used permissions such as network permission with users' location in malicious applications. However, Avdiienko et al. [11] showed that similar to malware, most benign Android applications access sensitive information of users and use a lot of permissions that are also typically used in malware. Consequently, permission-based malware detection approaches could incur a high false positive rate.
APIs Many approaches attempted to classify malicious applications based on APIs used in them [2,18,30,34,37,40,58,61]. By analyzing APIs used in an applications, we can understand functionalities that the application provide to users. For example, if an application uses APIs such as android.telephony and android.telecom, we can know that the application would monitor a mobile phone's network status and manages phone calls. As such, Android APIs provides functional information about what an application does. Therefore, we can infer an application's behavior by using APIs used in the application. However, if we only use APIs as a key feature for identifying malware, we can have high false positives because analyzing APIs does not provide an application's concrete behaviors and there are a lot of common APIs used in both benign and malicious applications [11].

Unpractical machine/deep learning-based android malware detection approaches
Within several years, a surge of studies were proposed to detect Android malware by employing machine or deep learning-based approaches, which classified malicious application based on features discussed in the previous section ( [23,30,31,34,38,45,47,49,51,54,58,77,78,85,88,92]. The notable advantage of deep learning algorithms is that they can eliminate the need of domain expertise and manual feature extraction because they learn features of data algorithmically [68]. However, previous approaches commonly require very high cost (in terms of computing resources and times) for using their approaches because they use a combination of multiple features to achieve the high accuracy [71]. Consequently, even though they could achieve the high accuracy, it is difficult to employ them in a practical manner due to the high cost for using them.

Goal
In this work, our goal is to detect malicious applications efficiently while achieving the high accuracy (1) to reduce the cost for detecting them and (2) to deal with the increasing Android malware. To this end, we optimize the Android malware detection process by using a deep learning algorithm with a deep learning interpretation approach for extracting dominant, common features used in malware. Deep learning-based malware detection approaches showed the high accuracy but have the disadvantage of using a lot of computing resources and times (as discussed in Sect. 2.3). In general, the cost for using a deep learning algorithm (to construct a classifier model) and even for using the model to actually classify malware is very expensive because they used complex features for increasing the accuracy. In this paper, we use a deep learning algorithm with a deep learning interpretation approach not for classifying malicious applications from benign applications, but only for identifying high-weight features of malware. We, then, build a lowcost classifier that finds malicious applications based on only such high-weight features identified by a deep learning algorithm. In this way, we can avoid heuristic feature selection for detecting malware as well as we can reduce the usage of computing resources and times for detecting malware (Fig. 1).

Design overview
Malware features used In this work, we attempts to detect malicious applications based on common patterns of their API call graphs. With API call graphs, we can find concrete malicious behaviors of malicious applications [20,50,72].
To be specific, MAPAS analyzes frequently used patterns of API call graphs which can lead to leakages of sensitive information (social security numbers, credit card numbers, passwords, etc.) with a deep learning algorithm. MAPAS, then, detects malware based on the identified patterns of malicious API call graphs. The design of MAPAS consists of the following three steps: (1) Data Preprocessing As the first step, MAPAS generates training dataset through extracting API call graphs from malicious and benign applications. Specifically, MAPAS obtains API call graphs by conducting the taint analysis with Flowdroid [9]. (2) Identifying High-weight API Call Graphs In this step, MAPAS first vectorizes training dataset and performs deep learning on the dataset by using convolution neural networks (CNN). After the learning phase finishes, MAPAS uses the deep learning interpretation approach, Grad-CAM, to discover high-weight API call graphs used in malicious applications. (3) Malware Detection In the last step, MAPAS classifies malware by using the Jaccard algorithm which calculate the similarity between API call graphs of an application and the high-weight API call graphs of malicious applications.

Data preprocessing for generating training dataset
MAPAS extracts API call graphs of applications by conducting taint analysis. Taint analysis is a static analysis method used to track data flows in an application. Specifically, we use a taint analysis for analyzing data flows from specific sources that read sensitive data (e.g., a function reading a password) to sinks which can transfer data (e.g., a function writing to a socket) by identifying whether sensitive information can be leaked or not. Hence, we can find potential sensitive leakages from an application. For MAPAS, we chose a static analysis tool based on evaluation results from Arzt [8] and Qiu et al. [65]. There are many taint analysis tools such as Flowdroid [9], AppScan [28], Epicc [60], JoDroid [56], DroidSafe [25] and Amandroid [80]. Among them, Arzt [8] and Qiu et al. [65] showed that overall Flowdroid has the best results in terms of the accuracy and the runtime performance. Therefore, in this work, we generates API call graphs based on taint analysis results from Flowdroid [9]. The detail process for generating API call graphs with Flowdroid is shown in Fig. 2.
It is worth noting that we exclude applications that have obfuscated API calls for the taint analysis. MAPAS uses Flowdroid that cannot extract API call graphs for API hiding techniques and class encryption techniques among obfuscation techniques such as renaming, control flow, string encryption, API hiding and class encryption [52]. Therefore, MAPAS has to exclude obfuscated applications that cannot extract API call graphs from Flowdroid. We leave this limitation as a future work (Fig. 3).

Deep learning and identifying high-weight API call graphs from malware
MAPAS uses a deep learning algorithm (CNN) [44] for the training dataset. While learning the dataset, the algorithm finds important features from the collected API call graphs used in malware and constructs the classification model. MAPAS, then, discovers the important features by using Vectorizing API Call Graphs In order to apply deep learning on API call graphs, which is text-type data, they must be converted into a vector. To vectorize text-type data, we can map each word in the data to an integer and create a vector with mapped integer numbers. Also, we can vectorize text-type data by analyzing the correlation between words known as word2vec [55] and analyzing the correlation between documents known as doc2vec [43]. MAPAS does not use vectorization methods such as word2vec and doc2vec but vectorizes API call graphs by simply mapping each API call graph to an integer number. For detecting mali-cious applications, API call graphs that MAPAS needs to find are specific sequences of function calls from the sources to the sinks as we discussed in Sect. 4.2. Each of malicious API call graphs represents a possible case of the sensitive information leak. Therefore, to detect malware, MAPAS should focus on finding the existence of such API call graphs rather than analyzing relationships between API call graphs.
Learning the dataset: MAPAS analyzes API call graphs commonly used in malware which can leak the sensitive information. To this end, MAPAS uses CNN [44] for learning the vectorized dataset. CNN is an effective deep learning algorithm for text-type data by using regional information of the data [35]. Please refer to "Appendix A" for the details on CNN. By learning the vectorized dataset with CNN, MAPAS can find common patterns of API call graphs that are frequently used in actual malicious applications. The overall learning process in MAPAS is illustrated in Fig. 4.
Finding high-weight features with a deep learning interpretation approach Deep learning models are a black-box model. Due to their multilayer and nonlinear structures, their predictions are not transparent [57]. CNN, also, operates in a black-box way, we cannot transparently figure out which API call graphs have high weights (which API call graphs are important) to detect malware from a classifier model generated by CNN. Hence, several deep learning interpretation approaches were proposed to transparently show specific data that substantially contributed to constructing a classifier model generated by a deep learning algorithm [4,29].
To observe high-weight API call graphs analyzed by CNN, MAPAS employs Grad-CAM [69] that produces visual explanations from CNN-based models. Please refer to "Appendix B" for more details on the approach.
As a result of using Grad-CAM, MAPAS found a highweight API call graph of which the source is android. content and the sink is java.net. This call graph can leak user's sensitive information over the network.
After discovering high-weight features with Grad-CAM, MAPAS can classify malicious applications from benign ones based on such features. Note that MAPAS does not detect malware with the classifier model generated by CNN for reducing the cost in terms of the usage of computing resources. In Sect. 5, we demonstrate the effectiveness and efficiency of MAPAS by comparing it to the classifier model generated by CNN (Fig. 5).

Malware detection
For detecting malicious applications, MAPAS measures the similarity between two sets (the high-weight API call graphs and call graphs extracted from an unclassified application) by using Jaccard similarity algorithm [3] as shown in Fig. 6.  The Jaccard similarity has a value between 0 and 1. If two sets are exactly equal to each other, the similarity score is 1, and if two sets are totally different, the similarity score is 0. The expression of Jaccard similarity algorithm is as follows.  Fig. 6 Malware classification process of MAPAS MAPAS considers an application is malware if the similarity score is higher than a threshold (0.4303) that we set based on testing results as in Sect. 5.2.

Evaluation
In this section, we evaluate MAPAS to demonstrate its effectiveness and efficiency. Our evaluation addresses the following research questions: RQ

Experimental configuration
Setup We performed our evaluations on a workstation running Ubuntu 18.04 with a 20-core Intel Xeon Gold 6230 CPU at 2.10 GHz, 128 GB RAM and a NVIDIA GeForce RTX 2080 GPU.
Datasets We first collected the top 10,000 applications from Google Play Store [24]. We, then, randomly downloaded 10,653 malicious applications released in 2018 and 2019 from VirusShare [75]. In addition, we used 23,039 malicious applications from Android Malware Dataset (AMD) [79]. Wei et al. classified the AMD into 70 categories [79]. Table 1 shows the number of applications used for our evaluation. Training dataset is used for generating a classifier model with CNN. We used Test dataset for evaluating the effectiveness of MAPAS.

Finding high-weight features
Training dataset 9000 malicious applications provided by VirusShare [75] and 9,000 benign applications downloaded from Google Play Store [6] were used for training a classifier model with CNN. To this end, we extracted API call graphs from the 18,000 applications by using Flowdroid [9]. In total, we obtained 21,690 unique API call graphs and used them as a training dataset.

Model learning and verification
We trained a classifier model by using CNN with the training dataset. Next, we verified the classifier model by employing the k-fold cross-validation approach. The accuracy of the classifier model measured by the validation method is 0.9695 on average.   To pick a threshold, we measured the Jaccard similarity between the high-weight API call graphs and API call graphs extracted from malicious applications and benign ones. As result, the similarity score is 0.561 and 0.2996, respectively. We used the average value (0.4303) of two scores as a threshold value for detecting malware. In this work, MAPAS can avoid biased results by using the average value. In other words, MAPAS avoid false negatives that when the classifier detections the application is benign when it is actually malware and false positives that is classifying the application is malware when it is actually benign by using average score.

Performance evaluation of MAPAS with the CNN classifier model
MAPAS uses the Jaccard similarity algorithm as a classifier to detect malware. We evaluated the performance and the usage of computing resources of MAPAS's malware detection process. Also, we measured the performance and the usage of computing resources of the classifier model generated by CNN. For this evaluation, we used 1000 malicious applications and 1000 benign applications of the test dataset as shown in Table 1. Table 3 shows the experimental results. To classify 2000 applications, MAPAS took 21.18 s (1.059 ms on average) on a single core. The classifier model processed them in 15.92 s (0.796 ms on average) by using one GPU. It is worth noting that, when we used the classifier model without using a GPU, we could not finish processing 2000 applications within 24 h. In addition, as in Table 3, the classifier model used 10,590 MiB of GPU memory and about 2070 MB of RAM (1214.16% more than MAPAS). We, also, measured the detection accuracy. The CNN classifier model showed 11% lower detection rate than MAPAS.

Performance evaluation of MAPAS with MaMaDroid
We compare the performance of MAPAS to previous work (MaMaDroid [61]). Similar to MAPAS, MaMaDroid uses API call graphs of malicious applications to detect them. To compare the performance, MAPAS and MaMaDroid [61] created a classifier by using 9000 benign applications and 9000 malicious ones in the training dataset. MaMaDroid converted API call graphs into Markov chain [59] and created a classifier by learning 198,916 features. On the other hand, MAPAS used unique 21,659 API call graphs for creating a classifier. By default, MaMaDroid uses random forest (RF) [13] and k-nearest neighbors (k-NN) [17]. Also, in this evaluation, we did not use a GPU but only a CPU for both MaMaDroid and MAPAS.
Performance of the learning process Figure 7 shows the evaluation results of learning phases in each system. MaMaDroid +CNN used about 1214% of RAM more than MAPAS for the learning phase (MAPAS used 2.26 GB of RAM and MaMadroid+CNN used 34 GB of RAM). Also, MaMaDroid +CNN spent 5.45 times as much time as MAPAS did to Performance of the classification process To evaluate the classification process of MAPAS and MaMaDroid, we used each system for classifying 2000 applications in the test dataset. The evaluation results are shown in Figs. 8 and 9. Overall, MaMaDroid using the random forest algorithm (MaMaDroid+RF) showed the best accuracy as in Fig. 8. MAPAS achieves about 3% lower accuracy than MaMaDroid+RF. However, MAPAS showed the best performance in terms of the execution time and the lowest RAM usage as illustrated in Fig. 9. To be specific, MAPAS can classify applications 76.4% and 145.8% faster than MaMaDroid+RF and MaMaDroid+k-NN, using much lower memory (MAPAS used memory around ten times lower than MaMaDroid+RF).
Detecting malware of various categories We evaluated the effectiveness of MAPAS and MaMaDroid+RF for detecting Android malware in 70 categories defined by Wei et al. [79]. The measurement results are shown in Table 4. MAPAS showed about 99% accuracy for 70 malware cat-  Detecting unknown malware We evaluated the performance of detecting unknown malicious applications by using MAPAS and MaMaDroid+RF. To this end, we collected malware, released later than applications in the training dataset, from VirusShare [75]. As in Table 4, MAPAS showed 91% accuracy for detecting unkown malware, which is 6% higher than MaMaDroid.
The closet related work to this paper is MaMaDroid [61] that used Markov chain [59] to calculate the probability of transition from the current state (Sources) to another state (Sinks) from API call graphs used in malicious applications. MaMaDroid, then, utilized k-NN and random forest algorithms to train the Markov chains and to generate a classifier model. Besides, DeepFlow [92] and EveDroid [45] also used API call graphs for detecting malware. They especially focused on detecting newly emerging malicious applications by using a deep learning algorithm.

Conclusion
In this paper, we proposed MAPAS, an effective and efficient malware detection approach. MAPAS analyzes common features of API call graphs extracted from malicious applications by using a deep learning algorithm. Then, it detects malware based on the features with a lightweight classifier for the efficiency. Our evaluation results showed that MAPAS outperforms a state-of-the-art approach, MaMaDroid [61], in terms of the usage of computing resources and the accuracy for detecting unknown malware. Also, MAPAS can generally detect any type of malware with high accuracy.

Conflict of interest
The authors declare that they have no conflict of interest.
Ethical approval This article does not contain any studies with human participants or animals performed by any of the authors.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copy-right holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.

Appendix A: CNN
Convolutional neural network (CNN) is an algorithm of artificial neural networks using convolution arithmetic. The convolution arithmetic is operations by extracting the regional information of the data in the filter (or kernel) is moving. The filter calculates the convolution while moving the input data at a set interval. At this time, the interval of the filter moves is called the stride. The output data of result from convolution arithmetic is called a feature map. Feature map uses ReLU among activation functions to extract only positive values. After that, a new layer is created by the pooling. Pooling is a method that reducing the size of a feature map and emphasizing feature information. As the number of filters increases in the convolution arithmetic, the number of feature maps increases. Therefore, as feature maps increase, there is a risk of many memory use and overfitting due to many features. As we already expressed, to prevent convolution arithmetic problems, CNN uses pooling. There are max pooling and average pooling in pooling. Max pooling reduces the size by leaving only the largest value among the feature map values. Average pooling reduces the size by calculating the average from the feature map values. After that, the extracted output data transform one dimension. The output data of one dimension is called a fully connected layer (FC). In FC, the result value is classified using the softmax among activation functions.
CNN was originally designed for processing images. However, recently CNN is usually used for the natural language processing [35].

Appendix B: CAM
Because a deep learning algorithm operates in a black-box manner, it is important to interpret values affected the classification results in a deep learning model. Therefore, in recent years, many interpretation approaches were proposed to identify features that have an important influence on classification results in the deep learning model [4,29]. Class activation map (CAM) among interpretation approaches is used to the CNN algorithm [91]. CAM uses global average pooling (GAP) instead of FC to extract the feature information value of CNN. The formula of CAM is as follows. However, CAM should be used by replacing FC with GAP.
Grad-CAM Gradient-weighted class activation map (Grad-CAM) [69] does not use GAP and extract features that affect the result. Grad-CAM uses the gradient value about the class of the last convolution layer by backpropagation to calculate the value of CAM. The formula of Grad-CAM is as follows.