Introduction

The legendary Android operating system has dominated the smartphone industry since 2011 (Statista 2011). The Android operating system has approximately 2.5 billion active users from 190 countries, according to Android Statistics (Business of Apps: Android Statistics 2022). In this digital era, Android smartphones play an essential role in fulfilling multiple user needs. Therefore, its impacts on various aspects of society are immense. The Android operating system dominated the global market share since 2014 and it loomed over 87% of the market share in 2022 (Business of Apps: Android Statistics 2022). Android has become a prime target for cybercriminals because of its widespread use and open-source nature. Cybercriminals and anti-malware developers are constantly at odds in the realm of malware detection. With evolving technologies, the two protagonists have quickly modified their strategies. Malicious actors typically strive to profit in an unethical or even unlawful way. Mobile malware could steal sensitive and confidential user data, misuse the user's device to send SMS to premium text services, or install adware that causes users to view malicious websites or download other malware. The researchers have conducted several studies to develop countermeasures against Android operating systems and application security issues. Generally, most of the studies are based on the detection of generic Android malware, and studies based on the malware categories are relatively few. The malicious patterns can effectively be identified by the malware category recognition. Android malware detection can be categorized into the signature and behavior-based detection. The signature-based approach detects malicious behavior only after malware attacks have already occurred (a posteriori event), because it generates well-defined patterns (Portokalidis et al. 2006; Wressnegger et al. 2017; Oyama et al. 2012). Also, signature-based solutions rely on cryptographic algorithms or similarity measurement techniques (Tchakounté et al. 2021). This traditional method successfully detects well-known dangerous patterns because these malicious patterns are already stored in the database, whereas it cannot detect zero-day attacks.

On the other hand, the behavior-based approaches are further classified into static, dynamic, and hybrid methods. Researchers frequently use static analysis to extract features without running applications on a real device or emulator. This approach is particularly appealing since it requires less computing time and overhead during implementation. Static analysis cannot detect malware that hides or obfuscates its abusive behavior during execution. Dynamic analysis is required to address this issue because it monitors the run-time behavior of the applications. System calls are the most frequently retrieved features by dynamic analysis techniques. System call sequences will adequately reveal the malignant behavior patterns of different malware categories. As a result, this paper presents a dynamic analysis-based method that uses system call frequencies as features. This method successfully identified several categories of Android malware, including riskware, banking trojan, SMS malware, and adware.

The highlights of the proposed methodology are described below:

  • The system call frequencies are utilized to build a detection solution for Android malware categories.

  • A dynamic analysis-based model is proposed for the detection of Android malware categories.

  • A novel feature vector generation method based on Huffman Encoding is incorporated with the detection model.

  • The proposed model was evaluated using machine learning and deep learning techniques and compared with previous studies.

Sect. "Related works" summarizes the related studies on Android malware detection. Sect. "Proposed methodology" thoroughly explains the proposed methodology for detecting Android malware categories. The experimental results are discussed in Sect. "Experiments and Results". Sect. "Conclusion" concludes the paper and discusses potential directions for further research.

Related works

The detection systems can recognize more malware by identifying their related families, prioritizing the risky families, and capturing their impact on users (Alswaina and Elleithy 2020). Generally, Android malware detection approaches are classified into two categories: signature-based and behavior-based techniques. The behavior-based detection approaches are further divided into static, dynamic, and hybrid analysis-based techniques.

The signature-based techniques primarily rely on the known signature patterns of the malware. For example, a set of semi-supervised algorithms for the automatic generation of different Android malware family signatures were developed by Atzeni et al. (2018). However, this approach fails to detect unknown malware attacks. Therefore, researchers generally prefer something other than this traditional technique like behavior-based strategies, which includes static, dynamic, and hybrid methods.

The articles (Alswaina and Khaled 2018; Arindaam Roy. et al. 2020; Zhou et al. 2020; Zhu et al. 2021; Elayan and Mustafa 2021; Imtiaz 2021; Almahmoud and Dalia Alzu’bi et al. 2021; Pei et al. 2020; Kim et al. 2021; Bai et al. 2021) propose static analysis-based techniques for detecting Android malware. The studies (Alswaina and Khaled. 2018, Imtiaz 2021, Pei et al. 2020, Kim et al. 2021, Bai et al. 2021) discusses the family classification of Android malware with a static analysis approach. Alswaina et al. (Alswaina and Khaled. 2018) used machine learning approaches to categorize malware families and developed a reverse engineering framework to extract permissions. Ibrahim et al. (2021) proposed DeepAMD for android malware and its family in static and dynamic layers. Similarly, for malware detection and family attribution, Pei et al. (2020) developed a unique deep-learning system called AMalNet. Kim et al. (2021) propose leveraging built-in custom permissions and machine learning to categorize Android malware families. Bai et al. (2021) used static features such as permissions, API calls, activity, services, broadcast receivers, and content providers to classify Android malware families. This study (Bai et al. 2021) uses a lot of machine learning and neural network techniques, based on manual features from literature reviews and documentary features from Android developers.

The dynamic analysis methodology has been applied to detect Android malware since obfuscated malware and malicious dynamic content loading cannot be detected by static analysis. Several research articles, such as those in Mahindru and Sangal (2021a, 2021b), Martín et al. (2018), Abderrahmane et al. (2019), D'Angelo et al. (2021), explore dynamic analysis-based Android malware detection. Martin A et al. (2018) introduce CANDYMAN, a malware classification tool that leverages the Markov chain to categorize Android malware families. They rely on deep learning techniques and use Markov chains for detection. System calls are used as features in convolutional neural networks by Abderrahamane et al. (2019) to detect fraudulent Android applications. This approach depends on pair-level system call dependencies. With the assistance of the CuckooDroid sandbox (2020), a mobile security framework for static and dynamic feature extraction, D'Angelo et al. (2021) created another dynamic analysis-based Android malware classification system.

The combined static and dynamic features are promoted in hybrid-based detection systems. For example, in their article, Ding et al. (2021) present a hybrid analysis-based technique for identifying Android malware and categorizing malware families. This approach used static and dynamic analysis to extract static features (like permissions and intents) and dynamic features (like network traffic data) to classify malware families. Similarly, Taheri et al. (2019) presented a hybrid, two-layer Android malware analyzer based on static and dynamic analysis-based malware category classification. Dhalaria and Gandotra (2021) depicted another hybrid approach for both malware detection and family classification, in which the authors have employed the information gain feature selection algorithm. Many malware detection studies have recently tried to use machine learning to make advancements in detecting unidentified Android malware (Meijin et al. 2022). El Fiky et al. (2021) developed machine learning-based approaches for identifying Android malware categories. Zhang et al. (2019) propose a combination of n-gram analysis and online classifiers for Android malware detection and family attribution. Shao et al. (2021) introduced a novel detection technique based on sampling strategies. The authors created two distinct sampling algorithms based on various malware families, to address the sample imbalance in the dataset. The original authors of CICMalDroid dataset, Samaneh Mahdavifar et al. (2022) employed a semi-supervised learning method combined with a pseudo-labelling technique. It is obvious that pseudo-labelling is sensitive to the initial predictions. Although this approach reduces label dependency, it may lead to incorrect prediction results if there are only limited data points available. Lee introduced pseudo-labelling technique (Lee 2013). In this technique, clustering is done to label unknown data points. If there are only a few labelled points and proper clustering cannot be performed, the resulting pseudo-labels may lead the classifier to the incorrect decision boundary. The authors claim that their proposed method shows an accuracy of 95.19% with only 100 labelled training samples. However, these results could be unstable with only 100 labelled samples since pseudo-labelling is highly influenced by the initial predictions. Moreover, the semi-supervised technique is highly based on a self-training approach. The main drawback of this approach is that wrong predictions with high confidence will propagate the prediction error into model learning.

Proposed methodology

This section outlines the proposed methodology to classify Android malware categories. The proposed method follows a dynamic analysis-based approach that utilizes system call frequencies as features. Figure 1 represents the model of proposed detection solution. The detailed descriptions of each phase are provided in the following Sects. "Data acquisition", "Data pre-processing" and "Feature selection").

Fig. 1
figure 1

Proposed system model

Data acquisition

This section summarizes the dataset used for the proposed method. The dataset CICMalDroid (2020), Mahdavifar et al. (2022), Canadian Institute for Cybersecurity (2020) is used, which comprises Android samples broken down into five categories: Adware, Banking malware, SMS malware, Riskware, and Benign. For example, the adware category contains families like Judy, Ewind, Copycat, GhostClicker, etc. Each malware category has a variety of families. This dataset was gathered from different sources, including the VirusTotal service (2022), Contagio Mini Dump blog (2022), MalDozer (Karbab et al. 2018), etc. The proposed methodology employs the CSV (Comma Separated Value) file containing 139 extracted system call frequencies from 11,598 APK (Android Application Package) files of five malware categories. Table 1 shows the count of different APK sample categories used in the study.

Table 1 Count of different application samples

Data pre-processing

The act of transforming unprocessed data into something that a machine learning model can use is known as data pre-processing. It is the first and most crucial stage in developing a machine-learning model. The data pre-processing is used to improve the model's accuracy and efficiency. The basic steps in data pre-processing include importing libraries and datasets, finding missing data, removing NULL values or unnecessary data, encoding categorical data, data scaling, augmentation, and feature vector generation. Since there are no missing or NULL values in the dataset, a new Huffman encoding-based feature vector generation technique is proposed and data scaling is applied as pre-processing tasks.

Feature vector generation

The Huffman encoding-based feature vector generation is used in this phase. According to the literature survey, the methodology presented in this paper is the first truly innovative way for detecting Android malware. The system calls help to monitor the dynamic behavior of apps; therefore, it will be easy to identify malicious patterns by determining how frequently a given application uses a system call. System call frequencies are therefore used as features in the study. The proposed method uses the frequencies of 139 different system calls, which were acquired in the data acquisition phase. Then Huffman encoding is used, which is an optimization technique that maps system call frequencies into an optimized size value in O(nlogn) times. The new feature vector is then created using these Huffman-encoded values, which boosts the performance of the detection framework even more due to its higher encoding speed and effectiveness. The Huffman's optimality or minimum-redundancy code property makes it a more efficient technique (Moffat 2019).

Huffman encoding

David A. Huffman invented the Huffman Encoding compression technique (Huffman_coding 2022). This technique is based on the frequency of occurrence of a data item. According to this encoding scheme, a unique code is obtained for each system call. Given ‘m’ number of application samples, A = {a1, a2, a3,..., am} and S = {s1, s2, s3,...., sn} represents the ‘n’ number of features (system call frequencies) used by ‘m’ APK samples (Here, m = 11,598 and n = 139) as shown in Table 2

Table 2 Application samples and features

Then the feature set, Fa,s represents all system call frequencies of all APK samples (Fig. 2).

Fig. 2
figure 2

Feature set of proposed system

The corresponding Huffman trees for each row of system calls are built and mapped to unique codes by assigning 0 and 1 in the left and right child trees, respectively. At the end, a sequence of 0’s and 1’s will be generated for each leaf node. Table 3 represents the mapped Huffman codes of some of the given system calls.

Table 3 Huffman codes of system call frequencies

As the final step, the optimum size required for each system call frequency value is obtained by multiplying the length of the corresponding Huffman codes with the value of system call frequency (Eq. 1). This will be used as the new feature vector for the final detection model. The pseudo-code of Huffman encoding is shown in Fig. 3.

Fig. 3
figure 3

Pseudo code of Huffman encoding

$$Size=len\left(Huffma{n}{code\left(Syste{m}{call}\right)}\right)*System\_call\_frequency$$
(1)
a. Minimum-redundancy code / Optimality property

Minimum-redundancy code or optimality means the average number of coding digits per feature is minimized. This property makes Huffman an efficient technique. Any application of Huffman's algorithm will always create a minimum-redundancy code (Huffman 1952). Therefore, it will help to generate an optimal feature vector for the final detection model. Thus, this can improve the detection accuracy of the model. A minimum-redundancy code exists in which the two least-value features are siblings and share a common parent in the corresponding binary code tree. These two features are joined into a combined node with weight given by their sum. Following that, a minimum-redundancy code is created for this reduced-by-one feature set. Then expanding that feature into its two components, yields a minimum-redundancy code for the original set of features (Moffat 2019).

Data scaling

One of the most critical phases in data pre-processing before building a machine learning model is feature scaling or data scaling. It is used to generalize data points so there will be less space between them. A machine learning model's strength can be changed through scaling, from poor to better. In this work, standard and Min–max scaling techniques are employed.

Feature selection

Feature selection is necessary for a model to predict the target variable. This process aims to minimize the number of input variables to select those features that are identified as most beneficial. The proposed model uses the Chi-square technique to select appropriate features. Thus, it reduces the feature space for the final machine learning model.

Classification

In this phase, the Android malware category classification is experimented using machine learning and deep learning techniques. The machine learning classifiers use the feature vector as input that was obtained from the previous step. Then classifiers like Random Forest, Decision Tree, Logistic Regression, Support Vector Machine, and AdaBoost are employed to detect Android malware categories. The experiments were also carried out with convolutional neural networks and multi-layer perceptron techniques.

Experiments and results

This section discusses the experiments and results. The proposed system is built with a novel method for creating feature vectors based on Huffman coding. A total of 139 features (system call frequencies) from 11,598 data samples provided by CICMaldroid 2020 (Mahdavifar 2020; Mahdavifar et al. 2022; Canadian Institute for Cybersecurity 2020) were used in the system. The proposed feature vector generation technique significantly improves the overall effectiveness of the detection model. Although there are several data scaling techniques, the primary issue for machine learning is selecting the appropriate scaling method. The studies (Ambarwari et al. 2020; Shahriyari 2019) support the impact of data scaling methods on various ML algorithms. As a result, the proposed solution uses standard scaling because it yields better performance with machine learning models. The experimental results are shown in Tables 4 and 5. The corresponding result graphs are depicted in Figs. 4 and 5.

Table 4 Results without Proposed Feature vector generation
Table 5 Results with Huffman encoding
Fig. 4
figure 4

Results without proposed feature vector generation

Fig. 5
figure 5

Results with Huffman encoding-based feature vector generation

Table 4 shows the results without the proposed feature vector generation. This demonstrates that, with a greater detection accuracy of 0.931, the random forest model performs better. The decision tree, support vector machine, k-nearest neighbor classifiers, multi-layer perceptron and CNN models provide more than 80% of detection accuracies.

The experiment results with the proposed Huffman Encoding-based approach are presented in Table 5. It is evident that the proposed approach yields a greater accuracy of 98.70%. Moreover, it has increased the performance of other classifiers as well. This proves how effectively proposed feature vector generation process works. Huffman encoding supplies an optimised size value for each feature in O(nlogn) times, which increases the detection model's efficacy.

The proposed feature vector generation technique is compared with logarithmic transformation-based feature vector generation (Table 6). As per the results, the log transformation gives the greater accuracy is 93.06% with Random Forest model. It is known that log transformation will reduce the skewness of data by compressing the range of large numbers and extending the range of small numbers. However, it may lead to high memory consumption and increased time complexity. Also, this technique is computationally expensive and it may cause lowering of models’ accuracy. Whereas, in the Huffman encoding method, its minimum redundancy code property and higher encoding speed enables it to produce an optimal feature vector, which can improve the performance of the final detection model. The experiments also proved that Huffman-encoding feature vector generation gives better results than logarithmic based feature vector generation. Figure 6 shows the corresponding result graph of logarithmic transformation-based approach.

Table 6 Results with Logarithmic transformation
Fig. 6
figure 6

Results with logarithmic transformation

From Fig. 7, it is clear that the performance of the proposed method is higher than the method without using any feature vector generation and the baseline technique, i.e., logarithmic transformation technique.

Fig. 7
figure 7

Result comparison of proposed method

As shown in Table 7, the effectiveness of the proposed system is compared to that of a few existing methodologies. The results show that the proposed system outperforms the alternatives.

Table 7 Performance comparison of the proposed system with previous methods

Conclusion

This paper presents an Android malware category detection system based on a novel Huffman encoding-based feature vector generation scheme. The proposed model includes phases like data acquisition, data pre-processing, feature selection, and classification. The system call frequencies of 11,598 Android application samples were used as features to design this solution because this dynamic feature helps to recognize the dynamic behavior patterns of malware category. The Huffman encoding technique is employed in the data pre-processing phase to provide the optimum size of system call frequencies used by the applications. Several Machine learning-based experiments are conducted to evaluate the effectiveness of the proposed system. Based on the findings of the experiments, the proposed method using the Random Forest model outperforms other models with a better accuracy of 98.70%. The results were also compared with the performance of logarithmic transformation-based feature vector generation, showing that the proposed approach exhibits better results. Additionally, the model's effectiveness was compared with a few earlier methods, and it was discovered that this work yields better outcomes. This solution relies on a dynamic feature called system calls; thus, in future research studies, the static features like permissions, API calls, intents, etc., should be integrated with the detection solution.